from ngboost import NGBClassifier, NGBRegressor
from ngboost.distns import k_categorical, Normal
from ngboost.scores import LogScore

from sklearn.datasets import load_breast_cancer, load_boston
from sklearn.model_selection import train_test_split

X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

Staged Prediction

All fit NGBoost objects support staged prediction.

ngb_cls = NGBClassifier(Dist=k_categorical(3), Score=LogScore, n_estimators=500, verbose=False).fit(X_cls_train, Y_cls_train)

For instance, to get the predictions on the first 5 examples after fitting 415 base learners, use:

preds = ngb_cls.staged_predict(X_cls_test)
array([0, 0, 1, 1, 1])
pred_dists = ngb_cls.staged_pred_dist(X_cls_test)
{'p0': array([0.99074995, 0.91368635, 0.00517919, 0.00517919, 0.00517919]),
 'p1': array([0.00860966, 0.03267806, 0.99450359, 0.99450359, 0.99450359]),
 'p2': array([0.00064039, 0.05363559, 0.00031722, 0.00031722, 0.00031722])}

This is useful in conjunction with tracking errors on a validation set, which you can do by passing the X_val and Y_val arguments and then inspecting the .best_val_loss_itr instance attribute.

ngb = NGBRegressor(), Y_reg_train, X_val=X_reg_test, Y_val=Y_reg_test) # use a validation set instead of test set here in your own work
best_preds = ngb.predict(X_reg_test, max_iter=ngb.best_val_loss_itr)
[iter 0] loss=3.6556 val_loss=3.5575 scale=0.5000 norm=3.4142
[iter 100] loss=3.1118 val_loss=3.1284 scale=1.0000 norm=3.9174
[iter 200] loss=2.4839 val_loss=2.6398 scale=2.0000 norm=4.0907
[iter 300] loss=2.0183 val_loss=2.7162 scale=1.0000 norm=1.5637
[iter 400] loss=1.8111 val_loss=3.1315 scale=1.0000 norm=1.3983

Early Stopping

NGBoost also has early stopping. If an integer early_stopping_rounds and a validation set (X_val,Y_val) are passed to fit(), the algorithm will stop running after the validation loss has increased for early_stopping_rounds of consecutive iterations.

_ = NGBRegressor().fit(X_reg_train, Y_reg_train, X_val=X_reg_test, Y_val=Y_reg_test, early_stopping_rounds=2)
[iter 0] loss=3.6556 val_loss=3.5575 scale=0.5000 norm=3.4142
[iter 100] loss=3.1118 val_loss=3.1292 scale=1.0000 norm=3.9174
[iter 200] loss=2.4839 val_loss=2.6422 scale=2.0000 norm=4.0907
== Early stopping achieved.
== Best iteration / VAL234 (val_loss=2.5693)

Validation set sample weights can be passed using the val_sample_weight argument to fit.

Using sklearn Model Selection

sklearn methods are compatible with NGBoost.

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

b1 = DecisionTreeRegressor(criterion='friedman_mse', max_depth=2)
b2 = DecisionTreeRegressor(criterion='friedman_mse', max_depth=4)

param_grid = {
    'minibatch_frac': [1.0, 0.5],
    'Base': [b1, b2]

ngb = NGBRegressor(Dist=Normal, verbose=False)

grid_search = GridSearchCV(ngb, param_grid=param_grid, cv=5), Y_reg_train)
/usr/local/lib/python3.7/site-packages/sklearn/model_selection/ DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
{'Base': DecisionTreeRegressor(criterion='friedman_mse', max_depth=4, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best'), 'minibatch_frac': 1.0}