Usage

We'll start with a probabilistic regression example on the Boston housing dataset:

from ngboost import NGBRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=3.6486 val_loss=0.0000 scale=0.5000 norm=3.4791
[iter 100] loss=3.1043 val_loss=0.0000 scale=1.0000 norm=3.9358
[iter 200] loss=2.4762 val_loss=0.0000 scale=2.0000 norm=4.1521
[iter 300] loss=2.0484 val_loss=0.0000 scale=1.0000 norm=1.6249
[iter 400] loss=1.8610 val_loss=0.0000 scale=1.0000 norm=1.4547
Test MSE 7.719871354323341
Test NLL 2.8867507325340243


Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:

Y_dists[0:5].params

{'loc': array([15.71909047, 19.51384116, 19.24509285, 17.8645122 , 24.31325397]),
'scale': array([1.48748154, 1.37673424, 1.67090687, 1.63854999, 1.52513887])}

## Distributions

NGBoost can be used with a variety of distributions, broken down into those for regression (support on an infinite set) and those for classification (support on a finite set).

### Regression Distributions

Distribution Parameters Implemented Scores Reference
Normal loc, scale LogScore, CRPScore scipy.stats normal
LogNormal s, scale LogScore, CRPScore scipy.stats lognormal
Exponential scale LogScore, CRPScore scipy.stats exponential

Regression distributions can be used through the NGBRegressor() constructor by passing the appropriate class as the Dist argument. Normal is the default.

from ngboost.distns import Exponential, Normal

X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)


There are two prediction methods for NGBRegressor objects: predict(), which returns point predictions as one would expect from a standard regressor, and pred_dist(), which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.

ngb_norm.predict(X_reg_test)[0:5]

array([21.25837828,  9.88964092, 23.01338315, 10.89576892, 16.12806237])
ngb_exp.predict(X_reg_test)[0:5]

array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])
ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])}

#### Survival Regression

NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, LogNormal and Exponential have these scores implemented. To do survival analysis, use NGBSurvival and pass both the time-to-event (or censoring) and event indicator vectors to fit():

import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=1.2960 val_loss=0.0000 scale=8.0000 norm=4.8495
[iter 100] loss=0.6339 val_loss=0.0000 scale=2.0000 norm=0.7358
[iter 200] loss=0.3803 val_loss=0.0000 scale=4.0000 norm=0.9619
[iter 300] loss=0.2276 val_loss=0.0000 scale=8.0000 norm=0.9190
[iter 400] loss=0.1178 val_loss=0.0000 scale=4.0000 norm=0.3496


The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.

### Classification Distributions

Distribution Parameters Implemented Scores Reference
k_categorical(K) p0, p1... p{K-1} LogScore Categorical distribution on Wikipedia
Bernoulli p LogScore Bernoulli distribution on Wikipedia

Classification distributions can be used through the NGBClassifier() constructor by passing the appropriate class as the Dist argument. Bernoulli is the default and is equivalent to k_categorical(2).

from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli

y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}


When using NGBoost for classification, the outcome vector Y must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.

NGBClassifier objects have three prediction methods: predict() returns the most likely class, predict_proba() returns the class probabilities, and pred_dist() returns the distribution object.

ngb_cat.predict(X_cls_test)[0:5]

array([1, 1, 1, 0, 1])
ngb_cat.predict_proba(X_cls_test)[0:5]

array([[3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
[6.59565268e-03, 9.93168490e-01, 2.35857004e-04],
[3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
[9.92981053e-01, 6.07012737e-03, 9.48819937e-04],
[3.53080012e-03, 9.96242905e-01, 2.26294536e-04]])
ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.0035308 , 0.00659565, 0.0035308 , 0.99298105, 0.0035308 ]),
'p1': array([0.99624291, 0.99316849, 0.99624291, 0.00607013, 0.99624291]),
'p2': array([0.00022629, 0.00023586, 0.00022629, 0.00094882, 0.00022629])}

## Scores

NGBoost supports the log score (LogScore, also known as negative log-likelihood) and CRPS (CRPScore), although each score may not be implemented for each distribution. The score is specified by the Score argument in the constructor.

from ngboost.scores import LogScore, CRPScore

NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

NGBClassifier(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=None,
splitter='best'),
Dist=<class 'ngboost.distns.categorical.k_categorical.<locals>.Categorical'>,
Score=<class 'ngboost.scores.LogScore'>, col_sample=1.0,
learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
verbose=False, verbose_eval=100)

## Base Learners

NGBoost can be used with any sklearn regressor as the base learner, specified with the Base argument. The default is a depth-3 regression tree.

from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)

NGBSurvival(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=None,
splitter='best'),
Dist=<class 'ngboost.api.NGBSurvival.__init__.<locals>.SurvivalDistn'>,
Score=<class 'ngboost.scores.CRPScore'>, col_sample=1.0,
learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
verbose=False, verbose_eval=100)

## Other Arguments

The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=3.6328 val_loss=0.0000 scale=0.5000 norm=3.3554

NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=None,
splitter='best'),
Dist=<class 'ngboost.distns.normal.Normal'>,
Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
verbose=True, verbose_eval=100)

Sample weights (for training) are set using the sample_weight argument to fit.

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=3.6257 val_loss=0.0000 scale=1.0000 norm=6.6551

NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=None,
splitter='best'),
Dist=<class 'ngboost.distns.normal.Normal'>,
Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
verbose=True, verbose_eval=100)