Usage

We'll start with a probabilistic regression example on the Boston housing dataset:

from ngboost import NGBRegressor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=3.6486 val_loss=0.0000 scale=0.5000 norm=3.4791
[iter 100] loss=3.1043 val_loss=0.0000 scale=1.0000 norm=3.9358
[iter 200] loss=2.4762 val_loss=0.0000 scale=2.0000 norm=4.1521
[iter 300] loss=2.0484 val_loss=0.0000 scale=1.0000 norm=1.6249
[iter 400] loss=1.8610 val_loss=0.0000 scale=1.0000 norm=1.4547
Test MSE 7.719871354323341
Test NLL 2.8867507325340243

Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:

Y_dists[0:5].params

{'loc': array([15.71909047, 19.51384116, 19.24509285, 17.8645122 , 24.31325397]),
 'scale': array([1.48748154, 1.37673424, 1.67090687, 1.63854999, 1.52513887])}

Distributions

NGBoost can be used with a variety of distributions, broken down into those for regression (support on an infinite set) and those for classification (support on a finite set).

Regression Distributions

Distribution	Parameters	Implemented Scores	Reference
`Normal`	`loc`, `scale`	`LogScore`, `CRPScore`	`scipy.stats` normal
`LogNormal`	`s`, `scale`	`LogScore`, `CRPScore`	`scipy.stats` lognormal
`Exponential`	`scale`	`LogScore`, `CRPScore`	`scipy.stats` exponential

Regression distributions can be used through the NGBRegressor() constructor by passing the appropriate class as the Dist argument. Normal is the default.

from ngboost.distns import Exponential, Normal

X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)

There are two prediction methods for NGBRegressor objects: predict(), which returns point predictions as one would expect from a standard regressor, and pred_dist(), which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.

ngb_norm.predict(X_reg_test)[0:5]

array([21.25837828,  9.88964092, 23.01338315, 10.89576892, 16.12806237])

ngb_exp.predict(X_reg_test)[0:5]

array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])

ngb_exp.pred_dist(X_reg_test)[0:5].params

{'scale': array([20.94799589,  9.38317525, 22.88445968, 10.33327537, 14.83048942])}

Survival Regression

NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, LogNormal and Exponential have these scores implemented. To do survival analysis, use NGBSurvival and pass both the time-to-event (or censoring) and event indicator vectors to fit():

import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal

X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)

# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring

ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)

[iter 0] loss=1.2960 val_loss=0.0000 scale=8.0000 norm=4.8495
[iter 100] loss=0.6339 val_loss=0.0000 scale=2.0000 norm=0.7358
[iter 200] loss=0.3803 val_loss=0.0000 scale=4.0000 norm=0.9619
[iter 300] loss=0.2276 val_loss=0.0000 scale=8.0000 norm=0.9190
[iter 400] loss=0.1178 val_loss=0.0000 scale=4.0000 norm=0.3496

The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.

Classification Distributions

Distribution	Parameters	Implemented Scores	Reference
`k_categorical(K)`	`p0`, `p1`... `p{K-1}`	`LogScore`	Categorical distribution on Wikipedia
`Bernoulli`	`p`	`LogScore`	Bernoulli distribution on Wikipedia

Classification distributions can be used through the NGBClassifier() constructor by passing the appropriate class as the Dist argument. Bernoulli is the default and is equivalent to k_categorical(2).

from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test  = train_test_split(X, y, test_size=0.2)

ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}

When using NGBoost for classification, the outcome vector Y must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.

NGBClassifier objects have three prediction methods: predict() returns the most likely class, predict_proba() returns the class probabilities, and pred_dist() returns the distribution object.

ngb_cat.predict(X_cls_test)[0:5]

array([1, 1, 1, 0, 1])

ngb_cat.predict_proba(X_cls_test)[0:5]

array([[3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
       [6.59565268e-03, 9.93168490e-01, 2.35857004e-04],
       [3.53080012e-03, 9.96242905e-01, 2.26294536e-04],
       [9.92981053e-01, 6.07012737e-03, 9.48819937e-04],
       [3.53080012e-03, 9.96242905e-01, 2.26294536e-04]])

ngb_cat.pred_dist(X_cls_test)[0:5].params

{'p0': array([0.0035308 , 0.00659565, 0.0035308 , 0.99298105, 0.0035308 ]),
 'p1': array([0.99624291, 0.99316849, 0.99624291, 0.00607013, 0.99624291]),
 'p2': array([0.00022629, 0.00023586, 0.00022629, 0.00094882, 0.00022629])}

Scores

NGBoost supports the log score (LogScore, also known as negative log-likelihood) and CRPS (CRPScore), although each score may not be implemented for each distribution. The score is specified by the Score argument in the constructor.

from ngboost.scores import LogScore, CRPScore

NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)

NGBClassifier(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                         max_features=None, max_leaf_nodes=None,
                                         min_impurity_decrease=0.0,
                                         min_impurity_split=None,
                                         min_samples_leaf=1,
                                         min_samples_split=2,
                                         min_weight_fraction_leaf=0.0,
                                         presort=False, random_state=None,
                                         splitter='best'),
              Dist=<class 'ngboost.distns.categorical.k_categorical.<locals>.Categorical'>,
              Score=<class 'ngboost.scores.LogScore'>, col_sample=1.0,
              learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
              natural_gradient=True,
              random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
              verbose=False, verbose_eval=100)

Base Learners

NGBoost can be used with any sklearn regressor as the base learner, specified with the Base argument. The default is a depth-3 regression tree.

from sklearn.tree import DecisionTreeRegressor

learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)

NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)

NGBSurvival(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=5,
                                       max_features=None, max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       presort=False, random_state=None,
                                       splitter='best'),
            Dist=<class 'ngboost.api.NGBSurvival.__init__.<locals>.SurvivalDistn'>,
            Score=<class 'ngboost.scores.CRPScore'>, col_sample=1.0,
            learning_rate=0.01, minibatch_frac=1.0, n_estimators=500,
            natural_gradient=True,
            random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
            verbose=False, verbose_eval=100)

Other Arguments

The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)

[iter 0] loss=3.6328 val_loss=0.0000 scale=0.5000 norm=3.3554

NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'),
             Dist=<class 'ngboost.distns.normal.Normal'>,
             Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
             learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
             natural_gradient=True,
             random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
             verbose=True, verbose_eval=100)

Sample weights (for training) are set using the sample_weight argument to fit.

ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
             minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)

[iter 0] loss=3.6257 val_loss=0.0000 scale=1.0000 norm=6.6551

NGBRegressor(Base=DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                                        max_features=None, max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'),
             Dist=<class 'ngboost.distns.normal.Normal'>,
             Score=<class 'ngboost.scores.LogScore'>, col_sample=0.5,
             learning_rate=0.01, minibatch_frac=0.5, n_estimators=100,
             natural_gradient=True,
             random_state=RandomState(MT19937) at 0x117AF2D10, tol=0.0001,
             verbose=True, verbose_eval=100)