We'll start with a probabilistic regression example on the Boston housing dataset:
from ngboost import NGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)
# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)
# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)
Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:
Y_dists[0:5].params
| Distribution | Parameters | Implemented Scores | Reference |
|---|---|---|---|
Normal |
loc, scale |
LogScore, CRPScore |
scipy.stats normal |
LogNormal |
s, scale |
LogScore, CRPScore |
scipy.stats lognormal |
Exponential |
scale |
LogScore, CRPScore |
scipy.stats exponential |
Regression distributions can be used through the NGBRegressor() constructor by passing the appropriate class as the Dist argument. Normal is the default.
from ngboost.distns import Exponential, Normal
X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)
ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)
There are two prediction methods for NGBRegressor objects: predict(), which returns point predictions as one would expect from a standard regressor, and pred_dist(), which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.
ngb_norm.predict(X_reg_test)[0:5]
ngb_exp.predict(X_reg_test)[0:5]
ngb_exp.pred_dist(X_reg_test)[0:5].params
NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, LogNormal and Exponential have these scores implemented. To do survival analysis, use NGBSurvival and pass both the time-to-event (or censoring) and event indicator vectors to fit():
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal
X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)
# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring
ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)
The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.
| Distribution | Parameters | Implemented Scores | Reference |
|---|---|---|---|
k_categorical(K) |
p0, p1... p{K-1} |
LogScore |
Categorical distribution on Wikipedia |
Bernoulli |
p |
LogScore |
Bernoulli distribution on Wikipedia |
Classification distributions can be used through the NGBClassifier() constructor by passing the appropriate class as the Dist argument. Bernoulli is the default and is equivalent to k_categorical(2).
from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test = train_test_split(X, y, test_size=0.2)
ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}
When using NGBoost for classification, the outcome vector Y must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.
NGBClassifier objects have three prediction methods: predict() returns the most likely class, predict_proba() returns the class probabilities, and pred_dist() returns the distribution object.
ngb_cat.predict(X_cls_test)[0:5]
ngb_cat.predict_proba(X_cls_test)[0:5]
ngb_cat.pred_dist(X_cls_test)[0:5].params
NGBoost supports the log score (LogScore, also known as negative log-likelihood) and CRPS (CRPScore), although each score may not be implemented for each distribution. The score is specified by the Score argument in the constructor.
from ngboost.scores import LogScore, CRPScore
NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)
NGBoost can be used with any sklearn regressor as the base learner, specified with the Base argument. The default is a depth-3 regression tree.
from sklearn.tree import DecisionTreeRegressor
learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)
NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)
The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)
Sample weights (for training) are set using the sample_weight argument to fit.
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)