We'll start with a probabilistic regression example on the Boston housing dataset:
from ngboost import NGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, Y = load_boston(True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
ngb = NGBRegressor().fit(X_train, Y_train)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)
# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)
# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)
Getting the estimated distributional parameters at a set of points is easy. This returns the predicted mean and standard deviation of the first five observations in the test set:
Y_dists[0:5].params
Distribution | Parameters | Implemented Scores | Reference |
---|---|---|---|
Normal |
loc , scale |
LogScore , CRPScore |
scipy.stats normal |
LogNormal |
s , scale |
LogScore , CRPScore |
scipy.stats lognormal |
Exponential |
scale |
LogScore , CRPScore |
scipy.stats exponential |
Regression distributions can be used through the NGBRegressor()
constructor by passing the appropriate class as the Dist
argument. Normal
is the default.
from ngboost.distns import Exponential, Normal
X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)
ngb_norm = NGBRegressor(Dist=Normal, verbose=False).fit(X_reg_train, Y_reg_train)
ngb_exp = NGBRegressor(Dist=Exponential, verbose=False).fit(X_reg_train, Y_reg_train)
There are two prediction methods for NGBRegressor
objects: predict()
, which returns point predictions as one would expect from a standard regressor, and pred_dist()
, which returns a distribution object representing the conditional distribution of $Y|X=x_i$ at the points $x_i$ in the test set.
ngb_norm.predict(X_reg_test)[0:5]
ngb_exp.predict(X_reg_test)[0:5]
ngb_exp.pred_dist(X_reg_test)[0:5].params
NGBoost supports analyses of right-censored data. Any distribution that can be used for regression in NGBoost can also be used for survival analysis in theory, but this requires the implementation of the right-censored version of the appropriate score. At the moment, LogNormal
and Exponential
have these scores implemented. To do survival analysis, use NGBSurvival
and pass both the time-to-event (or censoring) and event indicator vectors to fit()
:
import numpy as np
from ngboost import NGBSurvival
from ngboost.distns import LogNormal
X, Y = load_boston(True)
X_surv_train, X_surv_test, Y_surv_train, Y_surv_test = train_test_split(X, Y, test_size=0.2)
# introduce administrative censoring to simulate survival data
T_surv_train = np.minimum(Y_surv_train, 30) # time of an event or censoring
E_surv_train = Y_surv_train > 30 # 1 if T[i] is the time of an event, 0 if it's a time of censoring
ngb = NGBSurvival(Dist=LogNormal).fit(X_surv_train, T_surv_train, E_surv_train)
The scores currently implemented assume that the censoring is independent of survival, conditional on the observed predictors.
Distribution | Parameters | Implemented Scores | Reference |
---|---|---|---|
k_categorical(K) |
p0 , p1 ... p{K-1} |
LogScore |
Categorical distribution on Wikipedia |
Bernoulli |
p |
LogScore |
Bernoulli distribution on Wikipedia |
Classification distributions can be used through the NGBClassifier()
constructor by passing the appropriate class as the Dist
argument. Bernoulli
is the default and is equivalent to k_categorical(2)
.
from ngboost import NGBClassifier
from ngboost.distns import k_categorical, Bernoulli
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(True)
y[0:15] = 2 # artificially make this a 3-class problem instead of a 2-class problem
X_cls_train, X_cls_test, Y_cls_train, Y_cls_test = train_test_split(X, y, test_size=0.2)
ngb_cat = NGBClassifier(Dist=k_categorical(3), verbose=False) # tell ngboost that there are 3 possible outcomes
_ = ngb_cat.fit(X_cls_train, Y_cls_train) # Y should have only 3 values: {0,1,2}
When using NGBoost for classification, the outcome vector Y
must consist only of integers from 0 to K-1, where K is the total number of classes. This is consistent with the classification standards in sklearn.
NGBClassifier
objects have three prediction methods: predict()
returns the most likely class, predict_proba()
returns the class probabilities, and pred_dist()
returns the distribution object.
ngb_cat.predict(X_cls_test)[0:5]
ngb_cat.predict_proba(X_cls_test)[0:5]
ngb_cat.pred_dist(X_cls_test)[0:5].params
NGBoost supports the log score (LogScore
, also known as negative log-likelihood) and CRPS (CRPScore
), although each score may not be implemented for each distribution. The score is specified by the Score
argument in the constructor.
from ngboost.scores import LogScore, CRPScore
NGBRegressor(Dist=Exponential, Score=CRPScore, verbose=False).fit(X_reg_train, Y_reg_train)
NGBClassifier(Dist=k_categorical(3), Score=LogScore, verbose=False).fit(X_cls_train, Y_cls_train)
NGBoost can be used with any sklearn regressor as the base learner, specified with the Base
argument. The default is a depth-3 regression tree.
from sklearn.tree import DecisionTreeRegressor
learner = DecisionTreeRegressor(criterion='friedman_mse', max_depth=5)
NGBSurvival(Dist=Exponential, Score=CRPScore, Base=learner, verbose=False).fit(X_surv_train, T_surv_train, E_surv_train)
The learning rate, number of estimators, minibatch fraction, and column subsampling are also easily adjusted:
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
minibatch_frac=0.5, col_sample=0.5)
ngb.fit(X_reg_train, Y_reg_train)
Sample weights (for training) are set using the sample_weight
argument to fit
.
ngb = NGBRegressor(n_estimators=100, learning_rate=0.01,
minibatch_frac=0.5, col_sample=0.5)
weights = np.random.random(Y_reg_train.shape)
ngb.fit(X_reg_train, Y_reg_train, sample_weight=weights)