As you work with NGBoost, you may want to experiment with distributions or scores that are not yet supported. Here we will walk through the process of implementing a new distribution or score.
The first order of business is to write the class for your new distribution. The distribution class must subclass the appropriate distribution type (either RegressionDistn
or ClassificationDistn
) and must implement methods for fit()
and sample()
. The scores compatible with the distribution should be stored in a class attribute called score
and the number of parameters in an class attribute n_params. The class must also store the (internal) distributional parameters in a _params
instance attribute. Additionally, regression distributions must implement a mean()
method to support point prediction.
We'll use the Laplace distribution as an example. The Laplace distribution has PDF $\frac{1}{2b} e^{-\frac{|x-\mu|}{b}}$ with user-facing parameters $\mu \in \mathbb{R}$ and $b > 0$, which we will call loc
and scale
to conform to the scipy.stats
implementation.
In NGBoost, all parameters must be represented internally in $\mathbb R$, so we need to reparametrize $(\mu, b)$ to, for instance, $(\mu, \log(b))$. The latter are the parameters we need to work with when we initialize a Laplace
object and when implement the score.
from scipy.stats import laplace as dist
import numpy as np
from ngboost.distns.distn import RegressionDistn
from ngboost.scores import LogScore
class LaplaceLogScore(LogScore): # will implement this later
pass
class Laplace(RegressionDistn):
n_params = 2
scores = [LaplaceLogScore] # will implement this later
def __init__(self, params):
# save the parameters
self._params = params
# create other objects that will be useful later
self.loc = params[0]
self.logscale = params[1]
self.scale = np.exp(params[1]) # since params[1] is log(scale)
self.dist = dist(loc=self.loc, scale=self.scale)
def fit(Y):
m, s = dist.fit(Y) # use scipy's implementation
return np.array([m, np.log(s)])
def sample(self, m):
return np.array([self.dist.rvs() for i in range(m)])
def __getattr__(self, name): # gives us access to Laplace.mean() required for RegressionDist.predict()
if name in dir(self.dist):
return getattr(self.dist, name)
return None
@property
def params(self):
return {'loc':self.loc, 'scale':self.scale}
The fit()
method is a class method that takes a vector of observations and fits a marginal distribution. Meanwhile, sample()
should return a $m$ samples from $P(Y|X=x)$, each of which is a vector of len(Y)
.
Here we're taking advantage of the fact that scipy.stats
already has the Laplace distribution implemented so we can steal its fit()
method and put a thin wrapper around rvs()
to get samples. We also use __getattr__()
on the internal scipy.stats
object to get access to its mean()
method.
Lastly, we write a convenience method params()
that, when called, returns the distributional parameters as the user expects to see them, i.e. $(\mu, b)$, not $(\mu, \log b)$.
Implementing a Score for our Distribution
Now we turn our attention to implementing a score that we can use with this distribution. We'll use the log score as an example.
All implemented scores should subclass the appropriate score and implement three methods:
score()
: the value of the score at the current parameters, given the dataY
d_score()
: the derivative of the score at the current parameters, given the dataY
metric()
: the value of the Riemannian metric at the current parameters
class LaplaceLogScore(LogScore):
def score(self, Y):
return -self.dist.logpdf(Y)
def d_score(self, Y):
D = np.zeros((len(Y), 2)) # first col is dS/d𝜇, second col is dS/d(log(b))
D[:, 0] = np.sign(self.loc - Y)/self.scale
D[:, 1] = 1 - np.abs(self.loc - Y)/self.scale
return D
Notice that the attributes of an instance of Laplace
are referenced using the self.attr
notation even though we haven't said these will be attributes of the LaplaceLogScore
class. When a user asks NGBoost to use the Laplace
distribution with the LogScore
, NGBoost will first find the implmentation of the log score that is compatible with Laplace
, i.e. LaplaceLogScore
and dynamically create a new class that has both the attributes of the distribution and the appropriate implementation of the score. For this to work, the distribution class Laplace
must have a scores
class attribute that includes the implementation LaplaceLogScore
and LaplaceLogScore
must subclass LogScore
. As long as those conditions are satisfied, NGBoost can take care of the rest.
In this example we won't bother implementing metric()
, which would return the current Fisher Information. The reason is that the NGBoost implmentation of LogScore
has a default metric()
method that uses a Monte Carlo method to approximate the Fisher Information using the gradient()
method and the distribution's sample()
method (that's why we needed to implement sample()
). By inhereting from LogScore()
, not only can NGBoost find our implementation for the Laplace distribution, it can also fall back on the defualt metric()
method. More on that later.
Putting it all together:
class LaplaceLogScore(LogScore):
def score(self, Y):
return -self.dist.logpdf(Y)
def d_score(self, Y):
D = np.zeros((len(Y), 2)) # first col is dS/d𝜇, second col is dS/d(log(b))
D[:, 0] = np.sign(self.loc - Y)/self.scale
D[:, 1] = 1 - np.abs(self.loc - Y)/self.scale
return D
class Laplace(RegressionDistn):
n_params = 2
scores = [LaplaceLogScore]
def __init__(self, params):
# save the parameters
self._params = params
# create other objects that will be useful later
self.loc = params[0]
self.logscale = params[1]
self.scale = np.exp(params[1]) # since params[1] is log(scale)
self.dist = dist(loc=self.loc, scale=self.scale)
def fit(Y):
m, s = dist.fit(Y) # use scipy's implementation
return np.array([m, np.log(s)])
def sample(self, m):
return np.array([self.dist.rvs() for i in range(m)])
def __getattr__(self, name): # gives us access to Laplace.mean() required for RegressionDist.predict()
if name in dir(self.dist):
return getattr(self.dist, name)
return None
@property
def params(self):
return {'loc':self.loc, 'scale':self.scale}
And we can test our method:
from ngboost import NGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)
ngb = NGBRegressor(Dist=Laplace, Score=LogScore).fit(X_reg_train, Y_reg_train)
Y_preds = ngb.predict(X_reg_test)
Y_dists = ngb.pred_dist(X_reg_test)
# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_reg_test)
print('Test MSE', test_MSE)
# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_reg_test).mean()
print('Test NLL', test_NLL)
Dig into the source of ngboost.distns
to find more examples. If you write and test your own distribution, please contribute it to NGBoost by making a pull request!
You can make your distribution suitable for use in surival analysis if you implement a censored version of the score. The signature for the score()
, d_score()
and metric()
methods should be the same, but they should expect Y
to be indexable into two arrays like E, T = Y["Event"], Y["Time"]
. Furthermore, any censored scores should be linked to the distribution class definition via a class attribute called censored_scores
instead of scores
.
Since censored scores are more general than their standard counterparts (fully observed data is a specific case of censored data), if you implement a censored score in NGBoost, it will automatically become available as a useable score for standard regression analysis. No need to implement the regression score seperately or register it in the scores
class attribute.
As we saw, using the log score, the easiest thing to do as a developer is to lean on the default ngboost method that calculates the log score metric.
However, the distribution-agnostic default method is slow because it must sample from the distribution many times to build up an approximation of the metric. If you want to make it faster, then you must derive and implement the distribution-specific Riemannian metric, which for the log score is the Fisher information matrix of that distribution. You have to derive the Fisher with respect to the internal ngboost parameterization (if that is different to the user-facing parametrization, e.g. $\log(\sigma)$, not $\sigma$). Deriving a Fisher is not necessarily easy since you have to compute an expectation analytically, but there are many examples online of deriving Fisher matrices that you can look through.
For example, consider the Student's-t distribution. This distribution is parameterised by degrees of freedom $\nu$, mean $\mu$, and standard deviation $\sigma$.
The Fisher information of this distribution can be found here, and is
$$\begin{align*} \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^4} \end{bmatrix} \end{align*}$$As $\sigma > 0$, for NGBoost we must replace this with $\log ( \sigma )$. This requires us to reparameterise the distribution. To find the Fisher information under this reparameterisation, we can follow the procedure laid out here on Wikipedia.
$\eta = (\mu, \sigma), \theta = (\mu, \log \sigma)$
$I_{\eta}(\eta) = J^T I_{\theta}(\theta) J$
Where $J$ is the $2 \times 2$ Jacobian matrix defined by
$$[J]_{ij} = \dfrac{\partial \theta_i}{\partial \eta_j}$$Which evaluates to
$$\begin{align*} J = J^T &= \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{\sigma} \end{bmatrix}\\ J^{-1} &= \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix}\\ \end{align*}$$We can thus obtain the desired Fisher information by rearranging as such,
$$\begin{align*} I_{\theta}(\theta) &= J^{-1} I_{\eta}(\eta) J^{-1}\\ &= \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix} \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^4} \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix}\\ &= \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^2} \end{bmatrix} \end{align*} $$If you don't want to use the log score (say you want CRP score, for example), then ngboost does not (yet?) have a default method for calculating the metric and you must derive and implement it yourself. This is harder than deriving a Fisher because there are not many worked examples. The most general derivation process should follow the outline here, replacing the KL divergence (which is induced by the log score) with whichever divergence is induced by the scoring rule you want to use (e.g. L2 for CRPS), again taking care to derive with respect to the internal ngboost parameterization, not the user-facing one. For any particular score, there may be a specific closed-form expression that you can use to calculate the metric across distributions (the expression for the Fisher Info serves this purpose for the log score) or there may not be- I actually don't know the answer to this question! But if there were, that could suggest some kind of default implementation for that score's metric()
method.
We've seen how to implement an existing score for a new distribution, but making a new score altogether in NGBoost is also easy: just make a new class that subclasses Score
:
from ngboost.scores import Score
class SphericalScore(Score):
pass
That's it. Distribution-specific implemenations of this score (e.g. LaplaceSphericalScore
) should subclass SphericalScore
. The implementations of LogScore
and CRPScore
are in ngboost.scores
for reference.