Developing NGBoost

As you work with NGBoost, you may want to experiment with distributions or scores that are not yet supported. Here we will walk through the process of implementing a new distribution or score.

Adding Distributions

The first order of business is to write the class for your new distribution. The distribution class must subclass the appropriate distribution type (either RegressionDistn or ClassificationDistn) and must implement methods for fit() and sample(). The scores compatible with the distribution should be stored in a class attribute called score and the number of parameters in an class attribute n_params. The class must also store the (internal) distributional parameters in a _params instance attribute. Additionally, regression distributions must implement a mean() method to support point prediction.

We'll use the Laplace distribution as an example. The Laplace distribution has PDF $\frac{1}{2b} e^{-\frac{|x-\mu|}{b}}$ with user-facing parameters $\mu \in \mathbb{R}$ and $b > 0$, which we will call loc and scale to conform to the scipy.stats implementation.

In NGBoost, all parameters must be represented internally in $\mathbb R$, so we need to reparametrize $(\mu, b)$ to, for instance, $(\mu, \log(b))$. The latter are the parameters we need to work with when we initialize a Laplace object and when implement the score.

from scipy.stats import laplace as dist
import numpy as np
from ngboost.distns.distn import RegressionDistn
from ngboost.scores import LogScore

class LaplaceLogScore(LogScore): # will implement this later
    pass

class Laplace(RegressionDistn):

    n_params = 2
    scores = [LaplaceLogScore] # will implement this later

    def __init__(self, params):
        # save the parameters
        self._params = params

        # create other objects that will be useful later
        self.loc = params[0]
        self.logscale = params[1]
        self.scale = np.exp(params[1]) # since params[1] is log(scale)
        self.dist = dist(loc=self.loc, scale=self.scale)

    def fit(Y):
        m, s = dist.fit(Y) # use scipy's implementation
        return np.array([m, np.log(s)])

    def sample(self, m):
        return np.array([self.dist.rvs() for i in range(m)])

    def __getattr__(self, name): # gives us access to Laplace.mean() required for RegressionDist.predict()
        if name in dir(self.dist):
            return getattr(self.dist, name)
        return None

    @property
    def params(self):
        return {'loc':self.loc, 'scale':self.scale}

The fit() method is a class method that takes a vector of observations and fits a marginal distribution. Meanwhile, sample() should return a $m$ samples from $P(Y|X=x)$, each of which is a vector of len(Y).

Here we're taking advantage of the fact that scipy.stats already has the Laplace distribution implemented so we can steal its fit() method and put a thin wrapper around rvs() to get samples. We also use __getattr__() on the internal scipy.stats object to get access to its mean() method.

Lastly, we write a convenience method params() that, when called, returns the distributional parameters as the user expects to see them, i.e. $(\mu, b)$, not $(\mu, \log b)$.

Implementing a Score for our Distribution

Now we turn our attention to implementing a score that we can use with this distribution. We'll use the log score as an example.

All implemented scores should subclass the appropriate score and implement three methods:

score() : the value of the score at the current parameters, given the data Y
d_score() : the derivative of the score at the current parameters, given the data Y
metric() : the value of the Riemannian metric at the current parameters

class LaplaceLogScore(LogScore):

    def score(self, Y):
        return -self.dist.logpdf(Y)

    def d_score(self, Y):
        D = np.zeros((len(Y), 2)) # first col is dS/d𝜇, second col is dS/d(log(b))
        D[:, 0] = np.sign(self.loc - Y)/self.scale
        D[:, 1] = 1 - np.abs(self.loc - Y)/self.scale
        return D

Notice that the attributes of an instance of Laplace are referenced using the self.attr notation even though we haven't said these will be attributes of the LaplaceLogScore class. When a user asks NGBoost to use the Laplace distribution with the LogScore, NGBoost will first find the implmentation of the log score that is compatible with Laplace, i.e. LaplaceLogScore and dynamically create a new class that has both the attributes of the distribution and the appropriate implementation of the score. For this to work, the distribution class Laplace must have a scores class attribute that includes the implementation LaplaceLogScore and LaplaceLogScore must subclass LogScore. As long as those conditions are satisfied, NGBoost can take care of the rest.

The derivatives with respect to $\log b$ and $\mu$ are easily derived using, for instance, WolframAlpha.

In this example we won't bother implementing metric(), which would return the current Fisher Information. The reason is that the NGBoost implmentation of LogScore has a default metric() method that uses a Monte Carlo method to approximate the Fisher Information using the gradient() method and the distribution's sample() method (that's why we needed to implement sample()). By inhereting from LogScore(), not only can NGBoost find our implementation for the Laplace distribution, it can also fall back on the defualt metric() method. More on that later.

Putting it all together:

class LaplaceLogScore(LogScore):

    def score(self, Y):
        return -self.dist.logpdf(Y)

    def d_score(self, Y):
        D = np.zeros((len(Y), 2)) # first col is dS/d𝜇, second col is dS/d(log(b))
        D[:, 0] = np.sign(self.loc - Y)/self.scale
        D[:, 1] = 1 - np.abs(self.loc - Y)/self.scale
        return D

class Laplace(RegressionDistn):

    n_params = 2
    scores = [LaplaceLogScore]

    def __init__(self, params):
        # save the parameters
        self._params = params

        # create other objects that will be useful later
        self.loc = params[0]
        self.logscale = params[1]
        self.scale = np.exp(params[1]) # since params[1] is log(scale)
        self.dist = dist(loc=self.loc, scale=self.scale)

    def fit(Y):
        m, s = dist.fit(Y) # use scipy's implementation
        return np.array([m, np.log(s)])

    def sample(self, m):
        return np.array([self.dist.rvs() for i in range(m)])

    def __getattr__(self, name): # gives us access to Laplace.mean() required for RegressionDist.predict()
        if name in dir(self.dist):
            return getattr(self.dist, name)
        return None

    @property
    def params(self):
        return {'loc':self.loc, 'scale':self.scale}

And we can test our method:

from ngboost import NGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBRegressor(Dist=Laplace, Score=LogScore).fit(X_reg_train, Y_reg_train)
Y_preds = ngb.predict(X_reg_test)
Y_dists = ngb.pred_dist(X_reg_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_reg_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_reg_test).mean()
print('Test NLL', test_NLL)

[iter 0] loss=3.5989 val_loss=0.0000 scale=1.0000 norm=6.8968
[iter 100] loss=2.8432 val_loss=0.0000 scale=1.0000 norm=5.3980
[iter 200] loss=2.4432 val_loss=0.0000 scale=1.0000 norm=3.5194
[iter 300] loss=2.2069 val_loss=0.0000 scale=1.0000 norm=2.4722
[iter 400] loss=2.0325 val_loss=0.0000 scale=1.0000 norm=1.9901
Test MSE 13.593387122611777
Test NLL 2.552651522788599

Dig into the source of ngboost.distns to find more examples. If you write and test your own distribution, please contribute it to NGBoost by making a pull request!

Censored Scores

You can make your distribution suitable for use in surival analysis if you implement a censored version of the score. The signature for the score(), d_score() and metric() methods should be the same, but they should expect Y to be indexable into two arrays like E, T = Y["Event"], Y["Time"]. Furthermore, any censored scores should be linked to the distribution class definition via a class attribute called censored_scores instead of scores.

Since censored scores are more general than their standard counterparts (fully observed data is a specific case of censored data), if you implement a censored score in NGBoost, it will automatically become available as a useable score for standard regression analysis. No need to implement the regression score seperately or register it in the scores class attribute.

Metrics

As we saw, using the log score, the easiest thing to do as a developer is to lean on the default ngboost method that calculates the log score metric.

However, the distribution-agnostic default method is slow because it must sample from the distribution many times to build up an approximation of the metric. If you want to make it faster, then you must derive and implement the distribution-specific Riemannian metric, which for the log score is the Fisher information matrix of that distribution. You have to derive the Fisher with respect to the internal ngboost parameterization (if that is different to the user-facing parametrization, e.g. $\log(\sigma)$, not $\sigma$). Deriving a Fisher is not necessarily easy since you have to compute an expectation analytically, but there are many examples online of deriving Fisher matrices that you can look through.

For example, consider the Student's-t distribution. This distribution is parameterised by degrees of freedom $\nu$, mean $\mu$, and standard deviation $\sigma$.

The Fisher information of this distribution can be found here, and is

$$\begin{align*} \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^4} \end{bmatrix} \end{align*}$$

As $\sigma > 0$, for NGBoost we must replace this with $\log ( \sigma )$. This requires us to reparameterise the distribution. To find the Fisher information under this reparameterisation, we can follow the procedure laid out here on Wikipedia.

$\eta = (\mu, \sigma), \theta = (\mu, \log \sigma)$

$I_{\eta}(\eta) = J^T I_{\theta}(\theta) J$

Where $J$ is the $2 \times 2$ Jacobian matrix defined by

$$[J]_{ij} = \dfrac{\partial \theta_i}{\partial \eta_j}$$

Which evaluates to

$$\begin{align*} J = J^T &= \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{\sigma} \end{bmatrix}\\ J^{-1} &= \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix}\\ \end{align*}$$

We can thus obtain the desired Fisher information by rearranging as such,

$$\begin{align*} I_{\theta}(\theta) &= J^{-1} I_{\eta}(\eta) J^{-1}\\ &= \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix} \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^4} \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix}\\ &= \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^2} \end{bmatrix} \end{align*} $$

If you don't want to use the log score (say you want CRP score, for example), then ngboost does not (yet?) have a default method for calculating the metric and you must derive and implement it yourself. This is harder than deriving a Fisher because there are not many worked examples. The most general derivation process should follow the outline here, replacing the KL divergence (which is induced by the log score) with whichever divergence is induced by the scoring rule you want to use (e.g. L2 for CRPS), again taking care to derive with respect to the internal ngboost parameterization, not the user-facing one. For any particular score, there may be a specific closed-form expression that you can use to calculate the metric across distributions (the expression for the Fisher Info serves this purpose for the log score) or there may not be- I actually don't know the answer to this question! But if there were, that could suggest some kind of default implementation for that score's metric() method.

Adding Scores

We've seen how to implement an existing score for a new distribution, but making a new score altogether in NGBoost is also easy: just make a new class that subclasses Score:

from ngboost.scores import Score

class SphericalScore(Score):
    pass

That's it. Distribution-specific implemenations of this score (e.g. LaplaceSphericalScore) should subclass SphericalScore. The implementations of LogScore and CRPScore are in ngboost.scores for reference.