As you work with NGBoost, you may want to experiment with distributions or scores that are not yet supported. Here we will walk through the process of implementing a new distribution or score.

The first order of business is to write the class for your new distribution. The distribution class must subclass the appropriate distribution type (either `RegressionDistn`

or `ClassificationDistn`

) and must implement methods for `fit()`

and `sample()`

. The scores compatible with the distribution should be stored in a class attribute called `score`

and the number of parameters in an class attribute n_params. The class must also store the (internal) distributional parameters in a `_params`

instance attribute. Additionally, regression distributions must implement a `mean()`

method to support point prediction.

We'll use the Laplace distribution as an example. The Laplace distribution has PDF $\frac{1}{2b} e^{-\frac{|x-\mu|}{b}}$ with user-facing parameters $\mu \in \mathbb{R}$ and $b > 0$, which we will call `loc`

and `scale`

to conform to the `scipy.stats`

implementation.

In NGBoost, all parameters must be represented internally in $\mathbb R$, so we need to reparametrize $(\mu, b)$ to, for instance, $(\mu, \log(b))$. The latter are the parameters we need to work with when we initialize a `Laplace`

object and when implement the score.

```
from scipy.stats import laplace as dist
import numpy as np
from ngboost.distns.distn import RegressionDistn
from ngboost.scores import LogScore
class LaplaceLogScore(LogScore): # will implement this later
pass
class Laplace(RegressionDistn):
n_params = 2
scores = [LaplaceLogScore] # will implement this later
def __init__(self, params):
# save the parameters
self._params = params
# create other objects that will be useful later
self.loc = params[0]
self.logscale = params[1]
self.scale = np.exp(params[1]) # since params[1] is log(scale)
self.dist = dist(loc=self.loc, scale=self.scale)
def fit(Y):
m, s = dist.fit(Y) # use scipy's implementation
return np.array([m, np.log(s)])
def sample(self, m):
return np.array([self.dist.rvs() for i in range(m)])
def __getattr__(self, name): # gives us access to Laplace.mean() required for RegressionDist.predict()
if name in dir(self.dist):
return getattr(self.dist, name)
return None
@property
def params(self):
return {'loc':self.loc, 'scale':self.scale}
```

The `fit()`

method is a class method that takes a vector of observations and fits a marginal distribution. Meanwhile, `sample()`

should return a $m$ samples from $P(Y|X=x)$, each of which is a vector of `len(Y)`

.

Here we're taking advantage of the fact that `scipy.stats`

already has the Laplace distribution implemented so we can steal its `fit()`

method and put a thin wrapper around `rvs()`

to get samples. We also use `__getattr__()`

on the internal `scipy.stats`

object to get access to its `mean()`

method.

Lastly, we write a convenience method `params()`

that, when called, returns the distributional parameters as the user expects to see them, i.e. $(\mu, b)$, not $(\mu, \log b)$.

### Implementing a Score for our Distribution

Now we turn our attention to implementing a score that we can use with this distribution. We'll use the log score as an example.

All implemented scores should subclass the appropriate score and implement three methods:

`score()`

: the value of the score at the current parameters, given the data`Y`

`d_score()`

: the derivative of the score at the current parameters, given the data`Y`

`metric()`

: the value of the Riemannian metric at the current parameters

```
class LaplaceLogScore(LogScore):
def score(self, Y):
return -self.dist.logpdf(Y)
def d_score(self, Y):
D = np.zeros((len(Y), 2)) # first col is dS/d𝜇, second col is dS/d(log(b))
D[:, 0] = np.sign(self.loc - Y)/self.scale
D[:, 1] = 1 - np.abs(self.loc - Y)/self.scale
return D
```

Notice that the attributes of an instance of `Laplace`

are referenced using the `self.attr`

notation even though we haven't said these will be attributes of the `LaplaceLogScore`

class. When a user asks NGBoost to use the `Laplace`

distribution with the `LogScore`

, NGBoost will first find the implmentation of the log score that is compatible with `Laplace`

, i.e. `LaplaceLogScore`

and dynamically create a new class that has both the attributes of the distribution and the appropriate implementation of the score. For this to work, the distribution class `Laplace`

must have a `scores`

class attribute that includes the implementation `LaplaceLogScore`

and `LaplaceLogScore`

must subclass `LogScore`

. As long as those conditions are satisfied, NGBoost can take care of the rest.

In this example we won't bother implementing `metric()`

, which would return the current Fisher Information. The reason is that the NGBoost implmentation of `LogScore`

has a default `metric()`

method that uses a Monte Carlo method to approximate the Fisher Information using the `gradient()`

method and the distribution's `sample()`

method (that's why we needed to implement `sample()`

). By inhereting from `LogScore()`

, not only can NGBoost find our implementation for the Laplace distribution, it can also fall back on the defualt `metric()`

method. More on that later.

Putting it all together:

```
class LaplaceLogScore(LogScore):
def score(self, Y):
return -self.dist.logpdf(Y)
def d_score(self, Y):
D = np.zeros((len(Y), 2)) # first col is dS/d𝜇, second col is dS/d(log(b))
D[:, 0] = np.sign(self.loc - Y)/self.scale
D[:, 1] = 1 - np.abs(self.loc - Y)/self.scale
return D
class Laplace(RegressionDistn):
n_params = 2
scores = [LaplaceLogScore]
def __init__(self, params):
# save the parameters
self._params = params
# create other objects that will be useful later
self.loc = params[0]
self.logscale = params[1]
self.scale = np.exp(params[1]) # since params[1] is log(scale)
self.dist = dist(loc=self.loc, scale=self.scale)
def fit(Y):
m, s = dist.fit(Y) # use scipy's implementation
return np.array([m, np.log(s)])
def sample(self, m):
return np.array([self.dist.rvs() for i in range(m)])
def __getattr__(self, name): # gives us access to Laplace.mean() required for RegressionDist.predict()
if name in dir(self.dist):
return getattr(self.dist, name)
return None
@property
def params(self):
return {'loc':self.loc, 'scale':self.scale}
```

And we can test our method:

```
from ngboost import NGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X, Y = load_boston(True)
X_reg_train, X_reg_test, Y_reg_train, Y_reg_test = train_test_split(X, Y, test_size=0.2)
ngb = NGBRegressor(Dist=Laplace, Score=LogScore).fit(X_reg_train, Y_reg_train)
Y_preds = ngb.predict(X_reg_test)
Y_dists = ngb.pred_dist(X_reg_test)
# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_reg_test)
print('Test MSE', test_MSE)
# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_reg_test).mean()
print('Test NLL', test_NLL)
```

Dig into the source of `ngboost.distns`

to find more examples. If you write and test your own distribution, please contribute it to NGBoost by making a pull request!

You can make your distribution suitable for use in surival analysis if you implement a censored version of the score. The signature for the `score()`

, `d_score()`

and `metric()`

methods should be the same, but they should expect `Y`

to be indexable into two arrays like `E, T = Y["Event"], Y["Time"]`

. Furthermore, any censored scores should be linked to the distribution class definition via a class attribute called `censored_scores`

instead of `scores`

.

Since censored scores are more general than their standard counterparts (fully observed data is a specific case of censored data), if you implement a censored score in NGBoost, it will automatically become available as a useable score for standard regression analysis. No need to implement the regression score seperately or register it in the `scores`

class attribute.

As we saw, using the log score, the easiest thing to do as a developer is to lean on the default ngboost method that calculates the log score metric.

However, the distribution-agnostic default method is slow because it must sample from the distribution many times to build up an approximation of the metric. If you want to make it faster, then you must derive and implement the distribution-specific Riemannian metric, which for the log score is the Fisher information matrix of that distribution. You have to derive the Fisher with respect to the internal ngboost parameterization (if that is different to the user-facing parametrization, e.g. $\log(\sigma)$, not $\sigma$). Deriving a Fisher is not necessarily easy since you have to compute an expectation analytically, but there are many examples online of deriving Fisher matrices that you can look through.

For example, consider the Student's-t distribution. This distribution is parameterised by degrees of freedom $\nu$, mean $\mu$, and standard deviation $\sigma$.

The Fisher information of this distribution can be found here, and is

$$\begin{align*} \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^4} \end{bmatrix} \end{align*}$$As $\sigma > 0$, for NGBoost we must replace this with $\log ( \sigma )$. This requires us to reparameterise the distribution. To find the Fisher information under this reparameterisation, we can follow the procedure laid out here on Wikipedia.

$\eta = (\mu, \sigma), \theta = (\mu, \log \sigma)$

$I_{\eta}(\eta) = J^T I_{\theta}(\theta) J$

Where $J$ is the $2 \times 2$ Jacobian matrix defined by

$$[J]_{ij} = \dfrac{\partial \theta_i}{\partial \eta_j}$$Which evaluates to

$$\begin{align*} J = J^T &= \begin{bmatrix} 1 & 0 \\ 0 & \frac{1}{\sigma} \end{bmatrix}\\ J^{-1} &= \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix}\\ \end{align*}$$We can thus obtain the desired Fisher information by rearranging as such,

$$\begin{align*} I_{\theta}(\theta) &= J^{-1} I_{\eta}(\eta) J^{-1}\\ &= \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix} \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^4} \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & \sigma \end{bmatrix}\\ &= \begin{bmatrix} \frac{\nu + 1}{(\nu + 3) \sigma^2} & 0 \\ 0 & \frac{\nu}{2(\nu + 3) \sigma^2} \end{bmatrix} \end{align*} $$If you don't want to use the log score (say you want CRP score, for example), then ngboost does not (yet?) have a default method for calculating the metric and you *must* derive and implement it yourself. This is harder than deriving a Fisher because there are not many worked examples. The most general derivation process should follow the outline here, replacing the KL divergence (which is induced by the log score) with whichever divergence is induced by the scoring rule you want to use (e.g. L2 for CRPS), again taking care to derive with respect to the internal ngboost parameterization, not the user-facing one. For any particular score, there may be a specific closed-form expression that you can use to calculate the metric across distributions (the expression for the Fisher Info serves this purpose for the log score) or there may not be- I actually don't know the answer to this question! But if there were, that could suggest some kind of default implementation for that score's `metric()`

method.

We've seen how to *implement* an existing score for a new distribution, but making a new score altogether in NGBoost is also easy: just make a new class that subclasses `Score`

:

```
from ngboost.scores import Score
class SphericalScore(Score):
pass
```

That's it. Distribution-specific implemenations of this score (e.g. `LaplaceSphericalScore`

) should subclass `SphericalScore`

. The implementations of `LogScore`

and `CRPScore`

are in `ngboost.scores`

for reference.