Tony Duan*, Anand Avati*, Daisy Yi Ding, Sanjay Basu, Andrew Ng, Alejandro Schuler
Estimating the uncertainty in the predictions of a machine learning model is crucial for production deployments in the real world. Not only do we want our models to make accurate predictions, but we also want a correct estimate of uncertainty along with each prediction. When model predictions are part of an automated decision-making workflow or production line, predictive uncertainty estimates are important for determining manual fallback alternatives or for human inspection and intervenion.
Probabilistic prediction (or probabilistic forecasting), which is the approach where the model outputs a full probability distribution over the entire outcome space, is a natural way to quantify those uncertainties.
Compare the point predictions vs probabilistic predictions in the following examples.
|Question||Point Prediction (No uncertainty estimate)||Probabilistic Prediction (Uncertainty is implicit)|
|What will be the temperature at noon tomorrow?||73.4 Fahrenheit|
|How long will this patient live?||11.3 months|
Gradient Boosting methods have generally been among the top performers in predictive accuracy over structured or tabular input data.
NGBoost enables predictive uncertainty estimation with Gradient Boosting through probabilistic predictions (including real valued outputs). With the use of Natural Gradients, NGBoost overcomes technical challenges that make generic probabilistic prediction hard with gradient boosting.
We release an open source package of our implementation on GitHub.Read our paper GitHub
The NGBoost algorithm is simple to use. It has three abstract modular components that are chosen as configuration:
Base LearnerThe most common choice is Decision Trees, which tend to work well on structured inputs.
Probability DistributionThe distribution needs to be compatible with the output type. For e.g. Normal distribution for real valued outputs, Bernoulli for binary outputs.
Scoring ruleMaximum Likelihood Estimation is an obvious choice. More robust rules such as Continuous Ranked Probability Score are also suitable.
The above choices can be mixed and matched to be customized for the specific prediction problem at hand.
Our key innovation is in employing the natural gradient to perform gradient boosting by casting it as a problem of determining the parameters of a probability distribution.
Ordinary gradients can be highly unsuitable for learning multi-parameter probability distributions (such as the Normal distribution). The training dynamics with the use of natural gradients tends to be much more stable and result in a better fit, as seen in the probabilistic regression example above.