Bayesian Additive Regression Trees
Context
This paper develops a Bayesian approach to an ensemble of trees. It is extremely readable for an academic paper and I recommend taking the time to read it if you find the subject interesting.
Bayesian Additive Regression Trees (BART) are similar to Gradient Boosting Tree (GBT) methods in that they sum the contribution of sequential weak learners. This is opposed to Random Forests, which average many independent estimates. But instead of multiplying each sequential tree by a small constant (the learning rate) as in GBT, the Bayesian approach is to use a prior.
By using a prior and likelihood to get a posterior distribution of our prediction, we’re given a much richer set of information than the point estimates of classical regression methods. Furthermore, the Bayesian framework has a built-in complexity penalty, which means we no longer have to make empirical choices about regularization, max tree-depth and the plethora of other options we normally tune via cross-validation.
It also turns out that this method outperforms all the others that were compared, including GBM and Random Forests, on the 42 different datasets evaluated.
What’s new hear
The novel thing in this paper is really the combination of three previous works: the Bayesian framing of individual trees in the Bayesian Treed Models paper; the idea of Gradient Boosting Trees; and the use of Bayesian Backfitting to do Markov Chain Monte Carlo (MCMC) sampling from a general additive model’s posterior distribution. If you can make one tree Bayesian and you know how to sample from any model that’s a sum of base learners, then you have BART.
Approach
The task of framing the tree problem in such a way that you can coherently define prior parameter distributions is not trivial. This work was mostly done in the previously linked Treed Models paper. Essentially, the overall prior is split into three sub-priors: one for the tree structure (depth, splitting criteria), another for the values in the terminal nodes conditional on the tree structure, and a final one for the residual noise’s standard deviation. Some clever, but straight-forward arguments are made to lead to reasonable default recommendations. For instance, it’s argued that the mean of the function is very likely between \(y_{min}\) and \(y_{max}\) of your training data, so it’s framed such that most of the prior’s mass is in this region.
The choice on the number of trees, \(m\), is interestingly not given a prior, and this is due to computational concerns. The reality is that the results were incredibly robust to this parameter, so he recommended setting it to 200 and moving on. In fact, the results seem very robust to reasonable prior choices in general and he mostly recommends just using the specified defaults.
The second important part is sampling from the model’s posterior. Like many problems, this is done by a Metropolis-Hastings algorithm where you generate a sample from a distribution and then keep/reject it based on how well it performs. In this case, it boils down to mostly this: pick a tree in the additive sequence, morph it by randomly choosing among some rules (PRUNE, GROW…etc), from this new tree, sample from the terminal node value distribution, then choose whether to keep this new tree or the original according to their ratio of posterior probabilities. In this way, the trees are continually changed to balance their complexity and ability to explain the data. The priors are chosen to naturally favor simpler trees, so deeper are only chosen if it’s necessary to explain the data.
Results
This framework was applied to 42 different real datasets and compared to other popular approaches: linear regression with Lasso/L1 regularization, Gradient Boosting Trees, Random Forests, and Neural Networks with one hidden layer. These models were tuned over a variety of parameter choices. The punchline is: BART wins. You get the best performance if you perform cross-validation for hyper-parameter tuning, similar to how you would with all the other models to tune parameters, but you can avoid all of that and get highly competitive results just using BART with the default values.
Further, the results were shown to be highly robust to changes in the choice of prior. And in all its Bayesian glory, it was shown that even when the predictors were thrown in with many useless random variables (as would often be the case when we’re building models without being clear which ones are most important), it performed very well as compared to the other methods.
The author also suggested a variable selection technique that involved artificially limiting the number of trees and then counting the prevalence of the variables in the splitting rules of the resulting trees. Although this might be a nice heuristic, the question of variable selection is a complicated one and it wasn’t explored with any rigor.
Discussion
This was a super cool paper that gave very impressive results. Although I had difficulty understanding the details of the Bayesian Backfitting algorithm and how the Gibbs sampling is actually achieved, the basics seem mostly in line with other Metropolis-Hastings approaches.
Ultimately, this approach provides an “out-of-the-box” set of defaults that are statistically robust, give best-in-class performance, and are resilient to changes in hyper-parameter choices. I think it’s an excellent demonstration of how the Bayesian probabilistic framework can avoid many of the problems with traditional, ad hoc approaches that mostly rely on maximum likelihood.