Random Forests: Why They Work and Why That's a Problem

Random Forests: Why They Work and Why That's a Problem (via Zoom)

Abstract: Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success, a full and satisfying explanation for their success has yet to be put forth. In this talk, we will show that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. From a model-complexity perspective, this means that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicit regularization procedures like the lasso. Realizing this, we demonstrate that alternative forms of randomness can provide similarly beneficial stabilization. In particular, we show that augmenting the feature space with additional features consisting of only random noise can substantially improve the predictive accuracy of the model. This surprising fact has been largely overlooked in the statistics community, but has crucial implications for thinking about how to objectively define and measure variable importance. Numerous demonstrations on both real and synthetic data are provided.

Bio: Lucas Mentch is an Assistant Professor in the Department of Statistics at the University of Pittsburgh. He obtained his PhD in statistics from Cornell University in 2015 and spent the 2015-2016 academic year as a visiting researcher at the Statistics and Applied Mathematical Sciences Institute (SAMSI) working on problems in statistics and forensic science. His primary research area is at the intersection of machine learning and classical statistical inference and he has worked in a number of applied areas including law, policing, forensic science, sports, biomedicine, and sleep-related science.