https://www.ams.org/journals/bull/2019-56-01/S0273-0979-2018-01611-X/S0273-0979-2018-01611-X.pdf

That computers are changing the nature of scientific inquiry and the way we learn about ourselves, the world around us, and our day-to-day lives is obvious. Computing is ubiquitous. Focus has shifted from “if” to “how” when contemplating the extent to which computers are augmenting, automating, and enhancing a particular enterprise. Some experiences are touched more than others, and not always in a good way. Nowhere is this more true than in data science. Computers have revolutionized the degree of automation in learning from data and have expanded the fidelity of statistical inquiry to an astonishing degree in recent decades. At the same time, computers are responsible for a similarly dramatic expansion in the amount of data recorded, banking measurements on more things and at resolutions vastly higher than ever imagined, except perhaps in science fiction. To cope, statistical methodology has had to pare back in some areas, as much as ratchet up. What was for a century, or more, a fairly stable and plug-and-play toolkit for testing hypotheses and making predictions is today rapidly evolving in a dynamic landscape. Computational machinery has broken us out of the point-estimation, central limit theorem-style uncertainty quantification that practitioners seldom fully internalize. Now we have more intuitive tools, such as cross-validation (CV), bootstrap, and Bayesian posterior inference via Markov chain Monte Carlo (MCMC). Computation offers robustness via ensembles and long run averages or (when all eggs must be in one basket) model selection from alternatives, which, if fully enumerated, would be of astronomic proportion. Methods are scored not just on the old-school trifecta of theoretical justification, empirical accuracy, and interpretive capability but increasingly on their algorithmics, implementations, speed, potential for parallelization, distribution, execution on specialized hardware, automation, and so on. Development has been so feverish that it can be hard for practitioners— experts in other areas of science—to keep up. Yet at the same time, it has never been more essential to utilize statistical machinery: to incorporate data and to make decisions, often in real time, in the face of uncertainty. Many are desperate 2010 Mathematics Subject Classification. Primary 62Gxx, 65C60. c 2018 American Mathematical Society 137 License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 138 BOOK REVIEWS for an atlas to help navigate the modern data science landscape. More data and greater automation may have paradoxically led to greater uncertainty. Essential tools. Despite the dizzying array of acronyms, experts largely agree on a relatively compact set of modern fundamentals: bias-variance trade-off and regularization, control for false discovery, randomization and Monte Carlo, latent variables, divide-and-conquer, basis expansion, and kernels. Take Monte Carlo as a first example, a class of methods that are meaningless without the advent of cheap computing. Monte Carlo, or MC for short, is named for the games of chance played in the eponymous city on the Mediterranean. From a data science perspective, one applies MC by interjecting randomness into an otherwise deterministic procedure. At first glance this seems to be of dubious benefit: how could adding random noise help, and doesn’t that represent a lot of effort? Computers don’t mind repetitive tasks, potentially accommodating a substantial degree of randomization without breaking a sweat. As for why randomization is useful, well that’s more subtle and depends on the task at hand. The simplest and perhaps most widely applied MC method is the bootstrap. The bootstrap draws its power from the empirical distribution of the (training) data. Recall that, in the statistics literature, observations are regarded as a random sample from an underlying population, and the goal is to learn about that population from the sample. Most statistical methods posit a model offering a mathematically convenient caricature of the data-generating mechanism. Models have parameters that are somehow optimized, say via a measure of fit to the data like squared-error loss or the likelihood. Since the data are a random sample, the optimized parameters may be regarded as random variables whose distribution depends on the relative frequencies of occurrences in the underlying population. Those estimated parameters are, in the jargon, a statistic. The trouble is, only with very special models, special parameters, and populations can the distribution of statistics be derived, or even asymptotically approximated and, therefore, properly understood. The literature of old is peppered with mathematical acrobatics toward closed form (approximate) so-called sampling distributions. Even when they work out, the results can be inscrutable, at least from a practitioner’s perspective, and therefore they rarely furnish forecasts with meaningful summaries of uncertainty. Along comes the bootstrap, which says you can usually get the same thing, at least empirically, for a wide class of models and parameters with a simple loop: randomly resample your data, estimate parameters, repeat. The collection of optimized parameters constitute an empirical sampling distribution. The implementation is trivial, may be highly parallelized (because each data resample is handled in a statistically and algorithmically independent way), and may even be distributed, meaning that you can even bootstrap over partitioned data whose elements are, due to storage, legislative, or communication bottlenecks, effectively quarantined apart from one another. Examples include data collected and stored locally by ISPs or e-commerce giants like Amazon or Google. Methods like the bootstrap, offering the potential to understand uncertainty in almost any estimate, open up the potential to explore a vast array of alternative explanations of data. But that uncertainty, or variance, is but one side of the accuracy coin, i.e., how far our inferences are from the “truth”, or at least something useful. The other side is bias. Statisticians learned long ago that it is relatively easy to reduce uncertainty in forecasts with stronger (or simpler) modeling assumptions, License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use BOOK REVIEWS 139 more data, and usually both. But often that did not lead to better forecasts. Exploring the bias-variance trade-off was difficult before computers got fast. These days it is easy to enumerate thousands of alternative models and evaluate forecasts out-of-sample with MC validation schemes such as CV. The most common CV setup partitions the data into equal-sized chunks and then iterates, alternately holding each out for testing. Candidate predictors are fit to training data comprising of the chunk’s complement in the partition, and are evaluated on the hold-out testing set. By performing a double loop, over predictors/models and partition elements, one can access the predictive accuracy, or any other score out-of-sample, and select the best alternative. Now for some classes of models it is possible to leverage a degree of analytic tractability while exploring the bias-variance trade-off, and thereby explore (at least implicitly) a dizzying array of alternatives. The best example is the lasso for linear models relating a response, or output variable, to a potentially enormous set of explanatory, or input variables. The lasso is part of a wider family of regularized regression methods pairing a loss function (usually squared error) with a constraint that the estimated coefficients are not too large. It turns out to be easier, and equivalent, to work with an additive penalty instead. The lasso uses an L1 penalty, λ j |βj |, which maps out a space wherein optimal solutions are on “corners” where some coefficients are set identically to zero, effectively deselecting input coordinates—de facto model selection, in a limited sense. Clever coordinatewise algorithms make the search for optimal coefficients blazingly fast and even enable a continuum of penalty parameters λ to be entertained in one fell swoop, so that all that remains is to pick one according to some meta-criteria. CV is the most popular option, for which automations are readily available in software. Information criteria (IC), which essentially contemplate out-of-sample accuracy without actually measuring it empirically, offer further computational savings in many settings. Some formulations, i.e., of penalty and information criteria, have links to Bayesian posterior inference under certain priors on the unknown coefficients. It can be shown that the lasso estimator represents the maximum a posteriori (MAP) estimator under an independent Laplace prior for the βj . In a fully Bayesian framework one has the option of Markov chain Monte Carlo (MCMC) inference if sufficient computational resources are available. The advantage could be more accurate prediction via model averaging, i.e., entertaining many models of high posterior probability rather than simply the most probable (MAP) selection. Researchers have discovered that this “trick” (regularizing in such a way as to automatically detect useful explanatory variables) has analogies in settings well beyond ordinary linear regression: from linear-logistic (and other generalized linear model families) to nonlinear and nonparametric settings. Basis expansion, which generates so-called features from explanatory variables by transformation and interaction (multiplying pairs of features together), allows linear models on those features to span rich function spaces, mapping inputs to outputs. This all works splendidly as long as judicious regularization is applied. Again, CV, other MC validation methods, and MCMC can play a viral role. MC and regularization are even important when entertaining inherently nonlinear models, such as those based on trees, kernels, and artificial neural networks. Tree-based regression provides a divide-and-conquer approach to large-scale nonlinear modeling, and it is especially attractive when nonlinear interactions are License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 140 BOOK REVIEWS present. Trees—merely special graphs to mathematicians—are a fundamental data structure to computer scientists with many efficient libraries available for convenient abstraction and fast implementation. Statisticians have simply ported tree-based data structures over to learning. The idea is to let the data decide how to “divvy” up the input space recursively via binary splits on individual input coordinates (e.g., xj ≤ 5) placed at internal nodes of the tree, so that simple models can be fit at the regions of the input space demarcated by the partitions, or the so-called leaves of the tree. Appropriate leaf models are dictated by the nature of the response, with constant (unknown mean) and linear models being appropriate for regression, and multinomial models for classification. Splitting locations can be chosen according to any of several optimization heuristics, again depending on the leaf model, but the key is preventing over-fit by growing trees too deep, i.e., by having too many elements in the partition. One solution to this again involves MC validation (e.g., CV). These form the basis of older CART1 methods. However, two newer schemes have become popular as computing power has increased several-fold: Bayesian MCMC exploration of tree-posteriors, and boosting and the bootstrap. The former boasts organic regularization through a prior over tree space, and it has been extended to sums of trees with Bayesian additive regression trees (BART). In BART, the prior encourages shallow trees when many are being summed. A non-Bayesian analogue of sums of trees, both actually predating BART, can be obtained via boosting or the bootstrap. Boosting targets a best (weighted) sum of shallow trees, or decision stumps, whereas a bootstrap can yield many trees, each fit to randomly subsampled data, which can be averaged, leading to a random forest predictor—an example of so-called bootstrap aggregating, or bagging. Tree based predictors, especially in ensembles, are hard to beat when inputs interact in their functional relationship to outputs, and when both sets of variables lack a degree of regularity, usually manifested in the form of smoothness. Take either of those challenges away, and kernels are king. To understand how kernels work, it helps to think about how distance in the input space relates to correlation (i.e., linear dependence) and other forms of probabilistic or functional dependence in outputs. Testing locations with inputs far from training data inputs should have outputs which are less highly correlated, or dependent, and vice versa. There are some nice computational tricks in play when distances are measured in a certain way. And you can think of the choice of distance as a mapping from the original input space into a feature space, wherein calculations are relatively straightforward (linear). The best example is so-called Gaussian process regression, where pairwise (often inverse exponential Euclidean) distances in the input space define a multivariate normal (MVN) covariance structure. Then, simple MVN conditioning rules provide the predictive distribution. One can interpret the entire enterprise as Bayesian, with priors over function spaces leading to posteriors. However, in the opinion of many practitioners, that endows things with the sort of technical scaffolding that obfuscates, unless you are already of the opinion that all things Bayesian are good. A road-map. One criticism may be that modern statistical learning feels a bit like shooting first and asking questions later: algorithms targeting a particular behavior, and if they seem to work somebody might try to prove a little theory about it—to explain why it works and how similar tactics may (or may not) port to other learning 1Classification and regression trees License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use BOOK REVIEWS 141 tasks. That makes for a landscape that can be hard to navigate, particularly for “newbies”. Although there are prospectors who know their particular canyons well, there are few who know how we got here or where “here” is. Building a map from the old to the new requires cartographers to span both worlds. Such folks are in short supply. Brad Efron and Trevor Hastie are vanguards in the statistics world, having managed to be many things to many people and having been claimed by (modern) classical statisticians, Bayesians, and machine learning researchers alike. Their new book, Computer age statistical inference is the (mathematically sophisticated) data scientist’s road-map to modern statistical thinking and computation. It covers nearly all of the topics outlined above, in a succinct and elegant way, and with carefully crafted illustrative examples, emphasizing a methodology’s evolution as well as its implementation, aspects which often speak more loudly about their popularity than technical detail. However this is not a cookbook. Although pointers are made, usually to R packages, the book has almost no code. This is probably by design. If they had provided, say, R code, few in the machine learning community would buy it, as they prefer Python. And vice-versa with statisticians. The book is in three parts. The first part, on classic statistical inference, offers some context. This is a great read for someone who knows the material already— an abridged summary of the landscape “a hundred years ago”. The middle of the book, Part II, is even better. It covers a transitional period. The researchers who developed this methodology were prescient, or lucky, in that they anticipated computational developments on the horizon. They developed algorithms for the machines of the 1970s–1990s, but if computing never matured, none of them would have been household names (James–Stein, ridge regression, expectation maximization, bootstrap, CV, etc.). Although the topics in Part III comprise of essential tools in the modern arsenal, those in Part II are in many ways more important to the reader. These chapters teach the foundational statistical and computational concepts, out of which the Part III topics grew, and they will be key to understanding the next big thing. Part III offers a sampling of the most important and most recent advances. It starts with large-scale (big-p) model selection, and the multiple testing issues which ensue, perhaps as a means to transition into variable selection via the lasso. Then comes tree-based methods, and ensembles thereof, via the bootstrap and boosting. These chapters, though short and sweet, are surprisingly complete. The latter chapters, however, leave much to the imagination. In part that is because research on some topics (e.g., deep neural networks (DNNs)) is perhaps in its infancy. In other places, full monographs offer greater insight, and references are provided. Or it may be a matter of the author’s taste and expertise. Whereas support vector machines are rather more mature and, thus, well-presented in the text, they are paired with kernels and local regression, and the development here barely scratches the surface. Gaussian processes (GPs), which often out-perform neural networks (even deep ones) in lower signal-to-noise regimes (in part because they require less training data), do not even get a mention. The final two chapters feel a little misplaced. Post-selection inference, or how to correctly quantify uncertainty in estimated quantities (like standard errors) after model-selection via optimization (e.g., lasso), might be better placed closer to the start of the chapter. Experts in the literature would pick up on the following three omissions. The first is that treatment of Bayesian inference is awkward. Of the four chapters covering Bayes, two have essentially the same title, “Empirical Bayes”. Just about every License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 142 BOOK REVIEWS method—from lasso/linear regression to generalized linear models, trees, forests, neural networks, and kernels—has highly impactful Bayesian analogues which do not get a mention in the text. Emphasis in the text is on Bayesian-lite methodology, including objective and Empirical Bayes, yet these are the variations which have benefited least from the advent of modern computation. The second issue regards methods tailored to ubiquitous yet specialized computing architecture such as graphical processing units (GPSs), symmetric multi-core, and distributed computing. Much work has been done to leverage these modern paradigms, to achieve astounding computational efficiency gains, yet sometimes at the expense of statistical efficiency. Divide and conquer, the technique exploited by trees, has been applied to kernels and GPs for vastly parallelized prediction. GPU implementations explain much of the excitement in DNNs, and algorithmic tricks derived thereof (not limited to stochastic gradient descent) are bleeding into other modeling paradigms. Third and finally, there is nothing on fast/sparse and distributed linear algebra, which is dramatically expanding the sizes of problems that can be tackled with otherwise established (“old”) methodology. To wrap up with such shortcomings is definitely unfair as overall this is an excellent text, and to do justice to the hottest methods in statistics/machine learning could take thousands of pages. Computing advances have revolutionized so many aspects of our lives, and data—either its collection or the science or making sense of it—has come to dominate almost all of those aspects. The pace of innovation is feverish and is poaching talent from every corner of the quantitative sciences. Efron and Hastie’s Computer age statistical inference offers an excellent handbook for new recruits: either post-doctoral scholars from non–data science backgrounds, or graduate students in statistics, machine learning, and computer science. It is a great primer that helps readers appreciate how we got to where we are, and how challenges going forward are linked to computation, implementation, automation, and a pragmatic yet disciplined approach to inference.