Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment, Applied Predictive Modeling by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.
The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yet-to-be-seen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.
Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and Rule-Based models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.
Applied Predictive Modeling begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.
There are many ways to go about reading Applied Predictive Modeling. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.
Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)
A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.
The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially well-suited for self study.)
Applied Predictive Modeling is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.
Prob_Bad_Credit
The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good – just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.
Finally, Applied Predictive Modeling can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.
Applied Predictive Modeling is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s The Elements of Statistical Learning.