Book Review: Statistical Rethinking
I'll start with the conclusion: McElreath's Staistical Rethinking is my favorite textbook. In the rest of this post I'll explain why and offer advice on how to get the most out of this book.
My background before reading this book
By the time I started reading the first edition of this book in late 2017, I was already leading a sizable data science and analytics team and had a masters degree in statistics. For that degree, most courses used frequentist methods, although I also took the one Bayesian statistics course that was offered. So I wasn't exactly a beginner, and yet I still claim that I learned at least as much from this book as I did from that degree program. I later used this book to teach a course at my old company, so I've also seen how well it works for those with non-stats backgrounds.
In this post, I'll refer to the 2nd edition.
Connecting methods to science
McElreath starts off by presenting the typical maze of statistical tests students are presented with in a statistics curriculum, and explains why this often causes confusion. He then goes on to argue that these tests are not enough for research because they are difficult to adapt to unique contexts and fail in unpredictable ways.
Chapter one continues on to describe why deductive falsification, which is often used as the justification for applying frequentist null hypothesis tests, is impossible in most scientific contexts. This is because "many models correspond to the same hypothesis, and many hypotheses correspond to a single model," which "makes strict falsification impossible." This idea is illustrated here, which shows that many processes are consistent with stated hypotheses, and many processes may be consistent with a single statistical model.
Measurement errors and continuous hypotheses also make falsification challenging in practice, and because of these issues, falsification is consensual rather than logical. That is, scientists debate the merits of the evidence and come to a consensus over time.
The remainder of the chapter is spent outlining the methods for doing better science that are covered in the book: Bayesian data analysis, model comparison, multilevel models, and graphical casual models.
Illustrative explanations
I think people like this book so much because the explanations bring the concepts to life, giving the reader an intuition that goes well beyond the definitions. These are given throughout the book, but here are two examples that have always stuck with me:
Chapter 2 contains perhaps the best illustration of a likelihood function that I've seen. It does so by presenting a simple case, in which a bag contains four marbles, each of which is blue or white. Three marbles are drawn with replacement, and their colors are the observed data. The unknown parameter of interest, $p$, is the proportion of marbles that are blue, so this sets up a discrete parameter, discrete outcome case. The likelihood is calculated by enumerating all of the possible ways data can be generated for each conjecture (each possible value of parameter $p \in \{ 0,\frac{1}{4}, \frac{1}{2}, \frac{3}{4}, 1 \}$ ) by taking three draws from the bag, and highlighting which possible outcomes match the observed outcome. You can see the example play out over consecutive slides starting here. The lesson is simple: the likelihood is the relative number of ways that the observed data can occur for each value of the parameter. While such literal counting is only feasible in the discrete/discrete case, it gives the right intuition for continuous cases as well.
Another explanation that stands out is for KL Divergence and demonstrating why it is not symmetric like a distance metric. The example given concerns predicting whether we will land on land or water, using Earth to set expectations for landing on Mars and then using Mars to set expectations for Earth. Mars' surface is almost entirely land (we count the ice caps as water on 1% of the surface for the sake of the example), while Earth's surface is mostly water but with a significant amount of land. So if you randomly land on Mars while using Earth to guide expectations, you'll be a little surprised but not shocked to (almost certainly) touch down on land. If you randomly land on Earth using Mars to guide your expectations, there's a good chance you'll land in water, which will be shocking if you were picturing the 1% water on Mars. The extra surprise means that using the proportions of land and water on Mars to guide expectations about Earth produces larger KL divergence than vice versa.
Insightful plots
A strength of this book is the plots used for model checking. It conveys good habits for plotting relationships and comparing the model to data. As every experienced practitioner knows, plotting adds valuable information far beyond what model fit metrics can provide alone.
Bayesian philosophy
While he points out advantages of Bayesian methods over frequentist ones, McElreath avoids making sweeping claims for Bayesian methods. In fact, he argues that no statistical approach by itself is sufficient, and there's a healthy humility as he discusses the modeling process.
The book takes the "logical" view of probability (see here for a long explanation), often associated with Jaynes, and also does a nice job delineating epistemology from ontology. The simulation of tadpoles in tanks and the role of hyperparameters in the Models with Memory chapter is a good example of clarifying this difference, as is the discussion of the i.i.d. assumption and Jaynes' mind projection fallacy.
Working up to multilevel modeling and beyond
The book is sensibly organized to gradually build up to multilevel models with partial pooling, which is one of the Bayesian super powers. Along the way, the book introduces essential and useful topics for a practicing statistician, including spurious association, masked relationships, confounding variables, interactions, and causal DAGs. It also covers modeling with Gaussian, binomial, Poisson, categorical, zero-inflated, and ordinal response variables.
It consistently presents the formal specification of each model before moving onto the code. It's a great practice to get into and helps to clarify and communicate a model's intention. Plus, modern Bayesian libraries use declarative model syntax that closely aligns with this format, bridging conceptual understanding and assumptions to code.
Continuing a bit beyond multilevel models to some other advanced topics, the book includes covariance, Gaussian Processes, missing data, measurement error, and instrument variables. After I used it to teach the class at work, our team members were able to tackle some pretty sophisticated problems where quantifying uncertainty or leveraging a flexible model structure were crucially important.
Traditional topics missing
One thing the book skips over, which is normally covered in an introductory Bayesian course, is solving simple problems with conjugate priors analytically. In practice, it's only applicable to a very small set of problems. But I think doing the math by hand for a few examples will help some learners better understand what the MCMC sampling is approximating. I don't think this belongs in McElreath's book; it's just a suggestion to find another resource to try that out separately, perhaps before reading this book, or sometime around chapter 3 or 4.
Continuing along those lines, it's generally not a math heavy book. The math is usually presented, but not required, since most problems are solved computationally. I think this serves the book and its scope well, but for moving on to more advanced material afterwards, I think a mathematical statistics course or book would be helpful to supplement. Probabilistic Machine Learning (PML) actually covers this material quite well (and also includes a section on conjugate prior models). Having that background ahead of time would probably help comprehend this book a bit better, but I don't think it's necessary.
Another traditional topic is implementing your own Gibbs and Metropolis Hastings samplers. This is again not particularly practical with modern libraries for Bayesian inference that do the hard work for you (and tend to use Hamiltonian Monte Carlo). I don't necessarily think a reader needs to supplement with this, although diving into the sampling algorithms may help with diagnosing problems and understanding when reparameterization is helpful in more complex models. PML2 covers MCMC sampling.
And finally another traditional topic not found in the book is Jeffreys priors. These are intellectually interesting, but rarely use in applied work. PML2 covers Jeffreys priors.
Evolving editions
As mentioned, I originally read the first edition of the book, which was already great. I also bought a copy of the second edition, which added several more topics, but the main difference may be how it stresses causal diagrams (DAGs) throughout. The forthcoming 3rd edition will apparently feature more on the Bayesian workflow.
Recommendation
McElreath thoughtfully weaves together philosophical, conceptual, and computational considerations through motivating examples and insightful illustrations. If you want to learn (or relearn) statistics in a way that will clarify your thinking and prepare you for a wide variety of modeling problems, this is a must read. I'll describe how I went through the book, which is what I'd recommend to get the most out of it.
When I read this book, I wanted to make a collection of the code examples from it in Python. It's such a popular book that the author's code, which uses an R package the author created, has been translated to several other probabilistic programming languages (PPLs). I was using PyMC3 at the time (now replaced by PyMC version 5), so I grabbed an available translation for it, and started making my own notebooks. However, I noticed there were some differences between results from the resource I found and the book's results. I think doing the full translation from scratch myself would have been tedious, but figuring out how to correct some examples was helpful.
By the way, I don't think which framework you choose is that important. In fact, when I taught the class, I stressed that our chosen framework was somewhat arbitrary and that those will change over time; what's important is learning the concepts and practicing with modern MCMC samplers. In practice, I've used pymc, numpyro, and stan for various projects.
I also reproduced every plot in the book with code. Much of McElreath's plotting code was not published in the book, so figuring that out also helped my understanding of the mounds of samples the models were generating. Plotting is so important for model improvement.
And finally, as preparation for teaching a class with the book, I selected a small number of problems to do from each chapter. I recommend choosing at least one conceptual question, and one computational question (and ideally one from each major topic) from each chapter to help get hands on experience with the material in the book. McElreath also publishes homework problems and solutions from his course, which is a good option.
At the end of it, I had an amazing set of code examples in a framework I wanted to use, along with a solid understanding of how to approach a wide variety of problems. This book on its own can bring an active learner up to an intermediate level. It's also great conceptual preparation for more dense books like PML or BDA3.
Take your time with this one. You'll get out of it what you put into it.
Extra Resources
McElreath publishes lectures for the book here. You can just find the latest (or whichever one matches your book's edition) and follow along. Or you can look for the latest class github repo with lectures and assigned homework and solutions here (specific example here)
I mentioned translations for various PPLs. Here are some for the 2nd edition of the book.
I haven't seen a complete translation for pystan, but stan itself is a language, so it should be relatively straightforward to convert to pystan for those with python data experience.