Book Review: Probably Overthinking It

Dave

Overview

Allen Downey’s Probably Overthinking It describes and resolves a number of statistical fallacies and paradoxes in an accessible way. The subtitle is How to use data to answer questions, avoid statistical traps, and make better decisions, and an overarching theme is the various ways sampling bias can impact inference. He illustrates the issues visually, demonstrates each with multiple examples, and shows adjustments that can overcome and explain them.

This book is well suited for data scientists, analysts, and mathematically inclined consumers of quantitative analyses. While a background in probability and statistics would help get more out of the book, it’s not particularly technical; Downey does not dive deeply into the mathematical details, leaving much of the explanation to well designed plots and stories that accompany them. It’s primarily a book on how to reason about observational data and evaluate models.

Having read Downey’s blog (and his previous blog before that) for the past 12 years or so, I was familiar with many themes in the book, although some of the examples were new. Topics include (among others) the Inspection Paradox, Berkson’s Paradox, Simpson’s Paradox, and the base rate fallacy.

What makes the book unique

One thing I like about the book is its emphasis on choosing appropriate probability distributions and plotting methods. Students tend to come out of university statistics courses overemphasizing the Gaussian distribution, and a big lesson from this book is that the choice of distribution can matter a lot to a model’s quality and the decisions it informs. He also explains circumstances under which the Gaussian distribution is an appropriate choice and how to compare the fit of different modeling distributions. He does quite a bit of modeling with the log-normal distribution, which I've used extensively in industry, and even the log-t distribution, which I've also applied in practice.

The book also focuses on comparing model predictions to observable data, rather than on parameters, standard errors, or hypothesis tests. This is done with widespread use of ECDFs and log-log complementary CDFs when we want to emphasize tail behavior (these are widely used for power law distributions, often found in networks). It also touches on survival analysis, which many analysts and data scientists don’t seem to encounter in school.

Most interesting chapters

Perhaps the most interesting topic is Berkson’s Paradox, for which the following example is given:

  • Babies of mothers who smoked were about 6% lighter at birth
  • Smokers were about twice as likely to have babies lighter than 2,500 grams, which is considered “low birthweight”
  • Low-birthweight babies were much more likely to die within a month of birth
  • But the mortality rate of low-birthweight babies of smokers was lower than that of non-smokers

This led some to conclude that smoking was protective for low-birthweight babies. I won't spoil the solution here, but he posted some slides on the subject. The chapter also introduces causal diagrams, but they are only covered to the extent needed by the chapter.

Another chapter I particularly enjoyed was Chasing the Overton Window, where he explains the following phenomena:

  • People are more likely to identify as conservative as they get older.
  • Order people hold more conservative views, on average.
  • However, as people get older, their views do not become more conservative.

Again, I won't spoil it, but you can see some of his analysis here and watch a talk on it here.

Although Downey is a card-carrying Bayesian, the book doesn't emphasize Bayesian analysis, beyond the chapter on the base rate fallacy, which uses the uncontroversial application of Bayes rule to diagnostic tests, although I don't recall seeing the word "Bayes" at all. An important point he makes is that information available before the test is important to establishing the relevant base rate, e.g. random screening implies a different base rate for a disease than someone being tested because of symptoms. He then moves on to applying base rates to the fairness impossibility theorem.

Another interesting chapter I'll mention was about the inspection paradox. He gives several examples, but the most illustrative is about running a distance race. A given runner will observe more runners who are either significantly faster or slower than themselves, but almost none who are running at about the same pace. Not only does Downey do a nice job explaining why this occurs, but also shows how to adjust for the sampling bias that leads to the paradox. You can see a version of the analysis here.

Expanding on Chapter 1

The basic premise of chapter 1 is that no one is “normal” (defined as being within 0.3 standard deviations of the average) with respect to even a modest number of bodily measurements or personality traits. He shows through a couple of examples that very few people are "normal" on numerous metrics. One specific example uses the Big Five personality traits, which we are told are roughly independent, and pretty well approximated by Gaussian distributions. For each trait, he defines normal as being in the middle 20-28% of the distribution.

He then presents data on the traits and normality within them and in combinations. I'm combining some information from two tables here to show the percentage who are considered "normal" for each trait, followed by the counts and percentage of the total people who make it through each successive "normal" filter.

TraitPct "Normal"CountPct Remain
Extroversion23.4204,07723.4
Emotional stability20.946,9885.4
Conscientiousness20.210,9761.3
Agreeableness21.12,9810.3
Openness28.39260.1

In other words, only 0.1% of those in the sample data are in a "normal" range in each trait. There are at least two ways to explain why this is an expected result.

The first is the curse of dimensionality, which offers a geometric explanation for this. Suppose each of $d$ traits were an independent standard Gaussian distribution. The mean squared distance from the origin for points in the distribution is $d$, which follows from the chi-squared distribution. In other words, the more independent traits we add, the further from the mean we expect a sampled point in the distribution to be. And more to the point, as $d$ grows, the probability increases that at least one of the draws from the independent Gaussian traits will be far away from the mean.

From a geometric point of view, if a point is far from the mean in any dimension, then we would say it is on the "edge" of the distribution. The curse of dimensionality tells us that as $d$ grows large, eventually most of our data will be found on the edge of the data set, i.e. a value in at least one dimension for each point will be far from the mean (not "normal" aka "weird"). And with enough dimensions, all data will be at an edge in approximately the same number of attributes. Downey alludes to this idea when he writes, "In the limit, if we consider the nearly infinite ways people vary, we find we are all equally weird," but without naming the concept directly. So there may have been an opportunity to connect the chapter 1 lesson back to a more broadly applicable concept that readers may have heard about in other contexts of statistics or machine learning. See section 2.5 of Elements of Statistical Learning for more on the idea.

A simpler reason this result is expected is that if you multiply roughly independent probabilities between 0.2 and 0.3, you can get a small probability quickly. And the product gets smaller faster when the events are independent. For instance, suppose we treat each of those percent normal for each trait as probabilities of independent events and just multiply them. We would get the results in the far right of the following table, which isn't too much different from the actual results.

TraitPct "Normal"Pct RemainPct if ⫫
Extroversion23.423.423.4
Emotional stability20.95.44.9
Conscientiousness20.21.31.0
Agreeableness21.10.30.2
Openness28.30.10.1

Notice this result does not have anything to do with being "normal" or any concept of distance at all. We could have been talking about the probabilities of being in the farthest left portions of the trait distributions, or simply assigned the probabilities to a set of independent Bernoulli trials. To simplify further, let's suppose the probability of being "normal" for any trait is $p$. Then the probability that an instance is normal in all $d$ independent traits is just $p^d$, which will become very small as $d$ grows large, if $p \lt 1$. Here, we can get roughly the same result by setting $p=0.234$ for each of 5 indepdent traits.

Verdict

Overall, I recommend this book to producers and consumers of data analysis to learn a useful set of concepts through examples. At the very least, readers will learn to ask good questions with respect to common issues in observational data analysis, and readers with the right background may learn some new methods to apply. While it eschews the technical density of a textbook, it demands more intellectual engagement than a typical pop science book, drawing readers in with its broad scope of topics and colorful storytelling.

All the notebooks used to generate the analyses and plots are available here, which can aid practitioners looking to apply the lessons to their own work. You can buy the book here, and he also has a number of high quality technical books available for free here.