Book Review: Probabilistic Machine Learning: An Introduction

Dave

I'm surprised people aren't making a bigger deal about Kevin Murphy's new textbooks, the first of which I'll review here.

Overview

Probabilistic Machine Learning: An Introduction covers an incredible breadth and surprising depth of machine learning and statistics topics. It can be thought of as a "best of" from Bishop, MacKay, The Elements of Statistical Learning, and others, along with a view of recent, relevant research, all of which is pulled together under a probabilistic persepctive and consistent notation. The book is almost entirely self-contained, with an extensive Foundations section covering most prerequisite topics1 before moving on to linear models, deep neural networks, nonparametric models, unsupervised learning, and a few other topics.

Clear exposition

The book takes a Bayesian view of probability (that we can treat all unknown quantities, such as future outcomes and parameters, as random variables and model them with probability distributions) and applies it to statistical and machine learning models. In doing so, Murphy is able to apply a consistent framework and common concepts across a very wide range of topics, regardless of how they were originally motivated or conceived. This approach turns the exploration of various models into a seamless composition of building blocks, rather than a series of jumps between the origin stories of each model.

Perhaps surprisingly, the book manages to cover the most important2 concepts one would encounter in graduate statistics coursework. In fact, even after exiting the aforementioned Foundations section, it spends a good chunk of the book on linear models, much of which would overlap with a statistics curriculum. This is helpful for two reasons:

  1. First, it's important to apply the right tool for the job. While complex ML models increasingly do amazing things, there are still many applications and contexts when we need the imposed structure of GLMs and other simpler models.
  2. More importantly, it introduces the structure of Murphy's approach, which extends naturally from the Foundations section, to relatively simple models before moving on to more complex topics. This allows the reader to more easily adapt and generalize the framework.

A common pattern is:

  1. Motivate the model and define the likeliihood, such as $$ p(y| \mathbf{x}; \mathbf{\theta}) = \text{Ber}(y|\mathbf{\sigma}(\mathbf{w}^\intercal \mathbf{x} + b)) $$ in the case of logistic regression, adding definitions, priors, or nested functions if applicable.
  2. Define and derive the objective function, such as the negative log likelihood.
  3. Show how the optimization of this objective is done in practice.

The book extensively borrows from earlier textbooks and papers for explanations, figures, and problems, always with proper attribution. In fact, it may have the most citations per page I've encountered. Murphy has meticulously adapted the clearest plots and explanations for each subject and selected problems that provoke deep insights into the material. Additionally, the extensive use of self-references within the book help to facilitate refreshing an earlier topics as needed.

Practical theory

In practice, I find the the probabilistic perspective to be much more useful than the theory emphasized in some other ML books. For instance, while I understand the role of VC-dimension and PAC learning in the development of statistical learning theory, I'm rarely utilizing those ideas in applied work.3 Murphy's book explains them in under two pages, while some other books devote entire chapters to each of these concepts.

Also, as he notes, the probabilistic perspective lends itself to optimal decision making and is shared across science and engineering disciplines, which is what helps fuel the breadth of the two books.

Breadth and depth

Have a look at the table of contents. I don't know of another textbook that covers linear Gaussian systems, GLMs, and transformers, not to mention Gaussian processes, factorization machines, and graph embeddings (plus the even longer sequel book on "advanced topics" and the supplemental material to that!). The power of his approach is that it allows him to write succintly, yet clearly and precisely, on a variety of subjects, adding color and intuition when needed.

The book examines the most crucial and fundamental topics, offering detailed proofs and derivations, unveiling important results, and comparing the performance of competing methods across various problems. For example, linear regression methods are thoroughly explored over forty pages. Furthermore, the text dedicates three chapters to deep neural networks, addressing structured data, images, and sequences specifically. In contrast, it touches briefly on many specialized and niche topics, subsequently directing readers to one or more citations for further investigation.

Source code

While the text focuses on the theory, concepts, and math, Murphy makes available the code to reproduce most of the figures. This often includes experiments and model fitting, for those who want to try out models from the book. I have not used the code much myself, although I did find an issue with the mixture of linear experts code.

Corrections

I preordered a hard copy of the first printing and, while meticulously working through it, encountered several typos, some even within the mathematical content. For each typo discovered, I referred to the latest draft version to verify if it had already been corrected. Many had been previously identified and fixed, yet I was still able to report several new ones, all of which were subsequently addressed. Contributing in this way to such an important book was an unexpected perk.

In a weird way, the typos actually enhanced my experience, since catching them was a nice way to validate I understood the material well (your mileage may vary). Collecting corrections via github probably increased reader participation, and I would assume the current draft is in great shape as a result.

Recommendation

Probabilistic Machine Learning is one of two introductory ML textbooks I would recommend. I think its perspective is the perfect complement to the inductive bias4 lens found in Tom Mitchell's classic Machine Learning textbook.

If you're looking to develop expertise in machine learning, I recommend going through this book cover to cover and attempting each problem provided. You might not be able to solve them all, so time box each attempt. You can check solutions to many of the problems, but give each of them a shot on your own first.

I was able to move through it relatively quickly, but it helped that I already had graduate degrees focused on statistics and machine learning.5 If you find some of your math background is shaky or that the Foundations section is too dense, there are plenty of online resources available to help out. I particularly like the Mathematics for Machine Learning course from Imperial College London for linear algebra and multivariate calculus. If you need to supplement with such outside sources, I would still use the structure within PML to guide what specific topics to brush up on, rather than committing to an endless math curriculum before moving on to ML. This would also be an ideal book for readers with a Bayesian statistics background to dive into machine learning, since they will be familiar with the framing.

If you find a particular subject interesting, grab the associated code and apply it to some new problems or run experiments. If you feel like you've mastered a chapter of the book, try to generate data sets that deliberately demonstrate the comparative strengths or weaknesses of the models within it.

Conclusion

This has become my go-to reference on ML. Whenever I encounter a question about ML or need to refresh my memory on a topic, I usually go straight to this book (or its sequel) over google, wikipedia, or other books.

For a long time, it's been common for study groups to read through ESL together while learning ML. I think some of its sections are good, but it never appealed to me in the same way and relies too much on null hypothesis testing and p-values for my taste. Probabilistic Machine Learning now offers a better, modern alternative to cultivate a way of thinking that extends well beyond what many people narrowly think of as "machine learning."

The book is freely available online; however, considering its extensive length, opting for a hard copy allows for a much-needed respite from computer screens. You can purchase it here. And if you want to hear more about the author's perspective, you can listen to a podcast interview here.

I also have the second book, but haven't yet had time to read it (although I've already used it as a reference on several topics). Hopefully I will start it later this year.

Footnotes

1

The book assumes knowledge of basic set theory and calculus, and comfort with math notation will help, although there is an appendix for that. It very briefly covers derivatives and matrix calculus, so that might be challenging for someone new to those subjects. Integration is not covered within the book, so I think it's fair to say that a reader should know calculus well before starting.

2

While it briefly describes null hypothesis testing, it does not devolve into various statistical tests. I said important, not common.

3

These concepts may help develop an intuition about machine learning, but they are certainly not the concepts I start with when facing a new prediction problem.

4

Inductive bias describes how a learning algorithm conducts its search over a hypothesis space, and can be further broken down into restriction bias (which hypotheses are considered) and preference bias (which hypotheses are preferred) for analysis. Mitchell's book covers much more than this, of course, but it's the unique aspect I find most useful.

5

Even with this background, I still learned quite a bit from the book.