Book Review: Information Theory, Inference, and Learning Alogrithms
Overview
David MacKay's classic textbook Information Theory, Inference, and Learning Algorithms was first published in 2003, and yet it holds up very well over 20 years later, seamlessly weaving together concepts across compression, noisy-channel coding, Bayesian statistics, and neural networks. By the end, a diligent reader will have a very good understanding of fundamentals that are vital to more advanced work in Bayesian statistics and deep learning.
A unique perspective
MacKay uniquely presents the material, drawing parallels between seemingly distinct topics. As he states in Chapter 2, "One of the themes of this book is that data compression and data modeling are one and the same, and that they should both be addressed... using inverse probability."
One chapter presented a unique idea for model comparison using Occam's Razor. He presents an "Occam's Factor" which "provides the ratio of the poterior accessible volume of $\mathcal{H}_i$'s hypothesis space to the priori accessible volume, or the factor by which $\mathcal{H}_i$'s hypothesis space collapses when the data arrive." It's logarithm "is a measure of the amount of information we gain about the model's parameters when the data arrive."
So models with many parameters and few constraints will be penalized by a stronger Occam's factor than a simpler model because they had too much "wiggle room" to overfit the data. It's similar in spirit to AIC or BIC, except it can be applied to the full distributions over parameters, as opposed to just the maximium likelihood point estimation models. Later he goes so far as to say it can be used as a substitute for validation sets in Bayesian machine learning. I'm not sure I'm ready to commit to that, but I'm interested to explore it further. It seemes like it could be gamed. For instance, if you know of an overfit model and its parameter values for a set of training data, and you then set your priors to be equal to these, then the parameters wouldn't change and the Occam's Factor of that model would be small. So this seems to be relying on restricting oneself to truly using priors before seeing the data.
Diverse topics
As if the diversity of high level topics weren't enough, perhaps the most interesting sequence in the book are chapters 18 and 19. He applies the tools from the Information Theory section first to crossword puzzles, and then presents a simplified version of the Enigma Code breaking done at Bletchley Park. Finally, he ends the section with an information theoretic point of view on the benefits of sexual reproduction with recombination using both analytical and simulation methods.
Relevance to NLP and deep learning
Even in Claude Shannon's original paper that introduced information theory, A Mathemetical Theory of Communication, he presented language models. And if one studies natural language processing today, it's still filled with concepts from information theory. Although this book came out 14 years before the transformer architecture was introduced in Attention is All You Need, it will give readers an outstanding foundation from which to study modern approaches. I worry too many practitioners just take the shortest path to the latest architectures without building the requisite knowledge first.
I also like how he distills neural networks into their architecture, activity rule, and learning rule. The architecture specifies the variables and relationships within the network; the activity rule details how the outputs of neurons change in response to each other; the learning rule defines how to update weights during training.
One other topic I found particularly interesting was on the information capacity of a single neuron (and relating it back to VC dimension), as well as the capacity of Hopfield networks. Walking through these analyses gives the reader building blocks for reasoning about the capacity of larger, more complex networks.
While it's true that the Bayesian methods for which MacKay advocates are not the dominate paradigm used in deep learning today, the probabilistic perspective is broad enough to interpret parameters with point estimates. Regularization and drop out also have Bayesian interpretations.
Comparison to frequentist methods
MacKay mostly focuses on presenting the material at hand, but occasionally contrasts to frequentist methods. One example comes in Chapter 24, where he discusses the estimators for $\sigma$ in a Gaussian distribution. As he states:
Given data $D = \{x_n\}_{n=1}^N$, and 'estimator' of $\mu$ is
$$ \bar{x} \equiv \sum_{n=1}^N x_n / N, $$ and two estimators of $\sigma$ are: $$ \sigma_N = \sqrt{\frac{\sum_{n=1}^N (x_n - \bar{x})^2}{N}} \;\text{and}\; \sigma_{N-1} = \sqrt{\frac{\sum_{n=1}^N (x_n - \bar{x})^2}{N-1}} $$
He discusses how they invent estimators in the frequentist paradigm and choose the one that best meets some criteria of sampling properties. After pointing out that there is no clear principle for deciding which criterion to use, and given most criteria, there's no systematic way to produce an optimal estimator, he then explains the frequentist interpretation of these estimators. The estimator $(\bar{x}, \sigma_N)$ is the maximum likelihood estimator, but $\sigma_N$ is biased. That is, averaging over repeated sampling, $\sigma_N$ will not equal to $\sigma$. The $\sigma_{N-1}$ version is unbiased.
In contrast, the Bayesian view arrives at these quantities as maximum a posteriori estimates of $\sigma$ with different conditioning. The maximum of $P(\sigma | D)$ is at $\sigma_{N-1}$ and the maximum of $P(\sigma | D, \mu = \bar{x})$ is $\sigma_N$, using uninformative priors. In other words, $\sigma_{N-1}$ is when we are jointly estimating our uncertainty in $\mu$ and $\sigma_N$ is when we hold $\mu$ fixed at $\bar{x}$. There are nice supporting visuals on page 321.
One of the more lively and entertaining chapters is 37, Bayesian Inference and Sampling Theory. In this short chapter, he offers very simple examples to demonstrate that frequentist methods calculate unhelpful quantities to the decision at hand or are sensitive to irrelevant information.
Nitpick
Perhaps the one thing I didn't care for in the book was the early insistence on approximating $x!$ and ${n \choose x}$, which seemed like a bit of an unnecessary distraction. This did not last long, however.
Problems
The book offers a lot of problems, each with a difficulty rating between 1 and 3 next to it. Some of the problems are quite challenging. I recognized one problem as being a slight reframing of a Putnam A6 question, and it was only marked with a difficulty of 2. Luckily, he also marked a small portion as recommended problems, and these problems were thoughtfully chosen to reinforce the content.
Recommendation
Overall, I see this book as being a good one to read after Kevin Murphy's Probabilistic Machine Learning. They draw from a similar perspective, but MacKay's goes deeper within what it covers.
So my recommendation is to read the whole book, after PML, and do all of the recommended problems. The book is comprised of 50 relatively short chapters. I think readers could alternate between reading a chapter and then doing its recommended problems, and complete the book in 100 days without much problem. Very motivated readers could probably finish in 50 days by reading a chapter and completing the problems each day. Consider it a "must read" if you want to go deep into NLP.