Shallow Learning Theory
27 March 2026The last few years, I (and the rest of the ML community) have been swallowed by the current wave of Deep Learning models, which exhibit unusual generalization behaviors [1] .
While these models may work formidably well, I have been missing an environment where theory would actually make sense again. Therefore, I decided to write a blog post on traditional statistical learning theory and its generalization bounds. As machine learning nowadays is more strongly associated with deep learning models rather than traditional statistical models, we might, as well call it shallow learning theory.
Empirical Risk Minimization
Starting from the beginning, back in the day (thank god I did not live those days) when generalization was a real concern among ML practitioners, people were often attempting to minimize the empirical risk the average loss over a training set. Formally, given a dataset drawn i.i.d. from some unknown distribution , and a hypothesis class , the empirical risk of a hypothesis is defined as
where is a loss function measuring the discrepancy between the prediction and the true label . The Empirical Risk Minimizer (ERM) is then the hypothesis that minimizes this quantity:
The hope, of course, is that minimizing also minimizes the expected risk:
But is unknown — we only have access to the training samples. So the central question of learning theory becomes: how well does approximate , and under what conditions does minimizing the former guarantee small values of the latter?
Fortunately, we have generalization bounds (concentration inequalities) which give us guarantees that, for a given hypothesis , the empirical risk and the expected risk will be close with high probability.
Hoeffding's Inequality
One of the most important inequality in deriving the generalization bounds in ML is Hoeffding's inequality [2]. In its general form, it states that if are independent random variables with almost surely, then for any :
Application to binary classification risk
Now let's specialize to the case most relevant to classification: the 0-1 loss, defined as . Each individual loss term takes values in , so , which means for all . Noting that
the denominator simplifies to , and Hoeffding's inequality becomes:
This is a remarkably clean result. Setting the right-hand side equal to and solving for , we get that with probability at least :
or equivalently, the true risk is bounded by:
The term is the generalization gap — it shrinks as grows and is only mildly sensitive to the confidence parameter . The catch, however, is that this bound holds for a fixed hypothesis . In practice, is chosen after seeing the data, so we need to account for the fact that we are searching over an entire class — but that is a story for the next section.
[1] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021.[2] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.