2025-09-16-Supervised Machine Learning Lecture 3

Posted Sep 16, 2025

13 min read

Supervised Machine Learning Lecture 3: Learning with infinite hypothesis classes

Recall: PAC learnability

A class 𝐶 is PAC-learnable, if there exist an algorithm 𝒜 that given a training sample 𝑆 outputs a hypothesis ℎ𝑆 that has generalization error satisfying
$Pr(R(h_S) \leq \epsilon) \geq 1 - \delta$
for any distribution 𝐷, for arbitrary $\epsilon, \delta > 0$ and sample size $m = S $ that grows at polynomially in $1/\epsilon, 1/\delta$

Recall: PAC learning of a finite hypothesis class

Sample complexity bound relying on the size of the hypothesis class (Mohri et al, 2018):
$Pr(R(h_S) \leq \epsilon) \geq 1 - \delta$ if
$m \geq \frac{1}{\epsilon} \left(\log(|\mathcal{H}|) + \log\left(\frac{1}{\delta}\right)\right)$
An equivalent generalization error bound:
$R(h) \leq \frac{1}{m}\left(\log(|\mathcal{H}|) + \log\left(\frac{1}{\delta}\right)\right)$
Holds for any finite hypothesis class assuming there is a consistent hypothesis, one with zero empirical risk
Extra term compared to the rectangle learning example is the term $\frac{1}{\epsilon} \left(\log( \mathcal{H} )\right)$
The more hypotheses there are in $\mathcal{H}$, the more training examples are needed

Learning with infinite hypothesis classes

The size of the hypothesis class is a useful measure of complexity for finite hypothesis classes (e.g boolean formulae)
However, most classifiers used in practise rely on infinite hypothesis classes, e.g.
- $\mathcal{H} =$ axis-aligned rectangles in $\mathbb{R}^2$ (the example last lecture)
- $\mathcal{H} =$ hyperplanes in $\mathbb{R}^d$ (e.g. Support vector machines)
- $\mathcal{H} =$ neural networks with continuous input variables
Need better tools to analyze these cases

Vapnik-Chervonenkis dimension

Intuition

VC dimension can be understood as measuring the capacity of a hypothesis class to adapt to different concepts
It can be understood through the following thought experiment:
- Pick a fixed hypothesis class $\mathcal{H}$, e.g. axis-aligned rectangles in $R^2$
- Let as enumerate all possible labelings of a training set of size $m$: $\mathcal{Y}^m = {\mathbf{y_1}, \mathbf{y_2}, \ldots, \mathbf{y_{2^m}}}$, where $\mathbf{y_j} = (y_{j1}, \ldots, y_{jm})$, and $y_{ij} \in {0,1}$ is the label of $i$’th example in the $j$’th labeling
- We are allowed to freely choose a distribution $D$ generating the inputs and to generate the input data $x_1, \ldots, x_m$
- $VCdim(\mathcal{H}) =$ size of the largest training set that we can find a consistent classifier for all labelings in $\mathcal{Y}^m$
Intuitively:
- low $VCdim \implies$ easy to learn, low sample complexity
- high $VCdim \implies$ hard to learn, high sample complexity
- infinite $VCdim \implies$ cannot learn in PAC framework

Shattering

The underlying concept in VC dimension is shattering
Given a set of points $S = { x_1, \ldots, x_m }$ and a fixed class of functions $\mathcal{H}$
$\mathcal{H}$ is said to shatter $S$ if for any possible partition of $S$ into positive $S_+$ and negative subset $S_-$ we can find a hypothesis for which $h(x) = 1$ if and only if $x \in S_+$

How to show that $VCdim(\mathcal{H}) = d$

How to show that $VCdim(\mathcal{H}) = d$ for a hypothesis class
We need to show two facts:
1. There exists a set of inputs of size $d$ that can be shattered by hypothesis in $\mathcal{H}$ (i.e. we can pick the set of inputs any way we like): $VCdim(\mathcal{H}) \geq d$
2. There does not exist any set of inputs of size $d+1$ that can be shattered (i.e. need to show a general property): $VCdim(\mathcal{H}) < d + 1$

Example: intervals on a real line

Let the hypothesis class be intervals in $R$
Each hypothesis is defined by two parameters $b_h, e_h \in \mathbb{R}$: the beginning and end of the interval, $h(x) = \mathbf{1}_{b_h \leq x \leq e_h}$
We can shatter any set of two points by changing the end points of the interval:
We cannot shatter a three point set, as the middle point cannot be excluded while the left-hand and right-hand side points are included

We conclude that VC dimension for real intervals = 2

Lines in $\mathbb{R}^2$

A hypothesis class of lines $h(x) = ax + b$ shatters a set of three points $\mathbb{R}^2$.
We conclude that VC dimension is $\geq 3$

Four points cannot be shattered by lines in ℝ²:

There are only two possible configurations of four points in ℝ²:
1. All four points reside on the boundary of the convex hull
2. Three points form the convex hull and one is in interior
In the first case (left), we cannot draw a line separating the top and bottom points from the left-and and right-hand side points
In the second case, we cannot separate the interior point from the points on the boundary of the convex hull with a line
The two examples are sufficient to show that VCdim = 3

VC-dimension of axis-aligned rectangles

With axis aligned rectangles we can shatter a set of four points (picture shows 4 of the 16 configurations)
This implies $VCdim(\mathcal{H}) \geq 4$
For five distinct points, consider the minimum bounding box of the points
There are two possible configurations:
1. There are one or more points in the interior of the box: then one cannot include the points on the boundary and exclude the points in the interior
2. At least one of the edges contains two points: in this case we can pick either of the two points and verify that this point cannot be excluded while all the other points are included
Thus by the two examples we have established that $VCdim(\mathcal{H}) = 4$

Vapnik-Chervonenkis dimension formally

Formally $VCdim(\mathcal{H})$ is defined through the growth function
$\Pi_{\mathcal{H}}(m) = \max_{\{x_1, \ldots, x_m\} \subset X} \left| \{ (h(x_1), \ldots, h(x_m)) : h \in \mathcal{H} \} \right|$
The growth function gives the maximum number of unique labelings the hypothesis class $\mathcal{H}$ can provide for an arbitrary set of input points
The maximum of the growth function is $2^m$ for a set of $m$ examples
Vapnik-Chervonenkis dimension is then
$VCdim(\mathcal{H}) = \max_m \{ m | \Pi_{\mathcal{H}}(m) = 2^m \}$

Visualization

The ratio of the growth function $\Pi_{\mathcal{H}}(m)$ to the maximum number of labelings of a set of size $m$ is shown
Hypothesis class is 20-dimensional hyperplanes (VC dimension = 21)

VC dimension of finite hypothesis classes

Any finite hypothesis class has VC dimension $VCdim(\mathcal{H}) \leq \log_2 \mathcal{H} $
To see this:
- Consider a set of $m$ examples $S = {x_1, \ldots, x_m}$
- This set can be labeled $2^m$ different ways, by choosing the labels $y_i \in {0,1}$ independently
- Each hypothesis in $h \in \mathcal{H}$ fixes one labeling, a length-$m$ binary vector $\mathbf{y}(h,S) = (h(x_1), \ldots, h(x_m))$
- All hypotheses in $\mathcal{H}$ together can provide at most $ \mathcal{H} $ different labelings in total (different vectors $\mathbf{y}(h,S), h \in \mathcal{H}$)
- If $ \mathcal{H} < 2^m$ we cannot shatter $S \implies$ we cannot shatter a set of size $m > \log_2 \mathcal{H} $

VC dimension: Further examples

Examples of classes with a finite VC dimension:

convex d-polygons in $\mathbb{R}^2$: $VCdim = 2d + 1$ (e.g. for general, not restricted to axis-aligned, rectangles $VCdim = 5$)
hyperplanes in $\mathbb{R}^d$: $VCdim = d + 1$ - (e.g. single neural unit, linear SVM)
neural networks: $VCdim = E \log E $ where $E$ is the set of edges in the networks (for sign activation function)
boolean monomials of $d$ variables: $VCdim = d$
arbitrary boolean formulae of $d$ variables: $VCdim = 2^d$

Half-time poll: VC dimension of threshold functions in ℝ

Consider a hypothesis class $\mathcal{H} = {h_{\theta}}$ of threshold functions
$h_{\theta} : \mathbb{R} \mapsto {0,1}, \theta \in \mathbb{R} :$

\[h_{\theta}(x) = \begin{cases} 1 & \text{if } x > \theta \\ 0 & \text{otherwise} \end{cases}\]

What is the VC dimension of this hypothesis class?

$VCdim = 1$
$VCdim = 2$
$VCdim = \infty$

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll down to “Lecture 3 poll”:

Answers are anonymous and do not affect grading of the course.

Convex polygons have VC dimension $= ∞$

Let our hypothesis class be convex polygons in $\mathbb{R}^2$ without restriction of number of vertices $d$
Let us draw an arbitrary circle on $\mathbb{R}^2$ - the distribution $D$ will be concentrated on the circumference of the circle
- This is a difficult distribution for learning polygons - we choose it on purpose
Let us consider a set of m points with arbitrary binary labels
For any m, let us position m points on the circumference of the circle
- simulating drawing the inputs from the distribution D
Start from an arbitrary positive point (red circles)
Traverse the circumference clockwise skipping all negative points and stopping and positive points
Connect adjacent positive points with an edge
This forms a p-polygon inside the circle, where p is the number of positive data points
Define $h(x) = +1$ for points inside the polygon and $h(x) = 0$ outside
Each of the $2^m$ labelings of $m$ examples gives us a $p$-polygon that includes the $p$ positive points in that labeling and excludes the negative points $\implies$ we can shatter a set of size $m$: $VCdim(\mathcal{H}) \geq m$
Since $m$ was arbitrary, we can grow it without limit $VCdim(\mathcal{H}) = \infty$

Generalization bound based on the VC-dimension

(Mohri, 2018) Let $\mathcal{H}$ be a family of functions taking values in ${-1,+1}$ with VC-dimension $d$. Then for any $\delta > 0$, with probability at least $1-\delta$ the following holds for all $h \in \mathcal{H}$:
\[R(h) \leq \hat{R}(h) + \sqrt{\frac{2 \log(em/d)}{m/d}} + \sqrt{\frac{\log(1/\delta)}{2m}}\]
$e \approx 2.71828$ is the base of the natural logarithm
The bound reveals that the critical quantity is $m/d$, i.e. the number of examples divided by the VC-dimension
Manifestation of the Occam’s razor principle: to justify an increase in the complexity, we need reciprocally more data

Rademacher complexity

Experiment: how well does your hypothesis class fit noise?

Consider a set of training examples $S_0 = {(x_i, y_i)}_{i=1}^m$
Generate $M$ new datasets $S_1, \ldots, S_M$ from $S_0$ by randomly drawing a new label $\sigma \in \mathcal{Y}$ for each training example in $S_0$
\[S_k = \{(x_i, \sigma_{ik})\}_{i=1}^m\]
Train a classifier $h_k$ minimizing the empirical risk on training set $S_k$, record its empirical risk
\[\hat{R}(h_k) = \frac{1}{m} \sum_{i=1}^m \mathbf{1}_{h_k(x_i) \neq \sigma_{ik}}\]
Compute the average empirical risk over all datasets:
\[\bar{\epsilon} = \frac{1}{M} \sum_{k=1}^M \hat{R}(h_k)\]
Observe the quantity
$\hat{\mathcal{R}} = \frac{1}{2} - \bar{\epsilon}$
We have $\hat{\mathcal{R}} = 0$ when $\bar{\epsilon} = 0.5$, that is when the predictions correspond to random coin flips (0.5 probability to predict either class)
We have $\hat{\mathcal{R}} = 0.5$ when $\bar{\epsilon} = 0$, that is when all hypotheses $h_i, i = 1, \ldots, M$ have zero empirical error (perfect fit to noise, not good!)
Intuitively we would like our hypothesis
- to be able to separate noise from signal - to have low $\hat{\mathcal{R}}$
- have low empirical error on real data - otherwise impossible to obtain low generalization error

Rademacher complexity

Rademacher complexity defines complexity as the capacity of hypothesis class to fit random noise
For binary classification with labels $\mathcal{Y} = {-1, +1}$ empirical Rademacher complexity can be defined as
\[\hat{\mathcal{R}}_S(\mathcal{H}) = \frac{1}{2} E_{\sigma} \bigg( \sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^{m} \sigma^{i} h(\mathbf{x}_i) \bigg)\]
$\sigma_i \in {-1, +1}$ are Rademacher random variables, drawn independently from uniform distribution (i.e. $Pr{\sigma = 1} = 0.5$)
Expression inside the expectation takes the highest correlation over all hypothesis in $h \in \mathcal{H}$ between the random true labels $\sigma_i$ and predicted label $h(\mathbf{x}_i)$

\[\hat{\mathcal{R}}_{S}(\mathcal{H}) = \frac{1}{2} E_{\sigma} \bigg( \sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^{m} \sigma^{i} h(\mathbf{x}_i) \bigg)\]

Let us rewrite $\hat{\mathcal{R}}_{S}(\mathcal{H})$ in terms of empirical error
Note that with labels $\mathcal{Y} = {+1, -1}$,

\[\sigma_i h(\mathbf{x}_i) = \begin{cases} 1 & \text{if } \sigma_i = h(\mathbf{x}_i) \\ -1 & \text{if } \sigma_i \neq h(\mathbf{x}_i) \end{cases}\]

Thus

\[\frac{1}{m} \sum_{i=1}^{m} \sigma_i h(\mathbf{x}_i) = \frac{1}{m} \left( \sum_i \mathbf{1}_{\{h(\mathbf{x}_i) = \sigma_i\}} - \sum_i \mathbf{1}_{\{h(\mathbf{x}_i) \neq \sigma_i\}} \right) = \frac{1}{m} \left(m - 2 \sum_i \mathbf{1}_{\{h(\mathbf{x}_i) \neq \sigma_i\}} \right) = 1 - 2 \hat{\epsilon}(h)\]

Plug in
$\hat{\mathcal{R}}_S(\mathcal{H}) = \frac{1}{2} E_{\sigma}\left(\sup_{h \in \mathcal{H}}(1 - 2\hat{\epsilon}(h))\right) = \frac{1}{2} (1 - 2 E_{\sigma} \inf_{h \in \mathcal{H}} \hat{\epsilon}(h)) = \frac{1}{2} - E_{\sigma} \inf_{h \in \mathcal{H}} \hat{\epsilon}(h))$
Now we have expressed the empirical Rademacher complexity in terms of expected empirical error of classifying randomly labeled data
But how does the Rademacher complexity help in model selection?
- We need to relate it to generalization error

Generalization bound with Rademacher complexity

(Mohri et al. 2018): For any $\delta > 0$, with probability at least $1 - \delta$ over a sample drawn from an unknown distribution $D$, for any $h \in \mathcal{H}$ we have:

\[R(h) \leq \hat{R}_S(h) + \hat{\mathcal{R}}_S(\mathcal{H}) + 3 \sqrt{\frac{\log \frac{2}{\delta}}{2m}}\]

The bound is composed of the sum of:

The empirical risk of $h$ on the training data $S$ (with the original labels): $\hat{R}_S(h)$
The empirical Rademacher complexity: $\hat{\mathcal{R}}_S(\mathcal{H})$
A term that tends to zero as a function of size of the training data as $O(1/\sqrt{m})$ assuming constant $\delta$.

Example: Rademacher and VC bounds on a real dataset

Prediction of protein subcellular localization
10-500 training examples, 172 test examples
Comparing Rademacher and VC bounds using $\delta = 0.05$
Training and test error also shown
Rademacher bound is sharper than the VC bound
VC bound is not yet informative with 500 examples (> 0.5) using (δ = 0.05)
The gap between the mean of the error distribution (≈ test error) and the 0.05 probability tail (VC and Rademacher bounds) is evident (and expected)

Rademacher vs. VC

Note the differences between Rademacher complexity and VC dimension

VC dimension is independent of any training sample or distribution generating the data: it measures the worst-case where the data is generated in a bad way for the learner
Rademacher complexity depends on the training sample thus is dependent on the data generating distribution
VC dimension focuses the extreme case of realizing all labelings of the data
Rademacher complexity measures smoothly the ability to realize random labelings
Generalization bounds based on Rademacher Complexity are applicable to any binary classifiers (SVM, neural network, decision tree)
It motivates state of the art learning algorithms such as support vector machines
But computing it might be hard, if we need to train a large number of classifiers
Vapnik-Chervonenkis dimension (VCdim) is an alternative that is usually easier to derive analytically

Summary: Statistical learning theory

Statistical learning theory focuses in analyzing the generalization ability of learning algorithms
Probably Approximately Correct framework is the most studied theoretical framework, asking for bounding the generaliation error ($𝜖$) with high probability $(1 − 𝛿)$, with arbitrary level of error $𝜖 > 0$ and confidence $𝛿 > 0$
Vapnik-Chervonenkis dimension lets us study learnability infinite hypothesis classes through the concept of shattering
Rademacher complexity is a practical alternative to VC dimension, giving typically sharper bounds (but requires a lot of simulations to be run)

Study, Master

This post is licensed under CC BY 4.0 by the author.

Supervised Machine Learning Lecture 3: Learning with infinite hypothesis classes

Recall: PAC learnability

Recall: PAC learning of a finite hypothesis class

Learning with infinite hypothesis classes

Vapnik-Chervonenkis dimension

Intuition

Shattering

How to show that $VCdim(\mathcal{H}) = d$

Example: intervals on a real line

Lines in $\mathbb{R}^2$

VC-dimension of axis-aligned rectangles

Vapnik-Chervonenkis dimension formally

Visualization

VC dimension of finite hypothesis classes

VC dimension: Further examples

Half-time poll: VC dimension of threshold functions in ℝ

Convex polygons have VC dimension $= ∞$

Generalization bound based on the VC-dimension

Rademacher complexity

Experiment: how well does your hypothesis class fit noise?

Rademacher complexity

Generalization bound with Rademacher complexity

Example: Rademacher and VC bounds on a real dataset

Rademacher vs. VC

Summary: Statistical learning theory

Trending Tags