Post

ASML-Foundation

ASML-Foundation

Monday week 6

ADVANCED STATISTICS AND MACHINE LEARNING FOUNDATIONS

Introduction

What is the goal?

  • Make rational inferences and decisions from data.
  • Inference about unknown quantity $U$ given knowledge $K$:
    • Current data $X$
    • Other knowledge $K’$, e.g.:
      • Previous data; physical theory; expert opinion.
  • Decision $D$ about action to take, given this knowledge and the utility (or loss) consequent on the results.

image.png

Why is this goal important?

  • World data volume is huge:
    • Growing by 40% per year.
    • 175ZB (25TB per person) by 2025.
  • Most (95%) is images and video.
    • E.g. earth observation satellite generates 1TB/day and there are thousands of them.
  • Data usage is tiny:
    • ~1% is analysed.
  • Automatic data analysis is crucial.

Particle physics

  • One-hour’s data for LHC = One-year’s data for FB.
  • 70% of all data are classified by machine learning.
  • 100% charged-particle tracks vetted by NNs.
  • Achieving the sensitivity of a recent LHCb search would need 10 years of data without ML.
  • Determination of particle energies. - Multivariate regression, boosted decision trees.
  • Detection and classification of H → γγ decays. - Boosted decision trees.
  • Categorization of neutrino events in the NOvA and MicroBooNE detectors. - Computer vision using CNNs.

How to do achieve the goal?

  • Remarkably, there is essentially one way to do this correctly.
  • Making inferences under uncertainty:
    • Probability theory: ‘unique’ generalization of classical logic to uncertain propositions (Cox’s theorem).
  • Making decisions under uncertainty:
    • Decision theory: ‘unique’ method for rational choices (von Neumann–Morgenstern utility theorem).
  • Together: Bayesian statistics.

image.png

And that’s it! Easy, right? Wrong!

  • Data and decisions are often very complex and high-dimensional.
    • E.g. satellite image: 200-million-dimensional vector or in a set of 10^500,000,000 elements.
      • Cf 10^80 atoms in observable universe.
    • Neural networks may have ~ 10^9 parameters.
  • Constructing a probability distribution that represents enough knowledge to be useful.
    • Very difficult: most ‘images’ are just noise.
    • Perhaps not possible ‘by hand’.
  • Extracting information from this distribution.
    • We need to integrate and optimize over these spaces.

Bayesian statistics in use

image.png

Structure

  • Framework
    • Bayesian statistics: the maths of rationality
    • Cox’s theorem
    • Von Neumann-Morgenstern theorem
  • Framing the problem
    • Making inferences
    • Making decisions
  • Building the components
    • Ignorance
    • Information
    • Dependencies
    • Conditional models
  • Computing the answer
    • The problem
    • Integration
    • Optimization

Framework

Bayesian Statistics: the mathematics of rationality

  • Probability theory
    • The unique method for rational thinking
  • Decision theory
    • The unique method for rational acting
  • Bayesian statistics = PT + DT
  • Why these ingredients?
    • Cox’s theorem
    • Von Neumann-Morgenstern theorem

Cox’s theorem

Cox’s theorem: postulates

  • Concerns plausibilities $P(AB)$ of propositions.
    • Propositions $A, B,$ etc. meaning ‘A is true’, etc.
    • $P(AB)$ is plausibility of A being true given that B is true.
  • Axioms:
    • Comparability
      • The plausibility of a proposition is a real number (all plausibilities can be compared);
    • Rationality
      • Plausibilities should vary rationally with the variation of other plausibilities, including reducing to classical logic when all propositions are certain;
    • Consistency
      • If the plausibility of a proposition can be derived in many ways, all the results must be equal.

Cox’s theorem: unique solution

  • Plausibilities satisfy:
    • Range:
      • Certain truth is represented by $P(AB) = 1$
      • Certain falsehood is represented by $P(AB) = 0$.
    • Negation:
      • $P(AB) + P(\neg AB) = 1$
        • $\neg A$ is the proposition ‘A is false’.
    • Conjunction (product rule):
      • $P(A, BC) = P(AB, C)P(BC)$
        • $A, B$ is the proposition ‘A and B’.

Cox’s theorem: consequences

  • Bayes theorem (from product rule)
    $P(B|A, C) = \frac{P(A|B, C)P(B|C)}{P(A|C)}$

  • Disjunction (sum rule)
    $P(A \vee B|C) = P(A|C) + P(B|C) - P(A, B|C)$

  • Partition (extending the conversation)
    $P(A|C) = \sum_i P(A, B_i|C) = \sum_i P(A|B_i, C) P(B_i|C)$

  • Often:
    $P(A|C) = \sum_i P(A|B_i) P(B_i|C)$
    Sum over all ‘paths’ from C to A.

Von Neumann-Morgenstern theorem

Von Neumann-Morgenstern theorem: postulates

  • Concerns preferences ≼ between probability distributions on a set of rewards R.
    • Known as ‘gambles’.
    • $R \prec S$ means R is a worse gamble than S.
    • Decisions $d \in D$ choose between gambles. 符号有错
  • Axioms:
    • Completeness
      • For two gambles R and S, either $R \prec S$, $S \prec R$, or both ($R \sim S$).
    • Transitivity
      • If $R \prec S$ and $S \prec T$, then $R \prec T$.
    • Continuity
      • If $R \prec S \prec T$, then $\exists p \in [0,1]$: $S \sim pR + (1 - p)T$.
    • Independence
      • If $R \prec S$, then $\forall T: pR + (1 - p)T \prec pS + (1 - p)T$.

Von Neumann-Morgenstern theorem: unique solution

  • There exists a function $L: {R} \rightarrow \mathbb{R}$, known as the ‘loss function’, that satisfies:
    • $L(R) \geq L(S) \iff R \succ S$
    • $L(pR + (1 - p)S) = pL(R) + (1 - p)L(S)$ 符号有错
  • So, to decide between gambles: Minimize the expected loss

Von Neumann-Morgenstern theorem: consequences

  • Remember $L(d, u)$? How do we get there?
  • Rewards (i.e. consequences of decision) are pairs $(u’, u) \in W \times W$.
    • $W$ are possible ‘states of the world’ (SoW).
    • $u$ is state of the world before decision.
    • $u’$ is state of the world after decision.
  • $\tilde{L}(u’, u)$ measures how bad is the change in the SoW.
  • Pick the decision $d \in \mathcal{D}$ that minimizes
    • $\langle \tilde{L} \rangle (d) = \sum_{u’} \tilde{L}(u’, u) P(u’u, d, K)$.
  • If $u’ = f(u, d)$, then we can instead minimize
    • $\langle L \rangle (d) = \sum_{u} L(d, u) P(ud, K)$.

Framing the problem

Making inferences

Making inferences: set-up

  • Wish to make an inference about unknown $U$:
    • Class of galaxy; location of neutrino tracks in image; energy of particle; disease diagnosis; guilt or innocence; …
  • We have some knowledge $K$ already:
    • The world is 3d (or 4d or 10d); physical laws; general knowledge; previous data; expert opinion; …
  • The quantity of interest for inference is what we know about $U$ given $K$:
    • $P(UK)$
    • We must construct this probability.
  • How?!
\[S = \frac{1}{4 \pi \alpha'} \int d^2 \sigma \sqrt{|h|} h^{\alpha \beta} g_{\mu \nu} \partial_\alpha X^\mu \partial_\beta X^\nu\]

Making inferences: Bayes theorem

  • We have only the product and sum rules!
  • Use these to rework $P(UK)$ into probabilities that we can reasonably assign.
  • Often $K$ splits: current data $X$ and other knowledge $K’$.
  • It is frequently much easier to construct:
    • $P(XU,K’)$: ‘causal’ data formation model.
      • E.g. image from object.
    • $P(UK’)$: ‘theoretical’ model of $U$.
      • E.g. physics.
  • Then $P(UX,K’) \propto P(XU,K’)P(UK’)$.
    • N.B. Normalization may be very difficult.

Making inferences: partition

  • Sometimes other quantities $V$ can help:
    • Make the connection between $U, X,$ and $K$ ‘causal’.
    • Reduce dependence.
  • There are several ways this can happen.
  • Example:
    • $P(UK) \alpha \sum_V P(UV,K)P(VK)$ 有错
  • Sometimes this is just an extra formal parameter.
  • Sometimes it is a physical quantity.
    • Future solar exposure, rainfall, and temperature, if plant growth is being predicted from fertilizer application and other known quantities.
    • Alarm makes JohnCalls and MaryCalls independent.

Checkpoint

  • Bayesian statistics:
    • Probability theory is rational thinking: P(UK).
    • Decision theory is rational acting: L(D,U).
  • Problems are how to:
    • Model and compute P(UK).
    • Compute and optimize E[L](D).
  • Often useful to separate data X: P(UX,K).
    • Use Bayes’ theorem: P(UX,K) ∝ P(XU,K)P(UK).
  • Examples:
    • Medical diagnosis: X binary, U binary: P(UK) and P(XU,K) Bernoulli.
    • Labour voting: X ∈ ℕ, U ∈ ℝ: P(UK) beta and P(XU,K) binomial.
  • Often useful to introduce new unknown quantities to make the connection between U, X, and K easier to model:
    • More causal or more independence: e.g. plant growth.

Making inferences: mean of Gaussian

  • We have some data $x = { x_i }_{i \in n}$.
  • We will model it using a Gaussian, with unknown mean $\mu$ and variance $v$.
  • We want to know about $\mu$.
    • Quantity of interest is $P(\mux, K)$.
    • Note that we are not estimating $\mu$.
  • Result: $\frac{(\mu - \bar{x})}{sx, K} \sim t(n - 1)$.
    • Familiar but quite different in meaning.

image.png

Making inferences: summaries

  • Mode:
    • No measure of uncertainty.
    • Can be unrepresentative.
    • Not invariant.
  • Mean:
    • No measure of uncertainty.
    • Can be unrepresentative.
  • Mean and variance:
    • Can be unrepresentative.
  • Credible intervals:
    • Better at summarizing uncertainty, but no estimate.
  • In high dimensions, all these become difficult to compute.

Checkpoint

  • Often useful to introduce new unknown quantities to make the connection between $U, X,$ and $K$ easier to model:
    • More causal or more independence: e.g. plant growth.
    • Example:
      • Introduce variance to make inference about mean.
        • Also contrast with classical t-tests.
  • Summaries of distributions:
    • Pros and cons.

Making inferences: posterior predictive

  • Another example of partition.
  • Known $K$ is data $X = {X_i}$ and other info $K’$.
  • Unknown $U$ is next piece of data $Y$.
  • Quantity of interest: $P(YX,K)$.
    • Given a name: ‘posterior predictive distribution’.
    • Just another special case.
  • Missing information:
    • Parameters of process.
      • E.g. $P(\checkmark) = p$ and $P(\times) = 1 - p$.

image.png

Parameter estimation in integrals: plug-in estimates

  • Often arises not as part of a decision, but as an approximation to an integral.
  • Posterior predictive distribution is usual example.
  • Leads to ‘plug-in estimates’:
    • Estimate parameter $a$ from current data $P(ax,K)$, giving estimate $\hat{a}$.
    • Treat it as the true value and plug into $P(ya,K)$ giving $P(y\hat{a},K)$.

Making decisions

  • Once we have $P(UK)$, we want to use it to:
    • Tell us about $U$.
    • Make decisions.
  • Hypothesis testing, model selection, classification:
    • $U$ takes values in a finite set, often of two elements.
  • Parameter estimation, regression:
    • $U$ takes values in an uncountable set, often $\mathbb{R}^n$.
  • Unlike in classical statistics, these are just special cases of the general Bayesian statistics framework.

Making decisions: hypotheses

需修改

  • Often there are two (or more) competing hypotheses/classes/models, {H_i}.
    • A classic statistical question is how to decide between them.
    • Example:
      • H_0: μ ∈ I ⊆ ℝ
      • H_1: μ ∉ I ⊆ ℝ
    • Cf term 1 data analysis and t distributions.
  • As always, quantity of interest: P(H_ix, K).
    • We saw a two-class example of medical diagnosis.
    • Let us look at a simple approach to multi-class classification.

Making decisions: estimation

  • Often we want to give a single-value summary of a quantity $x \in X$.
    This is called an ‘estimate’.

  • The decisions $D$ are:
    • ‘Treat the value $d \in D \equiv X$ as if it were the true value’.
  • Different to:
    • Distribution summary: convey information succinctly.
    • Plug-in estimate: approximate an integral.
  • We need a loss function $L: X \times X \rightarrow \mathbb{R}$.
    This quantifies how bad it is to treat $d$ as the true value if it is in fact $x$.

  • Should always be bespoke, but in practice… image.png

Building the components

The problem

  • We have seen:
    • The general framework for inferences and decisions.
    • Some examples.
  • More generally:
    • How do we construct probability distributions that represent what we know?
  • Here we will briefly describe several principles for doing this.
    • Many examples will occur in the rest of the module.

Ignorance

  • What if we know nothing about a quantity U?
    • This is never true: we usually know at least a range.
    • However it may be a useful limit.
  • For a finite set A, easy:
    • Ignorance means all elements are equivalent.
    • Equivalence means invariance under all permutations.
    • Invariance implies uniform distribution: $P(a) = 1/A$.
  • But on a continuous set, e.g. ℝ, no distribution is invariant to all (even smooth) permutations.
    • How to define ignorance?
  • Solution: Jeffreys’ prior.
    • Pullback of Fisher-Rao metric to space of parameters using the ‘likelihood’.

Convenience

  • We have seen examples of conjugate priors.
  • Data $D$, unknown quantity of interest $U$, probability $P(UD, K)$.
    • Then $P(UD, K) \propto P(DU, K)P(UK)$.
  • If, for the given $P(DU, K)$ and parametric family $P(UK)$:
    • $P(UD, K)$ is in the same family, we say that $P(UK)$ is a conjugate prior for $P(DU, K)$.
  • Very convenient for initial calculations.
  • No longer as important with fast computers.

Information

  • Sometimes we may know constraints on $P$: e.g. expectations of quantities, or averages over previous data.
  • To avoid assumptions, $P$ should:
    • Maximize our ignorance
    • Be consistent with this knowledge.
  • We need a measure of ignorance or uncertainty.
  • Entropy $H(P)$ of a distribution $P$ on a space $X$ is the unique measure satisfying:
    • Continuity under changes in $P_i$;
    • Symmetry under permutations of $X_i$;
    • Maximality at the uniform distribution;
    • Additivity under coarsening of $X$.
  • Formula:
    • $H(P) = -\sum_x P(x) \ln P(x)$ or $H(P) = -\int_X P(x) \ln \rho(x)$

Maximum entropy

  • Solution to problem:
    • Maximize entropy subject to constraints.
  • Examples:
    • No constraints, finite set: ⇒ uniform distribution.
    • Known mean and covariance ⇒ Gaussian.
  • In general, under different constraints: any exponential family distribution:
    • $P(x) \propto \exp(-\sum_a \lambda_a f_a(x))$
    • Uniform, Gaussian, Beta, Gamma, …

Dependencies

  • We may be able to characterize what one quantity tells us about another.
  • This is usefully summarized in ‘graphical models’.
  • Solutions:
    • Directed graphical models capture ‘causal’ relationships and joint dependencies.
    • Undirected graphical models capture spatio-temporal relationships and conditional independencies. image.png

Directed graphical models

DGMs:

  • Summarize dependency information.
  • Provide a convenient ‘way to think’.
  • Help to analyze models.
  • Complex inferences can be expressed as graph operations, simplifying implementation.

Thus they enable:

  • Construction of probability distributions that capture important dependencies.
  • While keeping the models relatively simple (fewer parameters).
  • Use of the resulting models in practice.

image.png

Directed acyclic graphs (DAG)

  • Graph: $G = (V, E)$, where $E \subseteq V \times V$.
    • Cycle: set of edges leading back to a vertex.
    • Acyclic: containing no cycles.
  • Terms:
    • Parents of $v \in V$:
      • $pa(v) = { u \in V : (u, v) \in E }$
    • Children of $v \in V$:
      • $ch(v) = { u \in V : (v, u) \in E }$
    • Ancestors of $v \in V$ and descendants of $v$ similar.

image.png

Probability distribution

  • Each vertex $v \in V$ has an associated variable $x_v$.
  • Probability distribution on $x = {x_v}$: \(P(x) = \prod_v P(x_v | pa(v))\)
  • Advantage of structure:
    • Has $O(V A^{P+1})$ parameters, where each vertex has $O(P)$ parents and takes values in $A$.
    • Cf full model with $O(A V)$ parameters.

image.png

Examples

  • Naïve Bayes:
    $P(c, \theta,X) = P(c) \prod_a P(X_a|\theta_a)P(\theta_a|c)$

  • Markov chain:
    $P(X) = \prod_a P(x_a|x_{a-1})$

  • Second-order Markov chain:
    $P(X) = \prod_a P(x_a|x_{a-1},x_{a-2})$

  • Hidden Markov models:
    $P(X,Z) = \prod_a P(X_a|Z_a)P(Z_a|Z_{a-1})$

image.png

Hierarchical models

  • Simple graphical model:
    • η → θ → D
    • $P(D, θ, η) = P(Dθ)P(θη)P(η)$
  • Can then calculate quantities of interest $P(UK)$:
    • E.g. $P(θD)$ or $P(ηD)$.
  • Example: modelling cancer rates.

Conditional independence

  • Minimal ancestral subgraph of set $S \in V$:
    • Smallest subgraph containing all ancestors of $S$.
  • Moral graph:
    • Connect all parents of each node.
    • Drop the edge directions.
  • $A \in V$ is independent of $B \in V$ given $S \in V$ iff:
    • $S$ separates $A$ and $B$ in moralized minimal ancestral subgraph of $A \cup B$.
  • Markov blanket:
    • $P(vV \setminus v) = P(vN(v))$
    • $N(v) =$ neighbours of $v$ in moral graph.

Conditional models

Conditional (discriminative) models

  • Sometimes it is hard to build joint models ‘by hand’:
    • $P(UX, K) \propto P(XU, K)P(UK) = P(X, UK)$
  • Solution: model $P(UX, K)$ directly.
    1. Pick a parameterized distribution $P(Ua)$:
      • Parameter $a$ is often the mean.
    2. Make $a$ a function of $X$:
      • $P(UX, f, K) = P(Ua = f(X))$.
    3. Pick a parameterized set of functions: $f(X) \in {f_W(X)}_{W}):
      • $P(UX, w, K) = P(Ua = f_W(X))$.
    4. Pick a prior for $w, P(wK)$.
    5. Compute $P(UX, K) = \int_W P(UX, w, K)P(wK)$:
      • In practice, $K$ contains previous data ${(U_i, X_i)}$.
      • Use plug-in estimates: approximate integral by estimating the parameters $w$ using MAP:
        • $P(w{U, X}, K) \propto P({U}{X}, w, K)P(wK) \propto P({U}{f_W(X)})P(wK)$.
  • Classic examples: linear and logistic regression.

Problem:

  • Hard to incorporate prior knowledge about $U$.

Solution:

  • Generic models with many parameters to capture dependencies.
  • Large amounts of data needed to learn all these parameters.
  • Forests, deep neural networks, …
  • Can also be used for $P(XU, K)$.

image.png

image.png

Computing the answer

The problem

  • $U$ is often lies in a very high-dimensional space.
  • We need to:
    • Understand what $P(UK)$ tells us about $U$.
    • Calculate expectations, e.g. of the loss function.
    • Optimize to find estimates, etc.
  • How to integrate and optimize over such spaces?
    • Right now, we drop a few names.
    • We will look in more detail at Monte Carlo methods.

Integration

  • Analytical approximations
    • Laplace approximation and other expansions (loop,…).
  • Numerical approximations
    • Variational: fit a simpler distribution to $P(UK)$.
      • Gaussian (‘variational Bayes’)
      • Completely factorized (mean field approximation)
      • Pairwise interactions (Bethe approximation)
        • Message-passing algorithms.
  • Numerical integration
    • Monte-Carlo methods, in particular many flavours of Markov-chain Monte-Carlo.

image.png

image.png

Monte Carlo approximation

  • We often need to compute expectations:
    • $E[f] = \int_X f(x) P(xK)$
  • Examples:
    • Estimation (mean, MMSE estimate).
    • Posterior predictive distribution: $f(x) = P(cx)$.
  • Use samples ${x_i}$ from $P$ to approximate $E[f]$:
    • $\bar{f} = \frac{1}{n} \sum_i f(x_i)$
    • $E[\bar{f}] = E[f] ; \, Var[\bar{f}] = \frac{1}{n} Var[f]$.
  • But need independent samples from $P$. How?
    • What does this even mean?

Sampling

  • Needed: a method to sample from $U([0, 1])$.
    • These are pseudo-random number generators.
    • There is a lot of theory, but we will just assume we have one.
  • Zero dimensions (discrete distributions).

  • One dimension
    • Inverse transform sampling.
  • Low dimensions
    • Rejection sampling.
    • Importance sampling.
  • High dimensions
    • Markov Chain Monte Carlo (MCMC).
    • Low-dimensional sampling is also important for high dimensions:
      • Sampling one (or a few) variables at a time.

Discrete distributions

  • Finite set $A$, with probabilities $P(a)$ for $a \in A$.
    • Order $A$ arbitrarily, giving $a_1, a_2, \ldots, a_{A}$.
    • Define $P_i = \sum_{j=1}^{i} P(a_i)$, with $P_0 = 0$.
    • Divide ([0,1]) into intervals ([P_{i-1}, P_i], i \in [1..A]).
    • Let $P(x) = dx$.
      • Then $P(x \in [P_{i-1}, P_i]) = P_i - P_{i-1} = P(a_i)$.
  • Algorithm
    • Sample $x$ from $U([0,1])$.
    • Return $a_i$ such that $x \in [P_{i-1}, P_i]$.

image.png

Inverse transform sampling

  • We want to sample from:
    $P_Y(y) = dy \rho_Y(y)$.

  • Recall:
    If $P(x) = dx$ i.e. $U([0,1])$, then under a transformation $y = \epsilon(x)$:
    $P_Y(y) = dy |(\epsilon^{-1})’(y)|$.

  • Algorithm

    • Find $\epsilon^{-1}(y) = \int_0^y dz \rho_Y(z) = F_Y(y)$, the cdf of $y$.
    • Invert it to find $\epsilon$.
    • Sample ${x_i}$ from $U([0,1])$.
    • Compute ${y_i} = {\epsilon(x_i)}$.

image.png

Rejection sampling

  • Idea:
    • Probability is area under density.
    • Sample in 2D and throw away y value.
  • Want to sample from $p = \frac{\tilde{p}}{Z_p}$, but only know $\tilde{p}$.
    • Suppose:
      • We can sample from another distribution $q$.
      • There is a $k$ such that $kq \geq \tilde{p}$.
  • Algorithm
    • Sample $z_0$ from $q$.
    • Sample $u_0$ from $U([0, kq(z_0)])$.
    • If $u_0 < \tilde{p}(z_0)$, keep $x = z_0$; else reject.

image.png

image.png

Markov Chain Monte Carlo

  • In high dimensions, it is very difficult to apply rejection sampling:
    • Almost all samples are rejected.
    • E.g. hypersphere in cube.
  • Abandon independent samples:
    • Generate dependent samples using Markov chains.
  • Bad: more samples now needed for given variance.

  • Good: very flexible, so can sample from almost any distribution.

image.png

image.png

Intuition

  • Probability distribution $P(x)$ on space $X$.
  • Random walk in $X$ generates a series of samples.
  • Ensure that the relative frequencies of values in this chain are the same as $P(x)$.
  • Note that successive values are dependent.
    • Chain depends on initial point:
      • Need to eliminate burn-in.
    • Variance does not go down like $1/n$:
      • Need to estimate effective sample size.
    • Need thinning to achieve independence, but not really a good idea.

image.png image.png

image.png

image.png

Markov chains 1

  • A Markov chain is a probability distribution on variables {xi ∈ Ai}i∈ℕ.

image.png

  • Markov property:
    • P(xix<i) = P(xixi-1).
    • Independence of past given previous value.
    • Define P(xixi-1) = Ti(xi, xi-1).
      • Transition probability.
  • Stationary/homogeneous:
    • Ai = ℵ
    • Ti(xi, xi-1) = T(xi, xi-1) = T(x’, x).

image.png

Markov chains 2

  • Marginal probabilities related by transition probability:
    $P(x_{i+1}) = \sum_{x_i} P(x_{i+1}|x_i)P(x_i)$

  • All $x$ in same space: think of evolving distribution:
    $\rho_{i+1}(x) = \sum_{x’} T(x, x’)\rho_i(x’)$

  • Matrix multiplication:
    $\rho_{i+1} = T \rho_i$

  • After $N$ steps:
    $\rho_{i+N} = T^N \rho_i$

image.png

Markov chains 3

  • Under certain conditions, the evolution converges to an ‘invariant’, ‘stationary, or ‘equilibrium’ distribution π that does not change:
    • Chain converges: π = (\lim_{N \to \infty} T^N \pi_0)
    • Invariant: π = Tπ
  • If we sample from successive distributions in the chain, and wait long enough, our samples will become samples from π.
  • So if we can construct a chain with stationary distribution the one from which we want to sample, i.e. such that π = P, we are done!
  • How can we construct such Markov chains?

Gibbs sampling

  • Variable $x = {x_a} \in \mathcal{A} = \prod_a A_a$. Let $a= d$.
    • Could be components, or different variables entirely.
    • Could be groups of variables.
  • Sample from conditional distributions $P(x_ax_{\backslash a})$, updating one variable at a time:
    1. For $a = 1:d$
      1. Sample $x’_a$ from $P(x_ax)$
      2. Set $x_a = x’_{a}$
    2. Return $x$
    3. Go to 1
  • How do we know it works?

Gibbs sampling: the good

  • Only requires conditional distributions. Often:
    • Easy to find (at least unnormalized)
      • Especially for graphical models, for which conditionals are either obvious or easy to calculate.
    • Easy to sample from as low- (often one-) dimensional.
  • No tuning parameters:
    • Can effectively be automated (Winbugs, Jags)

Gibbs sampling: the not-so-good

  • Gibbs sampling can easily be reducible because you cannot move diagonally.
    • But if the conditional distributions are all positive, then the chain is irreducible.
  • Sampling can be slow, for two reasons:
    • You cannot move diagonally.
    • Step size vs distribution size.

image.png

image.png

Metropolis-Hastings

  • Splits sampling into two steps:
    • Propose a new state $x^$ according to some proposal distribution $Q(x^x)$.
    • Accept proposal $x’ = x^*$, or stay put $x’ = x$.
  • To achieve detailed balance:
    • Accept with probability $\alpha = \min(1, R(x^*, x))$, where:
    \[R(x^*, x) = \frac{Q(x|x^*)\pi(x^*)}{Q(x^*|x)\pi(x)} = \frac{\pi(x^*)/Q(x^*|x)}{\pi(x)/Q(x|x^*)}\]
  • If proposal is symmetric $R(x^, x) = \frac{\pi(x^)}{\pi(x)}$:
    • Metropolis algorithm.
  • How do we know it works?

Metropolis-Hastings: the good and bad

Good

  • Incredibly flexible: π can be more or less anything.
  • Only ratios of π are needed, so normalization is not required.

Bad

  • Choosing good proposal distributions is not always easy.
  • Proposal distributions often have tuning parameters that affect acceptance probability and mixing.
    • These must usually be chosen via trial and error.

Metropolis-Hastings: common proposals

  • Gaussian centred on $x$, with some (co)variance:
    $Q(x’|x) = N(x’; x, \Sigma)$.
    ‘Random walk Metropolis-Hastings’.

  • Independence sampler:
    $Q(x’|x) = Q(x’)$.

  • Gibbs sampler:
    $Q(x’|x) = P(x’a|x{\setminus a})\delta(x’a, x{\setminus a})$
    $P(a) = 1$.

Metropolis-Hastings: Gaussian proposals

  • Small step size bad:
    • Can fail to explore space.
    • Many steps to generate a given (effective) number of samples because of high correlation.

  • Large step size bad:
    • Many steps to generate a given number of samples due to large rejection probability.

image.png

  • Variance adjusted using pilot runs:
    • Target 25% – 40% acceptance rate.

image.png

Metropolis-Hastings: variants

  • Mixtures:
    • Pick one of a number of valid proposals at random.
  • Data-driven:
    • Predictors $Q_k(x_kf_k(D))$ trained on data $(x,D)$ generated from model.
    • $Q(x’x) = p_0 Q_0(x’x) + \sum_k p_k Q_k(x_kf_k(D))$
  • Adaptive:
    • Change Gaussian proposal $\Sigma$ over iterations according to empirical covariance.
  • Reversible-jump (RJ-MCMC)

  • Hamiltonian MCMC

MCMC: Practicalities

  • Acceptance:
    • Target ~ 40% acceptance by tuning parameters via pilot runs.
  • Burn-in:
    • Eye-balling traces is still the most reliable method.
  • Autocorrelation and thinning:
    • Function acf in R.
  • Effective sample size:
    • Package coda in R.

image.png

image.png

image.png

Summary of methods

image.png

Recent application: MCMC GANs

  • Discriminator Rejection Sampling
    • Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, Augustus Odena
    • (Submitted on 16 Oct 2018 (v1), last revised 26 Feb 2019 (this version, v3))
  • Metropolis-Hastings Generative Adversarial Networks
    • Ryan Turner, Jane Hung, Eric Frank, Yunus Saatci, Jason Yosinski
    • (Submitted on 28 Nov 2018 (v1), last revised 17 May 2019 (this version, v2))

Optimization

  • Gradient ascent.
    • Hill climbing.
  • Stochastic gradient ascent.
    • Add noise to avoid local maxima.
      • Explicitly.
      • Implicitly by training using subsets of the data.
    • Decrease noise (learning rate) slowly.
  • Simulated annealing.
    • Sampling e.g. MCMC.
    • Decrease ‘temperature’ slowly.
  • Message-passing algorithms.
    • Max-sum instead of sum-product.
This post is licensed under CC BY 4.0 by the author.

Trending Tags