Diffusion Models and Flow Matching: From Score-Based Diffusion to Continuous Normalizing Flows

1. The Deep Generative Modeling Problem
Generative modeling sits at the heart of modern machine learning, yet the problem it poses is deceptively simple to state and brutally hard to solve. We are handed a finite collection of observations — photographs, audio clips, protein structures, text documents — and asked to learn the hidden probability distribution that produced them well enough to both evaluate and sample from it. Everything else in this lecture grows out of understanding precisely why that is difficult.
Formally, suppose we observe a dataset , where each . For a standard RGB image at 256×256 resolution, . Our task is to find parameters such that a model distribution satisfies
A useful generative model must satisfy two simultaneous desiderata. First, it should assign high likelihood to real data — meaning should be large wherever is large. Second, it should support efficient, diverse sampling — drawing fresh examples that look indistinguishable from real ones in finite compute time. These two goals are more in tension than they might first appear; many architectures that excel at density estimation are slow to sample from, and vice versa.
The root cause of almost every difficulty is the curse of dimensionality. The most naïve approach to density estimation is a histogram: discretize each dimension into bins, count how many samples fall into each cell, and normalize. The number of cells scales as , so even with bins per dimension and , the histogram has more cells than there are atoms in the observable universe. Kernel density estimation (KDE) fares no better asymptotically — its sample complexity grows exponentially in . The ambient space is simply too large to cover with any finite dataset.
What saves us — partially — is the manifold hypothesis: real data does not spread uniformly over . Natural images, for instance, live on a vastly lower-dimensional manifold embedded in pixel space. Randomly-sampled Gaussian vectors in look like snow; real images occupy an astronomically small corner of that space. This means the effective complexity of the problem is much lower than the ambient dimension suggests — but we have to build a model that discovers and exploits that low-dimensional structure without ever being told what it is.
Parametric models — neural networks, normalizing flows — try to encode this structure implicitly. But a second fundamental obstacle arises immediately: computing the normalizing constant
is generally intractable for flexible models. If we parameterize an energy-based model , for instance, the integral over all of is unavailable in closed form, blocking both maximum-likelihood training and exact sampling. Normalizing flows avoid this by restricting to invertible architectures with tractable Jacobians — an elegant fix, but one that constrains expressivity and is computationally expensive at scale.
The conceptual breakthrough that motivates this entire lecture is a different kind of resolution: rather than trying to model directly, decompose the problem through a noise schedule. We design a process that gradually corrupts a data point into pure Gaussian noise over steps. This forward process is easy and known by construction. The hard distribution is then recovered by learning to reverse this corruption — denoising noise back into data, one small step at a time. Each individual denoising step operates on a nearly-Gaussian local distribution, sidestepping the global normalization problem entirely.
This noise-based decomposition is elegant for several reasons. It replaces one intractable global problem with a sequence of tractable local problems. It naturally exploits the manifold structure of data, because the noising process smoothly interpolates between the sharp data manifold and a featureless Gaussian. And it connects to deep mathematical tools — stochastic differential equations, score functions, and optimal transport — that we will develop carefully throughout this lecture.
The visual below captures both the difficulty and the proposed resolution in a compact diagram. On one side, it depicts the core tension: data lives on a tiny, irregular support within the vast ambient space , while the model must simultaneously achieve high likelihood and efficient sampling from that distribution. On the other side, the noise-decomposition idea appears as a bridge — a continuum connecting structureless Gaussian noise to the rich, structured data distribution. This bridge is exactly the object we will learn to traverse, and building it rigorously is the subject of everything that follows.




























