Variational Autoencoders: Principles, Derivations, and Applications - FeynmanWiki

CONTENTS

Bookmark this paper

Save for later reading

DEEP DIVE - 84 MIN READ

Variational Autoencoders: Principles, Derivations, and Applications

1. The Generative Modeling Problem

Before we can understand what a variational autoencoder is optimizing, we need to be precise about the problem it is trying to solve. A VAE is not merely an “autoencoder with noise.” It is a generative model: a model whose goal is to learn something about the probability distribution that produced the observed data.
Suppose we are given a dataset
D={x(1),…,x(N)},\mathcal{D}=\{x^{(1)},\ldots,x^{(N)}\},D={x(1),…,x(N)},
where each observation x(n)∈RDx^{(n)}\in\mathbb{R}^Dx(n)∈RD. For images, DDD might be the number of pixels; for audio, it might be waveform samples or spectrogram coefficients; for text embeddings, it might be the embedding dimension. The generative modeling problem is to learn a distribution pθ(x)p_\theta(x)pθ​(x) that approximates the unknown data-generating distribution pdata(x)p_{\text{data}}(x)pdata​(x).
This goal has three closely related interpretations. First, we may want density estimation: given a new point xxx, how likely is it under the model? That is, can we evaluate or approximate pθ(x)p_\theta(x)pθ​(x)? Second, we may want synthesis: can we draw new samples x∼pθ(x)x\sim p_\theta(x)x∼pθ​(x) that look like plausible members of the dataset, but are not exact copies? Third, we may want representation learning: can the model discover compact, structured variables that explain meaningful factors of variation in the data?
These goals are connected, but they are not identical. A model may produce sharp-looking samples while assigning poor likelihoods, or it may achieve good likelihood while generating visually mediocre samples. VAEs are especially interesting because they try to tie these goals together through a probabilistic latent-variable framework: they define a likelihood, support sampling, and learn internal representations.
A first naive idea is to estimate the density directly from the observed data. For example, kernel density estimation places a small “bump” of probability mass around each training example:
p^(x)=1N∑n=1NK ⁣(x−x(n)).\hat{p}(x)
=
\frac{1}{N}
\sum_{n=1}^{N}
\mathcal{K}\!\left(x-x^{(n)}\right).p^​(x)=N1​n=1∑N​K(x−x(n)).
In low dimensions, this can work surprisingly well. If the data points densely cover the relevant region of space, then nearby kernels overlap and form a smooth estimate of the distribution. But in high dimensions, this intuition breaks down catastrophically.
The reason is the curse of dimensionality. The volume of a ball of radius rrr, or more generally the number of distinguishable regions in a DDD-dimensional space, grows roughly exponentially with DDD. For MNIST, a 28×2828\times 2828×28 grayscale image lives in R784\mathbb{R}^{784}R784. Even N=60,000N=60{,}000N=60,000 images are vanishingly sparse in such a space. Almost every possible pixel vector is not a digit at all; it is visual noise. Worse, many distance-based intuitions fail: in very high dimensions, points tend to become nearly equidistant, making “nearest neighbors” and local smoothing much less reliable.
So the issue is not just that we need more data. The ambient space RD\mathbb{R}^DRD is overwhelmingly large, and the data occupies only a tiny, highly structured subset of it. A kernel density estimator spreads probability around observed examples in the ambient space, but most of that space is irrelevant. Unless the kernel bandwidth is extremely small, it assigns mass to unrealistic regions; if the bandwidth is extremely small, it assigns nearly zero density almost everywhere. Either way, the method fails to capture the actual structure of the data distribution.
This motivates the data manifold hypothesis. Although observations may be represented as vectors in a high-dimensional ambient space, the meaningful variation in the data often has much lower intrinsic dimension. For example, handwritten digits vary by stroke thickness, slant, rotation, identity, local style, and other factors. These factors are far fewer than 784 independent pixel degrees of freedom. Informally, the data may lie near a low-dimensional manifold embedded in the high-dimensional observation space.
This is where latent variables enter the story. Instead of modeling x∈RDx\in\mathbb{R}^Dx∈RD directly as an arbitrary point in the ambient space, we introduce a lower-dimensional variable
z∈RK,K≪D,z\in\mathbb{R}^K,
\qquad K\ll D,z∈RK,K≪D,
and imagine that observations are generated from these latent coordinates. The latent variable zzz should capture the compact explanatory factors, while the model maps from latent space into data space. This does not mean the true data manifold is literally linear, smooth everywhere, or exactly KKK-dimensional. Rather, it is a modeling assumption: useful data distributions often have enough low-dimensional structure that exploiting it is far better than treating all directions in RD\mathbb{R}^DRD equally.
The visual below condenses this motivation into a geometric picture. On the left, the high-dimensional ambient space is mostly empty: even many training examples provide negligible coverage when DDD is large, which is why direct kernel-style density estimation becomes ineffective. On the right, the same data becomes much more intelligible when viewed as concentrated near a lower-dimensional manifold.
The key transition is from “estimate density everywhere in RD\mathbb{R}^DRD” to “model how low-dimensional latent structure gives rise to high-dimensional observations.” That transition is the conceptual starting point for latent variable models, and it is exactly the path that will lead us to variational autoencoders.

CONTENTS

Bookmark this paper

Save for later reading

DEEP DIVE - 84 MIN READ

Variational Autoencoders: Principles, Derivations, and Applications

1. The Generative Modeling Problem

Before we can understand what a variational autoencoder is optimizing, we need to be precise about the problem it is trying to solve. A VAE is not merely an “autoencoder with noise.” It is a generative model: a model whose goal is to learn something about the probability distribution that produced the observed data.
Suppose we are given a dataset
D={x(1),…,x(N)},\mathcal{D}=\{x^{(1)},\ldots,x^{(N)}\},D={x(1),…,x(N)},
where each observation x(n)∈RDx^{(n)}\in\mathbb{R}^Dx(n)∈RD. For images, DDD might be the number of pixels; for audio, it might be waveform samples or spectrogram coefficients; for text embeddings, it might be the embedding dimension. The generative modeling problem is to learn a distribution pθ(x)p_\theta(x)pθ​(x) that approximates the unknown data-generating distribution pdata(x)p_{\text{data}}(x)pdata​(x).
This goal has three closely related interpretations. First, we may want density estimation: given a new point xxx, how likely is it under the model? That is, can we evaluate or approximate pθ(x)p_\theta(x)pθ​(x)? Second, we may want synthesis: can we draw new samples x∼pθ(x)x\sim p_\theta(x)x∼pθ​(x) that look like plausible members of the dataset, but are not exact copies? Third, we may want representation learning: can the model discover compact, structured variables that explain meaningful factors of variation in the data?
These goals are connected, but they are not identical. A model may produce sharp-looking samples while assigning poor likelihoods, or it may achieve good likelihood while generating visually mediocre samples. VAEs are especially interesting because they try to tie these goals together through a probabilistic latent-variable framework: they define a likelihood, support sampling, and learn internal representations.
A first naive idea is to estimate the density directly from the observed data. For example, kernel density estimation places a small “bump” of probability mass around each training example:
p^(x)=1N∑n=1NK ⁣(x−x(n)).\hat{p}(x)
=
\frac{1}{N}
\sum_{n=1}^{N}
\mathcal{K}\!\left(x-x^{(n)}\right).p^​(x)=N1​n=1∑N​K(x−x(n)).
In low dimensions, this can work surprisingly well. If the data points densely cover the relevant region of space, then nearby kernels overlap and form a smooth estimate of the distribution. But in high dimensions, this intuition breaks down catastrophically.
The reason is the curse of dimensionality. The volume of a ball of radius rrr, or more generally the number of distinguishable regions in a DDD-dimensional space, grows roughly exponentially with DDD. For MNIST, a 28×2828\times 2828×28 grayscale image lives in R784\mathbb{R}^{784}R784. Even N=60,000N=60{,}000N=60,000 images are vanishingly sparse in such a space. Almost every possible pixel vector is not a digit at all; it is visual noise. Worse, many distance-based intuitions fail: in very high dimensions, points tend to become nearly equidistant, making “nearest neighbors” and local smoothing much less reliable.
So the issue is not just that we need more data. The ambient space RD\mathbb{R}^DRD is overwhelmingly large, and the data occupies only a tiny, highly structured subset of it. A kernel density estimator spreads probability around observed examples in the ambient space, but most of that space is irrelevant. Unless the kernel bandwidth is extremely small, it assigns mass to unrealistic regions; if the bandwidth is extremely small, it assigns nearly zero density almost everywhere. Either way, the method fails to capture the actual structure of the data distribution.
This motivates the data manifold hypothesis. Although observations may be represented as vectors in a high-dimensional ambient space, the meaningful variation in the data often has much lower intrinsic dimension. For example, handwritten digits vary by stroke thickness, slant, rotation, identity, local style, and other factors. These factors are far fewer than 784 independent pixel degrees of freedom. Informally, the data may lie near a low-dimensional manifold embedded in the high-dimensional observation space.
This is where latent variables enter the story. Instead of modeling x∈RDx\in\mathbb{R}^Dx∈RD directly as an arbitrary point in the ambient space, we introduce a lower-dimensional variable
z∈RK,K≪D,z\in\mathbb{R}^K,
\qquad K\ll D,z∈RK,K≪D,
and imagine that observations are generated from these latent coordinates. The latent variable zzz should capture the compact explanatory factors, while the model maps from latent space into data space. This does not mean the true data manifold is literally linear, smooth everywhere, or exactly KKK-dimensional. Rather, it is a modeling assumption: useful data distributions often have enough low-dimensional structure that exploiting it is far better than treating all directions in RD\mathbb{R}^DRD equally.
The visual below condenses this motivation into a geometric picture. On the left, the high-dimensional ambient space is mostly empty: even many training examples provide negligible coverage when DDD is large, which is why direct kernel-style density estimation becomes ineffective. On the right, the same data becomes much more intelligible when viewed as concentrated near a lower-dimensional manifold.
The key transition is from “estimate density everywhere in RD\mathbb{R}^DRD” to “model how low-dimensional latent structure gives rise to high-dimensional observations.” That transition is the conceptual starting point for latent variable models, and it is exactly the path that will lead us to variational autoencoders.

2. Latent Variable Models: The Core Idea

The previous discussion framed generative modeling as a density-learning problem: we want a model that assigns high probability to realistic data and can also produce new samples. But high-dimensional observations—images, audio, text embeddings—rarely vary freely in all ambient dimensions. A handwritten digit image may have thousands of pixels, yet much of its variation can be described by a smaller set of factors: which digit it is, how thick the stroke is, whether it is tilted, how centered it is, and so on. Latent variable models make this intuition explicit.
The core assumption is that each observed data point xxx is generated from an unobserved, lower-dimensional variable zzz. We do not get to see zzz in the dataset; it is a hidden explanation for the observation. Instead of modeling the distribution over xxx directly, we define a two-step generative process:
z∼p(z)=N(0,I),x∼pθ(x∣z).z \sim p(z) = \mathcal{N}(0, I),
\qquad
x \sim p_{\theta}(x \mid z).z∼p(z)=N(0,I),x∼pθ​(x∣z).
Here, p(z)p(z)p(z) is a simple prior over latent codes, usually chosen to be a standard Gaussian. The conditional distribution pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) is the decoder or generative model: given a latent code, it describes a distribution over possible observations. In modern VAEs, this conditional distribution is parameterized by a neural network fθ(z)f_{\theta}(z)fθ​(z), which maps latent coordinates into the parameters of a likelihood over data space.
This setup separates two kinds of complexity. The prior p(z)p(z)p(z) is deliberately simple: sampling from N(0,I)\mathcal{N}(0,I)N(0,I) is easy, and its geometry is well behaved. The decoder carries the burden of learning how simple latent variation becomes rich observed structure. For example, in an idealized MNIST model, nearby values of zzz might correspond to visually similar digits, while different directions in latent space could control properties such as digit identity, stroke width, or tilt.
The probability assigned to an observation xxx is obtained by considering all possible latent explanations that could have generated it. This gives the marginal likelihood, also called the evidence:
pθ(x)=∫Zpθ(x∣z) p(z) dz.p_{\theta}(x)
=
\int_{\mathcal{Z}} p_{\theta}(x \mid z)\, p(z)\, dz.pθ​(x)=∫Z​pθ​(x∣z)p(z)dz.
This integral is the mathematical heart of latent variable modeling. It says: to evaluate how likely xxx is, average the likelihood pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) over every possible latent code zzz, weighted by how plausible that code was under the prior. A particular zzz might reconstruct xxx very well, but if that zzz lies in an extremely unlikely region of the prior, its contribution is limited. Conversely, common latent codes contribute more, but only if the decoder can plausibly produce xxx from them.
This is why latent variable models are so appealing. They offer a structured way to generate data:
sample a simple latent code zzz,
pass it through a learned decoder,
obtain a complex observation xxx.
They also offer a conceptual interpretation of representation learning: the model is encouraged to organize meaningful variation in data through the latent space. If the learned representation is smooth, moving through zzz-space should produce coherent changes in generated samples rather than abrupt jumps.
But the same integral that makes the model principled also creates the central computational difficulty. When pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) is parameterized by a neural network, the integral
∫Zpθ(x∣z) p(z) dz\int_{\mathcal{Z}} p_{\theta}(x \mid z)\, p(z)\, dz∫Z​pθ​(x∣z)p(z)dz
usually has no closed form. The decoder can be highly nonlinear, and the latent space may have many dimensions. Direct numerical integration becomes infeasible, and naïve Monte Carlo estimates can be too noisy or inefficient for maximum likelihood training. Therefore, although the model defines pθ(x)p_{\theta}(x)pθ​(x) formally, we cannot usually evaluate or maximize log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) directly.
This is the key tension that motivates variational inference and, ultimately, the VAE objective. We have a clean generative story and a meaningful likelihood, but the marginalization over hidden causes is intractable. The rest of the VAE framework can be understood as a way to optimize a tractable surrogate for this inaccessible log-likelihood.
The visual below compactly summarizes this idea. The left side uses a graphical-model view: a latent variable zzz is drawn from the prior, then an observed variable xxx is generated through pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z), repeated independently across data points. The shaded observed node emphasizes that xxx is in the dataset, while zzz remains hidden.
The right side grounds the abstraction in an MNIST-style example: a latent code can be thought of as controlling semantic or stylistic factors that the decoder turns into an image. The important warning is the intractable integral over zzz: even though the sampling story is simple, evaluating the probability of a given observation requires summing over all latent explanations, which is precisely what we cannot do directly with a neural decoder.

3. Failure Case: Why Not Just Use EM or MAP?

Having introduced latent variable models, we now run into the first serious computational obstacle: the model is easy to write down, but hard to fit. The whole appeal was to define a simple prior p(z)p(z)p(z), pass zzz through a decoder, and obtain a flexible distribution over observations xxx. But maximum likelihood training asks us to evaluate
pθ(x)=∫pθ(x∣z) p(z) dz,p_{\theta}(x) = \int p_{\theta}(x \mid z)\,p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
and that integral is exactly where the trouble begins. For simple decoders, the integral and the posterior over latents may be analytically tractable. For neural network decoders, they usually are not.
The classical tool for latent variable models is Expectation-Maximization. EM alternates between inferring the latent posterior under the current parameters and then updating the parameters using expectations under that posterior. The E-step requires
pθ(z∣x)=pθ(x∣z) p(z)pθ(x).p_{\theta}(z \mid x)
=
\frac{p_{\theta}(x \mid z)\,p(z)}{p_{\theta}(x)}.pθ​(z∣x)=pθ​(x)pθ​(x∣z)p(z)​.
This equation is innocent-looking but deceptive. The denominator pθ(x)p_{\theta}(x)pθ​(x) is the very marginal likelihood integral we were trying to avoid. In models with conjugacy or linear-Gaussian structure, the posterior has a known form and EM is elegant. But once the decoder becomes nonlinear, the posterior can become highly warped, multimodal, and unavailable in closed form.
A useful contrast is probabilistic PCA. Suppose
p(z)=N(0,I),pθ(x∣z)=N(Wz+b,σ2I).p(z)=\mathcal{N}(0,I),
\qquad
p_{\theta}(x \mid z)=\mathcal{N}(Wz+b,\sigma^2 I).p(z)=N(0,I),pθ​(x∣z)=N(Wz+b,σ2I).
Because everything is linear and Gaussian, the posterior pθ(z∣x)p_{\theta}(z \mid x)pθ​(z∣x) is also Gaussian. EM can compute its mean and covariance exactly. The latent space remains geometrically well-behaved: observing xxx carves out an elliptical Gaussian belief over possible zzz's.
Now replace the linear map Wz+bWz+bWz+b with a neural network fθ(z)f_{\theta}(z)fθ​(z):
pθ(x∣z)=N(fθ(z),σ2I).p_{\theta}(x \mid z)=\mathcal{N}(f_{\theta}(z),\sigma^2 I).pθ​(x∣z)=N(fθ​(z),σ2I).
The prior is still simple, and the observation noise may still be Gaussian, but the posterior is no longer Gaussian. The inverse image of a given observation xxx under fθf_{\theta}fθ​ may contain many disconnected regions. Different latent codes can decode to similar observations. The posterior may have sharp ridges, separated modes, and strong nonlinear dependencies between latent dimensions. In other words, the generative direction z↦xz \mapsto xz↦x may be easy to evaluate, while the inference direction x↦zx \mapsto zx↦z is hard.
One tempting fallback is MAP inference: instead of representing the whole posterior, choose the most likely latent point,
z^=arg⁡max⁡zlog⁡pθ(z∣x).\hat{z}
=
\arg\max_z \log p_{\theta}(z \mid x).z^=argzmax​logpθ​(z∣x).
This can be useful in some settings, but it is not a satisfactory replacement for posterior inference. A point estimate throws away uncertainty. If the posterior has several plausible modes, MAP picks one and ignores the rest. It also introduces an inner optimization problem for every datapoint, which is expensive during training. Worse, if we need gradients through that optimization procedure, the computation becomes cumbersome and brittle, especially when latent variables are discrete or when the optimization landscape is poorly conditioned.
Another seemingly generic solution is Monte Carlo EM. We might sample latent candidates from the prior,
z(l)∼p(z),z^{(l)} \sim p(z),z(l)∼p(z),
and weight them according to how well they explain xxx:
w(l)∝pθ(x∣z(l)).w^{(l)} \propto p_{\theta}(x \mid z^{(l)}).w(l)∝pθ​(x∣z(l)).
This is importance sampling with the prior as the proposal distribution. The problem is that the prior is usually a terrible proposal for the posterior. In high dimensions, most samples from p(z)p(z)p(z) land in regions that explain a specific xxx extremely poorly. A tiny number of samples may receive almost all the probability mass, while the rest contribute essentially nothing.
This phenomenon is often called importance weight collapse. As the latent dimension KKK grows, the effective number of useful samples can collapse toward one:
effective sample size≈1as K→∞.\text{effective sample size} \approx 1
\qquad
\text{as } K \to \infty.effective sample size≈1as K→∞.
The phrase “effective sample size” is important: even if we draw thousands of samples, the estimator may behave as though it had only one meaningful sample. This creates exponentially high variance estimates of the E-step quantities and makes naive Monte Carlo learning impractical for expressive latent-variable models.
So the failure is not that EM, MAP, or Monte Carlo are conceptually wrong. Each is reasonable under the right assumptions. The failure is a mismatch between those assumptions and neural generative models:
Exact EM needs a tractable posterior.
MAP replaces uncertainty with a single point.
Prior-based Monte Carlo wastes samples in high-dimensional latent spaces.
Neural decoders make the true posterior complex and hard to normalize.
The visual below compresses this comparison into the key geometric intuition. In the linear-Gaussian case, posterior inference is clean: the posterior over zzz is a single Gaussian-shaped region, and EM can proceed exactly. In the nonlinear case, the posterior becomes irregular and multimodal, so the E-step no longer has a closed-form solution.
The weight-collapse sketch at the bottom emphasizes why simply sampling many zzz's from the prior does not rescue us. In high-dimensional latent spaces, almost all prior samples are irrelevant for a particular observation xxx, and the importance weights concentrate on one lucky sample. This is the motivation for the next idea: instead of solving a separate hard inference problem from scratch for every datapoint, VAEs introduce amortized approximate inference—a learned encoder that predicts an approximate posterior directly.

4. The VAE Framework: Three Distributions

The failure of direct EM or per-example MAP inference points us toward a different compromise: instead of solving a fresh optimization problem for every datapoint, we will learn a function that performs inference. This is the central move in a variational autoencoder. A VAE is not just “an autoencoder with noise”; it is a probabilistic latent-variable model equipped with a trainable approximation to the posterior.
The generative story begins with a latent variable zzz, typically chosen to live in a relatively low-dimensional continuous space:
p(z)=N(0,I),z∈RK.p(z) = \mathcal{N}(0, I), \qquad z \in \mathbb{R}^K.p(z)=N(0,I),z∈RK.
This distribution is called the prior. It says what kinds of latent codes are plausible before seeing any data. The standard Gaussian prior is not chosen because we believe the true hidden factors of images, text, or molecules are literally independent standard normal variables. Rather, it is chosen because it gives us a simple, smooth, sampleable reference distribution. Later, when we generate new data, we will sample z∼p(z)z \sim p(z)z∼p(z) and decode it into an observation.
The second ingredient is the decoder, also called the generative model or likelihood model. It specifies how an observed datapoint xxx is produced from a latent code zzz:
pθ(x∣z).p_{\theta}(x \mid z).pθ​(x∣z).
In a neural VAE, this conditional distribution is parameterized by a neural network fθ(z)f_{\theta}(z)fθ​(z). The network does not directly output “the reconstruction” in a purely deterministic sense; rather, it outputs the parameters of a probability distribution over possible observations. For continuous data, a common choice is an isotropic Gaussian likelihood,
pθ(x∣z)=N ⁣(fθ(z), σ2I),p_{\theta}(x \mid z)
=
\mathcal{N}\!\left(f_{\theta}(z),\, \sigma^2 I\right),pθ​(x∣z)=N(fθ​(z),σ2I),
where fθ(z)f_{\theta}(z)fθ​(z) is the mean of the conditional distribution. For binary data, such as binarized MNIST pixels, a common choice is a Bernoulli likelihood,
pθ(x∣z)=Bernoulli ⁣(sigmoid(fθ(z))).p_{\theta}(x \mid z)
=
\mathrm{Bernoulli}\!\left(\mathrm{sigmoid}(f_{\theta}(z))\right).pθ​(x∣z)=Bernoulli(sigmoid(fθ​(z))).
Together, the prior and decoder define the actual generative model:
pθ(x,z)=pθ(x∣z) p(z).p_{\theta}(x, z)
=
p_{\theta}(x \mid z)\,p(z).pθ​(x,z)=pθ​(x∣z)p(z).
This joint distribution is the model’s claim about how data and latent variables co-occur. If we could compute the posterior exactly,
pθ(z∣x)=pθ(x∣z)p(z)pθ(x),p_{\theta}(z \mid x)
=
\frac{p_{\theta}(x \mid z)p(z)}{p_{\theta}(x)},pθ​(z∣x)=pθ​(x)pθ​(x∣z)p(z)​,
then inference would be straightforward: given an observed xxx, infer which latent codes zzz plausibly generated it. But the denominator,
pθ(x)=∫pθ(x∣z)p(z) dz,p_{\theta}(x)
=
\int p_{\theta}(x \mid z)p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
is generally intractable for a nonlinear neural decoder. This is the same obstacle we encountered when trying to use exact EM: the posterior is the object we need, but it is not available in closed form.
The VAE introduces a third distribution to address this: the encoder, or approximate posterior,
qϕ(z∣x)=N ⁣(μϕ(x),diag(σϕ(x)2)).q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
\mu_{\phi}(x),
\mathrm{diag}(\sigma_{\phi}(x)^2)
\right).qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)).
Here another neural network, often denoted gϕ(x)g_{\phi}(x)gϕ​(x), maps an input datapoint to the parameters of a Gaussian distribution over latent codes:
gϕ(x)⟶(μϕ(x),σϕ(x)).g_{\phi}(x)
\longrightarrow
\left(\mu_{\phi}(x), \sigma_{\phi}(x)\right).gϕ​(x)⟶(μϕ​(x),σϕ​(x)).
This is an approximation to pθ(z∣x)p_{\theta}(z \mid x)pθ​(z∣x), not part of the generative model itself. The generative model is still p(z)pθ(x∣z)p(z)p_{\theta}(x \mid z)p(z)pθ​(x∣z). The encoder is an inference mechanism: it gives us a tractable distribution from which we can sample latent codes likely to explain xxx.
The key innovation is amortized inference. In classical variational inference, we might introduce separate variational parameters for each datapoint, for example qλn(z∣x(n))q_{\lambda_n}(z \mid x^{(n)})qλn​​(z∣x(n)) with its own λn\lambda_nλn​. That would mean every new datapoint requires its own inference optimization. VAEs instead use a single shared network qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) for all datapoints:
x(n)↦(μϕ(x(n)),σϕ(x(n))).x^{(n)}
\mapsto
\left(\mu_{\phi}(x^{(n)}), \sigma_{\phi}(x^{(n)})\right).x(n)↦(μϕ​(x(n)),σϕ​(x(n))).
The cost of inference is therefore amortized over the dataset. We pay once to learn ϕ\phiϕ, and then inference for a new xxx is just a forward pass. This is why VAEs scale well to large datasets and why they look architecturally like autoencoders: data goes through an encoder into a latent representation, then through a decoder back into data space. But probabilistically, the encoder and decoder play asymmetric roles.
There is also an important modeling assumption hidden in the standard encoder form. By choosing
qϕ(z∣x)=N ⁣(μϕ(x),diag(σϕ(x)2)),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
\mu_{\phi}(x),
\mathrm{diag}(\sigma_{\phi}(x)^2)
\right),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),
we assume the approximate posterior is Gaussian with diagonal covariance. This makes sampling and KL-divergence computations convenient, but it can be restrictive. The true posterior pθ(z∣x)p_{\theta}(z \mid x)pθ​(z∣x) may be multimodal, skewed, or highly correlated across latent dimensions. Much of the behavior of VAEs—including some failure modes we will discuss later—comes from the tension between a flexible neural decoder and a relatively simple approximate posterior family.
The three distributions therefore have distinct responsibilities:
Prior p(z)p(z)p(z): defines the latent space we can sample from.
Decoder pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z): defines how latent codes generate observations.
Encoder qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x): approximates the intractable posterior for efficient inference.
The visual below consolidates this separation by showing two complementary paths. The generative path starts from z∼p(z)z \sim p(z)z∼p(z) and moves downward through pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) to produce xxx. The inference path goes in the opposite direction: given xxx, the encoder qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) produces a distribution over plausible latent codes.
The most important detail to keep in mind is that the right-hand path is not a separate model of the data. It is a learned approximation used to make training and inference tractable. The shared parameter vector ϕ\phiϕ is what makes the inference procedure amortized: instead of optimizing a new posterior approximation for every x(n)x^{(n)}x(n), the VAE learns one encoder network that serves the entire dataset.

5. Objective: Maximize the Log-Evidence

Now that we have separated the VAE into its three distributions—the prior p(z)p(z)p(z), the decoder pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z), and the encoder-like approximation qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x)—we can ask the most basic statistical question: what objective should train the generative model? If the decoder is supposed to define a probability model over observations, then the natural answer is maximum likelihood. We want parameters θ\thetaθ that assign high probability to the observed dataset:
max⁡θ∑n=1Nlog⁡pθ(x(n)).\max_{\theta} \sum_{n=1}^{N} \log p_{\theta}(x^{(n)}).θmax​n=1∑N​logpθ​(x(n)).
This is the same principle used in ordinary probabilistic modeling: choose the model under which the data would have been most likely. The complication is that in a latent-variable model, pθ(x)p_\theta(x)pθ​(x) is not given directly. It is the probability of observing xxx after averaging over all possible latent explanations zzz.
For a single datapoint, the marginal likelihood, also called the evidence, is
pθ(x)=∫Zpθ(x∣z) p(z) dz,p_\theta(x)
=
\int_{\mathcal{Z}} p_\theta(x \mid z)\,p(z)\,dz,pθ​(x)=∫Z​pθ​(x∣z)p(z)dz,
so the log-evidence is
log⁡pθ(x)=log⁡∫Zpθ(x∣z) p(z) dz.\log p_{\theta}(x)
=
\log \int_{\mathcal{Z}} p_{\theta}(x \mid z)\,p(z)\,dz.logpθ​(x)=log∫Z​pθ​(x∣z)p(z)dz.
Conceptually, this integral says: sample a latent code zzz from the prior, decode it into a distribution over xxx, and average the probability assigned to the observed xxx across all possible latent codes. If many plausible latent codes explain xxx, the evidence should be high. If almost no latent code decodes near xxx, the evidence should be low.
The problem is that this integral is almost never analytically tractable for a neural decoder. In a simple linear-Gaussian latent-variable model, the integral may have a closed form. But in a VAE, pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z) is parameterized by a deep network fθ(z)f_\theta(z)fθ​(z). The latent space is typically continuous, often Z=RK\mathcal{Z} = \mathbb{R}^KZ=RK, and the integrand can be nonzero over an enormous region. We are therefore trying to integrate a nonlinear neural-network-shaped function over a high-dimensional space:
∫RKpθ(x∣z) p(z) dz.\int_{\mathbb{R}^K} p_\theta(x \mid z)\,p(z)\,dz.∫RK​pθ​(x∣z)p(z)dz.
There is no general symbolic simplification available.
A first instinct is to use Monte Carlo sampling from the prior. Draw
z(1),…,z(L)∼p(z),z^{(1)}, \dots, z^{(L)} \sim p(z),z(1),…,z(L)∼p(z),
approximate the expectation, and then take the logarithm:
log⁡pθ(x)=log⁡Ep(z)[pθ(x∣z)]≈log⁡1L∑l=1Lpθ(x∣z(l)).\log p_{\theta}(x)
=
\log \mathbb{E}_{p(z)}\bigl[p_\theta(x \mid z)\bigr]
\approx
\log \frac{1}{L}\sum_{l=1}^{L} p_\theta(x \mid z^{(l)}).logpθ​(x)=logEp(z)​[pθ​(x∣z)]≈logL1​l=1∑L​pθ​(x∣z(l)).
This looks reasonable, and for very large LLL it is a consistent estimator. But as a training objective, it has two serious issues. First, prior samples are usually a terrible way to find the latent codes that explain a specific datapoint xxx. In high dimensions, most z∼p(z)z \sim p(z)z∼p(z) will decode to samples unrelated to xxx, so the average can be dominated by rare lucky samples. This leads to high variance.
Second, and more fundamentally, the log of a Monte Carlo average is a biased estimator of the log-evidence. The logarithm is concave, so Jensen’s inequality gives
E[log⁡1L∑l=1Lpθ(x∣z(l))]≤log⁡E[1L∑l=1Lpθ(x∣z(l))]=log⁡pθ(x).\mathbb{E}\left[
\log \frac{1}{L}\sum_{l=1}^{L} p_\theta(x \mid z^{(l)})
\right]
\leq
\log \mathbb{E}\left[
\frac{1}{L}\sum_{l=1}^{L} p_\theta(x \mid z^{(l)})
\right]
=
\log p_\theta(x).E[logL1​l=1∑L​pθ​(x∣z(l))]≤logE[L1​l=1∑L​pθ​(x∣z(l))]=logpθ​(x).
The inequality is strict unless the random quantity inside the log is essentially constant. Equivalently, in the single-sample case,
log⁡Ep(z)[pθ(x∣z)]≥Ep(z)[log⁡pθ(x∣z)].\log \mathbb{E}_{p(z)}\bigl[p_\theta(x \mid z)\bigr]
\geq
\mathbb{E}_{p(z)}\bigl[\log p_\theta(x \mid z)\bigr].logEp(z)​[pθ​(x∣z)]≥Ep(z)​[logpθ​(x∣z)].
This is the key obstruction: moving the logarithm outside or inside an expectation changes the objective. The log-evidence is what we want, but it contains an intractable integral. The expectation of the log-likelihood is tractable by sampling, but it is generally a lower quantity and, by itself, is not the right maximum-likelihood objective.
This is where the VAE’s central idea begins to emerge. We need a surrogate objective that is:
tractable by sampling, so it avoids exact integration over all zzz;
a principled lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x), so optimizing it still pushes up the evidence;
differentiable with low-variance gradients, so neural networks can be trained efficiently;
aware of the datapoint xxx, so we sample useful latent codes rather than blind samples from the prior.
That surrogate will be the Evidence Lower Bound, or ELBO, written schematically as
L(θ,ϕ;x)≤log⁡pθ(x).\mathcal{L}(\theta,\phi; x) \leq \log p_\theta(x).L(θ,ϕ;x)≤logpθ​(x).
The additional parameter ϕ\phiϕ appears because we introduce an inference network qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x), whose role is to propose latent codes likely to explain the observed datapoint. Instead of sampling zzz blindly from the prior, the VAE learns where in latent space to look.
The visual below compresses this motivation into three layers: the maximum-likelihood goal, the intractable log-evidence integral, and the failed naive Monte Carlo shortcut. The red warning around Jensen’s inequality highlights the precise mathematical reason we cannot simply replace the integral with a sampled average inside a logarithm and proceed as if nothing changed.
The bottom layer then points to the resolution: rather than estimating the log-evidence directly, we construct a computable lower bound with good gradient behavior. The next step is to derive that bound carefully and see exactly why it is lower, when it becomes tight, and how it decomposes into the reconstruction and KL terms used in VAE training.

6. Deriving the ELBO: Jensen's Inequality Route

Having written the evidence as an integral over the latent variable, we now face the central obstacle of latent-variable maximum likelihood: the quantity we want,
log⁡pθ(x)=log⁡∫pθ(x,z) dz,\log p_{\theta}(x)=\log \int p_{\theta}(x,z)\,dz,logpθ​(x)=log∫pθ​(x,z)dz,
is usually not computable in closed form. The integral sums over all possible latent explanations zzz of the observation xxx. For expressive neural decoders pθ(x∣z)p_{\theta}(x\mid z)pθ​(x∣z), this marginalization is precisely what makes the model powerful — and precisely what makes direct optimization difficult.
The key variational idea is to introduce an auxiliary distribution qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), which we will later implement as the encoder. At this point, however, qϕq_{\phi}qϕ​ is just a mathematical device: a distribution over latent variables conditioned on the observed datapoint. We use it to rewrite the intractable integral as an expectation under a distribution from which we can sample.
Starting from the evidence,
log⁡pθ(x)=log⁡∫pθ(x,z) dz,\log p_{\theta}(x)
=
\log \int p_{\theta}(x,z)\,dz,logpθ​(x)=log∫pθ​(x,z)dz,
we multiply and divide by qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x):
log⁡pθ(x)=log⁡∫qϕ(z∣x)pθ(x,z)qϕ(z∣x) dz.\log p_{\theta}(x)
=
\log \int q_{\phi}(z\mid x)
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
\,dz.logpθ​(x)=log∫qϕ​(z∣x)qϕ​(z∣x)pθ​(x,z)​dz.
This is an exact algebraic identity, not an approximation. It is the same logic as importance sampling: instead of integrating directly with respect to the latent variable measure, we express the integral as an expectation under a chosen proposal distribution qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x):
log⁡pθ(x)=log⁡Eqϕ(z∣x) ⁣[pθ(x,z)qϕ(z∣x)].\log p_{\theta}(x)
=
\log
\mathbb{E}_{q_{\phi}(z\mid x)}
\!\left[
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
\right].logpθ​(x)=logEqϕ​(z∣x)​[qϕ​(z∣x)pθ​(x,z)​].
There is an important support condition hidden in this step. The ratio pθ(x,z)/qϕ(z∣x)p_{\theta}(x,z)/q_{\phi}(z\mid x)pθ​(x,z)/qϕ​(z∣x) must be well-defined wherever pθ(x,z)p_{\theta}(x,z)pθ​(x,z) contributes mass. Informally, qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) must not assign zero probability to latent regions that the model considers possible explanations of xxx. Otherwise, the importance-weighted expression can become undefined or miss parts of the integral entirely.
Now comes the only inequality in the derivation. Since log⁡\loglog is concave, Jensen’s inequality tells us that for a positive random variable YYY,
log⁡E[Y]≥E[log⁡Y].\log \mathbb{E}[Y]
\geq
\mathbb{E}[\log Y].logE[Y]≥E[logY].
Here the random variable is
Y=pθ(x,z)qϕ(z∣x),z∼qϕ(z∣x).Y
=
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)},
\qquad
z\sim q_{\phi}(z\mid x).Y=qϕ​(z∣x)pθ​(x,z)​,z∼qϕ​(z∣x).
Applying Jensen’s inequality gives
log⁡pθ(x)≥Eqϕ(z∣x)[log⁡pθ(x,z)qϕ(z∣x)],\log p_{\theta}(x)
\geq
\mathbb{E}_{q_{\phi}(z\mid x)}
\left[
\log
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
\right],logpθ​(x)≥Eqϕ​(z∣x)​[logqϕ​(z∣x)pθ​(x,z)​],
or equivalently,
log⁡pθ(x)≥Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\log p_{\theta}(x)
\geq
\mathbb{E}_{q_{\phi}(z\mid x)}
\!\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z\mid x)
\right].logpθ​(x)≥Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
This lower bound is the evidence lower bound, or ELBO:
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z\mid x)}
\!\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z\mid x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
The name is literal: it is a lower bound on the log-evidence log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). For any valid choice of qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x),
L(θ,ϕ;x)≤log⁡pθ(x).\mathcal{L}(\theta,\phi;x)
\leq
\log p_{\theta}(x).L(θ,ϕ;x)≤logpθ​(x).
This is the foundational move in VAEs. We replace an intractable objective with a tractable lower bound that can be estimated by sampling from the encoder.
The bound becomes tight exactly when Jensen’s inequality becomes equality. For a concave function like log⁡\loglog, equality occurs when the random variable inside the expectation is constant almost surely under qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x). In this case, we need
pθ(x,z)qϕ(z∣x)=pθ(x)for qϕ-almost every z.\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
=
p_{\theta}(x)
\quad
\text{for } q_{\phi}\text{-almost every }z.qϕ​(z∣x)pθ​(x,z)​=pθ​(x)for qϕ​-almost every z.
Rearranging,
qϕ(z∣x)=pθ(x,z)pθ(x)=pθ(z∣x).q_{\phi}(z\mid x)
=
\frac{p_{\theta}(x,z)}{p_{\theta}(x)}
=
p_{\theta}(z\mid x).qϕ​(z∣x)=pθ​(x)pθ​(x,z)​=pθ​(z∣x).
So the ELBO is tight when the variational distribution qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) exactly matches the true posterior pθ(z∣x)p_{\theta}(z\mid x)pθ​(z∣x). This is also why the encoder is often described as an approximate posterior: it is trained to behave like the posterior distribution that would make the bound exact, but which is usually unavailable because it depends on the intractable evidence pθ(x)p_{\theta}(x)pθ​(x).
This derivation matters because it gives us a practical optimization target with a clear interpretation:
maximizing the ELBO with respect to θ\thetaθ improves the generative model;
maximizing it with respect to ϕ\phiϕ improves the approximate posterior;
the gap between log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) and the ELBO measures how far qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) is from the true posterior.
There is a subtle but important caveat: maximizing a lower bound is not automatically the same as maximizing the original objective. If the bound is loose, we may improve L\mathcal{L}L without significantly improving log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). The success of VAEs depends on making the variational family expressive enough, and the optimization stable enough, that this lower bound remains a useful surrogate for maximum likelihood.
The visual below condenses the derivation into its three essential moves: start from the marginal likelihood, rewrite it as an expectation using qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), then apply Jensen’s inequality to move the logarithm inside the expectation. The boxed expression is the ELBO, the quantity we can optimize in place of the inaccessible log-evidence.
It also highlights the geometric meaning of the inequality: L(θ,ϕ;x)\mathcal{L}(\theta,\phi;x)L(θ,ϕ;x) sits below log⁡pθ(x)\log p_{\theta}(x)logpθ​(x), with a gap determined by how imperfect the variational posterior is. In the next section, we will decompose this same ELBO into the two terms that make VAEs operational: a reconstruction term and a KL regularization term.

7. ELBO Decomposition: Reconstruction + KL

Having obtained the ELBO through Jensen’s inequality, the next question is: what exactly are we optimizing? The expression
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)]\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_{\theta}(x,z) - \log q_{\phi}(z|x)
\right]L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)]
is already a valid lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x), but in this form it hides the two competing forces that make VAEs work. To understand the training objective operationally, we want to rewrite it in terms of the decoder likelihood and a penalty on the encoder distribution.
The key move is simply to expand the joint model. In a VAE, the generative story is assumed to factor as: first sample a latent variable zzz from a prior p(z)p(z)p(z), then sample the observation xxx from a decoder likelihood pθ(x∣z)p_\theta(x|z)pθ​(x∣z). Therefore,
pθ(x,z)=pθ(x∣z)p(z),p_\theta(x,z) = p_\theta(x|z)p(z),pθ​(x,z)=pθ​(x∣z)p(z),
and taking logs gives
log⁡pθ(x,z)=log⁡pθ(x∣z)+log⁡p(z).\log p_\theta(x,z)
=
\log p_\theta(x|z) + \log p(z).logpθ​(x,z)=logpθ​(x∣z)+logp(z).
Substituting this into the ELBO gives
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)+log⁡p(z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)+logp(z)−logqϕ​(z∣x)].
Because expectation is linear, we can separate the terms:
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]+Eqϕ(z∣x) ⁣[log⁡p(z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z)
\right]
+
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p(z) - \log q_\phi(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]+Eqϕ​(z∣x)​[logp(z)−logqϕ​(z∣x)].
The first term has a direct modeling interpretation. We draw latent codes zzz from the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x), then ask how much probability the decoder assigns back to the original observation xxx. This is the reconstruction term:
Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)].\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z)
\right].Eqϕ​(z∣x)​[logpθ​(x∣z)].
It rewards latent representations that preserve information useful for explaining the input. If pθ(x∣z)p_\theta(x|z)pθ​(x∣z) is a Bernoulli distribution, this often corresponds to a binary cross-entropy reconstruction objective; if it is a Gaussian with fixed variance, it becomes proportional to a negative squared-error loss. This is why VAEs often look, at implementation time, like autoencoders with a probabilistic reconstruction loss.
The second term is more subtle. Recall that the KL divergence is
DKL ⁣(qϕ(z∣x) ∥ p(z))=Eqϕ(z∣x) ⁣[log⁡qϕ(z∣x)−log⁡p(z)].D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right)
=
\mathbb{E}_{q_\phi(z|x)}
\!\left[
\log q_\phi(z|x) - \log p(z)
\right].DKL​(qϕ​(z∣x)∥p(z))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logp(z)].
Our ELBO contains the negative of this quantity:
Eqϕ(z∣x) ⁣[log⁡p(z)−log⁡qϕ(z∣x)]=−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p(z) - \log q_\phi(z|x)
\right]
=
-
D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right).Eqϕ​(z∣x)​[logp(z)−logqϕ​(z∣x)]=−DKL​(qϕ​(z∣x)∥p(z)).
So the ELBO decomposes into the now-famous form
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]⏟reconstruction term−DKL ⁣(qϕ(z∣x) ∥ p(z))⏟regularisation term\boxed{
\mathcal{L}(\theta, \phi; x)
=
\underbrace{
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z)
\right]
}_{\text{reconstruction term}}
-
\underbrace{
D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right)
}_{\text{regularisation term}}
}L(θ,ϕ;x)=reconstruction termEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularisation termDKL​(qϕ​(z∣x)∥p(z))​​​
This decomposition explains the central tension in VAE training. The reconstruction term wants qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) to encode enough information about xxx that the decoder can reconstruct it accurately. The KL term wants qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) to stay close to the prior p(z)p(z)p(z), usually a simple standard Gaussian such as N(0,I)\mathcal{N}(0,I)N(0,I). In other words, the encoder is not free to assign every datapoint an arbitrary isolated latent code; its codes must remain compatible with a shared latent space from which we can later sample.
This regularization is not merely aesthetic. Without it, the model could behave like an ordinary deterministic autoencoder: excellent reconstructions, but a latent space with holes, disconnected regions, and no reliable way to generate new samples by drawing z∼p(z)z \sim p(z)z∼p(z). The KL penalty pushes the aggregate structure of the latent codes toward something smooth and sampleable. At the same time, if the KL pressure is too strong, the encoder may ignore xxx, producing qϕ(z∣x)≈p(z)q_\phi(z|x) \approx p(z)qϕ​(z∣x)≈p(z) for every input. That failure mode is known as posterior collapse, and it causes the latent variable to carry little or no information.
A useful way to remember the objective is:
Reconstruction term: “Can the decoder explain this datapoint using latents from the encoder?”
KL term: “Is the encoder’s posterior still close to the prior distribution we intend to sample from?”
ELBO maximization: “Find a compromise between faithful reconstruction and a well-structured latent space.”
For the common Gaussian case, this tradeoff becomes especially convenient computationally. If
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ2(x)))q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi^2(x)))qϕ​(z∣x)=N(μϕ​(x),diag(σϕ2​(x)))
and
p(z)=N(0,I),p(z) = \mathcal{N}(0,I),p(z)=N(0,I),
then the KL term has a closed-form expression. That means VAE training usually combines a Monte Carlo estimate of the reconstruction expectation with an analytic KL penalty, giving a stable and differentiable objective once we introduce the reparameterization trick.
The visual below condenses this derivation into its algebraic skeleton: start from the Jensen-derived ELBO, expand the joint distribution using the generative factorization, separate the expectation, and recognize the second piece as a negative KL divergence. The color split emphasizes that the final objective is not one monolithic loss, but a sum of two interpretable forces with opposite pressures.
Read the final boxed equation as the practical training objective for a single datapoint xxx: maximize expected decoder log-likelihood while minimizing deviation from the prior. This compact decomposition is the bridge between the abstract variational bound and the loss function we will soon implement in an actual VAE.

8. Theorem: ELBO–Evidence Gap is the KL Divergence

Having split the ELBO into a reconstruction term and a KL-to-prior regularizer, we can now ask a deeper question: what exactly did we lose when we replaced the true log-evidence log⁡pθ(x)\log p_\theta(x)logpθ​(x) by the ELBO? The answer is one of the most important identities in variational inference: the missing quantity is not mysterious approximation error, nor a loose heuristic penalty. It is exactly a KL divergence between the variational encoder and the true Bayesian posterior.
For a latent-variable model, the evidence is
pθ(x)=∫pθ(x∣z)p(z) dz,p_\theta(x)=\int p_\theta(x|z)p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
and the true posterior over latents is
pθ(z∣x)=pθ(x∣z)p(z)pθ(x).p_\theta(z|x)
=
\frac{p_\theta(x|z)p(z)}{p_\theta(x)}.pθ​(z∣x)=pθ​(x)pθ​(x∣z)p(z)​.
This posterior is the distribution we would ideally use for inference: after observing xxx, it tells us which latent explanations zzz are plausible. The difficulty is that pθ(z∣x)p_\theta(z|x)pθ​(z∣x) depends on pθ(x)p_\theta(x)pθ​(x), the very integral we cannot usually compute. VAEs therefore introduce an encoder qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x), a tractable approximation to this intractable posterior.
The central theorem says that, for a suitable variational distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x),
log⁡pθ(x)=L(θ,ϕ;x)+DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x)).\log p_{\theta}(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}
\!\left(
q_{\phi}(z|x)
\,\|\,
p_{\theta}(z|x)
\right).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
Here the ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_{\theta}(x|z)
\right]
-
D_{\mathrm{KL}}
\!\left(
q_{\phi}(z|x)
\,\|\,
p(z)
\right).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
So the ELBO is not merely inspired by the evidence. It is the evidence minus a very specific nonnegative gap.
Because KL divergence is always nonnegative,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))≥0,D_{\mathrm{KL}}
\!\left(
q_{\phi}(z|x)
\,\|\,
p_{\theta}(z|x)
\right)
\geq 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))≥0,
we immediately recover the lower-bound property:
L(θ,ϕ;x)≤log⁡pθ(x).\mathcal{L}(\theta,\phi;x)
\leq
\log p_\theta(x).L(θ,ϕ;x)≤logpθ​(x).
This also explains the name Evidence Lower Bound. The ELBO sits below the log-evidence, and the vertical distance between them is precisely the mismatch between the approximate posterior qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) and the true posterior pθ(z∣x)p_\theta(z|x)pθ​(z∣x).
The theorem also tells us exactly when the bound is tight. We have
L(θ,ϕ;x)=log⁡pθ(x)\mathcal{L}(\theta,\phi;x)
=
\log p_\theta(x)L(θ,ϕ;x)=logpθ​(x)
if and only if
qϕ(z∣x)=pθ(z∣x)almost everywhere.q_\phi(z|x)=p_\theta(z|x)
\quad \text{almost everywhere}.qϕ​(z∣x)=pθ​(z∣x)almost everywhere.
In words: the ELBO becomes the true log-evidence exactly when the encoder recovers the true posterior. There is no remaining variational gap. This is why the encoder is not just an auxiliary neural network used for amortized sampling; it is performing approximate Bayesian inference.
There is a subtle support assumption hiding here. For the KL
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p_\theta(z|x)\right)DKL​(qϕ​(z∣x)∥pθ​(z∣x))
to be finite, qϕq_\phiqϕ​ should not place probability mass where the true posterior has zero mass. More formally, we usually require qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) to be absolutely continuous with respect to pθ(z∣x)p_\theta(z|x)pθ​(z∣x). At the same time, if the variational family is too restrictive and cannot represent the true posterior’s important regions, the gap cannot close. This is one reason why posterior approximation quality depends not only on optimization, but also on the expressiveness of the encoder family.
The identity has an important optimization consequence. Suppose θ\thetaθ is fixed. Then log⁡pθ(x)\log p_\theta(x)logpθ​(x) is constant with respect to ϕ\phiϕ, so maximizing the ELBO over ϕ\phiϕ is exactly equivalent to minimizing
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x)).D_{\mathrm{KL}}
\!\left(
q_\phi(z|x)
\,\|\,
p_\theta(z|x)
\right).DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
That is the variational inference interpretation of the VAE encoder: training qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) by maximizing the ELBO pushes it toward the true posterior. When we also optimize θ\thetaθ, we are doing two things at once: learning a generative model that assigns high probability to the data, and learning an approximate inference model that explains each datapoint in latent space.
A useful way to remember the theorem is:
Evidence is the target quantity we wish we could maximize directly.
ELBO is the tractable surrogate we can optimize.
KL gap is the price paid for using qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) instead of the true posterior.
The visual below compresses this relationship into a geometric picture. The log-evidence is represented as a fixed point on a “nats” axis, while the ELBO lies to its left. The red interval between them is the posterior KL divergence. If the encoder improves, that interval shrinks; if qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) matches pθ(z∣x)p_\theta(z|x)pθ​(z∣x), the two points coincide.
The three corollaries in the visual are therefore not separate facts, but direct consequences of the same decomposition: the ELBO is a lower bound because KL is nonnegative; it is tight exactly when the approximate and true posteriors agree; and maximizing it with respect to ϕ\phiϕ is variational inference. The next step is to prove the identity algebraically, showing how Bayes’ rule turns the apparently inaccessible evidence into the sum of an optimizable lower bound and an explicit posterior mismatch term.

9. Proof: ELBO + KL Gap = Log Evidence

Having stated that the gap between the log evidence and the ELBO is a KL divergence, we should now verify that this is not a heuristic slogan. It is an exact identity. The derivation is short, but it is worth reading carefully because it explains why variational inference works: we are not inventing an arbitrary surrogate objective; we are decomposing the true marginal log-likelihood into a tractable lower bound plus a nonnegative error term.
Fix one observed datapoint xxx. The model defines a latent-variable joint distribution
pθ(x,z)=pθ(x∣z)p(z),p_{\theta}(x,z) = p_{\theta}(x|z)p(z),pθ​(x,z)=pθ​(x∣z)p(z),
and the true posterior is
pθ(z∣x)=pθ(x,z)pθ(x).p_{\theta}(z|x) = \frac{p_{\theta}(x,z)}{p_{\theta}(x)}.pθ​(z∣x)=pθ​(x)pθ​(x,z)​.
The problem is that pθ(x)p_{\theta}(x)pθ​(x), the evidence, usually requires integrating out zzz:
pθ(x)=∫pθ(x,z) dz.p_{\theta}(x) = \int p_{\theta}(x,z)\,dz.pθ​(x)=∫pθ​(x,z)dz.
In a VAE, this integral is generally intractable because the decoder pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z) is represented by a neural network. So instead of working directly with the true posterior pθ(z∣x)p_{\theta}(z|x)pθ​(z∣x), we introduce an approximate posterior, or encoder distribution,
qϕ(z∣x).q_{\phi}(z|x).qϕ​(z∣x).
The key question is: how does optimizing an objective involving qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) relate to maximizing the true log evidence log⁡pθ(x)\log p_{\theta}(x)logpθ​(x)?
Start from the KL divergence between the variational posterior and the true posterior:
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡pθ(z∣x)].D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x) - \log p_{\theta}(z|x)
\right].DKL​(qϕ​(z∣x)∥pθ​(z∣x))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logpθ​(z∣x)].
Now substitute Bayes’ rule into the true posterior term:
pθ(z∣x)=pθ(x,z)pθ(x).p_{\theta}(z|x)
=
\frac{p_{\theta}(x,z)}{p_{\theta}(x)}.pθ​(z∣x)=pθ​(x)pθ​(x,z)​.
Taking logs gives
log⁡pθ(z∣x)=log⁡pθ(x,z)−log⁡pθ(x).\log p_{\theta}(z|x)
=
\log p_{\theta}(x,z) - \log p_{\theta}(x).logpθ​(z∣x)=logpθ​(x,z)−logpθ​(x).
Therefore,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡pθ(x,z)+log⁡pθ(x)].D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x)
-
\log p_{\theta}(x,z)
+
\log p_{\theta}(x)
\right].DKL​(qϕ​(z∣x)∥pθ​(z∣x))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logpθ​(x,z)+logpθ​(x)].
The subtle but crucial observation is that log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) does not depend on zzz. The expectation is over z∼qϕ(z∣x)z \sim q_{\phi}(z|x)z∼qϕ​(z∣x), while xxx is fixed. So we can pull log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) outside the expectation:
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡pθ(x,z)]+log⁡pθ(x).D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x)
-
\log p_{\theta}(x,z)
\right]
+
\log p_{\theta}(x).DKL​(qϕ​(z∣x)∥pθ​(z∣x))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logpθ​(x,z)]+logpθ​(x).
Equivalently,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=−Eqϕ(z∣x)[log⁡pθ(x,z)−log⁡qϕ(z∣x)]+log⁡pθ(x).D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
-
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z|x)
\right]
+
\log p_{\theta}(x).DKL​(qϕ​(z∣x)∥pθ​(z∣x))=−Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)]+logpθ​(x).
The expectation inside the negative sign is precisely the evidence lower bound:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
So the KL divergence becomes
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=−L(θ,ϕ;x)+log⁡pθ(x).D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
-\mathcal{L}(\theta,\phi;x)
+
\log p_{\theta}(x).DKL​(qϕ​(z∣x)∥pθ​(z∣x))=−L(θ,ϕ;x)+logpθ​(x).
Rearranging gives the central decomposition:
log⁡pθ(x)=L(θ,ϕ;x)+DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))\boxed{
\log p_{\theta}(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
}logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x))​
This is the whole proof. No approximation has been made. We used only the definition of KL divergence, Bayes’ theorem, and the fact that constants can be moved outside expectations.
The consequence is immediate. Since KL divergence is always nonnegative,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))≥0,D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right) \geq 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))≥0,
we have
log⁡pθ(x)≥L(θ,ϕ;x).\log p_{\theta}(x) \geq \mathcal{L}(\theta,\phi;x).logpθ​(x)≥L(θ,ϕ;x).
That is why L\mathcal{L}L is called a lower bound on the log evidence. The bound is tight exactly when
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=0,D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right) = 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))=0,
which occurs when
qϕ(z∣x)=pθ(z∣x)q_{\phi}(z|x) = p_{\theta}(z|x)qϕ​(z∣x)=pθ​(z∣x)
almost everywhere under qϕq_{\phi}qϕ​. In words: the ELBO equals the true log evidence when the encoder distribution exactly matches the model’s true posterior.
This identity also clarifies a common failure mode in variational modeling. Maximizing the ELBO improves a lower bound on log⁡pθ(x)\log p_{\theta}(x)logpθ​(x), but it does so through the restricted family qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x). If the encoder family is too limited, the KL gap may remain large even after optimization. The objective is still valid, but the approximation may be poor. Conversely, if qϕq_{\phi}qϕ​ is expressive enough and optimization succeeds, the ELBO can become a very accurate proxy for the true marginal likelihood.
The visual below compresses the derivation into a chain of equalities: start with the KL divergence to the true posterior, replace the posterior using Bayes’ theorem, separate the constant log⁡pθ(x)\log p_{\theta}(x)logpθ​(x), recognize the ELBO, and rearrange. The boxed final identity is not an additional assumption; it is simply the same expression written so that the evidence appears on the left.
It is useful to keep this picture in mind as we move forward. The ELBO is not merely “reconstruction plus regularization” yet—that interpretation will come after expanding the joint model. At this stage, the most important fact is structural: log evidence decomposes exactly into an optimizable lower bound plus a nonnegative posterior-approximation gap.

10. ELBO as Negative Free Energy

Having just seen that the ELBO differs from the true log evidence by a nonnegative KL gap, we can now reinterpret the bound in a more operational way. The ELBO is not merely an algebraic trick for lower-bounding log⁡pθ(x)\log p_\theta(x)logpθ​(x); it is the objective that tells a VAE how to trade off using latent information against staying compatible with the prior.
For a single datapoint xxx, the VAE objective is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z)).\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
-
D_{\mathrm{KL}}(q_{\phi}(z|x) \| p(z)).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
The first term is the reconstruction term. It asks: if the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) samples a latent code zzz, how much probability does the decoder pθ(x∣z)p_\theta(x|z)pθ​(x∣z) assign back to the original observation xxx? Maximizing this term encourages the latent representation to preserve information that the decoder needs. In an image VAE, this means encoding shape, pose, color, texture, or any features that help predict the input under the chosen likelihood model.
The second term is the regularization term, but that phrase can be slightly misleading if we think of it as a generic penalty. More precisely, it penalizes the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) for moving too far away from the prior p(z)p(z)p(z), usually p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This matters because generation will later sample z∼p(z)z \sim p(z)z∼p(z), not z∼qϕ(z∣x)z \sim q_\phi(z|x)z∼qϕ​(z∣x). If the encoder learns latent codes that live in strange isolated regions far from the prior, then samples drawn from the prior may decode poorly. The KL term forces the aggregate geometry of the latent space to remain usable for generation.
We can unpack the KL penalty further using
DKL(q∥p)=−H[q]+Eq[−log⁡p(z)].D_{\mathrm{KL}}(q\|p)
=
-\mathcal{H}[q]
+
\mathbb{E}_{q}[-\log p(z)].DKL​(q∥p)=−H[q]+Eq​[−logp(z)].
Therefore,
−DKL(qϕ(z∣x)∥p(z))=H[qϕ(z∣x)]−Eqϕ(z∣x)[−log⁡p(z)].-D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
=
\mathcal{H}[q_{\phi}(z|x)]
-
\mathbb{E}_{q_{\phi}(z|x)}[-\log p(z)].−DKL​(qϕ​(z∣x)∥p(z))=H[qϕ​(z∣x)]−Eqϕ​(z∣x)​[−logp(z)].
This decomposition reveals two forces hidden inside the regularization term. The entropy term H[qϕ(z∣x)]\mathcal{H}[q_\phi(z|x)]H[qϕ​(z∣x)] rewards the encoder for being uncertain, or spread out, rather than collapsing to a deterministic point code. Meanwhile, the expected negative log prior term penalizes codes that fall in regions where p(z)p(z)p(z) assigns low probability. For a standard Gaussian prior, this means the model is discouraged from placing latent mass far away from the origin.
So the ELBO is balancing two competing desires:
Reconstruction: encode enough information in zzz to explain xxx well.
Regularization: keep qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) close enough to p(z)p(z)p(z) that latent samples remain meaningful.
This tension is fundamental. If the KL penalty is too weak, the encoder may memorize examples using highly specialized latent codes, producing good reconstructions but a poorly organized latent space. If the KL penalty is too strong, the encoder may ignore xxx and make qϕ(z∣x)≈p(z)q_\phi(z|x)\approx p(z)qϕ​(z∣x)≈p(z), which can lead to posterior collapse: the decoder receives little useful information from zzz, and the latent variables stop carrying semantic structure.
This is why the ELBO also has a natural rate–distortion interpretation. The KL term acts like a rate: it measures how many nats, or bits up to a constant conversion, are required to encode zzz using a distribution different from the prior. The reconstruction error acts like a distortion: poor reconstructions correspond to high distortion, while high log-likelihood corresponds to low distortion. Maximizing the ELBO means finding a good operating point between compression cost and reconstruction fidelity.
The connection to physics and variational inference comes from writing the training problem as minimization of the negative ELBO:
−L(θ,ϕ;x).-\mathcal{L}(\theta,\phi;x).−L(θ,ϕ;x).
This quantity is often called the variational free energy or negative evidence lower bound. Since we previously established
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x)∥pθ(z∣x)),\log p_{\theta}(x)
=
\mathcal{L}(\theta, \phi; x)
+
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p_{\theta}(z|x)),logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)),
minimizing free energy simultaneously pushes the ELBO upward and tries to reduce the gap between the approximate posterior qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) and the true posterior pθ(z∣x)p_\theta(z|x)pθ​(z∣x). The bound becomes tight exactly when
qϕ(z∣x)=pθ(z∣x),q_\phi(z|x)=p_\theta(z|x),qϕ​(z∣x)=pθ​(z∣x),
assuming the variational family is expressive enough to represent the true posterior.
The visual below condenses this interpretation into a balance: reconstruction pulls the objective toward faithful recovery of xxx, while KL regularization pulls the encoder distribution back toward the prior. The scale metaphor is useful because neither side is “bad”; a good VAE needs both. Reconstruction gives the latent variable meaning, while the KL term gives the latent space global structure.
It also highlights the rate–distortion view: moving toward lower distortion usually requires spending more rate, while enforcing a very low rate often sacrifices detail. Much of practical VAE design—choosing decoder likelihoods, KL schedules, β\betaβ-VAE objectives, or more expressive priors—is about controlling exactly where this operating point lies.

11. Worked Example: ELBO on a 1D Gaussian

Having seen the ELBO as a negative free energy, it is worth pausing on a case where nothing is hidden behind approximation. In most VAE settings, the posterior pθ(z∣x)p_\theta(z \mid x)pθ​(z∣x) is intractable, which is exactly why we introduce a variational distribution qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x). But in a one-dimensional linear-Gaussian model, the posterior, marginal likelihood, KL terms, and ELBO can all be written down exactly. That makes it a useful sanity check: if our interpretation of the ELBO is right, then optimizing it over a sufficiently expressive variational family should recover the true posterior and make the ELBO equal to the evidence.
Consider the toy generative model
p(z)=N(0,1),pθ(x∣z)=N(z,0.1).p(z) = \mathcal{N}(0,1),
\qquad
p_\theta(x \mid z) = \mathcal{N}(z, 0.1).p(z)=N(0,1),pθ​(x∣z)=N(z,0.1).
Here the latent variable zzz is drawn from a standard normal prior, and the observation xxx is a noisy version of zzz with small Gaussian noise variance 0.10.10.1. Because both the prior and likelihood are Gaussian, the marginal distribution of xxx is also Gaussian:
pθ(x)=N(0,1.1),p_\theta(x) = \mathcal{N}(0, 1.1),pθ​(x)=N(0,1.1),
so the log evidence is available in closed form:
log⁡pθ(x)=−12log⁡(2π⋅1.1)−x22.2.\log p_\theta(x)
=
-\frac{1}{2}\log(2\pi \cdot 1.1)
-
\frac{x^2}{2.2}.logpθ​(x)=−21​log(2π⋅1.1)−2.2x2​.
The true posterior is also Gaussian. Combining the prior precision 111 with the likelihood precision 1/0.1=101/0.1 = 101/0.1=10, the posterior variance is
(σ∗)2=11+10=0.11.1,(\sigma^*)^2 = \frac{1}{1 + 10} = \frac{0.1}{1.1},(σ∗)2=1+101​=1.10.1​,
and the posterior mean is the precision-weighted estimate
μ∗=x1.1.\mu^* = \frac{x}{1.1}.μ∗=1.1x​.
Thus
pθ(z∣x)=N ⁣(μ∗,(σ∗)2),μ∗=x1.1,(σ∗)2=0.11.1.p_\theta(z \mid x)
=
\mathcal{N}\!\left(\mu^*,(\sigma^*)^2\right),
\qquad
\mu^* = \frac{x}{1.1},
\qquad
(\sigma^*)^2 = \frac{0.1}{1.1}.pθ​(z∣x)=N(μ∗,(σ∗)2),μ∗=1.1x​,(σ∗)2=1.10.1​.
This already contains an important intuition. The observation xxx pulls the posterior mean toward xxx, but not all the way: the prior still shrinks the estimate toward zero. Since the observation noise is small, the posterior variance is much smaller than the prior variance. For x=1x=1x=1, we get
μ∗≈0.91,σ∗≈0.30.\mu^* \approx 0.91,
\qquad
\sigma^* \approx 0.30.μ∗≈0.91,σ∗≈0.30.
Now suppose our variational family is
qϕ(z∣x)=N(μ,σ2).q_\phi(z \mid x) = \mathcal{N}(\mu,\sigma^2).qϕ​(z∣x)=N(μ,σ2).
This family is expressive enough to contain the true posterior exactly, because the true posterior is also a one-dimensional Gaussian. The ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)]
-
D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p(z)).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
In this Gaussian case, both terms can be evaluated analytically. The expected reconstruction term is
Eqϕ[log⁡pθ(x∣z)]=−12log⁡(2π⋅0.1)−(x−μ)2+σ22⋅0.1,\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)]
=
-\frac{1}{2}\log(2\pi \cdot 0.1)
-
\frac{(x-\mu)^2+\sigma^2}{2\cdot 0.1},Eqϕ​​[logpθ​(x∣z)]=−21​log(2π⋅0.1)−2⋅0.1(x−μ)2+σ2​,
because under qϕq_\phiqϕ​,
Eqϕ[(x−z)2]=(x−μ)2+σ2.\mathbb{E}_{q_\phi}[(x-z)^2] = (x-\mu)^2+\sigma^2.Eqϕ​​[(x−z)2]=(x−μ)2+σ2.
The KL regularizer against the standard normal prior is
DKL ⁣(N(μ,σ2) ∥ N(0,1))=12(μ2+σ2−1−log⁡σ2).D_{\mathrm{KL}}\!\left(\mathcal{N}(\mu,\sigma^2)\,\|\,\mathcal{N}(0,1)\right)
=
\frac{1}{2}\left(\mu^2+\sigma^2-1-\log\sigma^2\right).DKL​(N(μ,σ2)∥N(0,1))=21​(μ2+σ2−1−logσ2).
So the ELBO is a deterministic function of only two variational parameters, μ\muμ and σ\sigmaσ. There is no Monte Carlo noise, no neural network approximation, and no optimization mystery. We can inspect the entire objective surface directly.
The key identity from the previous section was
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x) ∥ pθ(z∣x)).\log p_\theta(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
Because the evidence log⁡pθ(x)\log p_\theta(x)logpθ​(x) does not depend on ϕ\phiϕ, maximizing the ELBO over ϕ\phiϕ is equivalent to minimizing
DKL(qϕ(z∣x) ∥ pθ(z∣x)).D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)).DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
In this toy example, the minimum possible KL gap is exactly zero, since the variational family contains the true posterior. Therefore the global maximizer is
qϕ(z∣x)=pθ(z∣x),q_\phi(z \mid x) = p_\theta(z \mid x),qϕ​(z∣x)=pθ​(z∣x),
or equivalently,
μ=μ∗,σ=σ∗.\mu = \mu^*,
\qquad
\sigma = \sigma^*.μ=μ∗,σ=σ∗.
At that point,
DKL(qϕ(z∣x) ∥ pθ(z∣x))=0,D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)) = 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))=0,
and the ELBO becomes tight:
L(θ,ϕ;x)=log⁡pθ(x).\mathcal{L}(\theta,\phi;x) = \log p_\theta(x).L(θ,ϕ;x)=logpθ​(x).
This is the tightness corollary in its cleanest possible form.
It is useful to compare three operating points for x=1x=1x=1:
Prior proposal: q(z∣x)=N(0,1)q(z \mid x)=\mathcal{N}(0,1)q(z∣x)=N(0,1). This ignores the observation entirely. It has no KL cost relative to the prior, but it reconstructs x=1x=1x=1 poorly, so the ELBO is low.
Centered but too wide: q(z∣x)=N(μ∗,(3σ∗)2)q(z \mid x)=\mathcal{N}(\mu^*, (3\sigma^*)^2)q(z∣x)=N(μ∗,(3σ∗)2). This puts its mean in the right place but spreads too much probability mass over implausible latent values. The ELBO improves, but the KL gap to the true posterior remains positive.
Exact posterior: q(z∣x)=N(μ∗,(σ∗)2)q(z \mid x)=\mathcal{N}(\mu^*,(\sigma^*)^2)q(z∣x)=N(μ∗,(σ∗)2). This matches both the mean and uncertainty of the true posterior, so the KL gap vanishes and the ELBO reaches log⁡pθ(x)\log p_\theta(x)logpθ​(x).
The visual below consolidates these facts in two complementary ways. The contour plot treats the ELBO as a surface over (μ,σ)(\mu,\sigma)(μ,σ): the maximum occurs exactly at the true posterior parameters, not merely near them. The prior-like approximation sits far away from the optimum, while the centered-but-wide approximation lands on the correct mean but remains suboptimal because its uncertainty is wrong.
The companion bar chart emphasizes the decomposition
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x) ∥ pθ(z∣x)).\log p_\theta(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
For each candidate qϕq_\phiqϕ​, the ELBO plus the KL gap reaches the same evidence level. In the exact posterior case, the red “gap” disappears entirely, making the lower bound not just a bound, but the true log marginal likelihood.

12. The Gradient Problem: Why We Cannot Backprop Through Sampling

The one-dimensional Gaussian example makes the ELBO feel almost deceptively friendly: the KL term can be written down, the expectation can sometimes be evaluated or estimated, and the whole objective looks like something we should be able to optimize with ordinary backpropagation. But the moment we replace that toy setup with a neural encoder and decoder, a subtle obstruction appears. The problem is not that the ELBO is undefined, nor that Monte Carlo estimation is impossible. The problem is that the Monte Carlo samples themselves sit in the middle of the computation graph.
Recall that for a single datapoint xxx, the VAE objective is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right]
-
D_{\mathrm{KL}}\bigl(q_{\phi}(z|x)\|p(z)\bigr).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
When we differentiate with respect to the encoder parameters ϕ\phiϕ, the gradient decomposes as
∇ϕ L(θ,ϕ;x)=∇ϕ Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟reconstruction term: hard−∇ϕ DKL(qϕ(z∣x)∥p(z))⏟regularization term: usually easy.\nabla_{\phi}\,\mathcal{L}(\theta,\phi;x)
=
\underbrace{
\nabla_{\phi}\,
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right]
}_{\text{reconstruction term: hard}}
-
\underbrace{
\nabla_{\phi}\,
D_{\mathrm{KL}}\bigl(q_{\phi}(z|x)\|p(z)\bigr)
}_{\text{regularization term: usually easy}}.∇ϕ​L(θ,ϕ;x)=reconstruction term: hard∇ϕ​Eqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularization term: usually easy∇ϕ​DKL​(qϕ​(z∣x)∥p(z))​​.
The KL term is usually not the source of trouble in the standard VAE. If qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) is a diagonal Gaussian,
qϕ(z∣x)=N(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z|x)=\mathcal{N}\bigl(z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^2(x))\bigr),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
and the prior is p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I), then the KL divergence has a closed form. It is just a differentiable expression involving μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x). Autodiff can handle this directly.
The reconstruction term is different because the distribution inside the expectation depends on ϕ\phiϕ:
Eqϕ(z∣x)[log⁡pθ(x∣z)]=∫qϕ(z∣x) log⁡pθ(x∣z) dz.\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right]
=
\int q_{\phi}(z|x)\,
\log p_{\theta}(x|z)\,dz.Eqϕ​(z∣x)​[logpθ​(x∣z)]=∫qϕ​(z∣x)logpθ​(x∣z)dz.
Here ϕ\phiϕ controls where we sample from. Changing ϕ\phiϕ changes the density qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), which changes the regions of latent space that contribute to the expectation. This is not the same as differentiating an ordinary deterministic function. The parameter ϕ\phiϕ affects the objective through the sampling distribution itself.
A natural first attempt is to use Monte Carlo sampling. Draw
z(l)∼qϕ(z∣x),z^{(l)} \sim q_{\phi}(z|x),z(l)∼qϕ​(z∣x),
and approximate the reconstruction expectation by
L~=1L∑l=1Llog⁡pθ(x∣z(l)).\tilde{\mathcal{L}}
=
\frac{1}{L}
\sum_{l=1}^{L}
\log p_{\theta}(x|z^{(l)}).L~=L1​l=1∑L​logpθ​(x∣z(l)).
This gives an unbiased estimate of the expectation. For optimizing θ\thetaθ, it is fine: once z(l)z^{(l)}z(l) is sampled, the decoder likelihood log⁡pθ(x∣z(l))\log p_{\theta}(x|z^{(l)})logpθ​(x∣z(l)) is an ordinary differentiable function of θ\thetaθ. But for optimizing ϕ\phiϕ, naive backpropagation runs into a wall. The sampled value z(l)z^{(l)}z(l) is not represented as a differentiable deterministic function of ϕ\phiϕ. In the computation graph, the operation “sample from qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x)” behaves like a stochastic node, not like a smooth layer.
This is the key failure mode of naive Monte Carlo:
∇ϕ[1L∑l=1Llog⁡pθ(x∣z(l))]\nabla_{\phi}
\left[
\frac{1}{L}
\sum_{l=1}^{L}
\log p_{\theta}(x|z^{(l)})
\right]∇ϕ​[L1​l=1∑L​logpθ​(x∣z(l))]
does not correctly estimate
∇ϕEqϕ(z∣x)[log⁡pθ(x∣z)].\nabla_{\phi}
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right].∇ϕ​Eqϕ​(z∣x)​[logpθ​(x∣z)].
If we treat the sampled z(l)z^{(l)}z(l) values as fixed constants after sampling, then the reconstruction term has no differentiable path back to ϕ\phiϕ. Informally, the gradient is blocked at the sampling operation. More precisely, the particular sampled numerical value may change if we rerun the sampler with a different ϕ\phiϕ, but that dependence is not available to standard backpropagation as a local derivative through the realized sample.
There is a general-purpose workaround called the score-function estimator, often known in this context as REINFORCE. It uses the identity
∇ϕ Eqϕ[f(z)]=Eqϕ[f(z) ∇ϕlog⁡qϕ(z∣x)].\nabla_{\phi}\,
\mathbb{E}_{q_{\phi}}[f(z)]
=
\mathbb{E}_{q_{\phi}}
\left[
f(z)\,\nabla_{\phi}\log q_{\phi}(z|x)
\right].∇ϕ​Eqϕ​​[f(z)]=Eqϕ​​[f(z)∇ϕ​logqϕ​(z∣x)].
This identity is valid under mild regularity assumptions and does not require differentiating through the sampled value zzz. Instead, it differentiates the log-density of the sampling distribution. That is powerful: it works even for discrete random variables, where ordinary pathwise derivatives are unavailable.
But for VAEs with neural decoders, this estimator is usually too noisy to be practical without substantial variance reduction. The scalar reward-like term f(z)=log⁡pθ(x∣z)f(z)=\log p_{\theta}(x|z)f(z)=logpθ​(x∣z) can vary dramatically across samples, especially early in training. Multiplying it by ∇ϕlog⁡qϕ(z∣x)\nabla_{\phi}\log q_{\phi}(z|x)∇ϕ​logqϕ​(z∣x) often produces gradients with very high variance. In principle the estimator is unbiased; in practice, its updates can be so unstable that learning becomes slow, erratic, or ineffective.
So the issue is not merely “sampling is random.” The deeper issue is that we need a low-variance differentiable estimator of how the reconstruction objective changes when the encoder distribution changes. Naive Monte Carlo gives samples but no usable pathwise gradient. REINFORCE gives a gradient but often with prohibitive variance. This tension is exactly what motivates the reparameterization trick.
The visual below compresses this obstruction into a computation graph. The forward pass is straightforward: xxx is encoded into parameters of qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), a latent sample z(l)z^{(l)}z(l) is drawn, and the decoder evaluates log⁡pθ(x∣z(l))\log p_{\theta}(x|z^{(l)})logpθ​(x∣z(l)). The backward pass is where the asymmetry appears: gradients flow cleanly through the decoder, but the attempted gradient path back through the stochastic sampling node is blocked.
The small REINFORCE annotation is the important caveat. There is a mathematical gradient estimator that avoids differentiating through the sample, but it pays for that generality with high variance. The next step is to replace the problematic sampling operation with an equivalent differentiable construction—one that preserves the same distribution over zzz, while allowing backpropagation to reach ϕ\phiϕ.

13. The Reparameterization Trick: Core Identity

The obstruction we just ran into was not that the encoder distribution qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) is mysterious, nor that Gaussian sampling is impossible to differentiate in some philosophical sense. The issue is more specific: if we treat the operation “draw z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x)” as an opaque stochastic node, then the sampled value changes with ϕ\phiϕ only through the distribution from which it was drawn. Standard backpropagation expects deterministic computational paths; it does not know how to assign a pathwise derivative through the act of sampling itself.
The reparameterization trick resolves this by changing how we generate the same random variable. Instead of sampling directly from a ϕ\phiϕ-dependent Gaussian, we sample from a fixed, parameter-free source of randomness and then transform that noise deterministically using the encoder outputs. For the diagonal Gaussian encoder used in the standard VAE,
qϕ(z∣x)=N ⁣(μϕ(x),diag⁡(σϕ(x)2)),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}(x)^2)
\right),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),
we write a sample as
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x) \odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
This is the core identity. The randomness now lives entirely in ϵ\epsilonϵ, whose distribution does not depend on ϕ\phiϕ. The encoder parameters only determine the deterministic map that takes (x,ϵ)(x,\epsilon)(x,ϵ) to z^\hat{z}z^.
It is worth verifying that this has not changed the distribution being sampled. Componentwise,
z^k=μϕ,k(x)+σϕ,k(x)ϵk,ϵk∼N(0,1).\hat{z}_k
=
\mu_{\phi,k}(x)
+
\sigma_{\phi,k}(x)\epsilon_k,
\qquad
\epsilon_k \sim \mathcal{N}(0,1).z^k​=μϕ,k​(x)+σϕ,k​(x)ϵk​,ϵk​∼N(0,1).
An affine transformation of a standard normal random variable is normal, with shifted mean and rescaled variance:
z^k∼N ⁣(μϕ,k(x),σϕ,k(x)2).\hat{z}_k
\sim
\mathcal{N}
\!\left(
\mu_{\phi,k}(x),
\sigma_{\phi,k}(x)^2
\right).z^k​∼N(μϕ,k​(x),σϕ,k​(x)2).
Since the ϵk\epsilon_kϵk​ are independent under N(0,I)\mathcal{N}(0,I)N(0,I), the resulting vector has diagonal covariance:
z^∼N ⁣(μϕ(x),diag⁡(σϕ(x)2))=qϕ(z∣x).\hat{z}
\sim
\mathcal{N}
\!\left(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}(x)^2)
\right)
=
q_{\phi}(z \mid x).z^∼N(μϕ​(x),diag(σϕ​(x)2))=qϕ​(z∣x).
So the reparameterized sample z^\hat{z}z^ is distributionally identical to a direct sample from the encoder. The trick is not an approximation to the Gaussian; it is an exact change of representation.
The important shift is computational. Before reparameterization, we had a sample z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x), and the dependence of zzz on ϕ\phiϕ was hidden inside the sampling operation. After reparameterization, we have
z^=gϕ(x,ϵ)=μϕ(x)+σϕ(x)⊙ϵ,\hat{z} = g_{\phi}(x,\epsilon)
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon,z^=gϕ​(x,ϵ)=μϕ​(x)+σϕ​(x)⊙ϵ,
where ϵ\epsilonϵ is treated as an external random input. For a fixed draw of ϵ\epsilonϵ, the map from ϕ\phiϕ to z^\hat{z}z^ is just an ordinary differentiable computation graph. Gradients can flow through μϕ(x)\mu_{\phi}(x)μϕ​(x), through σϕ(x)\sigma_{\phi}(x)σϕ​(x), and then through the decoder likelihood term log⁡pθ(x∣z^)\log p_{\theta}(x \mid \hat{z})logpθ​(x∣z^).
This is why the reparameterization trick is sometimes called a pathwise gradient estimator. Instead of asking how the probability density changes with ϕ\phiϕ, we ask how the sampled point moves in latent space when ϕ\phiϕ changes while holding the underlying noise fixed. Intuitively, ϵ\epsilonϵ chooses a location in “standard normal coordinates,” and the encoder stretches and shifts that location into the latent space used by the decoder.
For example, if increasing one encoder parameter increases μϕ,k(x)\mu_{\phi,k}(x)μϕ,k​(x), then every sampled z^k\hat{z}_kz^k​ shifts upward by the corresponding amount. If increasing another parameter increases σϕ,k(x)\sigma_{\phi,k}(x)σϕ,k​(x), then samples with positive ϵk\epsilon_kϵk​ move upward and samples with negative ϵk\epsilon_kϵk​ move downward. These are ordinary derivatives:
∂z^k∂μϕ,k=1,∂z^k∂σϕ,k=ϵk.\frac{\partial \hat{z}_k}{\partial \mu_{\phi,k}}
=
1,
\qquad
\frac{\partial \hat{z}_k}{\partial \sigma_{\phi,k}}
=
\epsilon_k.∂μϕ,k​∂z^k​​=1,∂σϕ,k​∂z^k​​=ϵk​.
Backpropagation can now see exactly how changing the encoder changes the latent sample and, through the decoder, changes the reconstruction term in the ELBO.
There are a few assumptions hiding here. First, we need a distribution that can be expressed as a deterministic transformation of parameter-free noise. This is easy for location-scale families such as Gaussians, where samples can be written as “mean plus scale times noise.” Second, the transformation should be differentiable with respect to the parameters we want to optimize. Third, the base noise distribution p(ϵ)p(\epsilon)p(ϵ) must not depend on ϕ\phiϕ; otherwise the original problem reappears inside the supposedly fixed randomness.
In its most general form, the idea is
z=gϕ(ϵ,x),ϵ∼p(ϵ),z = g_{\phi}(\epsilon, x),
\qquad
\epsilon \sim p(\epsilon),z=gϕ​(ϵ,x),ϵ∼p(ϵ),
where p(ϵ)p(\epsilon)p(ϵ) is fixed and gϕg_{\phi}gϕ​ is differentiable. The diagonal Gaussian VAE is the canonical example, but the principle extends beyond it whenever such a transformation is available. When it is not available—for example, for many discrete latent variables—we need different gradient estimators or relaxations, and those usually come with higher variance or bias.
The visual below compresses this logic into two computation graphs. The left side represents the problematic formulation: ϕ\phiϕ determines a distribution, a stochastic sample is drawn, and the gradient path is interrupted at that sampling node. The right side rewrites the same sample as a deterministic function of encoder outputs and independent noise. The stochasticity has not disappeared; it has simply been moved to a place where it no longer carries ϕ\phiϕ-dependence.
The key message to carry forward is that reparameterization does not make the VAE deterministic. Each forward pass can still use a fresh ϵ\epsilonϵ, so the latent code remains random. What changes is that, conditional on that noise draw, the ELBO becomes an ordinary differentiable objective with respect to ϕ\phiϕ. That is the bridge from stochastic latent-variable learning to standard neural-network optimization.

14. Theorem: Reparameterized Gradient Estimator is Unbiased

Having rewritten sampling as z=μϕ(x)+σϕ(x)⊙ϵz = \mu_\phi(x) + \sigma_\phi(x)\odot \epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ, the next question is not merely whether this is a convenient implementation trick. The real question is whether it gives us the right gradient. If we replace a draw from qϕ(z∣x)q_\phi(z\mid x)qϕ​(z∣x) with a deterministic transformation of noise, are we still estimating
∇ϕ Eqϕ(z∣x)[f(z)]\nabla_\phi \, \mathbb{E}_{q_\phi(z\mid x)}[f(z)]∇ϕ​Eqϕ​(z∣x)​[f(z)]
without bias?
This matters because the VAE objective contains expectations over latent variables sampled from the encoder distribution. For a single datapoint xxx, the reconstruction-related part of the ELBO often has the form
Eqϕ(z∣x)[f(z)],\mathbb{E}_{q_\phi(z\mid x)}[f(z)],Eqϕ​(z∣x)​[f(z)],
where, for example, f(z)f(z)f(z) may be log⁡pθ(x∣z)\log p_\theta(x\mid z)logpθ​(x∣z). The encoder parameters ϕ\phiϕ affect this expectation indirectly: changing ϕ\phiϕ changes the distribution over zzz. Naively sampling z∼qϕ(z∣x)z \sim q_\phi(z\mid x)z∼qϕ​(z∣x) creates a problem for backpropagation because the sampling operation itself is not a deterministic differentiable map of ϕ\phiϕ.
The reparameterization trick resolves this by expressing the random variable as
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat z = \mu_\phi(x) + \sigma_\phi(x)\odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
Now all randomness lives in ϵ\epsilonϵ, whose distribution does not depend on ϕ\phiϕ. The dependence on ϕ\phiϕ has been moved into a deterministic function:
z=Tϕ(x,ϵ)=μϕ(x)+σϕ(x)⊙ϵ.z = T_\phi(x,\epsilon)
= \mu_\phi(x) + \sigma_\phi(x)\odot \epsilon.z=Tϕ​(x,ϵ)=μϕ​(x)+σϕ​(x)⊙ϵ.
So instead of differentiating through the act of sampling from qϕq_\phiqϕ​, we differentiate through the deterministic path
ϕ⟶μϕ(x),σϕ(x)⟶z⟶f(z).\phi \longrightarrow \mu_\phi(x), \sigma_\phi(x)
\longrightarrow z
\longrightarrow f(z).ϕ⟶μϕ​(x),σϕ​(x)⟶z⟶f(z).
The theorem states that, under the usual smoothness and integrability assumptions that allow us to interchange differentiation and expectation,
∇ϕ Eqϕ(z∣x)[f(z)]=Ep(ϵ) ⁣[∇ϕ f ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\nabla_{\phi}\, \mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\mathbb{E}_{p(\epsilon)}\!\left[
\nabla_{\phi}\,
f\!\left(\mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon\right)
\right].∇ϕ​Eqϕ​(z∣x)​[f(z)]=Ep(ϵ)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)].
This is the central identity. The left side is the gradient we actually want: the derivative of an expectation under the encoder distribution. The right side is something we can estimate by ordinary backpropagation after sampling ϵ\epsilonϵ. Importantly, the expectation is now over p(ϵ)p(\epsilon)p(ϵ), a fixed distribution independent of ϕ\phiϕ.
Given LLL independent samples ϵ(1),…,ϵ(L)∼N(0,I)\epsilon^{(1)}, \ldots, \epsilon^{(L)} \sim \mathcal{N}(0,I)ϵ(1),…,ϵ(L)∼N(0,I), the Monte Carlo estimator is
g^=1L∑l=1L∇ϕ f ⁣(μϕ(x)+σϕ(x)⊙ϵ(l)).\hat{g}
=
\frac{1}{L}
\sum_{l=1}^{L}
\nabla_{\phi}\,
f\!\left(
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon^{(l)}
\right).g^​=L1​l=1∑L​∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ(l)).
Because each term inside the average is an unbiased draw from the expectation on the right-hand side of the theorem, linearity of expectation gives
E[g^]=∇ϕ Eqϕ(z∣x)[f(z)].\mathbb{E}[\hat g]
=
\nabla_{\phi}\,
\mathbb{E}_{q_{\phi}(z|x)}[f(z)].E[g^​]=∇ϕ​Eqϕ​(z∣x)​[f(z)].
So the estimator is unbiased: averaging over repeated draws of ϵ\epsilonϵ, it recovers the exact gradient of the expected objective.
There are a few assumptions hiding inside this clean statement. First, fff must be differentiable with respect to zzz, and the transformation from ϕ\phiϕ to zzz must also be differentiable. This is why the standard Gaussian VAE encoder, parameterized by μϕ(x)\mu_\phi(x)μϕ​(x) and log⁡σϕ2(x)\log \sigma_\phi^2(x)logσϕ2​(x), is so convenient. Second, the support of the distribution should not change in a pathological way as ϕ\phiϕ changes; otherwise, differentiating under the integral can become delicate. Third, the estimator is unbiased for the gradient of the reparameterized expectation, but finite-sample estimates still have variance. Unbiased does not mean noiseless.
The reason this estimator is so effective is not only that it is unbiased, but that it tends to have low variance. Compare it to the score-function or REINFORCE estimator:
g^RF=f(z) ∇ϕlog⁡qϕ(z∣x).\hat g_{\text{RF}}
=
f(z)\,\nabla_\phi \log q_\phi(z\mid x).g^​RF​=f(z)∇ϕ​logqϕ​(z∣x).
REINFORCE is also unbiased and applies more generally, including to discrete random variables. But it usually has much higher variance because it does not exploit the local derivative ∇zf(z)\nabla_z f(z)∇z​f(z). It treats f(z)f(z)f(z) more like a black-box reward. The reparameterized estimator, by contrast, uses the geometry of fff: gradients flow through the sampled latent value back into μϕ(x)\mu_\phi(x)μϕ​(x) and σϕ(x)\sigma_\phi(x)σϕ​(x).
This explains a practical fact about VAEs: we often use L=1L=1L=1 latent sample per datapoint during training. At first this may seem surprisingly crude, but minibatch stochastic optimization already averages gradients across many datapoints. The total gradient noise comes from both minibatch sampling and latent-variable sampling; in practice, the low variance of the pathwise gradient makes a single latent draw per example sufficient for stable learning.
The visual below condenses this theorem into two linked ideas. The top part emphasizes the algebraic identity: a gradient of an expectation under qϕ(z∣x)q_\phi(z\mid x)qϕ​(z∣x) can be rewritten as an expectation over fixed noise ϵ\epsilonϵ, with gradients traveling through the deterministic map μϕ(x)+σϕ(x)⊙ϵ\mu_\phi(x)+\sigma_\phi(x)\odot\epsilonμϕ​(x)+σϕ​(x)⊙ϵ. The key annotation is that ϵ\epsilonϵ is independent of ϕ\phiϕ, which is precisely what makes ordinary backpropagation valid.
The bottom part gives the statistical intuition. Both REINFORCE and the reparameterized estimator are centered on the true gradient, so both are unbiased. But the REINFORCE distribution is much wider, representing high-variance gradient estimates, while the reparameterized estimator concentrates much more tightly around the true value. That difference in variance is what turns the theorem from a formal identity into a practical training method for VAEs.

15. Proof: Reparameterized Gradient via Change of Variables

Now that we know the reparameterized gradient estimator is unbiased, it is worth slowing down and asking why the identity is true. The key point is not mysterious: we are simply rewriting a ϕ\phiϕ-dependent distribution as a deterministic transformation of noise drawn from a fixed distribution. Once the randomness no longer depends on ϕ\phiϕ, differentiation becomes an ordinary backpropagation problem through a deterministic computation graph.
Start with the quantity we care about. For a fixed datapoint xxx, suppose the encoder distribution is a diagonal Gaussian,
qϕ(z∣x)=N ⁣(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^{2}(x))\right),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
and suppose f(z)f(z)f(z) is some downstream scalar objective term, such as a decoder log-likelihood contribution. We want the gradient
∇ϕ Eqϕ(z∣x)[f(z)].\nabla_{\phi}\, \mathbb{E}_{q_{\phi}(z|x)}[f(z)].∇ϕ​Eqϕ​(z∣x)​[f(z)].
The difficulty is that the distribution inside the expectation depends on ϕ\phiϕ. If we sample z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x), the sample itself changes as the encoder parameters change, but this dependence is not explicit in the notation. Naively differentiating through a random draw is not well-defined in the usual computational graph sense.
Writing the expectation as an integral makes the dependency explicit:
Eqϕ(z∣x)[f(z)]=∫f(z) qϕ(z∣x) dz.\mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\int f(z)\, q_{\phi}(z \mid x)\, dz.Eqϕ​(z∣x)​[f(z)]=∫f(z)qϕ​(z∣x)dz.
Here, ϕ\phiϕ appears in the density qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x). A score-function estimator would differentiate the density directly, producing terms involving ∇ϕlog⁡qϕ(z∣x)\nabla_{\phi}\log q_{\phi}(z \mid x)∇ϕ​logqϕ​(z∣x). That approach is general, but often high-variance. The reparameterization trick instead asks whether we can move the ϕ\phiϕ-dependence out of the density and into the argument of fff.
For a diagonal Gaussian, we can. Let
ϵ∼p(ϵ)=N(0,I),\epsilon \sim p(\epsilon) = \mathcal{N}(0,I),ϵ∼p(ϵ)=N(0,I),
and define
z^=μϕ(x)+σϕ(x)⊙ϵ.\hat z
=
\mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon.z^=μϕ​(x)+σϕ​(x)⊙ϵ.
This deterministic transformation maps standard normal noise into a sample from qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x). Componentwise,
zk=μϕ,k(x)+σϕ,k(x)ϵk.z_k = \mu_{\phi,k}(x) + \sigma_{\phi,k}(x)\epsilon_k.zk​=μϕ,k​(x)+σϕ,k​(x)ϵk​.
Assuming σϕ,k(x)>0\sigma_{\phi,k}(x) > 0σϕ,k​(x)>0, this map is invertible in each coordinate:
ϵk=zk−μϕ,k(x)σϕ,k(x).\epsilon_k
=
\frac{z_k - \mu_{\phi,k}(x)}{\sigma_{\phi,k}(x)}.ϵk​=σϕ,k​(x)zk​−μϕ,k​(x)​.
The Jacobian is diagonal, with entries σϕ,k(x)\sigma_{\phi,k}(x)σϕ,k​(x), so the volume element transforms as
dz=∏k=1Kσϕ,k(x)  dϵ.dz
=
\prod_{k=1}^{K}\sigma_{\phi,k}(x)\; d\epsilon.dz=k=1∏K​σϕ,k​(x)dϵ.
This is the whole mathematical engine of the proof. Under the change of variables z=μϕ(x)+σϕ(x)⊙ϵz = \mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ, the original density-weighted measure becomes
qϕ(z∣x) dz=p(ϵ) dϵ.q_{\phi}(z \mid x)\,dz
=
p(\epsilon)\,d\epsilon.qϕ​(z∣x)dz=p(ϵ)dϵ.
Therefore,
∫f(z) qϕ(z∣x) dz=∫f ⁣(μϕ(x)+σϕ(x)⊙ϵ)p(ϵ) dϵ.\int f(z)\, q_{\phi}(z|x)\, dz
=
\int
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
p(\epsilon)\,d\epsilon.∫f(z)qϕ​(z∣x)dz=∫f(μϕ​(x)+σϕ​(x)⊙ϵ)p(ϵ)dϵ.
Equivalently,
Eqϕ(z∣x)[f(z)]=Ep(ϵ) ⁣[f ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\mathbb{E}_{p(\epsilon)}
\!\left[
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right].Eqϕ​(z∣x)​[f(z)]=Ep(ϵ)​[f(μϕ​(x)+σϕ​(x)⊙ϵ)].
Notice what changed. The distribution p(ϵ)p(\epsilon)p(ϵ) is now fixed: it does not depend on ϕ\phiϕ. All dependence on ϕ\phiϕ has moved into the deterministic map
ϵ↦μϕ(x)+σϕ(x)⊙ϵ.\epsilon
\mapsto
\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon.ϵ↦μϕ​(x)+σϕ​(x)⊙ϵ.
That is why backpropagation becomes possible. We are no longer differentiating “through sampling” from a moving distribution; we are differentiating through a deterministic function of fixed external noise.
Under the usual regularity assumptions—smooth enough fff, differentiable encoder outputs, and an integrable dominating function allowing us to exchange differentiation and integration—we can pass the gradient through the expectation:
∇ϕEp(ϵ) ⁣[f ⁣(μϕ(x)+σϕ(x)⊙ϵ)]=Ep(ϵ) ⁣[∇ϕf ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\nabla_{\phi}
\mathbb{E}_{p(\epsilon)}
\!\left[
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right]
=
\mathbb{E}_{p(\epsilon)}
\!\left[
\nabla_{\phi}
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right].∇ϕ​Ep(ϵ)​[f(μϕ​(x)+σϕ​(x)⊙ϵ)]=Ep(ϵ)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)].
Combining the equality of expectations with this differentiation step gives the reparameterized gradient identity:
∇ϕ Eqϕ(z∣x)[f(z)]=Ep(ϵ) ⁣[∇ϕ f ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\nabla_{\phi}\, \mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\mathbb{E}_{p(\epsilon)}
\!\left[
\nabla_{\phi}\,
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right].∇ϕ​Eqϕ​(z∣x)​[f(z)]=Ep(ϵ)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)].
This is exactly the statement that the Monte Carlo estimator
∇ϕf ⁣(μϕ(x)+σϕ(x)⊙ϵ(s)),ϵ(s)∼N(0,I),\nabla_{\phi}
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon^{(s)}\right),
\qquad
\epsilon^{(s)} \sim \mathcal{N}(0,I),∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ(s)),ϵ(s)∼N(0,I),
is unbiased for the desired gradient.
There are two subtle assumptions hiding here. First, the transformation must be valid: for the diagonal Gaussian case, this means the scale parameters are positive and the mapping between ϵ\epsilonϵ and zzz is invertible almost everywhere. In practice, VAEs often parameterize log⁡σϕ2(x)\log\sigma_{\phi}^{2}(x)logσϕ2​(x) or use a softplus transformation to guarantee positivity. Second, exchanging ∇ϕ\nabla_{\phi}∇ϕ​ and the integral requires mild analytic conditions. Neural networks used in practice are usually treated as satisfying these conditions almost everywhere, but this step is still a mathematical assumption, not magic.
The practical takeaway is compact:
Before reparameterization: randomness comes from qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), whose parameters depend on ϕ\phiϕ.
After reparameterization: randomness comes from p(ϵ)p(\epsilon)p(ϵ), which is fixed.
Optimization benefit: gradients flow through μϕ(x)\mu_{\phi}(x)μϕ​(x), σϕ(x)\sigma_{\phi}(x)σϕ​(x), and fff by ordinary backpropagation.
The visual summary below condenses the proof into its four logical moves: write the expectation as an integral, perform the Gaussian change of variables, observe that the transformed density becomes p(ϵ)p(\epsilon)p(ϵ), and finally move the gradient through the expectation because p(ϵ)p(\epsilon)p(ϵ) is independent of ϕ\phiϕ.
The highlighted substitution is the pivotal step. Once qϕ(z∣x) dzq_{\phi}(z \mid x)\,dzqϕ​(z∣x)dz has been rewritten as p(ϵ) dϵp(\epsilon)\,d\epsilonp(ϵ)dϵ, the proof is essentially finished: the parameter dependence has migrated from the probability measure into a differentiable deterministic path, which is precisely what makes the VAE encoder trainable with low-variance gradient estimates.

16. Gaussian Encoder: Closed-Form KL Divergence

Now that we have a pathwise gradient estimator for samples from the encoder, it is tempting to think that every part of the VAE objective needs Monte Carlo estimation. But one of the conveniences of the standard VAE is that this is not true. The reconstruction term usually requires samples z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x), because it passes through a nonlinear decoder. The KL regularizer, however, has a closed-form expression when the encoder is Gaussian and the prior is standard normal.
Recall the per-example ELBO:
L(x;θ,ϕ)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(x;\theta,\phi)
=
\mathbb{E}_{q_{\phi}(z\mid x)}
\left[
\log p_{\theta}(x\mid z)
\right]
-
D_{\mathrm{KL}}\!\left(q_{\phi}(z\mid x)\,\|\,p(z)\right).L(x;θ,ϕ)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
In the common diagonal Gaussian VAE, the encoder outputs two vectors:
μϕ(x)∈RK,σϕ(x)∈R>0K,\mu_{\phi}(x) \in \mathbb{R}^K,
\qquad
\sigma_{\phi}(x) \in \mathbb{R}_{>0}^K,μϕ​(x)∈RK,σϕ​(x)∈R>0K​,
and defines
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ(x)2)).q_{\phi}(z\mid x)
=
\mathcal{N}
\left(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}(x)^2)
\right).qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)).
The prior is usually chosen as
p(z)=N(0,I).p(z)=\mathcal{N}(0,I).p(z)=N(0,I).
This particular pairing is not accidental. A diagonal Gaussian compared against a standard Gaussian gives a KL term that decomposes dimension-by-dimension. That means the model can penalize each latent coordinate independently for moving its posterior mean away from 000, or its posterior variance away from 111.
Start from the definition:
DKL(qϕ(z∣x) ∥ p(z))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡p(z)].D_{\mathrm{KL}}(q_{\phi}(z|x)\,\|\,p(z))
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x) - \log p(z)
\right].DKL​(qϕ​(z∣x)∥p(z))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logp(z)].
Because both distributions factorize across latent dimensions, their log-densities are sums over coordinates. Suppressing constants that will cancel or collect into dimension-independent terms, we can write
DKL=Eq[∑k=1K(−(zk−μk)22σk2−log⁡σk)−∑k=1K(−zk22)]+const.D_{\mathrm{KL}}
=
\mathbb{E}_{q}
\left[
\sum_{k=1}^{K}
\left(
-\frac{(z_k-\mu_k)^2}{2\sigma_k^2}
-\log\sigma_k
\right)
-
\sum_{k=1}^{K}
\left(
-\frac{z_k^2}{2}
\right)
\right]
+
\text{const}.DKL​=Eq​[k=1∑K​(−2σk2​(zk​−μk​)2​−logσk​)−k=1∑K​(−2zk2​​)]+const.
Here μk\mu_kμk​ and σk\sigma_kσk​ mean the kkk-th encoder outputs for the current datapoint xxx. The expectation is still with respect to qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), but now only simple Gaussian moments are needed. Under zk∼N(μk,σk2)z_k \sim \mathcal{N}(\mu_k,\sigma_k^2)zk​∼N(μk​,σk2​),
Eq[(zk−μk)2]=σk2,Eq[zk2]=μk2+σk2.\mathbb{E}_q[(z_k-\mu_k)^2] = \sigma_k^2,
\qquad
\mathbb{E}_q[z_k^2] = \mu_k^2+\sigma_k^2.Eq​[(zk​−μk​)2]=σk2​,Eq​[zk2​]=μk2​+σk2​.
Substituting these moments and simplifying gives the exact closed form:
DKL(qϕ(z∣x) ∥ p(z))=12∑k=1K(μk2+σk2−1−log⁡σk2).D_{\mathrm{KL}}(q_{\phi}(z|x)\,\|\,p(z))
=
\frac{1}{2}
\sum_{k=1}^{K}
\left(
\mu_k^2
+
\sigma_k^2
-
1
-
\log\sigma_k^2
\right).DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑K​(μk2​+σk2​−1−logσk2​).
This expression is worth reading term by term. The μk2\mu_k^2μk2​ penalty discourages the approximate posterior from shifting its mean far away from the prior mean 000. The combination
σk2−1−log⁡σk2\sigma_k^2 - 1 - \log\sigma_k^2σk2​−1−logσk2​
penalizes the posterior variance for deviating from the prior variance 111. It is minimized at σk2=1\sigma_k^2=1σk2​=1, where it equals 000. Thus each latent coordinate pays no KL cost only when
μk=0,σk2=1,\mu_k=0,
\qquad
\sigma_k^2=1,μk​=0,σk2​=1,
meaning qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) matches the prior along that coordinate.
A crucial practical point is that this KL term is fully differentiable and sampling-free. We do not need to draw zzz to estimate it, and we do not need the reparameterization trick for this part of the objective. Gradients can flow directly through the encoder outputs μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x), or more commonly through μϕ(x)\mu_{\phi}(x)μϕ​(x) and log⁡σϕ(x)2\log\sigma_{\phi}(x)^2logσϕ​(x)2, since neural networks typically output log-variance for numerical stability.
This exactness also clarifies an important failure mode. Because the KL term pushes qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) toward N(0,I)\mathcal{N}(0,I)N(0,I), a powerful decoder may learn to reconstruct well while ignoring zzz. In that case the encoder can choose μϕ(x)≈0\mu_{\phi}(x)\approx 0μϕ​(x)≈0 and σϕ(x)2≈1\sigma_{\phi}(x)^2\approx 1σϕ​(x)2≈1, making the KL nearly zero. This is one view of posterior collapse: the approximate posterior becomes almost identical to the prior, so the latent code carries little information about xxx.
There are a few assumptions hiding inside the neat formula:
The encoder covariance is diagonal, so the KL separates over coordinates.
The prior is exactly standard Gaussian, p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I).
The variance parameters must remain positive, which is why implementations often parameterize log⁡σk2\log\sigma_k^2logσk2​ rather than σk\sigma_kσk​ directly.
The formula is per datapoint; minibatch training averages or sums it across examples depending on the convention used for the loss.
The visual below condenses the derivation into the three essential moves: expand the KL definition, substitute the diagonal Gaussian log-densities, and apply the two Gaussian moment identities. The boxed expression is the quantity that appears directly in a VAE implementation.
It also separates the computational roles of the two ELBO terms. The KL side is analytic and deterministic; the reconstruction side still requires samples from qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), which is where the reparameterization trick from the previous section becomes necessary.

17. Gaussian Decoder: Reconstruction Term as MSE

Having handled the KL term for a Gaussian encoder, the remaining piece of the ELBO is the part that asks a simple but crucial question: if we sample a latent code from the encoder, how well can the decoder explain the original observation? This is the reconstruction term,
Eqϕ(z∣x)[log⁡pθ(x∣z)].\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)].Eqϕ​(z∣x)​[logpθ​(x∣z)].
It is often implemented as a mean-squared error loss, but that is not merely a convenient engineering choice. Under a Gaussian decoder assumption, MSE falls directly out of the likelihood model.
Assume the decoder defines a conditional Gaussian distribution over observations:
pθ(x∣z)=N(fθ(z),σ2I).p_{\theta}(x|z) = \mathcal{N}(f_{\theta}(z), \sigma^2 I).pθ​(x∣z)=N(fθ​(z),σ2I).
Here fθ(z)f_{\theta}(z)fθ​(z) is the decoder network output, interpreted as the mean of the conditional distribution over xxx. The variance σ2I\sigma^2 Iσ2I says that, given zzz, each observed dimension is corrupted by independent isotropic Gaussian noise with the same variance σ2\sigma^2σ2. This is a modeling assumption: it is reasonable for continuous-valued data after suitable normalization, but it is not automatically appropriate for binary pixels, counts, categorical variables, or perceptually structured image data.
For one latent sample zzz, the Gaussian log-likelihood is
log⁡pθ(x∣z)=−D2log⁡(2πσ2)−12σ2∥x−fθ(z)∥2,\log p_{\theta}(x|z)
=
-\frac{D}{2}\log(2\pi\sigma^2)
-
\frac{1}{2\sigma^2}\|x - f_{\theta}(z)\|^2,logpθ​(x∣z)=−2D​log(2πσ2)−2σ21​∥x−fθ​(z)∥2,
where DDD is the dimensionality of xxx. The first term is the Gaussian normalization constant. If σ2\sigma^2σ2 is fixed, it does not depend on the decoder parameters θ\thetaθ. The second term is the squared reconstruction error, scaled by −1/(2σ2)-1/(2\sigma^2)−1/(2σ2). Therefore, maximizing the Gaussian likelihood pushes fθ(z)f_{\theta}(z)fθ​(z) toward xxx in squared-error distance.
Inside the VAE, however, zzz is not a single deterministic code. It is drawn from the approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), so the ELBO uses the expected log-likelihood:
Eqϕ(z∣x)[log⁡pθ(x∣z)]=−D2log⁡(2πσ2)−12σ2Eqϕ(z∣x)[∥x−fθ(z)∥2].\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
=
-\frac{D}{2}\log(2\pi\sigma^2)
-
\frac{1}{2\sigma^2}
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\|x - f_{\theta}(z)\|^2
\right].Eqϕ​(z∣x)​[logpθ​(x∣z)]=−2D​log(2πσ2)−2σ21​Eqϕ​(z∣x)​[∥x−fθ​(z)∥2].
Using the reparameterization trick, we usually write the sampled latent variable as
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I),\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I),z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I),
so the reconstruction term becomes an expectation over noise injected through a differentiable transformation. This is what lets gradients flow not only into the decoder parameters θ\thetaθ, but also back into the encoder parameters ϕ\phiϕ.
If σ2\sigma^2σ2 is treated as a fixed hyperparameter, then maximizing the reconstruction term with respect to θ\thetaθ is equivalent to minimizing
Eqϕ(z∣x)[∥x−fθ(z^)∥2],\mathbb{E}_{q_{\phi}(z|x)}
\left[
\|x - f_{\theta}(\hat{z})\|^2
\right],Eqϕ​(z∣x)​[∥x−fθ​(z^)∥2],
up to a constant and a positive scaling factor. This is the precise sense in which the Gaussian decoder gives rise to expected MSE as the reconstruction loss. In practice, this expectation is usually approximated with one or a few Monte Carlo samples of ϵ\epsilonϵ per datapoint.
There are two subtle points worth keeping separate. First, the constant term
−D2log⁡(2πσ2)-\frac{D}{2}\log(2\pi\sigma^2)−2D​log(2πσ2)
can be dropped only when σ2\sigma^2σ2 is fixed. If σ2\sigma^2σ2 is learned, that term matters; otherwise the model could change its variance without paying the correct likelihood normalization cost. Second, although dropping constants is harmless for optimizing θ\thetaθ, the reconstruction expectation still depends on ϕ\phiϕ through z^\hat{z}z^. So the encoder is trained not only to match the prior through the KL term, but also to choose latent distributions that allow accurate reconstructions.
The variance σ2\sigma^2σ2 also has an important practical interpretation. In the full ELBO,
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−KL(qϕ(z∣x)∥p(z)),\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
-
\mathrm{KL}(q_{\phi}(z|x)\|p(z)),L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−KL(qϕ​(z∣x)∥p(z)),
the reconstruction penalty is scaled by 1/(2σ2)1/(2\sigma^2)1/(2σ2). Smaller σ2\sigma^2σ2 makes reconstruction errors more expensive relative to the KL regularizer; larger σ2\sigma^2σ2 softens the reconstruction pressure and makes the KL term comparatively more influential. Thus σ2\sigma^2σ2 acts like a reconstruction–regularization trade-off knob, much like the weighting coefficient in a β\betaβ-VAE, though it arises from the likelihood model itself.
This likelihood choice also helps explain a classic VAE failure mode: blurry reconstructions. A Gaussian decoder trained with squared error is rewarded for predicting conditional means. When the posterior or decoder is uncertain among several plausible outputs, averaging them can produce visually smooth or blurry samples. This is not just an optimization artifact; it is partly a consequence of the assumed observation model. For binary data, a more appropriate decoder is often Bernoulli,
pθ(x∣z)=∏dBernoulli(xd;σ(fθ(z)d)),p_{\theta}(x|z)
=
\prod_d \mathrm{Bernoulli}
\left(
x_d;\sigma(f_{\theta}(z)_d)
\right),pθ​(x∣z)=d∏​Bernoulli(xd​;σ(fθ​(z)d​)),
which leads to a binary cross-entropy reconstruction term rather than MSE.
The visual below condenses this derivation into a chain: start from the Gaussian decoder, expand the log-likelihood, take the expectation under the encoder distribution, and then drop the θ\thetaθ-constant terms to reveal the expected MSE objective. The key algebraic move is that the squared Euclidean error appears directly inside the Gaussian log-density.
It also highlights the modeling choices surrounding the derivation: the reparameterized sample z^\hat{z}z^ is what makes the expected reconstruction loss trainable by backpropagation, the Bernoulli alternative reminds us that MSE is not universal, and the note about σ2\sigma^2σ2 previews how the reconstruction term will be balanced against the KL divergence in the full VAE objective.

18. The Full VAE Objective: Putting It All Together

Now that the Gaussian decoder has turned the reconstruction term into something as familiar as squared error, we can finally assemble the pieces into the objective that is actually optimized in a VAE. The important point is that the VAE loss is not just reconstruction error with noise injected into the latent space. It is a variational lower bound with two coupled responsibilities: explain each datapoint well through the decoder, while keeping the encoder’s approximate posterior close enough to the prior that generation remains possible.
For a single datapoint xxx, the ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_\phi(z|x)}
\left[
\log p_\theta(x|z)
\right]
-
D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
The first term rewards latent samples zzz that allow the decoder pθ(x∣z)p_\theta(x|z)pθ​(x∣z) to assign high probability to the observed datapoint. The second term penalizes the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) for drifting too far from the prior p(z)p(z)p(z), usually p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This is what makes the latent space usable at generation time: after training, we want to sample z∼p(z)z\sim p(z)z∼p(z), not from some fragmented collection of unrelated encoder distributions.
In practice, the expectation in the reconstruction term is estimated with samples. With the reparameterization trick, we write
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat{z}
=
\mu_\phi(x)
+
\sigma_\phi(x)\odot \epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
This separates the randomness ϵ\epsilonϵ from the encoder parameters ϕ\phiϕ. Instead of sampling zzz from a distribution whose parameters depend on ϕ\phiϕ in a way that blocks ordinary backpropagation, we sample parameter-free noise and transform it differentiably. That is the key technical move that allows gradients from the decoder’s reconstruction likelihood to flow backward through z^\hat{z}z^, then into μϕ(x)\mu_\phi(x)μϕ​(x) and σϕ(x)\sigma_\phi(x)σϕ​(x).
Using one Monte Carlo sample z^\hat{z}z^, the per-datapoint stochastic ELBO estimate becomes
L~(θ,ϕ;x)=log⁡pθ(x∣z^)−12∑k=1K[μϕ,k(x)2+σϕ,k(x)2−1−log⁡σϕ,k(x)2].\tilde{\mathcal{L}}(\theta,\phi;x)
=
\log p_\theta(x|\hat{z})
-
\frac{1}{2}
\sum_{k=1}^{K}
\left[
\mu_{\phi,k}(x)^2
+
\sigma_{\phi,k}(x)^2
-
1
-
\log \sigma_{\phi,k}(x)^2
\right].L~(θ,ϕ;x)=logpθ​(x∣z^)−21​k=1∑K​[μϕ,k​(x)2+σϕ,k​(x)2−1−logσϕ,k​(x)2].
Here the KL term has been written in closed form for the common diagonal Gaussian encoder
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ(x)2)),q_\phi(z|x)
=
\mathcal{N}
\left(
\mu_\phi(x),
\operatorname{diag}(\sigma_\phi(x)^2)
\right),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),
with standard normal prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This analytic KL is one of the conveniences of the classical VAE setup. It avoids adding more sampling noise to the objective and gives a direct regularizing signal to the encoder.
A useful way to read the objective is as a controlled compromise:
Reconstruction term: make z^\hat{z}z^ informative enough that the decoder can recover xxx.
KL term: prevent qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) from becoming an arbitrary lookup table for each datapoint.
Reparameterization: make the sampled latent path differentiable with respect to ϕ\phiϕ.
Closed-form KL: regularize the encoder without needing a Monte Carlo estimate.
This compromise is also where many VAE failure modes originate. If the KL penalty dominates too early, the encoder may learn qϕ(z∣x)≈p(z)q_\phi(z|x)\approx p(z)qϕ​(z∣x)≈p(z) for all xxx, causing posterior collapse: the latent variable carries little information, and the decoder behaves almost like an unconditional model. If the reconstruction term dominates, the model may encode datapoints very precisely but produce a latent space that does not match the prior, making prior samples poor. With Gaussian decoders, another common issue is blurry reconstructions, because maximizing a pixelwise Gaussian likelihood often encourages conditional means rather than sharp, multimodal outputs.
For a dataset {x(n)}n=1N\{x^{(n)}\}_{n=1}^N{x(n)}n=1N​, training maximizes the sum of per-datapoint estimates:
L(θ,ϕ)=∑n=1NL~(θ,ϕ;x(n)).\mathcal{L}(\theta,\phi)
=
\sum_{n=1}^{N}
\tilde{\mathcal{L}}(\theta,\phi;x^{(n)}).L(θ,ϕ)=n=1∑N​L~(θ,ϕ;x(n)).
Since full-dataset optimization is usually impractical, we use minibatches. For a batch of size BBB, an unbiased estimate of the full objective is
Lbatch=NB∑b=1BL~(θ,ϕ;x(b)).\mathcal{L}_{\text{batch}}
=
\frac{N}{B}
\sum_{b=1}^{B}
\tilde{\mathcal{L}}(\theta,\phi;x^{(b)}).Lbatch​=BN​b=1∑B​L~(θ,ϕ;x(b)).
Many implementations omit the factor N/BN/BN/B when optimizing an average loss, because it only rescales the gradient for a fixed dataset and can be absorbed into the learning rate. What matters conceptually is that the minibatch objective estimates the same dataset-level ELBO. Also note the sign convention: derivations usually maximize the ELBO, while code often minimizes the negative ELBO, sometimes reported as reconstruction loss plus KL penalty.
The gradient dependencies are asymmetric but tightly coordinated. The decoder parameters θ\thetaθ appear only inside the likelihood term log⁡pθ(x∣z^)\log p_\theta(x|\hat{z})logpθ​(x∣z^), so ∇θ\nabla_\theta∇θ​ is ordinary decoder backpropagation. The encoder parameters ϕ\phiϕ, however, receive two kinds of signal: an analytic KL gradient that pushes μϕ(x)\mu_\phi(x)μϕ​(x) toward zero and σϕ(x)\sigma_\phi(x)σϕ​(x) toward one, and a reconstruction gradient that flows through the reparameterized latent sample z^\hat{z}z^. This is why the full VAE objective trains encoder and decoder jointly rather than treating inference and generation as separate procedures.
The visual below condenses this entire training objective into its two main colored pathways. The blue reconstruction path represents the stochastic, reparameterized route from encoder outputs through z^\hat{z}z^ into the decoder likelihood. The red KL path represents the closed-form regularizer acting directly on the encoder’s Gaussian parameters.
It also separates the three levels at which the same idea appears: the per-datapoint ELBO estimate, the full-dataset sum, and the minibatch estimator used in SGD. That hierarchy is worth keeping in mind before moving to the training algorithm: implementation is mostly bookkeeping, but the bookkeeping must preserve these two terms and their gradient paths.

19. Algorithm: VAE Training (Minibatch SGD)

Now that the ELBO has been assembled into a reconstruction term minus a regularization term, the remaining question is operational: what exactly happens during one training iteration? A VAE can look conceptually complicated because it contains an encoder, a decoder, a latent random variable, and a variational objective. But the actual training loop is quite close to ordinary minibatch SGD once we express the stochastic latent sample in a differentiable way.
For each datapoint x(b)x^{(b)}x(b) in a minibatch, the encoder produces the parameters of an approximate posterior distribution,
qϕ(z∣x(b))=N ⁣(z;μϕ(x(b)),diag⁡(σϕ(x(b))2)).q_{\phi}(z \mid x^{(b)}) = \mathcal{N}\!\left(
z;\mu_{\phi}(x^{(b)}), \operatorname{diag}(\sigma_{\phi}(x^{(b)})^2)
\right).qϕ​(z∣x(b))=N(z;μϕ​(x(b)),diag(σϕ​(x(b))2)).
The diagonal Gaussian assumption is doing a lot of work here. It makes sampling cheap, makes the KL divergence against the standard normal prior analytic, and gives us a simple parameterization for uncertainty in each latent dimension. In implementations, the network often predicts log⁡σ2\log \sigma^2logσ2, or logvar, rather than σ\sigmaσ directly, because the variance must remain positive and because log-variance is numerically more stable.
The central step is the reparameterized forward pass. Instead of sampling directly from qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), we sample parameter-free noise
ϵ(b)∼N(0,I),\epsilon^{(b)} \sim \mathcal{N}(0,I),ϵ(b)∼N(0,I),
and construct
z^(b)=μϕ(x(b))+σϕ(x(b))⊙ϵ(b).\hat{z}^{(b)}
=
\mu_{\phi}(x^{(b)})
+
\sigma_{\phi}(x^{(b)}) \odot \epsilon^{(b)}.z^(b)=μϕ​(x(b))+σϕ​(x(b))⊙ϵ(b).
This turns the random draw into a deterministic differentiable function of μϕ\mu_{\phi}μϕ​, σϕ\sigma_{\phi}σϕ​, and external noise ϵ\epsilonϵ. That distinction is essential: gradients cannot flow through “sample from this distribution” in the usual backpropagation sense, but they can flow through addition, multiplication, and neural network outputs. The randomness remains, but it has been isolated from the parameters.
Once z^(b)\hat{z}^{(b)}z^(b) is sampled, the decoder evaluates the likelihood of reconstructing x(b)x^{(b)}x(b) from that latent code:
log⁡pθ(x(b)∣z^(b)).\log p_{\theta}(x^{(b)} \mid \hat{z}^{(b)}).logpθ​(x(b)∣z^(b)).
This term depends on the chosen observation model. For binarized MNIST, it is usually a Bernoulli log-likelihood. For continuous data, one often uses a Gaussian likelihood, sometimes with fixed variance. This modeling choice matters: a simple Gaussian decoder trained with squared-error-like losses often encourages averaged reconstructions, which is one reason VAEs can produce blurry samples compared with adversarial models.
The KL term is computed analytically for each example because both qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) and the prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I) are Gaussian. For a KKK-dimensional diagonal posterior,
KL(b)=12∑k=1K[μϕ,k(x(b))2+σϕ,k(x(b))2−1−log⁡σϕ,k(x(b))2].\mathrm{KL}^{(b)}
=
\frac{1}{2}
\sum_{k=1}^{K}
\left[
\mu_{\phi,k}(x^{(b)})^2
+
\sigma_{\phi,k}(x^{(b)})^2
-
1
-
\log \sigma_{\phi,k}(x^{(b)})^2
\right].KL(b)=21​k=1∑K​[μϕ,k​(x(b))2+σϕ,k​(x(b))2−1−logσϕ,k​(x(b))2].
This term penalizes approximate posteriors that drift too far from the prior. Intuitively, it asks the encoder to use latent space economically: encode information only when it improves reconstruction enough to justify the cost. That tradeoff is powerful, but it also creates one of the most important VAE failure modes: posterior collapse. If the decoder is expressive enough to model the data without using zzz, optimization may drive qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) close to p(z)p(z)p(z), making the latent code nearly uninformative.
Putting the two pieces together, a single-sample Monte Carlo estimate of the per-example ELBO is
L~(b)=log⁡pθ ⁣(x(b)∣z^(b))−KL(b).\tilde{\mathcal{L}}^{(b)}
=
\log p_{\theta}\!\left(x^{(b)} \mid \hat{z}^{(b)}\right)
-
\mathrm{KL}^{(b)}.L~(b)=logpθ​(x(b)∣z^(b))−KL(b).
Using one latent sample per datapoint is common because the minibatch itself already provides stochasticity, and the reparameterized gradient estimator usually has manageable variance. More samples can reduce variance, but they also increase computation; in standard VAE training, one sample is often a good tradeoff.
For a minibatch of size BBB, the objective is typically written as an unbiased estimate of the full-data ELBO:
NB∑b=1BL~(b).\frac{N}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)}.BN​b=1∑B​L~(b).
The factor N/BN/BN/B appears if we are estimating the sum of ELBOs over the dataset. Many software implementations instead optimize the minibatch mean and omit NNN; this changes the scale of the loss and therefore the effective learning rate, but not the location of the optimum. The sign convention is another common source of bugs: mathematically we maximize the ELBO, while most deep learning libraries minimize losses, so implementations often minimize
−1B∑b=1BL~(b).-\frac{1}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)}.−B1​b=1∑B​L~(b).
Both parameter sets are updated from the same scalar objective. The decoder parameters θ\thetaθ receive gradients through the reconstruction likelihood. The encoder parameters ϕ\phiϕ receive gradients through two paths: directly through the analytic KL term, and indirectly through z^(b)\hat{z}^{(b)}z^(b) into the reconstruction term because of the reparameterization trick. In gradient-ascent form,
θ←θ+α∇θNB∑b=1BL~(b),ϕ←ϕ+α∇ϕNB∑b=1BL~(b).\theta
\leftarrow
\theta
+
\alpha \nabla_{\theta}
\frac{N}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)},
\qquad
\phi
\leftarrow
\phi
+
\alpha \nabla_{\phi}
\frac{N}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)}.θ←θ+α∇θ​BN​b=1∑B​L~(b),ϕ←ϕ+α∇ϕ​BN​b=1∑B​L~(b).
In practice, these are simultaneous optimizer updates, usually performed by Adam or a related adaptive method.
The key algorithmic pattern is therefore:
Encode xxx into μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x).
Sample ϵ\epsilonϵ, then form z^\hat{z}z^ using the reparameterization trick.
Decode z^\hat{z}z^ and evaluate the reconstruction log-likelihood.
Compute the KL term analytically.
Optimize the reconstruction-minus-KL objective by backpropagating through both networks.
The visual that follows compresses this whole training iteration into a pseudocode-style view. The highlighted latent-sampling line is the computational heart of the VAE: it is what converts a stochastic latent-variable model into something compatible with backpropagation. The highlighted KL line emphasizes the other major simplification: for the standard Gaussian prior and diagonal Gaussian encoder, this part of the ELBO does not need Monte Carlo estimation.
The update lines at the bottom summarize the most important implementation detail: θ\thetaθ and ϕ\phiϕ are optimized together using the same minibatch ELBO estimate. The gradient notes make explicit where each derivative flows—∇θ\nabla_{\theta}∇θ​ through the decoder likelihood, and ∇ϕ\nabla_{\phi}∇ϕ​ through both the analytic KL and the reparameterized reconstruction path.

20. Worked Example: VAE on Binarized MNIST

After the training loop is written down, the VAE can still feel slightly abstract: we say “encode, sample, decode, add KL, backpropagate,” but it is worth seeing what those words mean for one concrete datapoint. Let us trace a single minibatch element: a binarized MNIST digit, say a handwritten 333, represented as x∈{0,1}784x \in \{0,1\}^{784}x∈{0,1}784. The pixels are binary, so the natural decoder likelihood is a product of Bernoulli distributions over pixels, with the decoder producing logits or probabilities for each of the 784 coordinates.
For this worked example, suppose the latent space is only two-dimensional, K=2K=2K=2. This is much smaller than we would typically use for high-quality generation, but it is ideal for understanding the mechanics. The encoder network gϕg_{\phi}gϕ​ maps the image to the parameters of a diagonal Gaussian approximate posterior,
qϕ(z∣x)=N ⁣(z; μϕ(x), diag⁡(σϕ2(x))).q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
z;\ \mu_{\phi}(x),\ \operatorname{diag}(\sigma_{\phi}^2(x))
\right).qϕ​(z∣x)=N(z; μϕ​(x), diag(σϕ2​(x))).
For our particular image, assume the encoder outputs
μϕ(x)=[0.8, −1.2],σϕ(x)=[0.5, 0.9].\mu_{\phi}(x) = [0.8,\ -1.2],
\qquad
\sigma_{\phi}(x) = [0.5,\ 0.9].μϕ​(x)=[0.8, −1.2],σϕ​(x)=[0.5, 0.9].
These numbers have a simple interpretation. The encoder believes that, for this digit, plausible latent codes are centered near (0.8,−1.2)(0.8,-1.2)(0.8,−1.2), with less uncertainty in the first coordinate than the second. The diagonal covariance assumption means the two latent dimensions are conditionally independent under qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), which makes both sampling and the KL term cheap. This assumption is computationally convenient, but it is also restrictive: if the true posterior has strong correlations between latent dimensions, a diagonal Gaussian encoder cannot represent them directly.
Now we need a latent sample. A naive expression would be
z^∼qϕ(z∣x),\hat{z} \sim q_{\phi}(z \mid x),z^∼qϕ​(z∣x),
but this hides a problem: the sampling operation depends on ϕ\phiϕ, and we want gradients to flow through the sampled latent code into the encoder. The reparameterization trick rewrites the sample as a deterministic differentiable function of μϕ(x)\mu_{\phi}(x)μϕ​(x), σϕ(x)\sigma_{\phi}(x)σϕ​(x), and external noise ϵ\epsilonϵ that does not depend on ϕ\phiϕ:
ϵ∼N(0,I2),z^=μϕ(x)+σϕ(x)⊙ϵ.\epsilon \sim \mathcal{N}(0,I_2),
\qquad
\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x) \odot \epsilon.ϵ∼N(0,I2​),z^=μϕ​(x)+σϕ​(x)⊙ϵ.
If, for this forward pass, we draw
ϵ=[0.3, −0.7],\epsilon = [0.3,\ -0.7],ϵ=[0.3, −0.7],
then the latent sample is
z^=[0.8, −1.2]+[0.5, 0.9]⊙[0.3, −0.7]=[0.95, −1.83].\hat{z}
=
[0.8,\ -1.2]
+
[0.5,\ 0.9] \odot [0.3,\ -0.7]
=
[0.95,\ -1.83].z^=[0.8, −1.2]+[0.5, 0.9]⊙[0.3, −0.7]=[0.95, −1.83].
The important point is not just the numerical value. It is the computational path: z^\hat{z}z^ is now differentiable with respect to both μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x). The randomness has been isolated in ϵ\epsilonϵ, while the dependence on encoder parameters remains inside ordinary differentiable arithmetic.
The decoder fθf_{\theta}fθ​ then maps z^\hat{z}z^ back to Bernoulli parameters for the 784 pixels. For a binary image, the reconstruction log-likelihood has the form
log⁡pθ(x∣z^)=∑j=1784[xjlog⁡πθ,j(z^)+(1−xj)log⁡(1−πθ,j(z^))],\log p_{\theta}(x \mid \hat{z})
=
\sum_{j=1}^{784}
\left[
x_j \log \pi_{\theta,j}(\hat{z})
+
(1-x_j)\log(1-\pi_{\theta,j}(\hat{z}))
\right],logpθ​(x∣z^)=j=1∑784​[xj​logπθ,j​(z^)+(1−xj​)log(1−πθ,j​(z^))],
where πθ,j(z^)\pi_{\theta,j}(\hat{z})πθ,j​(z^) is the decoder’s predicted probability that pixel jjj is on. Suppose this particular decoder evaluation gives
log⁡pθ(x∣z^)=−89.2 nats.\log p_{\theta}(x \mid \hat{z}) = -89.2 \text{ nats}.logpθ​(x∣z^)=−89.2 nats.
This term rewards reconstructions that assign high probability to the observed binary pixels. Since it is a log probability over 784 dimensions, a negative value is expected. What matters during optimization is whether changing θ\thetaθ and ϕ\phiϕ makes this quantity less negative on average while maintaining a reasonable latent posterior.
The regularizer is the KL divergence from the approximate posterior to the standard normal prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). Because both distributions are diagonal Gaussians, the KL has a closed form:
DKL(qϕ(z∣x)∥p(z))=12∑k=12[μϕ,k2+σϕ,k2−1−log⁡σϕ,k2].D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
=
\frac{1}{2}
\sum_{k=1}^{2}
\left[
\mu_{\phi,k}^2
+
\sigma_{\phi,k}^2
-
1
-
\log\sigma_{\phi,k}^2
\right].DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑2​[μϕ,k2​+σϕ,k2​−1−logσϕ,k2​].
Plugging in the two latent dimensions gives
DKL=12[(0.64+0.25−1+1.386)+(1.44+0.81−1+0.211)]=12[1.276+1.461]=1.37.D_{\mathrm{KL}}
=
\frac{1}{2}
\left[
(0.64+0.25-1+1.386)
+
(1.44+0.81-1+0.211)
\right]
=
\frac{1}{2}[1.276+1.461]
=
1.37.DKL​=21​[(0.64+0.25−1+1.386)+(1.44+0.81−1+0.211)]=21​[1.276+1.461]=1.37.
This number measures how much the encoder’s posterior for this example deviates from the prior. Large means, very small variances, or very large variances all tend to increase the KL. Intuitively, the KL penalizes the encoder for using latent codes that are too specialized or too far from the standard normal geometry that the decoder will later sample from at generation time.
Combining the reconstruction term and the KL term gives the single-sample Monte Carlo estimate of the ELBO:
L~(θ,ϕ;x)=log⁡pθ(x∣z^)⏟−89.2−DKL(qϕ(z∣x)∥p(z))⏟1.37=−90.57 nats.\tilde{\mathcal{L}}(\theta,\phi;x)
=
\underbrace{\log p_{\theta}(x\mid\hat{z})}_{-89.2}
-
\underbrace{D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))}_{1.37}
=
-90.57 \text{ nats}.L~(θ,ϕ;x)=−89.2logpθ​(x∣z^)​​−1.37DKL​(qϕ​(z∣x)∥p(z))​​=−90.57 nats.
Training maximizes this quantity, or equivalently minimizes its negative. The gradients split cleanly. Parameters θ\thetaθ receive gradients through the decoder likelihood. Parameters ϕ\phiϕ receive gradients in two ways: directly through the analytic KL term, and indirectly through the reconstruction term via
ϕ⟶(μϕ(x),σϕ(x))⟶z^⟶log⁡pθ(x∣z^).\phi
\longrightarrow
(\mu_{\phi}(x),\sigma_{\phi}(x))
\longrightarrow
\hat{z}
\longrightarrow
\log p_{\theta}(x\mid\hat{z}).ϕ⟶(μϕ​(x),σϕ​(x))⟶z^⟶logpθ​(x∣z^).
This is why the reparameterization trick is not a cosmetic rewrite; it is what makes the reconstruction part of the objective train the encoder using standard backpropagation.
There are already hints of common VAE failure modes in this tiny example. If the KL pressure is too strong, the encoder may learn μϕ(x)≈0\mu_{\phi}(x)\approx 0μϕ​(x)≈0 and σϕ(x)≈1\sigma_{\phi}(x)\approx 1σϕ​(x)≈1 for most inputs, making qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) nearly equal to the prior; then the decoder may ignore zzz, a phenomenon known as posterior collapse. Conversely, if the decoder likelihood is too simple or the latent representation is too compressed, reconstructions may average over plausible outputs, producing the characteristic blurriness associated with VAEs on continuous-valued images. Even in binarized MNIST, where Bernoulli likelihoods are relatively well matched to the data, the balance between reconstruction accuracy and latent regularity is the central tension.
The visual below condenses this entire forward pass into one computation graph: a binarized image enters the encoder, the encoder emits μϕ\mu_{\phi}μϕ​ and σϕ\sigma_{\phi}σϕ​, external Gaussian noise is injected through the reparameterization equation, and the resulting z^\hat{z}z^ is decoded into a Bernoulli reconstruction score. The KL calculation is separated because it does not require sampling; once μϕ\mu_{\phi}μϕ​ and σϕ\sigma_{\phi}σϕ​ are known, its value follows analytically.
It also highlights the gradient story. The decoder parameters θ\thetaθ are trained through the reconstruction likelihood, while the encoder parameters ϕ\phiϕ are trained both by the closed-form KL and by the decoder loss through z^\hat{z}z^. That combination—one stochastic but reparameterized reconstruction path plus one deterministic Gaussian KL path—is the practical core of VAE training.

21. Empirical Results: Learned Latent Space and Generation

After walking through a single numerical forward pass, the natural next question is: what does all this optimization actually buy us? A VAE is not trained merely to reconstruct individual inputs. It is trained to shape a latent space in which encoding, sampling, decoding, and interpolation all behave coherently. The empirical test is therefore twofold: whether the latent variables organize semantic information, and whether samples drawn from the prior decode into plausible observations.
A useful diagnostic is to look at the encoder means μϕ(x)\mu_{\phi}(x)μϕ​(x). Recall that the encoder produces an approximate posterior
qϕ(z∣x)=N ⁣(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z \mid x) = \mathcal{N}\!\left(z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^2(x))\right),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
so μϕ(x)\mu_{\phi}(x)μϕ​(x) is the “central” latent representation assigned to xxx. If we train a VAE on MNIST with only two latent dimensions, we can scatter-plot μϕ(x(n))\mu_{\phi}(x^{(n)})μϕ​(x(n)) for thousands of test images. This is not just a visualization trick: in K=2K=2K=2, the entire learned latent geometry is visible.
What we typically see is a semantically structured but overlapping latent space. Digits of the same class tend to occupy nearby regions, because the decoder can more easily reconstruct similar digits from nearby latent codes. But the clusters are not perfectly separated. That is expected, and in fact desirable: the VAE objective does not ask for a discriminative classifier. It balances reconstruction quality against proximity to the prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). The KL term discourages the encoder from carving the latent space into isolated islands far from the origin.
This is one of the central empirical signatures of VAEs: nearby latent points decode to nearby-looking outputs. If two codes z1z_1z1​ and z2z_2z2​ are close in latent space, then the decoder fθf_{\theta}fθ​ usually maps them to visually similar digits. This continuity comes from both the neural decoder and the regularization imposed by the prior. Without the KL penalty, an autoencoder can learn a jagged, disconnected latent representation where interpolation may pass through meaningless regions. With the VAE objective, the model is pressured to make the latent space usable under samples from a simple distribution.
Generation then becomes conceptually simple. We sample
z∼p(z)=N(0,I),z \sim p(z) = \mathcal{N}(0,I),z∼p(z)=N(0,I),
and decode it using pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z), often parameterized by a neural network fθ(z)f_{\theta}(z)fθ​(z). For MNIST, the generated samples are usually recognizable as digits. However, they often look soft or blurry, especially compared with real handwritten digits. This is not merely an implementation flaw; it reflects a modeling assumption.
In the common Gaussian decoder setting,
pθ(x∣z)=N ⁣(x;fθ(z),σ2I),p_{\theta}(x \mid z) = \mathcal{N}\!\left(x; f_{\theta}(z), \sigma^2 I\right),pθ​(x∣z)=N(x;fθ​(z),σ2I),
maximizing likelihood encourages the decoder output fθ(z)f_{\theta}(z)fθ​(z) to approximate a conditional mean. If several plausible sharp images correspond to nearby latent explanations, the mean of those possibilities may average edges, strokes, and fine details. The result is a digit that is semantically correct but visually smoothed. This is one reason VAEs are often said to trade sample sharpness for a well-behaved likelihood objective and structured latent space.
Interpolation is another revealing experiment. Given two test images x(a)x^{(a)}x(a) and x(b)x^{(b)}x(b), we encode them to their posterior means and linearly interpolate:
z^(λ)=(1−λ) μϕ(x(a))+λ μϕ(x(b)),λ∈[0,1].\hat{z}(\lambda)
=
(1-\lambda)\,\mu_{\phi}(x^{(a)})
+
\lambda\,\mu_{\phi}(x^{(b)}),
\qquad
\lambda \in [0,1].z^(λ)=(1−λ)μϕ​(x(a))+λμϕ​(x(b)),λ∈[0,1].
Decoding fθ(z^(λ))f_{\theta}(\hat{z}(\lambda))fθ​(z^(λ)) for increasing λ\lambdaλ often produces a smooth morph, such as a digit 333 gradually becoming an 888. This matters because it suggests that the model has learned more than memorized examples. It has learned a continuous latent manifold where semantic transformations correspond to smooth movement in Z\mathcal{Z}Z.
Quantitatively, we still judge the model through the ELBO. Kingma and Welling reported a test-set bound around
Ltest≈−86.6 nats\mathcal{L}_{\text{test}} \approx -86.6 \text{ nats}Ltest​≈−86.6 nats
for a stronger VAE configuration with a 500500500-unit MLP and 202020 latent dimensions. A two-dimensional latent space is excellent for visualization, but it is capacity-limited: many details needed to explain the data must be compressed away. With Klatent=20K_{\text{latent}}=20Klatent​=20, the approximate posterior can retain more information while still being regularized toward the prior, typically yielding a better ELBO.
The main takeaways are:
Latent organization: encoder means form semantically meaningful regions.
Continuity: interpolations decode into smooth transformations.
Generative ability: prior samples decode into recognizable digits.
Trade-off: Gaussian likelihoods and ELBO optimization often produce blurry samples.
Capacity matters: higher-dimensional latent spaces usually improve likelihood, even if they are harder to visualize.
The visual below consolidates these empirical checks into one picture: a two-dimensional latent scatter to expose structure, generated samples to test the prior-to-decoder path, interpolation to test continuity, and an ELBO curve to connect the qualitative behavior back to the optimization objective.
It is important to read the panels together. A clean latent scatter alone does not guarantee good generation, and good-looking reconstructions alone do not guarantee a useful prior. The strength of the VAE is that these behaviors are coupled by the ELBO: reconstruction encourages informativeness, KL regularization encourages global structure, and the resulting model supports both sampling and smooth latent manipulation.

22. Failure Mode 1: Posterior Collapse

The latent traversals and generated samples we just looked at are the success case for VAEs: the encoder has learned a meaningful map from data xxx into latent variables zzz, and the decoder has learned to use those variables to control semantic properties of the output. But this behavior is not guaranteed by the VAE objective. In fact, one of the most important VAE failure modes is that the model can optimize the ELBO while learning a latent space that carries almost no information at all.
Recall that for a single datapoint xxx, the ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟reconstruction−DKL(qϕ(z∣x)∥p(z))⏟regularization.\mathcal{L}(\theta, \phi; x)
=
\underbrace{\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}_{\text{reconstruction}}
-
\underbrace{D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))}_{\text{regularization}}.L(θ,ϕ;x)=reconstructionEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularizationDKL​(qϕ​(z∣x)∥p(z))​​.
This objective creates a negotiation between two incentives. The reconstruction term rewards the encoder for placing information about xxx into zzz, because a useful zzz helps the decoder assign high likelihood to the observed data. The KL term pushes in the opposite direction: it rewards the approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) for staying close to the prior p(z)p(z)p(z), usually a standard Gaussian. If a latent dimension is not useful enough to improve reconstruction, the cheapest solution is to make that dimension look exactly like prior noise.
Posterior collapse occurs when this cheap solution becomes the dominant one. For some latent dimension kkk, collapse means
qϕ(zk∣x)≈p(zk)=N(0,1)⟹DKL(k)≈0.q_{\phi}(z_k|x) \approx p(z_k) = \mathcal{N}(0,1)
\quad \Longrightarrow \quad
D_{\mathrm{KL}}^{(k)} \approx 0.qϕ​(zk​∣x)≈p(zk​)=N(0,1)⟹DKL(k)​≈0.
When this happens, zkz_kzk​ carries essentially no information about xxx. Sampling zkz_kzk​ from the encoder is no different from sampling it from the prior. The encoder may still output a mean and variance, but if those parameters match the prior for every input, then the latent coordinate has become statistically independent of the data. In practice, one often observes the KL term quickly shrinking toward zero early in training, long before the decoder has learned a useful conditional generative model.
The most dangerous case arises when the decoder is powerful enough to model the data distribution without using zzz. Suppose the decoder becomes effectively zzz-independent:
pθ(x∣z)=pθ(x)for all z.p_{\theta}(x|z) = p_{\theta}(x)
\quad \text{for all } z.pθ​(x∣z)=pθ​(x)for all z.
Then the reconstruction term no longer depends on the encoder distribution:
Eqϕ(z∣x)[log⁡pθ(x∣z)]=log⁡pθ(x).\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
=
\log p_{\theta}(x).Eqϕ​(z∣x)​[logpθ​(x∣z)]=logpθ​(x).
At that point, there is no reconstruction benefit to encoding information in zzz. The only remaining pressure on the encoder is the KL penalty, and the KL is minimized by setting
qϕ(z∣x)=p(z).q_{\phi}(z|x) = p(z).qϕ​(z∣x)=p(z).
This is why posterior collapse is not merely a transient optimization accident; it can be a stable fixed point of the ELBO. Once the decoder can explain xxx on its own, the encoder is actively rewarded for becoming uninformative.
This failure mode is especially common with highly expressive decoders: autoregressive PixelCNN-style image decoders, strong Transformer decoders, or LSTM decoders for language. These models can often predict local structure from previously generated pixels or tokens so well that the global latent variable becomes unnecessary. The decoder learns a strong marginal model pθ(x)p_{\theta}(x)pθ​(x), while the intended conditional model pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z) effectively ignores its conditioning signal.
There is a subtle but important distinction here. Posterior collapse does not necessarily mean the model has poor likelihood. A collapsed VAE may still assign decent likelihood to data if the decoder is strong enough. The failure is that the model no longer behaves like a useful latent-variable model. The latent space will not organize examples semantically, interpolations become meaningless, and downstream uses of zzz—compression, representation learning, controllable generation—break down.
A useful diagnostic is to monitor the per-dimension KL terms during training. In a healthy VAE, some latent dimensions usually develop positive KL: the model is “paying” bits to transmit information through zzz. In a collapsed model, many or all dimensions have KL near zero:
Healthy latent dimension: positive KL, carries information about xxx.
Collapsed latent dimension: KL ≈0\approx 0≈0, behaves like prior noise.
Fully collapsed model: total KL ≈0\approx 0≈0, decoder acts as a marginal model.
The visual below compactly summarizes both the training symptom and the mechanism. On the left, the key signal is the KL trajectory: a useful latent space requires the KL to rise above zero, while collapse appears as a flat line near zero. On the right, the information-flow picture emphasizes the cause: the path through zzz is no longer used, so the encoder has no reason to encode input-specific information.
Read the diagram as a warning about the ELBO’s incentives. The KL term is not just a harmless regularizer; when the decoder can reconstruct or predict without zzz, the KL penalty can make “ignore the latent variable” the easiest optimum. This sets up the next failure mode as well: even when VAEs do use zzz, the probabilistic reconstruction objective can still produce overly smooth or blurry samples.

23. Failure Mode 2: Blurry Reconstructions

After posterior collapse, it is tempting to think the main danger in VAEs is that the latent variable becomes unused. But even when the latent code is used, VAEs often exhibit another characteristic failure mode: reconstructions look smooth, washed out, or “average-looking.” This is not merely an implementation artifact. It follows directly from the likelihood model we usually choose and from the way the ELBO trades reconstruction fidelity against latent regularity.
The standard VAE decoder is often written as a Gaussian likelihood,
pθ(x∣z)=N(x;fθ(z),σ2I),p_{\theta}(x \mid z) = \mathcal{N}(x; f_{\theta}(z), \sigma^2 I),pθ​(x∣z)=N(x;fθ​(z),σ2I),
where fθ(z)f_{\theta}(z)fθ​(z) is the decoder’s predicted image mean. If σ2\sigma^2σ2 is fixed, maximizing the reconstruction log-likelihood is equivalent, up to constants and scaling, to minimizing a pixel-wise squared error:
Eqϕ(z∣x)[∥x−fθ(z)∥2].\mathbb{E}_{q_{\phi}(z|x)}
\left[
\|x - f_{\theta}(z)\|^2
\right].Eqϕ​(z∣x)​[∥x−fθ​(z)∥2].
This is the key point: squared error rewards conditional averages. If there are multiple plausible sharp reconstructions compatible with the latent uncertainty, the prediction that minimizes expected squared error is not necessarily one of those sharp possibilities. It is their mean.
More precisely, under an MSE objective, the optimal point estimate is a conditional expectation. In the VAE setting, because reconstructions are averaged over samples from the approximate posterior qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), the effective optimum behaves like
x^opt=Eqϕ(z∣x)[fθ(z)].\hat{x}_{\text{opt}}
=
\mathbb{E}_{q_{\phi}(z|x)}[f_{\theta}(z)].x^opt​=Eqϕ​(z∣x)​[fθ​(z)].
This is harmless when the posterior mass corresponds to a narrow set of nearly identical decodings. But it becomes visually damaging when the plausible decodings are multimodal. For example, if a digit could plausibly have a stroke slightly to the left or slightly to the right, the pixel-wise average puts mass in both places weakly. The resulting image is not a realistic digit from either mode; it is a blurred compromise between them.
This is why VAE blurriness is often described as an “averaging” phenomenon. The decoder is not explicitly trying to blur images. Rather, the Gaussian likelihood says that deviations in pixel space are penalized quadratically, independently across pixels. Under this geometry, a soft gray pixel halfway between black and white can be preferable to committing to the wrong sharp value. The model is rewarded for being safe under uncertainty, not for producing a sample that lies on the manifold of visually crisp images.
The KL term compounds the problem. Recall the single-example ELBO:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟encourages reconstruction−DKL(qϕ(z∣x)∥p(z))⏟regularizes latent distribution.\mathcal{L}(\theta,\phi;x)
=
\underbrace{
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
}_{\text{encourages reconstruction}}
-
\underbrace{
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
}_{\text{regularizes latent distribution}}.L(θ,ϕ;x)=encourages reconstructionEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularizes latent distributionDKL​(qϕ​(z∣x)∥p(z))​​.
The reconstruction term wants qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) to encode enough information to reconstruct xxx sharply. The KL term pushes qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) toward the prior, typically p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This pressure discourages highly concentrated, highly informative posteriors. In effect, it can broaden qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), increasing the range of latent samples that the decoder must handle for the same input.
So even before collapse, the model is caught between two forces:
Reconstruction pressure wants precise latent codes and sharp outputs.
KL regularization wants broad, prior-like posteriors and smooth latent space structure.
Gaussian/MSE decoding turns uncertainty over plausible outputs into pixel-wise averages.
This tradeoff is part of what makes VAEs attractive as generative models: their latent spaces are usually smooth, interpolatable, and mode-covering. But the price is that their samples and reconstructions can look less sharp than those from models trained with perceptual or adversarial criteria.
GANs fail in almost the opposite direction. A GAN discriminator does not ask whether each pixel is close to a target under squared error. It asks whether the generated sample looks like it came from the real data distribution. A blurry average of several plausible images is often easy for a discriminator to reject, so adversarial training strongly penalizes unrealistic smooth compromises. This is why GAN samples are frequently sharper. However, that sharpness comes with a different failure mode: mode dropping. The generator can learn to produce a subset of highly realistic samples while ignoring other regions of the data distribution.
This comparison is useful because it separates two notions that are often conflated: sharpness and coverage. VAEs tend to cover the data distribution more systematically because the likelihood objective assigns probability mass broadly and the latent prior encourages global organization. GANs tend to produce sharper samples because adversarial feedback rewards realism, but they may cover fewer modes. Neither objective is universally better; they encode different preferences.
Within the VAE framework, there are several ways to reduce blurriness, each changing the balance of this tradeoff. One can replace pixel-wise MSE with a perceptual loss, measuring reconstruction error in the feature space of a pretrained network rather than raw pixels. One can add an adversarial term, producing hybrid models such as VAE-GANs. Or one can reduce the decoder variance σ2\sigma^2σ2, effectively increasing the weight of reconstruction accuracy relative to the KL term. But these interventions are not free: they can destabilize training, weaken likelihood interpretation, or reduce the smoothness and coverage properties that motivated VAEs in the first place.
The visual below compresses this story into a side-by-side comparison. On the VAE side, a broad approximate posterior sends multiple latent samples through the decoder; averaging their plausible outputs produces a blurry mean. In the center, the loss comparison highlights the mechanism: Gaussian likelihood/MSE encourages mode coverage but tolerates visual averaging, while adversarial training penalizes unrealistic blur.
The GAN side then illustrates the complementary behavior: samples can be crisp because the discriminator enforces perceptual realism, but the generated examples may become repetitive, signaling missing modes. This is the central takeaway: VAE blurriness is not accidental; it is a consequence of averaging under latent uncertainty, amplified by KL regularization. GAN sharpness solves that visual problem by changing the objective, but introduces its own coverage failure mode.

24. Extension: Beta-VAE and Disentanglement

After seeing why VAEs can produce blurry reconstructions, it is tempting to ask whether we can tune the objective to prefer a different kind of representation. The standard VAE is already balancing two competing goals: reconstruct the input well, while keeping the approximate posterior close to a simple prior. Beta-VAE makes this trade-off explicit by introducing a single scalar knob, β\betaβ, that controls how strongly we penalize information stored in the latent variable.
Recall the standard VAE objective for one data point xxx:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\bigl[
\log p_{\theta}(x|z)
\bigr]
-
D_{\mathrm{KL}}
\!\bigl(
q_{\phi}(z|x)\,\|\,p(z)
\bigr).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
The first term rewards good reconstruction. The second term regularizes the encoder distribution qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), usually pushing it toward a simple prior such as
p(z)=N(0,I).p(z)=\mathcal{N}(0,I).p(z)=N(0,I).
Beta-VAE modifies only one thing: it multiplies the KL term by β\betaβ:
Lβ(θ,ϕ; x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−β DKL ⁣(qϕ(z∣x) ∥ p(z)),β≥1.\mathcal{L}_{\beta}(\theta, \phi;\, x)
=
\mathbb{E}_{q_{\phi}(z|x)}
[\log p_{\theta}(x|z)]
-
\beta\,
D_{\mathrm{KL}}
\!\bigl(
q_{\phi}(z|x)
\,\|\,
p(z)
\bigr),
\qquad
\beta \geq 1.Lβ​(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−βDKL​(qϕ​(z∣x)∥p(z)),β≥1.
When β=1\beta=1β=1, this is exactly the usual VAE ELBO:
β=1  ⟹  Lβ=L(θ,ϕ;x).\beta = 1
\implies
\mathcal{L}_{\beta}
=
\mathcal{L}(\theta,\phi;x).β=1⟹Lβ​=L(θ,ϕ;x).
For β>1\beta>1β>1, however, the objective is no longer the standard ELBO on log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). It is better understood as a rate–distortion trade-off: the reconstruction term is the distortion, while the KL term controls the rate, meaning how much information about xxx can be transmitted through zzz. Increasing β\betaβ makes latent information more expensive.
This pressure can encourage disentanglement. The prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I) is factorized:
p(z)=∏k=1dN(zk;0,1).p(z)
=
\prod_{k=1}^{d} \mathcal{N}(z_k;0,1).p(z)=k=1∏d​N(zk​;0,1).
So if the encoder is heavily penalized for deviating from this prior, it is encouraged to use latent dimensions in a way that remains close to independent, standardized Gaussian coordinates. Under the common diagonal Gaussian encoder,
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z|x)
=
\mathcal{N}
\bigl(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}^2(x))
\bigr),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ2​(x))),
the KL term decomposes across dimensions:
DKL ⁣(qϕ(z∣x) ∥ p(z))=12∑k=1d(μϕ,k(x)2+σϕ,k(x)2−log⁡σϕ,k(x)2−1).D_{\mathrm{KL}}
\!\bigl(
q_{\phi}(z|x)\,\|\,p(z)
\bigr)
=
\frac{1}{2}
\sum_{k=1}^{d}
\left(
\mu_{\phi,k}(x)^2
+
\sigma_{\phi,k}(x)^2
-
\log \sigma_{\phi,k}(x)^2
-
1
\right).DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑d​(μϕ,k​(x)2+σϕ,k​(x)2−logσϕ,k​(x)2−1).
This expression makes the effect of β\betaβ concrete. Unless a latent dimension zkz_kzk​ helps explain real variation in the data, the model is rewarded for keeping
μϕ,k(x)≈0,σϕ,k(x)≈1.\mu_{\phi,k}(x) \approx 0,
\qquad
\sigma_{\phi,k}(x) \approx 1.μϕ,k​(x)≈0,σϕ,k​(x)≈1.
In other words, unused or weakly useful dimensions are pushed back toward pure prior noise. Dimensions that remain active must “earn their keep” by improving reconstruction enough to compensate for the larger KL penalty.
The intuitive hope is that each active latent coordinate will specialize in a distinct factor of variation: one dimension for shape, another for scale, another for rotation, another for lighting, and so on. Empirically, this often happens on controlled datasets such as dSprites or 3DShapes, where the underlying generative factors are relatively clean and independent. Values such as β∈[4,10]\beta \in [4,10]β∈[4,10] can produce latents where traversing one coordinate changes one semantic factor while leaving others mostly fixed.
But this is not magic, and it is important not to overstate the guarantee. Disentanglement depends on several assumptions:
the prior is factorized, usually isotropic Gaussian;
the encoder often has a diagonal covariance structure;
the data actually contains relatively separable factors of variation;
the decoder architecture and training procedure do not hide information elsewhere;
unsupervised disentanglement is not identifiable in general without inductive biases.
So Beta-VAE is best understood as a useful bias, not a theorem that semantic factors will automatically emerge.
The cost is also direct: as β\betaβ increases, reconstruction quality usually decreases. The decoder receives less information about the particular input xxx, because the encoder is more constrained to keep qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) close to p(z)p(z)p(z). This can worsen the same reconstruction problems we discussed earlier: less detail, more averaging, and potentially blurrier outputs. Beta-VAE therefore trades fidelity for structure.
A useful way to summarize the idea is to imagine β\betaβ as a dial. At β=1\beta=1β=1, we have the ordinary VAE objective: better reconstructions, but latent dimensions may be entangled. As β\betaβ grows, the KL term becomes more dominant, pushing the posterior closer to the independent Gaussian prior. The latent space may become more axis-aligned and interpretable, but the reconstructions become less faithful.
The visual below consolidates this trade-off: the left side represents the standard VAE regime, where latent coordinates can mix multiple factors; the center emphasizes the modified objective and the β\betaβ-controlled tension between reconstruction and regularization; the right side represents the Beta-VAE regime, where dimensions are encouraged to separate factors such as shape and scale, but with degraded reconstruction quality as the price.

25. VAEs vs GANs vs Normalizing Flows

After seeing how β\betaβ-VAE deliberately reshapes the latent space by changing the strength of the KL penalty, it is useful to step back and ask a broader question: what kind of generative model is a VAE, and what does it trade away compared with other families? VAEs are often introduced alongside GANs and normalizing flows because all three learn to generate samples from data, but they make very different compromises about likelihoods, latent variables, optimization, and sample quality.
The VAE starts from an explicitly probabilistic story. We assume a latent variable z∼p(z)z \sim p(z)z∼p(z), usually p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I), and a decoder likelihood pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z). The marginal likelihood is
pθ(x)=∫pθ(x∣z)p(z) dz,p_{\theta}(x)=\int p_{\theta}(x|z)p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
but this integral is generally intractable for neural decoders. The VAE therefore optimizes the Evidence Lower Bound:
log⁡pθ(x)≥L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x)∥p(z)).\log p_{\theta}(x) \geq \mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
-
D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\|p(z)\right).logpθ​(x)≥L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
This objective is not merely a training trick. It encodes the core VAE design philosophy: learn to reconstruct data through a latent bottleneck while keeping the approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) close to a simple prior. That KL term is exactly what gives VAEs their unusually useful latent geometry. Nearby points in latent space tend to decode to semantically related outputs, interpolation is meaningful, and the latent representation can support downstream tasks such as clustering, semi-supervised learning, and controllable generation.
But the same structure also explains common VAE weaknesses. Because the decoder is trained through a likelihood term, the model is often rewarded for predicting averages under uncertainty. With Gaussian pixel likelihoods, this can produce blurry reconstructions: if multiple sharp images are plausible, the conditional mean may lie between them. And when the decoder is too powerful, the model may ignore zzz entirely, producing posterior collapse, where qϕ(z∣x)≈p(z)q_{\phi}(z|x)\approx p(z)qϕ​(z∣x)≈p(z) and the latent code carries little information about xxx. In other words, the VAE’s principled probabilistic objective is also the source of its characteristic compromises.
GANs sit at almost the opposite end of the spectrum. A GAN generator maps noise z∼p(z)z\sim p(z)z∼p(z) to a sample fθ(z)f_{\theta}(z)fθ​(z), while a discriminator tries to distinguish generated samples from real data. The classical minimax objective is
min⁡θmax⁡ϕ  Ex∼pdata[log⁡ϕ(x)]+Ez∼p(z)[log⁡(1−ϕ(fθ(z)))].\min_{\theta}\max_{\phi}
\;
\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log \phi(x)]
+
\mathbb{E}_{z\sim p(z)}
\left[\log(1-\phi(f_{\theta}(z)))\right].θmin​ϕmax​Ex∼pdata​​[logϕ(x)]+Ez∼p(z)​[log(1−ϕ(fθ​(z)))].
This adversarial formulation can produce extremely sharp and realistic samples because the generator is trained to fool a learned critic rather than to maximize a pixelwise likelihood. However, GANs usually do not provide tractable likelihoods, and their latent space is not regularized by an inference model in the VAE sense. There is no built-in encoder qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), no ELBO, and no direct estimate of log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). As a result, GANs are powerful image synthesizers but less naturally suited to probabilistic inference or structured representation learning.
Normalizing flows take a third route: they preserve exact likelihoods by making the generative map invertible. A flow defines an invertible transformation between data xxx and latent variables z=fθ(x)z=f_{\theta}(x)z=fθ​(x), allowing density evaluation through the change-of-variables formula:
log⁡pθ(x)=log⁡p(z)+log⁡∣det⁡∂fθ−1∂x∣.\log p_{\theta}(x)
=
\log p(z)
+
\log \left|
\det \frac{\partial f_{\theta}^{-1}}{\partial x}
\right|.logpθ​(x)=logp(z)+log​det∂x∂fθ−1​​​.
This is a major advantage: unlike VAEs, flows do not optimize a lower bound; unlike GANs, they can evaluate exact densities. But this exactness comes with architectural constraints. The transformation must be bijective, which usually forces the latent dimensionality to match the data dimensionality, K=DK=DK=D. The model cannot freely compress data into a lower-dimensional semantic bottleneck in the way a VAE can. Flows are excellent density models, but their latent spaces are often less directly aligned with compact representation learning.
A useful way to summarize the comparison is to ask what each model gives you “for free” and what it makes difficult:
VAEs give a regularized encoder-decoder latent space and a principled likelihood lower bound, but may sacrifice sharpness and exact density evaluation.
GANs often give the sharpest samples, but provide no explicit likelihood and can be unstable to train.
Normalizing flows give exact likelihoods and exact inference through invertibility, but require constrained architectures and typically do not compress into a low-dimensional latent representation.
The key point is that there is no universally dominant model family. The right choice depends on whether the priority is sample realism, density evaluation, or latent representation structure. If the task is pure image synthesis, GAN-like models may be attractive. If exact log-likelihood is central, flows are compelling. If we care about a learned latent space that is smooth, regularized, and useful for inference or downstream prediction, VAEs remain especially important.
The visual comparison that follows condenses these trade-offs into the axes that matter most: objective, sampling procedure, density evaluation, latent structure, and sample quality. Notice in particular the contrast between the VAE’s ELBO-based density estimate and the flow’s exact likelihood, as well as the contrast between the VAE’s KL-regularized latent space and the more weakly structured latent variables used by GANs.
The highlighted cells emphasize the central lesson: VAEs are not simply “worse GANs” because their samples can be blurrier, nor are they merely “approximate flows” because they optimize a bound rather than exact likelihood. Their distinctive strength is the combination of a probabilistic training objective with a structured latent representation, which is precisely why they remain a foundational tool for representation learning, semi-supervised modeling, and latent-variable inference.

26. Empirical Anchor: VAE on CelebA Faces

After comparing VAEs with GANs and normalizing flows in the abstract, it is useful to ground the discussion in a concrete empirical example. CelebA faces are a particularly good testbed: the images are structured enough that we can see semantic attributes—pose, hair, glasses, gender presentation, expression—but constrained enough that a moderately sized convolutional VAE can learn a meaningful latent representation.
Consider a convolutional VAE trained on roughly 200k CelebA images resized to 64×6464 \times 6464×64 RGB. The encoder gϕg_{\phi}gϕ​ maps an image xxx through several convolutional layers and a fully connected layer to the parameters of a diagonal Gaussian posterior approximation,
qϕ(z∣x)=N ⁣(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^2(x))
\right),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
with latent dimension K=128K=128K=128. The decoder fθf_{\theta}fθ​ maps a latent vector back through a fully connected layer and transposed convolutional layers to a Gaussian mean image. In other words, the decoder is not directly producing a sharp image distribution in pixel space; it is producing the mean of a conditional Gaussian likelihood.
At training and reconstruction time, the latent sample is drawn using the reparameterization trick,
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
This matters because the encoder is not merely assigning each image to a deterministic code. It is learning a local distribution over plausible codes, regularized toward the prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). The VAE therefore has to balance two goals: preserving enough information to reconstruct the face, while keeping the aggregate latent geometry compatible with sampling from a simple Gaussian prior.
The first qualitative result is reconstruction. If we encode a real face, sample z^\hat{z}z^, and decode fθ(z^)f_{\theta}(\hat{z})fθ​(z^), the output is usually recognizable: identity, hair color, pose, and broad facial structure are often preserved. But the image is noticeably soft. This is not an incidental flaw of convolutional networks; it is tightly connected to the Gaussian likelihood and pixelwise squared-error reconstruction term. When multiple plausible high-frequency details could explain an input—exact hair strands, skin texture, eye highlights—the conditional mean averages over them. The result is the familiar VAE blur.
That blur is the qualitative price paid for a likelihood-based, mode-covering generative model with a simple reconstruction distribution. Unlike a GAN, which may learn sharp samples by matching an implicit distribution adversarially, the standard Gaussian VAE is explicitly rewarded for predicting an average pixel value under uncertainty. This is why simply training longer or increasing the latent dimension often does not remove the softness. To substantially change the visual character, one usually modifies the likelihood, the reconstruction loss, the decoder distribution, or introduces more expressive generative structure.
The second result is unconditional generation. If we sample
z∼p(z)=N(0,I)z \sim p(z)=\mathcal{N}(0,I)z∼p(z)=N(0,I)
and decode fθ(z)f_{\theta}(z)fθ​(z), we obtain plausible faces rather than random noise. This is a key empirical validation of the VAE objective: the KL term has shaped the encoder’s latent codes so that regions likely under the prior correspond to meaningful decoded images. In a poorly regularized autoencoder, sampling a random latent vector would often land in “holes” between encoded training examples. In a well-trained VAE, the prior is encouraged to cover the learned manifold.
The third result is more subtle: the latent space often supports semantic arithmetic. Using posterior means as deterministic summaries of images, we may find relationships like
μϕ(xman+glasses)−μϕ(xman)+μϕ(xwoman)≈zwoman+glasses.\mu_{\phi}(x_{\text{man+glasses}})
-
\mu_{\phi}(x_{\text{man}})
+
\mu_{\phi}(x_{\text{woman}})
\approx
z_{\text{woman+glasses}}.μϕ​(xman+glasses​)−μϕ​(xman​)+μϕ​(xwoman​)≈zwoman+glasses​.
Decoding the resulting vector can produce a face that preserves the “woman” identity direction while adding the “glasses” attribute. This should not be interpreted as exact symbolic reasoning. The latent space is not guaranteed to contain perfectly linear, disentangled factors. But because the prior is smooth and the decoder is trained over a continuous latent domain, common attributes can become approximately represented as directions or subspaces.
A related phenomenon appears in interpolation. Given two images x1x_1x1​ and x2x_2x2​, we can linearly interpolate between their posterior means,
z(t)=(1−t)μϕ(x1)+tμϕ(x2),t∈[0,1].z(t)
=
(1-t)\mu_{\phi}(x_1)
+
t\mu_{\phi}(x_2),
\qquad
t\in[0,1].z(t)=(1−t)μϕ​(x1​)+tμϕ​(x2​),t∈[0,1].
Decoding points along this path often gives a smooth morph from one face to another. This is an important distinction between a VAE and a plain autoencoder: the latent space is not merely a lookup table of compressed examples. The ELBO’s KL term encourages neighboring latent regions to decode coherently, so linear paths can remain on or near the learned face manifold.
The visual below consolidates these four empirical anchors: reconstruction, unconditional generation, latent arithmetic, and interpolation. Read it as evidence for both sides of the VAE story. On one hand, the model has learned a semantically organized latent space where sampling, arithmetic, and smooth traversal are meaningful. On the other hand, the reconstructions and samples remain visibly soft, reminding us that the Gaussian decoder’s averaging behavior is not a cosmetic detail but a central modeling limitation.
This example is therefore a natural bridge into VAE extensions. Once we accept that the basic framework gives us a useful latent geometry but imperfect image fidelity, the motivation for richer decoders, hierarchical latents, discrete variables, perceptual losses, adversarial hybrids, and diffusion-style decoders becomes much clearer.

27. Hierarchy of VAE Extensions

After seeing a standard VAE trained on CelebA, it is tempting to think of “the VAE” as one fixed architecture: an encoder produces a Gaussian latent distribution, a decoder reconstructs an image, and the ELBO balances reconstruction against KL regularization. But in practice, many of the most useful VAE variants keep this encode–sample–decode skeleton while changing one structural assumption in the probabilistic model or in the variational bound.
A useful way to organize the family is to ask: what exactly are we modifying?
Are we adding side information, such as a class label or attribute?
Are we tightening the variational lower bound without changing the model family?
Are we replacing the continuous latent variable with a discrete one?
These questions lead naturally to three important extensions: Conditional VAEs, Importance Weighted Autoencoders, and Vector-Quantized VAEs. They look different architecturally, but they are all still variations on the same principle: introduce latent structure, define a tractable training objective, and optimize an encoder–decoder system end-to-end.
The Conditional VAE, or CVAE, modifies the generative story by conditioning both inference and generation on observed side information yyy. Instead of modeling pθ(x,z)p_\theta(x,z)pθ​(x,z), we model something closer to pθ(x,z∣y)p_\theta(x,z \mid y)pθ​(x,z∣y). The encoder becomes qϕ(z∣x,y)q_\phi(z \mid x,y)qϕ​(z∣x,y), the decoder becomes pθ(x∣z,y)p_\theta(x \mid z,y)pθ​(x∣z,y), and the prior may also depend on yyy, giving the conditional ELBO
L(θ,ϕ;x,y)=Eqϕ(z∣x,y)[log⁡pθ(x∣z,y)]−DKL(qϕ(z∣x,y)∥p(z∣y)).\mathcal{L}(\theta,\phi;x,y)
=
\mathbb{E}_{q_{\phi}(z|x,y)}
\big[\log p_{\theta}(x|z,y)\big]
-
D_{\mathrm{KL}}
\big(q_{\phi}(z|x,y)\|p(z|y)\big).L(θ,ϕ;x,y)=Eqϕ​(z∣x,y)​[logpθ​(x∣z,y)]−DKL​(qϕ​(z∣x,y)∥p(z∣y)).
The intuition is simple but powerful: yyy explains the part of the variation we already know, while zzz captures the residual variation. For faces, yyy might encode identity, expression, hair color, or a binary attribute such as “smiling.” At generation time, we fix yyy, sample z∼p(z∣y)z \sim p(z \mid y)z∼p(z∣y), and decode x∼pθ(x∣z,y)x \sim p_\theta(x \mid z,y)x∼pθ​(x∣z,y). This gives controlled generation: “generate a face with this attribute, while allowing the remaining details to vary.” The subtle assumption is that yyy is available and meaningful during training; without paired (x,y)(x,y)(x,y) supervision, the conditional structure cannot be learned directly.
The Importance Weighted Autoencoder, or IWAE, changes something different. It does not primarily add labels or alter the latent space type. Instead, it improves the variational objective. Recall that the ordinary ELBO is a lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x), derived by introducing an approximate posterior qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x). The gap between the ELBO and the true log evidence is a KL divergence between qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) and the exact posterior pθ(z∣x)p_\theta(z \mid x)pθ​(z∣x). IWAE tightens this bound by drawing LLL samples from the encoder and forming an importance-weighted estimate:
LIWAE=Ez(1),…,z(L)∼qϕ(z∣x)[log⁡1L∑l=1Lpθ(x,z(l))qϕ(z(l)∣x)].\mathcal{L}_{\mathrm{IWAE}}
=
\mathbb{E}_{z^{(1)},\ldots,z^{(L)}\sim q_{\phi}(z|x)}
\left[
\log
\frac{1}{L}
\sum_{l=1}^{L}
\frac{p_{\theta}(x,z^{(l)})}{q_{\phi}(z^{(l)}|x)}
\right].LIWAE​=Ez(1),…,z(L)∼qϕ​(z∣x)​[logL1​l=1∑L​qϕ​(z(l)∣x)pθ​(x,z(l))​].
For L=1L=1L=1, this recovers the ordinary VAE ELBO. As LLL increases, the bound becomes tighter, and under appropriate regularity assumptions,
LIWAE→L→∞log⁡pθ(x).\mathcal{L}_{\mathrm{IWAE}}
\xrightarrow{L\to\infty}
\log p_{\theta}(x).LIWAE​L→∞​logpθ​(x).
This matters because the standard ELBO can be loose when the variational posterior is too simple. IWAE partially compensates by using multiple latent samples and letting high-importance samples dominate the estimate. But this improvement is not free. The tighter objective can produce more difficult optimization dynamics for the encoder parameters ϕ\phiϕ: as LLL grows, the learning signal for the proposal distribution can degrade or become biased in practical gradient estimators. So IWAE improves the statistical bound, but it introduces a new optimization trade-off.
The VQ-VAE moves in a third direction: it replaces the usual continuous Gaussian latent variable with a discrete latent code. The encoder first outputs a continuous vector, but that vector is then snapped to the nearest entry in a learned codebook {ek}\{e_k\}{ek​}. The decoder receives the selected codebook vector rather than a sample from a continuous posterior. This is especially useful when the data has naturally discrete or symbolic structure: phonemes in speech, tokens in language-like representations, or repeated visual parts in images.
The difficulty is that nearest-neighbor lookup is not differentiable. A standard VAE relies on the reparameterization trick, for example z=μϕ(x)+σϕ(x)⊙ϵz = \mu_\phi(x) + \sigma_\phi(x)\odot \epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ, to backpropagate through stochastic sampling. VQ-VAE cannot use that trick directly because the latent choice is discrete. Instead, it typically uses a straight-through estimator: in the forward pass, the model uses the nearest codebook vector; in the backward pass, gradients are copied through the quantization operation as if it were approximately identity. This estimator is biased, but often effective. Another consequence is that, with a uniform prior over discrete codes, the KL term can collapse to a constant, so the objective is usually expressed through reconstruction plus codebook and commitment losses rather than the ordinary Gaussian KL penalty.
These three variants therefore form a compact hierarchy of interventions on the base VAE:
CVAE modifies the conditioning structure: generate xxx given yyy.
IWAE modifies the bound: use multiple importance samples to tighten the estimate of log⁡pθ(x)\log p_\theta(x)logpθ​(x).
VQ-VAE modifies the latent representation: replace continuous zzz with discrete codebook indices.
The visual below is meant to consolidate this taxonomy rather than introduce a new derivation. The table separates each model by the part of the VAE it changes: objective, key innovation, and trade-off. That framing is important because these extensions are not arbitrary architectural hacks; each corresponds to a different mathematical pressure point in the original VAE formulation.
The small diagrams beneath the table reinforce the same idea operationally. CVAE routes the label yyy into both encoder and decoder, IWAE fans out multiple latent samples before combining them through an importance-weighted log average, and VQ-VAE inserts a codebook lookup between encoder and decoder. The common thread is that all three preserve the VAE’s central encode–latent–decode logic, while changing what the latent variable means or how its objective is optimized.

28. Summary: VAEs — Equivalent Forms and Unified View

After seeing the hierarchy of VAE extensions, it is useful to step back and notice that most of the machinery we have introduced is not a collection of unrelated tricks. The latent-variable model, Jensen’s inequality, the KL gap, the reconstruction–regularization tradeoff, the reparameterization trick, and the Gaussian closed form are all different views of the same central object: the Evidence Lower Bound, or ELBO. The extensions change priors, posteriors, decoders, objectives, or training schedules, but the conceptual spine remains the same.
The starting point is the latent-variable likelihood
pθ(x)=∫pθ(x,z) dz=∫pθ(x∣z)p(z) dz.p_{\theta}(x)=\int p_{\theta}(x,z)\,dz
=\int p_{\theta}(x|z)p(z)\,dz.pθ​(x)=∫pθ​(x,z)dz=∫pθ​(x∣z)p(z)dz.
The difficulty is that this integral is usually intractable for a neural decoder pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z). VAEs solve this not by evaluating the marginal likelihood directly, but by introducing an approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x). This distribution is not part of the original generative model; it is an inference model, or encoder, trained to approximate the true posterior pθ(z∣x)p_{\theta}(z|x)pθ​(z∣x). Once we multiply and divide by qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), Jensen’s inequality gives the canonical ELBO form:
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_{\theta}(x,z)-\log q_{\phi}(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
This is the most compact variational-inference statement of the VAE objective. It says: sample latent variables from the encoder, score them under the joint model, and subtract the cost of using the encoder distribution. Since it is a lower bound,
log⁡pθ(x)≥L(θ,ϕ;x),\log p_{\theta}(x)\geq \mathcal{L}(\theta,\phi;x),logpθ​(x)≥L(θ,ϕ;x),
maximizing the ELBO is a surrogate for maximizing likelihood.
The same expression becomes more interpretable when we use the factorization pθ(x,z)=pθ(x∣z)p(z)p_{\theta}(x,z)=p_{\theta}(x|z)p(z)pθ​(x,z)=pθ​(x∣z)p(z). Then the ELBO decomposes as
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
[\log p_{\theta}(x|z)]
-
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z)).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
This is the familiar reconstruction minus regularization form. The first term rewards latent codes that allow the decoder to explain xxx. The second term penalizes the encoder posterior for drifting too far from the prior. This penalty matters because generation later samples z∼p(z)z\sim p(z)z∼p(z), not z∼qϕ(z∣x)z\sim q_{\phi}(z|x)z∼qϕ​(z∣x). If the aggregate latent codes used during training live in regions of latent space that the prior rarely visits, generation will be poor even if reconstruction is good.
A third equivalent rewriting separates the KL term into entropy and prior energy:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]+H[qϕ(z∣x)]−Eqϕ(z∣x)[−log⁡p(z)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
+
\mathcal{H}[q_{\phi}(z|x)]
-
\mathbb{E}_{q_{\phi}(z|x)}[-\log p(z)].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]+H[qϕ​(z∣x)]−Eqϕ​(z∣x)​[−logp(z)].
This form emphasizes the connection to negative free energy and rate–distortion. The reconstruction term is a distortion-like quantity: how accurately can the latent representation explain the data? The KL-related terms control the information content and geometry of the representation. A low-entropy posterior can carry precise information about xxx, but it pays a cost if it concentrates too sharply or moves away from the prior. A high-entropy posterior is cheaper and smoother, but may discard information needed for accurate reconstructions.
The fourth form reveals exactly when the bound is tight:
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x)∥pθ(z∣x)).\log p_{\theta}(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p_{\theta}(z|x)).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
Equivalently,
L(θ,ϕ;x)=log⁡pθ(x)−DKL(qϕ(z∣x)∥pθ(z∣x)).\mathcal{L}(\theta,\phi;x)
=
\log p_{\theta}(x)
-
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p_{\theta}(z|x)).L(θ,ϕ;x)=logpθ​(x)−DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
So the gap between the ELBO and the true log evidence is not mysterious: it is exactly the KL divergence from the approximate posterior to the true posterior. The bound is tight if and only if
qϕ(z∣x)=pθ(z∣x)q_{\phi}(z|x)=p_{\theta}(z|x)qϕ​(z∣x)=pθ​(z∣x)
almost everywhere. This condition is important but subtle. In practice, the true posterior changes as θ\thetaθ changes, and the variational family may be too limited to represent it exactly. Thus the VAE optimizes a coupled problem: improve the generative model while simultaneously learning an amortized approximation to its posterior.
The fifth form is the one we actually implement for the standard Gaussian VAE. If
qϕ(z∣x)=N ⁣(μϕ(x),diag⁡(σϕ(x)2)),p(z)=N(0,I),q_{\phi}(z|x)=\mathcal{N}\!\left(\mu_{\phi}(x),\operatorname{diag}(\sigma_{\phi}(x)^2)\right),
\qquad
p(z)=\mathcal{N}(0,I),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),p(z)=N(0,I),
then we sample using the reparameterization trick
z=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I),z=\mu_{\phi}(x)+\sigma_{\phi}(x)\odot \epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I),z=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I),
and use the closed-form Gaussian KL:
DKL(qϕ(z∣x)∥p(z))=12∑k=1K[μϕ,k2+σϕ,k2−1−log⁡σϕ,k2].D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
=
\frac{1}{2}\sum_{k=1}^{K}
\left[
\mu_{\phi,k}^2+\sigma_{\phi,k}^2-1-\log\sigma_{\phi,k}^2
\right].DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑K​[μϕ,k2​+σϕ,k2​−1−logσϕ,k2​].
The practical training objective is therefore
L~(θ,ϕ;x)=Eϵ∼N(0,I) ⁣[log⁡pθ ⁣(x∣μϕ(x)+σϕ(x)⊙ϵ)]−12∑k=1K[μϕ,k(x)2+σϕ,k(x)2−1−log⁡σϕ,k(x)2].\tilde{\mathcal{L}}(\theta,\phi;x)
=
\mathbb{E}_{\epsilon\sim\mathcal{N}(0,I)}
\!\left[
\log p_{\theta}
\!\left(
x\mid \mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon
\right)
\right]
-
\frac{1}{2}\sum_{k=1}^{K}
\left[
\mu_{\phi,k}(x)^2+\sigma_{\phi,k}(x)^2-1-\log\sigma_{\phi,k}(x)^2
\right].L~(θ,ϕ;x)=Eϵ∼N(0,I)​[logpθ​(x∣μϕ​(x)+σϕ​(x)⊙ϵ)]−21​k=1∑K​[μϕ,k​(x)2+σϕ,k​(x)2−1−logσϕ,k​(x)2].
This last expression is not a new objective; it is the computationally usable version of the same ELBO under Gaussian assumptions. The reparameterization trick is what makes gradients with respect to ϕ\phiϕ low-variance and compatible with backpropagation: randomness is pushed into ϵ\epsilonϵ, while μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x) remain differentiable functions of the encoder parameters.
Seen this way, the VAE is a unification of two tasks:
Amortized variational inference: learn qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) so that inference for each datapoint is fast and approximate.
Deep generative modeling: learn pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z) so that samples from a simple prior can be decoded into realistic observations.
The strengths and failure modes follow directly from this unified objective. Posterior collapse occurs when the decoder becomes strong enough that the optimum can ignore zzz, pushing qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) close to p(z)p(z)p(z) and making the KL small but the latent code uninformative. Blurry reconstructions often arise when the likelihood model, such as a factorized Gaussian decoder, rewards averaging over plausible outputs. Extensions such as β\betaβ-VAEs, hierarchical VAEs, richer priors, normalizing-flow posteriors, and discrete latents can all be understood as attempts to reshape one part of this same bound.
The visual below condenses this entire path into a single unified map. The upper portion organizes the five ELBO forms side by side: Jensen’s lower bound, reconstruction–regularization, entropy/free-energy, the exact variational gap, and the reparameterized Gaussian estimator. They are not competing formulas; they are algebraically equivalent perspectives, each useful for answering a different question.
The lower portion traces the lecture arc from latent-variable modeling through intractability, ELBO derivation, decomposition, tightness, reparameterization, closed-form KL terms, training, failure modes, and extensions. Read together, the table and timeline reinforce the main lesson: the VAE is best understood as an end-to-end differentiable implementation of variational inference inside a neural generative model.

2. Latent Variable Models: The Core Idea

The previous discussion framed generative modeling as a density-learning problem: we want a model that assigns high probability to realistic data and can also produce new samples. But high-dimensional observations—images, audio, text embeddings—rarely vary freely in all ambient dimensions. A handwritten digit image may have thousands of pixels, yet much of its variation can be described by a smaller set of factors: which digit it is, how thick the stroke is, whether it is tilted, how centered it is, and so on. Latent variable models make this intuition explicit.
The core assumption is that each observed data point xxx is generated from an unobserved, lower-dimensional variable zzz. We do not get to see zzz in the dataset; it is a hidden explanation for the observation. Instead of modeling the distribution over xxx directly, we define a two-step generative process:
z∼p(z)=N(0,I),x∼pθ(x∣z).z \sim p(z) = \mathcal{N}(0, I),
\qquad
x \sim p_{\theta}(x \mid z).z∼p(z)=N(0,I),x∼pθ​(x∣z).
Here, p(z)p(z)p(z) is a simple prior over latent codes, usually chosen to be a standard Gaussian. The conditional distribution pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) is the decoder or generative model: given a latent code, it describes a distribution over possible observations. In modern VAEs, this conditional distribution is parameterized by a neural network fθ(z)f_{\theta}(z)fθ​(z), which maps latent coordinates into the parameters of a likelihood over data space.
This setup separates two kinds of complexity. The prior p(z)p(z)p(z) is deliberately simple: sampling from N(0,I)\mathcal{N}(0,I)N(0,I) is easy, and its geometry is well behaved. The decoder carries the burden of learning how simple latent variation becomes rich observed structure. For example, in an idealized MNIST model, nearby values of zzz might correspond to visually similar digits, while different directions in latent space could control properties such as digit identity, stroke width, or tilt.
The probability assigned to an observation xxx is obtained by considering all possible latent explanations that could have generated it. This gives the marginal likelihood, also called the evidence:
pθ(x)=∫Zpθ(x∣z) p(z) dz.p_{\theta}(x)
=
\int_{\mathcal{Z}} p_{\theta}(x \mid z)\, p(z)\, dz.pθ​(x)=∫Z​pθ​(x∣z)p(z)dz.
This integral is the mathematical heart of latent variable modeling. It says: to evaluate how likely xxx is, average the likelihood pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) over every possible latent code zzz, weighted by how plausible that code was under the prior. A particular zzz might reconstruct xxx very well, but if that zzz lies in an extremely unlikely region of the prior, its contribution is limited. Conversely, common latent codes contribute more, but only if the decoder can plausibly produce xxx from them.
This is why latent variable models are so appealing. They offer a structured way to generate data:
sample a simple latent code zzz,
pass it through a learned decoder,
obtain a complex observation xxx.
They also offer a conceptual interpretation of representation learning: the model is encouraged to organize meaningful variation in data through the latent space. If the learned representation is smooth, moving through zzz-space should produce coherent changes in generated samples rather than abrupt jumps.
But the same integral that makes the model principled also creates the central computational difficulty. When pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) is parameterized by a neural network, the integral
∫Zpθ(x∣z) p(z) dz\int_{\mathcal{Z}} p_{\theta}(x \mid z)\, p(z)\, dz∫Z​pθ​(x∣z)p(z)dz
usually has no closed form. The decoder can be highly nonlinear, and the latent space may have many dimensions. Direct numerical integration becomes infeasible, and naïve Monte Carlo estimates can be too noisy or inefficient for maximum likelihood training. Therefore, although the model defines pθ(x)p_{\theta}(x)pθ​(x) formally, we cannot usually evaluate or maximize log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) directly.
This is the key tension that motivates variational inference and, ultimately, the VAE objective. We have a clean generative story and a meaningful likelihood, but the marginalization over hidden causes is intractable. The rest of the VAE framework can be understood as a way to optimize a tractable surrogate for this inaccessible log-likelihood.
The visual below compactly summarizes this idea. The left side uses a graphical-model view: a latent variable zzz is drawn from the prior, then an observed variable xxx is generated through pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z), repeated independently across data points. The shaded observed node emphasizes that xxx is in the dataset, while zzz remains hidden.
The right side grounds the abstraction in an MNIST-style example: a latent code can be thought of as controlling semantic or stylistic factors that the decoder turns into an image. The important warning is the intractable integral over zzz: even though the sampling story is simple, evaluating the probability of a given observation requires summing over all latent explanations, which is precisely what we cannot do directly with a neural decoder.

3. Failure Case: Why Not Just Use EM or MAP?

Having introduced latent variable models, we now run into the first serious computational obstacle: the model is easy to write down, but hard to fit. The whole appeal was to define a simple prior p(z)p(z)p(z), pass zzz through a decoder, and obtain a flexible distribution over observations xxx. But maximum likelihood training asks us to evaluate
pθ(x)=∫pθ(x∣z) p(z) dz,p_{\theta}(x) = \int p_{\theta}(x \mid z)\,p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
and that integral is exactly where the trouble begins. For simple decoders, the integral and the posterior over latents may be analytically tractable. For neural network decoders, they usually are not.
The classical tool for latent variable models is Expectation-Maximization. EM alternates between inferring the latent posterior under the current parameters and then updating the parameters using expectations under that posterior. The E-step requires
pθ(z∣x)=pθ(x∣z) p(z)pθ(x).p_{\theta}(z \mid x)
=
\frac{p_{\theta}(x \mid z)\,p(z)}{p_{\theta}(x)}.pθ​(z∣x)=pθ​(x)pθ​(x∣z)p(z)​.
This equation is innocent-looking but deceptive. The denominator pθ(x)p_{\theta}(x)pθ​(x) is the very marginal likelihood integral we were trying to avoid. In models with conjugacy or linear-Gaussian structure, the posterior has a known form and EM is elegant. But once the decoder becomes nonlinear, the posterior can become highly warped, multimodal, and unavailable in closed form.
A useful contrast is probabilistic PCA. Suppose
p(z)=N(0,I),pθ(x∣z)=N(Wz+b,σ2I).p(z)=\mathcal{N}(0,I),
\qquad
p_{\theta}(x \mid z)=\mathcal{N}(Wz+b,\sigma^2 I).p(z)=N(0,I),pθ​(x∣z)=N(Wz+b,σ2I).
Because everything is linear and Gaussian, the posterior pθ(z∣x)p_{\theta}(z \mid x)pθ​(z∣x) is also Gaussian. EM can compute its mean and covariance exactly. The latent space remains geometrically well-behaved: observing xxx carves out an elliptical Gaussian belief over possible zzz's.
Now replace the linear map Wz+bWz+bWz+b with a neural network fθ(z)f_{\theta}(z)fθ​(z):
pθ(x∣z)=N(fθ(z),σ2I).p_{\theta}(x \mid z)=\mathcal{N}(f_{\theta}(z),\sigma^2 I).pθ​(x∣z)=N(fθ​(z),σ2I).
The prior is still simple, and the observation noise may still be Gaussian, but the posterior is no longer Gaussian. The inverse image of a given observation xxx under fθf_{\theta}fθ​ may contain many disconnected regions. Different latent codes can decode to similar observations. The posterior may have sharp ridges, separated modes, and strong nonlinear dependencies between latent dimensions. In other words, the generative direction z↦xz \mapsto xz↦x may be easy to evaluate, while the inference direction x↦zx \mapsto zx↦z is hard.
One tempting fallback is MAP inference: instead of representing the whole posterior, choose the most likely latent point,
z^=arg⁡max⁡zlog⁡pθ(z∣x).\hat{z}
=
\arg\max_z \log p_{\theta}(z \mid x).z^=argzmax​logpθ​(z∣x).
This can be useful in some settings, but it is not a satisfactory replacement for posterior inference. A point estimate throws away uncertainty. If the posterior has several plausible modes, MAP picks one and ignores the rest. It also introduces an inner optimization problem for every datapoint, which is expensive during training. Worse, if we need gradients through that optimization procedure, the computation becomes cumbersome and brittle, especially when latent variables are discrete or when the optimization landscape is poorly conditioned.
Another seemingly generic solution is Monte Carlo EM. We might sample latent candidates from the prior,
z(l)∼p(z),z^{(l)} \sim p(z),z(l)∼p(z),
and weight them according to how well they explain xxx:
w(l)∝pθ(x∣z(l)).w^{(l)} \propto p_{\theta}(x \mid z^{(l)}).w(l)∝pθ​(x∣z(l)).
This is importance sampling with the prior as the proposal distribution. The problem is that the prior is usually a terrible proposal for the posterior. In high dimensions, most samples from p(z)p(z)p(z) land in regions that explain a specific xxx extremely poorly. A tiny number of samples may receive almost all the probability mass, while the rest contribute essentially nothing.
This phenomenon is often called importance weight collapse. As the latent dimension KKK grows, the effective number of useful samples can collapse toward one:
effective sample size≈1as K→∞.\text{effective sample size} \approx 1
\qquad
\text{as } K \to \infty.effective sample size≈1as K→∞.
The phrase “effective sample size” is important: even if we draw thousands of samples, the estimator may behave as though it had only one meaningful sample. This creates exponentially high variance estimates of the E-step quantities and makes naive Monte Carlo learning impractical for expressive latent-variable models.
So the failure is not that EM, MAP, or Monte Carlo are conceptually wrong. Each is reasonable under the right assumptions. The failure is a mismatch between those assumptions and neural generative models:
Exact EM needs a tractable posterior.
MAP replaces uncertainty with a single point.
Prior-based Monte Carlo wastes samples in high-dimensional latent spaces.
Neural decoders make the true posterior complex and hard to normalize.
The visual below compresses this comparison into the key geometric intuition. In the linear-Gaussian case, posterior inference is clean: the posterior over zzz is a single Gaussian-shaped region, and EM can proceed exactly. In the nonlinear case, the posterior becomes irregular and multimodal, so the E-step no longer has a closed-form solution.
The weight-collapse sketch at the bottom emphasizes why simply sampling many zzz's from the prior does not rescue us. In high-dimensional latent spaces, almost all prior samples are irrelevant for a particular observation xxx, and the importance weights concentrate on one lucky sample. This is the motivation for the next idea: instead of solving a separate hard inference problem from scratch for every datapoint, VAEs introduce amortized approximate inference—a learned encoder that predicts an approximate posterior directly.

4. The VAE Framework: Three Distributions

The failure of direct EM or per-example MAP inference points us toward a different compromise: instead of solving a fresh optimization problem for every datapoint, we will learn a function that performs inference. This is the central move in a variational autoencoder. A VAE is not just “an autoencoder with noise”; it is a probabilistic latent-variable model equipped with a trainable approximation to the posterior.
The generative story begins with a latent variable zzz, typically chosen to live in a relatively low-dimensional continuous space:
p(z)=N(0,I),z∈RK.p(z) = \mathcal{N}(0, I), \qquad z \in \mathbb{R}^K.p(z)=N(0,I),z∈RK.
This distribution is called the prior. It says what kinds of latent codes are plausible before seeing any data. The standard Gaussian prior is not chosen because we believe the true hidden factors of images, text, or molecules are literally independent standard normal variables. Rather, it is chosen because it gives us a simple, smooth, sampleable reference distribution. Later, when we generate new data, we will sample z∼p(z)z \sim p(z)z∼p(z) and decode it into an observation.
The second ingredient is the decoder, also called the generative model or likelihood model. It specifies how an observed datapoint xxx is produced from a latent code zzz:
pθ(x∣z).p_{\theta}(x \mid z).pθ​(x∣z).
In a neural VAE, this conditional distribution is parameterized by a neural network fθ(z)f_{\theta}(z)fθ​(z). The network does not directly output “the reconstruction” in a purely deterministic sense; rather, it outputs the parameters of a probability distribution over possible observations. For continuous data, a common choice is an isotropic Gaussian likelihood,
pθ(x∣z)=N ⁣(fθ(z), σ2I),p_{\theta}(x \mid z)
=
\mathcal{N}\!\left(f_{\theta}(z),\, \sigma^2 I\right),pθ​(x∣z)=N(fθ​(z),σ2I),
where fθ(z)f_{\theta}(z)fθ​(z) is the mean of the conditional distribution. For binary data, such as binarized MNIST pixels, a common choice is a Bernoulli likelihood,
pθ(x∣z)=Bernoulli ⁣(sigmoid(fθ(z))).p_{\theta}(x \mid z)
=
\mathrm{Bernoulli}\!\left(\mathrm{sigmoid}(f_{\theta}(z))\right).pθ​(x∣z)=Bernoulli(sigmoid(fθ​(z))).
Together, the prior and decoder define the actual generative model:
pθ(x,z)=pθ(x∣z) p(z).p_{\theta}(x, z)
=
p_{\theta}(x \mid z)\,p(z).pθ​(x,z)=pθ​(x∣z)p(z).
This joint distribution is the model’s claim about how data and latent variables co-occur. If we could compute the posterior exactly,
pθ(z∣x)=pθ(x∣z)p(z)pθ(x),p_{\theta}(z \mid x)
=
\frac{p_{\theta}(x \mid z)p(z)}{p_{\theta}(x)},pθ​(z∣x)=pθ​(x)pθ​(x∣z)p(z)​,
then inference would be straightforward: given an observed xxx, infer which latent codes zzz plausibly generated it. But the denominator,
pθ(x)=∫pθ(x∣z)p(z) dz,p_{\theta}(x)
=
\int p_{\theta}(x \mid z)p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
is generally intractable for a nonlinear neural decoder. This is the same obstacle we encountered when trying to use exact EM: the posterior is the object we need, but it is not available in closed form.
The VAE introduces a third distribution to address this: the encoder, or approximate posterior,
qϕ(z∣x)=N ⁣(μϕ(x),diag(σϕ(x)2)).q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
\mu_{\phi}(x),
\mathrm{diag}(\sigma_{\phi}(x)^2)
\right).qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)).
Here another neural network, often denoted gϕ(x)g_{\phi}(x)gϕ​(x), maps an input datapoint to the parameters of a Gaussian distribution over latent codes:
gϕ(x)⟶(μϕ(x),σϕ(x)).g_{\phi}(x)
\longrightarrow
\left(\mu_{\phi}(x), \sigma_{\phi}(x)\right).gϕ​(x)⟶(μϕ​(x),σϕ​(x)).
This is an approximation to pθ(z∣x)p_{\theta}(z \mid x)pθ​(z∣x), not part of the generative model itself. The generative model is still p(z)pθ(x∣z)p(z)p_{\theta}(x \mid z)p(z)pθ​(x∣z). The encoder is an inference mechanism: it gives us a tractable distribution from which we can sample latent codes likely to explain xxx.
The key innovation is amortized inference. In classical variational inference, we might introduce separate variational parameters for each datapoint, for example qλn(z∣x(n))q_{\lambda_n}(z \mid x^{(n)})qλn​​(z∣x(n)) with its own λn\lambda_nλn​. That would mean every new datapoint requires its own inference optimization. VAEs instead use a single shared network qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) for all datapoints:
x(n)↦(μϕ(x(n)),σϕ(x(n))).x^{(n)}
\mapsto
\left(\mu_{\phi}(x^{(n)}), \sigma_{\phi}(x^{(n)})\right).x(n)↦(μϕ​(x(n)),σϕ​(x(n))).
The cost of inference is therefore amortized over the dataset. We pay once to learn ϕ\phiϕ, and then inference for a new xxx is just a forward pass. This is why VAEs scale well to large datasets and why they look architecturally like autoencoders: data goes through an encoder into a latent representation, then through a decoder back into data space. But probabilistically, the encoder and decoder play asymmetric roles.
There is also an important modeling assumption hidden in the standard encoder form. By choosing
qϕ(z∣x)=N ⁣(μϕ(x),diag(σϕ(x)2)),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
\mu_{\phi}(x),
\mathrm{diag}(\sigma_{\phi}(x)^2)
\right),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),
we assume the approximate posterior is Gaussian with diagonal covariance. This makes sampling and KL-divergence computations convenient, but it can be restrictive. The true posterior pθ(z∣x)p_{\theta}(z \mid x)pθ​(z∣x) may be multimodal, skewed, or highly correlated across latent dimensions. Much of the behavior of VAEs—including some failure modes we will discuss later—comes from the tension between a flexible neural decoder and a relatively simple approximate posterior family.
The three distributions therefore have distinct responsibilities:
Prior p(z)p(z)p(z): defines the latent space we can sample from.
Decoder pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z): defines how latent codes generate observations.
Encoder qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x): approximates the intractable posterior for efficient inference.
The visual below consolidates this separation by showing two complementary paths. The generative path starts from z∼p(z)z \sim p(z)z∼p(z) and moves downward through pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z) to produce xxx. The inference path goes in the opposite direction: given xxx, the encoder qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) produces a distribution over plausible latent codes.
The most important detail to keep in mind is that the right-hand path is not a separate model of the data. It is a learned approximation used to make training and inference tractable. The shared parameter vector ϕ\phiϕ is what makes the inference procedure amortized: instead of optimizing a new posterior approximation for every x(n)x^{(n)}x(n), the VAE learns one encoder network that serves the entire dataset.

5. Objective: Maximize the Log-Evidence

Now that we have separated the VAE into its three distributions—the prior p(z)p(z)p(z), the decoder pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z), and the encoder-like approximation qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x)—we can ask the most basic statistical question: what objective should train the generative model? If the decoder is supposed to define a probability model over observations, then the natural answer is maximum likelihood. We want parameters θ\thetaθ that assign high probability to the observed dataset:
max⁡θ∑n=1Nlog⁡pθ(x(n)).\max_{\theta} \sum_{n=1}^{N} \log p_{\theta}(x^{(n)}).θmax​n=1∑N​logpθ​(x(n)).
This is the same principle used in ordinary probabilistic modeling: choose the model under which the data would have been most likely. The complication is that in a latent-variable model, pθ(x)p_\theta(x)pθ​(x) is not given directly. It is the probability of observing xxx after averaging over all possible latent explanations zzz.
For a single datapoint, the marginal likelihood, also called the evidence, is
pθ(x)=∫Zpθ(x∣z) p(z) dz,p_\theta(x)
=
\int_{\mathcal{Z}} p_\theta(x \mid z)\,p(z)\,dz,pθ​(x)=∫Z​pθ​(x∣z)p(z)dz,
so the log-evidence is
log⁡pθ(x)=log⁡∫Zpθ(x∣z) p(z) dz.\log p_{\theta}(x)
=
\log \int_{\mathcal{Z}} p_{\theta}(x \mid z)\,p(z)\,dz.logpθ​(x)=log∫Z​pθ​(x∣z)p(z)dz.
Conceptually, this integral says: sample a latent code zzz from the prior, decode it into a distribution over xxx, and average the probability assigned to the observed xxx across all possible latent codes. If many plausible latent codes explain xxx, the evidence should be high. If almost no latent code decodes near xxx, the evidence should be low.
The problem is that this integral is almost never analytically tractable for a neural decoder. In a simple linear-Gaussian latent-variable model, the integral may have a closed form. But in a VAE, pθ(x∣z)p_\theta(x \mid z)pθ​(x∣z) is parameterized by a deep network fθ(z)f_\theta(z)fθ​(z). The latent space is typically continuous, often Z=RK\mathcal{Z} = \mathbb{R}^KZ=RK, and the integrand can be nonzero over an enormous region. We are therefore trying to integrate a nonlinear neural-network-shaped function over a high-dimensional space:
∫RKpθ(x∣z) p(z) dz.\int_{\mathbb{R}^K} p_\theta(x \mid z)\,p(z)\,dz.∫RK​pθ​(x∣z)p(z)dz.
There is no general symbolic simplification available.
A first instinct is to use Monte Carlo sampling from the prior. Draw
z(1),…,z(L)∼p(z),z^{(1)}, \dots, z^{(L)} \sim p(z),z(1),…,z(L)∼p(z),
approximate the expectation, and then take the logarithm:
log⁡pθ(x)=log⁡Ep(z)[pθ(x∣z)]≈log⁡1L∑l=1Lpθ(x∣z(l)).\log p_{\theta}(x)
=
\log \mathbb{E}_{p(z)}\bigl[p_\theta(x \mid z)\bigr]
\approx
\log \frac{1}{L}\sum_{l=1}^{L} p_\theta(x \mid z^{(l)}).logpθ​(x)=logEp(z)​[pθ​(x∣z)]≈logL1​l=1∑L​pθ​(x∣z(l)).
This looks reasonable, and for very large LLL it is a consistent estimator. But as a training objective, it has two serious issues. First, prior samples are usually a terrible way to find the latent codes that explain a specific datapoint xxx. In high dimensions, most z∼p(z)z \sim p(z)z∼p(z) will decode to samples unrelated to xxx, so the average can be dominated by rare lucky samples. This leads to high variance.
Second, and more fundamentally, the log of a Monte Carlo average is a biased estimator of the log-evidence. The logarithm is concave, so Jensen’s inequality gives
E[log⁡1L∑l=1Lpθ(x∣z(l))]≤log⁡E[1L∑l=1Lpθ(x∣z(l))]=log⁡pθ(x).\mathbb{E}\left[
\log \frac{1}{L}\sum_{l=1}^{L} p_\theta(x \mid z^{(l)})
\right]
\leq
\log \mathbb{E}\left[
\frac{1}{L}\sum_{l=1}^{L} p_\theta(x \mid z^{(l)})
\right]
=
\log p_\theta(x).E[logL1​l=1∑L​pθ​(x∣z(l))]≤logE[L1​l=1∑L​pθ​(x∣z(l))]=logpθ​(x).
The inequality is strict unless the random quantity inside the log is essentially constant. Equivalently, in the single-sample case,
log⁡Ep(z)[pθ(x∣z)]≥Ep(z)[log⁡pθ(x∣z)].\log \mathbb{E}_{p(z)}\bigl[p_\theta(x \mid z)\bigr]
\geq
\mathbb{E}_{p(z)}\bigl[\log p_\theta(x \mid z)\bigr].logEp(z)​[pθ​(x∣z)]≥Ep(z)​[logpθ​(x∣z)].
This is the key obstruction: moving the logarithm outside or inside an expectation changes the objective. The log-evidence is what we want, but it contains an intractable integral. The expectation of the log-likelihood is tractable by sampling, but it is generally a lower quantity and, by itself, is not the right maximum-likelihood objective.
This is where the VAE’s central idea begins to emerge. We need a surrogate objective that is:
tractable by sampling, so it avoids exact integration over all zzz;
a principled lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x), so optimizing it still pushes up the evidence;
differentiable with low-variance gradients, so neural networks can be trained efficiently;
aware of the datapoint xxx, so we sample useful latent codes rather than blind samples from the prior.
That surrogate will be the Evidence Lower Bound, or ELBO, written schematically as
L(θ,ϕ;x)≤log⁡pθ(x).\mathcal{L}(\theta,\phi; x) \leq \log p_\theta(x).L(θ,ϕ;x)≤logpθ​(x).
The additional parameter ϕ\phiϕ appears because we introduce an inference network qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x), whose role is to propose latent codes likely to explain the observed datapoint. Instead of sampling zzz blindly from the prior, the VAE learns where in latent space to look.
The visual below compresses this motivation into three layers: the maximum-likelihood goal, the intractable log-evidence integral, and the failed naive Monte Carlo shortcut. The red warning around Jensen’s inequality highlights the precise mathematical reason we cannot simply replace the integral with a sampled average inside a logarithm and proceed as if nothing changed.
The bottom layer then points to the resolution: rather than estimating the log-evidence directly, we construct a computable lower bound with good gradient behavior. The next step is to derive that bound carefully and see exactly why it is lower, when it becomes tight, and how it decomposes into the reconstruction and KL terms used in VAE training.

6. Deriving the ELBO: Jensen's Inequality Route

Having written the evidence as an integral over the latent variable, we now face the central obstacle of latent-variable maximum likelihood: the quantity we want,
log⁡pθ(x)=log⁡∫pθ(x,z) dz,\log p_{\theta}(x)=\log \int p_{\theta}(x,z)\,dz,logpθ​(x)=log∫pθ​(x,z)dz,
is usually not computable in closed form. The integral sums over all possible latent explanations zzz of the observation xxx. For expressive neural decoders pθ(x∣z)p_{\theta}(x\mid z)pθ​(x∣z), this marginalization is precisely what makes the model powerful — and precisely what makes direct optimization difficult.
The key variational idea is to introduce an auxiliary distribution qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), which we will later implement as the encoder. At this point, however, qϕq_{\phi}qϕ​ is just a mathematical device: a distribution over latent variables conditioned on the observed datapoint. We use it to rewrite the intractable integral as an expectation under a distribution from which we can sample.
Starting from the evidence,
log⁡pθ(x)=log⁡∫pθ(x,z) dz,\log p_{\theta}(x)
=
\log \int p_{\theta}(x,z)\,dz,logpθ​(x)=log∫pθ​(x,z)dz,
we multiply and divide by qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x):
log⁡pθ(x)=log⁡∫qϕ(z∣x)pθ(x,z)qϕ(z∣x) dz.\log p_{\theta}(x)
=
\log \int q_{\phi}(z\mid x)
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
\,dz.logpθ​(x)=log∫qϕ​(z∣x)qϕ​(z∣x)pθ​(x,z)​dz.
This is an exact algebraic identity, not an approximation. It is the same logic as importance sampling: instead of integrating directly with respect to the latent variable measure, we express the integral as an expectation under a chosen proposal distribution qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x):
log⁡pθ(x)=log⁡Eqϕ(z∣x) ⁣[pθ(x,z)qϕ(z∣x)].\log p_{\theta}(x)
=
\log
\mathbb{E}_{q_{\phi}(z\mid x)}
\!\left[
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
\right].logpθ​(x)=logEqϕ​(z∣x)​[qϕ​(z∣x)pθ​(x,z)​].
There is an important support condition hidden in this step. The ratio pθ(x,z)/qϕ(z∣x)p_{\theta}(x,z)/q_{\phi}(z\mid x)pθ​(x,z)/qϕ​(z∣x) must be well-defined wherever pθ(x,z)p_{\theta}(x,z)pθ​(x,z) contributes mass. Informally, qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) must not assign zero probability to latent regions that the model considers possible explanations of xxx. Otherwise, the importance-weighted expression can become undefined or miss parts of the integral entirely.
Now comes the only inequality in the derivation. Since log⁡\loglog is concave, Jensen’s inequality tells us that for a positive random variable YYY,
log⁡E[Y]≥E[log⁡Y].\log \mathbb{E}[Y]
\geq
\mathbb{E}[\log Y].logE[Y]≥E[logY].
Here the random variable is
Y=pθ(x,z)qϕ(z∣x),z∼qϕ(z∣x).Y
=
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)},
\qquad
z\sim q_{\phi}(z\mid x).Y=qϕ​(z∣x)pθ​(x,z)​,z∼qϕ​(z∣x).
Applying Jensen’s inequality gives
log⁡pθ(x)≥Eqϕ(z∣x)[log⁡pθ(x,z)qϕ(z∣x)],\log p_{\theta}(x)
\geq
\mathbb{E}_{q_{\phi}(z\mid x)}
\left[
\log
\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
\right],logpθ​(x)≥Eqϕ​(z∣x)​[logqϕ​(z∣x)pθ​(x,z)​],
or equivalently,
log⁡pθ(x)≥Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\log p_{\theta}(x)
\geq
\mathbb{E}_{q_{\phi}(z\mid x)}
\!\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z\mid x)
\right].logpθ​(x)≥Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
This lower bound is the evidence lower bound, or ELBO:
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z\mid x)}
\!\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z\mid x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
The name is literal: it is a lower bound on the log-evidence log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). For any valid choice of qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x),
L(θ,ϕ;x)≤log⁡pθ(x).\mathcal{L}(\theta,\phi;x)
\leq
\log p_{\theta}(x).L(θ,ϕ;x)≤logpθ​(x).
This is the foundational move in VAEs. We replace an intractable objective with a tractable lower bound that can be estimated by sampling from the encoder.
The bound becomes tight exactly when Jensen’s inequality becomes equality. For a concave function like log⁡\loglog, equality occurs when the random variable inside the expectation is constant almost surely under qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x). In this case, we need
pθ(x,z)qϕ(z∣x)=pθ(x)for qϕ-almost every z.\frac{p_{\theta}(x,z)}{q_{\phi}(z\mid x)}
=
p_{\theta}(x)
\quad
\text{for } q_{\phi}\text{-almost every }z.qϕ​(z∣x)pθ​(x,z)​=pθ​(x)for qϕ​-almost every z.
Rearranging,
qϕ(z∣x)=pθ(x,z)pθ(x)=pθ(z∣x).q_{\phi}(z\mid x)
=
\frac{p_{\theta}(x,z)}{p_{\theta}(x)}
=
p_{\theta}(z\mid x).qϕ​(z∣x)=pθ​(x)pθ​(x,z)​=pθ​(z∣x).
So the ELBO is tight when the variational distribution qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) exactly matches the true posterior pθ(z∣x)p_{\theta}(z\mid x)pθ​(z∣x). This is also why the encoder is often described as an approximate posterior: it is trained to behave like the posterior distribution that would make the bound exact, but which is usually unavailable because it depends on the intractable evidence pθ(x)p_{\theta}(x)pθ​(x).
This derivation matters because it gives us a practical optimization target with a clear interpretation:
maximizing the ELBO with respect to θ\thetaθ improves the generative model;
maximizing it with respect to ϕ\phiϕ improves the approximate posterior;
the gap between log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) and the ELBO measures how far qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) is from the true posterior.
There is a subtle but important caveat: maximizing a lower bound is not automatically the same as maximizing the original objective. If the bound is loose, we may improve L\mathcal{L}L without significantly improving log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). The success of VAEs depends on making the variational family expressive enough, and the optimization stable enough, that this lower bound remains a useful surrogate for maximum likelihood.
The visual below condenses the derivation into its three essential moves: start from the marginal likelihood, rewrite it as an expectation using qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), then apply Jensen’s inequality to move the logarithm inside the expectation. The boxed expression is the ELBO, the quantity we can optimize in place of the inaccessible log-evidence.
It also highlights the geometric meaning of the inequality: L(θ,ϕ;x)\mathcal{L}(\theta,\phi;x)L(θ,ϕ;x) sits below log⁡pθ(x)\log p_{\theta}(x)logpθ​(x), with a gap determined by how imperfect the variational posterior is. In the next section, we will decompose this same ELBO into the two terms that make VAEs operational: a reconstruction term and a KL regularization term.

7. ELBO Decomposition: Reconstruction + KL

Having obtained the ELBO through Jensen’s inequality, the next question is: what exactly are we optimizing? The expression
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)]\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_{\theta}(x,z) - \log q_{\phi}(z|x)
\right]L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)]
is already a valid lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x), but in this form it hides the two competing forces that make VAEs work. To understand the training objective operationally, we want to rewrite it in terms of the decoder likelihood and a penalty on the encoder distribution.
The key move is simply to expand the joint model. In a VAE, the generative story is assumed to factor as: first sample a latent variable zzz from a prior p(z)p(z)p(z), then sample the observation xxx from a decoder likelihood pθ(x∣z)p_\theta(x|z)pθ​(x∣z). Therefore,
pθ(x,z)=pθ(x∣z)p(z),p_\theta(x,z) = p_\theta(x|z)p(z),pθ​(x,z)=pθ​(x∣z)p(z),
and taking logs gives
log⁡pθ(x,z)=log⁡pθ(x∣z)+log⁡p(z).\log p_\theta(x,z)
=
\log p_\theta(x|z) + \log p(z).logpθ​(x,z)=logpθ​(x∣z)+logp(z).
Substituting this into the ELBO gives
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)+log⁡p(z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z) + \log p(z) - \log q_\phi(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)+logp(z)−logqϕ​(z∣x)].
Because expectation is linear, we can separate the terms:
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]+Eqϕ(z∣x) ⁣[log⁡p(z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z)
\right]
+
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p(z) - \log q_\phi(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]+Eqϕ​(z∣x)​[logp(z)−logqϕ​(z∣x)].
The first term has a direct modeling interpretation. We draw latent codes zzz from the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x), then ask how much probability the decoder assigns back to the original observation xxx. This is the reconstruction term:
Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)].\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z)
\right].Eqϕ​(z∣x)​[logpθ​(x∣z)].
It rewards latent representations that preserve information useful for explaining the input. If pθ(x∣z)p_\theta(x|z)pθ​(x∣z) is a Bernoulli distribution, this often corresponds to a binary cross-entropy reconstruction objective; if it is a Gaussian with fixed variance, it becomes proportional to a negative squared-error loss. This is why VAEs often look, at implementation time, like autoencoders with a probabilistic reconstruction loss.
The second term is more subtle. Recall that the KL divergence is
DKL ⁣(qϕ(z∣x) ∥ p(z))=Eqϕ(z∣x) ⁣[log⁡qϕ(z∣x)−log⁡p(z)].D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right)
=
\mathbb{E}_{q_\phi(z|x)}
\!\left[
\log q_\phi(z|x) - \log p(z)
\right].DKL​(qϕ​(z∣x)∥p(z))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logp(z)].
Our ELBO contains the negative of this quantity:
Eqϕ(z∣x) ⁣[log⁡p(z)−log⁡qϕ(z∣x)]=−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p(z) - \log q_\phi(z|x)
\right]
=
-
D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right).Eqϕ​(z∣x)​[logp(z)−logqϕ​(z∣x)]=−DKL​(qϕ​(z∣x)∥p(z)).
So the ELBO decomposes into the now-famous form
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]⏟reconstruction term−DKL ⁣(qϕ(z∣x) ∥ p(z))⏟regularisation term\boxed{
\mathcal{L}(\theta, \phi; x)
=
\underbrace{
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_\theta(x|z)
\right]
}_{\text{reconstruction term}}
-
\underbrace{
D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right)
}_{\text{regularisation term}}
}L(θ,ϕ;x)=reconstruction termEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularisation termDKL​(qϕ​(z∣x)∥p(z))​​​
This decomposition explains the central tension in VAE training. The reconstruction term wants qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) to encode enough information about xxx that the decoder can reconstruct it accurately. The KL term wants qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) to stay close to the prior p(z)p(z)p(z), usually a simple standard Gaussian such as N(0,I)\mathcal{N}(0,I)N(0,I). In other words, the encoder is not free to assign every datapoint an arbitrary isolated latent code; its codes must remain compatible with a shared latent space from which we can later sample.
This regularization is not merely aesthetic. Without it, the model could behave like an ordinary deterministic autoencoder: excellent reconstructions, but a latent space with holes, disconnected regions, and no reliable way to generate new samples by drawing z∼p(z)z \sim p(z)z∼p(z). The KL penalty pushes the aggregate structure of the latent codes toward something smooth and sampleable. At the same time, if the KL pressure is too strong, the encoder may ignore xxx, producing qϕ(z∣x)≈p(z)q_\phi(z|x) \approx p(z)qϕ​(z∣x)≈p(z) for every input. That failure mode is known as posterior collapse, and it causes the latent variable to carry little or no information.
A useful way to remember the objective is:
Reconstruction term: “Can the decoder explain this datapoint using latents from the encoder?”
KL term: “Is the encoder’s posterior still close to the prior distribution we intend to sample from?”
ELBO maximization: “Find a compromise between faithful reconstruction and a well-structured latent space.”
For the common Gaussian case, this tradeoff becomes especially convenient computationally. If
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ2(x)))q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \operatorname{diag}(\sigma_\phi^2(x)))qϕ​(z∣x)=N(μϕ​(x),diag(σϕ2​(x)))
and
p(z)=N(0,I),p(z) = \mathcal{N}(0,I),p(z)=N(0,I),
then the KL term has a closed-form expression. That means VAE training usually combines a Monte Carlo estimate of the reconstruction expectation with an analytic KL penalty, giving a stable and differentiable objective once we introduce the reparameterization trick.
The visual below condenses this derivation into its algebraic skeleton: start from the Jensen-derived ELBO, expand the joint distribution using the generative factorization, separate the expectation, and recognize the second piece as a negative KL divergence. The color split emphasizes that the final objective is not one monolithic loss, but a sum of two interpretable forces with opposite pressures.
Read the final boxed equation as the practical training objective for a single datapoint xxx: maximize expected decoder log-likelihood while minimizing deviation from the prior. This compact decomposition is the bridge between the abstract variational bound and the loss function we will soon implement in an actual VAE.

8. Theorem: ELBO–Evidence Gap is the KL Divergence

Having split the ELBO into a reconstruction term and a KL-to-prior regularizer, we can now ask a deeper question: what exactly did we lose when we replaced the true log-evidence log⁡pθ(x)\log p_\theta(x)logpθ​(x) by the ELBO? The answer is one of the most important identities in variational inference: the missing quantity is not mysterious approximation error, nor a loose heuristic penalty. It is exactly a KL divergence between the variational encoder and the true Bayesian posterior.
For a latent-variable model, the evidence is
pθ(x)=∫pθ(x∣z)p(z) dz,p_\theta(x)=\int p_\theta(x|z)p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
and the true posterior over latents is
pθ(z∣x)=pθ(x∣z)p(z)pθ(x).p_\theta(z|x)
=
\frac{p_\theta(x|z)p(z)}{p_\theta(x)}.pθ​(z∣x)=pθ​(x)pθ​(x∣z)p(z)​.
This posterior is the distribution we would ideally use for inference: after observing xxx, it tells us which latent explanations zzz are plausible. The difficulty is that pθ(z∣x)p_\theta(z|x)pθ​(z∣x) depends on pθ(x)p_\theta(x)pθ​(x), the very integral we cannot usually compute. VAEs therefore introduce an encoder qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x), a tractable approximation to this intractable posterior.
The central theorem says that, for a suitable variational distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x),
log⁡pθ(x)=L(θ,ϕ;x)+DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x)).\log p_{\theta}(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}
\!\left(
q_{\phi}(z|x)
\,\|\,
p_{\theta}(z|x)
\right).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
Here the ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_{\theta}(x|z)
\right]
-
D_{\mathrm{KL}}
\!\left(
q_{\phi}(z|x)
\,\|\,
p(z)
\right).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
So the ELBO is not merely inspired by the evidence. It is the evidence minus a very specific nonnegative gap.
Because KL divergence is always nonnegative,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))≥0,D_{\mathrm{KL}}
\!\left(
q_{\phi}(z|x)
\,\|\,
p_{\theta}(z|x)
\right)
\geq 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))≥0,
we immediately recover the lower-bound property:
L(θ,ϕ;x)≤log⁡pθ(x).\mathcal{L}(\theta,\phi;x)
\leq
\log p_\theta(x).L(θ,ϕ;x)≤logpθ​(x).
This also explains the name Evidence Lower Bound. The ELBO sits below the log-evidence, and the vertical distance between them is precisely the mismatch between the approximate posterior qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) and the true posterior pθ(z∣x)p_\theta(z|x)pθ​(z∣x).
The theorem also tells us exactly when the bound is tight. We have
L(θ,ϕ;x)=log⁡pθ(x)\mathcal{L}(\theta,\phi;x)
=
\log p_\theta(x)L(θ,ϕ;x)=logpθ​(x)
if and only if
qϕ(z∣x)=pθ(z∣x)almost everywhere.q_\phi(z|x)=p_\theta(z|x)
\quad \text{almost everywhere}.qϕ​(z∣x)=pθ​(z∣x)almost everywhere.
In words: the ELBO becomes the true log-evidence exactly when the encoder recovers the true posterior. There is no remaining variational gap. This is why the encoder is not just an auxiliary neural network used for amortized sampling; it is performing approximate Bayesian inference.
There is a subtle support assumption hiding here. For the KL
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p_\theta(z|x)\right)DKL​(qϕ​(z∣x)∥pθ​(z∣x))
to be finite, qϕq_\phiqϕ​ should not place probability mass where the true posterior has zero mass. More formally, we usually require qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) to be absolutely continuous with respect to pθ(z∣x)p_\theta(z|x)pθ​(z∣x). At the same time, if the variational family is too restrictive and cannot represent the true posterior’s important regions, the gap cannot close. This is one reason why posterior approximation quality depends not only on optimization, but also on the expressiveness of the encoder family.
The identity has an important optimization consequence. Suppose θ\thetaθ is fixed. Then log⁡pθ(x)\log p_\theta(x)logpθ​(x) is constant with respect to ϕ\phiϕ, so maximizing the ELBO over ϕ\phiϕ is exactly equivalent to minimizing
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x)).D_{\mathrm{KL}}
\!\left(
q_\phi(z|x)
\,\|\,
p_\theta(z|x)
\right).DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
That is the variational inference interpretation of the VAE encoder: training qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) by maximizing the ELBO pushes it toward the true posterior. When we also optimize θ\thetaθ, we are doing two things at once: learning a generative model that assigns high probability to the data, and learning an approximate inference model that explains each datapoint in latent space.
A useful way to remember the theorem is:
Evidence is the target quantity we wish we could maximize directly.
ELBO is the tractable surrogate we can optimize.
KL gap is the price paid for using qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) instead of the true posterior.
The visual below compresses this relationship into a geometric picture. The log-evidence is represented as a fixed point on a “nats” axis, while the ELBO lies to its left. The red interval between them is the posterior KL divergence. If the encoder improves, that interval shrinks; if qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) matches pθ(z∣x)p_\theta(z|x)pθ​(z∣x), the two points coincide.
The three corollaries in the visual are therefore not separate facts, but direct consequences of the same decomposition: the ELBO is a lower bound because KL is nonnegative; it is tight exactly when the approximate and true posteriors agree; and maximizing it with respect to ϕ\phiϕ is variational inference. The next step is to prove the identity algebraically, showing how Bayes’ rule turns the apparently inaccessible evidence into the sum of an optimizable lower bound and an explicit posterior mismatch term.

9. Proof: ELBO + KL Gap = Log Evidence

Having stated that the gap between the log evidence and the ELBO is a KL divergence, we should now verify that this is not a heuristic slogan. It is an exact identity. The derivation is short, but it is worth reading carefully because it explains why variational inference works: we are not inventing an arbitrary surrogate objective; we are decomposing the true marginal log-likelihood into a tractable lower bound plus a nonnegative error term.
Fix one observed datapoint xxx. The model defines a latent-variable joint distribution
pθ(x,z)=pθ(x∣z)p(z),p_{\theta}(x,z) = p_{\theta}(x|z)p(z),pθ​(x,z)=pθ​(x∣z)p(z),
and the true posterior is
pθ(z∣x)=pθ(x,z)pθ(x).p_{\theta}(z|x) = \frac{p_{\theta}(x,z)}{p_{\theta}(x)}.pθ​(z∣x)=pθ​(x)pθ​(x,z)​.
The problem is that pθ(x)p_{\theta}(x)pθ​(x), the evidence, usually requires integrating out zzz:
pθ(x)=∫pθ(x,z) dz.p_{\theta}(x) = \int p_{\theta}(x,z)\,dz.pθ​(x)=∫pθ​(x,z)dz.
In a VAE, this integral is generally intractable because the decoder pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z) is represented by a neural network. So instead of working directly with the true posterior pθ(z∣x)p_{\theta}(z|x)pθ​(z∣x), we introduce an approximate posterior, or encoder distribution,
qϕ(z∣x).q_{\phi}(z|x).qϕ​(z∣x).
The key question is: how does optimizing an objective involving qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) relate to maximizing the true log evidence log⁡pθ(x)\log p_{\theta}(x)logpθ​(x)?
Start from the KL divergence between the variational posterior and the true posterior:
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡pθ(z∣x)].D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x) - \log p_{\theta}(z|x)
\right].DKL​(qϕ​(z∣x)∥pθ​(z∣x))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logpθ​(z∣x)].
Now substitute Bayes’ rule into the true posterior term:
pθ(z∣x)=pθ(x,z)pθ(x).p_{\theta}(z|x)
=
\frac{p_{\theta}(x,z)}{p_{\theta}(x)}.pθ​(z∣x)=pθ​(x)pθ​(x,z)​.
Taking logs gives
log⁡pθ(z∣x)=log⁡pθ(x,z)−log⁡pθ(x).\log p_{\theta}(z|x)
=
\log p_{\theta}(x,z) - \log p_{\theta}(x).logpθ​(z∣x)=logpθ​(x,z)−logpθ​(x).
Therefore,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡pθ(x,z)+log⁡pθ(x)].D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x)
-
\log p_{\theta}(x,z)
+
\log p_{\theta}(x)
\right].DKL​(qϕ​(z∣x)∥pθ​(z∣x))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logpθ​(x,z)+logpθ​(x)].
The subtle but crucial observation is that log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) does not depend on zzz. The expectation is over z∼qϕ(z∣x)z \sim q_{\phi}(z|x)z∼qϕ​(z∣x), while xxx is fixed. So we can pull log⁡pθ(x)\log p_{\theta}(x)logpθ​(x) outside the expectation:
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡pθ(x,z)]+log⁡pθ(x).D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x)
-
\log p_{\theta}(x,z)
\right]
+
\log p_{\theta}(x).DKL​(qϕ​(z∣x)∥pθ​(z∣x))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logpθ​(x,z)]+logpθ​(x).
Equivalently,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=−Eqϕ(z∣x)[log⁡pθ(x,z)−log⁡qϕ(z∣x)]+log⁡pθ(x).D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
-
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z|x)
\right]
+
\log p_{\theta}(x).DKL​(qϕ​(z∣x)∥pθ​(z∣x))=−Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)]+logpθ​(x).
The expectation inside the negative sign is precisely the evidence lower bound:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x,z)
-
\log q_{\phi}(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
So the KL divergence becomes
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=−L(θ,ϕ;x)+log⁡pθ(x).D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
=
-\mathcal{L}(\theta,\phi;x)
+
\log p_{\theta}(x).DKL​(qϕ​(z∣x)∥pθ​(z∣x))=−L(θ,ϕ;x)+logpθ​(x).
Rearranging gives the central decomposition:
log⁡pθ(x)=L(θ,ϕ;x)+DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))\boxed{
\log p_{\theta}(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right)
}logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x))​
This is the whole proof. No approximation has been made. We used only the definition of KL divergence, Bayes’ theorem, and the fact that constants can be moved outside expectations.
The consequence is immediate. Since KL divergence is always nonnegative,
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))≥0,D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right) \geq 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))≥0,
we have
log⁡pθ(x)≥L(θ,ϕ;x).\log p_{\theta}(x) \geq \mathcal{L}(\theta,\phi;x).logpθ​(x)≥L(θ,ϕ;x).
That is why L\mathcal{L}L is called a lower bound on the log evidence. The bound is tight exactly when
DKL ⁣(qϕ(z∣x) ∥ pθ(z∣x))=0,D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\,\|\,p_{\theta}(z|x)\right) = 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))=0,
which occurs when
qϕ(z∣x)=pθ(z∣x)q_{\phi}(z|x) = p_{\theta}(z|x)qϕ​(z∣x)=pθ​(z∣x)
almost everywhere under qϕq_{\phi}qϕ​. In words: the ELBO equals the true log evidence when the encoder distribution exactly matches the model’s true posterior.
This identity also clarifies a common failure mode in variational modeling. Maximizing the ELBO improves a lower bound on log⁡pθ(x)\log p_{\theta}(x)logpθ​(x), but it does so through the restricted family qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x). If the encoder family is too limited, the KL gap may remain large even after optimization. The objective is still valid, but the approximation may be poor. Conversely, if qϕq_{\phi}qϕ​ is expressive enough and optimization succeeds, the ELBO can become a very accurate proxy for the true marginal likelihood.
The visual below compresses the derivation into a chain of equalities: start with the KL divergence to the true posterior, replace the posterior using Bayes’ theorem, separate the constant log⁡pθ(x)\log p_{\theta}(x)logpθ​(x), recognize the ELBO, and rearrange. The boxed final identity is not an additional assumption; it is simply the same expression written so that the evidence appears on the left.
It is useful to keep this picture in mind as we move forward. The ELBO is not merely “reconstruction plus regularization” yet—that interpretation will come after expanding the joint model. At this stage, the most important fact is structural: log evidence decomposes exactly into an optimizable lower bound plus a nonnegative posterior-approximation gap.

10. ELBO as Negative Free Energy

Having just seen that the ELBO differs from the true log evidence by a nonnegative KL gap, we can now reinterpret the bound in a more operational way. The ELBO is not merely an algebraic trick for lower-bounding log⁡pθ(x)\log p_\theta(x)logpθ​(x); it is the objective that tells a VAE how to trade off using latent information against staying compatible with the prior.
For a single datapoint xxx, the VAE objective is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z)).\mathcal{L}(\theta, \phi; x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
-
D_{\mathrm{KL}}(q_{\phi}(z|x) \| p(z)).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
The first term is the reconstruction term. It asks: if the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) samples a latent code zzz, how much probability does the decoder pθ(x∣z)p_\theta(x|z)pθ​(x∣z) assign back to the original observation xxx? Maximizing this term encourages the latent representation to preserve information that the decoder needs. In an image VAE, this means encoding shape, pose, color, texture, or any features that help predict the input under the chosen likelihood model.
The second term is the regularization term, but that phrase can be slightly misleading if we think of it as a generic penalty. More precisely, it penalizes the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) for moving too far away from the prior p(z)p(z)p(z), usually p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This matters because generation will later sample z∼p(z)z \sim p(z)z∼p(z), not z∼qϕ(z∣x)z \sim q_\phi(z|x)z∼qϕ​(z∣x). If the encoder learns latent codes that live in strange isolated regions far from the prior, then samples drawn from the prior may decode poorly. The KL term forces the aggregate geometry of the latent space to remain usable for generation.
We can unpack the KL penalty further using
DKL(q∥p)=−H[q]+Eq[−log⁡p(z)].D_{\mathrm{KL}}(q\|p)
=
-\mathcal{H}[q]
+
\mathbb{E}_{q}[-\log p(z)].DKL​(q∥p)=−H[q]+Eq​[−logp(z)].
Therefore,
−DKL(qϕ(z∣x)∥p(z))=H[qϕ(z∣x)]−Eqϕ(z∣x)[−log⁡p(z)].-D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
=
\mathcal{H}[q_{\phi}(z|x)]
-
\mathbb{E}_{q_{\phi}(z|x)}[-\log p(z)].−DKL​(qϕ​(z∣x)∥p(z))=H[qϕ​(z∣x)]−Eqϕ​(z∣x)​[−logp(z)].
This decomposition reveals two forces hidden inside the regularization term. The entropy term H[qϕ(z∣x)]\mathcal{H}[q_\phi(z|x)]H[qϕ​(z∣x)] rewards the encoder for being uncertain, or spread out, rather than collapsing to a deterministic point code. Meanwhile, the expected negative log prior term penalizes codes that fall in regions where p(z)p(z)p(z) assigns low probability. For a standard Gaussian prior, this means the model is discouraged from placing latent mass far away from the origin.
So the ELBO is balancing two competing desires:
Reconstruction: encode enough information in zzz to explain xxx well.
Regularization: keep qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) close enough to p(z)p(z)p(z) that latent samples remain meaningful.
This tension is fundamental. If the KL penalty is too weak, the encoder may memorize examples using highly specialized latent codes, producing good reconstructions but a poorly organized latent space. If the KL penalty is too strong, the encoder may ignore xxx and make qϕ(z∣x)≈p(z)q_\phi(z|x)\approx p(z)qϕ​(z∣x)≈p(z), which can lead to posterior collapse: the decoder receives little useful information from zzz, and the latent variables stop carrying semantic structure.
This is why the ELBO also has a natural rate–distortion interpretation. The KL term acts like a rate: it measures how many nats, or bits up to a constant conversion, are required to encode zzz using a distribution different from the prior. The reconstruction error acts like a distortion: poor reconstructions correspond to high distortion, while high log-likelihood corresponds to low distortion. Maximizing the ELBO means finding a good operating point between compression cost and reconstruction fidelity.
The connection to physics and variational inference comes from writing the training problem as minimization of the negative ELBO:
−L(θ,ϕ;x).-\mathcal{L}(\theta,\phi;x).−L(θ,ϕ;x).
This quantity is often called the variational free energy or negative evidence lower bound. Since we previously established
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x)∥pθ(z∣x)),\log p_{\theta}(x)
=
\mathcal{L}(\theta, \phi; x)
+
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p_{\theta}(z|x)),logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)),
minimizing free energy simultaneously pushes the ELBO upward and tries to reduce the gap between the approximate posterior qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) and the true posterior pθ(z∣x)p_\theta(z|x)pθ​(z∣x). The bound becomes tight exactly when
qϕ(z∣x)=pθ(z∣x),q_\phi(z|x)=p_\theta(z|x),qϕ​(z∣x)=pθ​(z∣x),
assuming the variational family is expressive enough to represent the true posterior.
The visual below condenses this interpretation into a balance: reconstruction pulls the objective toward faithful recovery of xxx, while KL regularization pulls the encoder distribution back toward the prior. The scale metaphor is useful because neither side is “bad”; a good VAE needs both. Reconstruction gives the latent variable meaning, while the KL term gives the latent space global structure.
It also highlights the rate–distortion view: moving toward lower distortion usually requires spending more rate, while enforcing a very low rate often sacrifices detail. Much of practical VAE design—choosing decoder likelihoods, KL schedules, β\betaβ-VAE objectives, or more expressive priors—is about controlling exactly where this operating point lies.

11. Worked Example: ELBO on a 1D Gaussian

Having seen the ELBO as a negative free energy, it is worth pausing on a case where nothing is hidden behind approximation. In most VAE settings, the posterior pθ(z∣x)p_\theta(z \mid x)pθ​(z∣x) is intractable, which is exactly why we introduce a variational distribution qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x). But in a one-dimensional linear-Gaussian model, the posterior, marginal likelihood, KL terms, and ELBO can all be written down exactly. That makes it a useful sanity check: if our interpretation of the ELBO is right, then optimizing it over a sufficiently expressive variational family should recover the true posterior and make the ELBO equal to the evidence.
Consider the toy generative model
p(z)=N(0,1),pθ(x∣z)=N(z,0.1).p(z) = \mathcal{N}(0,1),
\qquad
p_\theta(x \mid z) = \mathcal{N}(z, 0.1).p(z)=N(0,1),pθ​(x∣z)=N(z,0.1).
Here the latent variable zzz is drawn from a standard normal prior, and the observation xxx is a noisy version of zzz with small Gaussian noise variance 0.10.10.1. Because both the prior and likelihood are Gaussian, the marginal distribution of xxx is also Gaussian:
pθ(x)=N(0,1.1),p_\theta(x) = \mathcal{N}(0, 1.1),pθ​(x)=N(0,1.1),
so the log evidence is available in closed form:
log⁡pθ(x)=−12log⁡(2π⋅1.1)−x22.2.\log p_\theta(x)
=
-\frac{1}{2}\log(2\pi \cdot 1.1)
-
\frac{x^2}{2.2}.logpθ​(x)=−21​log(2π⋅1.1)−2.2x2​.
The true posterior is also Gaussian. Combining the prior precision 111 with the likelihood precision 1/0.1=101/0.1 = 101/0.1=10, the posterior variance is
(σ∗)2=11+10=0.11.1,(\sigma^*)^2 = \frac{1}{1 + 10} = \frac{0.1}{1.1},(σ∗)2=1+101​=1.10.1​,
and the posterior mean is the precision-weighted estimate
μ∗=x1.1.\mu^* = \frac{x}{1.1}.μ∗=1.1x​.
Thus
pθ(z∣x)=N ⁣(μ∗,(σ∗)2),μ∗=x1.1,(σ∗)2=0.11.1.p_\theta(z \mid x)
=
\mathcal{N}\!\left(\mu^*,(\sigma^*)^2\right),
\qquad
\mu^* = \frac{x}{1.1},
\qquad
(\sigma^*)^2 = \frac{0.1}{1.1}.pθ​(z∣x)=N(μ∗,(σ∗)2),μ∗=1.1x​,(σ∗)2=1.10.1​.
This already contains an important intuition. The observation xxx pulls the posterior mean toward xxx, but not all the way: the prior still shrinks the estimate toward zero. Since the observation noise is small, the posterior variance is much smaller than the prior variance. For x=1x=1x=1, we get
μ∗≈0.91,σ∗≈0.30.\mu^* \approx 0.91,
\qquad
\sigma^* \approx 0.30.μ∗≈0.91,σ∗≈0.30.
Now suppose our variational family is
qϕ(z∣x)=N(μ,σ2).q_\phi(z \mid x) = \mathcal{N}(\mu,\sigma^2).qϕ​(z∣x)=N(μ,σ2).
This family is expressive enough to contain the true posterior exactly, because the true posterior is also a one-dimensional Gaussian. The ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)]
-
D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p(z)).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
In this Gaussian case, both terms can be evaluated analytically. The expected reconstruction term is
Eqϕ[log⁡pθ(x∣z)]=−12log⁡(2π⋅0.1)−(x−μ)2+σ22⋅0.1,\mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)]
=
-\frac{1}{2}\log(2\pi \cdot 0.1)
-
\frac{(x-\mu)^2+\sigma^2}{2\cdot 0.1},Eqϕ​​[logpθ​(x∣z)]=−21​log(2π⋅0.1)−2⋅0.1(x−μ)2+σ2​,
because under qϕq_\phiqϕ​,
Eqϕ[(x−z)2]=(x−μ)2+σ2.\mathbb{E}_{q_\phi}[(x-z)^2] = (x-\mu)^2+\sigma^2.Eqϕ​​[(x−z)2]=(x−μ)2+σ2.
The KL regularizer against the standard normal prior is
DKL ⁣(N(μ,σ2) ∥ N(0,1))=12(μ2+σ2−1−log⁡σ2).D_{\mathrm{KL}}\!\left(\mathcal{N}(\mu,\sigma^2)\,\|\,\mathcal{N}(0,1)\right)
=
\frac{1}{2}\left(\mu^2+\sigma^2-1-\log\sigma^2\right).DKL​(N(μ,σ2)∥N(0,1))=21​(μ2+σ2−1−logσ2).
So the ELBO is a deterministic function of only two variational parameters, μ\muμ and σ\sigmaσ. There is no Monte Carlo noise, no neural network approximation, and no optimization mystery. We can inspect the entire objective surface directly.
The key identity from the previous section was
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x) ∥ pθ(z∣x)).\log p_\theta(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
Because the evidence log⁡pθ(x)\log p_\theta(x)logpθ​(x) does not depend on ϕ\phiϕ, maximizing the ELBO over ϕ\phiϕ is equivalent to minimizing
DKL(qϕ(z∣x) ∥ pθ(z∣x)).D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)).DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
In this toy example, the minimum possible KL gap is exactly zero, since the variational family contains the true posterior. Therefore the global maximizer is
qϕ(z∣x)=pθ(z∣x),q_\phi(z \mid x) = p_\theta(z \mid x),qϕ​(z∣x)=pθ​(z∣x),
or equivalently,
μ=μ∗,σ=σ∗.\mu = \mu^*,
\qquad
\sigma = \sigma^*.μ=μ∗,σ=σ∗.
At that point,
DKL(qϕ(z∣x) ∥ pθ(z∣x))=0,D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)) = 0,DKL​(qϕ​(z∣x)∥pθ​(z∣x))=0,
and the ELBO becomes tight:
L(θ,ϕ;x)=log⁡pθ(x).\mathcal{L}(\theta,\phi;x) = \log p_\theta(x).L(θ,ϕ;x)=logpθ​(x).
This is the tightness corollary in its cleanest possible form.
It is useful to compare three operating points for x=1x=1x=1:
Prior proposal: q(z∣x)=N(0,1)q(z \mid x)=\mathcal{N}(0,1)q(z∣x)=N(0,1). This ignores the observation entirely. It has no KL cost relative to the prior, but it reconstructs x=1x=1x=1 poorly, so the ELBO is low.
Centered but too wide: q(z∣x)=N(μ∗,(3σ∗)2)q(z \mid x)=\mathcal{N}(\mu^*, (3\sigma^*)^2)q(z∣x)=N(μ∗,(3σ∗)2). This puts its mean in the right place but spreads too much probability mass over implausible latent values. The ELBO improves, but the KL gap to the true posterior remains positive.
Exact posterior: q(z∣x)=N(μ∗,(σ∗)2)q(z \mid x)=\mathcal{N}(\mu^*,(\sigma^*)^2)q(z∣x)=N(μ∗,(σ∗)2). This matches both the mean and uncertainty of the true posterior, so the KL gap vanishes and the ELBO reaches log⁡pθ(x)\log p_\theta(x)logpθ​(x).
The visual below consolidates these facts in two complementary ways. The contour plot treats the ELBO as a surface over (μ,σ)(\mu,\sigma)(μ,σ): the maximum occurs exactly at the true posterior parameters, not merely near them. The prior-like approximation sits far away from the optimum, while the centered-but-wide approximation lands on the correct mean but remains suboptimal because its uncertainty is wrong.
The companion bar chart emphasizes the decomposition
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x) ∥ pθ(z∣x)).\log p_\theta(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x)).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
For each candidate qϕq_\phiqϕ​, the ELBO plus the KL gap reaches the same evidence level. In the exact posterior case, the red “gap” disappears entirely, making the lower bound not just a bound, but the true log marginal likelihood.

12. The Gradient Problem: Why We Cannot Backprop Through Sampling

The one-dimensional Gaussian example makes the ELBO feel almost deceptively friendly: the KL term can be written down, the expectation can sometimes be evaluated or estimated, and the whole objective looks like something we should be able to optimize with ordinary backpropagation. But the moment we replace that toy setup with a neural encoder and decoder, a subtle obstruction appears. The problem is not that the ELBO is undefined, nor that Monte Carlo estimation is impossible. The problem is that the Monte Carlo samples themselves sit in the middle of the computation graph.
Recall that for a single datapoint xxx, the VAE objective is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right]
-
D_{\mathrm{KL}}\bigl(q_{\phi}(z|x)\|p(z)\bigr).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
When we differentiate with respect to the encoder parameters ϕ\phiϕ, the gradient decomposes as
∇ϕ L(θ,ϕ;x)=∇ϕ Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟reconstruction term: hard−∇ϕ DKL(qϕ(z∣x)∥p(z))⏟regularization term: usually easy.\nabla_{\phi}\,\mathcal{L}(\theta,\phi;x)
=
\underbrace{
\nabla_{\phi}\,
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right]
}_{\text{reconstruction term: hard}}
-
\underbrace{
\nabla_{\phi}\,
D_{\mathrm{KL}}\bigl(q_{\phi}(z|x)\|p(z)\bigr)
}_{\text{regularization term: usually easy}}.∇ϕ​L(θ,ϕ;x)=reconstruction term: hard∇ϕ​Eqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularization term: usually easy∇ϕ​DKL​(qϕ​(z∣x)∥p(z))​​.
The KL term is usually not the source of trouble in the standard VAE. If qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) is a diagonal Gaussian,
qϕ(z∣x)=N(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z|x)=\mathcal{N}\bigl(z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^2(x))\bigr),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
and the prior is p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I), then the KL divergence has a closed form. It is just a differentiable expression involving μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x). Autodiff can handle this directly.
The reconstruction term is different because the distribution inside the expectation depends on ϕ\phiϕ:
Eqϕ(z∣x)[log⁡pθ(x∣z)]=∫qϕ(z∣x) log⁡pθ(x∣z) dz.\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right]
=
\int q_{\phi}(z|x)\,
\log p_{\theta}(x|z)\,dz.Eqϕ​(z∣x)​[logpθ​(x∣z)]=∫qϕ​(z∣x)logpθ​(x∣z)dz.
Here ϕ\phiϕ controls where we sample from. Changing ϕ\phiϕ changes the density qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), which changes the regions of latent space that contribute to the expectation. This is not the same as differentiating an ordinary deterministic function. The parameter ϕ\phiϕ affects the objective through the sampling distribution itself.
A natural first attempt is to use Monte Carlo sampling. Draw
z(l)∼qϕ(z∣x),z^{(l)} \sim q_{\phi}(z|x),z(l)∼qϕ​(z∣x),
and approximate the reconstruction expectation by
L~=1L∑l=1Llog⁡pθ(x∣z(l)).\tilde{\mathcal{L}}
=
\frac{1}{L}
\sum_{l=1}^{L}
\log p_{\theta}(x|z^{(l)}).L~=L1​l=1∑L​logpθ​(x∣z(l)).
This gives an unbiased estimate of the expectation. For optimizing θ\thetaθ, it is fine: once z(l)z^{(l)}z(l) is sampled, the decoder likelihood log⁡pθ(x∣z(l))\log p_{\theta}(x|z^{(l)})logpθ​(x∣z(l)) is an ordinary differentiable function of θ\thetaθ. But for optimizing ϕ\phiϕ, naive backpropagation runs into a wall. The sampled value z(l)z^{(l)}z(l) is not represented as a differentiable deterministic function of ϕ\phiϕ. In the computation graph, the operation “sample from qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x)” behaves like a stochastic node, not like a smooth layer.
This is the key failure mode of naive Monte Carlo:
∇ϕ[1L∑l=1Llog⁡pθ(x∣z(l))]\nabla_{\phi}
\left[
\frac{1}{L}
\sum_{l=1}^{L}
\log p_{\theta}(x|z^{(l)})
\right]∇ϕ​[L1​l=1∑L​logpθ​(x∣z(l))]
does not correctly estimate
∇ϕEqϕ(z∣x)[log⁡pθ(x∣z)].\nabla_{\phi}
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log p_{\theta}(x|z)
\right].∇ϕ​Eqϕ​(z∣x)​[logpθ​(x∣z)].
If we treat the sampled z(l)z^{(l)}z(l) values as fixed constants after sampling, then the reconstruction term has no differentiable path back to ϕ\phiϕ. Informally, the gradient is blocked at the sampling operation. More precisely, the particular sampled numerical value may change if we rerun the sampler with a different ϕ\phiϕ, but that dependence is not available to standard backpropagation as a local derivative through the realized sample.
There is a general-purpose workaround called the score-function estimator, often known in this context as REINFORCE. It uses the identity
∇ϕ Eqϕ[f(z)]=Eqϕ[f(z) ∇ϕlog⁡qϕ(z∣x)].\nabla_{\phi}\,
\mathbb{E}_{q_{\phi}}[f(z)]
=
\mathbb{E}_{q_{\phi}}
\left[
f(z)\,\nabla_{\phi}\log q_{\phi}(z|x)
\right].∇ϕ​Eqϕ​​[f(z)]=Eqϕ​​[f(z)∇ϕ​logqϕ​(z∣x)].
This identity is valid under mild regularity assumptions and does not require differentiating through the sampled value zzz. Instead, it differentiates the log-density of the sampling distribution. That is powerful: it works even for discrete random variables, where ordinary pathwise derivatives are unavailable.
But for VAEs with neural decoders, this estimator is usually too noisy to be practical without substantial variance reduction. The scalar reward-like term f(z)=log⁡pθ(x∣z)f(z)=\log p_{\theta}(x|z)f(z)=logpθ​(x∣z) can vary dramatically across samples, especially early in training. Multiplying it by ∇ϕlog⁡qϕ(z∣x)\nabla_{\phi}\log q_{\phi}(z|x)∇ϕ​logqϕ​(z∣x) often produces gradients with very high variance. In principle the estimator is unbiased; in practice, its updates can be so unstable that learning becomes slow, erratic, or ineffective.
So the issue is not merely “sampling is random.” The deeper issue is that we need a low-variance differentiable estimator of how the reconstruction objective changes when the encoder distribution changes. Naive Monte Carlo gives samples but no usable pathwise gradient. REINFORCE gives a gradient but often with prohibitive variance. This tension is exactly what motivates the reparameterization trick.
The visual below compresses this obstruction into a computation graph. The forward pass is straightforward: xxx is encoded into parameters of qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), a latent sample z(l)z^{(l)}z(l) is drawn, and the decoder evaluates log⁡pθ(x∣z(l))\log p_{\theta}(x|z^{(l)})logpθ​(x∣z(l)). The backward pass is where the asymmetry appears: gradients flow cleanly through the decoder, but the attempted gradient path back through the stochastic sampling node is blocked.
The small REINFORCE annotation is the important caveat. There is a mathematical gradient estimator that avoids differentiating through the sample, but it pays for that generality with high variance. The next step is to replace the problematic sampling operation with an equivalent differentiable construction—one that preserves the same distribution over zzz, while allowing backpropagation to reach ϕ\phiϕ.

13. The Reparameterization Trick: Core Identity

The obstruction we just ran into was not that the encoder distribution qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) is mysterious, nor that Gaussian sampling is impossible to differentiate in some philosophical sense. The issue is more specific: if we treat the operation “draw z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x)” as an opaque stochastic node, then the sampled value changes with ϕ\phiϕ only through the distribution from which it was drawn. Standard backpropagation expects deterministic computational paths; it does not know how to assign a pathwise derivative through the act of sampling itself.
The reparameterization trick resolves this by changing how we generate the same random variable. Instead of sampling directly from a ϕ\phiϕ-dependent Gaussian, we sample from a fixed, parameter-free source of randomness and then transform that noise deterministically using the encoder outputs. For the diagonal Gaussian encoder used in the standard VAE,
qϕ(z∣x)=N ⁣(μϕ(x),diag⁡(σϕ(x)2)),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}(x)^2)
\right),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),
we write a sample as
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x) \odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
This is the core identity. The randomness now lives entirely in ϵ\epsilonϵ, whose distribution does not depend on ϕ\phiϕ. The encoder parameters only determine the deterministic map that takes (x,ϵ)(x,\epsilon)(x,ϵ) to z^\hat{z}z^.
It is worth verifying that this has not changed the distribution being sampled. Componentwise,
z^k=μϕ,k(x)+σϕ,k(x)ϵk,ϵk∼N(0,1).\hat{z}_k
=
\mu_{\phi,k}(x)
+
\sigma_{\phi,k}(x)\epsilon_k,
\qquad
\epsilon_k \sim \mathcal{N}(0,1).z^k​=μϕ,k​(x)+σϕ,k​(x)ϵk​,ϵk​∼N(0,1).
An affine transformation of a standard normal random variable is normal, with shifted mean and rescaled variance:
z^k∼N ⁣(μϕ,k(x),σϕ,k(x)2).\hat{z}_k
\sim
\mathcal{N}
\!\left(
\mu_{\phi,k}(x),
\sigma_{\phi,k}(x)^2
\right).z^k​∼N(μϕ,k​(x),σϕ,k​(x)2).
Since the ϵk\epsilon_kϵk​ are independent under N(0,I)\mathcal{N}(0,I)N(0,I), the resulting vector has diagonal covariance:
z^∼N ⁣(μϕ(x),diag⁡(σϕ(x)2))=qϕ(z∣x).\hat{z}
\sim
\mathcal{N}
\!\left(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}(x)^2)
\right)
=
q_{\phi}(z \mid x).z^∼N(μϕ​(x),diag(σϕ​(x)2))=qϕ​(z∣x).
So the reparameterized sample z^\hat{z}z^ is distributionally identical to a direct sample from the encoder. The trick is not an approximation to the Gaussian; it is an exact change of representation.
The important shift is computational. Before reparameterization, we had a sample z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x), and the dependence of zzz on ϕ\phiϕ was hidden inside the sampling operation. After reparameterization, we have
z^=gϕ(x,ϵ)=μϕ(x)+σϕ(x)⊙ϵ,\hat{z} = g_{\phi}(x,\epsilon)
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon,z^=gϕ​(x,ϵ)=μϕ​(x)+σϕ​(x)⊙ϵ,
where ϵ\epsilonϵ is treated as an external random input. For a fixed draw of ϵ\epsilonϵ, the map from ϕ\phiϕ to z^\hat{z}z^ is just an ordinary differentiable computation graph. Gradients can flow through μϕ(x)\mu_{\phi}(x)μϕ​(x), through σϕ(x)\sigma_{\phi}(x)σϕ​(x), and then through the decoder likelihood term log⁡pθ(x∣z^)\log p_{\theta}(x \mid \hat{z})logpθ​(x∣z^).
This is why the reparameterization trick is sometimes called a pathwise gradient estimator. Instead of asking how the probability density changes with ϕ\phiϕ, we ask how the sampled point moves in latent space when ϕ\phiϕ changes while holding the underlying noise fixed. Intuitively, ϵ\epsilonϵ chooses a location in “standard normal coordinates,” and the encoder stretches and shifts that location into the latent space used by the decoder.
For example, if increasing one encoder parameter increases μϕ,k(x)\mu_{\phi,k}(x)μϕ,k​(x), then every sampled z^k\hat{z}_kz^k​ shifts upward by the corresponding amount. If increasing another parameter increases σϕ,k(x)\sigma_{\phi,k}(x)σϕ,k​(x), then samples with positive ϵk\epsilon_kϵk​ move upward and samples with negative ϵk\epsilon_kϵk​ move downward. These are ordinary derivatives:
∂z^k∂μϕ,k=1,∂z^k∂σϕ,k=ϵk.\frac{\partial \hat{z}_k}{\partial \mu_{\phi,k}}
=
1,
\qquad
\frac{\partial \hat{z}_k}{\partial \sigma_{\phi,k}}
=
\epsilon_k.∂μϕ,k​∂z^k​​=1,∂σϕ,k​∂z^k​​=ϵk​.
Backpropagation can now see exactly how changing the encoder changes the latent sample and, through the decoder, changes the reconstruction term in the ELBO.
There are a few assumptions hiding here. First, we need a distribution that can be expressed as a deterministic transformation of parameter-free noise. This is easy for location-scale families such as Gaussians, where samples can be written as “mean plus scale times noise.” Second, the transformation should be differentiable with respect to the parameters we want to optimize. Third, the base noise distribution p(ϵ)p(\epsilon)p(ϵ) must not depend on ϕ\phiϕ; otherwise the original problem reappears inside the supposedly fixed randomness.
In its most general form, the idea is
z=gϕ(ϵ,x),ϵ∼p(ϵ),z = g_{\phi}(\epsilon, x),
\qquad
\epsilon \sim p(\epsilon),z=gϕ​(ϵ,x),ϵ∼p(ϵ),
where p(ϵ)p(\epsilon)p(ϵ) is fixed and gϕg_{\phi}gϕ​ is differentiable. The diagonal Gaussian VAE is the canonical example, but the principle extends beyond it whenever such a transformation is available. When it is not available—for example, for many discrete latent variables—we need different gradient estimators or relaxations, and those usually come with higher variance or bias.
The visual below compresses this logic into two computation graphs. The left side represents the problematic formulation: ϕ\phiϕ determines a distribution, a stochastic sample is drawn, and the gradient path is interrupted at that sampling node. The right side rewrites the same sample as a deterministic function of encoder outputs and independent noise. The stochasticity has not disappeared; it has simply been moved to a place where it no longer carries ϕ\phiϕ-dependence.
The key message to carry forward is that reparameterization does not make the VAE deterministic. Each forward pass can still use a fresh ϵ\epsilonϵ, so the latent code remains random. What changes is that, conditional on that noise draw, the ELBO becomes an ordinary differentiable objective with respect to ϕ\phiϕ. That is the bridge from stochastic latent-variable learning to standard neural-network optimization.

14. Theorem: Reparameterized Gradient Estimator is Unbiased

Having rewritten sampling as z=μϕ(x)+σϕ(x)⊙ϵz = \mu_\phi(x) + \sigma_\phi(x)\odot \epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ, the next question is not merely whether this is a convenient implementation trick. The real question is whether it gives us the right gradient. If we replace a draw from qϕ(z∣x)q_\phi(z\mid x)qϕ​(z∣x) with a deterministic transformation of noise, are we still estimating
∇ϕ Eqϕ(z∣x)[f(z)]\nabla_\phi \, \mathbb{E}_{q_\phi(z\mid x)}[f(z)]∇ϕ​Eqϕ​(z∣x)​[f(z)]
without bias?
This matters because the VAE objective contains expectations over latent variables sampled from the encoder distribution. For a single datapoint xxx, the reconstruction-related part of the ELBO often has the form
Eqϕ(z∣x)[f(z)],\mathbb{E}_{q_\phi(z\mid x)}[f(z)],Eqϕ​(z∣x)​[f(z)],
where, for example, f(z)f(z)f(z) may be log⁡pθ(x∣z)\log p_\theta(x\mid z)logpθ​(x∣z). The encoder parameters ϕ\phiϕ affect this expectation indirectly: changing ϕ\phiϕ changes the distribution over zzz. Naively sampling z∼qϕ(z∣x)z \sim q_\phi(z\mid x)z∼qϕ​(z∣x) creates a problem for backpropagation because the sampling operation itself is not a deterministic differentiable map of ϕ\phiϕ.
The reparameterization trick resolves this by expressing the random variable as
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat z = \mu_\phi(x) + \sigma_\phi(x)\odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
Now all randomness lives in ϵ\epsilonϵ, whose distribution does not depend on ϕ\phiϕ. The dependence on ϕ\phiϕ has been moved into a deterministic function:
z=Tϕ(x,ϵ)=μϕ(x)+σϕ(x)⊙ϵ.z = T_\phi(x,\epsilon)
= \mu_\phi(x) + \sigma_\phi(x)\odot \epsilon.z=Tϕ​(x,ϵ)=μϕ​(x)+σϕ​(x)⊙ϵ.
So instead of differentiating through the act of sampling from qϕq_\phiqϕ​, we differentiate through the deterministic path
ϕ⟶μϕ(x),σϕ(x)⟶z⟶f(z).\phi \longrightarrow \mu_\phi(x), \sigma_\phi(x)
\longrightarrow z
\longrightarrow f(z).ϕ⟶μϕ​(x),σϕ​(x)⟶z⟶f(z).
The theorem states that, under the usual smoothness and integrability assumptions that allow us to interchange differentiation and expectation,
∇ϕ Eqϕ(z∣x)[f(z)]=Ep(ϵ) ⁣[∇ϕ f ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\nabla_{\phi}\, \mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\mathbb{E}_{p(\epsilon)}\!\left[
\nabla_{\phi}\,
f\!\left(\mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon\right)
\right].∇ϕ​Eqϕ​(z∣x)​[f(z)]=Ep(ϵ)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)].
This is the central identity. The left side is the gradient we actually want: the derivative of an expectation under the encoder distribution. The right side is something we can estimate by ordinary backpropagation after sampling ϵ\epsilonϵ. Importantly, the expectation is now over p(ϵ)p(\epsilon)p(ϵ), a fixed distribution independent of ϕ\phiϕ.
Given LLL independent samples ϵ(1),…,ϵ(L)∼N(0,I)\epsilon^{(1)}, \ldots, \epsilon^{(L)} \sim \mathcal{N}(0,I)ϵ(1),…,ϵ(L)∼N(0,I), the Monte Carlo estimator is
g^=1L∑l=1L∇ϕ f ⁣(μϕ(x)+σϕ(x)⊙ϵ(l)).\hat{g}
=
\frac{1}{L}
\sum_{l=1}^{L}
\nabla_{\phi}\,
f\!\left(
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon^{(l)}
\right).g^​=L1​l=1∑L​∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ(l)).
Because each term inside the average is an unbiased draw from the expectation on the right-hand side of the theorem, linearity of expectation gives
E[g^]=∇ϕ Eqϕ(z∣x)[f(z)].\mathbb{E}[\hat g]
=
\nabla_{\phi}\,
\mathbb{E}_{q_{\phi}(z|x)}[f(z)].E[g^​]=∇ϕ​Eqϕ​(z∣x)​[f(z)].
So the estimator is unbiased: averaging over repeated draws of ϵ\epsilonϵ, it recovers the exact gradient of the expected objective.
There are a few assumptions hiding inside this clean statement. First, fff must be differentiable with respect to zzz, and the transformation from ϕ\phiϕ to zzz must also be differentiable. This is why the standard Gaussian VAE encoder, parameterized by μϕ(x)\mu_\phi(x)μϕ​(x) and log⁡σϕ2(x)\log \sigma_\phi^2(x)logσϕ2​(x), is so convenient. Second, the support of the distribution should not change in a pathological way as ϕ\phiϕ changes; otherwise, differentiating under the integral can become delicate. Third, the estimator is unbiased for the gradient of the reparameterized expectation, but finite-sample estimates still have variance. Unbiased does not mean noiseless.
The reason this estimator is so effective is not only that it is unbiased, but that it tends to have low variance. Compare it to the score-function or REINFORCE estimator:
g^RF=f(z) ∇ϕlog⁡qϕ(z∣x).\hat g_{\text{RF}}
=
f(z)\,\nabla_\phi \log q_\phi(z\mid x).g^​RF​=f(z)∇ϕ​logqϕ​(z∣x).
REINFORCE is also unbiased and applies more generally, including to discrete random variables. But it usually has much higher variance because it does not exploit the local derivative ∇zf(z)\nabla_z f(z)∇z​f(z). It treats f(z)f(z)f(z) more like a black-box reward. The reparameterized estimator, by contrast, uses the geometry of fff: gradients flow through the sampled latent value back into μϕ(x)\mu_\phi(x)μϕ​(x) and σϕ(x)\sigma_\phi(x)σϕ​(x).
This explains a practical fact about VAEs: we often use L=1L=1L=1 latent sample per datapoint during training. At first this may seem surprisingly crude, but minibatch stochastic optimization already averages gradients across many datapoints. The total gradient noise comes from both minibatch sampling and latent-variable sampling; in practice, the low variance of the pathwise gradient makes a single latent draw per example sufficient for stable learning.
The visual below condenses this theorem into two linked ideas. The top part emphasizes the algebraic identity: a gradient of an expectation under qϕ(z∣x)q_\phi(z\mid x)qϕ​(z∣x) can be rewritten as an expectation over fixed noise ϵ\epsilonϵ, with gradients traveling through the deterministic map μϕ(x)+σϕ(x)⊙ϵ\mu_\phi(x)+\sigma_\phi(x)\odot\epsilonμϕ​(x)+σϕ​(x)⊙ϵ. The key annotation is that ϵ\epsilonϵ is independent of ϕ\phiϕ, which is precisely what makes ordinary backpropagation valid.
The bottom part gives the statistical intuition. Both REINFORCE and the reparameterized estimator are centered on the true gradient, so both are unbiased. But the REINFORCE distribution is much wider, representing high-variance gradient estimates, while the reparameterized estimator concentrates much more tightly around the true value. That difference in variance is what turns the theorem from a formal identity into a practical training method for VAEs.

15. Proof: Reparameterized Gradient via Change of Variables

Now that we know the reparameterized gradient estimator is unbiased, it is worth slowing down and asking why the identity is true. The key point is not mysterious: we are simply rewriting a ϕ\phiϕ-dependent distribution as a deterministic transformation of noise drawn from a fixed distribution. Once the randomness no longer depends on ϕ\phiϕ, differentiation becomes an ordinary backpropagation problem through a deterministic computation graph.
Start with the quantity we care about. For a fixed datapoint xxx, suppose the encoder distribution is a diagonal Gaussian,
qϕ(z∣x)=N ⁣(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^{2}(x))\right),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
and suppose f(z)f(z)f(z) is some downstream scalar objective term, such as a decoder log-likelihood contribution. We want the gradient
∇ϕ Eqϕ(z∣x)[f(z)].\nabla_{\phi}\, \mathbb{E}_{q_{\phi}(z|x)}[f(z)].∇ϕ​Eqϕ​(z∣x)​[f(z)].
The difficulty is that the distribution inside the expectation depends on ϕ\phiϕ. If we sample z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x), the sample itself changes as the encoder parameters change, but this dependence is not explicit in the notation. Naively differentiating through a random draw is not well-defined in the usual computational graph sense.
Writing the expectation as an integral makes the dependency explicit:
Eqϕ(z∣x)[f(z)]=∫f(z) qϕ(z∣x) dz.\mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\int f(z)\, q_{\phi}(z \mid x)\, dz.Eqϕ​(z∣x)​[f(z)]=∫f(z)qϕ​(z∣x)dz.
Here, ϕ\phiϕ appears in the density qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x). A score-function estimator would differentiate the density directly, producing terms involving ∇ϕlog⁡qϕ(z∣x)\nabla_{\phi}\log q_{\phi}(z \mid x)∇ϕ​logqϕ​(z∣x). That approach is general, but often high-variance. The reparameterization trick instead asks whether we can move the ϕ\phiϕ-dependence out of the density and into the argument of fff.
For a diagonal Gaussian, we can. Let
ϵ∼p(ϵ)=N(0,I),\epsilon \sim p(\epsilon) = \mathcal{N}(0,I),ϵ∼p(ϵ)=N(0,I),
and define
z^=μϕ(x)+σϕ(x)⊙ϵ.\hat z
=
\mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon.z^=μϕ​(x)+σϕ​(x)⊙ϵ.
This deterministic transformation maps standard normal noise into a sample from qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x). Componentwise,
zk=μϕ,k(x)+σϕ,k(x)ϵk.z_k = \mu_{\phi,k}(x) + \sigma_{\phi,k}(x)\epsilon_k.zk​=μϕ,k​(x)+σϕ,k​(x)ϵk​.
Assuming σϕ,k(x)>0\sigma_{\phi,k}(x) > 0σϕ,k​(x)>0, this map is invertible in each coordinate:
ϵk=zk−μϕ,k(x)σϕ,k(x).\epsilon_k
=
\frac{z_k - \mu_{\phi,k}(x)}{\sigma_{\phi,k}(x)}.ϵk​=σϕ,k​(x)zk​−μϕ,k​(x)​.
The Jacobian is diagonal, with entries σϕ,k(x)\sigma_{\phi,k}(x)σϕ,k​(x), so the volume element transforms as
dz=∏k=1Kσϕ,k(x)  dϵ.dz
=
\prod_{k=1}^{K}\sigma_{\phi,k}(x)\; d\epsilon.dz=k=1∏K​σϕ,k​(x)dϵ.
This is the whole mathematical engine of the proof. Under the change of variables z=μϕ(x)+σϕ(x)⊙ϵz = \mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ, the original density-weighted measure becomes
qϕ(z∣x) dz=p(ϵ) dϵ.q_{\phi}(z \mid x)\,dz
=
p(\epsilon)\,d\epsilon.qϕ​(z∣x)dz=p(ϵ)dϵ.
Therefore,
∫f(z) qϕ(z∣x) dz=∫f ⁣(μϕ(x)+σϕ(x)⊙ϵ)p(ϵ) dϵ.\int f(z)\, q_{\phi}(z|x)\, dz
=
\int
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
p(\epsilon)\,d\epsilon.∫f(z)qϕ​(z∣x)dz=∫f(μϕ​(x)+σϕ​(x)⊙ϵ)p(ϵ)dϵ.
Equivalently,
Eqϕ(z∣x)[f(z)]=Ep(ϵ) ⁣[f ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\mathbb{E}_{p(\epsilon)}
\!\left[
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right].Eqϕ​(z∣x)​[f(z)]=Ep(ϵ)​[f(μϕ​(x)+σϕ​(x)⊙ϵ)].
Notice what changed. The distribution p(ϵ)p(\epsilon)p(ϵ) is now fixed: it does not depend on ϕ\phiϕ. All dependence on ϕ\phiϕ has moved into the deterministic map
ϵ↦μϕ(x)+σϕ(x)⊙ϵ.\epsilon
\mapsto
\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon.ϵ↦μϕ​(x)+σϕ​(x)⊙ϵ.
That is why backpropagation becomes possible. We are no longer differentiating “through sampling” from a moving distribution; we are differentiating through a deterministic function of fixed external noise.
Under the usual regularity assumptions—smooth enough fff, differentiable encoder outputs, and an integrable dominating function allowing us to exchange differentiation and integration—we can pass the gradient through the expectation:
∇ϕEp(ϵ) ⁣[f ⁣(μϕ(x)+σϕ(x)⊙ϵ)]=Ep(ϵ) ⁣[∇ϕf ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\nabla_{\phi}
\mathbb{E}_{p(\epsilon)}
\!\left[
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right]
=
\mathbb{E}_{p(\epsilon)}
\!\left[
\nabla_{\phi}
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right].∇ϕ​Ep(ϵ)​[f(μϕ​(x)+σϕ​(x)⊙ϵ)]=Ep(ϵ)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)].
Combining the equality of expectations with this differentiation step gives the reparameterized gradient identity:
∇ϕ Eqϕ(z∣x)[f(z)]=Ep(ϵ) ⁣[∇ϕ f ⁣(μϕ(x)+σϕ(x)⊙ϵ)].\nabla_{\phi}\, \mathbb{E}_{q_{\phi}(z|x)}[f(z)]
=
\mathbb{E}_{p(\epsilon)}
\!\left[
\nabla_{\phi}\,
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon\right)
\right].∇ϕ​Eqϕ​(z∣x)​[f(z)]=Ep(ϵ)​[∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ)].
This is exactly the statement that the Monte Carlo estimator
∇ϕf ⁣(μϕ(x)+σϕ(x)⊙ϵ(s)),ϵ(s)∼N(0,I),\nabla_{\phi}
f\!\left(\mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon^{(s)}\right),
\qquad
\epsilon^{(s)} \sim \mathcal{N}(0,I),∇ϕ​f(μϕ​(x)+σϕ​(x)⊙ϵ(s)),ϵ(s)∼N(0,I),
is unbiased for the desired gradient.
There are two subtle assumptions hiding here. First, the transformation must be valid: for the diagonal Gaussian case, this means the scale parameters are positive and the mapping between ϵ\epsilonϵ and zzz is invertible almost everywhere. In practice, VAEs often parameterize log⁡σϕ2(x)\log\sigma_{\phi}^{2}(x)logσϕ2​(x) or use a softplus transformation to guarantee positivity. Second, exchanging ∇ϕ\nabla_{\phi}∇ϕ​ and the integral requires mild analytic conditions. Neural networks used in practice are usually treated as satisfying these conditions almost everywhere, but this step is still a mathematical assumption, not magic.
The practical takeaway is compact:
Before reparameterization: randomness comes from qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), whose parameters depend on ϕ\phiϕ.
After reparameterization: randomness comes from p(ϵ)p(\epsilon)p(ϵ), which is fixed.
Optimization benefit: gradients flow through μϕ(x)\mu_{\phi}(x)μϕ​(x), σϕ(x)\sigma_{\phi}(x)σϕ​(x), and fff by ordinary backpropagation.
The visual summary below condenses the proof into its four logical moves: write the expectation as an integral, perform the Gaussian change of variables, observe that the transformed density becomes p(ϵ)p(\epsilon)p(ϵ), and finally move the gradient through the expectation because p(ϵ)p(\epsilon)p(ϵ) is independent of ϕ\phiϕ.
The highlighted substitution is the pivotal step. Once qϕ(z∣x) dzq_{\phi}(z \mid x)\,dzqϕ​(z∣x)dz has been rewritten as p(ϵ) dϵp(\epsilon)\,d\epsilonp(ϵ)dϵ, the proof is essentially finished: the parameter dependence has migrated from the probability measure into a differentiable deterministic path, which is precisely what makes the VAE encoder trainable with low-variance gradient estimates.

16. Gaussian Encoder: Closed-Form KL Divergence

Now that we have a pathwise gradient estimator for samples from the encoder, it is tempting to think that every part of the VAE objective needs Monte Carlo estimation. But one of the conveniences of the standard VAE is that this is not true. The reconstruction term usually requires samples z∼qϕ(z∣x)z \sim q_{\phi}(z \mid x)z∼qϕ​(z∣x), because it passes through a nonlinear decoder. The KL regularizer, however, has a closed-form expression when the encoder is Gaussian and the prior is standard normal.
Recall the per-example ELBO:
L(x;θ,ϕ)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(x;\theta,\phi)
=
\mathbb{E}_{q_{\phi}(z\mid x)}
\left[
\log p_{\theta}(x\mid z)
\right]
-
D_{\mathrm{KL}}\!\left(q_{\phi}(z\mid x)\,\|\,p(z)\right).L(x;θ,ϕ)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
In the common diagonal Gaussian VAE, the encoder outputs two vectors:
μϕ(x)∈RK,σϕ(x)∈R>0K,\mu_{\phi}(x) \in \mathbb{R}^K,
\qquad
\sigma_{\phi}(x) \in \mathbb{R}_{>0}^K,μϕ​(x)∈RK,σϕ​(x)∈R>0K​,
and defines
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ(x)2)).q_{\phi}(z\mid x)
=
\mathcal{N}
\left(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}(x)^2)
\right).qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)).
The prior is usually chosen as
p(z)=N(0,I).p(z)=\mathcal{N}(0,I).p(z)=N(0,I).
This particular pairing is not accidental. A diagonal Gaussian compared against a standard Gaussian gives a KL term that decomposes dimension-by-dimension. That means the model can penalize each latent coordinate independently for moving its posterior mean away from 000, or its posterior variance away from 111.
Start from the definition:
DKL(qϕ(z∣x) ∥ p(z))=Eqϕ(z∣x)[log⁡qϕ(z∣x)−log⁡p(z)].D_{\mathrm{KL}}(q_{\phi}(z|x)\,\|\,p(z))
=
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\log q_{\phi}(z|x) - \log p(z)
\right].DKL​(qϕ​(z∣x)∥p(z))=Eqϕ​(z∣x)​[logqϕ​(z∣x)−logp(z)].
Because both distributions factorize across latent dimensions, their log-densities are sums over coordinates. Suppressing constants that will cancel or collect into dimension-independent terms, we can write
DKL=Eq[∑k=1K(−(zk−μk)22σk2−log⁡σk)−∑k=1K(−zk22)]+const.D_{\mathrm{KL}}
=
\mathbb{E}_{q}
\left[
\sum_{k=1}^{K}
\left(
-\frac{(z_k-\mu_k)^2}{2\sigma_k^2}
-\log\sigma_k
\right)
-
\sum_{k=1}^{K}
\left(
-\frac{z_k^2}{2}
\right)
\right]
+
\text{const}.DKL​=Eq​[k=1∑K​(−2σk2​(zk​−μk​)2​−logσk​)−k=1∑K​(−2zk2​​)]+const.
Here μk\mu_kμk​ and σk\sigma_kσk​ mean the kkk-th encoder outputs for the current datapoint xxx. The expectation is still with respect to qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), but now only simple Gaussian moments are needed. Under zk∼N(μk,σk2)z_k \sim \mathcal{N}(\mu_k,\sigma_k^2)zk​∼N(μk​,σk2​),
Eq[(zk−μk)2]=σk2,Eq[zk2]=μk2+σk2.\mathbb{E}_q[(z_k-\mu_k)^2] = \sigma_k^2,
\qquad
\mathbb{E}_q[z_k^2] = \mu_k^2+\sigma_k^2.Eq​[(zk​−μk​)2]=σk2​,Eq​[zk2​]=μk2​+σk2​.
Substituting these moments and simplifying gives the exact closed form:
DKL(qϕ(z∣x) ∥ p(z))=12∑k=1K(μk2+σk2−1−log⁡σk2).D_{\mathrm{KL}}(q_{\phi}(z|x)\,\|\,p(z))
=
\frac{1}{2}
\sum_{k=1}^{K}
\left(
\mu_k^2
+
\sigma_k^2
-
1
-
\log\sigma_k^2
\right).DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑K​(μk2​+σk2​−1−logσk2​).
This expression is worth reading term by term. The μk2\mu_k^2μk2​ penalty discourages the approximate posterior from shifting its mean far away from the prior mean 000. The combination
σk2−1−log⁡σk2\sigma_k^2 - 1 - \log\sigma_k^2σk2​−1−logσk2​
penalizes the posterior variance for deviating from the prior variance 111. It is minimized at σk2=1\sigma_k^2=1σk2​=1, where it equals 000. Thus each latent coordinate pays no KL cost only when
μk=0,σk2=1,\mu_k=0,
\qquad
\sigma_k^2=1,μk​=0,σk2​=1,
meaning qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) matches the prior along that coordinate.
A crucial practical point is that this KL term is fully differentiable and sampling-free. We do not need to draw zzz to estimate it, and we do not need the reparameterization trick for this part of the objective. Gradients can flow directly through the encoder outputs μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x), or more commonly through μϕ(x)\mu_{\phi}(x)μϕ​(x) and log⁡σϕ(x)2\log\sigma_{\phi}(x)^2logσϕ​(x)2, since neural networks typically output log-variance for numerical stability.
This exactness also clarifies an important failure mode. Because the KL term pushes qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) toward N(0,I)\mathcal{N}(0,I)N(0,I), a powerful decoder may learn to reconstruct well while ignoring zzz. In that case the encoder can choose μϕ(x)≈0\mu_{\phi}(x)\approx 0μϕ​(x)≈0 and σϕ(x)2≈1\sigma_{\phi}(x)^2\approx 1σϕ​(x)2≈1, making the KL nearly zero. This is one view of posterior collapse: the approximate posterior becomes almost identical to the prior, so the latent code carries little information about xxx.
There are a few assumptions hiding inside the neat formula:
The encoder covariance is diagonal, so the KL separates over coordinates.
The prior is exactly standard Gaussian, p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I).
The variance parameters must remain positive, which is why implementations often parameterize log⁡σk2\log\sigma_k^2logσk2​ rather than σk\sigma_kσk​ directly.
The formula is per datapoint; minibatch training averages or sums it across examples depending on the convention used for the loss.
The visual below condenses the derivation into the three essential moves: expand the KL definition, substitute the diagonal Gaussian log-densities, and apply the two Gaussian moment identities. The boxed expression is the quantity that appears directly in a VAE implementation.
It also separates the computational roles of the two ELBO terms. The KL side is analytic and deterministic; the reconstruction side still requires samples from qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x), which is where the reparameterization trick from the previous section becomes necessary.

17. Gaussian Decoder: Reconstruction Term as MSE

Having handled the KL term for a Gaussian encoder, the remaining piece of the ELBO is the part that asks a simple but crucial question: if we sample a latent code from the encoder, how well can the decoder explain the original observation? This is the reconstruction term,
Eqϕ(z∣x)[log⁡pθ(x∣z)].\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)].Eqϕ​(z∣x)​[logpθ​(x∣z)].
It is often implemented as a mean-squared error loss, but that is not merely a convenient engineering choice. Under a Gaussian decoder assumption, MSE falls directly out of the likelihood model.
Assume the decoder defines a conditional Gaussian distribution over observations:
pθ(x∣z)=N(fθ(z),σ2I).p_{\theta}(x|z) = \mathcal{N}(f_{\theta}(z), \sigma^2 I).pθ​(x∣z)=N(fθ​(z),σ2I).
Here fθ(z)f_{\theta}(z)fθ​(z) is the decoder network output, interpreted as the mean of the conditional distribution over xxx. The variance σ2I\sigma^2 Iσ2I says that, given zzz, each observed dimension is corrupted by independent isotropic Gaussian noise with the same variance σ2\sigma^2σ2. This is a modeling assumption: it is reasonable for continuous-valued data after suitable normalization, but it is not automatically appropriate for binary pixels, counts, categorical variables, or perceptually structured image data.
For one latent sample zzz, the Gaussian log-likelihood is
log⁡pθ(x∣z)=−D2log⁡(2πσ2)−12σ2∥x−fθ(z)∥2,\log p_{\theta}(x|z)
=
-\frac{D}{2}\log(2\pi\sigma^2)
-
\frac{1}{2\sigma^2}\|x - f_{\theta}(z)\|^2,logpθ​(x∣z)=−2D​log(2πσ2)−2σ21​∥x−fθ​(z)∥2,
where DDD is the dimensionality of xxx. The first term is the Gaussian normalization constant. If σ2\sigma^2σ2 is fixed, it does not depend on the decoder parameters θ\thetaθ. The second term is the squared reconstruction error, scaled by −1/(2σ2)-1/(2\sigma^2)−1/(2σ2). Therefore, maximizing the Gaussian likelihood pushes fθ(z)f_{\theta}(z)fθ​(z) toward xxx in squared-error distance.
Inside the VAE, however, zzz is not a single deterministic code. It is drawn from the approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), so the ELBO uses the expected log-likelihood:
Eqϕ(z∣x)[log⁡pθ(x∣z)]=−D2log⁡(2πσ2)−12σ2Eqϕ(z∣x)[∥x−fθ(z)∥2].\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
=
-\frac{D}{2}\log(2\pi\sigma^2)
-
\frac{1}{2\sigma^2}
\mathbb{E}_{q_{\phi}(z|x)}
\left[
\|x - f_{\theta}(z)\|^2
\right].Eqϕ​(z∣x)​[logpθ​(x∣z)]=−2D​log(2πσ2)−2σ21​Eqϕ​(z∣x)​[∥x−fθ​(z)∥2].
Using the reparameterization trick, we usually write the sampled latent variable as
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I),\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I),z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I),
so the reconstruction term becomes an expectation over noise injected through a differentiable transformation. This is what lets gradients flow not only into the decoder parameters θ\thetaθ, but also back into the encoder parameters ϕ\phiϕ.
If σ2\sigma^2σ2 is treated as a fixed hyperparameter, then maximizing the reconstruction term with respect to θ\thetaθ is equivalent to minimizing
Eqϕ(z∣x)[∥x−fθ(z^)∥2],\mathbb{E}_{q_{\phi}(z|x)}
\left[
\|x - f_{\theta}(\hat{z})\|^2
\right],Eqϕ​(z∣x)​[∥x−fθ​(z^)∥2],
up to a constant and a positive scaling factor. This is the precise sense in which the Gaussian decoder gives rise to expected MSE as the reconstruction loss. In practice, this expectation is usually approximated with one or a few Monte Carlo samples of ϵ\epsilonϵ per datapoint.
There are two subtle points worth keeping separate. First, the constant term
−D2log⁡(2πσ2)-\frac{D}{2}\log(2\pi\sigma^2)−2D​log(2πσ2)
can be dropped only when σ2\sigma^2σ2 is fixed. If σ2\sigma^2σ2 is learned, that term matters; otherwise the model could change its variance without paying the correct likelihood normalization cost. Second, although dropping constants is harmless for optimizing θ\thetaθ, the reconstruction expectation still depends on ϕ\phiϕ through z^\hat{z}z^. So the encoder is trained not only to match the prior through the KL term, but also to choose latent distributions that allow accurate reconstructions.
The variance σ2\sigma^2σ2 also has an important practical interpretation. In the full ELBO,
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−KL(qϕ(z∣x)∥p(z)),\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
-
\mathrm{KL}(q_{\phi}(z|x)\|p(z)),L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−KL(qϕ​(z∣x)∥p(z)),
the reconstruction penalty is scaled by 1/(2σ2)1/(2\sigma^2)1/(2σ2). Smaller σ2\sigma^2σ2 makes reconstruction errors more expensive relative to the KL regularizer; larger σ2\sigma^2σ2 softens the reconstruction pressure and makes the KL term comparatively more influential. Thus σ2\sigma^2σ2 acts like a reconstruction–regularization trade-off knob, much like the weighting coefficient in a β\betaβ-VAE, though it arises from the likelihood model itself.
This likelihood choice also helps explain a classic VAE failure mode: blurry reconstructions. A Gaussian decoder trained with squared error is rewarded for predicting conditional means. When the posterior or decoder is uncertain among several plausible outputs, averaging them can produce visually smooth or blurry samples. This is not just an optimization artifact; it is partly a consequence of the assumed observation model. For binary data, a more appropriate decoder is often Bernoulli,
pθ(x∣z)=∏dBernoulli(xd;σ(fθ(z)d)),p_{\theta}(x|z)
=
\prod_d \mathrm{Bernoulli}
\left(
x_d;\sigma(f_{\theta}(z)_d)
\right),pθ​(x∣z)=d∏​Bernoulli(xd​;σ(fθ​(z)d​)),
which leads to a binary cross-entropy reconstruction term rather than MSE.
The visual below condenses this derivation into a chain: start from the Gaussian decoder, expand the log-likelihood, take the expectation under the encoder distribution, and then drop the θ\thetaθ-constant terms to reveal the expected MSE objective. The key algebraic move is that the squared Euclidean error appears directly inside the Gaussian log-density.
It also highlights the modeling choices surrounding the derivation: the reparameterized sample z^\hat{z}z^ is what makes the expected reconstruction loss trainable by backpropagation, the Bernoulli alternative reminds us that MSE is not universal, and the note about σ2\sigma^2σ2 previews how the reconstruction term will be balanced against the KL divergence in the full VAE objective.

18. The Full VAE Objective: Putting It All Together

Now that the Gaussian decoder has turned the reconstruction term into something as familiar as squared error, we can finally assemble the pieces into the objective that is actually optimized in a VAE. The important point is that the VAE loss is not just reconstruction error with noise injected into the latent space. It is a variational lower bound with two coupled responsibilities: explain each datapoint well through the decoder, while keeping the encoder’s approximate posterior close enough to the prior that generation remains possible.
For a single datapoint xxx, the ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_\phi(z|x)}
\left[
\log p_\theta(x|z)
\right]
-
D_{\mathrm{KL}}\!\left(q_\phi(z|x)\,\|\,p(z)\right).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
The first term rewards latent samples zzz that allow the decoder pθ(x∣z)p_\theta(x|z)pθ​(x∣z) to assign high probability to the observed datapoint. The second term penalizes the encoder distribution qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) for drifting too far from the prior p(z)p(z)p(z), usually p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This is what makes the latent space usable at generation time: after training, we want to sample z∼p(z)z\sim p(z)z∼p(z), not from some fragmented collection of unrelated encoder distributions.
In practice, the expectation in the reconstruction term is estimated with samples. With the reparameterization trick, we write
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat{z}
=
\mu_\phi(x)
+
\sigma_\phi(x)\odot \epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
This separates the randomness ϵ\epsilonϵ from the encoder parameters ϕ\phiϕ. Instead of sampling zzz from a distribution whose parameters depend on ϕ\phiϕ in a way that blocks ordinary backpropagation, we sample parameter-free noise and transform it differentiably. That is the key technical move that allows gradients from the decoder’s reconstruction likelihood to flow backward through z^\hat{z}z^, then into μϕ(x)\mu_\phi(x)μϕ​(x) and σϕ(x)\sigma_\phi(x)σϕ​(x).
Using one Monte Carlo sample z^\hat{z}z^, the per-datapoint stochastic ELBO estimate becomes
L~(θ,ϕ;x)=log⁡pθ(x∣z^)−12∑k=1K[μϕ,k(x)2+σϕ,k(x)2−1−log⁡σϕ,k(x)2].\tilde{\mathcal{L}}(\theta,\phi;x)
=
\log p_\theta(x|\hat{z})
-
\frac{1}{2}
\sum_{k=1}^{K}
\left[
\mu_{\phi,k}(x)^2
+
\sigma_{\phi,k}(x)^2
-
1
-
\log \sigma_{\phi,k}(x)^2
\right].L~(θ,ϕ;x)=logpθ​(x∣z^)−21​k=1∑K​[μϕ,k​(x)2+σϕ,k​(x)2−1−logσϕ,k​(x)2].
Here the KL term has been written in closed form for the common diagonal Gaussian encoder
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ(x)2)),q_\phi(z|x)
=
\mathcal{N}
\left(
\mu_\phi(x),
\operatorname{diag}(\sigma_\phi(x)^2)
\right),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),
with standard normal prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This analytic KL is one of the conveniences of the classical VAE setup. It avoids adding more sampling noise to the objective and gives a direct regularizing signal to the encoder.
A useful way to read the objective is as a controlled compromise:
Reconstruction term: make z^\hat{z}z^ informative enough that the decoder can recover xxx.
KL term: prevent qϕ(z∣x)q_\phi(z|x)qϕ​(z∣x) from becoming an arbitrary lookup table for each datapoint.
Reparameterization: make the sampled latent path differentiable with respect to ϕ\phiϕ.
Closed-form KL: regularize the encoder without needing a Monte Carlo estimate.
This compromise is also where many VAE failure modes originate. If the KL penalty dominates too early, the encoder may learn qϕ(z∣x)≈p(z)q_\phi(z|x)\approx p(z)qϕ​(z∣x)≈p(z) for all xxx, causing posterior collapse: the latent variable carries little information, and the decoder behaves almost like an unconditional model. If the reconstruction term dominates, the model may encode datapoints very precisely but produce a latent space that does not match the prior, making prior samples poor. With Gaussian decoders, another common issue is blurry reconstructions, because maximizing a pixelwise Gaussian likelihood often encourages conditional means rather than sharp, multimodal outputs.
For a dataset {x(n)}n=1N\{x^{(n)}\}_{n=1}^N{x(n)}n=1N​, training maximizes the sum of per-datapoint estimates:
L(θ,ϕ)=∑n=1NL~(θ,ϕ;x(n)).\mathcal{L}(\theta,\phi)
=
\sum_{n=1}^{N}
\tilde{\mathcal{L}}(\theta,\phi;x^{(n)}).L(θ,ϕ)=n=1∑N​L~(θ,ϕ;x(n)).
Since full-dataset optimization is usually impractical, we use minibatches. For a batch of size BBB, an unbiased estimate of the full objective is
Lbatch=NB∑b=1BL~(θ,ϕ;x(b)).\mathcal{L}_{\text{batch}}
=
\frac{N}{B}
\sum_{b=1}^{B}
\tilde{\mathcal{L}}(\theta,\phi;x^{(b)}).Lbatch​=BN​b=1∑B​L~(θ,ϕ;x(b)).
Many implementations omit the factor N/BN/BN/B when optimizing an average loss, because it only rescales the gradient for a fixed dataset and can be absorbed into the learning rate. What matters conceptually is that the minibatch objective estimates the same dataset-level ELBO. Also note the sign convention: derivations usually maximize the ELBO, while code often minimizes the negative ELBO, sometimes reported as reconstruction loss plus KL penalty.
The gradient dependencies are asymmetric but tightly coordinated. The decoder parameters θ\thetaθ appear only inside the likelihood term log⁡pθ(x∣z^)\log p_\theta(x|\hat{z})logpθ​(x∣z^), so ∇θ\nabla_\theta∇θ​ is ordinary decoder backpropagation. The encoder parameters ϕ\phiϕ, however, receive two kinds of signal: an analytic KL gradient that pushes μϕ(x)\mu_\phi(x)μϕ​(x) toward zero and σϕ(x)\sigma_\phi(x)σϕ​(x) toward one, and a reconstruction gradient that flows through the reparameterized latent sample z^\hat{z}z^. This is why the full VAE objective trains encoder and decoder jointly rather than treating inference and generation as separate procedures.
The visual below condenses this entire training objective into its two main colored pathways. The blue reconstruction path represents the stochastic, reparameterized route from encoder outputs through z^\hat{z}z^ into the decoder likelihood. The red KL path represents the closed-form regularizer acting directly on the encoder’s Gaussian parameters.
It also separates the three levels at which the same idea appears: the per-datapoint ELBO estimate, the full-dataset sum, and the minibatch estimator used in SGD. That hierarchy is worth keeping in mind before moving to the training algorithm: implementation is mostly bookkeeping, but the bookkeeping must preserve these two terms and their gradient paths.

19. Algorithm: VAE Training (Minibatch SGD)

Now that the ELBO has been assembled into a reconstruction term minus a regularization term, the remaining question is operational: what exactly happens during one training iteration? A VAE can look conceptually complicated because it contains an encoder, a decoder, a latent random variable, and a variational objective. But the actual training loop is quite close to ordinary minibatch SGD once we express the stochastic latent sample in a differentiable way.
For each datapoint x(b)x^{(b)}x(b) in a minibatch, the encoder produces the parameters of an approximate posterior distribution,
qϕ(z∣x(b))=N ⁣(z;μϕ(x(b)),diag⁡(σϕ(x(b))2)).q_{\phi}(z \mid x^{(b)}) = \mathcal{N}\!\left(
z;\mu_{\phi}(x^{(b)}), \operatorname{diag}(\sigma_{\phi}(x^{(b)})^2)
\right).qϕ​(z∣x(b))=N(z;μϕ​(x(b)),diag(σϕ​(x(b))2)).
The diagonal Gaussian assumption is doing a lot of work here. It makes sampling cheap, makes the KL divergence against the standard normal prior analytic, and gives us a simple parameterization for uncertainty in each latent dimension. In implementations, the network often predicts log⁡σ2\log \sigma^2logσ2, or logvar, rather than σ\sigmaσ directly, because the variance must remain positive and because log-variance is numerically more stable.
The central step is the reparameterized forward pass. Instead of sampling directly from qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), we sample parameter-free noise
ϵ(b)∼N(0,I),\epsilon^{(b)} \sim \mathcal{N}(0,I),ϵ(b)∼N(0,I),
and construct
z^(b)=μϕ(x(b))+σϕ(x(b))⊙ϵ(b).\hat{z}^{(b)}
=
\mu_{\phi}(x^{(b)})
+
\sigma_{\phi}(x^{(b)}) \odot \epsilon^{(b)}.z^(b)=μϕ​(x(b))+σϕ​(x(b))⊙ϵ(b).
This turns the random draw into a deterministic differentiable function of μϕ\mu_{\phi}μϕ​, σϕ\sigma_{\phi}σϕ​, and external noise ϵ\epsilonϵ. That distinction is essential: gradients cannot flow through “sample from this distribution” in the usual backpropagation sense, but they can flow through addition, multiplication, and neural network outputs. The randomness remains, but it has been isolated from the parameters.
Once z^(b)\hat{z}^{(b)}z^(b) is sampled, the decoder evaluates the likelihood of reconstructing x(b)x^{(b)}x(b) from that latent code:
log⁡pθ(x(b)∣z^(b)).\log p_{\theta}(x^{(b)} \mid \hat{z}^{(b)}).logpθ​(x(b)∣z^(b)).
This term depends on the chosen observation model. For binarized MNIST, it is usually a Bernoulli log-likelihood. For continuous data, one often uses a Gaussian likelihood, sometimes with fixed variance. This modeling choice matters: a simple Gaussian decoder trained with squared-error-like losses often encourages averaged reconstructions, which is one reason VAEs can produce blurry samples compared with adversarial models.
The KL term is computed analytically for each example because both qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) and the prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I) are Gaussian. For a KKK-dimensional diagonal posterior,
KL(b)=12∑k=1K[μϕ,k(x(b))2+σϕ,k(x(b))2−1−log⁡σϕ,k(x(b))2].\mathrm{KL}^{(b)}
=
\frac{1}{2}
\sum_{k=1}^{K}
\left[
\mu_{\phi,k}(x^{(b)})^2
+
\sigma_{\phi,k}(x^{(b)})^2
-
1
-
\log \sigma_{\phi,k}(x^{(b)})^2
\right].KL(b)=21​k=1∑K​[μϕ,k​(x(b))2+σϕ,k​(x(b))2−1−logσϕ,k​(x(b))2].
This term penalizes approximate posteriors that drift too far from the prior. Intuitively, it asks the encoder to use latent space economically: encode information only when it improves reconstruction enough to justify the cost. That tradeoff is powerful, but it also creates one of the most important VAE failure modes: posterior collapse. If the decoder is expressive enough to model the data without using zzz, optimization may drive qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) close to p(z)p(z)p(z), making the latent code nearly uninformative.
Putting the two pieces together, a single-sample Monte Carlo estimate of the per-example ELBO is
L~(b)=log⁡pθ ⁣(x(b)∣z^(b))−KL(b).\tilde{\mathcal{L}}^{(b)}
=
\log p_{\theta}\!\left(x^{(b)} \mid \hat{z}^{(b)}\right)
-
\mathrm{KL}^{(b)}.L~(b)=logpθ​(x(b)∣z^(b))−KL(b).
Using one latent sample per datapoint is common because the minibatch itself already provides stochasticity, and the reparameterized gradient estimator usually has manageable variance. More samples can reduce variance, but they also increase computation; in standard VAE training, one sample is often a good tradeoff.
For a minibatch of size BBB, the objective is typically written as an unbiased estimate of the full-data ELBO:
NB∑b=1BL~(b).\frac{N}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)}.BN​b=1∑B​L~(b).
The factor N/BN/BN/B appears if we are estimating the sum of ELBOs over the dataset. Many software implementations instead optimize the minibatch mean and omit NNN; this changes the scale of the loss and therefore the effective learning rate, but not the location of the optimum. The sign convention is another common source of bugs: mathematically we maximize the ELBO, while most deep learning libraries minimize losses, so implementations often minimize
−1B∑b=1BL~(b).-\frac{1}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)}.−B1​b=1∑B​L~(b).
Both parameter sets are updated from the same scalar objective. The decoder parameters θ\thetaθ receive gradients through the reconstruction likelihood. The encoder parameters ϕ\phiϕ receive gradients through two paths: directly through the analytic KL term, and indirectly through z^(b)\hat{z}^{(b)}z^(b) into the reconstruction term because of the reparameterization trick. In gradient-ascent form,
θ←θ+α∇θNB∑b=1BL~(b),ϕ←ϕ+α∇ϕNB∑b=1BL~(b).\theta
\leftarrow
\theta
+
\alpha \nabla_{\theta}
\frac{N}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)},
\qquad
\phi
\leftarrow
\phi
+
\alpha \nabla_{\phi}
\frac{N}{B}\sum_{b=1}^{B}\tilde{\mathcal{L}}^{(b)}.θ←θ+α∇θ​BN​b=1∑B​L~(b),ϕ←ϕ+α∇ϕ​BN​b=1∑B​L~(b).
In practice, these are simultaneous optimizer updates, usually performed by Adam or a related adaptive method.
The key algorithmic pattern is therefore:
Encode xxx into μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x).
Sample ϵ\epsilonϵ, then form z^\hat{z}z^ using the reparameterization trick.
Decode z^\hat{z}z^ and evaluate the reconstruction log-likelihood.
Compute the KL term analytically.
Optimize the reconstruction-minus-KL objective by backpropagating through both networks.
The visual that follows compresses this whole training iteration into a pseudocode-style view. The highlighted latent-sampling line is the computational heart of the VAE: it is what converts a stochastic latent-variable model into something compatible with backpropagation. The highlighted KL line emphasizes the other major simplification: for the standard Gaussian prior and diagonal Gaussian encoder, this part of the ELBO does not need Monte Carlo estimation.
The update lines at the bottom summarize the most important implementation detail: θ\thetaθ and ϕ\phiϕ are optimized together using the same minibatch ELBO estimate. The gradient notes make explicit where each derivative flows—∇θ\nabla_{\theta}∇θ​ through the decoder likelihood, and ∇ϕ\nabla_{\phi}∇ϕ​ through both the analytic KL and the reparameterized reconstruction path.

20. Worked Example: VAE on Binarized MNIST

After the training loop is written down, the VAE can still feel slightly abstract: we say “encode, sample, decode, add KL, backpropagate,” but it is worth seeing what those words mean for one concrete datapoint. Let us trace a single minibatch element: a binarized MNIST digit, say a handwritten 333, represented as x∈{0,1}784x \in \{0,1\}^{784}x∈{0,1}784. The pixels are binary, so the natural decoder likelihood is a product of Bernoulli distributions over pixels, with the decoder producing logits or probabilities for each of the 784 coordinates.
For this worked example, suppose the latent space is only two-dimensional, K=2K=2K=2. This is much smaller than we would typically use for high-quality generation, but it is ideal for understanding the mechanics. The encoder network gϕg_{\phi}gϕ​ maps the image to the parameters of a diagonal Gaussian approximate posterior,
qϕ(z∣x)=N ⁣(z; μϕ(x), diag⁡(σϕ2(x))).q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
z;\ \mu_{\phi}(x),\ \operatorname{diag}(\sigma_{\phi}^2(x))
\right).qϕ​(z∣x)=N(z; μϕ​(x), diag(σϕ2​(x))).
For our particular image, assume the encoder outputs
μϕ(x)=[0.8, −1.2],σϕ(x)=[0.5, 0.9].\mu_{\phi}(x) = [0.8,\ -1.2],
\qquad
\sigma_{\phi}(x) = [0.5,\ 0.9].μϕ​(x)=[0.8, −1.2],σϕ​(x)=[0.5, 0.9].
These numbers have a simple interpretation. The encoder believes that, for this digit, plausible latent codes are centered near (0.8,−1.2)(0.8,-1.2)(0.8,−1.2), with less uncertainty in the first coordinate than the second. The diagonal covariance assumption means the two latent dimensions are conditionally independent under qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), which makes both sampling and the KL term cheap. This assumption is computationally convenient, but it is also restrictive: if the true posterior has strong correlations between latent dimensions, a diagonal Gaussian encoder cannot represent them directly.
Now we need a latent sample. A naive expression would be
z^∼qϕ(z∣x),\hat{z} \sim q_{\phi}(z \mid x),z^∼qϕ​(z∣x),
but this hides a problem: the sampling operation depends on ϕ\phiϕ, and we want gradients to flow through the sampled latent code into the encoder. The reparameterization trick rewrites the sample as a deterministic differentiable function of μϕ(x)\mu_{\phi}(x)μϕ​(x), σϕ(x)\sigma_{\phi}(x)σϕ​(x), and external noise ϵ\epsilonϵ that does not depend on ϕ\phiϕ:
ϵ∼N(0,I2),z^=μϕ(x)+σϕ(x)⊙ϵ.\epsilon \sim \mathcal{N}(0,I_2),
\qquad
\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x) \odot \epsilon.ϵ∼N(0,I2​),z^=μϕ​(x)+σϕ​(x)⊙ϵ.
If, for this forward pass, we draw
ϵ=[0.3, −0.7],\epsilon = [0.3,\ -0.7],ϵ=[0.3, −0.7],
then the latent sample is
z^=[0.8, −1.2]+[0.5, 0.9]⊙[0.3, −0.7]=[0.95, −1.83].\hat{z}
=
[0.8,\ -1.2]
+
[0.5,\ 0.9] \odot [0.3,\ -0.7]
=
[0.95,\ -1.83].z^=[0.8, −1.2]+[0.5, 0.9]⊙[0.3, −0.7]=[0.95, −1.83].
The important point is not just the numerical value. It is the computational path: z^\hat{z}z^ is now differentiable with respect to both μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x). The randomness has been isolated in ϵ\epsilonϵ, while the dependence on encoder parameters remains inside ordinary differentiable arithmetic.
The decoder fθf_{\theta}fθ​ then maps z^\hat{z}z^ back to Bernoulli parameters for the 784 pixels. For a binary image, the reconstruction log-likelihood has the form
log⁡pθ(x∣z^)=∑j=1784[xjlog⁡πθ,j(z^)+(1−xj)log⁡(1−πθ,j(z^))],\log p_{\theta}(x \mid \hat{z})
=
\sum_{j=1}^{784}
\left[
x_j \log \pi_{\theta,j}(\hat{z})
+
(1-x_j)\log(1-\pi_{\theta,j}(\hat{z}))
\right],logpθ​(x∣z^)=j=1∑784​[xj​logπθ,j​(z^)+(1−xj​)log(1−πθ,j​(z^))],
where πθ,j(z^)\pi_{\theta,j}(\hat{z})πθ,j​(z^) is the decoder’s predicted probability that pixel jjj is on. Suppose this particular decoder evaluation gives
log⁡pθ(x∣z^)=−89.2 nats.\log p_{\theta}(x \mid \hat{z}) = -89.2 \text{ nats}.logpθ​(x∣z^)=−89.2 nats.
This term rewards reconstructions that assign high probability to the observed binary pixels. Since it is a log probability over 784 dimensions, a negative value is expected. What matters during optimization is whether changing θ\thetaθ and ϕ\phiϕ makes this quantity less negative on average while maintaining a reasonable latent posterior.
The regularizer is the KL divergence from the approximate posterior to the standard normal prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). Because both distributions are diagonal Gaussians, the KL has a closed form:
DKL(qϕ(z∣x)∥p(z))=12∑k=12[μϕ,k2+σϕ,k2−1−log⁡σϕ,k2].D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
=
\frac{1}{2}
\sum_{k=1}^{2}
\left[
\mu_{\phi,k}^2
+
\sigma_{\phi,k}^2
-
1
-
\log\sigma_{\phi,k}^2
\right].DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑2​[μϕ,k2​+σϕ,k2​−1−logσϕ,k2​].
Plugging in the two latent dimensions gives
DKL=12[(0.64+0.25−1+1.386)+(1.44+0.81−1+0.211)]=12[1.276+1.461]=1.37.D_{\mathrm{KL}}
=
\frac{1}{2}
\left[
(0.64+0.25-1+1.386)
+
(1.44+0.81-1+0.211)
\right]
=
\frac{1}{2}[1.276+1.461]
=
1.37.DKL​=21​[(0.64+0.25−1+1.386)+(1.44+0.81−1+0.211)]=21​[1.276+1.461]=1.37.
This number measures how much the encoder’s posterior for this example deviates from the prior. Large means, very small variances, or very large variances all tend to increase the KL. Intuitively, the KL penalizes the encoder for using latent codes that are too specialized or too far from the standard normal geometry that the decoder will later sample from at generation time.
Combining the reconstruction term and the KL term gives the single-sample Monte Carlo estimate of the ELBO:
L~(θ,ϕ;x)=log⁡pθ(x∣z^)⏟−89.2−DKL(qϕ(z∣x)∥p(z))⏟1.37=−90.57 nats.\tilde{\mathcal{L}}(\theta,\phi;x)
=
\underbrace{\log p_{\theta}(x\mid\hat{z})}_{-89.2}
-
\underbrace{D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))}_{1.37}
=
-90.57 \text{ nats}.L~(θ,ϕ;x)=−89.2logpθ​(x∣z^)​​−1.37DKL​(qϕ​(z∣x)∥p(z))​​=−90.57 nats.
Training maximizes this quantity, or equivalently minimizes its negative. The gradients split cleanly. Parameters θ\thetaθ receive gradients through the decoder likelihood. Parameters ϕ\phiϕ receive gradients in two ways: directly through the analytic KL term, and indirectly through the reconstruction term via
ϕ⟶(μϕ(x),σϕ(x))⟶z^⟶log⁡pθ(x∣z^).\phi
\longrightarrow
(\mu_{\phi}(x),\sigma_{\phi}(x))
\longrightarrow
\hat{z}
\longrightarrow
\log p_{\theta}(x\mid\hat{z}).ϕ⟶(μϕ​(x),σϕ​(x))⟶z^⟶logpθ​(x∣z^).
This is why the reparameterization trick is not a cosmetic rewrite; it is what makes the reconstruction part of the objective train the encoder using standard backpropagation.
There are already hints of common VAE failure modes in this tiny example. If the KL pressure is too strong, the encoder may learn μϕ(x)≈0\mu_{\phi}(x)\approx 0μϕ​(x)≈0 and σϕ(x)≈1\sigma_{\phi}(x)\approx 1σϕ​(x)≈1 for most inputs, making qϕ(z∣x)q_{\phi}(z\mid x)qϕ​(z∣x) nearly equal to the prior; then the decoder may ignore zzz, a phenomenon known as posterior collapse. Conversely, if the decoder likelihood is too simple or the latent representation is too compressed, reconstructions may average over plausible outputs, producing the characteristic blurriness associated with VAEs on continuous-valued images. Even in binarized MNIST, where Bernoulli likelihoods are relatively well matched to the data, the balance between reconstruction accuracy and latent regularity is the central tension.
The visual below condenses this entire forward pass into one computation graph: a binarized image enters the encoder, the encoder emits μϕ\mu_{\phi}μϕ​ and σϕ\sigma_{\phi}σϕ​, external Gaussian noise is injected through the reparameterization equation, and the resulting z^\hat{z}z^ is decoded into a Bernoulli reconstruction score. The KL calculation is separated because it does not require sampling; once μϕ\mu_{\phi}μϕ​ and σϕ\sigma_{\phi}σϕ​ are known, its value follows analytically.
It also highlights the gradient story. The decoder parameters θ\thetaθ are trained through the reconstruction likelihood, while the encoder parameters ϕ\phiϕ are trained both by the closed-form KL and by the decoder loss through z^\hat{z}z^. That combination—one stochastic but reparameterized reconstruction path plus one deterministic Gaussian KL path—is the practical core of VAE training.

21. Empirical Results: Learned Latent Space and Generation

After walking through a single numerical forward pass, the natural next question is: what does all this optimization actually buy us? A VAE is not trained merely to reconstruct individual inputs. It is trained to shape a latent space in which encoding, sampling, decoding, and interpolation all behave coherently. The empirical test is therefore twofold: whether the latent variables organize semantic information, and whether samples drawn from the prior decode into plausible observations.
A useful diagnostic is to look at the encoder means μϕ(x)\mu_{\phi}(x)μϕ​(x). Recall that the encoder produces an approximate posterior
qϕ(z∣x)=N ⁣(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z \mid x) = \mathcal{N}\!\left(z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^2(x))\right),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
so μϕ(x)\mu_{\phi}(x)μϕ​(x) is the “central” latent representation assigned to xxx. If we train a VAE on MNIST with only two latent dimensions, we can scatter-plot μϕ(x(n))\mu_{\phi}(x^{(n)})μϕ​(x(n)) for thousands of test images. This is not just a visualization trick: in K=2K=2K=2, the entire learned latent geometry is visible.
What we typically see is a semantically structured but overlapping latent space. Digits of the same class tend to occupy nearby regions, because the decoder can more easily reconstruct similar digits from nearby latent codes. But the clusters are not perfectly separated. That is expected, and in fact desirable: the VAE objective does not ask for a discriminative classifier. It balances reconstruction quality against proximity to the prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). The KL term discourages the encoder from carving the latent space into isolated islands far from the origin.
This is one of the central empirical signatures of VAEs: nearby latent points decode to nearby-looking outputs. If two codes z1z_1z1​ and z2z_2z2​ are close in latent space, then the decoder fθf_{\theta}fθ​ usually maps them to visually similar digits. This continuity comes from both the neural decoder and the regularization imposed by the prior. Without the KL penalty, an autoencoder can learn a jagged, disconnected latent representation where interpolation may pass through meaningless regions. With the VAE objective, the model is pressured to make the latent space usable under samples from a simple distribution.
Generation then becomes conceptually simple. We sample
z∼p(z)=N(0,I),z \sim p(z) = \mathcal{N}(0,I),z∼p(z)=N(0,I),
and decode it using pθ(x∣z)p_{\theta}(x \mid z)pθ​(x∣z), often parameterized by a neural network fθ(z)f_{\theta}(z)fθ​(z). For MNIST, the generated samples are usually recognizable as digits. However, they often look soft or blurry, especially compared with real handwritten digits. This is not merely an implementation flaw; it reflects a modeling assumption.
In the common Gaussian decoder setting,
pθ(x∣z)=N ⁣(x;fθ(z),σ2I),p_{\theta}(x \mid z) = \mathcal{N}\!\left(x; f_{\theta}(z), \sigma^2 I\right),pθ​(x∣z)=N(x;fθ​(z),σ2I),
maximizing likelihood encourages the decoder output fθ(z)f_{\theta}(z)fθ​(z) to approximate a conditional mean. If several plausible sharp images correspond to nearby latent explanations, the mean of those possibilities may average edges, strokes, and fine details. The result is a digit that is semantically correct but visually smoothed. This is one reason VAEs are often said to trade sample sharpness for a well-behaved likelihood objective and structured latent space.
Interpolation is another revealing experiment. Given two test images x(a)x^{(a)}x(a) and x(b)x^{(b)}x(b), we encode them to their posterior means and linearly interpolate:
z^(λ)=(1−λ) μϕ(x(a))+λ μϕ(x(b)),λ∈[0,1].\hat{z}(\lambda)
=
(1-\lambda)\,\mu_{\phi}(x^{(a)})
+
\lambda\,\mu_{\phi}(x^{(b)}),
\qquad
\lambda \in [0,1].z^(λ)=(1−λ)μϕ​(x(a))+λμϕ​(x(b)),λ∈[0,1].
Decoding fθ(z^(λ))f_{\theta}(\hat{z}(\lambda))fθ​(z^(λ)) for increasing λ\lambdaλ often produces a smooth morph, such as a digit 333 gradually becoming an 888. This matters because it suggests that the model has learned more than memorized examples. It has learned a continuous latent manifold where semantic transformations correspond to smooth movement in Z\mathcal{Z}Z.
Quantitatively, we still judge the model through the ELBO. Kingma and Welling reported a test-set bound around
Ltest≈−86.6 nats\mathcal{L}_{\text{test}} \approx -86.6 \text{ nats}Ltest​≈−86.6 nats
for a stronger VAE configuration with a 500500500-unit MLP and 202020 latent dimensions. A two-dimensional latent space is excellent for visualization, but it is capacity-limited: many details needed to explain the data must be compressed away. With Klatent=20K_{\text{latent}}=20Klatent​=20, the approximate posterior can retain more information while still being regularized toward the prior, typically yielding a better ELBO.
The main takeaways are:
Latent organization: encoder means form semantically meaningful regions.
Continuity: interpolations decode into smooth transformations.
Generative ability: prior samples decode into recognizable digits.
Trade-off: Gaussian likelihoods and ELBO optimization often produce blurry samples.
Capacity matters: higher-dimensional latent spaces usually improve likelihood, even if they are harder to visualize.
The visual below consolidates these empirical checks into one picture: a two-dimensional latent scatter to expose structure, generated samples to test the prior-to-decoder path, interpolation to test continuity, and an ELBO curve to connect the qualitative behavior back to the optimization objective.
It is important to read the panels together. A clean latent scatter alone does not guarantee good generation, and good-looking reconstructions alone do not guarantee a useful prior. The strength of the VAE is that these behaviors are coupled by the ELBO: reconstruction encourages informativeness, KL regularization encourages global structure, and the resulting model supports both sampling and smooth latent manipulation.

22. Failure Mode 1: Posterior Collapse

The latent traversals and generated samples we just looked at are the success case for VAEs: the encoder has learned a meaningful map from data xxx into latent variables zzz, and the decoder has learned to use those variables to control semantic properties of the output. But this behavior is not guaranteed by the VAE objective. In fact, one of the most important VAE failure modes is that the model can optimize the ELBO while learning a latent space that carries almost no information at all.
Recall that for a single datapoint xxx, the ELBO is
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟reconstruction−DKL(qϕ(z∣x)∥p(z))⏟regularization.\mathcal{L}(\theta, \phi; x)
=
\underbrace{\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]}_{\text{reconstruction}}
-
\underbrace{D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))}_{\text{regularization}}.L(θ,ϕ;x)=reconstructionEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularizationDKL​(qϕ​(z∣x)∥p(z))​​.
This objective creates a negotiation between two incentives. The reconstruction term rewards the encoder for placing information about xxx into zzz, because a useful zzz helps the decoder assign high likelihood to the observed data. The KL term pushes in the opposite direction: it rewards the approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) for staying close to the prior p(z)p(z)p(z), usually a standard Gaussian. If a latent dimension is not useful enough to improve reconstruction, the cheapest solution is to make that dimension look exactly like prior noise.
Posterior collapse occurs when this cheap solution becomes the dominant one. For some latent dimension kkk, collapse means
qϕ(zk∣x)≈p(zk)=N(0,1)⟹DKL(k)≈0.q_{\phi}(z_k|x) \approx p(z_k) = \mathcal{N}(0,1)
\quad \Longrightarrow \quad
D_{\mathrm{KL}}^{(k)} \approx 0.qϕ​(zk​∣x)≈p(zk​)=N(0,1)⟹DKL(k)​≈0.
When this happens, zkz_kzk​ carries essentially no information about xxx. Sampling zkz_kzk​ from the encoder is no different from sampling it from the prior. The encoder may still output a mean and variance, but if those parameters match the prior for every input, then the latent coordinate has become statistically independent of the data. In practice, one often observes the KL term quickly shrinking toward zero early in training, long before the decoder has learned a useful conditional generative model.
The most dangerous case arises when the decoder is powerful enough to model the data distribution without using zzz. Suppose the decoder becomes effectively zzz-independent:
pθ(x∣z)=pθ(x)for all z.p_{\theta}(x|z) = p_{\theta}(x)
\quad \text{for all } z.pθ​(x∣z)=pθ​(x)for all z.
Then the reconstruction term no longer depends on the encoder distribution:
Eqϕ(z∣x)[log⁡pθ(x∣z)]=log⁡pθ(x).\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
=
\log p_{\theta}(x).Eqϕ​(z∣x)​[logpθ​(x∣z)]=logpθ​(x).
At that point, there is no reconstruction benefit to encoding information in zzz. The only remaining pressure on the encoder is the KL penalty, and the KL is minimized by setting
qϕ(z∣x)=p(z).q_{\phi}(z|x) = p(z).qϕ​(z∣x)=p(z).
This is why posterior collapse is not merely a transient optimization accident; it can be a stable fixed point of the ELBO. Once the decoder can explain xxx on its own, the encoder is actively rewarded for becoming uninformative.
This failure mode is especially common with highly expressive decoders: autoregressive PixelCNN-style image decoders, strong Transformer decoders, or LSTM decoders for language. These models can often predict local structure from previously generated pixels or tokens so well that the global latent variable becomes unnecessary. The decoder learns a strong marginal model pθ(x)p_{\theta}(x)pθ​(x), while the intended conditional model pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z) effectively ignores its conditioning signal.
There is a subtle but important distinction here. Posterior collapse does not necessarily mean the model has poor likelihood. A collapsed VAE may still assign decent likelihood to data if the decoder is strong enough. The failure is that the model no longer behaves like a useful latent-variable model. The latent space will not organize examples semantically, interpolations become meaningless, and downstream uses of zzz—compression, representation learning, controllable generation—break down.
A useful diagnostic is to monitor the per-dimension KL terms during training. In a healthy VAE, some latent dimensions usually develop positive KL: the model is “paying” bits to transmit information through zzz. In a collapsed model, many or all dimensions have KL near zero:
Healthy latent dimension: positive KL, carries information about xxx.
Collapsed latent dimension: KL ≈0\approx 0≈0, behaves like prior noise.
Fully collapsed model: total KL ≈0\approx 0≈0, decoder acts as a marginal model.
The visual below compactly summarizes both the training symptom and the mechanism. On the left, the key signal is the KL trajectory: a useful latent space requires the KL to rise above zero, while collapse appears as a flat line near zero. On the right, the information-flow picture emphasizes the cause: the path through zzz is no longer used, so the encoder has no reason to encode input-specific information.
Read the diagram as a warning about the ELBO’s incentives. The KL term is not just a harmless regularizer; when the decoder can reconstruct or predict without zzz, the KL penalty can make “ignore the latent variable” the easiest optimum. This sets up the next failure mode as well: even when VAEs do use zzz, the probabilistic reconstruction objective can still produce overly smooth or blurry samples.

23. Failure Mode 2: Blurry Reconstructions

After posterior collapse, it is tempting to think the main danger in VAEs is that the latent variable becomes unused. But even when the latent code is used, VAEs often exhibit another characteristic failure mode: reconstructions look smooth, washed out, or “average-looking.” This is not merely an implementation artifact. It follows directly from the likelihood model we usually choose and from the way the ELBO trades reconstruction fidelity against latent regularity.
The standard VAE decoder is often written as a Gaussian likelihood,
pθ(x∣z)=N(x;fθ(z),σ2I),p_{\theta}(x \mid z) = \mathcal{N}(x; f_{\theta}(z), \sigma^2 I),pθ​(x∣z)=N(x;fθ​(z),σ2I),
where fθ(z)f_{\theta}(z)fθ​(z) is the decoder’s predicted image mean. If σ2\sigma^2σ2 is fixed, maximizing the reconstruction log-likelihood is equivalent, up to constants and scaling, to minimizing a pixel-wise squared error:
Eqϕ(z∣x)[∥x−fθ(z)∥2].\mathbb{E}_{q_{\phi}(z|x)}
\left[
\|x - f_{\theta}(z)\|^2
\right].Eqϕ​(z∣x)​[∥x−fθ​(z)∥2].
This is the key point: squared error rewards conditional averages. If there are multiple plausible sharp reconstructions compatible with the latent uncertainty, the prediction that minimizes expected squared error is not necessarily one of those sharp possibilities. It is their mean.
More precisely, under an MSE objective, the optimal point estimate is a conditional expectation. In the VAE setting, because reconstructions are averaged over samples from the approximate posterior qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), the effective optimum behaves like
x^opt=Eqϕ(z∣x)[fθ(z)].\hat{x}_{\text{opt}}
=
\mathbb{E}_{q_{\phi}(z|x)}[f_{\theta}(z)].x^opt​=Eqϕ​(z∣x)​[fθ​(z)].
This is harmless when the posterior mass corresponds to a narrow set of nearly identical decodings. But it becomes visually damaging when the plausible decodings are multimodal. For example, if a digit could plausibly have a stroke slightly to the left or slightly to the right, the pixel-wise average puts mass in both places weakly. The resulting image is not a realistic digit from either mode; it is a blurred compromise between them.
This is why VAE blurriness is often described as an “averaging” phenomenon. The decoder is not explicitly trying to blur images. Rather, the Gaussian likelihood says that deviations in pixel space are penalized quadratically, independently across pixels. Under this geometry, a soft gray pixel halfway between black and white can be preferable to committing to the wrong sharp value. The model is rewarded for being safe under uncertainty, not for producing a sample that lies on the manifold of visually crisp images.
The KL term compounds the problem. Recall the single-example ELBO:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]⏟encourages reconstruction−DKL(qϕ(z∣x)∥p(z))⏟regularizes latent distribution.\mathcal{L}(\theta,\phi;x)
=
\underbrace{
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
}_{\text{encourages reconstruction}}
-
\underbrace{
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
}_{\text{regularizes latent distribution}}.L(θ,ϕ;x)=encourages reconstructionEqϕ​(z∣x)​[logpθ​(x∣z)]​​−regularizes latent distributionDKL​(qϕ​(z∣x)∥p(z))​​.
The reconstruction term wants qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) to encode enough information to reconstruct xxx sharply. The KL term pushes qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x) toward the prior, typically p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). This pressure discourages highly concentrated, highly informative posteriors. In effect, it can broaden qϕ(z∣x)q_{\phi}(z \mid x)qϕ​(z∣x), increasing the range of latent samples that the decoder must handle for the same input.
So even before collapse, the model is caught between two forces:
Reconstruction pressure wants precise latent codes and sharp outputs.
KL regularization wants broad, prior-like posteriors and smooth latent space structure.
Gaussian/MSE decoding turns uncertainty over plausible outputs into pixel-wise averages.
This tradeoff is part of what makes VAEs attractive as generative models: their latent spaces are usually smooth, interpolatable, and mode-covering. But the price is that their samples and reconstructions can look less sharp than those from models trained with perceptual or adversarial criteria.
GANs fail in almost the opposite direction. A GAN discriminator does not ask whether each pixel is close to a target under squared error. It asks whether the generated sample looks like it came from the real data distribution. A blurry average of several plausible images is often easy for a discriminator to reject, so adversarial training strongly penalizes unrealistic smooth compromises. This is why GAN samples are frequently sharper. However, that sharpness comes with a different failure mode: mode dropping. The generator can learn to produce a subset of highly realistic samples while ignoring other regions of the data distribution.
This comparison is useful because it separates two notions that are often conflated: sharpness and coverage. VAEs tend to cover the data distribution more systematically because the likelihood objective assigns probability mass broadly and the latent prior encourages global organization. GANs tend to produce sharper samples because adversarial feedback rewards realism, but they may cover fewer modes. Neither objective is universally better; they encode different preferences.
Within the VAE framework, there are several ways to reduce blurriness, each changing the balance of this tradeoff. One can replace pixel-wise MSE with a perceptual loss, measuring reconstruction error in the feature space of a pretrained network rather than raw pixels. One can add an adversarial term, producing hybrid models such as VAE-GANs. Or one can reduce the decoder variance σ2\sigma^2σ2, effectively increasing the weight of reconstruction accuracy relative to the KL term. But these interventions are not free: they can destabilize training, weaken likelihood interpretation, or reduce the smoothness and coverage properties that motivated VAEs in the first place.
The visual below compresses this story into a side-by-side comparison. On the VAE side, a broad approximate posterior sends multiple latent samples through the decoder; averaging their plausible outputs produces a blurry mean. In the center, the loss comparison highlights the mechanism: Gaussian likelihood/MSE encourages mode coverage but tolerates visual averaging, while adversarial training penalizes unrealistic blur.
The GAN side then illustrates the complementary behavior: samples can be crisp because the discriminator enforces perceptual realism, but the generated examples may become repetitive, signaling missing modes. This is the central takeaway: VAE blurriness is not accidental; it is a consequence of averaging under latent uncertainty, amplified by KL regularization. GAN sharpness solves that visual problem by changing the objective, but introduces its own coverage failure mode.

24. Extension: Beta-VAE and Disentanglement

After seeing why VAEs can produce blurry reconstructions, it is tempting to ask whether we can tune the objective to prefer a different kind of representation. The standard VAE is already balancing two competing goals: reconstruct the input well, while keeping the approximate posterior close to a simple prior. Beta-VAE makes this trade-off explicit by introducing a single scalar knob, β\betaβ, that controls how strongly we penalize information stored in the latent variable.
Recall the standard VAE objective for one data point xxx:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x) ∥ p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\bigl[
\log p_{\theta}(x|z)
\bigr]
-
D_{\mathrm{KL}}
\!\bigl(
q_{\phi}(z|x)\,\|\,p(z)
\bigr).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
The first term rewards good reconstruction. The second term regularizes the encoder distribution qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), usually pushing it toward a simple prior such as
p(z)=N(0,I).p(z)=\mathcal{N}(0,I).p(z)=N(0,I).
Beta-VAE modifies only one thing: it multiplies the KL term by β\betaβ:
Lβ(θ,ϕ; x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−β DKL ⁣(qϕ(z∣x) ∥ p(z)),β≥1.\mathcal{L}_{\beta}(\theta, \phi;\, x)
=
\mathbb{E}_{q_{\phi}(z|x)}
[\log p_{\theta}(x|z)]
-
\beta\,
D_{\mathrm{KL}}
\!\bigl(
q_{\phi}(z|x)
\,\|\,
p(z)
\bigr),
\qquad
\beta \geq 1.Lβ​(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−βDKL​(qϕ​(z∣x)∥p(z)),β≥1.
When β=1\beta=1β=1, this is exactly the usual VAE ELBO:
β=1  ⟹  Lβ=L(θ,ϕ;x).\beta = 1
\implies
\mathcal{L}_{\beta}
=
\mathcal{L}(\theta,\phi;x).β=1⟹Lβ​=L(θ,ϕ;x).
For β>1\beta>1β>1, however, the objective is no longer the standard ELBO on log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). It is better understood as a rate–distortion trade-off: the reconstruction term is the distortion, while the KL term controls the rate, meaning how much information about xxx can be transmitted through zzz. Increasing β\betaβ makes latent information more expensive.
This pressure can encourage disentanglement. The prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I) is factorized:
p(z)=∏k=1dN(zk;0,1).p(z)
=
\prod_{k=1}^{d} \mathcal{N}(z_k;0,1).p(z)=k=1∏d​N(zk​;0,1).
So if the encoder is heavily penalized for deviating from this prior, it is encouraged to use latent dimensions in a way that remains close to independent, standardized Gaussian coordinates. Under the common diagonal Gaussian encoder,
qϕ(z∣x)=N(μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z|x)
=
\mathcal{N}
\bigl(
\mu_{\phi}(x),
\operatorname{diag}(\sigma_{\phi}^2(x))
\bigr),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ2​(x))),
the KL term decomposes across dimensions:
DKL ⁣(qϕ(z∣x) ∥ p(z))=12∑k=1d(μϕ,k(x)2+σϕ,k(x)2−log⁡σϕ,k(x)2−1).D_{\mathrm{KL}}
\!\bigl(
q_{\phi}(z|x)\,\|\,p(z)
\bigr)
=
\frac{1}{2}
\sum_{k=1}^{d}
\left(
\mu_{\phi,k}(x)^2
+
\sigma_{\phi,k}(x)^2
-
\log \sigma_{\phi,k}(x)^2
-
1
\right).DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑d​(μϕ,k​(x)2+σϕ,k​(x)2−logσϕ,k​(x)2−1).
This expression makes the effect of β\betaβ concrete. Unless a latent dimension zkz_kzk​ helps explain real variation in the data, the model is rewarded for keeping
μϕ,k(x)≈0,σϕ,k(x)≈1.\mu_{\phi,k}(x) \approx 0,
\qquad
\sigma_{\phi,k}(x) \approx 1.μϕ,k​(x)≈0,σϕ,k​(x)≈1.
In other words, unused or weakly useful dimensions are pushed back toward pure prior noise. Dimensions that remain active must “earn their keep” by improving reconstruction enough to compensate for the larger KL penalty.
The intuitive hope is that each active latent coordinate will specialize in a distinct factor of variation: one dimension for shape, another for scale, another for rotation, another for lighting, and so on. Empirically, this often happens on controlled datasets such as dSprites or 3DShapes, where the underlying generative factors are relatively clean and independent. Values such as β∈[4,10]\beta \in [4,10]β∈[4,10] can produce latents where traversing one coordinate changes one semantic factor while leaving others mostly fixed.
But this is not magic, and it is important not to overstate the guarantee. Disentanglement depends on several assumptions:
the prior is factorized, usually isotropic Gaussian;
the encoder often has a diagonal covariance structure;
the data actually contains relatively separable factors of variation;
the decoder architecture and training procedure do not hide information elsewhere;
unsupervised disentanglement is not identifiable in general without inductive biases.
So Beta-VAE is best understood as a useful bias, not a theorem that semantic factors will automatically emerge.
The cost is also direct: as β\betaβ increases, reconstruction quality usually decreases. The decoder receives less information about the particular input xxx, because the encoder is more constrained to keep qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) close to p(z)p(z)p(z). This can worsen the same reconstruction problems we discussed earlier: less detail, more averaging, and potentially blurrier outputs. Beta-VAE therefore trades fidelity for structure.
A useful way to summarize the idea is to imagine β\betaβ as a dial. At β=1\beta=1β=1, we have the ordinary VAE objective: better reconstructions, but latent dimensions may be entangled. As β\betaβ grows, the KL term becomes more dominant, pushing the posterior closer to the independent Gaussian prior. The latent space may become more axis-aligned and interpretable, but the reconstructions become less faithful.
The visual below consolidates this trade-off: the left side represents the standard VAE regime, where latent coordinates can mix multiple factors; the center emphasizes the modified objective and the β\betaβ-controlled tension between reconstruction and regularization; the right side represents the Beta-VAE regime, where dimensions are encouraged to separate factors such as shape and scale, but with degraded reconstruction quality as the price.

25. VAEs vs GANs vs Normalizing Flows

After seeing how β\betaβ-VAE deliberately reshapes the latent space by changing the strength of the KL penalty, it is useful to step back and ask a broader question: what kind of generative model is a VAE, and what does it trade away compared with other families? VAEs are often introduced alongside GANs and normalizing flows because all three learn to generate samples from data, but they make very different compromises about likelihoods, latent variables, optimization, and sample quality.
The VAE starts from an explicitly probabilistic story. We assume a latent variable z∼p(z)z \sim p(z)z∼p(z), usually p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I), and a decoder likelihood pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z). The marginal likelihood is
pθ(x)=∫pθ(x∣z)p(z) dz,p_{\theta}(x)=\int p_{\theta}(x|z)p(z)\,dz,pθ​(x)=∫pθ​(x∣z)p(z)dz,
but this integral is generally intractable for neural decoders. The VAE therefore optimizes the Evidence Lower Bound:
log⁡pθ(x)≥L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL ⁣(qϕ(z∣x)∥p(z)).\log p_{\theta}(x) \geq \mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
-
D_{\mathrm{KL}}\!\left(q_{\phi}(z|x)\|p(z)\right).logpθ​(x)≥L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
This objective is not merely a training trick. It encodes the core VAE design philosophy: learn to reconstruct data through a latent bottleneck while keeping the approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) close to a simple prior. That KL term is exactly what gives VAEs their unusually useful latent geometry. Nearby points in latent space tend to decode to semantically related outputs, interpolation is meaningful, and the latent representation can support downstream tasks such as clustering, semi-supervised learning, and controllable generation.
But the same structure also explains common VAE weaknesses. Because the decoder is trained through a likelihood term, the model is often rewarded for predicting averages under uncertainty. With Gaussian pixel likelihoods, this can produce blurry reconstructions: if multiple sharp images are plausible, the conditional mean may lie between them. And when the decoder is too powerful, the model may ignore zzz entirely, producing posterior collapse, where qϕ(z∣x)≈p(z)q_{\phi}(z|x)\approx p(z)qϕ​(z∣x)≈p(z) and the latent code carries little information about xxx. In other words, the VAE’s principled probabilistic objective is also the source of its characteristic compromises.
GANs sit at almost the opposite end of the spectrum. A GAN generator maps noise z∼p(z)z\sim p(z)z∼p(z) to a sample fθ(z)f_{\theta}(z)fθ​(z), while a discriminator tries to distinguish generated samples from real data. The classical minimax objective is
min⁡θmax⁡ϕ  Ex∼pdata[log⁡ϕ(x)]+Ez∼p(z)[log⁡(1−ϕ(fθ(z)))].\min_{\theta}\max_{\phi}
\;
\mathbb{E}_{x\sim p_{\mathrm{data}}}[\log \phi(x)]
+
\mathbb{E}_{z\sim p(z)}
\left[\log(1-\phi(f_{\theta}(z)))\right].θmin​ϕmax​Ex∼pdata​​[logϕ(x)]+Ez∼p(z)​[log(1−ϕ(fθ​(z)))].
This adversarial formulation can produce extremely sharp and realistic samples because the generator is trained to fool a learned critic rather than to maximize a pixelwise likelihood. However, GANs usually do not provide tractable likelihoods, and their latent space is not regularized by an inference model in the VAE sense. There is no built-in encoder qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), no ELBO, and no direct estimate of log⁡pθ(x)\log p_{\theta}(x)logpθ​(x). As a result, GANs are powerful image synthesizers but less naturally suited to probabilistic inference or structured representation learning.
Normalizing flows take a third route: they preserve exact likelihoods by making the generative map invertible. A flow defines an invertible transformation between data xxx and latent variables z=fθ(x)z=f_{\theta}(x)z=fθ​(x), allowing density evaluation through the change-of-variables formula:
log⁡pθ(x)=log⁡p(z)+log⁡∣det⁡∂fθ−1∂x∣.\log p_{\theta}(x)
=
\log p(z)
+
\log \left|
\det \frac{\partial f_{\theta}^{-1}}{\partial x}
\right|.logpθ​(x)=logp(z)+log​det∂x∂fθ−1​​​.
This is a major advantage: unlike VAEs, flows do not optimize a lower bound; unlike GANs, they can evaluate exact densities. But this exactness comes with architectural constraints. The transformation must be bijective, which usually forces the latent dimensionality to match the data dimensionality, K=DK=DK=D. The model cannot freely compress data into a lower-dimensional semantic bottleneck in the way a VAE can. Flows are excellent density models, but their latent spaces are often less directly aligned with compact representation learning.
A useful way to summarize the comparison is to ask what each model gives you “for free” and what it makes difficult:
VAEs give a regularized encoder-decoder latent space and a principled likelihood lower bound, but may sacrifice sharpness and exact density evaluation.
GANs often give the sharpest samples, but provide no explicit likelihood and can be unstable to train.
Normalizing flows give exact likelihoods and exact inference through invertibility, but require constrained architectures and typically do not compress into a low-dimensional latent representation.
The key point is that there is no universally dominant model family. The right choice depends on whether the priority is sample realism, density evaluation, or latent representation structure. If the task is pure image synthesis, GAN-like models may be attractive. If exact log-likelihood is central, flows are compelling. If we care about a learned latent space that is smooth, regularized, and useful for inference or downstream prediction, VAEs remain especially important.
The visual comparison that follows condenses these trade-offs into the axes that matter most: objective, sampling procedure, density evaluation, latent structure, and sample quality. Notice in particular the contrast between the VAE’s ELBO-based density estimate and the flow’s exact likelihood, as well as the contrast between the VAE’s KL-regularized latent space and the more weakly structured latent variables used by GANs.
The highlighted cells emphasize the central lesson: VAEs are not simply “worse GANs” because their samples can be blurrier, nor are they merely “approximate flows” because they optimize a bound rather than exact likelihood. Their distinctive strength is the combination of a probabilistic training objective with a structured latent representation, which is precisely why they remain a foundational tool for representation learning, semi-supervised modeling, and latent-variable inference.

26. Empirical Anchor: VAE on CelebA Faces

After comparing VAEs with GANs and normalizing flows in the abstract, it is useful to ground the discussion in a concrete empirical example. CelebA faces are a particularly good testbed: the images are structured enough that we can see semantic attributes—pose, hair, glasses, gender presentation, expression—but constrained enough that a moderately sized convolutional VAE can learn a meaningful latent representation.
Consider a convolutional VAE trained on roughly 200k CelebA images resized to 64×6464 \times 6464×64 RGB. The encoder gϕg_{\phi}gϕ​ maps an image xxx through several convolutional layers and a fully connected layer to the parameters of a diagonal Gaussian posterior approximation,
qϕ(z∣x)=N ⁣(z;μϕ(x),diag⁡(σϕ2(x))),q_{\phi}(z \mid x)
=
\mathcal{N}\!\left(
z;\mu_{\phi}(x), \operatorname{diag}(\sigma_{\phi}^2(x))
\right),qϕ​(z∣x)=N(z;μϕ​(x),diag(σϕ2​(x))),
with latent dimension K=128K=128K=128. The decoder fθf_{\theta}fθ​ maps a latent vector back through a fully connected layer and transposed convolutional layers to a Gaussian mean image. In other words, the decoder is not directly producing a sharp image distribution in pixel space; it is producing the mean of a conditional Gaussian likelihood.
At training and reconstruction time, the latent sample is drawn using the reparameterization trick,
z^=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I).\hat{z}
=
\mu_{\phi}(x)
+
\sigma_{\phi}(x)\odot \epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).z^=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I).
This matters because the encoder is not merely assigning each image to a deterministic code. It is learning a local distribution over plausible codes, regularized toward the prior p(z)=N(0,I)p(z)=\mathcal{N}(0,I)p(z)=N(0,I). The VAE therefore has to balance two goals: preserving enough information to reconstruct the face, while keeping the aggregate latent geometry compatible with sampling from a simple Gaussian prior.
The first qualitative result is reconstruction. If we encode a real face, sample z^\hat{z}z^, and decode fθ(z^)f_{\theta}(\hat{z})fθ​(z^), the output is usually recognizable: identity, hair color, pose, and broad facial structure are often preserved. But the image is noticeably soft. This is not an incidental flaw of convolutional networks; it is tightly connected to the Gaussian likelihood and pixelwise squared-error reconstruction term. When multiple plausible high-frequency details could explain an input—exact hair strands, skin texture, eye highlights—the conditional mean averages over them. The result is the familiar VAE blur.
That blur is the qualitative price paid for a likelihood-based, mode-covering generative model with a simple reconstruction distribution. Unlike a GAN, which may learn sharp samples by matching an implicit distribution adversarially, the standard Gaussian VAE is explicitly rewarded for predicting an average pixel value under uncertainty. This is why simply training longer or increasing the latent dimension often does not remove the softness. To substantially change the visual character, one usually modifies the likelihood, the reconstruction loss, the decoder distribution, or introduces more expressive generative structure.
The second result is unconditional generation. If we sample
z∼p(z)=N(0,I)z \sim p(z)=\mathcal{N}(0,I)z∼p(z)=N(0,I)
and decode fθ(z)f_{\theta}(z)fθ​(z), we obtain plausible faces rather than random noise. This is a key empirical validation of the VAE objective: the KL term has shaped the encoder’s latent codes so that regions likely under the prior correspond to meaningful decoded images. In a poorly regularized autoencoder, sampling a random latent vector would often land in “holes” between encoded training examples. In a well-trained VAE, the prior is encouraged to cover the learned manifold.
The third result is more subtle: the latent space often supports semantic arithmetic. Using posterior means as deterministic summaries of images, we may find relationships like
μϕ(xman+glasses)−μϕ(xman)+μϕ(xwoman)≈zwoman+glasses.\mu_{\phi}(x_{\text{man+glasses}})
-
\mu_{\phi}(x_{\text{man}})
+
\mu_{\phi}(x_{\text{woman}})
\approx
z_{\text{woman+glasses}}.μϕ​(xman+glasses​)−μϕ​(xman​)+μϕ​(xwoman​)≈zwoman+glasses​.
Decoding the resulting vector can produce a face that preserves the “woman” identity direction while adding the “glasses” attribute. This should not be interpreted as exact symbolic reasoning. The latent space is not guaranteed to contain perfectly linear, disentangled factors. But because the prior is smooth and the decoder is trained over a continuous latent domain, common attributes can become approximately represented as directions or subspaces.
A related phenomenon appears in interpolation. Given two images x1x_1x1​ and x2x_2x2​, we can linearly interpolate between their posterior means,
z(t)=(1−t)μϕ(x1)+tμϕ(x2),t∈[0,1].z(t)
=
(1-t)\mu_{\phi}(x_1)
+
t\mu_{\phi}(x_2),
\qquad
t\in[0,1].z(t)=(1−t)μϕ​(x1​)+tμϕ​(x2​),t∈[0,1].
Decoding points along this path often gives a smooth morph from one face to another. This is an important distinction between a VAE and a plain autoencoder: the latent space is not merely a lookup table of compressed examples. The ELBO’s KL term encourages neighboring latent regions to decode coherently, so linear paths can remain on or near the learned face manifold.
The visual below consolidates these four empirical anchors: reconstruction, unconditional generation, latent arithmetic, and interpolation. Read it as evidence for both sides of the VAE story. On one hand, the model has learned a semantically organized latent space where sampling, arithmetic, and smooth traversal are meaningful. On the other hand, the reconstructions and samples remain visibly soft, reminding us that the Gaussian decoder’s averaging behavior is not a cosmetic detail but a central modeling limitation.
This example is therefore a natural bridge into VAE extensions. Once we accept that the basic framework gives us a useful latent geometry but imperfect image fidelity, the motivation for richer decoders, hierarchical latents, discrete variables, perceptual losses, adversarial hybrids, and diffusion-style decoders becomes much clearer.

27. Hierarchy of VAE Extensions

After seeing a standard VAE trained on CelebA, it is tempting to think of “the VAE” as one fixed architecture: an encoder produces a Gaussian latent distribution, a decoder reconstructs an image, and the ELBO balances reconstruction against KL regularization. But in practice, many of the most useful VAE variants keep this encode–sample–decode skeleton while changing one structural assumption in the probabilistic model or in the variational bound.
A useful way to organize the family is to ask: what exactly are we modifying?
Are we adding side information, such as a class label or attribute?
Are we tightening the variational lower bound without changing the model family?
Are we replacing the continuous latent variable with a discrete one?
These questions lead naturally to three important extensions: Conditional VAEs, Importance Weighted Autoencoders, and Vector-Quantized VAEs. They look different architecturally, but they are all still variations on the same principle: introduce latent structure, define a tractable training objective, and optimize an encoder–decoder system end-to-end.
The Conditional VAE, or CVAE, modifies the generative story by conditioning both inference and generation on observed side information yyy. Instead of modeling pθ(x,z)p_\theta(x,z)pθ​(x,z), we model something closer to pθ(x,z∣y)p_\theta(x,z \mid y)pθ​(x,z∣y). The encoder becomes qϕ(z∣x,y)q_\phi(z \mid x,y)qϕ​(z∣x,y), the decoder becomes pθ(x∣z,y)p_\theta(x \mid z,y)pθ​(x∣z,y), and the prior may also depend on yyy, giving the conditional ELBO
L(θ,ϕ;x,y)=Eqϕ(z∣x,y)[log⁡pθ(x∣z,y)]−DKL(qϕ(z∣x,y)∥p(z∣y)).\mathcal{L}(\theta,\phi;x,y)
=
\mathbb{E}_{q_{\phi}(z|x,y)}
\big[\log p_{\theta}(x|z,y)\big]
-
D_{\mathrm{KL}}
\big(q_{\phi}(z|x,y)\|p(z|y)\big).L(θ,ϕ;x,y)=Eqϕ​(z∣x,y)​[logpθ​(x∣z,y)]−DKL​(qϕ​(z∣x,y)∥p(z∣y)).
The intuition is simple but powerful: yyy explains the part of the variation we already know, while zzz captures the residual variation. For faces, yyy might encode identity, expression, hair color, or a binary attribute such as “smiling.” At generation time, we fix yyy, sample z∼p(z∣y)z \sim p(z \mid y)z∼p(z∣y), and decode x∼pθ(x∣z,y)x \sim p_\theta(x \mid z,y)x∼pθ​(x∣z,y). This gives controlled generation: “generate a face with this attribute, while allowing the remaining details to vary.” The subtle assumption is that yyy is available and meaningful during training; without paired (x,y)(x,y)(x,y) supervision, the conditional structure cannot be learned directly.
The Importance Weighted Autoencoder, or IWAE, changes something different. It does not primarily add labels or alter the latent space type. Instead, it improves the variational objective. Recall that the ordinary ELBO is a lower bound on log⁡pθ(x)\log p_\theta(x)logpθ​(x), derived by introducing an approximate posterior qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x). The gap between the ELBO and the true log evidence is a KL divergence between qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x) and the exact posterior pθ(z∣x)p_\theta(z \mid x)pθ​(z∣x). IWAE tightens this bound by drawing LLL samples from the encoder and forming an importance-weighted estimate:
LIWAE=Ez(1),…,z(L)∼qϕ(z∣x)[log⁡1L∑l=1Lpθ(x,z(l))qϕ(z(l)∣x)].\mathcal{L}_{\mathrm{IWAE}}
=
\mathbb{E}_{z^{(1)},\ldots,z^{(L)}\sim q_{\phi}(z|x)}
\left[
\log
\frac{1}{L}
\sum_{l=1}^{L}
\frac{p_{\theta}(x,z^{(l)})}{q_{\phi}(z^{(l)}|x)}
\right].LIWAE​=Ez(1),…,z(L)∼qϕ​(z∣x)​[logL1​l=1∑L​qϕ​(z(l)∣x)pθ​(x,z(l))​].
For L=1L=1L=1, this recovers the ordinary VAE ELBO. As LLL increases, the bound becomes tighter, and under appropriate regularity assumptions,
LIWAE→L→∞log⁡pθ(x).\mathcal{L}_{\mathrm{IWAE}}
\xrightarrow{L\to\infty}
\log p_{\theta}(x).LIWAE​L→∞​logpθ​(x).
This matters because the standard ELBO can be loose when the variational posterior is too simple. IWAE partially compensates by using multiple latent samples and letting high-importance samples dominate the estimate. But this improvement is not free. The tighter objective can produce more difficult optimization dynamics for the encoder parameters ϕ\phiϕ: as LLL grows, the learning signal for the proposal distribution can degrade or become biased in practical gradient estimators. So IWAE improves the statistical bound, but it introduces a new optimization trade-off.
The VQ-VAE moves in a third direction: it replaces the usual continuous Gaussian latent variable with a discrete latent code. The encoder first outputs a continuous vector, but that vector is then snapped to the nearest entry in a learned codebook {ek}\{e_k\}{ek​}. The decoder receives the selected codebook vector rather than a sample from a continuous posterior. This is especially useful when the data has naturally discrete or symbolic structure: phonemes in speech, tokens in language-like representations, or repeated visual parts in images.
The difficulty is that nearest-neighbor lookup is not differentiable. A standard VAE relies on the reparameterization trick, for example z=μϕ(x)+σϕ(x)⊙ϵz = \mu_\phi(x) + \sigma_\phi(x)\odot \epsilonz=μϕ​(x)+σϕ​(x)⊙ϵ, to backpropagate through stochastic sampling. VQ-VAE cannot use that trick directly because the latent choice is discrete. Instead, it typically uses a straight-through estimator: in the forward pass, the model uses the nearest codebook vector; in the backward pass, gradients are copied through the quantization operation as if it were approximately identity. This estimator is biased, but often effective. Another consequence is that, with a uniform prior over discrete codes, the KL term can collapse to a constant, so the objective is usually expressed through reconstruction plus codebook and commitment losses rather than the ordinary Gaussian KL penalty.
These three variants therefore form a compact hierarchy of interventions on the base VAE:
CVAE modifies the conditioning structure: generate xxx given yyy.
IWAE modifies the bound: use multiple importance samples to tighten the estimate of log⁡pθ(x)\log p_\theta(x)logpθ​(x).
VQ-VAE modifies the latent representation: replace continuous zzz with discrete codebook indices.
The visual below is meant to consolidate this taxonomy rather than introduce a new derivation. The table separates each model by the part of the VAE it changes: objective, key innovation, and trade-off. That framing is important because these extensions are not arbitrary architectural hacks; each corresponds to a different mathematical pressure point in the original VAE formulation.
The small diagrams beneath the table reinforce the same idea operationally. CVAE routes the label yyy into both encoder and decoder, IWAE fans out multiple latent samples before combining them through an importance-weighted log average, and VQ-VAE inserts a codebook lookup between encoder and decoder. The common thread is that all three preserve the VAE’s central encode–latent–decode logic, while changing what the latent variable means or how its objective is optimized.

28. Summary: VAEs — Equivalent Forms and Unified View

After seeing the hierarchy of VAE extensions, it is useful to step back and notice that most of the machinery we have introduced is not a collection of unrelated tricks. The latent-variable model, Jensen’s inequality, the KL gap, the reconstruction–regularization tradeoff, the reparameterization trick, and the Gaussian closed form are all different views of the same central object: the Evidence Lower Bound, or ELBO. The extensions change priors, posteriors, decoders, objectives, or training schedules, but the conceptual spine remains the same.
The starting point is the latent-variable likelihood
pθ(x)=∫pθ(x,z) dz=∫pθ(x∣z)p(z) dz.p_{\theta}(x)=\int p_{\theta}(x,z)\,dz
=\int p_{\theta}(x|z)p(z)\,dz.pθ​(x)=∫pθ​(x,z)dz=∫pθ​(x∣z)p(z)dz.
The difficulty is that this integral is usually intractable for a neural decoder pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z). VAEs solve this not by evaluating the marginal likelihood directly, but by introducing an approximate posterior qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x). This distribution is not part of the original generative model; it is an inference model, or encoder, trained to approximate the true posterior pθ(z∣x)p_{\theta}(z|x)pθ​(z∣x). Once we multiply and divide by qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x), Jensen’s inequality gives the canonical ELBO form:
L(θ,ϕ;x)=Eqϕ(z∣x) ⁣[log⁡pθ(x,z)−log⁡qϕ(z∣x)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
\!\left[
\log p_{\theta}(x,z)-\log q_{\phi}(z|x)
\right].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x,z)−logqϕ​(z∣x)].
This is the most compact variational-inference statement of the VAE objective. It says: sample latent variables from the encoder, score them under the joint model, and subtract the cost of using the encoder distribution. Since it is a lower bound,
log⁡pθ(x)≥L(θ,ϕ;x),\log p_{\theta}(x)\geq \mathcal{L}(\theta,\phi;x),logpθ​(x)≥L(θ,ϕ;x),
maximizing the ELBO is a surrogate for maximizing likelihood.
The same expression becomes more interpretable when we use the factorization pθ(x,z)=pθ(x∣z)p(z)p_{\theta}(x,z)=p_{\theta}(x|z)p(z)pθ​(x,z)=pθ​(x∣z)p(z). Then the ELBO decomposes as
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x)∥p(z)).\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}
[\log p_{\theta}(x|z)]
-
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z)).L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z)).
This is the familiar reconstruction minus regularization form. The first term rewards latent codes that allow the decoder to explain xxx. The second term penalizes the encoder posterior for drifting too far from the prior. This penalty matters because generation later samples z∼p(z)z\sim p(z)z∼p(z), not z∼qϕ(z∣x)z\sim q_{\phi}(z|x)z∼qϕ​(z∣x). If the aggregate latent codes used during training live in regions of latent space that the prior rarely visits, generation will be poor even if reconstruction is good.
A third equivalent rewriting separates the KL term into entropy and prior energy:
L(θ,ϕ;x)=Eqϕ(z∣x)[log⁡pθ(x∣z)]+H[qϕ(z∣x)]−Eqϕ(z∣x)[−log⁡p(z)].\mathcal{L}(\theta,\phi;x)
=
\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]
+
\mathcal{H}[q_{\phi}(z|x)]
-
\mathbb{E}_{q_{\phi}(z|x)}[-\log p(z)].L(θ,ϕ;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]+H[qϕ​(z∣x)]−Eqϕ​(z∣x)​[−logp(z)].
This form emphasizes the connection to negative free energy and rate–distortion. The reconstruction term is a distortion-like quantity: how accurately can the latent representation explain the data? The KL-related terms control the information content and geometry of the representation. A low-entropy posterior can carry precise information about xxx, but it pays a cost if it concentrates too sharply or moves away from the prior. A high-entropy posterior is cheaper and smoother, but may discard information needed for accurate reconstructions.
The fourth form reveals exactly when the bound is tight:
log⁡pθ(x)=L(θ,ϕ;x)+DKL(qϕ(z∣x)∥pθ(z∣x)).\log p_{\theta}(x)
=
\mathcal{L}(\theta,\phi;x)
+
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p_{\theta}(z|x)).logpθ​(x)=L(θ,ϕ;x)+DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
Equivalently,
L(θ,ϕ;x)=log⁡pθ(x)−DKL(qϕ(z∣x)∥pθ(z∣x)).\mathcal{L}(\theta,\phi;x)
=
\log p_{\theta}(x)
-
D_{\mathrm{KL}}(q_{\phi}(z|x)\|p_{\theta}(z|x)).L(θ,ϕ;x)=logpθ​(x)−DKL​(qϕ​(z∣x)∥pθ​(z∣x)).
So the gap between the ELBO and the true log evidence is not mysterious: it is exactly the KL divergence from the approximate posterior to the true posterior. The bound is tight if and only if
qϕ(z∣x)=pθ(z∣x)q_{\phi}(z|x)=p_{\theta}(z|x)qϕ​(z∣x)=pθ​(z∣x)
almost everywhere. This condition is important but subtle. In practice, the true posterior changes as θ\thetaθ changes, and the variational family may be too limited to represent it exactly. Thus the VAE optimizes a coupled problem: improve the generative model while simultaneously learning an amortized approximation to its posterior.
The fifth form is the one we actually implement for the standard Gaussian VAE. If
qϕ(z∣x)=N ⁣(μϕ(x),diag⁡(σϕ(x)2)),p(z)=N(0,I),q_{\phi}(z|x)=\mathcal{N}\!\left(\mu_{\phi}(x),\operatorname{diag}(\sigma_{\phi}(x)^2)\right),
\qquad
p(z)=\mathcal{N}(0,I),qϕ​(z∣x)=N(μϕ​(x),diag(σϕ​(x)2)),p(z)=N(0,I),
then we sample using the reparameterization trick
z=μϕ(x)+σϕ(x)⊙ϵ,ϵ∼N(0,I),z=\mu_{\phi}(x)+\sigma_{\phi}(x)\odot \epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I),z=μϕ​(x)+σϕ​(x)⊙ϵ,ϵ∼N(0,I),
and use the closed-form Gaussian KL:
DKL(qϕ(z∣x)∥p(z))=12∑k=1K[μϕ,k2+σϕ,k2−1−log⁡σϕ,k2].D_{\mathrm{KL}}(q_{\phi}(z|x)\|p(z))
=
\frac{1}{2}\sum_{k=1}^{K}
\left[
\mu_{\phi,k}^2+\sigma_{\phi,k}^2-1-\log\sigma_{\phi,k}^2
\right].DKL​(qϕ​(z∣x)∥p(z))=21​k=1∑K​[μϕ,k2​+σϕ,k2​−1−logσϕ,k2​].
The practical training objective is therefore
L~(θ,ϕ;x)=Eϵ∼N(0,I) ⁣[log⁡pθ ⁣(x∣μϕ(x)+σϕ(x)⊙ϵ)]−12∑k=1K[μϕ,k(x)2+σϕ,k(x)2−1−log⁡σϕ,k(x)2].\tilde{\mathcal{L}}(\theta,\phi;x)
=
\mathbb{E}_{\epsilon\sim\mathcal{N}(0,I)}
\!\left[
\log p_{\theta}
\!\left(
x\mid \mu_{\phi}(x)+\sigma_{\phi}(x)\odot\epsilon
\right)
\right]
-
\frac{1}{2}\sum_{k=1}^{K}
\left[
\mu_{\phi,k}(x)^2+\sigma_{\phi,k}(x)^2-1-\log\sigma_{\phi,k}(x)^2
\right].L~(θ,ϕ;x)=Eϵ∼N(0,I)​[logpθ​(x∣μϕ​(x)+σϕ​(x)⊙ϵ)]−21​k=1∑K​[μϕ,k​(x)2+σϕ,k​(x)2−1−logσϕ,k​(x)2].
This last expression is not a new objective; it is the computationally usable version of the same ELBO under Gaussian assumptions. The reparameterization trick is what makes gradients with respect to ϕ\phiϕ low-variance and compatible with backpropagation: randomness is pushed into ϵ\epsilonϵ, while μϕ(x)\mu_{\phi}(x)μϕ​(x) and σϕ(x)\sigma_{\phi}(x)σϕ​(x) remain differentiable functions of the encoder parameters.
Seen this way, the VAE is a unification of two tasks:
Amortized variational inference: learn qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) so that inference for each datapoint is fast and approximate.
Deep generative modeling: learn pθ(x∣z)p_{\theta}(x|z)pθ​(x∣z) so that samples from a simple prior can be decoded into realistic observations.
The strengths and failure modes follow directly from this unified objective. Posterior collapse occurs when the decoder becomes strong enough that the optimum can ignore zzz, pushing qϕ(z∣x)q_{\phi}(z|x)qϕ​(z∣x) close to p(z)p(z)p(z) and making the KL small but the latent code uninformative. Blurry reconstructions often arise when the likelihood model, such as a factorized Gaussian decoder, rewards averaging over plausible outputs. Extensions such as β\betaβ-VAEs, hierarchical VAEs, richer priors, normalizing-flow posteriors, and discrete latents can all be understood as attempts to reshape one part of this same bound.
The visual below condenses this entire path into a single unified map. The upper portion organizes the five ELBO forms side by side: Jensen’s lower bound, reconstruction–regularization, entropy/free-energy, the exact variational gap, and the reparameterized Gaussian estimator. They are not competing formulas; they are algebraically equivalent perspectives, each useful for answering a different question.
The lower portion traces the lecture arc from latent-variable modeling through intractability, ELBO derivation, decomposition, tightness, reparameterization, closed-form KL terms, training, failure modes, and extensions. Read together, the table and timeline reinforce the main lesson: the VAE is best understood as an end-to-end differentiable implementation of variational inference inside a neural generative model.