Variational Autoencoders: Principles, Derivations, and Applications

1. The Generative Modeling Problem
Before we can understand what a variational autoencoder is optimizing, we need to be precise about the problem it is trying to solve. A VAE is not merely an “autoencoder with noise.” It is a generative model: a model whose goal is to learn something about the probability distribution that produced the observed data.
Suppose we are given a dataset
where each observation . For images, might be the number of pixels; for audio, it might be waveform samples or spectrogram coefficients; for text embeddings, it might be the embedding dimension. The generative modeling problem is to learn a distribution that approximates the unknown data-generating distribution .
This goal has three closely related interpretations. First, we may want density estimation: given a new point , how likely is it under the model? That is, can we evaluate or approximate ? Second, we may want synthesis: can we draw new samples that look like plausible members of the dataset, but are not exact copies? Third, we may want representation learning: can the model discover compact, structured variables that explain meaningful factors of variation in the data?
These goals are connected, but they are not identical. A model may produce sharp-looking samples while assigning poor likelihoods, or it may achieve good likelihood while generating visually mediocre samples. VAEs are especially interesting because they try to tie these goals together through a probabilistic latent-variable framework: they define a likelihood, support sampling, and learn internal representations.
A first naive idea is to estimate the density directly from the observed data. For example, kernel density estimation places a small “bump” of probability mass around each training example:
In low dimensions, this can work surprisingly well. If the data points densely cover the relevant region of space, then nearby kernels overlap and form a smooth estimate of the distribution. But in high dimensions, this intuition breaks down catastrophically.
The reason is the curse of dimensionality. The volume of a ball of radius , or more generally the number of distinguishable regions in a -dimensional space, grows roughly exponentially with . For MNIST, a grayscale image lives in . Even images are vanishingly sparse in such a space. Almost every possible pixel vector is not a digit at all; it is visual noise. Worse, many distance-based intuitions fail: in very high dimensions, points tend to become nearly equidistant, making “nearest neighbors” and local smoothing much less reliable.
So the issue is not just that we need more data. The ambient space is overwhelmingly large, and the data occupies only a tiny, highly structured subset of it. A kernel density estimator spreads probability around observed examples in the ambient space, but most of that space is irrelevant. Unless the kernel bandwidth is extremely small, it assigns mass to unrealistic regions; if the bandwidth is extremely small, it assigns nearly zero density almost everywhere. Either way, the method fails to capture the actual structure of the data distribution.
This motivates the data manifold hypothesis. Although observations may be represented as vectors in a high-dimensional ambient space, the meaningful variation in the data often has much lower intrinsic dimension. For example, handwritten digits vary by stroke thickness, slant, rotation, identity, local style, and other factors. These factors are far fewer than 784 independent pixel degrees of freedom. Informally, the data may lie near a low-dimensional manifold embedded in the high-dimensional observation space.
This is where latent variables enter the story. Instead of modeling directly as an arbitrary point in the ambient space, we introduce a lower-dimensional variable
and imagine that observations are generated from these latent coordinates. The latent variable should capture the compact explanatory factors, while the model maps from latent space into data space. This does not mean the true data manifold is literally linear, smooth everywhere, or exactly -dimensional. Rather, it is a modeling assumption: useful data distributions often have enough low-dimensional structure that exploiting it is far better than treating all directions in equally.
The visual below condenses this motivation into a geometric picture. On the left, the high-dimensional ambient space is mostly empty: even many training examples provide negligible coverage when is large, which is why direct kernel-style density estimation becomes ineffective. On the right, the same data becomes much more intelligible when viewed as concentrated near a lower-dimensional manifold.
The key transition is from “estimate density everywhere in ” to “model how low-dimensional latent structure gives rise to high-dimensional observations.” That transition is the conceptual starting point for latent variable models, and it is exactly the path that will lead us to variational autoencoders.



























