Variational autoencoder

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

Lua error in package.lua at line 80: module 'strict' not found.

In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.[1]

Variational autoencoders are often associated with the autoencoder model because of its architectural affinity, but with significant differences in the goal and mathematical formulation. Variational autoencoders are probabilistic generative models that require neural networks as only a part of their overall structure. The neural network components are typically referred to as the encoder and decoder for the first and second component respectively. The first neural network maps the input variable to a latent space that corresponds to the parameters of a variational distribution. In this way, the encoder can produce multiple different samples that all come from the same distribution. The decoder has the opposite function, which is to map from the latent space to the input space, in order to produce or generate data points. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately.

Although this type of model was initially designed for unsupervised learning,[2][3] its effectiveness has been proven for semi-supervised learning[4][5] and supervised learning.[6]

Overview of architecture and operation

A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the Expectation-Maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. This neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder.

The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent.

To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence. Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise, however tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q distribution that overlaps with the p distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value.[7]

Formulation

File:VAE Basic.png
The basic scheme of a variational autoencoder. The model receives x as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces {x'} as similar as possible to x.

From the point of view of probabilistic modelling, one wants to maximize the likelihood of the data x by their chosen parameterized probability distribution p_{\theta}(x) = p(x|\theta). This distribution is usually chosen to be a Gaussian N(x|\mu,\sigma) which is parameterized by \mu and \sigma respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z results in intractable integrals. Let us find p_\theta(x) via marginalizing over z.

p_\theta(x) = \int_{z}p_\theta({x,z}) \, dz,

where p_\theta({x,z}) represents the joint distribution under p_\theta of the observable data  x and its latent representation or encoding  z . According to the chain rule, the equation can be rewritten as

p_\theta(x) = \int_{z}p_\theta({x| z})p_\theta(z) \, dz

In the vanilla variational autoencoder, z is usually taken to be a finite-dimensional vector of real numbers, and p_\theta({x|z}) to be a Gaussian distribution. Then p_\theta(x) is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as

  • Prior p_\theta(z)
  • Likelihood p_\theta(x|z)
  • Posterior p_\theta(z|x)

Unfortunately, the computation of p_\theta(x) is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

q_\phi({z| x}) \approx p_\theta({z| x})

with \phi defined as the set of real values that parametrize q. This is sometimes called amortized inference, since by "investing" in finding a good q_\phi, one can later infer z from x quickly without doing any integrals.

In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution p_\theta(x|z) is computed by the probabilistic decoder, and the approximated posterior distribution q_\phi(z|x) is computed by the probabilistic encoder.

Parametrize the encoder as E_\phi, and the decoder as D_\theta.

Evidence lower bound (ELBO)

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation.

For variational autoencoders, the idea is to jointly optimize the generative model parameters \theta to reduce the reconstruction error between the input and the output, and \phi to make q_\phi({z| x}) as close as possible to p_\theta(z|x). As reconstruction loss, mean squared error and cross entropy are often used.

As distance loss between the two distributions the reverse Kullback–Leibler divergence Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): D_{KL}(q_\phi({z| x})\parallel p_\theta({z| x}))

is a good choice to squeeze q_\phi({z| x}) under p_\theta(z|x).[7][8]

The distance loss just defined is expanded as

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} D_{KL}(q_\phi({z| x})\parallel p_\theta({z| x})) &= \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{q_\phi(z|x)}{p_\theta(z|x)}\right]\\ &= \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{q_\phi({z| x})p_\theta(x)}{p_\theta(x, z)}\right]\\ &=\ln p_\theta(x) + \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{q_\phi({z| x})}{p_\theta(x, z)}\right] \end{align}


Now define the evidence lower bound (ELBO):Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle L_{\theta,\phi}(x) := \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] = \ln p_\theta(x) - D_{KL}(q_\phi({\cdot| x})\parallel p_\theta({\cdot | x})) } Maximizing the ELBOFailed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) } is equivalent to simultaneously maximizing \ln p_\theta(x) and minimizing Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): D_{KL}(q_\phi({z| x})\parallel p_\theta({z| x})) . That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior q_\phi(\cdot | x) from the exact posterior p_\theta(\cdot | x) .

The form given is not very convenient for maximization, but the following, equivalent form, is:Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle L_{\theta,\phi}(x) = \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln p_\theta(x|z)\right] - D_{KL}(q_\phi({\cdot| x})\parallel p_\theta(\cdot)) } where \ln p_\theta(x|z) is implemented as -\frac{1}{2}\| x - D_\theta(z)\|^2_2, since that is, up to an additive constant, what x \sim \mathcal N(D_\theta(z), I) yields. That is, we model the distribution of x conditional on z to be a Gaussian distribution centered on D_\theta(z). The distribution of q_\phi(z |x) and p_\theta(z) are often also chosen to be Gaussians as z|x \sim \mathcal N(E_\phi(x), \sigma_\phi(x)^2I) and z \sim \mathcal N(0, I), with which we obtain by the formula for KL divergence of Gaussians:Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle L_{\theta,\phi}(x) = -\frac 12\mathbb E_{z \sim q_\phi(\cdot | x)} \left[ \|x - D_\theta(z)\|_2^2\right] - \frac 12 \left( N\sigma_\phi(x)^2 + \|E_\phi(x)\|_2^2 - 2N\ln\sigma_\phi(x) \right) + Const } Here  N is the dimension of  z . For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.

Reparameterization

File:Reparameterization Trick.png
The scheme of the reparameterization trick. The randomness variable {\varepsilon} is injected into the latent space z as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.

To efficiently search for Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \theta^*,\phi^* = \underset{\theta,\phi}\operatorname{arg max} \, L_{\theta,\phi}(x) } the typical method is gradient descent.

It is straightforward to findFailed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \nabla_\theta \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] = \mathbb E_{z \sim q_\phi(\cdot | x)} \left[ \nabla_\theta \ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] } However, Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] } does not allow one to put the \nabla_\phi inside the expectation, since \phi appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation[9]) bypasses this difficulty.[7][10][11]

The most important example is when z \sim q_\phi(\cdot | x)  is normally distributed, as \mathcal N(\mu_\phi(x), \Sigma_\phi(x))  .

This can be reparametrized by letting \boldsymbol{\varepsilon} \sim \mathcal{N}(0, \boldsymbol{I}) be a "standard random number generator", and construct z   as z = \mu_\phi(x)  + L_\phi(x)\epsilon  . Here, L_\phi(x)  is obtained by the Cholesky decomposition:Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \Sigma_\phi(x) = L_\phi(x)L_\phi(x)^T } Then we haveFailed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \nabla_\phi \mathbb E_{z \sim q_\phi(\cdot | x)} \left[\ln \frac{p_\theta(x, z)}{q_\phi({z| x})}\right] = \mathbb {E}_{\epsilon}\left[ \nabla_\phi \ln {\frac {p_{\theta }(x, \mu_\phi(x) + L_\phi(x)\epsilon)}{q_{\phi }(\mu_\phi(x) + L_\phi(x)\epsilon | x)}}\right] } and so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent.

Since we reparametrized z, we need to find q_\phi(z|x). Let q_0 by the probability density function for \epsilon, thenFailed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \ln q_\phi(z | x) = \ln q_0 (\epsilon) - \ln|\det(\partial_\epsilon z)|} where \partial_\epsilon z is the Jacobian matrix of \epsilon with respect to z. Since z = \mu_\phi(x)  + L_\phi(x)\epsilon  , this is Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \ln q_\phi(z | x) = -\frac 12 \|\epsilon\|^2 - \ln|\det L_\phi(x)| - \frac n2 \ln(2\pi)}


Variations

Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.

\beta-VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for \beta values greater than one. This architecture can discover disentangled latent factors without supervision.[12][13]

The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.[14]

Some structures directly deal with the quality of the generated samples[15][16] or implement more than one latent space to further improve the representation learning.[17][18]

Some architectures mix VAE and generative adversarial networks to obtain hybrid models.[19][20][21]

See also

<templatestyles src="Div col/styles.css"/>

References

  1. Lua error in package.lua at line 80: module 'strict' not found.
  2. Lua error in package.lua at line 80: module 'strict' not found.
  3. Lua error in package.lua at line 80: module 'strict' not found.
  4. Lua error in package.lua at line 80: module 'strict' not found.
  5. Lua error in package.lua at line 80: module 'strict' not found.
  6. Lua error in package.lua at line 80: module 'strict' not found.
  7. 7.0 7.1 7.2 Lua error in package.lua at line 80: module 'strict' not found.
  8. Lua error in package.lua at line 80: module 'strict' not found.
  9. Lua error in package.lua at line 80: module 'strict' not found.
  10. Lua error in package.lua at line 80: module 'strict' not found.
  11. Lua error in package.lua at line 80: module 'strict' not found.
  12. Lua error in package.lua at line 80: module 'strict' not found.
  13. Lua error in package.lua at line 80: module 'strict' not found.
  14. Lua error in package.lua at line 80: module 'strict' not found.
  15. Lua error in package.lua at line 80: module 'strict' not found.
  16. Lua error in package.lua at line 80: module 'strict' not found.
  17. Lua error in package.lua at line 80: module 'strict' not found.
  18. Lua error in package.lua at line 80: module 'strict' not found.
  19. Lua error in package.lua at line 80: module 'strict' not found.
  20. Lua error in package.lua at line 80: module 'strict' not found.
  21. Lua error in package.lua at line 80: module 'strict' not found.