Table of Contents
A comparison study between VAEs and GANs
EL-KADDOURY Mohamed1, Abdelhak MAHMOUDI2, and Mohammed
Majid Himmi1
1 LIMIARF, Faculty of sciences, Mohammed V University, Rabat, Morocco
{mh.kadouri”,himmi.fsr}@gmail.com
2 LIMIARF, Ecole Normale Suprieure, Mohammed V University, Rabat, Morocco
abdelhak.mahmoudi@um5.ac.ma
Abstract. Generative Models have shown huge improvements in recent
years. Especially the field of Generative Adversarial Networks (GANs)
have proven useful for many different problems. In this paper we will
compare two kinds of generative models, which are GANs and Variational
Autoencoders (VAEs). We apply those methods to different data sets, to
point out their differences and to see their capabilities and limits as well:
We find that while VAEs are easier than well as faster to train, their
results are in general more blurry than the images generated by GANs.
These on the other hand contain more details, which may realistic ones
but often is just noise.
Keywords: Generative Adverserial Network · Image Generation · An-
other keyword · Generative models.
1 Introduction
Unsupervised learning from large unlabeled datasets is an active research area. In
practice, millions of images and videos are unlabeled and one can leverage them
to learn good intermediate feature representations via approaches in unsuper-
vised learning, which can then be used for other supervised or semi-supervised
learning tasks such as classification. One approach for unsupervised learning is
to learn a generative model. Two popular methods in computer vision are varia-
tional auto-encoders (VAEs) [1] and generative adversarial networks (GANs) [2].
Variational auto-encoders are a class of deep generative models based on
variational methods. With sophisticated VAE models, one can not only generate
2 M. EL-KADDOURY et al.
realistic images, but also replicate consistent style. For example, DRAW [5] was
able to generate images of house numbers with number combinations not seen
in the training set, but with a consistent style/color/font of street sign in each
image. Additionally, as models learn to generate realistic output, they learn im-
portant features along the way, which potentially can be used for classification;
we consider this in the Conditional VAE and semi-supervised learning [11] mod-
els. However, one main criticism of VAE models is that their generated output
is often blurry.
The GANs framework was firstly proposed by Ian Goodfellow et al. in 2014[2].
A generator model and a discriminator model both built by multilayer percep-
trons are the basic modules of vanilla GANs. The goal of GANs is estimating
generative models that can capture the distribution of real data with the adver-
sarial assistance of a paired discriminator based on min-max game theory. After
the birth of GANs, a great many variants of GANs have been widely researched
to generate effective synthetic samples, such as image generation[2], image in-
painting[12], image translation[13], super-resolution[14], image de-occlusion[15]”,
natural language generation[16], text generation[17], etc. Though the powerful
learning capabilities have gained great success in many fields.
Since generation problems have no concrete target vector compared to normal
supervised learning, new methods have had to be found for these tasks. In this
paper, we want to compare two of these architectures, variational autoencoders
(VAEs) and generative adversarial networks (GANs), on different datasets. Sec-
tions 2 and 3 present the theory and architecture of models for comparison on
different datasets in Section 4.
2 Variational Auto-encoders
Let x be a vector of D observable variables and z ∈ RM a vector of stochastic
latent variables. Further, let pθ(x, z) be a parametric model of the joint distribu-
tion. Given data X = {x1, …, xN} we typically aim at maximizing the average
marginal log-likelihood, 1N ln(p(X) =
1
N
∑N
i=1 ln(p(xi)) , with respect to param-
eters. However, when the model is parameterized by a neural network (NN), the
optimization could be difficult due to the intractability of the marginal likeli-
A comparison study between VAEs and GANs 3
hood. One possible way of overcoming this issue is to apply variational inference
and optimize the following lower bound:
Ex∼q(x)[ln(p(x)] ≥ Ex∼q(x)[Eqφ(z|x)[ln pθ(x|z)
+ ln pλ(z)− ln qφ(z|x)]]
, L(φ, θ, λ)
(1)
where q(x) = 1N
∑N
n=1 δ(x−xn) is the empirical distribution, qφ(z|x) is the vari-
ational posterior (the encoder), pθ(x|z) is the generative model (the decoder)
and pλ(z) is the prior, and φ, θ, λ are their parameters, respectively.
There are various ways of optimizing this lower bound but for continuous z
this could be done efficiently through the re-parameterization of qφ(z|x) [3][4]”,
which yields a variational auto-encoder architecture (VAE). Therefore, during
learning we consider a Monte Carlo estimate of the second expectation in (1)
using L sample points:
L̃(φ, θ, λ) = Ex∼q(x)[
1
N
L∑
l=1
(ln pθ(x|z(l)φ ) (2)
+ ln pλ(z
(l)
φ )− ln qφ(z
(l)
φ |x))], (3)
where z
(l)
φ are sampled from qφ(z|x) through the reparameterization trick.
The first component of the objective function can be seen as the expectation
of the negative reconstruction error that forces the hidden representation for
each data case to be peaked at its specific MAP value. On the contrary, the
second and third components constitute a kind of regularization that drives the
encoder see fig.2 to match the prior.
We can get more insight into the role of the prior by inspecting the gradient
of L̃(φ, θ, λ) in (2) and (3) with respect to a single weight φi for a single data
point x, see Eq. (4) and (5) for details. We notice that the prior plays a role of
an anchor that keeps the posterior close to it, i.e., the term in round brackets in
Eq. (5) is 0 if the posterior matches the prior.
4 M. EL-KADDOURY et al.
Fig. 1. VAE Architecture.
∂
∂φi
L̃(x;φ, θ, λ) = 1
L
L∑
l=1
[
1
pθ(x|z(l)φ )
∂
∂zφ
pθ(x|z(l)φ )
∂
∂φi
z
(l)
φ
− 1
qφ(z
(l)
φ |x)
∂
∂φi
qφ(z
(l)
φ |x)
(4)
+
1
pλ(z
(l)
φ )qφ(z
(l)
φ |x)
(qφ(z
(l)
φ |x)
∂
∂zφ
pλ(z
(l)
φ )
− pλ(z(l)φ )
∂
∂zφ
qφ(z
(l)
φ |x))
∂
∂φi
z
(l)
φ ]
(5)
Typically, the encoder is assumed to have a diagonal covariance matrix, i.e.”,
qφ(z|x) = N (z|µφ(x), diag(σ2φ(x))) , where µφand σ2φ(x) are parameterized by
a NN with weights φ, and the prior is expressed using the standard normal
distribution, pλ(z) = N (z|0, I). The decoder utilizes a suitable distribution for
the data under consideration, e.g., the Bernoulli distribution for binary data or
the normal distribution for continuous data, and it is parameterized by a NN
with weights θ.
3 Generative Adversarial Nets
The GANs [2] framework is composed of a generator G(z) and a discriminator
D(x), where z is random noise. The generator G(z) tries to generate more and
more verisimilar data to fools the discriminator D(x), while the discriminator
D(x) aims to tell apart the fake data from the real data. These two adversarial
opponents are optimized to overpower each other and play a zero-sum game
(also called the min-max game) in the whole training process see fig.2. The
random noises z ∈ RN (usually normal distribution or gaussian distribution)
A comparison study between VAEs and GANs 5
are provided as the input of the generator G(z). And then, the generator G(z)
will generate synthetic data, x̃ = G(z). The real data x and fake data x̃ will
be both fed to the discriminator D(x), and then the discriminator D(x) will
output a scalar which represents the probability of input data are from the
real data distribution p(x) rather than the generator G(z). The two adversarial
players are optimized by the adversarial training process. The value function of
this adversarial process is as follows (GANs learn the generator G(z) and the
discriminator D(x) by solving Nash equilibrium problem):
minGmaxDV (D”,G) =
Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))]
(6)
where pz(z) is the distribution of random noises (uniform distribution in
most GANs at the early phase, pz(z) = U(0, 1). Both the generator G(z) and
the discriminator D(x) in original GANs are built by multilayer perceptrons.
They are both trained using stochastic gradient descent (SGD)[6] according to
the Equation 6.
From the perspective of Generator G(z):
minGVG(D”,G) = Ez∼pz(z)[log(1−D(G(z)))] (7)
– xG = G(z) represents that the generator is modelled to transforms a random
vector z into target sample xG.
– pdata(xG) is maximized for training G (The probability that the generated
samples belong to the distribution of real data).
– pz(z) is a fixed, easy sample prior distribution that GANs assumed.
From the perspective of Generator D(z):
maxDVD(D”,G) =
Ex∼pdata [log(D(x))] + Ez∼pz(z)[log(1−D(G(z)))]
(8)
– GANs framework uses a sigmoid neuron at the last layer of Discriminator
D(x), so its output is in [0, 1].
– The discriminator tries to assign a high value (the upper limit is 1) to real
data, while assigning a low value (the low limit is 0) to fake data from the
generator.
6 M. EL-KADDOURY et al.
Fig. 2. GAN Architecture.
4 Experiments and Results
The described methods where applied to three different data sets, namely MNIST
[8], CIFAR10 [7] and CelebA [9]. To reduce the required computational effort
we resized to images to have a maximum size of 72× 72 pixels. Implementations
were done with Python and KERAS [10].
4.1 MNIST Dataset
As the MNIST data set has the least variance of the examined datasets, one ex-
pects the two models to generate realistic images. Therefore, this data set is the
first one to be examined. Some exemplifying images, generated by a conditioned
VAE and an auxiliary GAN, can be seen in figure 3. Both models generate im-
ages which can easily be recognized as digits. While the GAN generates sharper
images, the VAE tends to smooth the edges of the digits.
a. Samples generated by GAN b. Samples generated by VAE
Fig. 3. Comparison of sampled images of the two models based on the MNIST dataset.
A comparison study between VAEs and GANs 7
4.2 CIFAR10 Dataset
The Cifar10 data set has a great variance of motives and camera angles. There-
fore, one expects this images to be harder to generate than the previous examples.
This can be confirmed by looking at the resulting images in figure 4. The images
of the VAE are once again blurry and no realistic objects can be recognized. The
GAN generates images with sharper edges; nevertheless, most of the generated
objects can not be uniquely identified.
a. Samples generated by GAN b. Samples generated by VAE
Fig. 4. Comparison of sampled images of the two models based on the CIFAR10
dataset.
4.3 CelebA Dataset
Lastly, the generative models have been used to generate portrait images of
humans using the CelebA dataset. Figure 5 shows exemplary portraits of the data
set and generated images. Here GAN has been compared to the results of VAE.
The GAN produces again much sharper images than the VAE. Nevertheless, the
faces produced by the VAE own a more natural appearance. Apart from the
blurry earth-colored background, some VAE-generated images resemble realistic
faces. In figure 5a and 5b, the condition between male and female persons is
demonstrated.
5 Conclusion
The main difference between the methods examined here is their learning pro-
cess. VAEs are minimizing a loss reproducing a certain image, and can, therefore”,
8 M. EL-KADDOURY et al.
a. Samples generated by GAN b. Samples generated by VAE
Fig. 5. Comparison of sampled images of the two models based on the CelebA dataset.
be considered as semi-supervised learning. GANs, on the other hand, are con-
sidered unsupervised learning because they do not use labeled pixels. The most
important difference, found in this work was the training time for the two meth-
ods. Mostly GANs took a lot longer to train (in terms of number of epochs, as
well as in terms of run time). Because experiments were run on different hard-
ware a quantitative comparison is not done here. Another advantage of VAEs is
their stability. For GANs highly oscillating image quality could be observed in
the course of training. Clear pictures turned into purely grey images within only
a few epochs of training. Therefore the use of GANs was considered and proved
a lot more stable. The problem with VAEs is, that with the increasing diversity
of the data set generated images to become more and more blurry and a lot
of details get lost (see CIFAR10 results). With GANs this does not necessarily
occur. Eventually, we conclude, that for low-diversity datasets like MNIST, both
methods give sufficiently realistic images. For more complex data sets this was
not the case in this work, but prior work like [8, 9] shows, that it is possible to
generate realistic images with both the techniques used here.
Finally, using VAEs one can achieve results in less time, but with decreased
image quality compared to results of GANs.
References
1. P. Goyal, Z. Hu, X. Liang, C. Wang, and E. Xing. Nonparametric Variational Au-
toencoders for Hierarchical Representation Learning. arXiv:1703.07027, 2017.
A comparison study between VAEs and GANs 9
2. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural infor-
mation processing systems, pages 26722680, 2014.
3. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114″,
2013.
4. D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and
Approximate Inference in Deep Generative Models. ICML, pages 12781286, 2014.
5. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan
Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint
arXiv:1502.04623, 2015.
6. BORDES. A., BOTTOU, L., and GALLINARI, P. (2009): SGD-QN: Careful
Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research”,
10:1737-1754.
7. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny
images. Technical report, University of Toronto, 2009.
8. Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
9. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face at-
tributes in the wild. In Proceedings of International Conference on Computer Vision
(ICCV), December 2015.
10. Francois Chollet et al. Keras. https://github.com/ fchollet/keras, 2015.
11. ME Abbasnejad, A Dick, A van den Hengel – 2017 IEEE Conference on Computer
Vision a”,2017.
12. R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N.
Do, Semantic image inpainting with deep generative models, in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 54855493.
13. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with
conditional adversarial networks, arXiv preprint arXiv:1611.07004, 2016.
14. C. Ledig, Z. Wang, W. Shi, L. Theis, F. Huszar, J. Caballero, A. Cunningham”,
A. Acosta, A. Aitken, and A. Tejani, Photo-realistic single image super-resolution
using a generative adversarial network, 2016.
15. F. Zhao, J. Feng, J. Zhao, W. Yang, and S. Yan, Robust lstmautoencoders for face
de-occlusion in the wild, IEEE Transactions on Image Processing, vol. 27, no. 2, pp.
778790, 2018.
16. O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf, Language generation with
recurrent generative adversarial networks without pre-training, arXiv preprint
arXiv:1706.01399, 2017.
17. L. Yu, W. Zhang, J. Wang, and Y. Yu, Seqgan: Sequence generative adversarial
nets with policy gradient. in AAAI, 2017, pp. 28522858.
Be First to Comment