Table of Contents

A comparison study between VAEs and GANs

EL-KADDOURY Mohamed1, Abdelhak MAHMOUDI2, and Mohammed

### Majid Himmi1

1 LIMIARF, Faculty of sciences, Mohammed V University, Rabat, Morocco

{mh.kadouri”,himmi.fsr}@gmail.com

2 LIMIARF, Ecole Normale Suprieure, Mohammed V University, Rabat, Morocco

abdelhak.mahmoudi@um5.ac.ma

Abstract. Generative Models have shown huge improvements in recent

years. Especially the field of Generative Adversarial Networks (GANs)

have proven useful for many different problems. In this paper we will

compare two kinds of generative models, which are GANs and Variational

Autoencoders (VAEs). We apply those methods to different data sets, to

point out their differences and to see their capabilities and limits as well:

We find that while VAEs are easier than well as faster to train, their

results are in general more blurry than the images generated by GANs.

These on the other hand contain more details, which may realistic ones

but often is just noise.

Keywords: Generative Adverserial Network · Image Generation · An-

other keyword · Generative models.

1 Introduction

Unsupervised learning from large unlabeled datasets is an active research area. In

practice, millions of images and videos are unlabeled and one can leverage them

to learn good intermediate feature representations via approaches in unsuper-

vised learning, which can then be used for other supervised or semi-supervised

learning tasks such as classification. One approach for unsupervised learning is

to learn a generative model. Two popular methods in computer vision are varia-

tional auto-encoders (VAEs) [1] and generative adversarial networks (GANs) [2].

Variational auto-encoders are a class of deep generative models based on

variational methods. With sophisticated VAE models, one can not only generate

2 M. EL-KADDOURY et al.

realistic images, but also replicate consistent style. For example, DRAW [5] was

able to generate images of house numbers with number combinations not seen

in the training set, but with a consistent style/color/font of street sign in each

image. Additionally, as models learn to generate realistic output, they learn im-

portant features along the way, which potentially can be used for classification;

we consider this in the Conditional VAE and semi-supervised learning [11] mod-

els. However, one main criticism of VAE models is that their generated output

is often blurry.

The GANs framework was firstly proposed by Ian Goodfellow et al. in 2014[2].

A generator model and a discriminator model both built by multilayer percep-

trons are the basic modules of vanilla GANs. The goal of GANs is estimating

generative models that can capture the distribution of real data with the adver-

sarial assistance of a paired discriminator based on min-max game theory. After

the birth of GANs, a great many variants of GANs have been widely researched

to generate effective synthetic samples, such as image generation[2], image in-

painting[12], image translation[13], super-resolution[14], image de-occlusion[15]”,

natural language generation[16], text generation[17], etc. Though the powerful

learning capabilities have gained great success in many fields.

Since generation problems have no concrete target vector compared to normal

supervised learning, new methods have had to be found for these tasks. In this

paper, we want to compare two of these architectures, variational autoencoders

(VAEs) and generative adversarial networks (GANs), on different datasets. Sec-

tions 2 and 3 present the theory and architecture of models for comparison on

different datasets in Section 4.

2 Variational Auto-encoders

Let x be a vector of D observable variables and z ∈ RM a vector of stochastic

latent variables. Further, let pθ(x, z) be a parametric model of the joint distribu-

tion. Given data X = {x1, …, xN} we typically aim at maximizing the average

marginal log-likelihood, 1N ln(p(X) =

1

N

∑N

i=1 ln(p(xi)) , with respect to param-

eters. However, when the model is parameterized by a neural network (NN), the

optimization could be difficult due to the intractability of the marginal likeli-

A comparison study between VAEs and GANs 3

hood. One possible way of overcoming this issue is to apply variational inference

and optimize the following lower bound:

Ex∼q(x)[ln(p(x)] ≥ Ex∼q(x)[Eqφ(z|x)[ln pθ(x|z)

+ ln pλ(z)− ln qφ(z|x)]]

, L(φ, θ, λ)

(1)

where q(x) = 1N

∑N

n=1 δ(x−xn) is the empirical distribution, qφ(z|x) is the vari-

ational posterior (the encoder), pθ(x|z) is the generative model (the decoder)

and pλ(z) is the prior, and φ, θ, λ are their parameters, respectively.

There are various ways of optimizing this lower bound but for continuous z

this could be done efficiently through the re-parameterization of qφ(z|x) [3][4]”,

which yields a variational auto-encoder architecture (VAE). Therefore, during

learning we consider a Monte Carlo estimate of the second expectation in (1)

using L sample points:

L̃(φ, θ, λ) = Ex∼q(x)[

1

N

L∑

l=1

(ln pθ(x|z(l)φ ) (2)

+ ln pλ(z

(l)

φ )− ln qφ(z

(l)

φ |x))], (3)

where z

(l)

φ are sampled from qφ(z|x) through the reparameterization trick.

The first component of the objective function can be seen as the expectation

of the negative reconstruction error that forces the hidden representation for

each data case to be peaked at its specific MAP value. On the contrary, the

second and third components constitute a kind of regularization that drives the

encoder see fig.2 to match the prior.

We can get more insight into the role of the prior by inspecting the gradient

of L̃(φ, θ, λ) in (2) and (3) with respect to a single weight φi for a single data

point x, see Eq. (4) and (5) for details. We notice that the prior plays a role of

an anchor that keeps the posterior close to it, i.e., the term in round brackets in

Eq. (5) is 0 if the posterior matches the prior.

4 M. EL-KADDOURY et al.

Fig. 1. VAE Architecture.

∂

∂φi

L̃(x;φ, θ, λ) = 1

L

L∑

l=1

[

1

pθ(x|z(l)φ )

∂

∂zφ

pθ(x|z(l)φ )

∂

∂φi

z

(l)

φ

− 1

qφ(z

(l)

φ |x)

∂

∂φi

qφ(z

(l)

φ |x)

(4)

+

1

pλ(z

(l)

φ )qφ(z

(l)

φ |x)

(qφ(z

(l)

φ |x)

∂

∂zφ

pλ(z

(l)

φ )

− pλ(z(l)φ )

∂

∂zφ

qφ(z

(l)

φ |x))

∂

∂φi

z

(l)

φ ]

(5)

Typically, the encoder is assumed to have a diagonal covariance matrix, i.e.”,

qφ(z|x) = N (z|µφ(x), diag(σ2φ(x))) , where µφand σ2φ(x) are parameterized by

a NN with weights φ, and the prior is expressed using the standard normal

distribution, pλ(z) = N (z|0, I). The decoder utilizes a suitable distribution for

the data under consideration, e.g., the Bernoulli distribution for binary data or

the normal distribution for continuous data, and it is parameterized by a NN

with weights θ.

3 Generative Adversarial Nets

The GANs [2] framework is composed of a generator G(z) and a discriminator

D(x), where z is random noise. The generator G(z) tries to generate more and

more verisimilar data to fools the discriminator D(x), while the discriminator

D(x) aims to tell apart the fake data from the real data. These two adversarial

opponents are optimized to overpower each other and play a zero-sum game

(also called the min-max game) in the whole training process see fig.2. The

random noises z ∈ RN (usually normal distribution or gaussian distribution)

A comparison study between VAEs and GANs 5

are provided as the input of the generator G(z). And then, the generator G(z)

will generate synthetic data, x̃ = G(z). The real data x and fake data x̃ will

be both fed to the discriminator D(x), and then the discriminator D(x) will

output a scalar which represents the probability of input data are from the

real data distribution p(x) rather than the generator G(z). The two adversarial

players are optimized by the adversarial training process. The value function of

this adversarial process is as follows (GANs learn the generator G(z) and the

discriminator D(x) by solving Nash equilibrium problem):

minGmaxDV (D”,G) =

Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))]

(6)

where pz(z) is the distribution of random noises (uniform distribution in

most GANs at the early phase, pz(z) = U(0, 1). Both the generator G(z) and

the discriminator D(x) in original GANs are built by multilayer perceptrons.

They are both trained using stochastic gradient descent (SGD)[6] according to

the Equation 6.

From the perspective of Generator G(z):

minGVG(D”,G) = Ez∼pz(z)[log(1−D(G(z)))] (7)

– xG = G(z) represents that the generator is modelled to transforms a random

vector z into target sample xG.

– pdata(xG) is maximized for training G (The probability that the generated

samples belong to the distribution of real data).

– pz(z) is a fixed, easy sample prior distribution that GANs assumed.

From the perspective of Generator D(z):

maxDVD(D”,G) =

Ex∼pdata [log(D(x))] + Ez∼pz(z)[log(1−D(G(z)))]

(8)

– GANs framework uses a sigmoid neuron at the last layer of Discriminator

D(x), so its output is in [0, 1].

– The discriminator tries to assign a high value (the upper limit is 1) to real

data, while assigning a low value (the low limit is 0) to fake data from the

generator.

6 M. EL-KADDOURY et al.

Fig. 2. GAN Architecture.

4 Experiments and Results

The described methods where applied to three different data sets, namely MNIST

[8], CIFAR10 [7] and CelebA [9]. To reduce the required computational effort

we resized to images to have a maximum size of 72× 72 pixels. Implementations

were done with Python and KERAS [10].

4.1 MNIST Dataset

As the MNIST data set has the least variance of the examined datasets, one ex-

pects the two models to generate realistic images. Therefore, this data set is the

first one to be examined. Some exemplifying images, generated by a conditioned

VAE and an auxiliary GAN, can be seen in figure 3. Both models generate im-

ages which can easily be recognized as digits. While the GAN generates sharper

images, the VAE tends to smooth the edges of the digits.

a. Samples generated by GAN b. Samples generated by VAE

Fig. 3. Comparison of sampled images of the two models based on the MNIST dataset.

A comparison study between VAEs and GANs 7

4.2 CIFAR10 Dataset

The Cifar10 data set has a great variance of motives and camera angles. There-

fore, one expects this images to be harder to generate than the previous examples.

This can be confirmed by looking at the resulting images in figure 4. The images

of the VAE are once again blurry and no realistic objects can be recognized. The

GAN generates images with sharper edges; nevertheless, most of the generated

objects can not be uniquely identified.

a. Samples generated by GAN b. Samples generated by VAE

Fig. 4. Comparison of sampled images of the two models based on the CIFAR10

dataset.

4.3 CelebA Dataset

Lastly, the generative models have been used to generate portrait images of

humans using the CelebA dataset. Figure 5 shows exemplary portraits of the data

set and generated images. Here GAN has been compared to the results of VAE.

The GAN produces again much sharper images than the VAE. Nevertheless, the

faces produced by the VAE own a more natural appearance. Apart from the

blurry earth-colored background, some VAE-generated images resemble realistic

faces. In figure 5a and 5b, the condition between male and female persons is

demonstrated.

5 Conclusion

The main difference between the methods examined here is their learning pro-

cess. VAEs are minimizing a loss reproducing a certain image, and can, therefore”,

8 M. EL-KADDOURY et al.

a. Samples generated by GAN b. Samples generated by VAE

Fig. 5. Comparison of sampled images of the two models based on the CelebA dataset.

be considered as semi-supervised learning. GANs, on the other hand, are con-

sidered unsupervised learning because they do not use labeled pixels. The most

important difference, found in this work was the training time for the two meth-

ods. Mostly GANs took a lot longer to train (in terms of number of epochs, as

well as in terms of run time). Because experiments were run on different hard-

ware a quantitative comparison is not done here. Another advantage of VAEs is

their stability. For GANs highly oscillating image quality could be observed in

the course of training. Clear pictures turned into purely grey images within only

a few epochs of training. Therefore the use of GANs was considered and proved

a lot more stable. The problem with VAEs is, that with the increasing diversity

of the data set generated images to become more and more blurry and a lot

of details get lost (see CIFAR10 results). With GANs this does not necessarily

occur. Eventually, we conclude, that for low-diversity datasets like MNIST, both

methods give sufficiently realistic images. For more complex data sets this was

not the case in this work, but prior work like [8, 9] shows, that it is possible to

generate realistic images with both the techniques used here.

Finally, using VAEs one can achieve results in less time, but with decreased

image quality compared to results of GANs.

### References

1. P. Goyal, Z. Hu, X. Liang, C. Wang, and E. Xing. Nonparametric Variational Au-

toencoders for Hierarchical Representation Learning. arXiv:1703.07027, 2017.

A comparison study between VAEs and GANs 9

2. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.

Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural infor-

mation processing systems, pages 26722680, 2014.

3. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114″,

2013.

4. D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and

Approximate Inference in Deep Generative Models. ICML, pages 12781286, 2014.

5. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan

Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint

arXiv:1502.04623, 2015.

6. BORDES. A., BOTTOU, L., and GALLINARI, P. (2009): SGD-QN: Careful

Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research”,

10:1737-1754.

7. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny

images. Technical report, University of Toronto, 2009.

8. Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.

9. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face at-

tributes in the wild. In Proceedings of International Conference on Computer Vision

(ICCV), December 2015.

10. Francois Chollet et al. Keras. https://github.com/ fchollet/keras, 2015.

11. ME Abbasnejad, A Dick, A van den Hengel – 2017 IEEE Conference on Computer

Vision a”,2017.

12. R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N.

Do, Semantic image inpainting with deep generative models, in Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 54855493.

13. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with

conditional adversarial networks, arXiv preprint arXiv:1611.07004, 2016.

14. C. Ledig, Z. Wang, W. Shi, L. Theis, F. Huszar, J. Caballero, A. Cunningham”,

A. Acosta, A. Aitken, and A. Tejani, Photo-realistic single image super-resolution

using a generative adversarial network, 2016.

15. F. Zhao, J. Feng, J. Zhao, W. Yang, and S. Yan, Robust lstmautoencoders for face

de-occlusion in the wild, IEEE Transactions on Image Processing, vol. 27, no. 2, pp.

778790, 2018.

16. O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf, Language generation with

recurrent generative adversarial networks without pre-training, arXiv preprint

arXiv:1706.01399, 2017.

17. L. Yu, W. Zhang, J. Wang, and Y. Yu, Seqgan: Sequence generative adversarial

nets with policy gradient. in AAAI, 2017, pp. 28522858.

## Be First to Comment