Serhii HavrylovPhD student at the University of Edinburgh
https://serhii-havrylov.github.io/
Mon, 29 Nov 2021 12:07:26 +0000Mon, 29 Nov 2021 12:07:26 +0000Jekyll v3.9.0Emergence of Language with Multi-agent Games<meta http-equiv="Refresh" content="0; url=https://medium.com/sap-machine-learning-research/emergence-of-language-with-multi-agent-games-learning-to-communicate-with-sequences-of-symbols-d4aff8474909" />
Thu, 15 Feb 2018 09:27:21 +0000
https://serhii-havrylov.github.io/blog/sap/
https://serhii-havrylov.github.io/blog/sap/Emergence of Language with Multi-agent Games<h1 id="intro">Intro</h1>
<p>Communication is by far one of the most impressive human abilities. Human civilisation was able to accumulate an enormous amount of knowledge and pass it to the next generations just because we can understand and use language. Origin of language is a mystery that has captivated people’s minds for centuries. The more general question of how communication arises has been studied for a long time as well. Yet, due to the algorithmic and computational limitations, just before recent, almost all mathematical models of this miraculous process had to be restricted to a low dimensional simple observation spaces. However, in the past years, there was a considerable amount of interest to this problem from the deep learning community. Hence, in this post, we would like to present our contribution to the field.</p>
<h1 id="referential-game">Referential game</h1>
<p>One of the most basic challenges of using a language is to refer to things. Thus, it is not surprising that a <a href="http://onlinelibrary.wiley.com/book/10.1002/9780470693711">referential game</a> is a go-to setting in learning-to-communicate field. Many extensions to the primary referential game are possible. In our case we decided to proceed with the following setup:</p>
<p><a name="rg"></a><img src="/res/rg.png" alt="image-rg" class="align-image-center" /></p>
<!-- 1. There is a collection of images \\(\\{i_n\\}\_{n=1}^N\\) from which a target image \\(t\\) is sampled as well as \\(K\\) distracting images \\(\\{d_k\\}\_{k=1}^K\\).
2. There are two agents: a sender \\(S\_{\phi}\\) and a receiver \\(R\_{\theta}\\).
3. After seeing the target image \\( t \\), the sender has to come up with a message \\(m_t\\), which is represented by a sequence of symbols from the vocabulary \\( V \\) of a size \\( \|V\| \\). The maximum possible length of a sequence is \\( L \\).
4. Given the message \\(m\_t\\) and a set of images, which consists of distracting images and the target image, the goal of the receiver is to identify the target image correctly. -->
<ol>
<li>There is a collection of images from which a target image is sampled as well as \(K\) distracting images.</li>
<li>There are two agents: a sender and a receiver.</li>
<li>After seeing the target image, the sender has to come up with a message that is represented by a sequence of symbols from the vocabulary of a fixed size. There is the maximum possible length of a sequence.</li>
<li>Given the generated message and a set of images that consists of distracting images and the target image, the goal of the receiver is to identify the target image correctly.</li>
</ol>
<p>Therefore, to succeed in this referential game a sender has to choose the words carefully and put them in a sequence in such a way that will make it easy for a receiver to correctly identify what image was shown to a sender. Compared to the <a href="http://iopscience.iop.org/article/10.1088/1742-5468/2006/06/P06014">previous</a> <a href="https://arxiv.org/abs/1612.07182">work</a> there is an essential difference: for example, we use sequences rather than single symbols to generate messages. This makes our setting arguably more realistic and challenging from the learning perspective.</p>
<h1 id="agents">Agents</h1>
<p>Both agents, a sender and a receiver, are implemented as <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">recurrent</a> neural networks, namely <a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735">long</a> short-term memory networks. This is one of the standard tools for generating and processing sequences. The figure below shows the sketch of a model. Solid arrows represent deterministic computations. Dashed arrows depict copying previously-obtained word. And lastly, diamond-shaped arrows mean sampling word from the vocabulary.</p>
<p><img src="/res/SR.gif" alt="image-sr" class="align-image-center" /></p>
<p>Probably this is the most important and the most troublesome part of the model. It is a crucial element because this is the place where a sender makes decisions about what to say next. On the other hand, it is troublesome because it is stochastic. Unfortunately, ubiquitous backpropagation algorithm relies on having chains of continuous differentiable functions in each of the layers of the neural network. However, this particular architecture contains nondifferentiable sampling from the discrete probability distribution. So, we can’t use backpropagation right away. The visual system of a sender is implemented as the convolutional neural network (CNN). In our case, it had already been pretrained neural network with <a href="https://arxiv.org/abs/1409.1556">VGG-16</a> architecture. So, eventually, images are represented by outputs of the penultimate hidden layer of the CNN. It is a low dimensional dense vector that summarises semantic information about a particular image. As you can see from the figure above, a message is obtained by sequentially sampling until the maximum possible length is reached or the special token “<em>end</em> <em>of</em> <em>a</em> <em>message</em>” is generated.</p>
<h2 id="learning">Learning</h2>
<p>It is relatively easy to learn a receiver agent. It is end-to-end differentiable so gradients of the loss function with respect to its parameters can be estimated efficiently. The real challenge is to learn the sender agent. Its computational graph contains sampling, which makes it nondifferentiable. As a baseline, we implemented <a href="https://link.springer.com/chapter/10.1007/978-1-4615-3618-5_2">REINFORCE</a> algorithm. This is a method that provides a simple way of estimating gradients of the loss function with respect to parameters of the stochastic policy. Even though it is unbiased, usually it has a huge variance and this fact slows down learning of a model. Fortunately, last year two groups independently discovered a biased but low-variance estimator - <a href="https://arxiv.org/abs/1611.01144">Gumbel</a>-Softmax <a href="https://arxiv.org/abs/1611.00712">estimator</a> (GS estimator). It allows to relax original discrete variable with its continuous counterpart. This makes everything differentiable and so backpropagation algorithm can be applied. This topic is quite big and deserves its own post, so, if you are interested, I encourage you to read a <a href="https://blog.evjang.com/2016/11/tutorial-categorical-variational.html">blog post</a> from one of the authors of this method.</p>
<h1 id="our-findings">Our findings</h1>
<p>The first thing we examined after learning the model was communication success rate. Communication between two agents is successful if the target image was identified correctly. As one can see from the figure below, Gumbel-Softmax estimator (red and blue curves) are better than REINFORCE (yellow and green curves) in all cases except when agents are allowed to communicate only by one word.
<img src="/res/comac.png" alt="image-comac" class="align-image-center-smaller" />
Probably, in this relatively simple situation, the variance for REINFORCE is not an issue and the property of being unbiased paid off. While bias of GS estimator drifted it away from the optimal solution. Also, this plot does go in hand with intuition and clearly shows that by using more words one can describe an image more precisely. We also investigated how many interactions between agent have to be performed to learn the communication protocol (left image in the figure below). Much to our surprise, we saw that the number of updates that are required to achieve training convergence with the Gumbel-Softmax estimator (green curve) decreases when we let a sender use longer messages. This behaviour is slightly unintuitive as one could expect that it is harder to learn the protocol when the search space of communication protocols is larger. In other words, using longer sequences helps to learn a communication protocol faster. However, this is not the case for the REINFORCE estimator (red curve): it usually takes five-fold more updates to converge compared to GS estimator and also there is no clear dependency between the number of updates needed to converge and the maximum possible length of a message.</p>
<div class="photo-center-images">
<img src="/res/nupdates.png" />
<img src="/res/perp.png" />
</div>
<p>We also plot the perplexity of the encoder (right image in the figure above), which arguably measures how many options sender has to choose from on each time step while sampling from the probability distribution over vocabulary. It is relatively high and increasing with sentence length for GS estimator (green curve), whereas for REINFORCE (red curve) the perplexity increase is not as rapid. This implies redundancy in the encodings: there exist multiple paraphrases that encode the same semantic content.</p>
<p>So, how does the learned language look like? To better understand the nature of it, we inspected a small subset of sentences
that were produced by the model with maximum possible message length equal to 5. First, we took a random photo of an object and generated a message. Then we iterated over the dataset and randomly selected images with messages that share prefixes of 1, 2 and 3 symbols with the generated message. For example, the first row of a left image shows some samples that correspond to <strong>(5747 * * * *)</strong> code. Here “*” means any word from the vocabulary or end-of-sentence padding. Images in this subset depict animals.
<img src="/res/prot.png" alt="image-comac" class="align-image-center" /></p>
<p>On the other hand, it seems that images for <strong>(* * * 5747 *)</strong> code do not correspond to any predefined category. This suggests that word order is crucial in the developed language. Particularly, word 5747 on the first position encodes presence of an animal in the image. The same figure shows that message <strong>(5747 5747 7125 * *)</strong> corresponds to a particular species of bears. This suggests that the developed language implements some kind of hierarchical coding, yet the model was not constrained explicitly to use any hierarchical encoding scheme. Presumably, this can help the model efficiently describe unseen images. Nevertheless, natural language uses other principles to ensure compositionality. The model shows similar behaviour for images in the food domain (right image in the figure above) as well.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In our work, we have shown that agents, modelled using neural networks, can successfully invent a language that consists of sequences of discrete tokens. We also found that agents develop communication protocol faster when we allow them to use longer sequences of symbols. We observed that the induced language implements hierarchical encoding scheme and there exist multiple paraphrases that encode the same semantic content. In the future work, we would like to extend this approach to modelling practical dialogues. The <em>“game”</em> can be played between two agents rather than an agent and a human while human
interpretability would be ensured by integrating supervised loss into the learning objective. Hopefully, this will reduce the amount of necessary human supervision. You can find more information and the technical details of our research in this paper: <a href="https://papers.nips.cc/paper/6810-emergence-of-language-with-multi-agent-games-learning-to-communicate-with-sequences-of-symbols">Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols</a>.</p>
Thu, 15 Feb 2018 09:27:21 +0000
https://serhii-havrylov.github.io/blog/sap/
https://serhii-havrylov.github.io/blog/sap/Notes on Controllable Text Generation<p>Recently, I have come across the <a href="https://arxiv.org/abs/1703.00955">Controllable Text Generation</a> paper. This paper aims at generating natural language sentences, whose attributes are dynamically controlled by representations with designated semantics. I liked this paper a lot. And if you haven’t read it yet, go ahead and do it now. Alright, assuming that you have just read it, are you confused as I was regarding <em>“efficient collaborative learning of generator and discriminator”</em>, <em>“efficient mutual bootstrapping”</em> or <em>“explicit enforcement of the independence property”</em>? I believe that there is a more rigorous way to get the same loss function that is discussed in the paper. So, if you are wondering how to do that, you are welcome to read these notes.</p>
<h1 id="outline">Outline</h1>
<ul>
<li>Short story that I would like to use throughout these notes</li>
<li>Directed graphical model for text generation</li>
<li>A couple of words regarding variational autoencoder</li>
<li>A couple of thoughts regarding mutual information</li>
<li>Derivation of lower bound of mutual information</li>
<li>Discussion of some VAE issues</li>
<li>Some comments on Controllable Text Generation paper</li>
</ul>
<h1 id="short-story">Short story</h1>
<p>Let me start with a short story that I would like to use throughout this post. I remember how during my school literature classes teacher always asked pupils about the main idea of a novel. “What message 📜 is he or she trying to communicate?” - The teacher was constantly asking. We often needed to figure out the author’s most important points and provide the pieces of evidence contained in the text 📖. The teacher always encouraged us to come up with more than one theory why the book had been written. Once in a while, someone gave a very unexpected interpretation of the story. In such situations, the teacher usually wasn’t very persuaded and replied that author was a rather “crazy” person but not that “crazy,” or he or she couldn’t think of X because at that time there wasn’t any X.
<a name="kl"></a><img src="/res/kl.png" alt="image-kl" class="align-image-center" />
Every now and then, we had to write an essay on a given topic. I can’t say that it was a very pleasant experience for me, but one time I got soaked myself in the task and wrote a very solid essay. I thought it would be <em>a masterpiece</em>, the punctuation and spelling were correct, the grammar was top notch. Even my teacher found it quite startling. The language was very fluent, and everything in that essay made a lot of sense. The only problem was it was completely off topic. I got a bad mark that day. I remember I was a little bit disappointed about that. In hindsight, I can say I would not have been so upset if I realized that during our literature classes we had been optimizing:</p>
<div class="notice--info">
<ul>
<li>a variational lower bound of marginal likelihood of the generative model for novels with latent themes</li>
<li>a variational lower bound of mutual information between novels and themes</li>
</ul>
</div>
<h1 id="graphical-model">Graphical model</h1>
<p><img src="/res/graph_model.png" alt="image-gm" height="200px" class="align-image-right" /> There are already enough well-written high-quality explanations of variational autoencoders and variational inference in general. I am not aiming to give yet another explanation. I would rather like to use the next two sections to establish notation and to make sure that we are on the same page. If you are already familiar with VAE, you can skip and jump to the <a href="#bad_pupil">Mutual Information </a> section.</p>
<p>We consider a directed graphical model for generating a novel \(x\) 📖 with a latent theme \(z\) 📜. Let’s use the same model for generating novel given a topic \(p(📖\vert📜)\) as in the <a href="https://arxiv.org/abs/1511.06349">Generating Sentences from a Continuous Space</a> paper. More specifically, it is an <a href="https://arxiv.org/abs/1503.04069">LSTM</a> recurrent neural network where the first hidden state is initialized with \(z\): <a name="lstm"></a><img src="/res/lstm.png" alt="image-lstm" height="180px" class="align-image-center" /> Let’s pause for a second and think what it means in terms of the described short story. On the figure above, you can see the model of the pupil who wants to write a book \(x\) 📖 given a topic \(z\) 📜. Variable \(h_t\) corresponds to the state of mind after reading \(t\) words. The good pupil is the one who assigns the high probability to the written books (training data). No one really knows what author wanted to tell. That is why we have to consider all possible human thoughts (topics) to get the probability of a book \(p(📖)=\int (p(📖\vert📜)d📜\). So, now the tough question arises: how to evaluate the marginal likelihood of the data? Typically, there is no analytical expression for the integral, even for such a simple model as shown in the figure.</p>
<p class="notice--warning"><strong>Watch out!</strong> Evaluating marginal likelihood \(p(x)\) is hard.</p>
<p>Alright, if we can’t integrate it analytically, let’s estimate it using Monte Carlo method. One of the possible estimators and naïvest one is \(\frac{1}{K}\sum_{k=1}^{K}p(x\vert z_k)\) where \(z_k\) is a sample from \(p(z)\). Although it is unbiased, it can have very high variance especially in models where most latent configurations can’t explain a given observation well. In terms of our story, it is the same as if the teacher randomly gives pupils a topic and asks them to evaluate the probability of the book given this topic. Of course, most of the provided topics will not have any relation with the book, so most of the time the probability is going to be hugely underestimated.</p>
<p class="notice--warning"><strong>Watch out!</strong> Naïve Monte Carlo estimator \(\frac{1}{K}\sum_{k=1}^{K}p(x\vert z_k)\) can have high variance.</p>
<p>But why does the teacher think that this kind of graphical model for the books even makes sense in the first place?
I believe that it does. Typically, the dimensionality of the variable \(z\) is much much smaller than the dimensionality of the variable \(x\). And if you can perform inference over latent variable \(p(z\vert x)\) instead of learning a book by heart, you can just remember inferred \(z\) (compressed book). And when you need details, you can use your generative model \(p(x\vert z)\) to reconstruct book back. It is very convenient, it will free a lot of space in your head. Hence, you can read even more. Also, if you choose an appropriate structure for \(z\), so that you can easily compare two values, you can compute the similarity between two books pretty easily without going trough both of them and trying to compare everything word by word. This approach is pretty handy. But the problem is the same as with marginal likelihood: there is no analytical solution for \(p(z\vert x)\) and naïve Monte Carlo estimators will have high variance. So, as you can guess, the answer for this problem is variational autoencoder.</p>
<p class="notice--warning"><strong>Watch out!</strong> Evaluating posterior distribution \(p(z \vert x)\) is hard.</p>
<h1 id="variational-autoencoder">Variational autoencoder</h1>
<p>Variational autoencoder is a latent variable model equiped with an inference network. As we already know, it is not straigforward to evaluate marginal likelihood. So, it would not be easier to maximize it. Instead, we will use its lower bound.
\[p(x) = \int p(x\vert z)p(z)dz = \int p(x\vert z)p(z)\frac{q(z\vert x)}{q(z\vert x)}dz = \int q(z\vert x)p(x\vert z)\frac{p(z)}{q(z\vert x)}dz\]
Using <a href="https://en.wikipedia.org/wiki/Jensen's_inequality">Jensen’s inequality</a>:
\[\log{p(x)} \ge \int q(z\vert x)\log{\left(p(x\vert z)\frac{p(z)}{q(z\vert x)}\right)}dz = \int q(z\vert x)\log{p(x\vert z)}dz - \int q(z\vert x)\log{\frac{q(z\vert x)}{p(z)}}dz\]
Finally, we have:
\[\log{p(x)} \ge \mathbb{E}_{q(z\vert x)}[\log{p(x\vert z)}] -D_{KL}\left(q(z\vert x)\|p(z)\right) = \mathcal{L}(x)\]
By maximizing lower bound \(\mathcal{L}(x)\), we will also maximize the marginal likelihood \(\log{p(x)}\). And, in the end, the value of marginal likelihood will be at least as large as the value of lower bound. The lower bound contains approximate posterior distribution \(q(z\vert x)\). It has such name because, as you already guessed, it is an approximation of the true posterior. By keeping parameters of \(p(x\vert z)\) fixed and optimizing the lower bound, you will be eventually minimizing \(D_{KL}\left(q(z\vert x)\|p(z\vert x)\right)\). Commonly \(q(z\vert x)\) belongs to the family of distributions that could be reparametrized. So, eventually, the gradients of the first term from the lower bound can be efficiently estimated with Monte Carlo methods (<a href="http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/">pathwise derivatives</a>).</p>
<p>Surprisingly enough, the lower bound is exactly what pupils use to learn during literature classes.
<a name="expected_reconstruction"></a> <img src="/res/recons.png" alt="image-recons" class="align-image-center" />At first, the teacher asks you to read a book and try to guess what authors had in mind while writing it. Making assumption about the theme is equivalent to sampling from \(q(📜\vert📖)\). Then, teacher is providing a feedback to the students about how probable is the guess with respect to all possible human thoughts \(p(\)📜\()\). At the same time, the teacher encourages the high entropy of \(q(📜\vert📖)\) by demanding several different interpretations from the class. To cut a long story short, the teacher evaluates the second term of the lower bound \(D_{KL}(q(📜\vert📖)\|p(📜))\).
In terms of our story, reconstruction term corresponds to the process depicted in the image above. At first, you have to read a book and make your guess about the theme of the novel. Then, you have to write back the whole book using your guess. If you can provide a reasonable variety of guesses to make your teacher happy</p>
<p><em>small</em> \(D_{KL}(q(📜\vert📖)\|p(📜))\)</p>
<p>and you can write reasonable reconstruction of the text given your guess</p>
<p><em>big</em> \(\mathbb{E}_{q(📜\vert📖)}[\log{p(📖\vert📜)}]\)</p>
<p>you will have a good lower bound and after all, you will be a good pupil, won’t you?</p>
<h1 id="-mutual-information"><a name="bad_pupil"></a> Mutual Information</h1>
<p>If you think that a huge evidence lower bound is enough to be a good pupil, you are missing one important thing. Unfortunately, so was I in my school.
<img src="/res/MI.png" alt="image-mi" class="align-image-center" />
The key thing is writing essay on a given topic. At first, teacher has to give you a sample from \(p(📜)\). Then, you have to write an essay (sample from \(p(📖\vert📜)\)). Assuming that the teacher is a good teacher, then, he or she knows \(p(📜\vert📖)\). It measures how well your text corresponds to the topic. Hence, the mark for the homework will be proportional to this probability. Does this process remind you of something? It does remind me of the evaluation of mutual information:
\[I(x, z)=D_{KL}(p(x, z)\|p(x)p(z))=\int_z\int_x p(x,z) \log{\frac{p(x,z)}{p(x)p(z)}}dxdz\]
As you can see from the definition, the mutual information measures dependence between two variables. They are independent if and only if \(I(x,z)=0\)</p>
\[\begin{aligned}
I(x, z)=\int_z\int_x p(x,z) \log{\frac{p(x,z)}{p(x)}}dxdz - \int_z\int_x p(x,z) \log{p(z)}dxdz \\
=\int_z\int_x p(x\vert z)p(z) \log{p(z\vert x)}dxdz - \int_z p(z) \log{p(z)}dxdz \\
=\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{p(z\vert x)}\right] + H(z)
\end{aligned}\]
<p>The first term corresponds exactly to described process of essay writing and evaluation. It measures how well you can preserve the topic in your text. Unfortunately, mutual information requires computation of intractable \(p(z\vert x)\).</p>
<p class="notice--warning"><strong>Watch out!</strong> Evaluating mutual information \(I(x, z)\) is hard.</p>
<h1 id="variational-lower-bound-for-mutual-information">Variational lower bound for mutual information</h1>
<p>As you can see, to be a really good pupil, you also have to have big mutual information \(I(📖,📜)\). But a pupil does not have access to the true posterior \(p(📜\vert📖)\). Is he or she doomed to fail? Of course, not. He or she can use their approximate posterior \(q(📜\vert📖)\) to evaluate an approximation to the true mutual information. More specifically:</p>
\[\begin{aligned}
I(x, z) - H(z) =\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{p(z\vert x)}\right] \\
=\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{\left(p(z\vert x)\frac{q(z\vert x)}{q(z\vert x)}\right)}\right] \\
=\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{q(z\vert x)}\right] + \mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{\frac{p(z\vert x)}{q(z\vert x)}}\right] \\
=\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{q(z\vert x)}\right] + \mathbb{E}_{p(x)}\left[\mathbb{E}_{p(z\vert x)}\log{\frac{p(z\vert x)}{q(z\vert x)}}\right] \\
=\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{q(z\vert x)}\right] + \mathbb{E}_{p(x)}\left[D_{KL}\left(p(z\vert x)\|q(z\vert x)\right)\right]
\end{aligned}\]
<p>Using the non-negativity property of KL-divergence, we can derive:</p>
\[I(x, z)\ge\mathbb{E}_{p(z)}\left[\mathbb{E}_{p(x\vert z)}\log{q(z\vert x)}\right] + H(z)\]
<p>This result seems neat to me. We managed to boil down the essay-writing task to its core – evaluating mutual information. Also, there is a lower bound of it that is relatively cheap to evaluate. Unfortunately, the problem still remains. If we want to learn the model, we have to estimate the gradients of the mutual information lower bound. The root of the problem is a discrete nature of x. We can’t use pathwise derivatives because \(p(x\vert z)\) does not belong to any reparametrizable family. Especially, in the case of language, the <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">score function estimator</a> can have huge variance, and that makes learning impractical. Fortunately, due to <a href="https://arxiv.org/abs/1611.00712">recent</a> <a href="https://arxiv.org/abs/1611.01144">development</a> of deep learning, we can use low-variance biased gradient estimators that work in practice incredibly well. For example, I used the straight-through Gumbel-Softmax estimator in our recent work (<a href="https://openreview.net/pdf?id=SkaxnKEYg">Havrylov & Titov 2017</a>) to learn language between two cooperative agents. By the way, it is not a coincidence that the loss function for communication looks very much like the variational lower bound of mutual information. After all, language is by definition something that we use to communicate information about the external world.</p>
<h1 id="vae-issues">VAE issues</h1>
<p>Now we have all building blocks to get a more rigorous explanation of the proposed model for controllable text generation. But before we dive into that, let me point out several important issues that arise while using VAE.</p>
<h2 id="mean-field-approximation">Mean field approximation</h2>
<p>Typically, when using VAE, we make strong assumptions about the posterior distribution. For instance, the assumption that the posterior distribution is approximately factorial (mean field approximation). Within VAE framework, generator and inference network are trained jointly. So, they are encouraged to learn representations where these assumptions are satisfied (<a href="https://arxiv.org/abs/1509.00519">Importance Weighted Autoencoders</a>, <a href="https://arxiv.org/abs/1602.06725">Variational inference for Monte Carlo objectives</a>, <a href="https://arxiv.org/abs/1606.04934">Improving Variational Inference with Inverse Autoregressive Flow</a>). In other words, training a powerful model using an insufficiently expressive variational posterior can cause the model to use only a small fraction of its capacity.</p>
<p class="notice--info"><a name="vae_feature"></a>
Using factorial variational posterior for VAE may make true posterior follow this constraint.</p>
<h2 id="kl-weight-annealing-hack">KL weight annealing hack</h2>
<p>Another important feature of learning dynamic is that, at the start of training, the generator model \(p(x\vert z)\) is weak. That makes the state where \(q(z\vert x) \approx p(z) \) most attractive. In this state, inference network gradients have a relatively low signal-to-noise ratio, resulting in a stable equilibrium from which it is difficult to escape. The solution proposed in the <a href="https://arxiv.org/abs/1511.06349">Generating Sentences from a Continuous Space</a> paper is to use an annealing for the weight of the \(D_{KL}(q(z\vert x)\|p(z))\) term, slowly increasing it from 0 to 1 over parameter updates. To be honest, I am not a big fan of this solution. It looks to me like a hack. I think we can do better.</p>
<h2 id="how-to-deal-with-powerful-pxvert-z">How to deal with powerful \(p(x\vert z)\)</h2>
<p>Alright, let’s say that now we are using this annealing trick. But did we manage to avoid the trivial \(q(z\vert x) \approx p(z)\) state? Not at all, when you have a very powerfull model for the likelihood \(p(x\vert z)\), most of the time you will encounter this trivial solution. Let’s illustrate it with my story example. It could seem that <a href="#lstm">proposed model</a> for the student is very simple, but in fact, LSTM RNN is a very powerful model that can fit your sequential data pretty well. Let’s say that pupil is in the phase of optimizing lower bound of the marginal log-likelihood. It consists of the <a href="#expected_reconstruction">expected reconstruction</a> term and <a href="#kl">KL</a> term. For optimizing the expected reconstruction term, a pupil has to read a book and make a guess about the theme and then he has to maximize the probability of reconstructing book back using this theme. It could seem that the reconstruction of the whole book is impossible. But, actually, this process does not look so complicated because on each time step you have access to the whole context that precedes the current word. The only thing that you have to predict on each time step is just one word. It is not surprising that pupil can be carried away by this process and can completely forget about the theme of the book. In the end, he or she can still write a decent essay, but it would be completely off topic (this is exactly what happened to me). To fight this problem, <a href="https://arxiv.org/abs/1511.06349">Bowman et. al</a> proposed to make the generative model weaker by removing some or all of conditioning words during learning (taking away the book from pupil during reconstruction phase). But this solution also looks like a hack to me. I believe that using the mutual information in the loss function can solve this problem in a more principled way without making the generative model less expressive.</p>
<p class="notice--info">Using the variational lower bound of \(I(x, z)\) can solve the problem with trivial solution \(q(z\vert x) \approx p(z)\) for powerful likelihood \(p(x\vert z)\) in a more principled way.</p>
<h1 id="controllable-text-generation">Controllable text generation</h1>
<p>Finally, we can get into the <a href="https://arxiv.org/abs/1703.00955">Controllable Text Generation</a> paper. As you may already know, paper aims to generate natural language sentences, whose attributes are dynamically controlled by representations with designated semantics. They propose a <em>“neural generative model which combines variational auto-encoders and holistic attribute discriminators for effective imposition of semantic structures.”</em> I hope that after reading this post you can clearly see that <em>“holistic attribute discriminator”</em> is nothing more than approximate posterior inference network. And <em>“collaborative learning of generator and discriminators”</em> is just optimization of variational lower bound of mutual information. Despite the fact that both options sound mouthful, I am convinced that the second one is more mathematically rigorous, hence, useful.</p>
<p>One thing that I am still confused about is the claim: <em>“proposed deep text generative model improves model interpretability by explicitly enforcing
the constraints on independent attribute controls;”</em>. They are saying that optimizing loss in equation 7 explicitly imposes independency constraint for posterior distributions. But if you look closely, you will see that it is just the variational lower bound of the mutual information between sentence \(x\) and latent code \(z\). I am not convinced that two posteriors \(p(z\vert x)\) and \(p(c\vert x)\) will be independent if you maximize mutual information \(I(x,z)\) and \(I(x,c)\). Using the definitions:
\[I(z,c\vert x)=H(x,z) + H(x, c) - H(x,z,c)-H(x)\]
\[H(x,z) = -I(x,z) + H(x) + H(z)\]</p>
<p>one can express the conditional mutual information through the mutual informaton:</p>
\[\begin{aligned}
I(z,c\vert x)=-I(x,z) -I(x,c) + H(x) + H(z) + H(c) - H(x,z,c) \\
=-I(x,z) -I(x,c) + H(x) - H(x\vert z,c)
\end{aligned}\]
<p>Independency of posteriors is equvialent to \(I(z,c\vert x)=0\). To say the truth, I can’t see how maximizing mutual information \(I(x,z)\) and \(I(x,c)\) will make \(I(z,c\vert x)\) smaller. Even though, by doing so, you will lower down the first two terms in the equation but you will increase overall value for the last two. If you think I am missing something, please let me know. But for the time being, I think that saying that loss in the equation 7 in the paper explicitly encourages independency is not tehnically correct. Moreover, I am not sure that the column names for the table 2 are correct either. I would like to remind you of the VAE <a href="#vae_feature">property</a> that if you make independenсy assumptions in your approximate posterior network, it is probable that true posterior will also follow this constraint. It means that it is higly likely that both models with and without mutual information regularization have almost independent posteriors.</p>
<p>Another thing that totaly confused me was: <em>“To avoid vanishingly small KL term in the VAE we use a KL term weight linearly annealing from
0 to 1 during training.”</em> Why should you do that? If you have a variational lower bound of mutual information in your loss, it is already solving this problem in a more principled way. Why do you need to use this annealing hack?</p>
<h1 id="conclusion">Conclusion</h1>
<p>In conclusion I would like to say one more time that I realy liked <a href="https://arxiv.org/abs/1703.00955">Controllable Text Generation</a> paper. Even though, I am a little bit disappointed that authors have chosen (in my opinion) not the best way to explain the proposed model. If you think that I am missing something, feel free to let me know in the comments. I would be more than happy to discuss it.</p>
<p>But for now, I would like to conclude with this quote:</p>
<p class="notice--primary"><em>“…I thought my symbols were just as good, if not better, than the regular
symbols - it doesn’t make any difference what symbols you use - but I
discovered later that it does make a difference. Once when I was explaining
something to another kid in high school, without thinking I started to make
these symbols, and he said, “What the hell are those?” I realized then that
if I’m going to talk to anybody else, I’ll have to use the standard symbols,
so I eventually gave up my own symbols…”</em></p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>I would like to thank <a href="https://twitter.com/iatitov">Ivan Titov</a> and <a href="https://twitter.com/wilkeraziz">Wilker Aziz</a> for interesting discussions. Special thanks to Liza Smirnova for providing illustrations and proofreading the post.</p>
Tue, 14 Mar 2017 22:27:21 +0000
https://serhii-havrylov.github.io/blog/mutual_info/
https://serhii-havrylov.github.io/blog/mutual_info/