Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization
TL;DR: Compressed sensing techniques enable efficient acquisition and recovery of sparse, highdimensional data signals via lowdimensional projections. In our AISTATS 2019 paper, we introduce uncertainty autoencoders (UAE) where we treat the lowdimensional projections as noisy latent representations of an autoencoder and directly learn both the acquisition (i.e., encoding) and amortized recovery (i.e., decoding) procedures via a tractable variational information maximization objective. Empirically, we obtain on average a 32% improvement over competing methods on the task of statistical compressed sensing of highdimensional data.
The broad goal of unsupervised representation learning is to learn transformations of the input data which succinctly capture the statistics of an underlying data distribution. A plethora of learning objectives and algorithms have been proposed in prior work, motivated from the perspectives of latent variable generative modeling, dimensionality reduction, and others. In this post, we will describe a new framework for unsupervised representation learning inspired from compressed sensing. We begin with a primer of statistical compressed sensing.
Statistical Compressed Sensing
Systems which can efficiently acquire and accurately recover highdimensional signals form the basis of compressed sensing. These systems enjoy widespread use. For example, compressed sensing has been successfully applied to a wide range of applications such as designing powerefficient singlepixel cameras and accelerating scanning times of MRI for medical imaging, among many others.
A compressed sensing pipeline consists of two components:
 Acquisition: A mapping between highdimensional signals to measurements
where is any external noise in the measurement process. The acquisition process is said to be efficient when .
 Recovery: A mapping between the measurements to the recovered data signals . Recovery is accurate if a normed loss e.g., is small.
In standard compressed sensing, the acquistion mapping is typically linear in (i.e., for some matrix ). In such a case, the system is underdetermined since we have more variables () than constraints (). To guarantee unique, nontrivial recovery, we assume the signals are sparse in an appropriate basis (e.g., Fourier basis for audio, wavelet basis for images). Thereafter, acquisition via certain classes of random matrices and recovery by solving a LASSO optimization method guarantees unique recovery with high probability using only a few measurements (roughly logarithmic in the data dimensionality) (Candès and Tao 2005), (Candès, Romberg, and Tao 2006), (Donoho and others 2006).
In this work, we consider the setting of statistical compressed sensing where we have access to a dataset of training data signals . We assume that every signal for some unknown data distribution . One way to think about acquisition and recovery in this setting is to consider a game between an agent and nature.
At training time:
 Nature shows the agent a finite dataset of highdimensional signals.
 Agent learns the acquistion and recovery mappings and by optimizing a suitable objective.
At test time:
 Nature shows the agent the compressed measurements for one or more test signals .
 Agent recovers the signal as and incurs an norm loss .
To play this game, the agent’s task is to choose the acquisition and recovery mappings and such that the test loss is minimized.
Uncertainty Autoencoders
In practice, there are two sources of uncertainty in recovering the signal from the measurements alone, even if the agent is allowed to pick an acquisition mapping . One is due to the stochastic measurement noise . Second, the acquisition mapping is typically parameterized with a family of finiteprecision restricted mappings (e.g., linear mappings as in standard compressed sensing or more generally neural networks). Given that the dimensionality of the measurements is smaller than that of the signal , such restrictions would prohibit learning a bijective mapping even in the absence of noise.
For the illustrative case where the mapping is linear, we established that exact recovery is not possible. Then what are some other ways to efficiently acquire data? In the figure below, we consider a toy setting where the true data distribution is an equallyweighted mixture of two 2D Gaussians stretched along orthogonal directions. We sample 100 points (black) from this mixture and consider two methods to reduce the dimensionality of these points to one dimension.
One option is to project the data along directions that account most for the variability in the data using principal component analysis (PCA). For the 2D example above, this is shown via the blue points on the magenta line. This line captures a large fraction of the variance in the data but collapses data sampled from the bottom right Gaussian into a narrow region. When multiple datapoints are collapsed into overlapping, densely clustered regions in the lowdimensional space, disambiguating the association between the lowdimensional projections and the original datapoints is difficult during recovery.
Alternatively, we can consider the projections (red points) on the green axis. These projections are more spread out and suggest that recovery is easier, even if doing so increases the total variance in the projected space compared to PCA. Next, we present the UAE framework which learns precisely the aforementioned lowdimensional projections that make recovery more accurate^{1}.
Probabilistically, the joint distribution of the signal and measurements is given as . E.g., if we model the noise as centered isotropic Gaussian, the likelihood can be expressed as . To learn the parameters that best facilitate recovery in the presence of uncertainty, consider the following objective
The above objective maximizes the logposterior probability of recovering from the measurements , consistent with the agent’s goal at test time as mentioned above.
Variational Information Maximization
Alternatively, one can interpret the above as maximizing the mutual information between the signals and the measurements . To see the connection, note that the data entropy is a constant and does not affect the optima. Hence, we can rewrite the objective as
Evaluating (and optimizing) the mutual information is unfortunately nontrivial and intractable in the current setting. To get around this difficulty while also permitting fast recovery, we propose to use an amortized variational lower bound on mutual information due to (Agakov 2004).
In particular, we consider a parameterized, variational approximation to the true posterior . Here, denote the variational parameters. Substituting the variational distribution gives us the following lower bound to the original objective
The above expression defines the learning objective for uncertainty autoencoders, where acquisition can be seen as encoding the data signals and the recovery corresponds to decoding the signals from the measurements.
Example
In practice, the expectation in the UAE objective is evaluated via Monte Carlo: the data signal is sampled from the training dataset , and the measurements are sampled from an assumed noise model that permits reparameterization (e.g., isotropic Gaussian). Depending on the accuracy metric of interest for recovery, we can make a distributional assumption on the amortized variational distribution (e.g., Gaussian with fixed variance for , Laplacian for ) and map the measurements to the sufficient statistics of via the recovery mapping .
As an illustration, consider an isotropic Gaussian noise model with known scalar variance . If we also let the variational distribution be an isotropic Gaussian with fixed scalar variance, we obtain the following objective maximized by an uncertainty autoencoder (UAE)
for some positive normalization constant that is independent of and .
Comparison with commonly used autoencoders
Even beyond statistical compressive sensing, UAEs present an alternate framework for unsupervised representation learning where the compressed measurements can be interpreted as the latent representations. Below, we discuss how UAEs computationally differ and relate to commonly used autoencoders.
 Standard autoencoders (AE): In the absence of any noise in the latent space, the UAE learning objective reduces to that of an AE.
 Denoising autoencoders (DAE): A DAE (Vincent et al. 2008) adds noise in the observed space (i.e., to the data signals), whereas a UAE models the uncertainty in the latent space.
 Variational autoencoders (VAE): A VAE (Kingma and Welling 2014) regularizes the latent space to follow a prior distribution. There is no explicit prior in a UAE, and consequently no KL divergence regularization of the distribution over the latent space^{2}. This avoids pitfalls of representation learning with VAEs where the latent representations are ignored in the presence of powerful decoders (Chen et al. 2016).
Does a UAE permit outofsample generalization, like a DAE or a VAE? Yes! Under suitable assumptions, we show that a UAE learns an implicit generative model of the data signal distribution and can be used to define a Markov chain Monte Carlo sampler. See Theorem 1 and Corollary 1 in the paper for more details.
Illustration of the Markov chain sampler for based on UAE.
Overview of experimental results
We present some experimental results on statistical compressive sensing of image datasets below for varying numbers of measurements and random Gaussian noise. We compare against two baselines:
 LASSO in an appropriate sparsityinducing basis
 CSVAE/DCGAN, a recently proposed compressed sensing method that searches the latent space of pretrained generative models such as VAEs and GANs for the latent vectors that minimize the recovery loss (Bora et al. 2017).
MNIST
Test reconstruction error (per image) for varying
Reconstructions for m=25 measurements
CelebA
Test reconstruction error (per image) for varying
Reconstructions for m=50 measurements
On average, we observe a 32% improvement across all datasets and measurements. For results on more datasets and tasks involving applications of UAE to transfer learning and supervised learning, check out our paper below!
Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization
Aditya Grover, Stefano Ermon
AISTATS, 2019.
paper code
Footnotes

We show in Theorem 2 in the paper that in the case of a Gaussian noise model, PCA is a special case of the information maximizing objective for a linear encoder and optimal (potentially nonlinear) decoder under suitable assumptions. ↩

While not discussed in the original paper, the UAE objective can be seen as a special case of the VAE objective (Higgins et al. 2016) for . ↩
References
 Candès, Emmanuel J, and Terence Tao. 2005. “Decoding by Linear Programming.” IEEE Transactions on Information Theory 51 (12): 4203–15.
 Candès, Emmanuel J, Justin Romberg, and Terence Tao. 2006. “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information.” IEEE Transactions on Information Theory 52 (2): 489–509.
 Donoho, David L, and others. 2006. “Compressed Sensing.” IEEE Transactions on Information Theory 52 (4): 1289–1306.
 Agakov, David Barber Felix. 2004. “The IM Algorithm: a Variational Approach to Information Maximization.” Advances in Neural Information Processing Systems 16: 201.
 Vincent, Pascal, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International Conference on Machine Learning, 1096–1103. ACM.
 Kingma, Diederik P, and Max Welling. 2014. “AutoEncoding Variational Bayes.” In International Conference on Learning Representations.
 Chen, Xi, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. “Variational Lossy Autoencoder.” ArXiv Preprint ArXiv:1611.02731.
 Bora, Ashish, Ajil Jalal, Eric Price, and Alexandros G Dimakis. 2017. “Compressed Sensing Using Generative Models.” In Proceedings of the 34th International Conference on Machine LearningVolume 70, 537–46. JMLR. org.
 Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2016. “BetaVae: Learning Basic Visual Concepts with a Constrained Variational Framework.”