Sliced Score Matching: A Scalable Approach to Density and Score Estimation
TL;DR: We show how to use random projections to scale up score matching—a classic method to learn unnormalized probabilisic models—to high-dimensional data. Theoretically, sliced score matching produces a consistent and asymptotic normal estimator under some regularity conditions. We apply sliced score matching to training deep energy-based models, learning VAEs with implicit encoders and training Wasserstein Auto-Encoders (WAEs).
Unnormalized probability models
Let \(f_\theta(\mathbf{x})\) be a real-valued function on \(\mathbf{x} \in \mathbb{R}^D\). We can define a probability model from \(f_\theta(\mathbf{x})\), using the following formula
\[p_\theta(\mathbf{x}) = \dfrac{e^{-f_\theta(\mathbf{x})}}{Z_\theta}.\]Here we assume that \(Z_\theta = \int e^{-f_\theta(\mathbf{x})} \text{d}\mathbf{x}\) exists and call it the partition function. Typically, it is intractable to evaluate the partition function, and \(Z_\theta\) is an unknown quantity. The probability model \(p_\theta(\mathbf{x})\) is called an unnormalized probability model, because the normalizing constant \(Z_\theta\) is unknown.
Learning unnormalized models with score matching
The de facto standard for learning a probability model is maximum likelihood (MLE). However, it is not suitable for learning unnormalized probability models, because the likelihood function
\[\log p_\theta(\mathbf{x}) = -f_\theta(\mathbf{x}) - \log Z_\theta\]depends on the intractable partition function \(Z_\theta\).
Score matching (Hyvärinen 2005) bypasses the intractable \(Z_\theta\) by only considering the scores, which is defined as the gradient of the log-density w.r.t. the random variable. For example, the score of \(p_\theta(\mathbf{x})\) is
\[\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) = -\nabla_{\mathbf{x}} f_\theta(\mathbf{x}).\]Note that the score of an unnormalized probability model \(\nabla_{\mathbf{x}} \log p_\theta(\mathbf{x})\) does not depend on the intractable partition function \(Z_\theta\). Instead of using likelihood as the objective, score matching is based on the following objective
\[\frac{1}{2}\mathbb{E}_{p_\text{data}}[\|\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x}) - \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) \|_2^2],\]where \(p_\text{data}(\mathbf{x})\) is the data distribution. The above objective is widely known as the Fisher divergence. Since it only involves \(\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\) and does not require \(Z_\theta\), it is ideal for training unnormalized models.
However, Fisher divergence is not directly computable, because the score of the data distribution \(\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})\) is unknown. Score matching eliminates the data score using integration by parts. To simplify our discussion, we consider the Fisher divergence between distributions of 1-D random variables. We have
\[\begin{aligned} &\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\nabla_x \log p_\text{data}(x) - \nabla_x \log p_\theta(x))^2] \\ =& \frac{1}{2} \int p_\text{data}(x) (\nabla_x \log p_\text{data}(x) - \nabla_x \log p_\theta(x))^2 \text{d}x\\ =& \underbrace{\frac{1}{2} \int p_\text{data}(x) (\nabla_x \log p_\text{data}(x))^2 \text{d}x}_{\text{const}} + \frac{1}{2} \int p_\text{data}(x) (\nabla_x \log p_\theta(x))^2 \text{d} x \\ &- \int p_\text{data}(x) \nabla_x \log p_\theta(x) \nabla_x \log p_\text{data}(x)\text{d}x. \end{aligned}\]By integration by parts, we have
\[\begin{aligned} &- \int p_\text{data}(x) \nabla_x \log p_\theta(x) \nabla_x \log p_\text{data}(x) \text{d}x\\ =& - \int \nabla_x \log p_\theta(x) \nabla_x p_\text{data}(x)\text{d} x\\ =& - p_\text{data}(x) \nabla_x \log p_\theta(x)\bigg|_{-\infty}^{\infty} + \int p_\text{data}(x) \nabla_x^2 \log p_\theta(x) \text{d} x\\ \stackrel{(i)}{=}& \mathbb{E}_{p_\text{data}}[\nabla_x^2 \log p_\theta(x)], \end{aligned}\]where \((i)\) holds if we assume \(p_\text{data}(x) \rightarrow 0\) when \(\vert x \vert \rightarrow 0\). Now, substituting the results of integration by parts into the 1-D Fisher divergence, we obtain
\[\begin{aligned} &\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\nabla_x \log p_\text{data}(x) - \nabla_x \log p_\theta(x))^2]\\ =& \mathbb{E}_{p_\text{data}}[\nabla_x^2 \log p_\theta(x)] + \frac{1}{2} \mathbb{E}_{p_\text{data}}[(\nabla_x \log p_\theta(x))^2] + \text{const}. \end{aligned}\]Therefore, the equivalent form of 1-D Fisher divergence does not involve \(\nabla_x \log p_\text{data}(x)\).
Generalizing the integration by parts argument to muti-dimensional data, we have the following objective equivalent to Fisher divergence (see (Hyvärinen 2005) for details of proof)
\[\mathbb{E}_{p_\text{data}}\bigg[\operatorname{tr}( \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})) + \frac{1}{2} \| \nabla_\mathbf{x} \log p_\theta(\mathbf{x})\|_2^2 \bigg] + \text{const},\]where \(\nabla_\mathbf{x}^2\) denotes the Hessian with respect to \(\mathbf{x}\). This objective is known as the score matching objective. Since it only involves functions of \(\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\), it does not depend on the intractable partition function and therefore is ideal for learning unnormalized probability models.
Sliced score matching
So far, we know that score matching can be used to learn unnormalized models. We are particularly interested in learning deep energy-based models, a special kind of unnormalized models where \(f_\theta(\mathbf{x})\) is parameterized by a deep neural network. In order to use score matching for learning deep energy-based models, we have to compute \(\| \nabla_\mathbf{x} \log p_\theta(\mathbf{x})\|_2^2 = \| \nabla_\mathbf{x} f_\theta(\mathbf{x})\|^2_2\) and \(\operatorname{tr}( \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})) = -\operatorname{tr}( \nabla_{\mathbf{x}}^2 f_\theta(\mathbf{x}))\). The former can be computed by one simple backpropagation of \(f_\theta(\mathbf{x})\). The later, however, requires much more number of backpropagations to compute. As studied in (Martens, Sutskever, and Swersky 2012), computing \(\operatorname{tr}( \nabla_{\mathbf{x}}^2 f_\theta(\mathbf{x}))\) requires a number of backpropagation that is proportional to the data dimension \(D\). Therefore, score matching is not scalable when learning deep energy-based models on high-dimensional data.
We propose sliced score matching to greatly scale up the computation of score matching. The motivating idea is that one dimensional data distribution is much easier to estimate for score matching. We propose to project the scores onto random directions, such that the vector fields of scores of the data and model distribution become scalar fields. We then compare the scalar fields to determine how far the model distribution is from the data distribution. It is clear to see that the two vector fields are equivalent if and only if their scalar fields corresponding to projections onto all directions are the same.
Mathematically, we denote \(\mathbf{v}\) as the random projection direction, and \(p_\mathbf{v}\) as its distribution. The random projected version of Fisher divergence is
\[\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\text{data}(\mathbf{x}) - \mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) )^2],\]which we name as the sliced Fisher divergence. Unfortunately, sliced Fisher divergence has the same problem as Fisher divergence, due to the unknown data score function \(\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})\). We therefore play the same trick of integration by parts, as done in score matching, to obtain the following tractable alternative form
\[\mathbb{E}_{p_\text{data}}\bigg[\mathbf{v}^\intercal \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})\mathbf{v} + \frac{1}{2} (\mathbf{v}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x}))^2 \bigg] + \text{const},\]which is our sliced score matching objective. Again, \(\mathbf{v}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x})\) can be computed by one backpropagation for deep energy-based models. The first term of the objective again involves Hessian, but it is in the form of Hessian-vector products, which can be computed within \(O(1)\) backpropagations. Therefore, the computation of sliced score matching does not depend on the dimension of data, and is much more scalable for training deep energy-based models on high dimensional datasets.
Theoretical guarantees of learning with sliced score matching
Suppose our data points \(\{ \mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N\}\) are i.i.d. samples from the data distribution \(p_\text{data}\). For each data point \(\mathbf{x}_i\), we randomly draw \(M\) random projection directions \(\{ \mathbf{v}_{i1}, \mathbf{v}_{i2}, \cdots, \mathbf{v}_{iM}\} \sim p_\mathbf{v}\). The sliced score matching objective can be estimated with empirical averages, giving rise to the following finite-sample estimator:
\[\frac{1}{NM} \sum_{i=1}^N \sum_{j=1}^M \bigg[\mathbf{v}_{ij}^\intercal \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x}_i)\mathbf{v}_{ij} + \frac{1}{2} (\mathbf{v}_{ij}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x}_i))^2 \bigg]\]We denote \(\hat{\theta}_{N,M}\) as the minimizer of the above empirical estimator, and let \(\theta^*\) denote the true parameter corresponding to the data distribution such that \(p_{\theta^*} = p_\text{data}\). We can prove that under some regularity conditions, \(\hat{\theta}_{N,M}\) is consistent and asymptotically normal. More rigorously, for any \(M \in \mathbb{N}^+\), when \(N \rightarrow \infty\), we have
\[\hat{\theta}_{N,M} \stackrel{p}{\rightarrow} \theta^*\]and
\[\sqrt{N}(\hat{\theta}_{N,M} - \theta^*) \stackrel{d}{\rightarrow} \mathcal{N}(0,\Sigma)\]where \(\Sigma\) is some covariance matrix.
Experiments
Sliced score mathing has numerous applications. For example,
- Density estimation. It can be used to scalably train deep energy-based models on high dimensional data.
- Score estimation. The sliced score matching objective can be used to estimate the score of any distribution from which samples can be efficiently obtained, which we call score estimation. It can be used to learn Variational Auto-Encoders (VAE) (Kingma and Welling 2014) where the encoder is an implicit distribution. It can also be used to learn Wasserstein Auto-Encoders (WAEs) (Tolstikhin et al. 2017)
- Score-based generative modeling (Song and Ermon 2019). It can be used to estimate scores of the data distribution and produce samples directly with Langevin dynamics.
References
- Hyvärinen, Aapo. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” Journal of Machine Learning Research 6 (Apr): 695–709.
- Martens, James, Ilya Sutskever, and Kevin Swersky. 2012. “Estimating the Hessian by Back-Propagating Curvature.” ArXiv Preprint ArXiv:1206.6464.
- Kingma, Diederik P, and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In International Conference on Learning Representations.
- Tolstikhin, Ilya, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. 2017. “Wasserstein Auto-Encoders.” ArXiv Preprint ArXiv:1711.01558.
- Song, Yang, and Stefano Ermon. 2019. “Generative Modeling by Estimating Gradients of the Data Distribution.” ArXiv Preprint ArXiv:1907.05600.