On the geometry of learning neural quantum states

Chae-Yeun Park

University of Cologne

CYP and M. J. Kastoryano, arXiv:1910.11163.

Machine learning and physics

Turing award in 2019 is given to 3 AI pioneers (Y. Lecun, G. Hinton, and Y. Bengio).

So far, physics and ML is interplayed for

• Using physics to investigate the behaviors of neural networks (e.g. statistical learning theory, look NN as Gaussian process).
• Using ML to solve a specific optimization problems in physics (detecting phase transitions, decoding QEC).
• Using physics problems to investigate how ML works.

1. Machine learning and quantum many-body physics

Generative Machine Learning

We want to model $p_{\rm data}(\pmb{x})$.
Choose an Ansatz based on neural network.
Restricted Boltzmann Machine:
$$p_{\pmb{\theta}}(\pmb{x}) = \sum_{\pmb{y}} e^{-E(\pmb{x},\pmb{y})}$$ where $\pmb{\theta} = [\pmb{a},\pmb{b},\pmb{W}]$.
Maximize the logarithmic likelihood (LL) \begin{align} \mathcal{L}(\pmb{\theta}) &= \sum_{\pmb{x}} p_{\rm data} (\pmb{x}) \log p_{\pmb{\theta}} (\pmb{x}) \\ &\approx \langle \log p_{\pmb{\theta}} (\pmb{x}) \rangle_{x\sim p_{\rm data}(x)} \end{align} Solve $\mathop{arg\,max}_{\pmb{\theta}} \mathcal{L}(\pmb{\theta})$ using stochastic gradient ascent.

Algorithm for generative model

1. Choose an Ansatz based on neural network.
2. For samples from data we want to model:
3.     $\pmb{\theta}_{t+1} = \pmb{\theta}_t + \eta \nabla_{\pmb{\theta}} \mathcal{L}(\pmb{\theta})$

In practice, several approximations are used as the gradient of logarithmic likelihood $\mathcal{L}$ is not directly computable (three typical generative models VAE, GAN, and normalizing flow use different techniques).

Variational quantum Monte-Carlo (VQMC)

We want to find the ground state of local Hamiltonians:

$$H = \sum_{\langle i,j \rangle } h_{i,j}$$ Examples (1D, 2-local)

Transverse field Ising model : $H = \sum_{i=1}^N Z_i Z_{i+1} + h X_i$
Heisenberg model: $H = \sum_{i=1}^N X_i X_{i+1} + Y_i Y_{i+1} + Z_i Z_{i+1}$

Find $E_g = \min_{\psi} \frac{\langle \psi | H | \psi \rangle }{\langle \psi | \psi \rangle}$.

Is it difficult?

Naive answer: Probably yes as the dimension of $H$ increase exponentially with $N$.

CS Answer: Probably yes as it is in QMA-Complete.

VQMC approach

First, we choose an Ansatz $\psi_{\pmb{\theta}}(\pmb{x})$ that describes the ground state.
Second, we iteratively minimize the energy using gradient descent type of algorithm.

Requirements:

1. # of parameters $\pmb{\theta}$ should be $poly(N)$.
2. Sampling from $|\psi_{\pmb{\theta}}(\pmb{x})|^2$ should be possible.
3. Calculating observable should be efficient.

Algorithm for VQMC

1. Choose an Ansatz for the ground state.
2. For samples from the model:
3.     $\pmb{\theta}_{t+1} = \pmb{\theta}_t - \eta \mathcal{F}^{-1} \nabla_{\pmb{\theta}} \langle H \rangle$

Comparison between ML (generative) and VQMC

Generative ML VQMC
What to represent? Data obtained from the real world (MNIST, CIFAR-10, CIFAR-100, ImageNet, etc.). GS of the Hamiltonian with a parameter.
Sample from? Data we already have (from our model when we generate). Always from the model (as we don't know the data).
Optimization? Stochastic gradient descent and its variations (RMSProp, Adam, etc.). Such a variant does not work well. We usually need full geometry information.

Three open problems

1. Expressibility: Which Ansatz describes the ground state accurately?
2. Sampling: Is it possible to sample from the Ansatz distribution efficiently?
3. Optimization: How can we optimize the parameters of the Ansatz well?

2. Information geometry in VQMC

RBM Ansatz

\begin{align} \psi_\theta(\pmb{x}) &= \sum_{\pmb{y}=\{\pm 1\}^N} e^{\pmb{a} \cdot \pmb{x} + \pmb{b} \cdot \pmb{y} + \langle \pmb{y}, W \pmb{x} \rangle } = e^{\pmb{a} \cdot \pmb{x}} \prod_{j} 2 \cosh [(W\pmb{x})_j+b_j] \end{align}

This Ansatz reported the state-of-the-art accuracy for TFI, Heisenberg model.

G. Carleo and M. Troyer, Science 355, 602 (2017).

Some properties of the RBM Ansatz

• For fixed $\alpha = |\pmb{h}|/|\pmb{x}|$, # of parameters is $\alpha N^2 + (\alpha+1)N$.
• Can sample from the distribution using Markov Chain Monte-Carlo (MCMC). Still, worst-case NP-Hard as it contains spin-glass problems.
• All parameters are now complex valued.
• When we restrict zero biases $\pmb{a}=\pmb{b}=0$, the system is $\mathbb{Z}_2$ symmetric.
• It is closed under local Clifford operations $\{X, Y, Z\}$ but not under the Hadamard operation (basis dependent).

Information geometry

For a probability distribution $p_\theta(x)$, the Fisher information is defined as $$F_{ij} = \sum_x p_\theta(x) \frac{\partial \log p_\theta(x)}{\partial \theta_i} \frac{\partial \log p_\theta(x)}{\partial \theta_j}.$$ This matrix defines a Riemannian metric that is invariant under sufficient statistics for exponential family. It can be also considered as infinitesimal form of KL divergence.

Quantum case

For quantum states, the Fubini-Study metric can be used to measure distance between two quantum states:
$D(\psi, \phi) = \mathrm{arccos}\sqrt{\frac{|\langle \psi | \phi \rangle |^2}{\langle \psi |\psi \rangle \langle \phi | \phi \rangle}}.$
In infinitesimal form, we have
$ds^2 \approx \left[ \frac{\langle \delta \psi|\delta \psi \rangle}{\langle \psi | \psi \rangle} - \frac{\langle \delta \psi | \psi \rangle \langle \psi | \delta \psi \rangle}{\langle \psi | \psi \rangle^2}\right]$.

For parametrized quantum states, we have
$\mathcal{F}_{ij} = \left\langle O_i(x)^* O_j(x) \right\rangle - \left\langle O_i(x)^* \right\rangle\left\langle O_j(x) \right\rangle$
where $\langle A(x) \rangle = \sum_x |\psi_\theta(x)|^2 A(x)$ and $O(x) = \frac{\partial \log \psi_\theta(x)}{\partial \theta_i}$.

$\mathcal{F}$ is the quantum Fisher information matrix and our main study object.

This matrix is widely used for optimization. $$\theta_{t+1} = \theta_t - \eta \mathcal{F}^{-1} \nabla{\langle H \rangle}$$ This optimization method is known as natural gradient descent (in ML) and stochastic reconfiguration (in VQMC).

ML community is also interested in this matrix as it is related to the optimization (directly) and generalization (indirectly).

Is it related to the ground state property?

Study using the transverse field Ising model

$$H = -\sum_i \sigma^i_z \sigma^{i+1}_z + h\sigma^i_x$$
• Ferromagnetic, close to a product state in $\sigma_z$ basis $|0\rangle^N$ and $|1\rangle^N$ when $h < 1.0$.
• Critical when $h=1.0$.
• Paramagnetic, close to a product state in $\sigma_x$ basis $|+\rangle^N$when $h > 1.0$.
$h = 0.0$
$h = 1.0$
$h = 2.0$
• Universal behavior at the beginning of the learning.
• In the ferromagnetic phase, the converged spectrum is singular.
• At the critical point, the converged spectrum is dense.
• In the paramagnetic phase, the converged spectrum is step-wise.

RBM representation of coherent thermal states for Ising model

For classical Hamiltonian $H(x)$, we define the coherent thermal state with the inverse temperature $\beta$ as $$|\psi \rangle = \sum_x \frac{e^{-\beta H(x)/2}}{\sqrt{Z}}|x\rangle.$$

An important property of this state is that any observable that is a function of $\sigma_z$ is the same to that of the classical thermal state. e.g. $$\langle \psi| \sigma^i_z \sigma^j_z | \psi \rangle = \frac{1}{Z} \sum_x e^{-\beta H(x)} x_i x_j$$

For two-dimensional Ising model, the coherent thermal state is used to show that PEPS can have polynomially decaying correlation. At $\beta = \beta_c$, the correlation decay polynomially but PEPS can exactly represent such a state.

Two dimensional Ising-RBM
 (a) Spectrum of the Fisher information matrix for different $\beta$ when $L=10$. (b) Rank of the Fisher information matrix and (c) the Fisher information density for different system size $L$ as a function of $\beta$. Shows steps regardless of $\beta$. In the paramagnetic ($\beta > \beta_c \approx 0.44$) phase, step is at $N(N+1)/2$. In the ferromagnetic ($\beta < \beta_c$) phase, rank decreases with $\beta$. Not dense when $\beta \approx \beta_c$ but the trace of the Fisher matrix signals the phase transition.

What we conjecture

• Singular spectrum may imply the ground state is close to a product state in computational basis.
• Step at $N(N+1)/2$ may imply the state is close to a randomly initialized RBM, which is $\approx | +\rangle^N$.
• Dense spectrum is may due to quantum criticality or due to the learning process.

Can weights be used to infer the ground state?

Not really. Many of these variables are redundant.

3. Implication to optimization

In our simulation, we use natural gradient descent (NGD) also known as stochastic reconfiguration (SR).

$$\theta_{t+1} = \theta_t - \eta (\mathcal{F} + \epsilon)^{-1} \nabla \langle H \rangle$$

However, calculating $\mathcal{F}^{-1}$ requires $O(D^{2 \sim 3})$. Never used in ML as it involves $10^5 \sim 10^8$ parameters (e.g. Vgg-19 has 138M parameters).

RMSProp

For each mini batch:
$v_{t+1} = \beta v_t +(1-\beta) \langle (\nabla_\theta f )^2 \rangle$
$w_{t+1} = w_t - \eta \langle \nabla_\theta f \rangle /\sqrt{v_{t+1} + \epsilon}$

It normalizes each component of the gradient based on its norm history.

How it works?

Work in higher dimension?

Yes for real world data but No for physics problems.

RMSProp as a diagonal approximation of the natural gradient

In usual ML, we want to optimize the cross entropy between the data, i.e.,
$\mathcal{L} \approx \left\langle \log p_\theta(x) \right\rangle_{x \sim p_{\rm data}(x)}$
then we have $\nabla_\theta f_\theta(x) = \frac{\partial \log p_\theta(x) }{\partial \theta_i}$.

Thus, $v_t$ is the running average of $\langle \frac{\partial \log p_\theta(x) }{\partial \theta_i}^2 \rangle$
which is the diagonal elements of the Fisher matrix.

Thus our update is close to $\theta_{t+1} = \theta_t - \eta (\mathcal{F}_{\rm diag} + \epsilon)^{-1/2} \nabla_\theta f$.

J. Martens, "New insights and perspectives on the natural gradient method", arXiv:1412.1193.

In our setting, $\nabla_\theta \langle H \rangle$ is used for gradient thus the square of it irrelevant to the diagonal of the Fisher information matrix. We may instead use to following:

For each mini batch:
$v_{t+1} = \beta v_t +(1-\beta) \langle (O(x) )^2 \rangle$
$w_{t+1} = w_t - \eta \langle \nabla_\theta H \rangle /\sqrt{v_{t+1} + \epsilon}$

where $O(x) = \frac{\partial \log \psi_\theta(x)}{\partial \theta_i}$.

Does it work better?

In general, we do not expect that RMSProp works well when the Fisher matrix is low-rank.

Why it works for ML?

In ML, it is typical that the rank of Fisher matrix $\gg$ number of samples. Long-range correlation in image data might be the reason.

Conclusion

• There are several differences between classical ML and ML-inspired VQMC.
• Information geometry contains information of ground states.
• Information geometry can be a reason why modern first-order optimizers do not work.