University of Cologne
CYP and M. J. Kastoryano, arXiv:1910.11163.
Turing award in 2019 is given to 3 AI pioneers (Y. Lecun, G. Hinton, and Y. Bengio).
So far, physics and ML is interplayed for
Algorithm for generative model
1. Choose an Ansatz based on neural network.
2. For samples from data we want to model:
3. $\pmb{\theta}_{t+1} = \pmb{\theta}_t + \eta \nabla_{\pmb{\theta}} \mathcal{L}(\pmb{\theta})$
In practice, several approximations are used as the gradient of logarithmic likelihood $\mathcal{L}$ is not directly computable (three typical generative models VAE, GAN, and normalizing flow use different techniques).
We want to find the ground state of local Hamiltonians:
$$H = \sum_{\langle i,j \rangle } h_{i,j}$$ Examples (1D, 2local)
Transverse field Ising model : $H = \sum_{i=1}^N Z_i Z_{i+1} + h X_i$
Heisenberg model: $H = \sum_{i=1}^N X_i X_{i+1} + Y_i Y_{i+1} + Z_i Z_{i+1}$
Naive answer: Probably yes as the dimension of $H$ increase exponentially with $N$.
CS Answer: Probably yes as it is in QMAComplete.
First, we choose an Ansatz $\psi_{\pmb{\theta}}(\pmb{x})$ that describes the ground state.
Second, we iteratively minimize the energy using gradient descent type of algorithm.
Requirements:
Algorithm for VQMC
1. Choose an Ansatz for the ground state.
2. For samples from the model:
3. $\pmb{\theta}_{t+1} = \pmb{\theta}_t  \eta \mathcal{F}^{1} \nabla_{\pmb{\theta}} \langle H \rangle$
$\pmb{\theta}_{t+1} = \pmb{\theta}_t  \eta \color{red}{\mathcal{F}}^{1} \nabla_{\pmb{\theta}} \langle H \rangle$
Generative ML  VQMC  

What to represent?  Data obtained from the real world (MNIST, CIFAR10, CIFAR100, ImageNet, etc.).  GS of the Hamiltonian with a parameter. 
Sample from?  Data we already have (from our model when we generate).  Always from the model (as we don't know the data). 
Optimization?  Stochastic gradient descent and its variations (RMSProp, Adam, etc.).  Such a variant does not work well. We usually need full geometry information. 
This Ansatz reported the stateoftheart accuracy for TFI, Heisenberg model.
G. Carleo and M. Troyer, Science 355, 602 (2017).For quantum states, the FubiniStudy metric can be used to measure distance between two quantum states:
$D(\psi, \phi) = \mathrm{arccos}\sqrt{\frac{\langle \psi  \phi \rangle ^2}{\langle \psi \psi \rangle \langle \phi  \phi \rangle}}.$
In infinitesimal form, we have
$ds^2 \approx \left[ \frac{\langle \delta \psi\delta \psi \rangle}{\langle \psi  \psi \rangle}  \frac{\langle \delta \psi  \psi \rangle \langle \psi  \delta \psi \rangle}{\langle \psi  \psi \rangle^2}\right]$.
For parametrized quantum states, we have
$\mathcal{F}_{ij} = \left\langle O_i(x)^* O_j(x) \right\rangle  \left\langle O_i(x)^* \right\rangle\left\langle O_j(x) \right\rangle$
where $\langle A(x) \rangle = \sum_x \psi_\theta(x)^2 A(x)$ and $O(x) = \frac{\partial \log \psi_\theta(x)}{\partial \theta_i}$.
ML community is also interested in this matrix as it is related to the optimization (directly) and generalization (indirectly).
Is it related to the ground state property?
Study using the transverse field Ising model
$$H = \sum_i \sigma^i_z \sigma^{i+1}_z + h\sigma^i_x$$For classical Hamiltonian $H(x)$, we define the coherent thermal state with the inverse temperature $\beta$ as $$\psi \rangle = \sum_x \frac{e^{\beta H(x)/2}}{\sqrt{Z}}x\rangle. $$
An important property of this state is that any observable that is a function of $\sigma_z$ is the same to that of the classical thermal state. e.g. $$\langle \psi \sigma^i_z \sigma^j_z  \psi \rangle = \frac{1}{Z} \sum_x e^{\beta H(x)} x_i x_j $$
For twodimensional Ising model, the coherent thermal state is used to show that PEPS can have polynomially decaying correlation. At $\beta = \beta_c$, the correlation decay polynomially but PEPS can exactly represent such a state.
(a) Spectrum of the Fisher information matrix for different $\beta$ when $L=10$. (b) Rank of the Fisher information matrix and (c) the Fisher information density for different system size $L$ as a function of $\beta$.


Not really. Many of these variables are redundant.
In our simulation, we use natural gradient descent (NGD) also known as stochastic reconfiguration (SR).
$$\theta_{t+1} = \theta_t  \eta (\mathcal{F} + \epsilon)^{1} \nabla \langle H \rangle $$However, calculating $\mathcal{F}^{1}$ requires $O(D^{2 \sim 3})$. Never used in ML as it involves $10^5 \sim 10^8$ parameters (e.g. Vgg19 has 138M parameters).
For each mini batch:
$v_{t+1} = \beta v_t +(1\beta) \langle (\nabla_\theta f )^2 \rangle$
$w_{t+1} = w_t  \eta \langle \nabla_\theta f \rangle /\sqrt{v_{t+1} + \epsilon}$
It normalizes each component of the gradient based on its norm history.
Work in higher dimension?
Yes for real world data but No for physics problems.
In usual ML, we want to optimize the cross entropy between the data, i.e.,
$\mathcal{L} \approx \left\langle \log p_\theta(x) \right\rangle_{x \sim p_{\rm data}(x)}$
then we have $\nabla_\theta f_\theta(x) = \frac{\partial \log p_\theta(x) }{\partial \theta_i}$.
Thus, $v_t$ is the running average of
$\langle \frac{\partial \log p_\theta(x) }{\partial \theta_i}^2 \rangle $
which is the diagonal elements of the Fisher matrix.
Thus our update is close to $\theta_{t+1} = \theta_t  \eta (\mathcal{F}_{\rm diag} + \epsilon)^{1/2} \nabla_\theta f$.
J. Martens, "New insights and perspectives on the natural gradient method", arXiv:1412.1193.
In our setting, $\nabla_\theta \langle H \rangle$ is used for gradient thus the square of it irrelevant to the diagonal of the Fisher information matrix. We may instead use to following:
For each mini batch:
$v_{t+1} = \beta v_t +(1\beta) \langle (O(x) )^2 \rangle$
$w_{t+1} = w_t  \eta \langle \nabla_\theta H \rangle /\sqrt{v_{t+1} + \epsilon}$
where $O(x) = \frac{\partial \log \psi_\theta(x)}{\partial \theta_i}$.
In general, we do not expect that RMSProp works well when the Fisher matrix is lowrank.
The spectrum of Fisher information from MNIST
In ML, it is typical that the rank of Fisher matrix $\gg$ number of samples. Longrange correlation in image data might be the reason.