Optimization

Landscape Symmetry, Saddle Points and Beyond

Non-convex Optimization

Proabilistic Models, Deep Neural Nets

Theory: NP-hard. Better avoid or use convex relaxation.
Practice: Easy! just run SGD.

Gradient Descent

$$ x_{t+1} = x_t - \eta\nabla f(x_t) $$

Converges to stationary point ($\nabla f(x_t)=0$) [Nesterov ‘98] or “local minimum” [[Ge etal. ‘15]][GHJY15]

GD cannot escape from a local optimal solution

Landscape

Convex Functions

Simple: 0 gradient → Global Minimum
Can be optimized efficiently

Non-Convex Functions

Complicated: local minima, saddle points
GD can only find a local minimum

What special properties make a non-convex function easy?

Why are the objectives always non-convex?

Symmetry → Non-Convexity

Problem asks for multiple components, but the components have no ordering.

Optimization algorithms need to break the symmetry and converge to one of the (equivalent) local minima.

Saddle Points

$$z = x^2 - y^2 + y^4 + 0.1\cdot y$$

Symmetry → Non-Convexity

$$f(x) = -\|x\|^2 + \|x\|_4^4$$

Locally Optimizable Functions

Local min are symmetric versions of global min
No high order saddle points

SVD/PCA
Generalized Linear Model [KKKS’11] [HLS’14]
Synchronization [BVS’16]
Dictionary Learning [SQW’17]

Matrix Completion [[GLM16]] [[GJZ17]]
Matrix Sensing [BNS’16] [PKCS’16]
MAX-CUT [MMMO’17]
Tensor Decomposition [GHJY’15] [GM’16]
2-Layer Neural Net [[GLM17]]

Matrix Completion

Low rank matrix $M$
Observations: entries of $M$
Goal: recover remaining entries

Non-Convex Objective

Idea: Try to find the low rank factors directly $$M=U\cdot V^T$$
Variables $X$, $Y$. Hope $X = U, Y = V, M = XY^T$
Uniform observations $(i, j)\in\Omega$
Minimize “loss” on observed entries $$\min f(X, Y) = \sum_{(i,j)\in\Omega}(M_{i, j} - (XY^T)_{i,j})^2$$

Symmetry and Solutions

$$M=UV^T\qquad \min(f(X, Y):=\|M-XY^T\|^2_\Omega)$$

Hope: $X=U, Y=V$
Not true: many equivalent solutions: $$UV^T=URR^TV^T$$
Saddle points: e.g. $X=Y=0$

Theorem: when the number of observations is at least $\tilde{\Omega}(nr^6)$, all local minima of $f(X, Y)^*$ are global minima: satisfy $XY^T=M$.

[GLM16]: symmetric case; [GJZ17]: asymmetric case

Corollary: Simple SGD can solve matrix completion from an arbitrary starting point.

Prior work:

convex relaxation
non-convex optimization with carefully chosen starting point

$X$ is a local minimum of $f(X)$ → $\nabla f(X)=0, \nabla^2f(X)\succcurlyeq 0$

$$\nabla f(X)\ne 0$$

Follow gradient reduces $f(X)$.

$$\lambda(\nabla^2f(X))<0$$

Min eigendirection of Hessian reduces $f(X)$.

Direction of Improvment Exists

If $X$ is not global minimum

Exists $\Delta$, $\langle\nabla f(X), \Delta\rangle\ne 0$ or $\nabla^2f(X)[\Delta]<0$

Matrix Factorization

Every entry is observed, want to write $M=UV^T$
Consider symmetric case: $M=UU^T$ $$g(X):=|M-XX^T|_F^2$$
Goal: prove local minima satisfy $XX^T=M$.

$\nabla g(X)=0$ → $MX=XX^TX$ → If $\text{span}(X)=\text{span}(M)$, $M=XX^T$

$\nabla^2 g(X)\succcurlyeq0$ → $\text{span}(X)=\text{span}(M)$

Approach in [GLM16]. Need more cases to work for Matrix Completion.

Intuitively, want $X$ to go to the optimal solution $U$.
Recall: many equivalent optimal solutions!
Idea: find the “closest” among all optimal solutions $$\Delta = X-UR, \qquad R=\arg\min||X-UR||_F$$
Nice property: $||\Delta\Delta^T||_F^2\le2||M-XX^T||_F^2$

Main Lemma [GJZ17]

Lemma: If $\|\Delta\Delta^T\|_\Omega^2<3\|M-XX^T\|_\Omega^2$, then either $\langle\nabla f(X), \Delta\rangle\ne 0$ or $\nabla^2 f(X)[\Delta]<0$. ($\Delta$ is a direction of improvement.)

$||\Delta\Delta^T||_F^2\le2||M-XX^T||_F^2$
Immediate proof for matrix factorization
For completion, proof works as long as $||A||_\Omega\approx ||A||_F$ for $\Delta\Delta^T$ and $M-XX^T$

Restricted isometry property

characterizes matrices which are nearly orthonormal, at least when operating on sparse vectors.

Applies to asymmetric cases, matrix sensing and robust PCA.

Locally optimizable problems

Low rank matrix
- SVD/PCA, Matrix Completion, Synchronization, Matrix Sensing, GLM, MAX-CUT
Low rank tensor
- Dictionary Learning, Tensor Decomposition, 2-Layer Neural Net

Optimization landscape for neural network

$$ g_W(x) = \sigma(W_p\sigma(W_{p-1}\sigma(\cdots \sigma(W_1x)\cdots))) $$

$$ f(W) = \mathbb E[\|y - g_W(x)\|^2] $$

Teacher/Student Setting

Goal: Prove sth about optimization
Assume there is already a good network ($W^*$) and enough samples
Good solution: student mimic teacher

Linear Networks

$$ g_W(x) = W_pW_{p-1}\cdots W_1x $$

[Kawaguchi 16], [YSJ18]

All local minima of linear neural network (with squared loss) are global*
With 2 layers, no higher order saddle points.
With 3 or more layers, has higher order saddles.

For a critical point, if product of all layers has rank $r$, then it is a local and global minima (if r = min(m, n)) it can also be a normal saddle point (if r < min(m, n)).
Open problem: does local search actually find a global minimum?

Two-Layer Neural Network

$$ g(x) = a^T\sigma(Bx) $$

Wlog: rows of $B$ ($b_i$) are unit norm.
Data: $x\sim N(0, I)$, $y$ from a teacher.
Some more technical assumptions on a*, B*

Bad local minima

Claim: The objective function $f(a, B)$ has local minima that are not equivalent to the ground truth.

Observed in [GLM17]
formmaly verified in [Safran&Shamir17]

Landscape Design

Idea: Design a new objective with no bad local min.
Implicit in many previous techniques
- Regularization
- Methods-of-moments instead of MLE

Provable New Objective

Theorem[GLM17]: Can construct an objective for two-layer neural network such that all local minima are global*

Objective inspired by tensor decomposition
Relies on Gaussian distribution
Extended to symmetric input distribution* by [GKLW18]

Theorem[GKLW18]: For a two-layer neural network with more outputs than hidden units, if the input distribution is symmetric, there is a polynormial time algorithm that learns the neural network

Theorem[GWZ19]: For a 2/3-layer neural network with quadratic/polynormal activations, if the inputs are in general position, and

#parameters = O(1) #training samples

GD can memorize training data.

Spin Glass Model

Claim[CHMBL15]: all the local minima of a neural network have approximate equal function value

Kac-Rice Formula

$$ \int_x\mathbb E[|\det(\nabla^2 f)|\cdot \mathbf 1(\nabla^2 f\preceq0)\mathbf 1(x\in Z)|\nabla f(x)=0]p_{\nabla f(x)} $$

Proof idea: Count number of local min directly
Evaluate the formula using random matrix theory
Formal for spin-glass model (random polynormials), informal connection with NNs.
Formal results for overcomplete tensor and tensor PCA

Dynamics/Trajectory

Main idea: instead of analyzing the global landscape, analyze the path from a random initialization.
Observation: path can be very short

Empirical Risk/Training Error
- [Du, Zhai, Poczos, Singh] Two-layer
- [Allen-Zhu, Li, Song] [Du, Lee, Li, Wang, Zhai] [Zou, Cao, Zhu, Gu] Multi-layer/ResNet
- [Allen-Zhu, Li, Song] Recurrent Neural Network
Population Risk/Test Error
- [Li Linag] Special multiclass classification
- [Allen-Zhu, Li, Liang] 2 or 3 layer neural network. “Kernel-like” setting, special activation

Optimization

Non-convex Optimization

Gradient Descent

GD cannot escape from a local optimal solution

Landscape

What special properties make a non-convex function easy?

Symmetry → Non-Convexity

Optimization algorithms need to break the symmetry and converge to one of the (equivalent) local minima.

Saddle Points

Symmetry → Non-Convexity

Locally Optimizable Functions

Matrix Completion

Non-Convex Objective

Symmetry and Solutions

Direction of Improvment Exists

Matrix Factorization

Main Lemma [GJZ17]

Restricted isometry property

Locally optimizable problems

Optimization landscape for neural network

Teacher/Student Setting

Linear Networks

Two-Layer Neural Network

Bad local minima

Landscape Design

Provable New Objective

Spin Glass Model

Kac-Rice Formula

Dynamics/Trajectory

Beyond Optimization

Finding a global minimum is not always good enough (trajectory/implicit bias)

Neural Tangent Kernels

How many parameters do we need?