Optimization

Landscape Symmetry, Saddle Points and Beyond

Non-convex Optimization

Proabilistic Models, Deep Neural Nets

  • Theory: NP-hard. Better avoid or use convex relaxation.
  • Practice: Easy! just run SGD.

Gradient Descent

$$ x_{t+1} = x_t - \eta\nabla f(x_t) $$

Converges to stationary point ($\nabla f(x_t)=0$) [Nesterov ‘98] or “local minimum” [[Ge etal. ‘15]][GHJY15]

GD cannot escape from a local optimal solution

Landscape

Convex Functions

  • Simple: 0 gradient → Global Minimum
  • Can be optimized efficiently

Non-Convex Functions

  • Complicated: local minima, saddle points
  • GD can only find a local minimum

What special properties make a non-convex function easy?

Why are the objectives always non-convex?

Symmetry → Non-Convexity

Problem asks for multiple components, but the components have no ordering.

Solution 1
Solution (a)
Solution 2
Solution (b)
Solution 3
Convex Combination (a+b)/2

Optimization algorithms need to break the symmetry and converge to one of the (equivalent) local minima.

Saddle Points

$$z = x^2 - y^2 + y^4 + 0.1\cdot y$$

Symmetry → Non-Convexity

$$f(x) = -\|x\|^2 + \|x\|_4^4$$

Locally Optimizable Functions

  • Local min are symmetric versions of global min
  • No high order saddle points
  • SVD/PCA
  • Generalized Linear Model [KKKS’11] [HLS’14]
  • Synchronization [BVS’16]
  • Dictionary Learning [SQW’17]
  • Matrix Completion [[GLM16]] [[GJZ17]]
  • Matrix Sensing [BNS’16] [PKCS’16]
  • MAX-CUT [MMMO’17]
  • Tensor Decomposition [GHJY’15] [GM’16]
  • 2-Layer Neural Net [[GLM17]]

Matrix Completion

  • Low rank matrix $M$
  • Observations: entries of $M$
  • Goal: recover remaining entries
Incomplete matrix

Non-Convex Objective

  • Idea: Try to find the low rank factors directly $$M=U\cdot V^T$$
  • Variables $X$, $Y$. Hope $X = U, Y = V, M = XY^T$
  • Uniform observations $(i, j)\in\Omega$
  • Minimize “loss” on observed entries $$\min f(X, Y) = \sum_{(i,j)\in\Omega}(M_{i, j} - (XY^T)_{i,j})^2$$

Symmetry and Solutions

$$M=UV^T\qquad \min(f(X, Y):=\|M-XY^T\|^2_\Omega)$$

  • Hope: $X=U, Y=V$
  • Not true: many equivalent solutions: $$UV^T=URR^TV^T$$
  • Saddle points: e.g. $X=Y=0$

Theorem: when the number of observations is at least $\tilde{\Omega}(nr^6)$, all local minima of $f(X, Y)^*$ are global minima: satisfy $XY^T=M$.

[GLM16]: symmetric case; [GJZ17]: asymmetric case

Corollary: Simple SGD can solve matrix completion from an arbitrary starting point.

Prior work:

  • convex relaxation
  • non-convex optimization with carefully chosen starting point

$X$ is a local minimum of $f(X)$ → $\nabla f(X)=0, \nabla^2f(X)\succcurlyeq 0$

$$\nabla f(X)\ne 0$$

Follow gradient reduces $f(X)$.

$$\lambda(\nabla^2f(X))<0$$

Min eigendirection of Hessian reduces $f(X)$.

Direction of Improvment Exists

If $X$ is not global minimum

Exists $\Delta$, $\langle\nabla f(X), \Delta\rangle\ne 0$ or $\nabla^2f(X)[\Delta]<0$

Matrix Factorization

  • Every entry is observed, want to write $M=UV^T$
  • Consider symmetric case: $M=UU^T$ $$g(X):=|M-XX^T|_F^2$$
  • Goal: prove local minima satisfy $XX^T=M$.

$\nabla g(X)=0$ → $MX=XX^TX$ → If $\text{span}(X)=\text{span}(M)$, $M=XX^T$

$\nabla^2 g(X)\succcurlyeq0$ → $\text{span}(X)=\text{span}(M)$

Approach in [GLM16]. Need more cases to work for Matrix Completion.

  • Intuitively, want $X$ to go to the optimal solution $U$.
  • Recall: many equivalent optimal solutions!
  • Idea: find the “closest” among all optimal solutions $$\Delta = X-UR, \qquad R=\arg\min||X-UR||_F$$
  • Nice property: $||\Delta\Delta^T||_F^2\le2||M-XX^T||_F^2$

Main Lemma [GJZ17]

Lemma: If $\|\Delta\Delta^T\|_\Omega^2<3\|M-XX^T\|_\Omega^2$, then either $\langle\nabla f(X), \Delta\rangle\ne 0$ or $\nabla^2 f(X)[\Delta]<0$. ($\Delta$ is a direction of improvement.)

  • $||\Delta\Delta^T||_F^2\le2||M-XX^T||_F^2$
  • Immediate proof for matrix factorization
  • For completion, proof works as long as $||A||_\Omega\approx ||A||_F$ for $\Delta\Delta^T$ and $M-XX^T$

Restricted isometry property

characterizes matrices which are nearly orthonormal, at least when operating on sparse vectors.

Applies to asymmetric cases, matrix sensing and robust PCA.

Locally optimizable problems

  • Low rank matrix
    • SVD/PCA, Matrix Completion, Synchronization, Matrix Sensing, GLM, MAX-CUT
  • Low rank tensor
    • Dictionary Learning, Tensor Decomposition, 2-Layer Neural Net

Optimization landscape for neural network

$$ g_W(x) = \sigma(W_p\sigma(W_{p-1}\sigma(\cdots \sigma(W_1x)\cdots))) $$

$$ f(W) = \mathbb E[\|y - g_W(x)\|^2] $$

Teacher/Student Setting

  • Goal: Prove sth about optimization
  • Assume there is already a good network ($W^*$) and enough samples
  • Good solution: student mimic teacher

Linear Networks

$$ g_W(x) = W_pW_{p-1}\cdots W_1x $$

[Kawaguchi 16], [YSJ18]

  • All local minima of linear neural network (with squared loss) are global*
  • With 2 layers, no higher order saddle points.
  • With 3 or more layers, has higher order saddles.
  • For a critical point, if product of all layers has rank $r$, then it is a local and global minima (if r = min(m, n)) it can also be a normal saddle point (if r < min(m, n)).
  • Open problem: does local search actually find a global minimum?

Two-Layer Neural Network

$$ g(x) = a^T\sigma(Bx) $$

  • Wlog: rows of $B$ ($b_i$) are unit norm.
  • Data: $x\sim N(0, I)$, $y$ from a teacher.
  • Some more technical assumptions on a*, B*

Bad local minima

Claim: The objective function $f(a, B)$ has local minima that are not equivalent to the ground truth.

Landscape Design

  • Idea: Design a new objective with no bad local min.
  • Implicit in many previous techniques
    • Regularization
    • Methods-of-moments instead of MLE

Provable New Objective

Theorem[GLM17]: Can construct an objective for two-layer neural network such that all local minima are global*

  • Objective inspired by tensor decomposition
  • Relies on Gaussian distribution
  • Extended to symmetric input distribution* by [GKLW18]

Theorem[GKLW18]: For a two-layer neural network with more outputs than hidden units, if the input distribution is symmetric, there is a polynormial time algorithm that learns the neural network

Theorem[GWZ19]: For a 2/3-layer neural network with quadratic/polynormal activations, if the inputs are in general position, and

#parameters = O(1) #training samples

GD can memorize training data.

Spin Glass Model

Claim[CHMBL15]: all the local minima of a neural network have approximate equal function value

Solution 3

Kac-Rice Formula

$$ \int_x\mathbb E[|\det(\nabla^2 f)|\cdot \mathbf 1(\nabla^2 f\preceq0)\mathbf 1(x\in Z)|\nabla f(x)=0]p_{\nabla f(x)} $$

  • Proof idea: Count number of local min directly
  • Evaluate the formula using random matrix theory
  • Formal for spin-glass model (random polynormials), informal connection with NNs.
  • Formal results for overcomplete tensor and tensor PCA

Dynamics/Trajectory

  • Main idea: instead of analyzing the global landscape, analyze the path from a random initialization.
  • Observation: path can be very short
  • Empirical Risk/Training Error
    • [Du, Zhai, Poczos, Singh] Two-layer
    • [Allen-Zhu, Li, Song] [Du, Lee, Li, Wang, Zhai] [Zou, Cao, Zhu, Gu] Multi-layer/ResNet
    • [Allen-Zhu, Li, Song] Recurrent Neural Network
  • Population Risk/Test Error
    • [Li Linag] Special multiclass classification
    • [Allen-Zhu, Li, Liang] 2 or 3 layer neural network. “Kernel-like” setting, special activation

Beyond Optimization

Accelerated methods are faster but leads to slightly worse generalization.

Finding a global minimum is not always good enough (trajectory/implicit bias)

Neural Tangent Kernels

Optimization is easy (linear) in highly overparametrized regime

How many parameters do we need?