Optimization on Manifolds

1. Optimization in Manifold
2. Embedding makes optimization difficult
3. Intrinsic Approach
- 3.1. Gradient Descent
4. Unit Sphere
- 4.1. Computing gradient
- 4.2. Retraction
  - 4.2.1. Exponential Map
  - 4.2.2. Projection
5. Stiefel Manifold
6. Reyleigh Quotient Minimization
7. Different Variations of PCA as objective functions
8. Maths

MIT EECS 6838

1. Optimization in Manifold

A common optimization problem is:

\begin{equation*} \min_{x \in \mathcal{M}} f(x) \end{equation*}

where the manifold \(\mathcal{M}\) can be

Euclidean Space \(\mathbb{R}^n\)
Unit Sphere \(S^{n-1}\)
Rotation Group \(SO(n)\)

e.g. image of protein frozen in a solution will give sample of protein images from different rotation. And an optimization probelm in terms of Rotation Group can be used to reconstruct the full 3D structure of protein. (00:04:08 Cryo-EM)

2. Embedding makes optimization difficult

00:06:59 A typical approach is to embed the manifold in a larger space, and write the problem as a constrained optimization problem.

E.g.

\begin{equation*} \min_{R \in SO(3)} f(R) \to \min_{R^TR = I_3; \det(R) = 1} f(R) \end{equation*}

00:08:48 But embedding it to \(\mathbb{R}^{3 \times 3}\) increases the difficulty of the problem. The constrain cut out a nonconvex, curved set in the embedding space.

using_constraints_makes_optimization_difficult-20240926113048.png

Figure 1: Using constraints makes optimization difficult

00:10:27 Also optimization this way is not straightforward. The objective wants to move to one direction but the constraint wants to keep the variables in another direction.

00:10:57 We could do step and project. But projecting a point in to a complex constraint is not always straightforward.

3. Intrinsic Approach

00:12:04 Optimize the problem in the intrinsic space of the Manifold.

3.1. Gradient Descent

00:15:34

\begin{equation*} x_{k + 1} = x_k - \alpha_k \nabla f(x_k) \end{equation*}

There are two main aspect to Gradient Descent:

Gradient direction \(\nabla f(x_k)\)
Walking along the domain (i.e. taking the difference)

On a Sphere, we can take derivative (i.e. find tangent) using laplace operator. (?)

00:17:25 In a Sphere, an exponential map moves a point on the Sphere. It takes a point on the manifold and a tangent vector at that point. And moves it along a geodesic.

This allows us to use gradient descent like algorithm in an intrinsic way.

\begin{equation*} x_{k+1} = \exp_{x_k}(-\alpha_k \nabla f(x_k)) \end{equation*}

Figure 2: First-Order Manifold Optimization

00:19:00 The benefits of this approach is:

Better alogrithms
- Automatic constraint satisfaction
- specialized to the space
Theoritical perspective:
- Generalize convexity
- Gradient Descent and other methods have analogues

00:19:45 Books and Resources:

Optimization Algorithms on Matrix Manifolds
https://manopt.org
An introduction to Optimization on Smooth Manifolds (https://nicolasboumal.net/#book)

00:26:41 Gradient descent is a first order algorithm. So instead of the difficult to compute Exponential Map, we can use a Retraction Map instead which is same as Exponential Map only upto first order.

4. Unit Sphere

00:47:24

\begin{equation*} S^{n-1} = {x \in \mathbb{R}^n : ||x||_2 = 1} \end{equation*}

Tangent Space of Sphere at point \(p\) is the space of points that are orthogonal to point \(p\):

\begin{equation*} T_p S^{n-1} = { v \in \mathbb{R}^n: v \cdot p = 0} \end{equation*}

Our optimation problem is:

\begin{equation*} \min_x x^T A x \end{equation*} \begin{equation*} x \in S^{n-1} \end{equation*}

4.1. Computing gradient

00:48:53

We can view the objective function on the sphere as a restriction of an objective function on all of \(\mathbb{R}^n\).

In this case (Sphere) if we want the intrinsic gradient of the function f, we can compute the extrinsic gradient in \(\mathbb{R}^n\) and projecting it to the tangent space.

\begin{equation*} \nabla_{S^{n-1}} f (p) = (I_{n \times n} - pp^T) \nabla_{\mathbb{R}^n} f(p) \end{equation*}

4.2. Retraction

00:50:28

For gradient descent, we need a gradient in the intrinsic space and also a retraction. In this case there are two (typical) options for a retraction: Exponential Map and Projection.

4.2.1. Exponential Map

\begin{equation*} \exp_p(v) = p \cos||v||_2 + \frac {v \sin ||v||_2} {||v||_2} \end{equation*}

This function maps a direction to a point on the sphere that is a \(||v||\) distance from p along the geodesic on that direction.

4.2.2. Projection

\begin{equation*} R_p(v) = \frac {p + v} {||p+v||_2} \end{equation*}

Take a gradient step off of the sphere and project it back orthogonally.

\(R_p(v)\) and \(\exp_p(v)\) don't match when \(v\) is large, but they are same locally.

5. Stiefel Manifold

00:52:27

\begin{equation*} V_k(\mathbb{R}^n) = \{ X \in \mathbb{R}^{n \times k}: X^T X = I_{k \times k} \} \end{equation*}

A manifold of matrices with \(k\) orthonormal columns.

If \(k=1\) this is same as \(S^{n-1}\).

00:53:32 For manifold optimization we need to define:

Tangent Space
Projection on the tangent space
Retraction

tangent_space_and_retraction_for_stiefel_manifold-20240929135414.png

Figure 3: Tangent Space and Retraction for Stiefel Manifold

6. Reyleigh Quotient Minimization

01:12:13

\begin{equation*} \min_{x \in S^{n-1}} \frac 1 2 x^T A x \end{equation*}

01:17:36 This is equivalent to computing the eigenvector of the mininum eigenvalue of \(A\), as indicated below:

\begin{equation*} \Lambda(x; \lambda) = \frac 1 2 x^T A x + \lambda (\frac 1 2 - \frac 1 2 ||x||^2_2) \end{equation*} \begin{equation*} 0 = \nabla_x \Lambda = Ax - \lambda x \implies Ax = \lambda x \end{equation*}

We then define then tangent space and reduction map and then iterate to find the solution.

Also, to find the eigenvector of any other eignevalue, we can formulate another problem:

\begin{equation*} \widetilde{f}(x) = ||\nabla_{S^{n-1}} (\frac 1 2 x^T A x)||_2^2 \end{equation*}

Because the gradient is zero at eigenvector location. And the l2 norm makes those zero value the minima of \(\widetilde{f}\).

7. Different Variations of PCA as objective functions

01:21:18 Optimizing on Stiefel Manifold corresponds to computing PCA.

Figure 4: Table of PCA Variations

01:22:13 For each variant of PCA, we could figure out a custom optimizer. Or we could just use the generic gradient descent algorithm in the manifold of PCA with the objective functions regularized in various ways.

8. Maths

8.1. Differential

00:21:21 Analogue of derivative

\begin{equation*} df_{x_0} (v) = \lim_{h \to 0} \frac {f(x_0 + h v) - f(x_0)} h \end{equation*} \begin{equation*} df_{x_0}(v) = Df(x_0) \cdot v \end{equation*}

8.2. Tangent Space

00:21:49 Set of tangent vectors at a point.

8.3. Exponential Map

00:22:48

\begin{equation*} \exp_p(v) = \gamma_v(1) \end{equation*}

where \(\gamma_v\) is (unique) geodesic from \(p\) with velocity \(v\).

8.4. Retraction

00:23:50 In general the geodesics curve are difficult to compute. But we don't always need a perfect geodesic.

We can replace the Exponential Map with a Retraction. Which is a weaker notion of exponential map. They are same only upto first order.

Figure 5: Retraction

8.5. Riemannain Manifold

00:28:16 Pair \((M,g)\) of a differentiable manifold \(M\) and a pointwise positive definite inner product per point.

\(g_p(\cdot, \cdot): T_p M \times T_p M \to \mathbb{R}\)

\(g_p\) is symmetric, bilinear, & positive definite form.

8.6. Riemannian Gradient

Metric tensor \(g \in \mathbb{R}^{n \times n}\)
Gradient in coordinates \(\nabla f \in \mathbb{R}^n\)
\begin{equation*} \nabla_g f = g^{-1} \nabla f \end{equation*}

\(\phi\) is maps a parametrization of \(M\) in \(\mathbb{R}^n\) to \(M\)

Figure 6: Gradient in Riemannian Manifold

References

https://www.youtube.com/watch?v=2mnqWES2roQ