Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

1. Gradient Descent => Kernel Machines
2. Kernel Machines => 1 Layer NN
3. Deep NN seem more expressible but learned function depends on Learning Algorithm
4. Kernel
5. Do Deep NN learn representations?
6. Using derivatives in Kernel helps combat curse of dimensionality
7. Path Kernel save storage space and matching time
8. Gradient descent can be viewed as a Boosting algorithm
9. Probabilistic Models are Kernel Density Estimation
10. Notes from the talk

Link to Paper
Link to talk Deep Networks Are Kernel Machines. University of Lisbon, 2024. (from author's website)

\begin{equation} y = g\big(\sum_i a_i K(x, x_i) + b \big) \end{equation}

1. Gradient Descent => Kernel Machines

Deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples.

2. Kernel Machines => 1 Layer NN

Kernel machines can be viewed as neural networks with one hidden layer, with the kernel as the nonlinearity.

3. Deep NN seem more expressible but learned function depends on Learning Algorithm

But a deep network would seem to be irreducible to a kernel machine, since it can represent some functions exponentially more compactly than a shallow one. (pg. 2)

Whether a representable function is actually learned, however, depends on the learning algorithm.

Learning by gradient descent is a strong enough constraint that the end result is guaranteed to be approximately a kernel machine (pg. 2)

4. Kernel

The function \(y\) learned by gradient descent is the same as:

\begin{equation} y = y_0 - \int_{c(t)} \sum_{i=1}^{m} \frac {\partial L} {\partial y_i} K_{y, w}^g(x, x_i) dt \end{equation}

Which can be written as a Kernel Machine:

\begin{equation} y = \sum_{i=1}^m a_i K(x, x_i) + b \end{equation}

Here,

\(m\) is the number of samples
\(y_0\) is the initial model
\(c(t)\) is the path followed by weights \(w\) during training
\(K_{y, w}^g(x, x_i) = \nabla_w y_w(x) \cdot \nabla_w y_w(x_i)\) is the similarity (dot product) of the model's gradient for sample \(x_i\) and query \(x\)

This differs from typical kernel machines in that the \(a_i\) ’s and \(b\) depend on \(x\). Nevertheless, the \(a_i\) ’s play a role similar to the example weights in ordinary SVMs and the perceptron algorithm (pg. 5)

The proof above is for batch gradient descent, which uses all training data points at each step. To extend it to stochastic gradient descent, which uses a subsample, it suffices to multiply each term in the summation over data points by an indicator function \(I_i(t)\) (pg. 6)

5. Do Deep NN learn representations?

The most significant implication of our result for deep learning is that it casts doubt on the common view that it works by automatically discovering new representations of the data, in contrast with other machine learning methods, which rely on predefined features.

6. Using derivatives in Kernel helps combat curse of dimensionality

A key property of path kernels is that they combat the curse of dimensionality by incorporating derivatives into the kernel: two data points are similar if the candidate function’s derivatives at them are similar, rather than if they are close in the input space.

Points that are far in Euclidean space can be close in gradient space, potentially improving the ability to model complex functions. (For example, the maxima of a sine wave are all close in gradient space, even though they can be arbitrarily far apart in the input space.) (pg. 8)

7. Path Kernel save storage space and matching time

Kernel machines require to store all the samples for matching but for deep neural networks it is not necessary since they are effectively all stored and matched simultaneously via their superposition in the model parameters. The storage space and matching time are independent of the number of examples.

8. Gradient descent can be viewed as a Boosting algorithm

Gradient descent can be viewed as a boosting algorithm, with tangent kernel machines as the weak learner and path kernel machines as the strong learner obtained by boosting it.

9. Probabilistic Models are Kernel Density Estimation

Every probabilistic model learned by gradient descent, including Bayesian networks (Koller and Friedman, 2009), is a form of kernel density estimation (Parzen, 1962).

10. Notes from the talk

We know all these theorems that show that we can learn some functions exponentially more compactly when we use Deep Neural Networks. (00:07:16)
But there is a difference between being able to represent something and being able to learn it. Gradient is a simple procedure. Maybe there is a better learning method other than backprop that can learn it. (00:07:38)