What if Neural Networks had SVDs?
Table of Contents
[Summar of paper: What if Neural Networks had SVDs? by Alexander Mathiasen, et. al.]
1. Weights in SVD Representation
Sometimes we might need SVD of weights of the Neural Network layers. Few cases are:
- to reduce the rank of the weight matrix 1 which is then useful for model compression,
- or to stablize training of RNNs 2,
- or maybe we want to do other matrix operations (e.g. inverse) to the weight matrix.
To achieve this we could train the network normally and then compute the SVD later. But computing SVD is costly and this may not make sense for our use case.
Another approach is to represent the weight matrix of a layer (\(W \in \mathbb{R}^{d \times d}\)) in decomposed form.
Let \(W = U \Sigma V^T\) be the weight matrix and \(U, V \in \mathbb{R}^{d \times m}, \Sigma \in \mathbb{R}^{m \times m}\) be the SVD parameters. The goal is to then perform gradient descent to \(W\) while preserving SVD. Consider the update:
\begin{equation*} U' = U - \eta \nabla_{U}, \Sigma' = \Sigma - \eta \nabla_{\Sigma}, V' = V - \eta \nabla_{V} \end{equation*}The updates preserve the diagonal property of \(\Sigma\), but the orthogonality of \(U\) and \(V\) is not preserved.
One approach for training would then be to use regularization 1 to ensure orthogonality of the \(U\) and \(V\) matrices. But this is approximate. The approach adopted in this paper, called FastH
is exact and fast.
2. U and V in Householder Representation
If we represent the orthogonal matrices \(U\) and \(V\) as product of Householder matrices as:
\begin{equation*} U = \Pi_{i=1}^{d} H_i \ ; H_i = I - 2 \frac {v_i v_i^T} {||v_i||_2^2} \end{equation*}Then updates \(v_i \leftarrow v_i - \eta \nabla_{v_i}\) preserves the orthogonality of \(U\).
Now however, we have a different problem. The time to compute the forward pass (say with input \(X\)) has increased because previously we needed only to compute a single matrix product \(UX\) but now we need to compute product with of \(d\) Householder matrices \((\Pi_i H_i) X\) sequentially.
Note that the computational complexity remains same because a single product with householder matrix (\(H_iX\)) can be done in \(O(dm)\) time and thus \(\Pi_i H_i X\) takes \(O(d^2m)\) time which is same as for \(UX\). But the run time in GPUs is increased because of the sequential nature of the products.
3. Parallelizing Householder products
FastH
algorithms solves this problem with the same \(O(d^2m)\) complexity but with \(O(d/m + m)\) sequential matrix-matrix operations instead of \(O(d)\) sequential vector-vector operation.
The approach is to divide the \(d\) householder matrices into \(m\) groups (assume that \(m\) divides \(d\) evenly), then
for each group compute the product of the householder matrices in that group. There are \(q=d/m\) matrices in each group, so:
\(P_k = \Pi_{i=kq}^{(k+1)q} H_i\)
Each \(P_k\) can be computed in total \(O(dm^2)\) time with \(O(d/m)\) sequential steps. So for \(d/m\) matrices it takes \(O(d^2m)\) steps in total with \(O(d/m)\) sequential steps.
- Sequentially multiply the \(P_k\) matrix with \(X\). Each step takes \(O(d m^2)\) and thus \(O(d^2 m)\) in total. But since there are only \(m\) number of \(P_k\) matrices, there are only \(O(m)\) sequential steps.
Thus although the total complexity is \(O(d^2m)\) the number of sequential steps is \(O(d/m + m)\).
Similarly, the backward pass can also be computed in same complexity.
Footnotes:
Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification, CVPR Workshop.
This paper uses SVD representation of Weights to train a CNN and get a low rank NNs.