Neural Acceleration for Incomplete Factorization Preconditioning

1. Objective
2. Traditional Methods
- 2.1. Incomplete Cholesky are difficult to parallelize
- 2.2. Traditional NN method try computing inverse
3. Our approach
4. Results
5. Conclusion
- 5.1. Question:
6. Miscelaneous
- 6.1. Fill in is affected by Ordering

1. Objective

Use Neural Acceleration for speeding up computation of Preconditioner
Preconditioner is a starting point used in iterative solvers
We present an NA method for incomplete cholesky factorization to be used as preconditioner

In this work, we demonstrate that a simple artificial neural network trained either at compile time or in parallel to the running application on a GPU can provide an incomplete \(LL^T\) factorization that can be used as a preconditioner. [pg. 1]

we explore a neural acceleration method for generating an incomplete Cholesky factorization with zero fill-in that performs as good as or better than a tuned incomplete Cholesky factorization without the overhead of trying different techniques [pg. 2]

This generated preconditioner is as good as or better than the ones found using multiple preconditioning techniques such as scaling and shifting in terms of reduction in number of iterations. [pg. 1]

Iterative solver (like PCG) are a common method for SPD

A common iterative method for symmetric positive definite (SPD) systems is the Preconditioned Conjugate Gradient (PCG) method as it only relies on sparse matrix-vector multiplication (SpMV) and sparse triangular solve (Stri). [pg. 2]

2. Traditional Methods

2.1. Incomplete Cholesky are difficult to parallelize

Incomplete Cholesky, are black box methods that are typically used to generate these preconditioners. These methods normally require trying techniques such as scaling, shifting, and identifying fill-in to achieve the desired reduction in iterations. However, the algorithm of incomplete factorization tends to be difficult to parallelize due to the low computational intensity. [pg. 2]

2.2. Traditional NN method try computing inverse

Traditional NN method for solving linear system try computing inverse but that is error prone.

Many traditional neural network-inspired approaches try to construct a precon- ditioner M such that \(M \approx A^{-1}\). In particular, the network itself becomes the output. However, constructing an inverse directly is an error-prone task. [pg. 6]

3. Our approach

3.1. We train a NN to approximate the matrix in factorized form

Construct a NN that has a hidden layer such that it computes \(Ax\) in the form of \(LL^Tx\)
Sample some \(x\) and \(y\) then train the parameters of NN that represent \(L\)
The obtained \(L\) is the preconditioner

The neural acceleration method could take in the sparse matrix A and generate samples X and Y in order to train L. In our experimental results, we √ demonstrate that the number of samples needed is relatively small (i.e., N where N = dimension(A)). For the output, the method could either output L to be used by the problem in its iterative solver package or function pointers to apply sparse triangular solve for this on the GPU where it was generated. [pg. 7]

Preconditioner in the form of \(L\) is common technique

Limiting the nonzero pattern of the preconditioner to that of the nonzero pattern of the input matrix is commonly done in many traditional preconditioner methods. Therefore, we do not perceive this as a major limitation as many traditional methods also use it [pg. 7]

3.2. Training Parameters

Depends on number of iterations and optimization algorithm

The training cost would depend on both the numerical optimization method used and the number of iterations needed to construct a good approximation. [pg. 8]

We use AdaGrad

AdaGrad is commonly used for training deep learning models with sparse gradients [pg. 9]

We sample X from random distribution [pg. 9]
- Question 1:
  
  Effect of different Network Initialization and Different Sparse Matrices (with different range of entries)
  
  E.g. He, Uniform, …
- Question 2:
  
  Effect of sampling X from different distribution?
  
  Normalization is tried in this paper. I don't have other suggestion. Were other distribution tried? Uniform, Normal, …
  
  Probably didn't matter.

3.3. Question: Typo?

Even with a sparse matrix that is not full rank, the image remains difficult to see after 25 samples and the error is very high (i.e., > 1). Therefore, the number of samples should be ≤ N to be a successful method. pg. 9

Number of samples should be <= N or ~ N ?

3.4. For \(LU\)

We construct a NN of the form \(LU\)
We avoid permutation similar to other ILU code

We avoid the permutation in a similar manner as many other ILU codes by trying to permute large values to the diagonal first and excluding matrices that fail due to stability along the way [pg. 9]

4. Results

4.1. Normalization of X doesn't do better

We use two variety of X:
- randomly selected X, not normalized (NN) and
- randomly selected X, normalized (NNN) [Page 12]
Not normalized X (i.e. NN) do better

one set of randomly selected X with normalized samples (denoted as NNN)

NN generally does better than NNN. This is a nice finding for two particular reasons. First, normal- ization does not need to be done, thus saving time in training. Second, the training of NN is much less sensitive. In particular, a simple parameter of α = 0.1 works well with AdaGrad for training, while the training of NNN is much more difficult. [pg. 16]

4.2. \(LL^T\) is ideal even for non-symmetric matrices

ILU is better preconditioner but failes to converge in some cases

The \(LL^T\) model tends to be a better choice than the LU model for most of the test suite, even for matrices that are non-symmetric. [Page 20]

\(LU\) isn't good when the matrix is SPD [Page 20]
But \(LL^T\) is good in both cases

4.3. Question: Why just 1 vector per iteration?

A total of \(\sqrt N\) training vectors were generated, and the models were trained iteratively utilizing a batch of 1 vector per iteration for all LLT models. [pg. 12]

5. Conclusion

In this work, we developed a neural network modeling method for incomplete factorization that can be utilized for the neural acceleration of preconditioners for sparse iterative solver methods of PCG and GMRES. [pg. 21]

It is hyperparamter free

Contrast with ICHOL(k), ICHOL(\(\tau\)), ILU(k) which have \(k\) or \(\tau\)
It is as good as ICHOL
It is 69 time slower, but still within range of compile time [Page 18]
- But it is more stable
  
  i.e. always produces preconditioner that reduces iteration count [Abstract]
- And speed might get better as support for sparse multiplication in GPU gets better (which is the trend due to focus on Graph Neural Network in the ML community)
So can be used as a blackbox method without user intervention

5.1. Question:

In Neural Acceleration of Graph Based Utility Functions for Sparse Matrices paper, the NN was computed at compile time. But here it is done in run time.

How is Neural Acceleration defined? No compiler tricks or compile time computing is utilized in this case.

So, is this neural acceleration? (only when the matrix to factorize is available at compile time can we compute the preconditioner at compile time)

6. Miscelaneous

6.1. Fill in is affected by Ordering

Certain orderings (ND, RCM), are known to reduce fill-in.

The amount of fill-in during factorization is a factor of the nonzero pattern. Certain orderings, e.g., nested-dissection (ND) [11] and reverse Cuthill-McKee (RCM) [12], are known to reduce fill-in. In theory, the reduction in fill-in should result in better preconditioners. [pg. 4]

Typo: factor or the nonzero -> factor of the nonzero

And reduction in fill in should reduce factorization time, but not always (AMD).

While this idea holds in general, it does not hold for all orderings. An example of this is approximate minimal degree (AMD) ordering which reduces fill-in but may not reduce iteration count to the same degree as RCM [pg. 4]