PR proposal: `AutoBatch`-batch/vectorized AD backend

I would like to propose an implementation I have that makes it easy to compute the Jacobians of many different (e.g. 100,000) primals in parallel on the CPU/GPU. The only requirement is that the function to evaluate Jacobians through must be "vectorized", in that:

```julia
f(x) = ... # (GPU) vectorized function accepting SoA memory layout
a = f([1 2 3 4 5 6])  # gives some values
b = f([7 8 9 10 11 12]) # gives some other values

# But instead faster to do it in one evaluation thanks to SIMD/SIMT:
f([1 2 3 4 5 6; 7 8 9 10 11 12]) == vcat(a, b)
```
In this case, each "batch element" is a row. I will refer to this case as the `batchdim=1` case

Another option is the AoS memory layout, which is not really vectorized but can still be much, much faster in certain cases than evaluating the same function for each primal separately (if there are layers of type instability in the function, for example):

```julia
g(x) = ... # AoS memory layout 
a = g([1, 2, 3, 4, 5, 6])  # gives some values
b = g([7, 8, 9, 10, 11, 12]) # gives some other values

# Many cases where this is still faster:
f([1 7; 2 8; 3 9; 4 10; 5 11; 6 12]) == hcat(a, b)
```
In this case, each "batch element" is a column, so we can let this case be the `batchdim=2` case.

For both of these cases, I can get the Jacobians for all "batch elements"/primals together by just reshaping the input/output arrays to be 1-dimensional, and the using `AutoSparse`. For `batchdim=2` case, the sparse Jacobian matrix will be block-diagonal, and for the `batchdim=1` case, the sparse Jacobian matrix will have diagonal bands. This works well, and in theory you could just use `AutoSparse` on this problem and be done. However, there are a couple of important remarks that make me believe a separate type which I call `AutoBatch`, different from `AutoSparse` is truly warranted for this special case:

1. I found the automatic sparsity detector and coloring algorithm can be _really_ slow in cases where there are e.g. 10,000-100,000 batch elements/primals . Furthermore, for `batchdim=1` and `batchdim=2`, we know the pattern and coloring of the sparse Jacobian beforehand, and so we can just construct the pattern and use a `ConstantColoringAlgorithm` instead.
2. Backend-specific optimizations can be made. Immediately, I am thinking about the massive speed that finite differences could achieve if the function is known to be vectorized or if it is faster to evaluate a batch all at once than to re-evaluate the function many times for each tangent. So, e.g. `FiniteDiff` could provide special overrides for `AutoBatch{AutoFiniteDiff}` type. This would be a more general way to solve https://github.com/JuliaDiff/FiniteDiff.jl/issues/210

I have currently a [working implementation of `AutoBatch`](https://github.com/mattsignorelli/BatchSolve.jl/blob/main/src/autobatch.jl) in my [`BatchSolve.jl` package](https://github.com/mattsignorelli/BatchSolve.jl). I would be happy to work with you on merging this functionality into DI if there is interest! Feedback is also very welcome.

I use these Jacobians that I compute with cuBLAS's library to do batched newton solves, for example, on 100,000 6x6 systems. It is really fast

Let me know what you think!





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR proposal: `AutoBatch`-batch/vectorized AD backend #991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PR proposal: AutoBatch-batch/vectorized AD backend #991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PR proposal: `AutoBatch`-batch/vectorized AD backend #991