Skip to content

PR proposal: AutoBatch-batch/vectorized AD backend #991

@mattsignorelli

Description

@mattsignorelli

I would like to propose an implementation I have that makes it easy to compute the Jacobians of many different (e.g. 100,000) primals in parallel on the CPU/GPU. The only requirement is that the function to evaluate Jacobians through must be "vectorized", in that:

f(x) = ... # (GPU) vectorized function accepting SoA memory layout
a = f([1 2 3 4 5 6])  # gives some values
b = f([7 8 9 10 11 12]) # gives some other values

# But instead faster to do it in one evaluation thanks to SIMD/SIMT:
f([1 2 3 4 5 6; 7 8 9 10 11 12]) == vcat(a, b)

In this case, each "batch element" is a row. I will refer to this case as the batchdim=1 case

Another option is the AoS memory layout, which is not really vectorized but can still be much, much faster in certain cases than evaluating the same function for each primal separately (if there are layers of type instability in the function, for example):

g(x) = ... # AoS memory layout 
a = g([1, 2, 3, 4, 5, 6])  # gives some values
b = g([7, 8, 9, 10, 11, 12]) # gives some other values

# Many cases where this is still faster:
f([1 7; 2 8; 3 9; 4 10; 5 11; 6 12]) == hcat(a, b)

In this case, each "batch element" is a column, so we can let this case be the batchdim=2 case.

For both of these cases, I can get the Jacobians for all "batch elements"/primals together by just reshaping the input/output arrays to be 1-dimensional, and the using AutoSparse. For batchdim=2 case, the sparse Jacobian matrix will be block-diagonal, and for the batchdim=1 case, the sparse Jacobian matrix will have diagonal bands. This works well, and in theory you could just use AutoSparse on this problem and be done. However, there are a couple of important remarks that make me believe a separate type which I call AutoBatch, different from AutoSparse is truly warranted for this special case:

  1. I found the automatic sparsity detector and coloring algorithm can be really slow in cases where there are e.g. 10,000-100,000 batch elements/primals . Furthermore, for batchdim=1 and batchdim=2, we know the pattern and coloring of the sparse Jacobian beforehand, and so we can just construct the pattern and use a ConstantColoringAlgorithm instead.
  2. Backend-specific optimizations can be made. Immediately, I am thinking about the massive speed that finite differences could achieve if the function is known to be vectorized or if it is faster to evaluate a batch all at once than to re-evaluate the function many times for each tangent. So, e.g. FiniteDiff could provide special overrides for AutoBatch{AutoFiniteDiff} type. This would be a more general way to solve Feature request: vectorized finite difference evaluation FiniteDiff.jl#210

I have currently a working implementation of AutoBatch in my BatchSolve.jl package. I would be happy to work with you on merging this functionality into DI if there is interest! Feedback is also very welcome.

I use these Jacobians that I compute with cuBLAS's library to do batched newton solves, for example, on 100,000 6x6 systems. It is really fast

Let me know what you think!

Metadata

Metadata

Assignees

Labels

coreRelated to the core utilities of the package

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions