You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to propose an implementation I have that makes it easy to compute the Jacobians of many different (e.g. 100,000) primals in parallel on the CPU/GPU. The only requirement is that the function to evaluate Jacobians through must be "vectorized", in that:
f(x) =...# (GPU) vectorized function accepting SoA memory layout
a =f([123456]) # gives some values
b =f([789101112]) # gives some other values# But instead faster to do it in one evaluation thanks to SIMD/SIMT:f([123456; 789101112]) ==vcat(a, b)
In this case, each "batch element" is a row. I will refer to this case as the batchdim=1 case
Another option is the AoS memory layout, which is not really vectorized but can still be much, much faster in certain cases than evaluating the same function for each primal separately (if there are layers of type instability in the function, for example):
g(x) =...# AoS memory layout
a =g([1, 2, 3, 4, 5, 6]) # gives some values
b =g([7, 8, 9, 10, 11, 12]) # gives some other values# Many cases where this is still faster:f([17; 28; 39; 410; 511; 612]) ==hcat(a, b)
In this case, each "batch element" is a column, so we can let this case be the batchdim=2 case.
For both of these cases, I can get the Jacobians for all "batch elements"/primals together by just reshaping the input/output arrays to be 1-dimensional, and the using AutoSparse. For batchdim=2 case, the sparse Jacobian matrix will be block-diagonal, and for the batchdim=1 case, the sparse Jacobian matrix will have diagonal bands. This works well, and in theory you could just use AutoSparse on this problem and be done. However, there are a couple of important remarks that make me believe a separate type which I call AutoBatch, different from AutoSparse is truly warranted for this special case:
I found the automatic sparsity detector and coloring algorithm can be really slow in cases where there are e.g. 10,000-100,000 batch elements/primals . Furthermore, for batchdim=1 and batchdim=2, we know the pattern and coloring of the sparse Jacobian beforehand, and so we can just construct the pattern and use a ConstantColoringAlgorithm instead.
Backend-specific optimizations can be made. Immediately, I am thinking about the massive speed that finite differences could achieve if the function is known to be vectorized or if it is faster to evaluate a batch all at once than to re-evaluate the function many times for each tangent. So, e.g. FiniteDiff could provide special overrides for AutoBatch{AutoFiniteDiff} type. This would be a more general way to solve Feature request: vectorized finite difference evaluation FiniteDiff.jl#210
I would like to propose an implementation I have that makes it easy to compute the Jacobians of many different (e.g. 100,000) primals in parallel on the CPU/GPU. The only requirement is that the function to evaluate Jacobians through must be "vectorized", in that:
In this case, each "batch element" is a row. I will refer to this case as the
batchdim=1caseAnother option is the AoS memory layout, which is not really vectorized but can still be much, much faster in certain cases than evaluating the same function for each primal separately (if there are layers of type instability in the function, for example):
In this case, each "batch element" is a column, so we can let this case be the
batchdim=2case.For both of these cases, I can get the Jacobians for all "batch elements"/primals together by just reshaping the input/output arrays to be 1-dimensional, and the using
AutoSparse. Forbatchdim=2case, the sparse Jacobian matrix will be block-diagonal, and for thebatchdim=1case, the sparse Jacobian matrix will have diagonal bands. This works well, and in theory you could just useAutoSparseon this problem and be done. However, there are a couple of important remarks that make me believe a separate type which I callAutoBatch, different fromAutoSparseis truly warranted for this special case:batchdim=1andbatchdim=2, we know the pattern and coloring of the sparse Jacobian beforehand, and so we can just construct the pattern and use aConstantColoringAlgorithminstead.FiniteDiffcould provide special overrides forAutoBatch{AutoFiniteDiff}type. This would be a more general way to solve Feature request: vectorized finite difference evaluation FiniteDiff.jl#210I have currently a working implementation of
AutoBatchin myBatchSolve.jlpackage. I would be happy to work with you on merging this functionality into DI if there is interest! Feedback is also very welcome.I use these Jacobians that I compute with cuBLAS's library to do batched newton solves, for example, on 100,000 6x6 systems. It is really fast
Let me know what you think!