Multi-Vector Distance Benchmarks#1027
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new diskann-benchmark-multi-vector crate to benchmark and regression-check multi-vector distance ops (Chamfer, MaxSim) across f32/f16 and optimized/reference implementations, integrated with diskann-benchmark-runner.
Changes:
- Introduces benchmark library with dispatcher registration, benchmark execution, and regression checking logic.
- Adds
benchmark-multi-vectorCLI + integration tests and example job/tolerance configs used by the runner workflow. - Updates workspace to include the new crate.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-benchmark-multi-vector/src/lib.rs | Core benchmark/regression logic, dispatch rules, fixtures, and unit tests. |
| diskann-benchmark-multi-vector/src/bin.rs | CLI entrypoint wiring registry + basic integration tests. |
| diskann-benchmark-multi-vector/examples/tolerance.json | Default tolerance config consumed by check verify/run. |
| diskann-benchmark-multi-vector/examples/test.json | Small smoke input for integration tests. |
| diskann-benchmark-multi-vector/examples/multi-vector.json | Full benchmark matrix input example. |
| diskann-benchmark-multi-vector/README.md | Documents kernels, metric normalization, and runner workflows. |
| diskann-benchmark-multi-vector/Cargo.toml | New crate manifest + deps/bin target. |
| Cargo.toml | Adds new crate to workspace members. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[derive(Debug, Error)] | ||
| #[error("this kernel handles a different implementation than {0}")] | ||
| pub(crate) struct ImplementationMismatch(Implementation); |
| match change { | ||
| Ok(change) => { | ||
| row.insert(format!("{:.3} %", change * 100.0), 6); | ||
| if change > c.tolerance.min_time_regression.get() { | ||
| row.insert("FAIL", 7); | ||
| } | ||
| } | ||
| Err(err) => { | ||
| row.insert("invalid", 6); | ||
| row.insert(err, 7); | ||
| } | ||
| } |
| match relative_change(before_min, after_min) { | ||
| Ok(change) => { | ||
| if change > tolerance.min_time_regression.get() { | ||
| passed = false; | ||
| } | ||
| } | ||
| Err(_) => passed = false, | ||
| }; |
| let min_latency = r | ||
| .latencies | ||
| .iter() | ||
| .min() | ||
| .copied() | ||
| .unwrap_or(MicroSeconds::new(u64::MAX)); |
| fn run_loops<F>(run: &Run, mut body: F) -> RunResult | ||
| where | ||
| F: FnMut(), | ||
| { | ||
| let mut latencies = Vec::with_capacity(run.num_measurements.get()); | ||
|
|
||
| for _ in 0..run.num_measurements.get() { | ||
| let start = std::time::Instant::now(); | ||
| for _ in 0..run.loops_per_measurement.get() { | ||
| body(); | ||
| } | ||
| latencies.push(start.elapsed().into()); | ||
| } | ||
|
|
||
| let percentiles = percentiles::compute_percentiles(&mut latencies).unwrap(); | ||
| RunResult { | ||
| run: run.clone(), | ||
| latencies, | ||
| percentiles, | ||
| } | ||
| } |
| .join("tolerance.json"); | ||
|
|
||
| let stdout = run_check_test(&input_path, &tolerance_path); | ||
| println!("stdout = {}", stdout); |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1027 +/- ##
==========================================
- Coverage 89.51% 89.45% -0.07%
==========================================
Files 460 463 +3
Lines 85466 86047 +581
==========================================
+ Hits 76508 76973 +465
- Misses 8958 9074 +116
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suryansh, this will be useful for performance optimization. One big question I had is whether or not this needs to be its own crate, or whether it can be folded into diskann-benchmark (which will need some work to put the normal indexing code behind a feature gate to keep compile times low) or as a binary in diskann-quantization directly.
The pattern established by diskann-benchmark-simd is a little unfortunate so now is the time to think about our approach before we have to deal with a crate proliferation. We also might be able to bury all benchmark related binary crates in a subdirectory to keep the top level organized.
In either case, I don't think we should publish this to crates.io - and we should probably stop publishing diskann-benchmark-simd as well.
| ($f:ident, $field:tt, $($expr:tt)*) => { | ||
| writeln!($f, "{:>18}: {}", $field, $($expr)*) | ||
| } | ||
| } |
There was a problem hiding this comment.
Since #996, manual alignment shouldn't be needed any more.
| macro_rules! register { | ||
| ($impl:ident, $t:ty, $tag:literal) => { | ||
| dispatcher.register_regression($tag, Kernel::<$impl, $t>::new()); | ||
| }; |
There was a problem hiding this comment.
Nit: The macro is offering only a moderate benefit in terseness. Might be better off without it?
| } | ||
|
|
||
| // Optimized (architecture-dispatched QueryComputer). | ||
| register!(Optimized, f32, "multi-vector-op-f32-optimized"); |
There was a problem hiding this comment.
We should probably provide a way of overriding the runtime architecture to test both AVX512 and AVX2 kernels on the same machine. Otherwise, we have no way to compare such implementations against each other ...
| } | ||
| } | ||
| }; | ||
| } |
There was a problem hiding this comment.
This might be cleaner to get rid of DispatchRule and use an entirely bespoke, smaller trait. Like:
impl TryFrom<Implementation> for Optimized {
type Error = FailureScore;
fn try_from(i: Implementation) -> Result<Self, Self::Error> {
match i {
Implementation::Optimized => Ok(Self),
_ => Err(FailureScore(1)),
}
}
}
and write matching via TryFrom.
| const RNG_SEED: u64 = 0x12345; | ||
|
|
||
| struct Data<T> { | ||
| query_data: Box<[T]>, |
There was a problem hiding this comment.
We spend so much time writing matrix abstractions. Why not use them? If the constructors are insufficient, that's a sign we have work to do!
| results.push(result); | ||
| } | ||
| Ok(results) | ||
| } |
There was a problem hiding this comment.
These two run loops are nearly identical. Perhaps a tweak to factor out the common parts? Maybe consider keeping the loop nest the same, but taking a trait object (to reduce compile times) for the distance implementation?
| } | ||
|
|
||
| impl_kernel_for!(f32); | ||
| impl_kernel_for!(f16); |
There was a problem hiding this comment.
Since the abstractions these macros are stamping were also introduced in this PR, consider whether or not there's a simpler way to express things via generics that doesn't require macros.
| | `multi-vector-op-f16-reference` | `f16` | `Chamfer` / `MaxSim` | | ||
|
|
||
| The **optimized** path constructs a `QueryComputer` once per shape (which | ||
| internally selects the best available SIMD kernel for the host) and calls |
There was a problem hiding this comment.
If we're using this as a development tool ... it's probably a good idea to allow an override for the architecture.
| bench run --input-file examples/multi-vector.json --output-file before.json | ||
| ``` | ||
|
|
||
| ### Regression check workflow |
There was a problem hiding this comment.
The section on regression checks seems to be just a copy+paste of how diskann-benchmark-runner works. Does this need to be spelled out in the README like this?
| The crate registers four kernels — one per `(element_type, implementation)` | ||
| pair: | ||
|
|
||
| | Tag | Element | Implementation | |
There was a problem hiding this comment.
While it's fine to have a list like this, the chances of this getting updated as kernels are added is nearly zero. Maybe instead rely on the inputs behavior of the runner APP and ensure that the returned descriptions are sufficient?
| "diskann-benchmark-runner", | ||
| "diskann-benchmark-core", | ||
| "diskann-benchmark-simd", | ||
| "diskann-benchmark-multi-vector", |
There was a problem hiding this comment.
Do we want to make this a high-level crate? I think it would be good if we had this added as a 'benchmark' under diskann-benchmark..
There was a problem hiding this comment.
How many top-level crates do we actually want to have here? If we end up with 10–20 diskann-benchmark-* crates, is that going to be a problem?
As an alternative, could we keep benchmarks inside each crate: e.g., add a bench subfolder and use [[bench]] with harness = false, pulling in something like diskann-benchmark-runner as a dev-dependency? That feels more idiomatic in Rust.
| | `multi-vector-op-f16-reference` | `f16` | `Chamfer` / `MaxSim` | | ||
|
|
||
| The **optimized** path constructs a `QueryComputer` once per shape (which | ||
| internally selects the best available SIMD kernel for the host) and calls |
diskann-benchmark-multi-vector crate
arkrishn94
left a comment
There was a problem hiding this comment.
Thanks Suryansh, I left some relatively minor comments. Can we run this benchmark and post some example numbers in the PR description?
| Per-measurement latency is normalized to **nanoseconds per inner-product | ||
| call**, abbreviated `ns/IP`: |
There was a problem hiding this comment.
Maybe I'm overthinking this but some CPU cycles are spent on reduction here too for Chamfer.
Why not just report the total time here? Users can profile themselves if they want to.
| #[derive(Debug, Serialize)] | ||
| struct Comparison { | ||
| run: Run, | ||
| tolerance: MultiVectorTolerance, |
There was a problem hiding this comment.
nit: Tolerance here should do? The path should allow us to distinguish?
| struct Comparison { | ||
| run: Run, | ||
| tolerance: MultiVectorTolerance, | ||
| before_min: f64, |
| if datatype::Type::<T>::try_match(&from.element_type).is_err() { | ||
| *failscore.get_or_insert(0) += 10; | ||
| } |
There was a problem hiding this comment.
I forget - how did we come up a failure score of 10 for this? Should this be something large?
| // initialized by it is non-null and properly aligned to the underlying type. | ||
| unsafe impl<T, F> NewOwned<Init<F>> for Standard<T> | ||
| where | ||
| T: Copy, | ||
| F: FnMut() -> T, | ||
| { | ||
| type Error = crate::error::Infallible; |
There was a problem hiding this comment.
Does it make sense to fold the default initialization into this in the spirit of minimizing the number of constructor paths we expose to users?
| pub(crate) enum Operation { | ||
| Chamfer, | ||
| MaxSim, | ||
| } |
There was a problem hiding this comment.
I'm wondering if we should just expose a single operation here? I'm not sure folks will use MaxSim much and it might simplify some of your code here?
Adds Chamfer / MaxSim benchmarks from
diskann-quantization::multi_vectortodiskann-benchmark, behind a newmulti-vectorCargo feature (off by default).Four registered regression kernels —
(f32, f16) × (optimized, reference):QueryComputer(architecture-dispatched SIMD).Chamfer/MaxSimfallback used by themulti_vectorunit tests — numerical ground truth and a baseline formeasuring SIMD speedups.
With the feature off, a
multi-vector-opjob emits the standardRequires the "multi-vector" featurediagnostic viastub_impl!.Design
OptimizedDistance<T>/ReferenceDistance<'a, T>implement an object-safe
Distance<T>trait; a sharedrun_with_distance<T>(run, doc, &dyn Distance<T>)carries the loop nest.run_loops<F: FnMut()>monomorphises over(T, Operation)only.QueryComputer::newis hoisted out of the timed loop; the referencepath doesn't need a parallel hoist (
MaxSim::newis a non-empty check +pointer wrap).
DispatchRule;implementation axis via
TryFrom<Implementation>plus a tinyImplementationMatchertrait for diagnostic consts.(Mat<Standard<T>>, Mat<Standard<T>>), built through a newNewOwned<Init<F>>constructor (closure-per-element fill) added todiskann-quantization::multi_vector::matrixalongside the existingNewOwned<T>/NewOwned<Defaulted>variants.ns/IP @ Dim=min_latency_µs * 1000 / (Q * D * loops). Invariantunder
Q,D, and loop budget; approximately linear inDim. Per-shaperegression detection is robust because the
Dimfactor cancels.NewFromMatRefimpls cover theinherent
QueryComputer::<T>::new; everything else is generic.Tests
6 unit tests in
backend::multi_vector::imp::tests(regression-check edges) +2 integration tests in
main.rs(runandcheck verify). 1 unit test + 1doctest in
diskann-quantization::multi_vector::matrixcover the newNewOwned<Init<F>>.Follow-up PRs (deferred)
diskann-benchmarkcompile times by feature-gating the existingindexing code.
diskann-benchmark-simdintodiskann-benchmarksimilarly.publish = false).benchmarks/workspace subdirectory.