TurboQuant KV cache port to v2#620
Open
jrajala6 wants to merge 10 commits into
Open
Conversation
…val harnesses - Gemma4 TQ wiring in attention build (per-layer rotation, sliding/global head dims) - Gemma4MmModel: forward score functions to language model so cache delegation works - ops_cache: TQ state/append/attention compute nodes; skip kernel window mask when cache size already capped (fixes NaN at ctx >= 2 * window) - Window-scoring softcap parity with v1 - TQ kernel: per-dim LUT cache for dot_4bit/6bit (~12% speedup, bit-identical) - TQ attention: K=8 angle-bits dispatch + dot_8bit/accumulate_8bit declarations - Seed via CACTUS_KV_TQ_SEED env (default 42 to match v1) - New tests: test_seed_sweep (per-position bucketed ppl + top-1 argmax agreement) and test_bfcl_eval (TQ config x tool-calling smoke)
# Conflicts: # .gitignore # cactus-engine/models/gemma4/model_gemma4.cpp # cactus-graph/src/builder.cpp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Engine KV cache is now TurboQuant unconditionally with no INT8 path. Default config: K=6, V=4, single Hadamard, seed=42. Graph-level INT8 primitives stay in
cactus-graph(still exercised bytest_cache) but unreachable from the engine.Headline: 43% smaller cache vs INT8, 1.6% faster decode, ≤5% perplexity vs INT8 on 8 of 9 measured cells.
Changes
cactus-engine/{src/engine.h, src/model.cpp, models/gemma4/model_gemma4.cpp}— TQ unconditional; INT8 fallback removed.cactus-kernels/cactus_kernels.h—TURBOQUANT_ROTATION_LAYERS3 → 1 (single Hadamard; 3× cheaper rotation, equal/better quality per prior data).cactus-engine/tests/{test_perplexity_context_sweep.cpp, test_decode_latency.cpp}— new. Multi-corpus / multi-seed / multi-ctx sweep + real decode benchmark.cactus-engine/tests/test_perplexity.cpp— drops INT8 column.python/src/tensor_io.py— restoresINTERLEAVE_BLOCK = 4(upstreamb99db766removed the constant but left two functions referencing it in default args, breaking module import).Validation — gemma-4-e2b vs INT8 cache (
|TQ/INT8 − 1| ≤ 0.05)INT8 reference numbers were captured during development against the in-tree INT8 cache before this PR removed it; reproducible from the prior commit.
Quality at seed=42 across context lengths:
Seed robustness (K6V4, ctx=1024):
seed=42 is the only seed passing on every corpus → ship pinned to 42; other seeds remain reachable via
CACTUS_KV_TQ_SEED.Decode latency (gen=64): INT8 = 115.48 ms/tok, TQ K6V4 = 113.66 ms/tok (−1.6%, faster).
Memory (head_dim=256): FP16 1024 / INT8 576 / K6V4 328 bytes per token. K6V4 is 3.12× vs FP16, −43% vs INT8. At ctx=32k across 15 own-KV layers: 276 MB → 158 MB (saves 118 MB).
Why K6V4 over alternatives
Configurability
CACTUS_KV_TQ_K_BITS,CACTUS_KV_TQ_V_BITS,CACTUS_KV_TQ_SEED,CACTUS_KV_TQ_BITS(unified).CACTUS_KV_WINDOW_SIZE,CACTUS_KV_SINK_SIZEretained from prior config.Test plan