Skip to content

Make Android scheduling load-aware#666

Merged
HenryNdubuaku merged 2 commits into
v2from
feature/android-load-aware-scheduling-v2
May 27, 2026
Merged

Make Android scheduling load-aware#666
HenryNdubuaku merged 2 commits into
v2from
feature/android-load-aware-scheduling-v2

Conversation

@ncylich
Copy link
Copy Markdown
Collaborator

@ncylich ncylich commented May 26, 2026

Summary

Make Android CPU scheduling load-aware for production inference on top of current v2.

Key changes:

  • Pin Android threadpool workers one-per-selected performance core instead of giving every worker the same broad performance-core mask.
  • Split blocking Android parallel_for work into smaller fixed internal chunks so faster cores can pull additional work.
  • Prepare production inference entrypoints by selecting a load-aware performance core for the caller thread.
  • Initialize the worker pool before narrowing caller affinity so worker threads do not inherit a single-core caller mask.
  • Preserve each caller thread's original allowed affinity and restore it before each load-aware selection, so repeated calls can move away from a newly busy core.
  • Keep the policy internal and deterministic: no public API changes, no required environment variables, and no debug logging.

Base

This branch is based directly on current v2 after the orthogonal CQ4 LM-head layout PR merged. The PR diff against v2 is limited to:

  • cactus-engine/src/complete.cpp
  • cactus-kernels/src/threading.h

Rationale

The LM-head kernel optimization recovered the kernel-level cost, but Android end-to-end performance was still sensitive to where the OS placed the caller thread and how evenly worker work was distributed across asymmetric cores.

The previous default behavior could leave important work on a slow or contended core. A naive main-thread pin helped in benchmarks, but it was not robust under contention. This PR makes the production path use the same general idea safely: workers stay distributed across selected performance cores, blocking work is chunked so faster cores naturally take more work, and the caller thread is placed using recent per-core load rather than a fixed core choice.

Implementation Details

Worker scheduling:

  • Detect Android core capacities from sysfs using the existing topology detection path.
  • Limit the worker pool to selected performance cores.
  • Pin each worker to an individual selected performance core.
  • Use the existing threadpool for blocking Android parallel_for calls and divide each requested worker assignment into a fixed number of smaller tasks.

Caller scheduling:

  • Keep each caller thread's original allowed affinity in a thread-local state object.
  • Initialize the worker pool before caller pinning to avoid worker threads inheriting the caller's narrowed affinity mask.
  • Restore the caller thread's original allowed affinity before each selection so prior single-core pinning does not restrict future choices.
  • Take two short /proc/stat samples around a 20 ms interval to estimate recent per-core busy fraction.
  • Score candidate performance cores by capacity adjusted for recent load, then pin the caller to the best allowed core.

Validation

Build and native tests after the cleanup:

  • cactus build
  • cactus build --android
  • ./cactus-graph/build/cactus-kernels/test_matmul
  • ./cactus-graph/build/test_ops
  • git diff --check

Pixel sanity check after the cleanup:

  • Generation on explain shrodingers cat produced coherent output.
  • Gemma 512 benchmark: 10.73 tok/s decode, 69.06 tok/s prefill.
  • Verified no leftover Pixel cactus_chat, cactus_llm_bench, or cpu_contention processes after the run.

Earlier contention testing also verified the main bug: pinning the caller before worker-pool initialization could cause all workers to inherit a one-core affinity mask. Initializing the pool first fixes that collapse while preserving load-aware caller placement.

ncylich added 2 commits May 26, 2026 13:38
Pin Android threadpool workers to individual selected performance cores instead of giving every worker the full performance-core mask. This avoids workers collapsing onto the same core under default Android scheduling while preserving the existing topology detection and non-Android behavior.

Route blocking Android parallel_for calls through the threadpool chunking path and split each worker request into a fixed set of smaller tasks. This lets faster cores pull additional chunks and prevents a single slow worker assignment from dominating decode and prefill latency.

Keep the policy internal and deterministic: no environment variables, no debug logging, and no public API changes.

Validation:

- cactus build

- cactus build --android

- ./cactus-graph/build/cactus-kernels/test_matmul

- ./cactus-graph/build/test_ops

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Make the Android production entrypoints prepare the caller thread with load-aware performance-core affinity instead of relying on benchmark-only environment hooks.

The caller affinity setup now initializes the worker pool before narrowing the caller mask, preserves the original allowed affinity per calling thread, and refreshes CPU load samples before selecting a target performance core. This prevents worker threads from inheriting a single-core mask while still letting repeated calls move away from a busy core.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich force-pushed the feature/android-load-aware-scheduling-v2 branch from 1b46e26 to e587035 Compare May 26, 2026 20:44
@HenryNdubuaku HenryNdubuaku merged commit fb0adae into v2 May 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants