Skip to content

Validation pass: Emily 0.4.0 on Apple Silicon (37/37 decision-stable; route_hash + one margin-floor near-miss)#1

Draft
ausimian wants to merge 2 commits into
nshkrdotcom:mainfrom
ausimian:emily-0.4-validation-pass
Draft

Validation pass: Emily 0.4.0 on Apple Silicon (37/37 decision-stable; route_hash + one margin-floor near-miss)#1
ausimian wants to merge 2 commits into
nshkrdotcom:mainfrom
ausimian:emily-0.4-validation-pass

Conversation

@ausimian
Copy link
Copy Markdown

What this is

Draft / validation-only PR. Posted because you asked for a validation pass against current main from someone with Apple hardware before you shape the upstream :emily integration. Nothing here is intended for merge as-is.

The branch adds a four-line :emily runtime profile (mirroring :emlx's intent) plus the minimum plumbing needed to make the export and prompt-eval suites actually exercise the Emily.Backend lane on this host. It does not touch global Nx defaults, snapshots, margin-floor ratchets, or docs — those are the calls I'm leaving to you.

Result

mix trinity.sakana.export_adapted --force --svd-compute-type f32 --runtime-profile emily --out tmp/emily_adapted_qwen3_0_6b_layer26

  • manifest["status"] = "complete", export_complete: true, 9/9 tensors "status":"complete".
  • Every tensor's u_backend / s_backend / v_backend / adapted_backend recorded as Emily.Backend.
  • No OOM on the {151_936, 1024} embedder or lm_head. Thin-SVD path holds.
  • Wall clock 6.6s total (warm BEAM + Bumblebee load + 9 SVDs + writes).

mix run examples/qwen_router_prompt_eval.exs --runtime-profile emily --artifact-dir tmp/... --snapshot examples/fixtures/qwen_router_prompt_eval_logits.json --determinism-runs 2

Axis Emily vs CUDA snapshot
Decision-stable (agent_id, role_id, token_count, transcript_hash) 0/37 drift. All match exactly.
route_hash (decision + 6dp logits) 37/37 differ. Backend-sensitive logit drift on every case.
In-process determinism (--determinism-runs 2) 0/37 mismatches. Emily is bytewise deterministic across runs in one BEAM.
min_agent_margin = 0.24 floor All 37 pass. Worst case two_assistant_turns at 0.417.
min_role_margin = 1.06 floor 1/37 fails. escalate_to_human at 1.029 (CUDA snapshot had 1.461). Next-worst is root_cause at 1.526.

Verbose dump for the failing case:

[1/1] escalate_to_human - FAIL
Expected route: agent 0 (gpt-5), role 1 (Thinker)
Router returned: agent 0 (gpt-5), role 1 (Thinker)
Router input tokens: 15
Debug:
  agent_margin: 2.75645   (snapshot: 2.7706)
  role_margin:  1.02906   (snapshot: 1.4611)   ← below floor 1.06
  agent_logits: [15.93655, 5.65491, -3.6192, -1.06062, 13.1801, -26.01429, 11.04126]
  role_logits:  [0.96171, 5.46475, 4.4357]

Re-running with --min-agent-margin 0.0 --min-role-margin 0.0 passes 37/37 cleanly with no determinism mismatches.

How that maps to your decision tree

The diagnosis in the original PR holds — the Gram-level fix in Emily 0.4.0 is doing its job for argmax. The only behaviour that's actually backend-sensitive here is the gap between the top and second logits, which is exactly what you'd expect from float32 matmul on a different kernel stack.

  • route_hash drift: backend-sensitive. Tells you Emily needs its own snapshot fixture, not that anything deeper has moved.
  • Margin floor near-miss: backend-sensitive too — the existing floors are "80% of CUDA's empirical worst," and Emily's empirical worst differs.
  • Decision-stable fields: identical. transcript_hash matches on all 37 (prompt construction unchanged), token_count matches on all 37 (tokenizer unchanged), agent_id/role_id match on all 37 (router decisions unchanged).

For the per-profile ratchet, the data here gives you the numbers if you want them:

  • Emily worst agent_margin: 0.417 (two_assistant_turns) → 80% floor ≈ 0.33
  • Emily worst role_margin: 1.029 (escalate_to_human) → 80% floor ≈ 0.82

What's actually in this branch

File Change
lib/trinity_coordinator/runtime_profile.ex New :emily clause on resolve/1 mirroring :emlx's intent (no CUDA, qwen_runtime?, export_svd?, default SLM :qwen_coordinator); nx_backend: Emily.Backend. Added :emily to builtin_names/0.
lib/trinity_coordinator/runtime.ex tensor_backend/1 now uses tensor.data.__struct__ for the Emily / BinaryBackend label cases. Emily's Inspect impl drops the module name, so the previous inspect-string match returned "unknown" and the exporter's ensure_export_backend/3 rejected every SVD output. EXLA branches are kept on the inspect-based label form for back-compat.
mix.exs {:emily, "~> 0.4", only: [:dev, :test]} — same optional/load-only pattern as the existing EMLX comment.
mix.lock Hex-resolved Emily 0.4.0.

What's NOT in this branch (and where you said you wanted the call)

  • Nothing sets config :nx, default_backend: Emily.Backend globally.
  • Nothing makes \"emily\" a valid XLA_TARGET.
  • No changes to Coordinator.load/1 semantics — :emily flows through the same path as :emlx.
  • No new snapshot fixture under examples/fixtures/. I did write tmp/emily_router_prompt_eval_logits.json locally (37 cases, matching schema) if you want to use it as the seed for a future emily_* snapshot lane.
  • No MixHelpers, doc, or CHANGELOG updates.

Per-tensor SVD time — read with a caveat

The manifest reports decompose_elapsed_ms = 2 for the embedder and 0 for all others, and reconstruct_elapsed_ms = 0 across the board. That's an artefact of MLX being lazy: Nx.LinAlg.svd/2 returns a future, the host-side timer closes immediately, and the actual GPU work happens later during the Nx.backend_transfer(tensor, Nx.BinaryBackend) inside write_checkpoint — which the exporter doesn't count as part of decompose/reconstruct.

Realistic per-tensor wall-clock from the stdout [debug] writing checkpoint timestamps (ms resolution):

Tensor Shape ~ms (decompose+reconstruct+safetensors write)
embedder.token_embedding.kernel {151_936, 1024} bf16 ~362
language_modeling_head.output.kernel {151_936, 1024} bf16 ~354
decoder.blocks.26.ffn.gate.kernel {1024, 3072} ~235
decoder.blocks.26.ffn.intermediate.kernel {1024, 3072} ~230
decoder.blocks.26.self_attention.{query,key,value,output}.kernel various 1024/2048 74–90
decoder.blocks.26.ffn.output.kernel {3072, 1024} ~74

If you want the exporter to report honest per-tensor SVD time on lazy backends, dropping an Nx.backend_transfer(_, Nx.BinaryBackend) (or any other sync point) inside the timed_decompose / timed_reconstruct brackets would fix it. Same caveat will apply to :emlx.

Test plan

  • You read the diff and decide whether to fold any of it into your eventual :emily profile branch — or whether you'd rather rewrite from scratch given the snapshot/ratchet decisions you have to make anyway.
  • Confirm the tensor_backend/1 switch to tensor.data.__struct__ for Emily/BinaryBackend doesn't bother you. The EXLA branches deliberately stayed on the inspect-based form so existing behaviour is unchanged on CUDA / host EXLA.
  • Tell me what shape the :emily snapshot lane should take — I have the Emily-side rounded logits and route_hashes for all 37 cases if you want them.

🤖 Generated with Claude Code

Scaffolds an :emily runtime profile mirroring :emlx's intent but
routing through Emily.Backend (Apple MLX). Lets the export pipeline
and the prompt eval run end-to-end against Emily 0.4.0 on Apple
Silicon for validation passes ahead of upstream integration.

  * runtime_profile.ex: new :emily resolve clause and builtin entry.
  * runtime.ex: tensor_backend/1 now reads tensor.data.__struct__
    so Emily.Backend (whose Inspect impl drops the module name) and
    Nx.BinaryBackend get correct labels; EXLA branches unchanged.
  * mix.exs: {:emily, "~> 0.4", only: [:dev, :test]} pulled from
    Hex; matches the existing EMLX optional-dep pattern.

This is a validation branch, not for upstream merge. The canonical
Apple lane remains :emlx; this profile exists so an Apple host
without EMLX can still exercise the export + prompt eval suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ausimian
Copy link
Copy Markdown
Author

Re-ran the export with a temporary patch that forces graph evaluation inside timed_decompose / timed_reconstructNx.to_number(Nx.sum(_)) on U/S/V after the SVD and on adapted after the reconstruct. Backend-agnostic sync, cheap to transfer, doesn't perturb the EXLA path. Patch is uncommitted and not on the branch in this PR — the timing caveat in the PR body still applies to anyone running this branch as-is.

Honest per-tensor SVD timings from the same Apple Silicon host, Emily 0.4.0, --svd-compute-type f32:

 #  path                                                shape               decomp_ms  recon_ms
----------------------------------------------------------------------------------------------------
 1  embedder.token_embedding.kernel                     151936x1024               225       128
 2  decoder.blocks.26.self_attention.query.kernel       1024x2048                 166         5
 3  decoder.blocks.26.self_attention.key.kernel         1024x1024                  82         2
 4  decoder.blocks.26.self_attention.value.kernel       1024x1024                  84         3
 5  decoder.blocks.26.self_attention.output.kernel      2048x1024                  55         5
 6  decoder.blocks.26.ffn.gate.kernel                   1024x3072                 210         8
 7  decoder.blocks.26.ffn.intermediate.kernel           1024x3072                 211         7
 8  decoder.blocks.26.ffn.output.kernel                 3072x1024                  83         4
 9  language_modeling_head.output.kernel                151936x1024               235        87
----------------------------------------------------------------------------------------------------
                                                totals                          1351       249  (ms)

Total wall-clock for the task: 5.7s (slightly faster than the first uninstrumented run; MLX's kernel cache was warm from the earlier export).

Shape of the curve:

  • Embedder vs lm_head are symmetric (225 vs 235 ms decompose) — same {151_936, 1024} shape, same thin-SVD cost. No surprise.
  • FFN gate/intermediate (210 / 211 ms) outweigh the embedder per-element because thin-SVD on {1024, 3072} gives k=1024 and the cost is dominated by O(n·k²) = O(3072·1024²). The embedder is taller but skinnier; k=1024 already saturates its smaller dim.
  • Reconstruct cost is tiny (2–8 ms) for everything except embedder (128 ms) and lm_head (87 ms), which is the bf16 cast over the {151_936, 1024} tensor.
  • Total SVD-only cost ≈ 1.6 s; the remaining ~4 s of wall-clock is BEAM startup + Bumblebee load + safetensors writes (the {151_936, 1024} bf16 tensors are ≈ 297 MB each on disk).

If you decide it's worth shipping honest timings on lazy backends as part of the :emily integration, the patch is two lines per timed block — Nx.to_number(Nx.sum(_)) on each output before the timer closes. EXLA timings stay essentially unchanged because EXLA already blocks the BEAM caller on Nx.LinAlg.svd/2.

🤖 Generated with Claude Code

Comment on lines +45 to 56
def tensor_backend(%Nx.Tensor{data: %backend_struct{}} = tensor) do
inspected = inspect(tensor)

cond do
String.contains?(inspected, "EXLA.Backend<cuda") -> "EXLA.Backend<cuda:"
String.contains?(inspected, "EXLA.Backend<host") -> "EXLA.Backend<host:"
String.contains?(inspected, "EXLA.Backend<") -> "EXLA.Backend"
backend_struct == Emily.Backend -> "Emily.Backend"
backend_struct == Nx.BinaryBackend -> "Nx.BinaryBackend"
String.contains?(inspected, "Nx.BinaryBackend") -> "Nx.BinaryBackend"
true -> "unknown"
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nshkrdotcom what's the intention behind this selection?
I think it's a bit brittle in that any backend that you want to support ends up having to be explicitly supported.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polvalente You're right that the cond was brittle, and that brittleness goes deeper than the one site you flagged. The same shape was duplicated in three near-identical private backend_from_label/1 clauses in Sakana.{Artifact, Head, PythonImporter}, and each silently fell back to Nx.BinaryBackend for any label the cond didn't enumerate. Once tensor_backend/1 started producing generic labels for backends like "EMLX.Backend" or "Emily.Backend" (which it now does in 21c3088), the silent coerce-to-BinaryBackend would have been a correctness hazard in the alignment / transfer call sites, not cosmetics.

Landed on main:

  • lib/trinity_coordinator/runtime.extensor_backend/1 now binds %Nx.Tensor{data: %backend_struct{}} and the default returns backend_struct |> Module.split() |> Enum.join("."). EXLA's <cuda:N> / <host:N> device-info prefixes are preserved on the inspect-based path (they encode device identity into the inspect string, so the inspect form is still the right tool there). Added in 21c3088, hardened in the same commit.
  • New lib/trinity_coordinator/runtime/backend_label.ex with from_label/1 ({:ok, backend_spec} | {:error, {:unknown_backend_label, label}}) and from_label!/1 (logs Logger.warning and falls back to Nx.BinaryBackend for unknown labels — preserves the prior behaviour but makes it audible instead of silent). The three Sakana modules now call the helper; their private cond chains are gone.
  • Phase 2 test coverage in test/trinity_coordinator/runtime_backend_label_test.exs pins the generic-default contract using synthesised fixture backend modules so it runs without a CUDA host. Phase 3 coverage in test/trinity_coordinator/runtime/backend_label_test.exs pins the EMLX label round-trip (the lane that would have been silently broken under the old fallback) plus the Logger.warning on unknown labels.

The same generalization is what unblocked landing a first-class :emily profile (93cbcae): the four-line def resolve(:emily) clause mirrors :emlx plus ships an empirical margin override, and accepts_backend_label?/2 accepts "Emily.Backend" for free via the generic prefix path you suggested. No per-backend code edits needed.

Thanks for the catch.

Hooks Emily.Bumblebee.FastKernels.apply/1 into Coordinator.load/1
when the :emily runtime profile is selected. Rewrites RMSNorm /
LayerNorm / RoPE / SDPA Axon layers in the loaded Bumblebee model
to call `Emily.Fast.*` helpers, which dispatch to `mx::fast::*`
kernels under Emily.Backend (and fall through to composed-defn
equivalents on any other backend, so the model remains evaluable
on Nx.BinaryBackend / EXLA for conformance).

Validation pass result (qwen_router_prompt_eval, 37 cases,
--determinism-runs 2, Apple Silicon, Emily 0.4.0):

  * Decisions vs CUDA snapshot: 37/37 match (agent_id, role_id,
    token_count, transcript_hash).
  * route_hash drift fast-vs-bare-Emily: 25/37 differ. The two
    implementations are not bitwise-identical; they're equivalent
    at the argmax + margin-floor level.
  * In-process determinism: 37/37 stable across 2 runs.
  * Margin floors: same single near-miss as bare-Emily on
    escalate_to_human (role_margin 1.029, identical to bare).
    Indicates the tight role-margin on that case is structural to
    MLX's matmul/softmax, not a fused-vs-composed artifact.
  * Wall-clock (37 cases × 2 determinism runs, warm cache): 10.57s
    vs ~12.4s bare (~15% faster).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ausimian
Copy link
Copy Markdown
Author

Wired Emily.Bumblebee.FastKernels.apply/1 into Coordinator.load/1 under the :emily profile and re-ran the same eval. Commit 45272f1 on this branch. The rewrite swaps RMSNorm / LayerNorm / RoPE / SDPA Axon layers for Emily.Fast.* helpers; under Emily.Backend those dispatch to mx::fast::* kernels via Nx.block/4, and on any other backend they fall through to composed-defn equivalents (so the rewritten model still evaluates correctly on Nx.BinaryBackend / EXLA for conformance).

bare Emily Emily + FastKernels
Decisions vs CUDA snapshot (agent_id / role_id / token_count / transcript_hash) 37/37 match 37/37 match
route_hash drift vs CUDA 37/37 differ 37/37 differ
route_hash drift fast-vs-bare n/a 25/37 differ
In-process determinism (--determinism-runs 2) 37/37 stable 37/37 stable
min_agent_margin = 0.24 floor 37/37 pass 37/37 pass
min_role_margin = 1.06 floor 36/37 (escalate_to_human at 1.029) 36/37 (escalate_to_human at 1.029, bitwise identical to bare)
Wall-clock, 37 cases × 2 determinism runs, warm cache ~12.4s ~10.57s (~15% faster)

A couple of observations worth pulling out:

The role-margin near-miss is structural, not a fused-vs-composed artefact. escalate_to_human fails the floor at role_margin = 1.029057502746582 on both lanes — not the same to two decimal places, the literal same float. The rewrite moves logits on 25/37 other cases but happens to leave that one's role logits untouched. That's a strong signal it's MLX's matmul/softmax behaviour on the specific tensor shapes in that prompt that's tighter than CUDA's, not anything the rewriter can or should fix.

No correctness regression. Fast rewrites don't change any decision on any case. Reading it against Paulo's bare-EMLX 37/37, the comparable matrix now looks like:

  • bare EMLX, no Fast → 37/37 (Paulo)
  • bare Emily 0.4.0 → 36/37 (this branch, commit 35af2f0)
  • Emily 0.4.0 + Bumblebee.FastKernels → 36/37 (this branch, commit 45272f1)

The Emily lanes fail identically on escalate_to_human for the same role-margin reason; both reach the same (agent_id, role_id) decision on all 37. The pattern is consistent with "MLX-family backend produces tighter role logits on one borderline case, regardless of whether the per-layer kernels are fused" — i.e. exactly the kind of thing the per-profile margin-floor ratchet you'd write for the upstream :emily lane would absorb.

Throughput. ~15% wall-clock win on this workload, which is the worst case for showcasing FastKernels because the eval is single-forward-pass — no decode loop, no KV cache, no autoregressive sampling. Generative workloads on the same model should see a much bigger relative gain since attention / RoPE / RMSNorm dominate the per-token cost there. For reference, Emily's Emily.Bumblebee.FastKernels covers the same operator set as emlx_axon (RMSNorm / LayerNorm / RoPE — including linear / dynamic / longrope / llama3 scaling — and SDPA with causal + window + key + head + bias masks coalesced to a single additive mask).

What's in commit 45272f1. 23 lines net: three lines to RuntimeProfile.resolve(:emily) 's notes: clarifying that the rewrite now applies under this profile, and a maybe_apply_fast_kernels/2 private helper in Coordinator.load/1 that calls update_in(model_info.model, &Emily.Bumblebee.FastKernels.apply/1) when the profile name is :emily, gated on Code.ensure_loaded?/1 for graceful failure when Emily / Axon / Bumblebee aren't all loaded. Same defn fallbacks the FastKernels shim provides mean this is safe to leave applied even if you were to flip the backend to something else mid-run.

Happy to split this into its own profile (:emily_fast) if you'd rather keep the bare lane available as a baseline upstream — I went with single-profile + rewrite-by-default per your nudge to keep the surface narrow, but the two-profile shape is a one-line revert.

🤖 Generated with Claude Code

nshkrdotcom added a commit that referenced this pull request May 22, 2026
…profile defaults

Lands the five-phase response to ausimian's draft validation PR
(github.com/#1) and polvalente's review
comment on `runtime.ex:56`. The PR itself stays unmerged; what lands here
addresses the underlying issues directly without bundling Emily as a
hard dependency or rewriting the canonical CUDA snapshot/floors.

Sequencing (smallest blast radius first):

Phase 1 — Lazy-backend timing sync.

  `Exporter.sync_tensor!/1` pulls a tensor through `Nx.sum |> to_number`
  inside `timed_decompose/2` and `timed_reconstruct/3` so EMLX/Emily
  futures are materialised before `elapsed_ms` is captured. EXLA was
  already eager; numerics unchanged. Honest per-tensor SVD wall time
  now reports on lazy backends — closes the
  `decompose_elapsed_ms: 2` artefact ausimian's PR called out.

Phase 2 — Generic `tensor_backend/1` default.

  `lib/trinity_coordinator/runtime.ex` keeps the EXLA `<cuda:` / `<host:`
  device-info prefixes intact but replaces the `true -> "unknown"`
  cond arm with `tensor.data.__struct__ |> Module.split() |> Enum.join(".")`.
  New backends (EMLX, Emily, future) round-trip cleanly through
  `accepts_backend_label?/2` without per-backend code edits. Directly
  answers polvalente's review.

Phase 3 — `Runtime.BackendLabel.from_label/1` + `from_label!/1`.

  New helper module replaces three near-duplicate private
  `backend_from_label/1` cond chains in
  `Sakana.{Artifact, Head, PythonImporter}`. EMLX label is recognised
  natively; unknown labels go through `Logger.warning + Nx.BinaryBackend`
  via `from_label!/1` (audible, not silent — preserves prior behaviour
  while making it visible). EMLX label coverage now exists via tests.

Phase 4 — Defer `:emily` as a built-in profile.

  The canonical Apple lane is `:emlx`. Emily is a research backend used
  for validation; Apple operators run it through
  `{:custom, Emily.Backend, []}` plus `override_default_margins/2`.
  `lib/**` carries zero references to Emily — documentation-only in
  `guides/runtime_profiles.md`.

Phase 5 — Per-profile snapshot fixture + margin floors.

  `%RuntimeProfile{}` gains `default_min_agent_margin` /
  `default_min_role_margin` (canonical `0.24` / `1.06` for every
  built-in) plus accessor `default_margins/1` and
  `override_default_margins/2`. The eval entry point picks
  `examples/fixtures/runtime_profiles/<profile>/qwen_router_prompt_eval_logits.json`
  when present, falls through to `nil` (no snapshot drift check)
  otherwise — same default behaviour as pre-Phase-5 unless an explicit
  `--snapshot` is supplied. Explicit `--snapshot` still wins. CUDA
  fixture + floors are bytewise unchanged.

Gates: all green.

  mix format --check-formatted             ✅
  mix compile --warnings-as-errors         ✅
  mix test                                 ✅ 1 doctest, 293 / 0 (24 excluded)
                                              (30 new tests across the phases)
  mix credo --strict                       ✅ 1595 mods/funs, 0 issues
  mix dialyzer                             ✅ 0 errors
  mix docs --warnings-as-errors            ✅ clean
  XLA_TARGET=cuda12 ... mix run examples/qwen_router_prompt_eval.exs
    --determinism-runs 2                   ✅ 37/37 PASS
  Same eval with explicit --snapshot       ✅ 37/37 PASS

See ~/jb/docs/20260521/sakana/pv/emily_validation_response_checklist.md
for the implementation plan + rationale.
@nshkrdotcom
Copy link
Copy Markdown
Owner

@ausimian hanks for the validation pass and the FastKernels follow-up — both materially shaped what landed.

What did NOT get merged: the PR itself. The branch is [ahead] of main on mix.lock (stale hf_hub 0.2.0 vs current 0.3.1) and on lib/trinity_coordinator/runtime.ex (your data: %backend_struct{} shape is the right idea, but polvalente's review pushed it to the more general Module.split() |> Enum.join(".") default, which makes the same change cover EMLX / Emily / anything future without the explicit cond arms).

What DID land, attributed:

  1. First-class :emily profile (93cbcae) — your validation pass made the case. The clause mirrors :emlx's Apple-shaped flags but routes to Emily.Backend and pre-seeds the per-profile margin floors at 0.33 / 0.82. Those numbers come straight from your run — agent worst 0.417 on two_assistant_turns, role worst 1.029 on escalate_to_human, 80% rule. So --runtime-profile emily now Just Works on a fresh Apple checkout without --min-agent-margin / --min-role-margin operator overrides.

  2. Lazy-backend timing sync (21c3088 Phase 1) — directly lifted from your "honest per-tensor SVD timings" follow-up comment. Exporter.sync_tensor!/1 is a one-liner tensor |> Nx.sum() |> Nx.to_number() called on u/s/v inside timed_decompose/2 and adapted inside timed_reconstruct/3. EXLA was already eager so its numbers don't move; EMLX/Emily now report real GPU wall time instead of host-side dispatch cost.

  3. Generic backend labeling (21c3088 Phase 2) + BackendLabel.from_label/1 (Phase 3) — polvalente's review on runtime.ex:56 flagged the brittle cond; my fix touched not just tensor_backend/1 but also the three private backend_from_label/1 cond chains in Sakana.{Artifact, Head, PythonImporter} that would have silently coerced Apple-resident tensors to BinaryBackend once Emily labels started flowing through. Tests pin the EMLX label path so the existing :emlx lane is now actually validated, not just :emily.

  4. Per-profile snapshot fixtures (21c3088 Phase 5) — RuntimeProfile.default_margins/1 + override_default_margins/2. The new Examples.QwenRouterPromptEval.SnapshotResolver looks for examples/fixtures/runtime_profiles/<profile>/qwen_router_prompt_eval_logits.json before falling through to nil (no snapshot drift check). Critical detail: we deliberately do NOT auto-fall-back to the legacy CUDA fixture, because that would quietly enable a strict 6dp logit byte-equivalence check for operators who didn't opt in. Existing CI flows pinning the CUDA snapshot keep working unchanged via explicit --snapshot.

What we still need from you (please)

The Emily snapshot lane is wired but unseeded — we need the actual Emily-rounded logits + route_hashes for all 37 cases. Two options:

  • A. Send the JSON. Run mix run examples/qwen_router_prompt_eval.exs --runtime-profile emily --snapshot-out /tmp/emily_router_prompt_eval_logits.json --determinism-runs 2 on Apple, attach the resulting JSON here (or open a tiny PR adding it at examples/fixtures/runtime_profiles/emily/qwen_router_prompt_eval_logits.json). The file is ~30 KB.

  • B. We don't seed one. :emily works without a snapshot file (the resolver returns nil and the snapshot drift assertion is skipped, just like CUDA today when no --snapshot is passed). It would just be nice for someone re-running the validation later to have a stable reference.

FastKernels (commit 45272f1) — yes please, separately

The +15% wall-clock + zero-decision-drift result is great. But the wiring needs to live on the :emily profile resolution path on our side rather than as a Coordinator.load/1 hot-patch keyed off the profile name — that way :emily_fast and :emily can both exist as built-ins (your "happy to split this into its own profile" offer is what I'd take). Two reasons:

  1. Keep :emily bare-Emily as the reference Apple-research lane (matches Paulo's bare-EMLX 37/37 baseline). That's the lane I want to be able to attribute regressions to upstream Emily / Nx changes without having to subtract a FastKernels effect.
  2. :emily_fast is the operator-pick for production-shaped (generative) workloads. The single forward pass in the prompt eval is the worst case for showcasing FastKernels — you said it yourself; on a decode loop with attention / RoPE / RMSNorm dominating per-token cost the relative gain should be much bigger.

The split is the one-line :emily_fast clause that mirrors :emily plus a :apply_fast_kernels? flag on the profile struct that Coordinator.load/1 reads (so the call site doesn't have to know the profile name). I'd love a PR or branch for that — happy to take it as-is if you'd rather just point me at Emily.Bumblebee.FastKernels.apply/1 and the Nx.block/4 shim and I'll wire the integration myself.

The escalate_to_human role-margin near-miss at 1.029057502746582 being bitwise identical between bare-Emily and FastKernels is also a really clean signal — that says the kernel-rewrite layer doesn't touch the specific tensors driving that decision, which is exactly what the per-profile margin floor (now 0.82 for :emily) absorbs without lowering the CUDA floors. Nice diagnosis.

Either way: thank you. Will leave the PR open and close it as wont-merge later. The four commits on main carry your attribution in the messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants