Validation pass: Emily 0.4.0 on Apple Silicon (37/37 decision-stable; route_hash + one margin-floor near-miss) by ausimian · Pull Request #1 · nshkrdotcom/trinity_coordinator

ausimian · 2026-05-21T23:53:45Z

What this is

Draft / validation-only PR. Posted because you asked for a validation pass against current main from someone with Apple hardware before you shape the upstream :emily integration. Nothing here is intended for merge as-is.

The branch adds a four-line :emily runtime profile (mirroring :emlx's intent) plus the minimum plumbing needed to make the export and prompt-eval suites actually exercise the Emily.Backend lane on this host. It does not touch global Nx defaults, snapshots, margin-floor ratchets, or docs — those are the calls I'm leaving to you.

Result

mix trinity.sakana.export_adapted --force --svd-compute-type f32 --runtime-profile emily --out tmp/emily_adapted_qwen3_0_6b_layer26

manifest["status"] = "complete", export_complete: true, 9/9 tensors "status":"complete".
Every tensor's u_backend / s_backend / v_backend / adapted_backend recorded as Emily.Backend.
No OOM on the {151_936, 1024} embedder or lm_head. Thin-SVD path holds.
Wall clock 6.6s total (warm BEAM + Bumblebee load + 9 SVDs + writes).

mix run examples/qwen_router_prompt_eval.exs --runtime-profile emily --artifact-dir tmp/... --snapshot examples/fixtures/qwen_router_prompt_eval_logits.json --determinism-runs 2

Axis	Emily vs CUDA snapshot
Decision-stable (`agent_id`, `role_id`, `token_count`, `transcript_hash`)	0/37 drift. All match exactly.
`route_hash` (decision + 6dp logits)	37/37 differ. Backend-sensitive logit drift on every case.
In-process determinism (`--determinism-runs 2`)	0/37 mismatches. Emily is bytewise deterministic across runs in one BEAM.
`min_agent_margin = 0.24` floor	All 37 pass. Worst case `two_assistant_turns` at 0.417.
`min_role_margin = 1.06` floor	1/37 fails. `escalate_to_human` at 1.029 (CUDA snapshot had 1.461). Next-worst is `root_cause` at 1.526.

Verbose dump for the failing case:

[1/1] escalate_to_human - FAIL
Expected route: agent 0 (gpt-5), role 1 (Thinker)
Router returned: agent 0 (gpt-5), role 1 (Thinker)
Router input tokens: 15
Debug:
  agent_margin: 2.75645   (snapshot: 2.7706)
  role_margin:  1.02906   (snapshot: 1.4611)   ← below floor 1.06
  agent_logits: [15.93655, 5.65491, -3.6192, -1.06062, 13.1801, -26.01429, 11.04126]
  role_logits:  [0.96171, 5.46475, 4.4357]

Re-running with --min-agent-margin 0.0 --min-role-margin 0.0 passes 37/37 cleanly with no determinism mismatches.

How that maps to your decision tree

The diagnosis in the original PR holds — the Gram-level fix in Emily 0.4.0 is doing its job for argmax. The only behaviour that's actually backend-sensitive here is the gap between the top and second logits, which is exactly what you'd expect from float32 matmul on a different kernel stack.

route_hash drift: backend-sensitive. Tells you Emily needs its own snapshot fixture, not that anything deeper has moved.
Margin floor near-miss: backend-sensitive too — the existing floors are "80% of CUDA's empirical worst," and Emily's empirical worst differs.
Decision-stable fields: identical. transcript_hash matches on all 37 (prompt construction unchanged), token_count matches on all 37 (tokenizer unchanged), agent_id/role_id match on all 37 (router decisions unchanged).

For the per-profile ratchet, the data here gives you the numbers if you want them:

Emily worst agent_margin: 0.417 (two_assistant_turns) → 80% floor ≈ 0.33
Emily worst role_margin: 1.029 (escalate_to_human) → 80% floor ≈ 0.82

What's actually in this branch

File	Change
`lib/trinity_coordinator/runtime_profile.ex`	New `:emily` clause on `resolve/1` mirroring `:emlx`'s intent (no CUDA, qwen_runtime?, export_svd?, default SLM `:qwen_coordinator`); `nx_backend: Emily.Backend`. Added `:emily` to `builtin_names/0`.
`lib/trinity_coordinator/runtime.ex`	`tensor_backend/1` now uses `tensor.data.__struct__` for the Emily / BinaryBackend label cases. Emily's `Inspect` impl drops the module name, so the previous inspect-string match returned `"unknown"` and the exporter's `ensure_export_backend/3` rejected every SVD output. EXLA branches are kept on the inspect-based label form for back-compat.
`mix.exs`	`{:emily, "~> 0.4", only: [:dev, :test]}` — same optional/load-only pattern as the existing EMLX comment.
`mix.lock`	Hex-resolved Emily 0.4.0.

What's NOT in this branch (and where you said you wanted the call)

Nothing sets config :nx, default_backend: Emily.Backend globally.
Nothing makes \"emily\" a valid XLA_TARGET.
No changes to Coordinator.load/1 semantics — :emily flows through the same path as :emlx.
No new snapshot fixture under examples/fixtures/. I did write tmp/emily_router_prompt_eval_logits.json locally (37 cases, matching schema) if you want to use it as the seed for a future emily_* snapshot lane.
No MixHelpers, doc, or CHANGELOG updates.

Per-tensor SVD time — read with a caveat

The manifest reports decompose_elapsed_ms = 2 for the embedder and 0 for all others, and reconstruct_elapsed_ms = 0 across the board. That's an artefact of MLX being lazy: Nx.LinAlg.svd/2 returns a future, the host-side timer closes immediately, and the actual GPU work happens later during the Nx.backend_transfer(tensor, Nx.BinaryBackend) inside write_checkpoint — which the exporter doesn't count as part of decompose/reconstruct.

Realistic per-tensor wall-clock from the stdout [debug] writing checkpoint timestamps (ms resolution):

Tensor	Shape	~ms (decompose+reconstruct+safetensors write)
`embedder.token_embedding.kernel`	{151_936, 1024} bf16	~362
`language_modeling_head.output.kernel`	{151_936, 1024} bf16	~354
`decoder.blocks.26.ffn.gate.kernel`	{1024, 3072}	~235
`decoder.blocks.26.ffn.intermediate.kernel`	{1024, 3072}	~230
`decoder.blocks.26.self_attention.{query,key,value,output}.kernel`	various 1024/2048	74–90
`decoder.blocks.26.ffn.output.kernel`	{3072, 1024}	~74

If you want the exporter to report honest per-tensor SVD time on lazy backends, dropping an Nx.backend_transfer(_, Nx.BinaryBackend) (or any other sync point) inside the timed_decompose / timed_reconstruct brackets would fix it. Same caveat will apply to :emlx.

Test plan

You read the diff and decide whether to fold any of it into your eventual :emily profile branch — or whether you'd rather rewrite from scratch given the snapshot/ratchet decisions you have to make anyway.
Confirm the tensor_backend/1 switch to tensor.data.__struct__ for Emily/BinaryBackend doesn't bother you. The EXLA branches deliberately stayed on the inspect-based form so existing behaviour is unchanged on CUDA / host EXLA.
Tell me what shape the :emily snapshot lane should take — I have the Emily-side rounded logits and route_hashes for all 37 cases if you want them.

🤖 Generated with Claude Code

Scaffolds an :emily runtime profile mirroring :emlx's intent but routing through Emily.Backend (Apple MLX). Lets the export pipeline and the prompt eval run end-to-end against Emily 0.4.0 on Apple Silicon for validation passes ahead of upstream integration. * runtime_profile.ex: new :emily resolve clause and builtin entry. * runtime.ex: tensor_backend/1 now reads tensor.data.__struct__ so Emily.Backend (whose Inspect impl drops the module name) and Nx.BinaryBackend get correct labels; EXLA branches unchanged. * mix.exs: {:emily, "~> 0.4", only: [:dev, :test]} pulled from Hex; matches the existing EMLX optional-dep pattern. This is a validation branch, not for upstream merge. The canonical Apple lane remains :emlx; this profile exists so an Apple host without EMLX can still exercise the export + prompt eval suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ausimian · 2026-05-22T00:00:21Z

Re-ran the export with a temporary patch that forces graph evaluation inside timed_decompose / timed_reconstruct — Nx.to_number(Nx.sum(_)) on U/S/V after the SVD and on adapted after the reconstruct. Backend-agnostic sync, cheap to transfer, doesn't perturb the EXLA path. Patch is uncommitted and not on the branch in this PR — the timing caveat in the PR body still applies to anyone running this branch as-is.

Honest per-tensor SVD timings from the same Apple Silicon host, Emily 0.4.0, --svd-compute-type f32:

 #  path                                                shape               decomp_ms  recon_ms
----------------------------------------------------------------------------------------------------
 1  embedder.token_embedding.kernel                     151936x1024               225       128
 2  decoder.blocks.26.self_attention.query.kernel       1024x2048                 166         5
 3  decoder.blocks.26.self_attention.key.kernel         1024x1024                  82         2
 4  decoder.blocks.26.self_attention.value.kernel       1024x1024                  84         3
 5  decoder.blocks.26.self_attention.output.kernel      2048x1024                  55         5
 6  decoder.blocks.26.ffn.gate.kernel                   1024x3072                 210         8
 7  decoder.blocks.26.ffn.intermediate.kernel           1024x3072                 211         7
 8  decoder.blocks.26.ffn.output.kernel                 3072x1024                  83         4
 9  language_modeling_head.output.kernel                151936x1024               235        87
----------------------------------------------------------------------------------------------------
                                                totals                          1351       249  (ms)

Total wall-clock for the task: 5.7s (slightly faster than the first uninstrumented run; MLX's kernel cache was warm from the earlier export).

Shape of the curve:

Embedder vs lm_head are symmetric (225 vs 235 ms decompose) — same {151_936, 1024} shape, same thin-SVD cost. No surprise.
FFN gate/intermediate (210 / 211 ms) outweigh the embedder per-element because thin-SVD on {1024, 3072} gives k=1024 and the cost is dominated by O(n·k²) = O(3072·1024²). The embedder is taller but skinnier; k=1024 already saturates its smaller dim.
Reconstruct cost is tiny (2–8 ms) for everything except embedder (128 ms) and lm_head (87 ms), which is the bf16 cast over the {151_936, 1024} tensor.
Total SVD-only cost ≈ 1.6 s; the remaining ~4 s of wall-clock is BEAM startup + Bumblebee load + safetensors writes (the {151_936, 1024} bf16 tensors are ≈ 297 MB each on disk).

If you decide it's worth shipping honest timings on lazy backends as part of the :emily integration, the patch is two lines per timed block — Nx.to_number(Nx.sum(_)) on each output before the timer closes. EXLA timings stay essentially unchanged because EXLA already blocks the BEAM caller on Nx.LinAlg.svd/2.

🤖 Generated with Claude Code

polvalente · 2026-05-22T01:24:15Z

+  def tensor_backend(%Nx.Tensor{data: %backend_struct{}} = tensor) do
    inspected = inspect(tensor)

    cond do
      String.contains?(inspected, "EXLA.Backend<cuda") -> "EXLA.Backend<cuda:"
      String.contains?(inspected, "EXLA.Backend<host") -> "EXLA.Backend<host:"
      String.contains?(inspected, "EXLA.Backend<") -> "EXLA.Backend"
+      backend_struct == Emily.Backend -> "Emily.Backend"
+      backend_struct == Nx.BinaryBackend -> "Nx.BinaryBackend"
      String.contains?(inspected, "Nx.BinaryBackend") -> "Nx.BinaryBackend"
      true -> "unknown"
    end


@nshkrdotcom what's the intention behind this selection?
I think it's a bit brittle in that any backend that you want to support ends up having to be explicitly supported.

@polvalente You're right that the cond was brittle, and that brittleness goes deeper than the one site you flagged. The same shape was duplicated in three near-identical private backend_from_label/1 clauses in Sakana.{Artifact, Head, PythonImporter}, and each silently fell back to Nx.BinaryBackend for any label the cond didn't enumerate. Once tensor_backend/1 started producing generic labels for backends like "EMLX.Backend" or "Emily.Backend" (which it now does in 21c3088), the silent coerce-to-BinaryBackend would have been a correctness hazard in the alignment / transfer call sites, not cosmetics.

Landed on main:

lib/trinity_coordinator/runtime.ex — tensor_backend/1 now binds %Nx.Tensor{data: %backend_struct{}} and the default returns backend_struct |> Module.split() |> Enum.join("."). EXLA's <cuda:N> / <host:N> device-info prefixes are preserved on the inspect-based path (they encode device identity into the inspect string, so the inspect form is still the right tool there). Added in 21c3088, hardened in the same commit.

New lib/trinity_coordinator/runtime/backend_label.ex with from_label/1 ({:ok, backend_spec} | {:error, {:unknown_backend_label, label}}) and from_label!/1 (logs Logger.warning and falls back to Nx.BinaryBackend for unknown labels — preserves the prior behaviour but makes it audible instead of silent). The three Sakana modules now call the helper; their private cond chains are gone.

Phase 2 test coverage in test/trinity_coordinator/runtime_backend_label_test.exs pins the generic-default contract using synthesised fixture backend modules so it runs without a CUDA host. Phase 3 coverage in test/trinity_coordinator/runtime/backend_label_test.exs pins the EMLX label round-trip (the lane that would have been silently broken under the old fallback) plus the Logger.warning on unknown labels.

The same generalization is what unblocked landing a first-class :emily profile (93cbcae): the four-line def resolve(:emily) clause mirrors :emlx plus ships an empirical margin override, and accepts_backend_label?/2 accepts "Emily.Backend" for free via the generic prefix path you suggested. No per-backend code edits needed.

Thanks for the catch.

Hooks Emily.Bumblebee.FastKernels.apply/1 into Coordinator.load/1 when the :emily runtime profile is selected. Rewrites RMSNorm / LayerNorm / RoPE / SDPA Axon layers in the loaded Bumblebee model to call `Emily.Fast.*` helpers, which dispatch to `mx::fast::*` kernels under Emily.Backend (and fall through to composed-defn equivalents on any other backend, so the model remains evaluable on Nx.BinaryBackend / EXLA for conformance). Validation pass result (qwen_router_prompt_eval, 37 cases, --determinism-runs 2, Apple Silicon, Emily 0.4.0): * Decisions vs CUDA snapshot: 37/37 match (agent_id, role_id, token_count, transcript_hash). * route_hash drift fast-vs-bare-Emily: 25/37 differ. The two implementations are not bitwise-identical; they're equivalent at the argmax + margin-floor level. * In-process determinism: 37/37 stable across 2 runs. * Margin floors: same single near-miss as bare-Emily on escalate_to_human (role_margin 1.029, identical to bare). Indicates the tight role-margin on that case is structural to MLX's matmul/softmax, not a fused-vs-composed artifact. * Wall-clock (37 cases × 2 determinism runs, warm cache): 10.57s vs ~12.4s bare (~15% faster). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ausimian · 2026-05-22T02:36:16Z

Wired Emily.Bumblebee.FastKernels.apply/1 into Coordinator.load/1 under the :emily profile and re-ran the same eval. Commit 45272f1 on this branch. The rewrite swaps RMSNorm / LayerNorm / RoPE / SDPA Axon layers for Emily.Fast.* helpers; under Emily.Backend those dispatch to mx::fast::* kernels via Nx.block/4, and on any other backend they fall through to composed-defn equivalents (so the rewritten model still evaluates correctly on Nx.BinaryBackend / EXLA for conformance).

	bare Emily	Emily + FastKernels
Decisions vs CUDA snapshot (`agent_id` / `role_id` / `token_count` / `transcript_hash`)	37/37 match	37/37 match
`route_hash` drift vs CUDA	37/37 differ	37/37 differ
`route_hash` drift fast-vs-bare	n/a	25/37 differ
In-process determinism (`--determinism-runs 2`)	37/37 stable	37/37 stable
`min_agent_margin = 0.24` floor	37/37 pass	37/37 pass
`min_role_margin = 1.06` floor	36/37 (escalate_to_human at 1.029)	36/37 (escalate_to_human at 1.029, bitwise identical to bare)
Wall-clock, 37 cases × 2 determinism runs, warm cache	~12.4s	~10.57s (~15% faster)

A couple of observations worth pulling out:

The role-margin near-miss is structural, not a fused-vs-composed artefact. escalate_to_human fails the floor at role_margin = 1.029057502746582 on both lanes — not the same to two decimal places, the literal same float. The rewrite moves logits on 25/37 other cases but happens to leave that one's role logits untouched. That's a strong signal it's MLX's matmul/softmax behaviour on the specific tensor shapes in that prompt that's tighter than CUDA's, not anything the rewriter can or should fix.

No correctness regression. Fast rewrites don't change any decision on any case. Reading it against Paulo's bare-EMLX 37/37, the comparable matrix now looks like:

bare EMLX, no Fast → 37/37 (Paulo)
bare Emily 0.4.0 → 36/37 (this branch, commit 35af2f0)
Emily 0.4.0 + Bumblebee.FastKernels → 36/37 (this branch, commit 45272f1)

The Emily lanes fail identically on escalate_to_human for the same role-margin reason; both reach the same (agent_id, role_id) decision on all 37. The pattern is consistent with "MLX-family backend produces tighter role logits on one borderline case, regardless of whether the per-layer kernels are fused" — i.e. exactly the kind of thing the per-profile margin-floor ratchet you'd write for the upstream :emily lane would absorb.

Throughput. ~15% wall-clock win on this workload, which is the worst case for showcasing FastKernels because the eval is single-forward-pass — no decode loop, no KV cache, no autoregressive sampling. Generative workloads on the same model should see a much bigger relative gain since attention / RoPE / RMSNorm dominate the per-token cost there. For reference, Emily's Emily.Bumblebee.FastKernels covers the same operator set as emlx_axon (RMSNorm / LayerNorm / RoPE — including linear / dynamic / longrope / llama3 scaling — and SDPA with causal + window + key + head + bias masks coalesced to a single additive mask).

What's in commit 45272f1. 23 lines net: three lines to RuntimeProfile.resolve(:emily) 's notes: clarifying that the rewrite now applies under this profile, and a maybe_apply_fast_kernels/2 private helper in Coordinator.load/1 that calls update_in(model_info.model, &Emily.Bumblebee.FastKernels.apply/1) when the profile name is :emily, gated on Code.ensure_loaded?/1 for graceful failure when Emily / Axon / Bumblebee aren't all loaded. Same defn fallbacks the FastKernels shim provides mean this is safe to leave applied even if you were to flip the backend to something else mid-run.

Happy to split this into its own profile (:emily_fast) if you'd rather keep the bare lane available as a baseline upstream — I went with single-profile + rewrite-by-default per your nudge to keep the surface narrow, but the two-profile shape is a one-line revert.

🤖 Generated with Claude Code

…profile defaults Lands the five-phase response to ausimian's draft validation PR (github.com/#1) and polvalente's review comment on `runtime.ex:56`. The PR itself stays unmerged; what lands here addresses the underlying issues directly without bundling Emily as a hard dependency or rewriting the canonical CUDA snapshot/floors. Sequencing (smallest blast radius first): Phase 1 — Lazy-backend timing sync. `Exporter.sync_tensor!/1` pulls a tensor through `Nx.sum |> to_number` inside `timed_decompose/2` and `timed_reconstruct/3` so EMLX/Emily futures are materialised before `elapsed_ms` is captured. EXLA was already eager; numerics unchanged. Honest per-tensor SVD wall time now reports on lazy backends — closes the `decompose_elapsed_ms: 2` artefact ausimian's PR called out. Phase 2 — Generic `tensor_backend/1` default. `lib/trinity_coordinator/runtime.ex` keeps the EXLA `<cuda:` / `<host:` device-info prefixes intact but replaces the `true -> "unknown"` cond arm with `tensor.data.__struct__ |> Module.split() |> Enum.join(".")`. New backends (EMLX, Emily, future) round-trip cleanly through `accepts_backend_label?/2` without per-backend code edits. Directly answers polvalente's review. Phase 3 — `Runtime.BackendLabel.from_label/1` + `from_label!/1`. New helper module replaces three near-duplicate private `backend_from_label/1` cond chains in `Sakana.{Artifact, Head, PythonImporter}`. EMLX label is recognised natively; unknown labels go through `Logger.warning + Nx.BinaryBackend` via `from_label!/1` (audible, not silent — preserves prior behaviour while making it visible). EMLX label coverage now exists via tests. Phase 4 — Defer `:emily` as a built-in profile. The canonical Apple lane is `:emlx`. Emily is a research backend used for validation; Apple operators run it through `{:custom, Emily.Backend, []}` plus `override_default_margins/2`. `lib/**` carries zero references to Emily — documentation-only in `guides/runtime_profiles.md`. Phase 5 — Per-profile snapshot fixture + margin floors. `%RuntimeProfile{}` gains `default_min_agent_margin` / `default_min_role_margin` (canonical `0.24` / `1.06` for every built-in) plus accessor `default_margins/1` and `override_default_margins/2`. The eval entry point picks `examples/fixtures/runtime_profiles/<profile>/qwen_router_prompt_eval_logits.json` when present, falls through to `nil` (no snapshot drift check) otherwise — same default behaviour as pre-Phase-5 unless an explicit `--snapshot` is supplied. Explicit `--snapshot` still wins. CUDA fixture + floors are bytewise unchanged. Gates: all green. mix format --check-formatted ✅ mix compile --warnings-as-errors ✅ mix test ✅ 1 doctest, 293 / 0 (24 excluded) (30 new tests across the phases) mix credo --strict ✅ 1595 mods/funs, 0 issues mix dialyzer ✅ 0 errors mix docs --warnings-as-errors ✅ clean XLA_TARGET=cuda12 ... mix run examples/qwen_router_prompt_eval.exs --determinism-runs 2 ✅ 37/37 PASS Same eval with explicit --snapshot ✅ 37/37 PASS See ~/jb/docs/20260521/sakana/pv/emily_validation_response_checklist.md for the implementation plan + rationale.

nshkrdotcom · 2026-05-22T03:50:30Z

@ausimian hanks for the validation pass and the FastKernels follow-up — both materially shaped what landed.

What did NOT get merged: the PR itself. The branch is [ahead] of main on mix.lock (stale hf_hub 0.2.0 vs current 0.3.1) and on lib/trinity_coordinator/runtime.ex (your data: %backend_struct{} shape is the right idea, but polvalente's review pushed it to the more general Module.split() |> Enum.join(".") default, which makes the same change cover EMLX / Emily / anything future without the explicit cond arms).

What DID land, attributed:

First-class :emily profile (93cbcae) — your validation pass made the case. The clause mirrors :emlx's Apple-shaped flags but routes to Emily.Backend and pre-seeds the per-profile margin floors at 0.33 / 0.82. Those numbers come straight from your run — agent worst 0.417 on two_assistant_turns, role worst 1.029 on escalate_to_human, 80% rule. So --runtime-profile emily now Just Works on a fresh Apple checkout without --min-agent-margin / --min-role-margin operator overrides.
Lazy-backend timing sync (21c3088 Phase 1) — directly lifted from your "honest per-tensor SVD timings" follow-up comment. Exporter.sync_tensor!/1 is a one-liner tensor |> Nx.sum() |> Nx.to_number() called on u/s/v inside timed_decompose/2 and adapted inside timed_reconstruct/3. EXLA was already eager so its numbers don't move; EMLX/Emily now report real GPU wall time instead of host-side dispatch cost.
Generic backend labeling (21c3088 Phase 2) + BackendLabel.from_label/1 (Phase 3) — polvalente's review on runtime.ex:56 flagged the brittle cond; my fix touched not just tensor_backend/1 but also the three private backend_from_label/1 cond chains in Sakana.{Artifact, Head, PythonImporter} that would have silently coerced Apple-resident tensors to BinaryBackend once Emily labels started flowing through. Tests pin the EMLX label path so the existing :emlx lane is now actually validated, not just :emily.
Per-profile snapshot fixtures (21c3088 Phase 5) — RuntimeProfile.default_margins/1 + override_default_margins/2. The new Examples.QwenRouterPromptEval.SnapshotResolver looks for examples/fixtures/runtime_profiles/<profile>/qwen_router_prompt_eval_logits.json before falling through to nil (no snapshot drift check). Critical detail: we deliberately do NOT auto-fall-back to the legacy CUDA fixture, because that would quietly enable a strict 6dp logit byte-equivalence check for operators who didn't opt in. Existing CI flows pinning the CUDA snapshot keep working unchanged via explicit --snapshot.

What we still need from you (please)

The Emily snapshot lane is wired but unseeded — we need the actual Emily-rounded logits + route_hashes for all 37 cases. Two options:

A. Send the JSON. Run mix run examples/qwen_router_prompt_eval.exs --runtime-profile emily --snapshot-out /tmp/emily_router_prompt_eval_logits.json --determinism-runs 2 on Apple, attach the resulting JSON here (or open a tiny PR adding it at examples/fixtures/runtime_profiles/emily/qwen_router_prompt_eval_logits.json). The file is ~30 KB.
B. We don't seed one. :emily works without a snapshot file (the resolver returns nil and the snapshot drift assertion is skipped, just like CUDA today when no --snapshot is passed). It would just be nice for someone re-running the validation later to have a stable reference.

FastKernels (commit `45272f1`) — yes please, separately

The +15% wall-clock + zero-decision-drift result is great. But the wiring needs to live on the :emily profile resolution path on our side rather than as a Coordinator.load/1 hot-patch keyed off the profile name — that way :emily_fast and :emily can both exist as built-ins (your "happy to split this into its own profile" offer is what I'd take). Two reasons:

Keep :emily bare-Emily as the reference Apple-research lane (matches Paulo's bare-EMLX 37/37 baseline). That's the lane I want to be able to attribute regressions to upstream Emily / Nx changes without having to subtract a FastKernels effect.
:emily_fast is the operator-pick for production-shaped (generative) workloads. The single forward pass in the prompt eval is the worst case for showcasing FastKernels — you said it yourself; on a decode loop with attention / RoPE / RMSNorm dominating per-token cost the relative gain should be much bigger.

The split is the one-line :emily_fast clause that mirrors :emily plus a :apply_fast_kernels? flag on the profile struct that Coordinator.load/1 reads (so the call site doesn't have to know the profile name). I'd love a PR or branch for that — happy to take it as-is if you'd rather just point me at Emily.Bumblebee.FastKernels.apply/1 and the Nx.block/4 shim and I'll wire the integration myself.

The escalate_to_human role-margin near-miss at 1.029057502746582 being bitwise identical between bare-Emily and FastKernels is also a really clean signal — that says the kernel-rewrite layer doesn't touch the specific tensors driving that decision, which is exactly what the per-profile margin floor (now 0.82 for :emily) absorbs without lowering the CUDA floors. Nice diagnosis.

Either way: thank you. Will leave the PR open and close it as wont-merge later. The four commits on main carry your attribution in the messages.

polvalente reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation pass: Emily 0.4.0 on Apple Silicon (37/37 decision-stable; route_hash + one margin-floor near-miss)#1

Validation pass: Emily 0.4.0 on Apple Silicon (37/37 decision-stable; route_hash + one margin-floor near-miss)#1
ausimian wants to merge 2 commits into
nshkrdotcom:mainfrom
ausimian:emily-0.4-validation-pass

ausimian commented May 21, 2026

Uh oh!

ausimian commented May 22, 2026

Uh oh!

polvalente May 22, 2026

Uh oh!

nshkrdotcom May 22, 2026

Uh oh!

ausimian commented May 22, 2026

Uh oh!

nshkrdotcom commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ausimian commented May 21, 2026

What this is

Result

How that maps to your decision tree

What's actually in this branch

What's NOT in this branch (and where you said you wanted the call)

Per-tensor SVD time — read with a caveat

Test plan

Uh oh!

ausimian commented May 22, 2026

Uh oh!

polvalente May 22, 2026

Choose a reason for hiding this comment

Uh oh!

nshkrdotcom May 22, 2026

Choose a reason for hiding this comment

Uh oh!

ausimian commented May 22, 2026

Uh oh!

nshkrdotcom commented May 22, 2026

What we still need from you (please)

FastKernels (commit 45272f1) — yes please, separately

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FastKernels (commit `45272f1`) — yes please, separately