Validation pass: Emily 0.4.0 on Apple Silicon (37/37 decision-stable; route_hash + one margin-floor near-miss)#1
Conversation
Scaffolds an :emily runtime profile mirroring :emlx's intent but
routing through Emily.Backend (Apple MLX). Lets the export pipeline
and the prompt eval run end-to-end against Emily 0.4.0 on Apple
Silicon for validation passes ahead of upstream integration.
* runtime_profile.ex: new :emily resolve clause and builtin entry.
* runtime.ex: tensor_backend/1 now reads tensor.data.__struct__
so Emily.Backend (whose Inspect impl drops the module name) and
Nx.BinaryBackend get correct labels; EXLA branches unchanged.
* mix.exs: {:emily, "~> 0.4", only: [:dev, :test]} pulled from
Hex; matches the existing EMLX optional-dep pattern.
This is a validation branch, not for upstream merge. The canonical
Apple lane remains :emlx; this profile exists so an Apple host
without EMLX can still exercise the export + prompt eval suite.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Re-ran the export with a temporary patch that forces graph evaluation inside Honest per-tensor SVD timings from the same Apple Silicon host, Emily 0.4.0, Total wall-clock for the task: 5.7s (slightly faster than the first uninstrumented run; MLX's kernel cache was warm from the earlier export). Shape of the curve:
If you decide it's worth shipping honest timings on lazy backends as part of the 🤖 Generated with Claude Code |
| def tensor_backend(%Nx.Tensor{data: %backend_struct{}} = tensor) do | ||
| inspected = inspect(tensor) | ||
|
|
||
| cond do | ||
| String.contains?(inspected, "EXLA.Backend<cuda") -> "EXLA.Backend<cuda:" | ||
| String.contains?(inspected, "EXLA.Backend<host") -> "EXLA.Backend<host:" | ||
| String.contains?(inspected, "EXLA.Backend<") -> "EXLA.Backend" | ||
| backend_struct == Emily.Backend -> "Emily.Backend" | ||
| backend_struct == Nx.BinaryBackend -> "Nx.BinaryBackend" | ||
| String.contains?(inspected, "Nx.BinaryBackend") -> "Nx.BinaryBackend" | ||
| true -> "unknown" | ||
| end |
There was a problem hiding this comment.
@nshkrdotcom what's the intention behind this selection?
I think it's a bit brittle in that any backend that you want to support ends up having to be explicitly supported.
There was a problem hiding this comment.
@polvalente You're right that the cond was brittle, and that brittleness goes deeper than the one site you flagged. The same shape was duplicated in three near-identical private backend_from_label/1 clauses in Sakana.{Artifact, Head, PythonImporter}, and each silently fell back to Nx.BinaryBackend for any label the cond didn't enumerate. Once tensor_backend/1 started producing generic labels for backends like "EMLX.Backend" or "Emily.Backend" (which it now does in 21c3088), the silent coerce-to-BinaryBackend would have been a correctness hazard in the alignment / transfer call sites, not cosmetics.
Landed on main:
lib/trinity_coordinator/runtime.ex—tensor_backend/1now binds%Nx.Tensor{data: %backend_struct{}}and the default returnsbackend_struct |> Module.split() |> Enum.join("."). EXLA's<cuda:N>/<host:N>device-info prefixes are preserved on the inspect-based path (they encode device identity into the inspect string, so the inspect form is still the right tool there). Added in 21c3088, hardened in the same commit.- New
lib/trinity_coordinator/runtime/backend_label.exwithfrom_label/1({:ok, backend_spec} | {:error, {:unknown_backend_label, label}}) andfrom_label!/1(logsLogger.warningand falls back toNx.BinaryBackendfor unknown labels — preserves the prior behaviour but makes it audible instead of silent). The three Sakana modules now call the helper; their private cond chains are gone. - Phase 2 test coverage in
test/trinity_coordinator/runtime_backend_label_test.exspins the generic-default contract using synthesised fixture backend modules so it runs without a CUDA host. Phase 3 coverage intest/trinity_coordinator/runtime/backend_label_test.exspins the EMLX label round-trip (the lane that would have been silently broken under the old fallback) plus theLogger.warningon unknown labels.
The same generalization is what unblocked landing a first-class :emily profile (93cbcae): the four-line def resolve(:emily) clause mirrors :emlx plus ships an empirical margin override, and accepts_backend_label?/2 accepts "Emily.Backend" for free via the generic prefix path you suggested. No per-backend code edits needed.
Thanks for the catch.
Hooks Emily.Bumblebee.FastKernels.apply/1 into Coordinator.load/1
when the :emily runtime profile is selected. Rewrites RMSNorm /
LayerNorm / RoPE / SDPA Axon layers in the loaded Bumblebee model
to call `Emily.Fast.*` helpers, which dispatch to `mx::fast::*`
kernels under Emily.Backend (and fall through to composed-defn
equivalents on any other backend, so the model remains evaluable
on Nx.BinaryBackend / EXLA for conformance).
Validation pass result (qwen_router_prompt_eval, 37 cases,
--determinism-runs 2, Apple Silicon, Emily 0.4.0):
* Decisions vs CUDA snapshot: 37/37 match (agent_id, role_id,
token_count, transcript_hash).
* route_hash drift fast-vs-bare-Emily: 25/37 differ. The two
implementations are not bitwise-identical; they're equivalent
at the argmax + margin-floor level.
* In-process determinism: 37/37 stable across 2 runs.
* Margin floors: same single near-miss as bare-Emily on
escalate_to_human (role_margin 1.029, identical to bare).
Indicates the tight role-margin on that case is structural to
MLX's matmul/softmax, not a fused-vs-composed artifact.
* Wall-clock (37 cases × 2 determinism runs, warm cache): 10.57s
vs ~12.4s bare (~15% faster).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Wired
A couple of observations worth pulling out: The role-margin near-miss is structural, not a fused-vs-composed artefact. No correctness regression. Fast rewrites don't change any decision on any case. Reading it against Paulo's bare-EMLX 37/37, the comparable matrix now looks like:
The Emily lanes fail identically on Throughput. ~15% wall-clock win on this workload, which is the worst case for showcasing FastKernels because the eval is single-forward-pass — no decode loop, no KV cache, no autoregressive sampling. Generative workloads on the same model should see a much bigger relative gain since attention / RoPE / RMSNorm dominate the per-token cost there. For reference, Emily's What's in commit Happy to split this into its own profile ( 🤖 Generated with Claude Code |
…profile defaults Lands the five-phase response to ausimian's draft validation PR (github.com/#1) and polvalente's review comment on `runtime.ex:56`. The PR itself stays unmerged; what lands here addresses the underlying issues directly without bundling Emily as a hard dependency or rewriting the canonical CUDA snapshot/floors. Sequencing (smallest blast radius first): Phase 1 — Lazy-backend timing sync. `Exporter.sync_tensor!/1` pulls a tensor through `Nx.sum |> to_number` inside `timed_decompose/2` and `timed_reconstruct/3` so EMLX/Emily futures are materialised before `elapsed_ms` is captured. EXLA was already eager; numerics unchanged. Honest per-tensor SVD wall time now reports on lazy backends — closes the `decompose_elapsed_ms: 2` artefact ausimian's PR called out. Phase 2 — Generic `tensor_backend/1` default. `lib/trinity_coordinator/runtime.ex` keeps the EXLA `<cuda:` / `<host:` device-info prefixes intact but replaces the `true -> "unknown"` cond arm with `tensor.data.__struct__ |> Module.split() |> Enum.join(".")`. New backends (EMLX, Emily, future) round-trip cleanly through `accepts_backend_label?/2` without per-backend code edits. Directly answers polvalente's review. Phase 3 — `Runtime.BackendLabel.from_label/1` + `from_label!/1`. New helper module replaces three near-duplicate private `backend_from_label/1` cond chains in `Sakana.{Artifact, Head, PythonImporter}`. EMLX label is recognised natively; unknown labels go through `Logger.warning + Nx.BinaryBackend` via `from_label!/1` (audible, not silent — preserves prior behaviour while making it visible). EMLX label coverage now exists via tests. Phase 4 — Defer `:emily` as a built-in profile. The canonical Apple lane is `:emlx`. Emily is a research backend used for validation; Apple operators run it through `{:custom, Emily.Backend, []}` plus `override_default_margins/2`. `lib/**` carries zero references to Emily — documentation-only in `guides/runtime_profiles.md`. Phase 5 — Per-profile snapshot fixture + margin floors. `%RuntimeProfile{}` gains `default_min_agent_margin` / `default_min_role_margin` (canonical `0.24` / `1.06` for every built-in) plus accessor `default_margins/1` and `override_default_margins/2`. The eval entry point picks `examples/fixtures/runtime_profiles/<profile>/qwen_router_prompt_eval_logits.json` when present, falls through to `nil` (no snapshot drift check) otherwise — same default behaviour as pre-Phase-5 unless an explicit `--snapshot` is supplied. Explicit `--snapshot` still wins. CUDA fixture + floors are bytewise unchanged. Gates: all green. mix format --check-formatted ✅ mix compile --warnings-as-errors ✅ mix test ✅ 1 doctest, 293 / 0 (24 excluded) (30 new tests across the phases) mix credo --strict ✅ 1595 mods/funs, 0 issues mix dialyzer ✅ 0 errors mix docs --warnings-as-errors ✅ clean XLA_TARGET=cuda12 ... mix run examples/qwen_router_prompt_eval.exs --determinism-runs 2 ✅ 37/37 PASS Same eval with explicit --snapshot ✅ 37/37 PASS See ~/jb/docs/20260521/sakana/pv/emily_validation_response_checklist.md for the implementation plan + rationale.
|
@ausimian hanks for the validation pass and the FastKernels follow-up — both materially shaped what landed. What did NOT get merged: the PR itself. The branch is What DID land, attributed:
What we still need from you (please)The Emily snapshot lane is wired but unseeded — we need the actual Emily-rounded logits +
FastKernels (commit 45272f1) — yes please, separatelyThe +15% wall-clock + zero-decision-drift result is great. But the wiring needs to live on the
The split is the one-line The Either way: thank you. Will leave the PR open and close it as |
What this is
Draft / validation-only PR. Posted because you asked for a validation pass against current
mainfrom someone with Apple hardware before you shape the upstream:emilyintegration. Nothing here is intended for merge as-is.The branch adds a four-line
:emilyruntime profile (mirroring:emlx's intent) plus the minimum plumbing needed to make the export and prompt-eval suites actually exercise the Emily.Backend lane on this host. It does not touch global Nx defaults, snapshots, margin-floor ratchets, or docs — those are the calls I'm leaving to you.Result
mix trinity.sakana.export_adapted --force --svd-compute-type f32 --runtime-profile emily --out tmp/emily_adapted_qwen3_0_6b_layer26manifest["status"] = "complete",export_complete: true, 9/9 tensors"status":"complete".u_backend / s_backend / v_backend / adapted_backendrecorded asEmily.Backend.mix run examples/qwen_router_prompt_eval.exs --runtime-profile emily --artifact-dir tmp/... --snapshot examples/fixtures/qwen_router_prompt_eval_logits.json --determinism-runs 2agent_id,role_id,token_count,transcript_hash)route_hash(decision + 6dp logits)--determinism-runs 2)min_agent_margin = 0.24floortwo_assistant_turnsat 0.417.min_role_margin = 1.06floorescalate_to_humanat 1.029 (CUDA snapshot had 1.461). Next-worst isroot_causeat 1.526.Verbose dump for the failing case:
Re-running with
--min-agent-margin 0.0 --min-role-margin 0.0passes 37/37 cleanly with no determinism mismatches.How that maps to your decision tree
The diagnosis in the original PR holds — the Gram-level fix in Emily 0.4.0 is doing its job for argmax. The only behaviour that's actually backend-sensitive here is the gap between the top and second logits, which is exactly what you'd expect from float32 matmul on a different kernel stack.
For the per-profile ratchet, the data here gives you the numbers if you want them:
agent_margin: 0.417 (two_assistant_turns) → 80% floor ≈ 0.33role_margin: 1.029 (escalate_to_human) → 80% floor ≈ 0.82What's actually in this branch
lib/trinity_coordinator/runtime_profile.ex:emilyclause onresolve/1mirroring:emlx's intent (no CUDA, qwen_runtime?, export_svd?, default SLM:qwen_coordinator);nx_backend: Emily.Backend. Added:emilytobuiltin_names/0.lib/trinity_coordinator/runtime.extensor_backend/1now usestensor.data.__struct__for the Emily / BinaryBackend label cases. Emily'sInspectimpl drops the module name, so the previous inspect-string match returned"unknown"and the exporter'sensure_export_backend/3rejected every SVD output. EXLA branches are kept on the inspect-based label form for back-compat.mix.exs{:emily, "~> 0.4", only: [:dev, :test]}— same optional/load-only pattern as the existing EMLX comment.mix.lockWhat's NOT in this branch (and where you said you wanted the call)
config :nx, default_backend: Emily.Backendglobally.\"emily\"a validXLA_TARGET.Coordinator.load/1semantics —:emilyflows through the same path as:emlx.examples/fixtures/. I did writetmp/emily_router_prompt_eval_logits.jsonlocally (37 cases, matching schema) if you want to use it as the seed for a futureemily_*snapshot lane.MixHelpers, doc, or CHANGELOG updates.Per-tensor SVD time — read with a caveat
The manifest reports
decompose_elapsed_ms= 2 for the embedder and 0 for all others, andreconstruct_elapsed_ms= 0 across the board. That's an artefact of MLX being lazy:Nx.LinAlg.svd/2returns a future, the host-side timer closes immediately, and the actual GPU work happens later during theNx.backend_transfer(tensor, Nx.BinaryBackend)insidewrite_checkpoint— which the exporter doesn't count as part of decompose/reconstruct.Realistic per-tensor wall-clock from the stdout
[debug] writing checkpointtimestamps (ms resolution):embedder.token_embedding.kernellanguage_modeling_head.output.kerneldecoder.blocks.26.ffn.gate.kerneldecoder.blocks.26.ffn.intermediate.kerneldecoder.blocks.26.self_attention.{query,key,value,output}.kerneldecoder.blocks.26.ffn.output.kernelIf you want the exporter to report honest per-tensor SVD time on lazy backends, dropping an
Nx.backend_transfer(_, Nx.BinaryBackend)(or any other sync point) inside thetimed_decompose/timed_reconstructbrackets would fix it. Same caveat will apply to:emlx.Test plan
:emilyprofile branch — or whether you'd rather rewrite from scratch given the snapshot/ratchet decisions you have to make anyway.tensor_backend/1switch totensor.data.__struct__for Emily/BinaryBackend doesn't bother you. The EXLA branches deliberately stayed on the inspect-based form so existing behaviour is unchanged on CUDA / host EXLA.:emilysnapshot lane should take — I have the Emily-side rounded logits and route_hashes for all 37 cases if you want them.🤖 Generated with Claude Code