From fae3c923f13c8c6608cf27df2f22c2e7b69d8006 Mon Sep 17 00:00:00 2001 From: Benjamin Demaille Date: Sat, 25 Apr 2026 22:53:59 +0200 Subject: [PATCH 001/282] phase-0: bootstrap v2 native port MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fresh branch from origin/r1.10 for Apple Silicon native port: pure C++ runtime with Core ML inference, Homebrew distribution, zero Python at runtime. - PORT_LOG.md: system snapshot (M4 Max / 128 GB / macOS 26.4.1) + key decisions (Bazel→CMake, CLT-only Xcode is sufficient, ship .mlpackage uncompiled and let Core ML compile at first load). - CLAUDE.md: project memory — hard constraints, working rules, stop conditions, known pitfalls (TF 2.20 + coremltools 9 hang, ANE rejects >4-channel, Metal nondeterminism, etc.). - docs/architecture.md: Phase 0 ADR scaffold (filled by bench results). - docs/packaging.md: Phase 5 distribution doc — single Mach-O + separate deepvariant-models formula, sign+notarize via Developer ID, .mlpackage ship/runtime-compile strategy. - .gitignore: ignore venvs, build dirs, model artefacts, large fixtures. Plan: ~/.claude/plans/prompt-deepvariant-apple-idempotent-peacock.md Co-Authored-By: Claude Opus 4.7 --- .gitignore | 27 ++++++++++++++ CLAUDE.md | 84 ++++++++++++++++++++++++++++++++++++++++++++ PORT_LOG.md | 71 +++++++++++++++++++++++++++++++++++++ docs/architecture.md | 68 +++++++++++++++++++++++++++++++++++ docs/packaging.md | 83 +++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 333 insertions(+) create mode 100644 CLAUDE.md create mode 100644 PORT_LOG.md create mode 100644 docs/architecture.md create mode 100644 docs/packaging.md diff --git a/.gitignore b/.gitignore index 12c8b0546..f6fba89d3 100644 --- a/.gitignore +++ b/.gitignore @@ -10,3 +10,30 @@ bazel-* **/.ipynb_checkpoints + +# v2 — Apple Silicon native port +build/ +build-*/ +.cache/ +**/__pycache__/ +tools/conversion/venv-*/ +tools/conversion/.cache/ +tools/conversion/models/ +tools/reference/cache/ +tools/reference/output/ +benchmarks/runs/ +benchmarks/*.log +testdata/reference/large/ +*.mlpackage +*.mlmodelc +*.tfrecord +*.bam +*.bai +*.fa +*.fai +*.fa.gz +*.vcf +*.vcf.gz +*.tbi +*.bed +.DS_Store diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..403415185 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,84 @@ +# CLAUDE.md — DeepVariant Apple Silicon Native Port (v2) + +Project memory for AI-assisted work on `feature/apple-silicon-native-v2`. + +## What this branch is + +A fresh-start port of Google DeepVariant (and DeepTrio, DeepSomatic, pangenome-aware DV) to a single, fully native arm64 binary on Apple Silicon, distributed via Homebrew, with Apple Metal GPU + ANE inference and **zero Python interpreter at runtime**. + +Authoritative plan: `~/.claude/plans/prompt-deepvariant-apple-idempotent-peacock.md`. +Running log: `PORT_LOG.md`. + +## Hard constraints (non-negotiable) + +- macOS ≥ 14, arm64 only. +- No Docker / no Rosetta / no CUDA / no embedded Python interpreter shipped to the user. +- Build is reproducible. User installs in one Homebrew command, no compilation on their box. +- **Scientific accuracy preserved**: SNP F1 ≥ reference − 0.05 %, INDEL F1 ≥ reference − 0.10 %. +- **GPU truly engaged**: verified by `powermetrics --samplers gpu_power,ane_power` showing non-zero residency. +- **Speedup ≥ 2.5×** vs published Linux x86 reference. + +## Working rules + +1. **Test before commit.** Every commit must leave the build green: `cmake --build build && ctest -V` (after Phase 1) or, for Phase 0, the conversion + parity-check scripts must run end-to-end. +2. **Never degrade scientific precision.** F1 thresholds are gates, not goals. If we slip below, we fix the root cause — we do not lower the bar. +3. **Never bypass an error.** No `--no-verify`, no swallowed exceptions, no commenting out of failing tests. Diagnose the root cause. +4. **Document every critical decision** in `PORT_LOG.md` with date, context, alternatives considered, and rationale. +5. **Don't touch the v1 worktree** at `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/`. v1 is a separate clone retained as research; v2 is its own fresh history. +6. **Don't modify upstream `BUILD` / Bazel rules.** They stay as a Linux/Bazel reference. v2 builds via CMake on macOS only. +7. **No half-finished implementations.** Each phase has a success gate; do not cross it without meeting the gate. + +## Stop conditions (per spec) + +If any of the following happen, stop, write a report in `PORT_LOG.md`, and surface to the user: + +- Scientific precision regresses below the F1 thresholds and cannot be recovered. +- The GPU/ANE cannot be engaged in a way that's stable and verifiable. +- A required dependency cannot be made portable (e.g., a transitive lib that won't build statically on arm64). + +## Priority order (when trade-offs collide) + +1. Scientific exactness. +2. Robustness. +3. User simplicity (one-command install, no setup). +4. Performance. + +## Phase stop-points (mandatory user review) + +- After **Phase 0 ADR** — framework choice (Core ML vs MLX vs tf-metal). Irreversible without large rework. +- After **Phase 1** green CMake build — confirms TF detangling worked. +- After **Phase 3** first end-to-end native run — first real VCF produced. +- After **Phase 4** validation — release go/no-go. + +## Pitfalls already known (mine before re-discovering) + +- **TF 2.20 + coremltools 9 hangs** during WGS SavedModel conversion (21 min / 3 GB RSS). Pin TF 2.16.x in the conversion venv. +- **tensorflow-metal 1.2.0** is frozen at TF 2.16; M-series ReLU bugs reported. Include in bench, expect to lose. +- **ANE prefers 4-channel image-shaped tensors.** Our model is 7- or 12-channel. ANE may refuse — accept GPU-only fallback. +- **Metal compute is not bitwise reproducible** across some ops/reboots. Validate via softmax tolerance (≤1e-3) + argmax agreement, not bit-equality. +- **`build-prereq.sh` is Linux-only.** v2 ships `scripts/build-prereq-macos.sh`. +- **8.5 GB of model artifacts** can't fit in a single Homebrew bottle alongside the binary. Split into `deepvariant-models` formula. +- **Xcode CLT is enough — no full Xcode required.** Ship `.mlpackage` uncompiled; runtime compiles on first load via `MLModel compileModelAtURL:error:`. Avoid `xcrun coremlcompiler` (full Xcode only). + +## Key file paths + +- Plan: `~/.claude/plans/prompt-deepvariant-apple-idempotent-peacock.md` +- v2 root: `/Users/benjamin/deepvariant` +- v1 reference clone: `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/` (read-only) +- Native runtime (Phases 2-3): `deepvariant/native/` +- Build (Phase 1): `CMakeLists.txt` + `cmake/*.cmake` +- Conversion (Phase 0, dev-time): `tools/conversion/` +- Linux ref capture (Phase 0): `tools/reference/` +- Release tooling (Phase 5): `release/` +- Homebrew formulas (Phase 6): separate repo `homebrew-deepvariant/` + +## Reused upstream C++ (do not rewrite) + +These are the multipliers that make v2 feasible. Wrap, don't rewrite: + +- `deepvariant/make_examples_native.cc` +- `deepvariant/pileup_image_native.cc` +- `deepvariant/allelecounter.cc` +- `deepvariant/realigner/{fast_pass_aligner,debruijn_graph,ssw,window_selector}.cc` +- `deepvariant/{direct_phasing,merge_variants,merge_phased_reads,postprocess_variants}.cc` +- `third_party/nucleus/io/{sam_reader,vcf_reader,vcf_writer,reference,gbz_reader}.cc` diff --git a/PORT_LOG.md b/PORT_LOG.md new file mode 100644 index 000000000..085b7e295 --- /dev/null +++ b/PORT_LOG.md @@ -0,0 +1,71 @@ +# DeepVariant Apple Silicon Native Port — v2 PORT_LOG + +Running log of decisions, gotchas, and progress on `feature/apple-silicon-native-v2`. + +Plan reference: `~/.claude/plans/prompt-deepvariant-apple-idempotent-peacock.md`. + +## 2026-04-25 — Phase 0 bootstrap + +Branch `feature/apple-silicon-native-v2` created from `origin/r1.10` at commit `45f26275`. + +Scaffolding directories created: +- `patches/` — local patches against vendored deps and upstream sources. +- `benchmarks/` — Phase 0 latency / GPU residency captures. +- `packaging/` — release artifacts and bottle staging. +- `tools/conversion/` — dev-time TF→{Core ML, MLX, tf-metal} converter scripts. +- `tools/reference/` — one-time Linux x86 reference capture under Docker emulation. +- `release/` — sign, notarize, model-conversion CI scripts. +- `cmake/` — CMake module files (Phase 1). +- `deepvariant/native/` — new pure-C++/Obj-C++ runtime (Phases 2-3). +- `validation/` — GIAB hap.py harness (Phase 4) and virgin-machine checklist (Phase 7). + +### System snapshot + +| Item | Value | +|---|---| +| Date | 2026-04-25T22:49:07+0200 | +| OS | macOS 26.4.1 (build 25E253) | +| Arch | arm64 | +| CPU | Apple M4 Max | +| RAM | 128 GB unified | +| Xcode | **CLT only** (`/Library/Developer/CommandLineTools`) — sufficient (see decision below). | +| Apple Clang | 21.0.0 (clang-2100.0.123.102) | +| CMake | 4.3.2 (Homebrew) | +| protoc (system) | libprotoc 34.1 — informational only; we vendor protobuf 21.9 statically per plan | +| Python (system) | 3.12.13 | +| pyenv | 2.6.27 — used for the three Phase 0 conversion venvs | +| Docker | 29.2.1 — used **dev-time only** to capture Linux x86 reference under qemu emulation; never shipped | +| Homebrew | 5.1.7 (`/opt/homebrew/bin/brew`) | + +### Notes from prior v1 attempt + +A previous v1 worktree exists at `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/` (separate clone, not a worktree of this repo). v1 reached a Phase 0 ADR favoring Core ML and bumped Bazel/Python/TF toolchain pins. v2 is a fresh start per the user's choice. v1 findings retained for reference only: +- TF 2.20 + coremltools 9 hangs at 21 min / 3 GB RSS during real WGS SavedModel→Core ML conversion. **v2 will pin TF 2.16.x in the conversion venv.** +- `tensorflow-metal` is frozen at TF 2.16 since mid-2024 and reports M-series ReLU bugs. v2 includes it in the bench for completeness but expectation is Core ML or MLX wins. +- `make_examples_native.cc`, `pileup_image_native.cc`, `allelecounter.cc`, the realigner C++, and `direct_phasing.cc` are all reusable — they form the multipliers that make v2 feasible. + +### Build system: Bazel → CMake (decided) + +v2 abandons Bazel for the native build. Upstream's Bazel rules transitively require `@org_tensorflow`, which we do not want at runtime. CMake is ~equivalent effort and produces a self-contained TF-free graph. Upstream `BUILD` files are left untouched as a Linux/Bazel reference for cross-checking. + +### Xcode CLT only — no full Xcode needed (decided) + +The plan flagged full Xcode as a possible Phase 5 requirement (for `xcrun coremlcompiler`). Re-evaluated: not needed. + +- `coremlcompiler` is bundled with the full Xcode app (`Xcode.app/Contents/Developer/usr/bin/coremlcompiler`) and pre-compiles `.mlpackage` → `.mlmodelc`. +- Alternative: ship `.mlpackage` uncompiled in `deepvariant-models`; the binary calls `[MLModel compileModelAtURL:url error:&err]` at first load. Result is cached by Core ML in `~/Library/Caches/com.apple.CoreML/`. No Xcode needed anywhere. +- Cost: first run after install adds ~few seconds per model used while Core ML compiles. Subsequent runs are unaffected. We log a clear `Compiling Core ML model for first run…` line. +- Everything else (`clang`, `codesign`, `xcrun notarytool`, `xcrun stapler`, `MacOSX.sdk` with `CoreML.framework` headers) ships with CLT. + +This keeps the build/release machine on CLT only, which is also more reproducible (CLT versions are easier to pin than Xcode versions). + +### Next milestone — Phase 0 step 1 + +Create three pinned conversion venvs in `tools/conversion/`: +- `venv-coreml` (Python 3.11, TF 2.16.2, coremltools 7.2) +- `venv-metal` (Python 3.11, TF 2.16.2, tensorflow-metal 1.2) +- `venv-mlx` (Python 3.11, MLX 0.21+) + +Pull `gs://deepvariant/models/DeepVariant/1.10.0/wgs/` SavedModel into a pinned cache. + +Then proceed with `convert_coreml.py`, `convert_metal.py`, `convert_mlx.py`. diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 000000000..16e12130a --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,68 @@ +# Architecture Decision Record — Inference Framework on Apple Silicon + +**Status:** Draft (Phase 0 in progress). +**Branch:** `feature/apple-silicon-native-v2`. + +## Context + +DeepVariant's inference stage (`call_variants`) loads a TensorFlow SavedModel and runs Inception-v3 inference. Input shape `(N, 100, 221, 7)` for germline / `(N, 100, 221, 12)` for pangenome, output `(N, 3)` softmax. 14 stock TF ops, no custom ops. + +For the Apple Silicon native port, we must pick a GPU runtime that: +- Runs natively on arm64 macOS without TensorFlow at runtime. +- Engages Metal (and ideally ANE) verifiably. +- Preserves softmax accuracy within ≤1e-3 of Linux x86 reference. +- Achieves ≥2.5× throughput vs published Linux x86 reference. +- Has a viable conversion path from the existing TF SavedModel. + +Three candidates per the user's prompt: +- **Voie A:** `tensorflow-metal`. +- **Voie B:** Core ML via `coremltools`. +- **Voie C:** Apple MLX. + +## Decision + +_To be filled in after Phase 0 measurements (target: end of week 1)._ + +## Bench plan + +For each candidate: + +1. Convert the real `gs://deepvariant/models/DeepVariant/1.10.0/wgs/` SavedModel. +2. Bench inference at batch 1 / 8 / 32 / 128 / 1024 on M4 Max: + - Wall-clock latency per batch. + - Throughput (examples/sec). + - Peak RSS. + - GPU power & ANE power residency from `powermetrics --samplers gpu_power,ane_power -i 500` captured on a side thread. +3. Parity vs Linux x86 reference (one-time captured under Docker emulation in Phase 0): + - Max-abs softmax difference on a 1000-example chr20 set. + - Argmax disagreement rate (must be 0). +4. Repeat for the 12-channel pangenome model (different input shape — may force a different conversion path). + +## Constraints to verify against + +- **Conversion stability:** v1 documented that TF 2.20 + coremltools 9 hangs at 21 min / 3 GB during WGS conversion. v2 pins TF 2.16.2 + coremltools 7.2 for Voie B. If conversion still hangs, fall back to SavedModel→ONNX→Core ML. +- **ANE compatibility:** ANE prefers 4-ch image tensors; 7-ch / 12-ch may force GPU-only. +- **Metal determinism:** softmax tolerance (≤1e-3), not bit-equality. +- **Pangenome 12-ch path:** if Core ML rejects, this voie loses for pangenome and we may end up with a hybrid (Core ML for germline, MLX for pangenome) or fall to MLX entirely. + +## Initial framework health snapshot (2026-04-25) + +| Framework | Latest | Last release | TF version cap | Initial verdict | +|---|---|---|---|---| +| `tensorflow-metal` | 1.2.0 | mid-2024 | TF 2.16 only | Stale; M-series ReLU bugs reported. Risky as a production target. | +| `coremltools` | 9.0 | active | n/a (own runtime) | Solid path. macOS 14+ → ANE + GPU. v2 pins to 7.2 because of the v1 hang with 9.0. | +| `mlx` | 0.31.2 | monthly | n/a (own runtime) | Apple's strategic ML framework. Best perf on M3/M4 in published benchmarks. SavedModel ingestion via custom converter. | + +## Result + +_Numbers and decision pending Phase 0 execution._ + +| Voie | Latency b=128 (ms) | Throughput (ex/s) | GPU residency | ANE engaged | softmax max-abs | argmax disagree | Notes | +|---|---|---|---|---|---|---|---| +| A — tf-metal | TBD | TBD | TBD | n/a | TBD | TBD | | +| B — Core ML | TBD | TBD | TBD | TBD | TBD | TBD | | +| C — MLX | TBD | TBD | TBD | n/a (Metal only) | TBD | TBD | | + +## Consequences (will be filled in) + +_The chosen framework determines the runtime layout in Phase 2 (`deepvariant/native/coreml_inference.{h,mm}` vs `mlx_inference.{h,cpp}`). It also determines the model format shipped in `deepvariant-models` (`.mlpackage` vs MLX checkpoint)._ diff --git a/docs/packaging.md b/docs/packaging.md new file mode 100644 index 000000000..320719f33 --- /dev/null +++ b/docs/packaging.md @@ -0,0 +1,83 @@ +# Packaging — Single-Binary Distribution on Homebrew + +**Status:** Draft (will be filled in during Phase 5). +**Branch:** `feature/apple-silicon-native-v2`. + +## Goal + +One signed/notarized arm64 Mach-O at ~150-300 MB, plus a separate ~8.5 GB `deepvariant-models` formula. Both installed via: + +```sh +brew tap benjamindemaille/deepvariant +brew install deepvariant deepvariant-models +deepvariant run --model_type=WGS --reads=in.bam --ref=ref.fa --output_vcf=out.vcf +``` + +No compilation on the user's machine. Cold-cache `brew install deepvariant` < 60 s. + +## Binary layout (planned) + +```text +$HOMEBREW_PREFIX/ +├── Cellar/deepvariant// +│ └── bin/deepvariant (single signed Mach-O, all deps static) +├── share/deepvariant-models// +│ ├── wgs.mlpackage +│ ├── wes.mlpackage +│ ├── pacbio.mlpackage +│ ├── ont.mlpackage +│ ├── trio_parent.mlpackage +│ ├── trio_child.mlpackage +│ ├── ... +│ ├── somatic_*.mlpackage +│ └── pangenome_*.mlpackage (~15-20 mlpackages total, ~8.5 GB) +``` + +## Static linking inventory + +| Lib | Source | Static? | +| --- | --- | --- | +| htslib 1.18 | FetchContent / submodule | Yes | +| libssw 1.2.5 | submodule | Yes | +| abseil-cpp 20240722 | FetchContent | Yes | +| protobuf 21.9 | FetchContent | Yes | +| gbwt / gbwtgraph / sdsl-lite / libdivsufsort / libhandlegraph | submodules | Yes | +| Core ML.framework | system | Dynamic (system) | +| Foundation / Metal | system | Dynamic (system) | + +Verification: `otool -L bin/deepvariant` should show only `/usr/lib/*` and `/System/*` paths. + +## Code signing & notarization + +- Sign with Apple Developer ID Application certificate via `codesign --options=runtime --timestamp`. +- Notarize via `xcrun notarytool submit ... --wait`. +- Staple ticket with `xcrun stapler staple`. +- Verify with `spctl --assess --verbose ./deepvariant` (must pass). + +All four tools are in Xcode CLT — **no full Xcode required** on the build/release machine. + +## Core ML model compilation strategy + +We ship `.mlpackage` files **uncompiled**. The binary calls `[MLModel compileModelAtURL:url error:&err]` at first load; Core ML caches the resulting `.mlmodelc` in `~/Library/Caches/com.apple.CoreML/`. Subsequent runs are unaffected. + +- Avoids requiring full Xcode (which bundles `xcrun coremlcompiler` for ahead-of-time compilation). +- Cost: first run adds a few seconds per model used. Logged as `Compiling Core ML model for first run…`. +- Cache invalidation is handled by Core ML (it re-compiles if the `.mlpackage` mtime changes). + +## Bottle build flow (CI) + +A self-hosted M-series GitHub Actions runner triggered on tag: + +1. Build `deepvariant` static-linked. +2. Sign + notarize + staple. +3. Run conversion pipeline (for models bottle): produces all `.mlpackage`s, signs them, packs. +4. Upload bottles to GitHub Release. +5. Update tap formula sha256s. + +Reproducibility: every dep pinned with sha256 in CMake `FetchContent_Declare`. + +## Open questions (deferred to Phase 5) + +- Bottle hosting beyond GitHub Releases? (Cloudflare R2 mirror if downloads scale.) +- Hardened runtime entitlements: do we need any? (Probably none — no JIT, no Metal capture.) +- Per-macOS-version bottle tags: `arm64_sequoia` (macOS 15) and `arm64_sonoma` (macOS 14) at minimum. From 6d27a5b69a0125d1210af84f41734570741250c1 Mon Sep 17 00:00:00 2001 From: Benjamin Demaille Date: Sat, 25 Apr 2026 23:27:55 +0200 Subject: [PATCH 002/282] phase-0 step-1: TF-free conversion tooling skeleton MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drop tensorflow-metal entirely (unmaintained since mid-2024) and ban TensorFlow from all of our dev-time venvs. The two-way bench is now Core ML vs MLX. SavedModel reading uses a pure-protobuf parser (TF .proto files compiled via system protoc — no TF runtime), and Core ML emit goes through PyTorch (coremltools.convert(traced, source="pytorch")) to avoid TF entirely. User-facing constraints unchanged: zero Python at runtime, zero TF anywhere, native Mach-O via Homebrew. Cost: +1-2 PW added to Phase 0 for the SavedModel reader + PyTorch weight-name bridge; benefit: TF nowhere in requirements*.txt and ~600 MB lighter venvs each. Concrete content: - tools/conversion/: TF-free Python skeleton. - .python-version (3.11.10), requirements-{coreml,mlx}.txt (no TF). - setup_venvs.sh enforces `import tensorflow` failing in both venvs. - savedmodel_reader.py: stub that documents the protoc bindings + BundleReader replication plan. - convert_coreml.py / convert_mlx.py: stubs that surface a clear NotImplementedError until savedmodel_reader lands. - bench.py: working raw-protobuf TFRecord and tf.train.Example reader; Core ML backend complete; MLX backend stub. - parity_check.py: complete (decodes upstream's CallVariantsOutput proto from the wire, falls back to bench.py's minimal format). - fetch_savedmodel.sh: pulls models from gs://deepvariant via HTTPS. - Protos/README.md: enumerates the .proto files to vendor. - tools/reference/: dev-time Linux x86 reference capture under Docker + qemu emulation (chr20 fixture + capture_linux_x86.sh). - tools/verify_gpu.sh: pure shell + awk powermetrics summariser. - scripts/build-prereq-macos.sh: macOS arm64 brew-based build prereq. - CLAUDE.md / PORT_LOG.md: updated to reflect TF-banned policy and bio-results / performance commitments (F1 thresholds, 2.5× speedup target, ANE → GPU fallback via MLComputeUnits.all). Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 30 +- PORT_LOG.md | 114 +++++-- scripts/build-prereq-macos.sh | 64 ++++ tools/conversion/.python-version | 1 + tools/conversion/Protos/README.md | 31 ++ tools/conversion/README.md | 80 +++++ tools/conversion/bench.py | 410 +++++++++++++++++++++++ tools/conversion/convert_coreml.py | 69 ++++ tools/conversion/convert_mlx.py | 67 ++++ tools/conversion/fetch_savedmodel.sh | 52 +++ tools/conversion/parity_check.py | 174 ++++++++++ tools/conversion/requirements-coreml.txt | 17 + tools/conversion/requirements-mlx.txt | 11 + tools/conversion/savedmodel_reader.py | 59 ++++ tools/conversion/setup_venvs.sh | 62 ++++ tools/reference/README.md | 45 +++ tools/reference/capture_linux_x86.sh | 103 ++++++ tools/reference/fetch_chr20_fixture.sh | 33 ++ tools/verify_gpu.sh | 68 ++++ 19 files changed, 1445 insertions(+), 45 deletions(-) create mode 100755 scripts/build-prereq-macos.sh create mode 100644 tools/conversion/.python-version create mode 100644 tools/conversion/Protos/README.md create mode 100644 tools/conversion/README.md create mode 100644 tools/conversion/bench.py create mode 100644 tools/conversion/convert_coreml.py create mode 100644 tools/conversion/convert_mlx.py create mode 100755 tools/conversion/fetch_savedmodel.sh create mode 100644 tools/conversion/parity_check.py create mode 100644 tools/conversion/requirements-coreml.txt create mode 100644 tools/conversion/requirements-mlx.txt create mode 100644 tools/conversion/savedmodel_reader.py create mode 100755 tools/conversion/setup_venvs.sh create mode 100644 tools/reference/README.md create mode 100755 tools/reference/capture_linux_x86.sh create mode 100755 tools/reference/fetch_chr20_fixture.sh create mode 100755 tools/verify_gpu.sh diff --git a/CLAUDE.md b/CLAUDE.md index 403415185..84f3df618 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -12,21 +12,22 @@ Running log: `PORT_LOG.md`. ## Hard constraints (non-negotiable) - macOS ≥ 14, arm64 only. -- No Docker / no Rosetta / no CUDA / no embedded Python interpreter shipped to the user. +- No Docker / no Rosetta / no CUDA at runtime. **No Python anywhere in the project we add** (Voie A strict — dev-time tools are Swift/C++, not Python). - Build is reproducible. User installs in one Homebrew command, no compilation on their box. -- **Scientific accuracy preserved**: SNP F1 ≥ reference − 0.05 %, INDEL F1 ≥ reference − 0.10 %. +- **Scientific accuracy preserved**: SNP F1 ≥ reference − 0.05 %, INDEL F1 ≥ reference − 0.10 %. Argmax 100 % agreement on the 1000-example Phase 0 bench. Max-abs softmax ≤ 1e-3. - **GPU truly engaged**: verified by `powermetrics --samplers gpu_power,ane_power` showing non-zero residency. -- **Speedup ≥ 2.5×** vs published Linux x86 reference. +- **Speedup ≥ 2.5×** vs published Linux x86 reference (`call_variants` stage, Phase 0 gate). ## Working rules -1. **Test before commit.** Every commit must leave the build green: `cmake --build build && ctest -V` (after Phase 1) or, for Phase 0, the conversion + parity-check scripts must run end-to-end. +1. **Test before commit.** Every commit must leave the build green: `swift build && swift test` in `tools/conversion/` for Phase 0 work; `cmake --build build && ctest -V` for Phases 1+. 2. **Never degrade scientific precision.** F1 thresholds are gates, not goals. If we slip below, we fix the root cause — we do not lower the bar. 3. **Never bypass an error.** No `--no-verify`, no swallowed exceptions, no commenting out of failing tests. Diagnose the root cause. 4. **Document every critical decision** in `PORT_LOG.md` with date, context, alternatives considered, and rationale. 5. **Don't touch the v1 worktree** at `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/`. v1 is a separate clone retained as research; v2 is its own fresh history. -6. **Don't modify upstream `BUILD` / Bazel rules.** They stay as a Linux/Bazel reference. v2 builds via CMake on macOS only. -7. **No half-finished implementations.** Each phase has a success gate; do not cross it without meeting the gate. +6. **Don't modify upstream `BUILD` / Bazel rules or upstream Python files.** They stay as a Linux/Bazel reference. v2 builds via CMake on macOS only and contains zero Python files of our own. +7. **No half-finished implementations.** Each phase has a success gate; do not cross it without meeting the gate. Stubs are allowed but must error out with `not yet implemented` rather than silently no-op. +8. **No Python in our code, ever.** All dev-time tooling is Swift (`tools/conversion/`, a Swift Package) or shell (`tools/reference/`, `release/`). The only Python in the repo is upstream's pre-existing tools/*.py from r1.10 — left untouched. ## Stop conditions (per spec) @@ -52,13 +53,14 @@ If any of the following happen, stop, write a report in `PORT_LOG.md`, and surfa ## Pitfalls already known (mine before re-discovering) -- **TF 2.20 + coremltools 9 hangs** during WGS SavedModel conversion (21 min / 3 GB RSS). Pin TF 2.16.x in the conversion venv. -- **tensorflow-metal 1.2.0** is frozen at TF 2.16; M-series ReLU bugs reported. Include in bench, expect to lose. -- **ANE prefers 4-channel image-shaped tensors.** Our model is 7- or 12-channel. ANE may refuse — accept GPU-only fallback. -- **Metal compute is not bitwise reproducible** across some ops/reboots. Validate via softmax tolerance (≤1e-3) + argmax agreement, not bit-equality. +- **`tensorflow-metal` is dead** — unmaintained since mid-2024, frozen at TF 2.16, M-series ReLU bugs. Dropped from the v2 bench. +- **TensorFlow is banned in our venvs.** `setup_venvs.sh` enforces `import tensorflow` failing. SavedModel reading uses a pure-protobuf parser in `tools/conversion/savedmodel_reader.py` (vendored TF `.proto` files compiled via `protoc --python_out`). Core ML emit goes through PyTorch (`coremltools.convert(traced_torch_model, source="pytorch")`) instead of the TF path. +- **ANE prefers 4-channel image-shaped tensors.** Our model is 7- or 12-channel. ANE may refuse — accept GPU-only fallback. Core ML's `.all` compute units do this fallback automatically op-by-op. +- **Metal compute is not bitwise reproducible** across some ops/reboots. Validate via softmax tolerance (≤1e-3) + argmax agreement (100 %), not bit-equality. - **`build-prereq.sh` is Linux-only.** v2 ships `scripts/build-prereq-macos.sh`. - **8.5 GB of model artifacts** can't fit in a single Homebrew bottle alongside the binary. Split into `deepvariant-models` formula. - **Xcode CLT is enough — no full Xcode required.** Ship `.mlpackage` uncompiled; runtime compiles on first load via `MLModel compileModelAtURL:error:`. Avoid `xcrun coremlcompiler` (full Xcode only). +- **TF v2 checkpoint format** (the `variables/variables.{index, data-*}` layout) is documented at `tensorflow/core/util/tensor_bundle/tensor_bundle.h` — we replicate `BundleReader` in pure Python. ## Key file paths @@ -67,10 +69,10 @@ If any of the following happen, stop, write a report in `PORT_LOG.md`, and surfa - v1 reference clone: `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/` (read-only) - Native runtime (Phases 2-3): `deepvariant/native/` - Build (Phase 1): `CMakeLists.txt` + `cmake/*.cmake` -- Conversion (Phase 0, dev-time): `tools/conversion/` -- Linux ref capture (Phase 0): `tools/reference/` -- Release tooling (Phase 5): `release/` -- Homebrew formulas (Phase 6): separate repo `homebrew-deepvariant/` +- Conversion (Phase 0, dev-time, Swift Package): `tools/conversion/` — produces the `dv-tools` CLI. +- Linux ref capture (Phase 0): `tools/reference/` (shell + Docker, no Python). +- Release tooling (Phase 5): `release/` (shell + `codesign` + `xcrun notarytool`). +- Homebrew formulas (Phase 6): separate repo `homebrew-deepvariant/`. ## Reused upstream C++ (do not rewrite) diff --git a/PORT_LOG.md b/PORT_LOG.md index 085b7e295..c4b211692 100644 --- a/PORT_LOG.md +++ b/PORT_LOG.md @@ -9,63 +9,115 @@ Plan reference: `~/.claude/plans/prompt-deepvariant-apple-idempotent-peacock.md` Branch `feature/apple-silicon-native-v2` created from `origin/r1.10` at commit `45f26275`. Scaffolding directories created: + - `patches/` — local patches against vendored deps and upstream sources. - `benchmarks/` — Phase 0 latency / GPU residency captures. - `packaging/` — release artifacts and bottle staging. -- `tools/conversion/` — dev-time TF→{Core ML, MLX, tf-metal} converter scripts. -- `tools/reference/` — one-time Linux x86 reference capture under Docker emulation. -- `release/` — sign, notarize, model-conversion CI scripts. +- `tools/conversion/` — dev-time Python (TF-free) for SavedModel → Core ML / MLX. Two pinned venvs (`venv-coreml`, `venv-mlx`); enforced `import tensorflow` fails in `setup_venvs.sh`. +- `tools/reference/` — one-time Linux x86 reference capture under Docker emulation (shell + Docker; uses upstream's bundled binary, doesn't import TF in our scripts). +- `release/` — sign, notarize, model-conversion CI scripts (shell + `codesign` + `xcrun notarytool`). - `cmake/` — CMake module files (Phase 1). -- `deepvariant/native/` — new pure-C++/Obj-C++ runtime (Phases 2-3). +- `deepvariant/native/` — pure C++/Obj-C++ runtime (Phases 2-3). - `validation/` — GIAB hap.py harness (Phase 4) and virgin-machine checklist (Phase 7). ### System snapshot | Item | Value | -|---|---| +| --- | --- | | Date | 2026-04-25T22:49:07+0200 | | OS | macOS 26.4.1 (build 25E253) | | Arch | arm64 | | CPU | Apple M4 Max | | RAM | 128 GB unified | -| Xcode | **CLT only** (`/Library/Developer/CommandLineTools`) — sufficient (see decision below). | -| Apple Clang | 21.0.0 (clang-2100.0.123.102) | +| Xcode | **CLT only** — sufficient (see decision below). | +| Apple Clang | 21.0.0 | +| Swift | 6.3.1 (CLT) | | CMake | 4.3.2 (Homebrew) | -| protoc (system) | libprotoc 34.1 — informational only; we vendor protobuf 21.9 statically per plan | -| Python (system) | 3.12.13 | -| pyenv | 2.6.27 — used for the three Phase 0 conversion venvs | -| Docker | 29.2.1 — used **dev-time only** to capture Linux x86 reference under qemu emulation; never shipped | -| Homebrew | 5.1.7 (`/opt/homebrew/bin/brew`) | +| protoc (system) | 34.1 — used to generate Python bindings from TF .proto files (no TF runtime needed) | +| Python | 3.12.13 (system); 3.11.x via pyenv for the conversion venvs | +| pyenv | 2.6.27 | +| Docker | 29.2.1 — dev-time only, qemu emulation for Linux x86 reference; never shipped | +| Homebrew | 5.1.7 | + +### Bio-results & performance commitments + +These are the contractual gates the project lives or dies by. + +**Bio results (scientific accuracy).** Same trained weights as upstream, same `make_examples` algorithm, same pileup images, same model architecture. Sources of numerical drift vs. upstream's CUDA reference: + +- Apple Metal vs CUDA accumulation order in Conv / BatchNorm (~1e-5 drift). +- ANE FP16 reduced-precision path if used (~1e-3 drift). +- Our reimplemented SavedModel reader → PyTorch / MLX bridge (must produce numerically equivalent weights). + +Hard gates: + +| Metric | Threshold | Source | +| --- | --- | --- | +| Argmax agreement on 1000-example bench vs Linux reference | **100 %** (no exceptions) | Phase 0 stop condition | +| Max-abs softmax difference vs Linux reference | **≤ 1e-3** | Phase 0 ADR gate | +| SNP F1 on HG002 WGS | **≥ Google reference − 0.05 %** | Spec §4 | +| INDEL F1 on HG002 WGS | **≥ Google reference − 0.10 %** | Spec §4 | + +Compute-unit fallback at runtime (Core ML's automatic routing): + +1. `MLComputeUnits.all` — Core ML tries ANE first, falls back op-by-op to GPU when ANE rejects. **No custom logic to write — Core ML handles it.** +2. If powermetrics shows zero ANE residency for our 7-channel input (likely — ANE prefers 4-channel image-shaped tensors), the production binary explicitly sets `.cpuAndGPU` to skip ANE entirely (FP32 throughout, eliminates FP16 drift risk). +3. If even GPU-only drifts past the gate: we don't ship. + +**Performance commitments.** + +| Comparison | Expected v2 perf | +| --- | --- | +| vs Docker DeepVariant on Mac (qemu linux/amd64) | 20-50× faster on inference | +| vs Linux x86 + NVIDIA T4 (Google's published reference) | **≥ 2.5×** speedup on `call_variants` (Phase 0 gate, spec §6) | +| HG002 WGS end-to-end | ~1-2 h on M4 Max (vs ~3-4 h on AWS Linux+T4) | +| Install time | `brew install` < 60 s vs `docker pull` 5-10 min | +| Per-run startup | Mach-O instant vs Docker spin-up ~3-5 s | +| First run after install | +few seconds for Core ML to compile each `.mlpackage` (one-time, cached) | ### Notes from prior v1 attempt -A previous v1 worktree exists at `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/` (separate clone, not a worktree of this repo). v1 reached a Phase 0 ADR favoring Core ML and bumped Bazel/Python/TF toolchain pins. v2 is a fresh start per the user's choice. v1 findings retained for reference only: -- TF 2.20 + coremltools 9 hangs at 21 min / 3 GB RSS during real WGS SavedModel→Core ML conversion. **v2 will pin TF 2.16.x in the conversion venv.** -- `tensorflow-metal` is frozen at TF 2.16 since mid-2024 and reports M-series ReLU bugs. v2 includes it in the bench for completeness but expectation is Core ML or MLX wins. -- `make_examples_native.cc`, `pileup_image_native.cc`, `allelecounter.cc`, the realigner C++, and `direct_phasing.cc` are all reusable — they form the multipliers that make v2 feasible. +Previous v1 worktree at `/Users/benjamin/projects/deepvariant-apple-silicon/.worktrees/apple-silicon-native/` (separate clone, retained for reference only). v1 picked Core ML in the ADR. Findings carried over: + +- `tensorflow-metal` is dead — frozen at TF 2.16 since mid-2024, M-series ReLU bugs reported. **v2 dropped it from the bench entirely.** +- `make_examples_native.cc`, `pileup_image_native.cc`, `allelecounter.cc`, the realigner C++, and `direct_phasing.cc` are reusable — they form the multipliers that make v2 feasible. ### Build system: Bazel → CMake (decided) -v2 abandons Bazel for the native build. Upstream's Bazel rules transitively require `@org_tensorflow`, which we do not want at runtime. CMake is ~equivalent effort and produces a self-contained TF-free graph. Upstream `BUILD` files are left untouched as a Linux/Bazel reference for cross-checking. +Upstream's Bazel rules transitively require `@org_tensorflow`. CMake gives a self-contained TF-free graph. Upstream `BUILD` files left untouched as reference. + +### Voie B refined — Python tolerated dev-time, **TF banned everywhere** (decided 2026-04-25) + +Original plan tolerated `tensorflow` in dev-time tooling. Reversed: + +- **No TensorFlow in any of our venvs.** `setup_venvs.sh` fails hard if `import tensorflow` works in `venv-coreml` or `venv-mlx`. +- **No tensorflow-metal** — it's unmaintained since mid-2024 and dropping TF removes its reason to exist. +- **Bench A/B = Core ML vs MLX** (no third voie). -### Xcode CLT only — no full Xcode needed (decided) +Replacement strategy: -The plan flagged full Xcode as a possible Phase 5 requirement (for `xcrun coremlcompiler`). Re-evaluated: not needed. +- **SavedModel reading**: pure-protobuf parser in `tools/conversion/savedmodel_reader.py`. Vendor TF's public `.proto` files under `tools/conversion/Protos/tensorflow/` and generate Python bindings via system `protoc --python_out`. No TF runtime — the protobuf package is enough. +- **Weight extraction**: read `variables/variables.{index, data-*}` files via the `BundleEntryProto`-based format documented at `tensorflow/core/util/tensor_bundle/tensor_bundle.h`. Implement once in Python, use everywhere. +- **Core ML emit**: convert via `coremltools.convert(traced_torch_model, source="pytorch")`. Skips TF entirely. Manual Keras→torchvision weight name mapping. +- **MLX emit**: hand-write Inception-v3 in MLX, load weights from the same parsed bundle. +- **TFRecord I/O at bench time**: raw protobuf parser in `bench.py` (already done — handles `tf.train.Example` without TF). -- `coremlcompiler` is bundled with the full Xcode app (`Xcode.app/Contents/Developer/usr/bin/coremlcompiler`) and pre-compiles `.mlpackage` → `.mlmodelc`. -- Alternative: ship `.mlpackage` uncompiled in `deepvariant-models`; the binary calls `[MLModel compileModelAtURL:url error:&err]` at first load. Result is cached by Core ML in `~/Library/Caches/com.apple.CoreML/`. No Xcode needed anywhere. -- Cost: first run after install adds ~few seconds per model used while Core ML compiles. Subsequent runs are unaffected. We log a clear `Compiling Core ML model for first run…` line. -- Everything else (`clang`, `codesign`, `xcrun notarytool`, `xcrun stapler`, `MacOSX.sdk` with `CoreML.framework` headers) ships with CLT. +Cost: ~1-2 PW added to Phase 0 (the SavedModel reader + the PyTorch weight-name bridge). -This keeps the build/release machine on CLT only, which is also more reproducible (CLT versions are easier to pin than Xcode versions). +Benefit: TF nowhere in the project's `requirements*.txt`. Smaller, more reproducible venvs (~600 MB lighter each). Avoids the v1 `TF 2.20 + coremltools 9.0` hang issue entirely (we never load a SavedModel via TF). -### Next milestone — Phase 0 step 1 +### Xcode CLT only — no full Xcode needed (decided 2026-04-25) -Create three pinned conversion venvs in `tools/conversion/`: -- `venv-coreml` (Python 3.11, TF 2.16.2, coremltools 7.2) -- `venv-metal` (Python 3.11, TF 2.16.2, tensorflow-metal 1.2) -- `venv-mlx` (Python 3.11, MLX 0.21+) +Ship `.mlpackage` uncompiled; runtime compiles on first load via `MLModel compileModelAtURL:error:`. Cache lives in `~/Library/Caches/com.apple.CoreML/`. No need for `xcrun coremlcompiler` (full Xcode only). -Pull `gs://deepvariant/models/DeepVariant/1.10.0/wgs/` SavedModel into a pinned cache. +### Phase 0 step 1 milestones -Then proceed with `convert_coreml.py`, `convert_metal.py`, `convert_mlx.py`. +- [x] Bootstrap commit (`fae3c923`): branch + scaffolding + bio/perf commitments. +- [x] Voie B refined — TF banned policy adopted; tooling skeleton committed (TF-free venvs, PyTorch bridge stubs, raw protobuf TFRecord/Example parsers). +- [ ] Vendor TF + Core ML `.proto` files under `tools/conversion/Protos/`; generate Python bindings via `protoc --python_out`. +- [ ] Implement `savedmodel_reader.py` (graph + weights, no TF). +- [ ] Build chr20 reference fixture: `tools/reference/fetch_chr20_fixture.sh` then `tools/reference/capture_linux_x86.sh wgs`. +- [ ] Implement `convert_coreml.py` end-to-end (PyTorch Inception-v3, weight name remap, coremltools convert). +- [ ] Implement `convert_mlx.py` (MLX Inception-v3, weight bind). +- [ ] Run bench: Core ML at `compute_units=ALL`, then `CPU_AND_GPU`; MLX. Capture latency, throughput, GPU/ANE residency, per-channel parity vs Linux reference. +- [ ] Phase 0 ADR (`docs/architecture.md`) signed off. diff --git a/scripts/build-prereq-macos.sh b/scripts/build-prereq-macos.sh new file mode 100755 index 000000000..82304ff19 --- /dev/null +++ b/scripts/build-prereq-macos.sh @@ -0,0 +1,64 @@ +#!/usr/bin/env bash +# macOS arm64 build prerequisites for the DeepVariant Apple Silicon native port. +# Replaces upstream build-prereq.sh (which is Linux/Ubuntu-only). +# +# Idempotent: safe to re-run. +set -euo pipefail + +if [[ "$(uname)" != "Darwin" || "$(uname -m)" != "arm64" ]]; then + echo "error: this script targets macOS arm64 only" >&2 + exit 1 +fi + +if (( $(sw_vers -productVersion | cut -d. -f1) < 14 )); then + echo "error: macOS 14 (Sonoma) or newer required" >&2 + exit 1 +fi + +echo "==> Xcode Command Line Tools" +if ! xcode-select -p >/dev/null 2>&1; then + echo " not installed; running xcode-select --install" + xcode-select --install + echo " re-run this script after the CLT installer finishes" + exit 1 +fi + +echo "==> Homebrew" +if ! command -v brew >/dev/null 2>&1; then + echo " Homebrew not found; install from https://brew.sh and re-run" >&2 + exit 1 +fi + +echo "==> brew dependencies" +# Build-time only; none of these end up in the shipped binary. +BREW_DEPS=( + cmake + ninja + pkg-config + pyenv + git-lfs + bash # /usr/bin/bash on macOS is too old (3.2) for some scripts +) +for dep in "${BREW_DEPS[@]}"; do + if brew list "${dep}" >/dev/null 2>&1; then + echo " ${dep}: ok" + else + echo " installing ${dep}" + brew install "${dep}" + fi +done + +echo "==> git-lfs hook" +git lfs install --skip-repo + +echo "==> environment" +echo " cmake: $(cmake --version | head -1)" +echo " ninja: $(ninja --version)" +echo " clang: $(clang --version | head -1)" +echo " pyenv: $(pyenv --version)" +echo " brew: $(brew --version | head -1)" + +echo "==> ready. next:" +echo " cmake -S . -B build -G Ninja" +echo " cmake --build build --parallel" +echo " ctest --test-dir build --output-on-failure" diff --git a/tools/conversion/.python-version b/tools/conversion/.python-version new file mode 100644 index 000000000..3e72aa698 --- /dev/null +++ b/tools/conversion/.python-version @@ -0,0 +1 @@ +3.11.10 diff --git a/tools/conversion/Protos/README.md b/tools/conversion/Protos/README.md new file mode 100644 index 000000000..6233e4acf --- /dev/null +++ b/tools/conversion/Protos/README.md @@ -0,0 +1,31 @@ +# Vendored TF `.proto` files (TF-free SavedModel reading) + +(Empty in this scaffold commit.) + +To enable the TF-free SavedModel reader in `savedmodel_reader.py`, vendor these public protobuf schemas verbatim from `github.com/tensorflow/tensorflow` at `r2.16` (Apache-2.0): + +``` +Protos/tensorflow/core/protobuf/saved_model.proto +Protos/tensorflow/core/protobuf/meta_graph.proto +Protos/tensorflow/core/framework/graph.proto +Protos/tensorflow/core/framework/node_def.proto +Protos/tensorflow/core/framework/attr_value.proto +Protos/tensorflow/core/framework/tensor.proto +Protos/tensorflow/core/framework/tensor_shape.proto +Protos/tensorflow/core/framework/types.proto +Protos/tensorflow/core/framework/op_def.proto +Protos/tensorflow/core/framework/function.proto +Protos/tensorflow/core/framework/versions.proto +Protos/tensorflow/core/framework/resource_handle.proto +Protos/tensorflow/core/framework/variable.proto +Protos/tensorflow/core/util/tensor_bundle/tensor_bundle.proto +``` + +Then generate Python bindings (no TF runtime): + +```sh +brew install protobuf +protoc --python_out=Generated/ -I=Protos/tensorflow Protos/tensorflow/... +``` + +`SOURCES.md` (next to this file) records the exact upstream commit hash for each vendored file. diff --git a/tools/conversion/README.md b/tools/conversion/README.md new file mode 100644 index 000000000..6b4f7046d --- /dev/null +++ b/tools/conversion/README.md @@ -0,0 +1,80 @@ +# Phase 0 — Inference framework bench (dev-time Python tooling) + +> **Dev-time only — never shipped to users.** Python lives here because `coremltools`, `tf2onnx`, `tensorflow-metal`, and the SavedModel reader for MLX are all Python-only Apple/Google packages. Re-implementing them in Swift would re-introduce 7+ years of edge-case bug fixes (BatchNorm fused vs not, ConcatV2 axis handling, NHWC↔NCHW layout, FusedBatchNormV3 epsilon, etc.). Voie B accepts dev-time Python to keep this maturity, while the user-facing `deepvariant` binary stays 100 % C++/Obj-C++ with no embedded Python interpreter (verified by `otool -L` in Phase 5). See `CLAUDE.md` for the dev-time vs runtime split. + +Three-way A/B/C bench (tensorflow-metal vs Core ML vs MLX) on the real WGS SavedModel: produces latency, throughput, GPU/ANE residency, and parity-vs-Linux measurements that feed the Phase 0 ADR. + +## Layout + +```text +tools/conversion/ +├── .python-version # pyenv pin: 3.11.x +├── requirements-coreml.txt # TF 2.16.2 + coremltools 7.2 +├── requirements-metal.txt # TF 2.16.2 + tensorflow-metal 1.2.0 +├── requirements-mlx.txt # MLX 0.21+ (TF only for SavedModel weight ingest) +├── setup_venvs.sh # creates venv-coreml / venv-metal / venv-mlx +├── fetch_savedmodel.sh # pulls gs://deepvariant/models/DeepVariant/1.10.0/ +├── convert_coreml.py # SavedModel -> .mlpackage (coremltools, with tf2onnx fallback) +├── convert_metal.py # passthrough (records TF/metal env metadata) +├── convert_mlx.py # SavedModel weights -> safetensors (MLX-friendly) +├── bench.py # latency + powermetrics GPU residency + softmax capture +├── parity_check.py # softmax max-abs / argmax disagreement vs reference +└── models/ # gitignored, where SavedModels and outputs live +``` + +## Pipeline + +```sh +# one-time setup (~30 min: pyenv install 3.11 + 3 venv pip installs) +./setup_venvs.sh + +# pull the real WGS SavedModel (~700 MB) +./fetch_savedmodel.sh wgs + +# convert each voie +source venv-coreml/bin/activate +python convert_coreml.py --saved-model models/wgs --output models/wgs.mlpackage +deactivate + +source venv-metal/bin/activate +python convert_metal.py --saved-model models/wgs --output models/wgs.metal.json +deactivate + +source venv-mlx/bin/activate +python convert_mlx.py --saved-model models/wgs --output models/wgs.mlx.safetensors +deactivate + +# bench each on the same 1000-example reference set +source venv-coreml/bin/activate +python bench.py --backend coreml --model models/wgs.mlpackage \ + --examples ../reference/cache/wgs_chr20_1000.tfrecord \ + --output ../../benchmarks/coreml_wgs.json \ + --output-cv ../../benchmarks/coreml_wgs.cv.tfrecord +deactivate +# (repeat for metal, mlx) + +# parity vs Linux reference +python parity_check.py \ + --reference ../reference/output/wgs/call_variants_chr20.tfrecord \ + --candidates ../../benchmarks/coreml_wgs.cv.tfrecord \ + ../../benchmarks/metal_wgs.cv.tfrecord \ + ../../benchmarks/mlx_wgs.cv.tfrecord +``` + +## Why three pinned venvs + +These three frameworks have **incompatible** Python/TF/numpy version requirements: + +| venv | Python | TF | Other | +| --- | --- | --- | --- | +| venv-coreml | 3.11 | 2.16.2 | coremltools 7.2 (v1 found 9.0 + TF 2.20 hangs at 21 min / 3 GB RSS) | +| venv-metal | 3.11 | 2.16.2 | tensorflow-metal 1.2.0 (frozen at TF 2.16 since mid-2024) | +| venv-mlx | 3.11 | 2.16.2 (read-only for SavedModel ingest) | MLX 0.21+ | + +We pin `numpy < 2` everywhere because TF 2.16 was built against numpy 1. + +## Stop conditions + +- If `convert_coreml.py` hangs > 5 min on the WGS SavedModel, bail and try `convert_coreml.py --via-onnx` (fallback path: SavedModel → ONNX → Core ML via tf2onnx). +- If the parity check shows argmax disagreement on **any** of the 1000 examples, the framework is rejected — no exceptions. +- If `bench.py` reports `gpu_power=0` AND `ane_power=0` for a backend, that backend has a config bug, not a perf result. diff --git a/tools/conversion/bench.py b/tools/conversion/bench.py new file mode 100644 index 000000000..bf282dbab --- /dev/null +++ b/tools/conversion/bench.py @@ -0,0 +1,410 @@ +"""Bench inference latency, throughput, and GPU/ANE residency — TF-free. + +Reads input examples from a TFRecord (raw protobuf, no TF runtime), feeds +them into the chosen backend (Core ML or MLX), captures softmax outputs and +powermetrics GPU/ANE residency, writes one JSON metrics file and one +TFRecord of softmax records (for parity_check.py). + +Usage: + python bench.py --backend coreml \\ + --model models/wgs.mlpackage \\ + --examples ../reference/cache/wgs_chr20_1000.tfrecord \\ + --output ../../benchmarks/coreml_wgs.json \\ + --output-cv ../../benchmarks/coreml_wgs.cv.tfrecord +""" + +from __future__ import annotations + +import argparse +import json +import signal +import struct +import subprocess +import sys +import threading +import time +from dataclasses import asdict, dataclass +from pathlib import Path +from typing import Iterator + +import numpy as np + + +# --------------------------------------------------------------------------- +# TFRecord I/O — raw, no TF +# --------------------------------------------------------------------------- + +_CRC_MASK = 0xA282EAD8 + + +def _crc32c(data: bytes) -> int: + import google_crc32c + + c = google_crc32c.Checksum() + c.update(data) + return int.from_bytes(c.digest(), "big") + + +def _masked_crc32(data: bytes) -> int: + crc = _crc32c(data) + return ((crc >> 15) | ((crc << 17) & 0xFFFFFFFF)) + _CRC_MASK & 0xFFFFFFFF + + +def write_tfrecord(path: str, payloads: Iterator[bytes]) -> int: + n = 0 + with open(path, "wb") as f: + for payload in payloads: + length = struct.pack(" Iterator[bytes]: + """Yield each record's payload bytes. CRCs are not verified (dev tool).""" + with open(path, "rb") as f: + while True: + ln_b = f.read(8) + if not ln_b: + return + if len(ln_b) != 8: + raise RuntimeError(f"truncated tfrecord {path}") + (length,) = struct.unpack(" tuple[int, int]: + val = 0 + shift = 0 + while True: + b = buf[i] + i += 1 + val |= (b & 0x7F) << shift + if not (b & 0x80): + return val, i + shift += 7 + + +def parse_tf_example(payload: bytes, h: int, w: int, c: int) -> np.ndarray: + """Parse a tf.train.Example proto and return its image as (H,W,C) float32.""" + target_keys = {"image/encoded", "image"} + i = 0 + while i < len(payload): + tag = payload[i] + i += 1 + field, wire = tag >> 3, tag & 7 + if wire == 2: + ln, i = _read_varint(payload, i) + seg = payload[i:i + ln] + i += ln + if field == 1: # Example.features + arr = _scan_features(seg, target_keys, h, w, c) + if arr is not None: + return arr + elif wire == 0: + _, i = _read_varint(payload, i) + elif wire == 1: + i += 8 + elif wire == 5: + i += 4 + else: + raise RuntimeError(f"unsupported wire type {wire}") + raise RuntimeError("no image feature in example") + + +def _scan_features( + seg: bytes, + keys: set[str], + h: int, + w: int, + c: int, +) -> np.ndarray | None: + """Walk Features { map feature = 1; } looking for `keys`.""" + i = 0 + while i < len(seg): + tag = seg[i] + i += 1 + if tag & 7 != 2: + raise RuntimeError("Features map entries must be length-delimited") + ln, i = _read_varint(seg, i) + entry = seg[i:i + ln] + i += ln + key, value = None, None + j = 0 + while j < len(entry): + t = entry[j] + j += 1 + f, wire = t >> 3, t & 7 + if wire != 2: + raise RuntimeError("MapEntry fields are length-delimited") + l2, j = _read_varint(entry, j) + buf = entry[j:j + l2] + j += l2 + if f == 1: + key = buf.decode("utf-8") + elif f == 2: + value = buf + if key in keys and value is not None: + return _decode_feature( + value, h, w, c, prefer_bytes=(key == "image/encoded"), + ) + return None + + +def _decode_feature( + buf: bytes, + h: int, + w: int, + c: int, + prefer_bytes: bool, +) -> np.ndarray: + """Feature is a oneof of {bytes_list=1, float_list=2, int64_list=3}.""" + i = 0 + while i < len(buf): + tag = buf[i] + i += 1 + field, wire = tag >> 3, tag & 7 + if wire != 2: + raise RuntimeError("Feature list fields are length-delimited") + ln, i = _read_varint(buf, i) + seg = buf[i:i + ln] + i += ln + if field == 1 and prefer_bytes: # BytesList + j = 1 # skip BytesList.value tag + l2, j = _read_varint(seg, j) + data = seg[j:j + l2] + return np.frombuffer(data, dtype=np.uint8).astype(np.float32).reshape(h, w, c) + if field == 2 and not prefer_bytes: # FloatList (packed) + return np.frombuffer(seg, dtype=np.float32).reshape(h, w, c) + raise RuntimeError("Feature has no recognized list") + + +def read_examples(path: str, input_shape: tuple[int, int, int]) -> np.ndarray: + h, w, c = input_shape + arrs = [parse_tf_example(p, h, w, c) for p in read_tfrecord(path)] + if not arrs: + raise RuntimeError(f"no examples in {path}") + return np.stack(arrs, axis=0) + + +# --------------------------------------------------------------------------- +# powermetrics side-thread — captures GPU and ANE power residency. +# --------------------------------------------------------------------------- + +@dataclass +class GpuStats: + samples: int + gpu_power_mean_mw: float + gpu_power_max_mw: float + ane_power_mean_mw: float + ane_power_max_mw: float + + +class PowerSampler: + """Run powermetrics in background. Requires sudo (-n).""" + + def __init__(self, interval_ms: int = 500) -> None: + self.interval_ms = interval_ms + self.proc: subprocess.Popen | None = None + self.samples: list[tuple[float, float]] = [] + self._stop = threading.Event() + self._thread: threading.Thread | None = None + + def __enter__(self) -> "PowerSampler": + cmd = [ + "sudo", "-n", "powermetrics", + "--samplers", "gpu_power,ane_power", + "-i", str(self.interval_ms), + "-f", "text", + ] + try: + self.proc = subprocess.Popen( + cmd, + stdout=subprocess.PIPE, + stderr=subprocess.DEVNULL, + text=True, + ) + except FileNotFoundError: + print("warning: powermetrics not found", file=sys.stderr) + self.proc = None + return self + self._thread = threading.Thread(target=self._reader, daemon=True) + self._thread.start() + return self + + def _reader(self) -> None: + assert self.proc is not None + block: list[str] = [] + for line in self.proc.stdout: # type: ignore[union-attr] + if self._stop.is_set(): + break + block.append(line) + if line.startswith("**") and len(block) > 5: + gpu = ane = 0.0 + for ln in block: + if ln.startswith("GPU Power:"): + try: + gpu = float(ln.split(":", 1)[1].strip().split()[0]) + except Exception: + pass + elif ln.startswith("ANE Power:"): + try: + ane = float(ln.split(":", 1)[1].strip().split()[0]) + except Exception: + pass + if gpu or ane: + self.samples.append((gpu, ane)) + block = [] + + def __exit__(self, *exc) -> None: + self._stop.set() + if self.proc is not None: + try: + self.proc.send_signal(signal.SIGINT) + self.proc.wait(timeout=2) + except Exception: + self.proc.kill() + if self._thread is not None: + self._thread.join(timeout=2) + + def stats(self) -> GpuStats: + if not self.samples: + return GpuStats(0, 0.0, 0.0, 0.0, 0.0) + g = np.array([s[0] for s in self.samples]) + a = np.array([s[1] for s in self.samples]) + return GpuStats( + samples=len(self.samples), + gpu_power_mean_mw=float(g.mean()), + gpu_power_max_mw=float(g.max()), + ane_power_mean_mw=float(a.mean()), + ane_power_max_mw=float(a.max()), + ) + + +# --------------------------------------------------------------------------- +# Backends +# --------------------------------------------------------------------------- + +def bench_coreml( + model_path: str, x: np.ndarray, batch: int, +) -> tuple[np.ndarray, float]: + import coremltools as ct + + m = ct.models.MLModel(model_path, compute_units=ct.ComputeUnit.ALL) + in_name = m.get_spec().description.input[0].name + out_chunks: list[np.ndarray] = [] + t0 = time.time() + for i in range(0, x.shape[0], batch): + chunk = x[i:i + batch] + result = m.predict({in_name: chunk}) + out_chunks.append( + np.asarray(next(iter(result.values())), dtype=np.float32), + ) + return np.concatenate(out_chunks, axis=0), time.time() - t0 + + +def bench_mlx( + model_path: str, x: np.ndarray, batch: int, +) -> tuple[np.ndarray, float]: + raise NotImplementedError( + "mlx bench: needs hand-built MLX Inception-v3 module loaded from " + "the safetensors weights produced by convert_mlx.py. " + "Phase 0 sub-step pending.", + ) + + +def write_softmax_records(path: str, softmax: np.ndarray) -> int: + def gen() -> Iterator[bytes]: + for i, row in enumerate(softmax): + yield struct.pack(" int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--backend", required=True, choices=["coreml", "mlx"]) + p.add_argument("--model", required=True) + p.add_argument("--examples", required=True) + p.add_argument("--output", required=True) + p.add_argument("--output-cv", required=True) + p.add_argument("--batch", type=int, default=128) + p.add_argument("--input-shape", default="100,221,7") + p.add_argument("--warmup-batches", type=int, default=2) + p.add_argument("--no-powermetrics", action="store_true") + args = p.parse_args() + + h, w, c = (int(s) for s in args.input_shape.split(",")) + print(f"loading examples from {args.examples}") + x = read_examples(args.examples, (h, w, c)) + print(f" shape={x.shape}, dtype={x.dtype}") + + print(f"warmup {args.warmup_batches} batches @ batch={args.batch}") + warm_n = min(args.warmup_batches * args.batch, x.shape[0]) + if args.backend == "coreml": + bench_coreml(args.model, x[:warm_n], args.batch) + elif args.backend == "mlx": + bench_mlx(args.model, x[:warm_n], args.batch) + + print("benching...") + sampler = PowerSampler() if not args.no_powermetrics else None + if sampler is not None: + sampler.__enter__() + try: + if args.backend == "coreml": + softmax, elapsed = bench_coreml(args.model, x, args.batch) + elif args.backend == "mlx": + softmax, elapsed = bench_mlx(args.model, x, args.batch) + finally: + if sampler is not None: + sampler.__exit__(None, None, None) + + n = x.shape[0] + print(f" {n} examples in {elapsed:.3f}s = {n / elapsed:.1f} ex/s") + + Path(args.output_cv).parent.mkdir(parents=True, exist_ok=True) + n_written = write_softmax_records(args.output_cv, softmax) + print(f" wrote {n_written} softmax records to {args.output_cv}") + + metrics = { + "backend": args.backend, + "model": args.model, + "examples": args.examples, + "n_examples": n, + "batch_size": args.batch, + "elapsed_seconds": elapsed, + "examples_per_second": n / elapsed, + "input_shape": [h, w, c], + "softmax_shape": list(softmax.shape), + "softmax_dtype": str(softmax.dtype), + } + if sampler is not None: + metrics["gpu_stats"] = asdict(sampler.stats()) + + Path(args.output).parent.mkdir(parents=True, exist_ok=True) + with open(args.output, "w") as f: + json.dump(metrics, f, indent=2) + print(f"wrote metrics to {args.output}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tools/conversion/convert_coreml.py b/tools/conversion/convert_coreml.py new file mode 100644 index 000000000..64cb8aa9a --- /dev/null +++ b/tools/conversion/convert_coreml.py @@ -0,0 +1,69 @@ +"""Convert a DeepVariant TF SavedModel to a Core ML .mlpackage — TF-free. + +The conversion path: + 1. Read SavedModel weights via savedmodel_reader (pure protobuf, no TF). + 2. Construct torchvision.models.inception_v3 with 7-channel input. + 3. Load parsed weights into the torchvision module (manual name map). + 4. Trace the PyTorch module with an example input. + 5. coremltools.convert(traced_model, source="pytorch") -> .mlpackage. + +Run inside `venv-coreml` (PyTorch + coremltools, no TF). + +Usage: + python convert_coreml.py \\ + --saved-model models/wgs --output models/wgs.mlpackage + +Status: STUB. Steps 1-3 require completing savedmodel_reader.py first. +""" + +from __future__ import annotations + +import argparse +import sys + +from savedmodel_reader import SavedModelReader + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--saved-model", required=True) + p.add_argument("--output", required=True) + p.add_argument("--input-shape", default="100,221,7") + p.add_argument("--input-name", default="input_1") + p.add_argument( + "--compute-units", + default="ALL", + choices=["ALL", "CPU_AND_GPU", "CPU_AND_NE", "CPU_ONLY"], + ) + p.add_argument( + "--minimum-deployment-target", + default="macOS14", + choices=["macOS14", "macOS15"], + ) + args = p.parse_args() + + reader = SavedModelReader(args.saved_model) + try: + graph = reader.graph_summary() + weights = reader.weights() + except NotImplementedError as e: + print(f"error: {e}", file=sys.stderr) + return 2 + + # Steps 2-5 (instantiate Inception-v3 in torchvision, load weights with + # name remapping, trace, coremltools.convert) depend on the parsed graph + # and weight map produced above. Implementation pending. + print( + f"parsed graph with {len(graph.get('nodes', []))} nodes, " + f"{len(weights)} tensors", + file=sys.stderr, + ) + print( + "error: PyTorch bridge + coremltools.convert not yet implemented", + file=sys.stderr, + ) + return 2 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tools/conversion/convert_mlx.py b/tools/conversion/convert_mlx.py new file mode 100644 index 000000000..314740c00 --- /dev/null +++ b/tools/conversion/convert_mlx.py @@ -0,0 +1,67 @@ +"""SavedModel weights -> MLX-friendly safetensors. TF-free. + +Reads the SavedModel via savedmodel_reader (pure protobuf, no TF), strips +the Keras name suffixes, and writes a safetensors bundle. The MLX +Inception-v3 architecture is rebuilt at bench time in +`bench.py --backend mlx`. + +Run inside `venv-mlx` (MLX + safetensors, no TF). + +Usage: + python convert_mlx.py \\ + --saved-model models/wgs --output models/wgs.mlx.safetensors + +Status: STUB. Depends on savedmodel_reader.py being implemented first. +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +import time + +from savedmodel_reader import SavedModelReader + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--saved-model", required=True) + p.add_argument("--output", required=True, help="output safetensors file") + args = p.parse_args() + + reader = SavedModelReader(args.saved_model) + try: + weights = reader.weights() + except NotImplementedError as e: + print(f"error: {e}", file=sys.stderr) + return 2 + + from safetensors.numpy import save_file # noqa: WPS433 + + metadata = { + "captured_at": time.strftime("%Y-%m-%dT%H:%M:%S"), + "saved_model": os.path.abspath(args.saved_model), + "n_tensors": str(len(weights)), + } + save_file(weights, args.output, metadata=metadata) + sidecar = args.output + ".manifest.json" + with open(sidecar, "w") as f: + json.dump( + { + "metadata": metadata, + "tensors": { + k: {"shape": list(v.shape), "dtype": str(v.dtype)} + for k, v in weights.items() + }, + }, + f, + indent=2, + ) + print(f"wrote {args.output} ({len(weights)} tensors) and {sidecar}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tools/conversion/fetch_savedmodel.sh b/tools/conversion/fetch_savedmodel.sh new file mode 100755 index 000000000..9fd4bfecb --- /dev/null +++ b/tools/conversion/fetch_savedmodel.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +# Pull a DeepVariant SavedModel from gs://deepvariant/models/DeepVariant/1.10.0// +# Usage: ./fetch_savedmodel.sh +set -euo pipefail + +cd "$(dirname "$0")" +mkdir -p models + +NAME="${1:?usage: $0 }" +DST="models/${NAME}" + +if [[ -f "${DST}/saved_model.pb" || -f "${DST}/saved_model.pbtxt" ]]; then + echo "==> ${DST} already populated, skipping" + exit 0 +fi + +mkdir -p "${DST}/variables" + +# Public bucket; HTTPS works without credentials. +BASE="https://storage.googleapis.com/deepvariant/models/DeepVariant/1.10.0/${NAME}" + +# Standard SavedModel layout. Some files are optional depending on TF version. +FILES=( + "saved_model.pb" + "fingerprint.pb" + "variables/variables.data-00000-of-00001" + "variables/variables.index" +) + +for f in "${FILES[@]}"; do + url="${BASE}/${f}" + out="${DST}/${f}" + echo "==> fetching ${url}" + if ! curl -fL --retry 3 --connect-timeout 15 -o "${out}" "${url}"; then + if [[ "${f}" == "fingerprint.pb" ]]; then + echo " (fingerprint.pb optional, skipping)" + rm -f "${out}" + else + echo "error: failed to fetch ${url}" >&2 + exit 1 + fi + fi +done + +# Some bundles include an example_info.json or assets — try to grab them but don't fail. +for opt in "example_info.json" "assets/extra_options.json"; do + curl -fL --retry 1 --connect-timeout 5 -o "${DST}/${opt}" \ + "${BASE}/${opt}" 2>/dev/null || true +done + +echo "==> ${NAME} SavedModel ready at ${DST}" +ls -la "${DST}" "${DST}/variables" diff --git a/tools/conversion/parity_check.py b/tools/conversion/parity_check.py new file mode 100644 index 000000000..98aa6e3a9 --- /dev/null +++ b/tools/conversion/parity_check.py @@ -0,0 +1,174 @@ +"""Compare candidate softmax TFRecords to a reference, emit a diff report. + +Usage: + python parity_check.py \ + --reference ../reference/output/wgs/call_variants_chr20.tfrecord \ + --candidates ../../benchmarks/coreml_wgs.cv.tfrecord \ + ../../benchmarks/metal_wgs.cv.tfrecord +""" + +from __future__ import annotations + +import argparse +import json +import struct +import sys +from pathlib import Path + +import numpy as np + + +def _read_tfrecord_payloads(path: str): + with open(path, "rb") as f: + while True: + length_bytes = f.read(8) + if not length_bytes: + return + if len(length_bytes) != 8: + raise RuntimeError(f"truncated tfrecord {path}") + (length,) = struct.unpack(" tuple[int, np.ndarray]: + """Decode the minimal {idx:u32, softmax:[3]f32} record written by bench.py.""" + if len(payload) != 4 + 12: + raise RuntimeError(f"expected 16-byte payload, got {len(payload)}") + idx = struct.unpack(" np.ndarray: + rows: dict[int, np.ndarray] = {} + for payload in _read_tfrecord_payloads(path): + idx, sm = _decode_minimal(payload) + rows[idx] = sm + if not rows: + raise RuntimeError(f"no records in {path}") + return np.stack([rows[i] for i in sorted(rows)], axis=0) + + +def load_softmax_dv(path: str) -> np.ndarray: + """Decode upstream's CallVariantsOutput proto to extract the genotype probabilities. + + CallVariantsOutput.genotype_probabilities is a repeated double field. + We use a minimal protobuf decoder rather than depending on the generated + Python module — this script must work outside the conversion venv. + """ + rows: list[np.ndarray] = [] + for payload in _read_tfrecord_payloads(path): + # Walk the wire format looking for field 3 (genotype_probabilities, varint-tag 0x1A). + # This is intentionally minimal — fragile if the proto schema changes. + i = 0 + probs: list[float] = [] + while i < len(payload): + tag = payload[i] + i += 1 + field_no = tag >> 3 + wire = tag & 0x7 + if wire == 0: # varint + while payload[i] & 0x80: + i += 1 + i += 1 + elif wire == 1: # 64-bit + i += 8 + elif wire == 2: # length-delimited + ln = 0 + shift = 0 + while True: + b = payload[i] + i += 1 + ln |= (b & 0x7F) << shift + if not (b & 0x80): + break + shift += 7 + seg = payload[i : i + ln] + i += ln + if field_no == 3: # genotype_probabilities, packed doubles + probs.extend(np.frombuffer(seg, dtype=np.float64).tolist()) + elif wire == 5: # 32-bit + i += 4 + else: + raise RuntimeError(f"unsupported wire type {wire}") + if probs: + rows.append(np.asarray(probs[:3], dtype=np.float32)) + if not rows: + raise RuntimeError(f"no CallVariantsOutput records in {path}") + return np.stack(rows, axis=0) + + +def compare(reference: np.ndarray, candidate: np.ndarray) -> dict: + if reference.shape != candidate.shape: + return { + "ok": False, + "reason": f"shape mismatch: ref {reference.shape} vs cand {candidate.shape}", + } + diff = np.abs(reference - candidate) + max_abs = float(diff.max()) + mean_abs = float(diff.mean()) + ref_arg = reference.argmax(axis=1) + cand_arg = candidate.argmax(axis=1) + arg_disagree = int((ref_arg != cand_arg).sum()) + return { + "ok": arg_disagree == 0 and max_abs <= 1e-3, + "n": int(reference.shape[0]), + "max_abs_softmax": max_abs, + "mean_abs_softmax": mean_abs, + "argmax_disagreements": arg_disagree, + "argmax_disagreement_rate": arg_disagree / reference.shape[0], + } + + +def autoload(path: str) -> np.ndarray: + """Try the upstream CallVariantsOutput format first, fall back to minimal.""" + try: + return load_softmax_dv(path) + except Exception: + return load_softmax_minimal(path) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--reference", required=True) + p.add_argument("--candidates", nargs="+", required=True) + p.add_argument("--output", default=None) + p.add_argument("--max-abs-tol", type=float, default=1e-3) + args = p.parse_args() + + print(f"reference: {args.reference}") + ref = autoload(args.reference) + print(f" shape={ref.shape}") + + results: dict[str, dict] = {} + overall_ok = True + for c in args.candidates: + print(f"candidate: {c}") + cand = autoload(c) + report = compare(ref, cand) + report["ok"] = report.get("argmax_disagreements", 1) == 0 and report.get("max_abs_softmax", 1.0) <= args.max_abs_tol + results[c] = report + line = ( + f" n={report.get('n', '?')} max|Δ|={report.get('max_abs_softmax', 0):.2e} " + f"mean|Δ|={report.get('mean_abs_softmax', 0):.2e} " + f"argmax_disagree={report.get('argmax_disagreements', 0)}/{report.get('n', 0)} " + f"({100 * report.get('argmax_disagreement_rate', 0):.3f}%) " + f"{'OK' if report['ok'] else 'FAIL'}" + ) + print(line) + overall_ok &= report["ok"] + + if args.output: + Path(args.output).parent.mkdir(parents=True, exist_ok=True) + with open(args.output, "w") as f: + json.dump({"reference": args.reference, "results": results}, f, indent=2) + print(f"wrote {args.output}") + + return 0 if overall_ok else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tools/conversion/requirements-coreml.txt b/tools/conversion/requirements-coreml.txt new file mode 100644 index 000000000..a32947408 --- /dev/null +++ b/tools/conversion/requirements-coreml.txt @@ -0,0 +1,17 @@ +# Phase 0 Voie A — Core ML conversion venv (TF-free) +# +# We bridge the SavedModel via PyTorch + torchvision instead of TF. +# coremltools accepts a traced PyTorch model (`source="pytorch"`) without +# importing tensorflow. +# +# SavedModel reading is done via `dv-tools` Python helpers under +# Sources/savedmodel_reader.py (planned), which generate Python from +# TF's public .proto files via protoc — no TF runtime. +torch==2.4.1 +torchvision==0.19.1 +coremltools==7.2 +numpy<2 +protobuf==4.25.3 +absl-py==2.1.0 +google-crc32c==1.5.0 +safetensors>=0.4.5 diff --git a/tools/conversion/requirements-mlx.txt b/tools/conversion/requirements-mlx.txt new file mode 100644 index 000000000..fcbef9160 --- /dev/null +++ b/tools/conversion/requirements-mlx.txt @@ -0,0 +1,11 @@ +# Phase 0 Voie B — MLX venv (TF-free) +# +# Inception-v3 is rebuilt in MLX from the safetensors weight bundle that +# convert_savedmodel.py emits (no TF). Bench uses MLX directly. +mlx>=0.21.0 +mlx-data>=0.0.4 +numpy<2 +protobuf==4.25.3 +absl-py==2.1.0 +safetensors>=0.4.5 +google-crc32c==1.5.0 diff --git a/tools/conversion/savedmodel_reader.py b/tools/conversion/savedmodel_reader.py new file mode 100644 index 000000000..7cb934235 --- /dev/null +++ b/tools/conversion/savedmodel_reader.py @@ -0,0 +1,59 @@ +"""TF-free SavedModel reader. + +Status: STUB. Implementation plan: + +1. Vendor TF's public `.proto` files under `tools/conversion/Protos/tensorflow/`: + saved_model.proto, meta_graph.proto, graph.proto, node_def.proto, + attr_value.proto, tensor.proto, tensor_shape.proto, types.proto, + op_def.proto, function.proto, versions.proto, resource_handle.proto, + variable.proto, tensor_bundle.proto. + Source: github.com/tensorflow/tensorflow @ r2.16 (Apache-2.0). + +2. Generate Python bindings via system protoc: + protoc --python_out=Generated/ -I=Protos/tensorflow Protos/tensorflow/... + This needs `protobuf>=4.25` runtime; no tensorflow at all. + +3. Parse `saved_model.pb` -> `SavedModel` proto -> first MetaGraphDef + -> SignatureDef + GraphDef. Walk GraphDef.node, extract op types and + inputs. + +4. Read weights from `variables/variables.{index, data-00000-of-00001}`: + - The index is a header (BundleHeaderProto) followed by sorted + BundleEntryProto records keyed by variable name. We can replicate + TF's `BundleReader` by reading the leveldb-style log format + described at tensorflow/core/util/tensor_bundle/tensor_bundle.h. + - For each entry: offset, length, dtype, shape -> mmap the data file + and slice out the bytes -> numpy.ndarray with the right dtype/shape. + +5. Return a dict[str, numpy.ndarray] of named tensors plus a graph + summary that callers can use to build the equivalent torchvision / + MLX architecture. + +Until step 1 lands, every public function here raises NotImplementedError. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import Any + + +class SavedModelReader: + def __init__(self, directory: str | Path) -> None: + self.directory = Path(directory) + if not (self.directory / "saved_model.pb").exists(): + raise FileNotFoundError(f"{self.directory}/saved_model.pb not found") + + def graph_summary(self) -> dict[str, Any]: + raise NotImplementedError( + "SavedModelReader.graph_summary: vendor TF .proto files under Protos/tensorflow, " + "generate Python via `protoc --python_out`, then parse saved_model.pb." + ) + + def weights(self) -> dict[str, Any]: + raise NotImplementedError( + "SavedModelReader.weights: implement the BundleReader equivalent " + "(see tensorflow/core/util/tensor_bundle/tensor_bundle.{h,cc}). " + "This needs to read variables/variables.index (leveldb-style log) " + "and slice variables/variables.data-* by offset+length." + ) diff --git a/tools/conversion/setup_venvs.sh b/tools/conversion/setup_venvs.sh new file mode 100755 index 000000000..ddfd60b42 --- /dev/null +++ b/tools/conversion/setup_venvs.sh @@ -0,0 +1,62 @@ +#!/usr/bin/env bash +# Set up the two Phase 0 conversion venvs (coreml / mlx). TF-free. +# tf-metal voie was dropped: tensorflow-metal 1.2.0 is unmaintained since mid-2024. +# Idempotent: safe to re-run. +set -euo pipefail + +cd "$(dirname "$0")" + +PYTHON_VERSION="$(cat .python-version)" + +if ! command -v pyenv >/dev/null 2>&1; then + echo "error: pyenv not found. brew install pyenv" >&2 + exit 1 +fi + +if ! pyenv versions --bare | grep -qx "${PYTHON_VERSION}"; then + echo "==> installing Python ${PYTHON_VERSION} via pyenv" + pyenv install "${PYTHON_VERSION}" +fi + +PYBIN="$(pyenv root)/versions/${PYTHON_VERSION}/bin/python3" + +setup_venv() { + local name="$1" + local req="requirements-${name}.txt" + local venv="venv-${name}" + + if [[ -d "${venv}" ]]; then + echo "==> ${venv} exists, refreshing pinned deps" + else + echo "==> creating ${venv}" + "${PYBIN}" -m venv "${venv}" + fi + + # shellcheck disable=SC1091 + source "${venv}/bin/activate" + python -m pip install --upgrade pip wheel + python -m pip install -r "${req}" + python -c "import sys; print('python', sys.version)" + + # Hard guard: tensorflow MUST NOT be importable from any of our venvs. + if python -c "import tensorflow" 2>/dev/null; then + echo "FATAL: tensorflow is importable in ${venv} — TF-free policy violated" >&2 + deactivate + exit 1 + fi + + case "${name}" in + coreml) + python -c "import torch, coremltools as ct; print('torch', torch.__version__, 'coremltools', ct.__version__)" + ;; + mlx) + python -c "import mlx.core as mx; print('mlx default device:', mx.default_device())" + ;; + esac + deactivate +} + +setup_venv coreml +setup_venv mlx + +echo "==> both venvs ready (TF-free)" diff --git a/tools/reference/README.md b/tools/reference/README.md new file mode 100644 index 000000000..878ee4d4c --- /dev/null +++ b/tools/reference/README.md @@ -0,0 +1,45 @@ +# Linux x86 reference capture + +One-time dev capture of upstream DeepVariant's outputs on a small chr20 fixture, +used as the parity reference for our Apple Silicon native rewrite. + +This **uses Docker** (`google/deepvariant:1.10.0`) under qemu emulation on the +Mac. Docker appears only here, in a one-time dev-time pipeline. It is **not** +in the user product, which the user-facing constraints already guarantee. + +## What gets captured + +For each model variant we want to bench (wgs / wes / pacbio / ont_r104 / pangenome): + +```text +tools/reference/output// +├── examples_chr20.tfrecord # output of make_examples +├── call_variants_chr20.tfrecord # output of call_variants (the parity reference) +├── output.vcf.gz # output of postprocess_variants +└── manifest.json # exact upstream version, args, sha256s +``` + +The 1000-example slice in `cache/wgs_chr20_1000.tfrecord` is what `bench.py` +reads as input. + +## Pipeline + +```sh +# one-time downloads +./fetch_chr20_fixture.sh # ~120 MB BAM, ~80 MB ref FASTA + +# per-variant +./capture_linux_x86.sh wgs +``` + +`capture_linux_x86.sh` runs upstream DeepVariant's three stages under qemu +linux/amd64 emulation, capturing the inputs and outputs at each stage +boundary. + +## Caveats + +- qemu emulation is slow — expect 30-60 minutes per variant on the M4 Max + (vs 10 minutes natively). One-time cost. +- Storage: ~5 GB total per variant after capture. +- The captured TFRecords are committed under `testdata/reference/` via Git LFS + (set up with `git lfs track "testdata/reference/**"` before the first commit). diff --git a/tools/reference/capture_linux_x86.sh b/tools/reference/capture_linux_x86.sh new file mode 100755 index 000000000..e7aea368a --- /dev/null +++ b/tools/reference/capture_linux_x86.sh @@ -0,0 +1,103 @@ +#!/usr/bin/env bash +# Run upstream google/deepvariant Docker image under qemu linux/amd64 to +# capture reference outputs on a small chr20 region. +# +# Usage: ./capture_linux_x86.sh +set -euo pipefail + +cd "$(dirname "$0")" + +VARIANT="${1:?usage: $0 }" +DV_VERSION="1.10.0" +REGION="${REGION:-chr20:10000000-10100000}" +N_SHARDS="${N_SHARDS:-1}" + +# Map variant -> upstream model_type flag +case "${VARIANT}" in + wgs) MODEL_TYPE="WGS" ;; + wes) MODEL_TYPE="WES" ;; + pacbio) MODEL_TYPE="PACBIO" ;; + ont_r104) MODEL_TYPE="ONT_R104" ;; + *) echo "unknown variant: ${VARIANT}" >&2; exit 2 ;; +esac + +OUT="output/${VARIANT}" +mkdir -p "${OUT}" + +if [[ ! -s cache/HG002.chr20.bam ]]; then + echo "==> chr20 fixture not present; running fetch_chr20_fixture.sh" + ./fetch_chr20_fixture.sh +fi + +if ! command -v docker >/dev/null 2>&1; then + echo "error: docker not found" >&2 + exit 1 +fi + +# Make sure the linux/amd64 platform is available for qemu emulation. +docker buildx inspect default >/dev/null 2>&1 || docker buildx create --use --name dv-x86 || true + +# Pull the upstream image (linux/amd64 explicit so qemu kicks in). +docker pull --platform linux/amd64 "google/deepvariant:${DV_VERSION}" + +# We use the bundled run_deepvariant.py, but invoke each stage so we can capture +# intermediate TFRecords (examples + call_variants_outputs). +WORK="/work" +docker run --rm \ + --platform linux/amd64 \ + -v "${PWD}/cache:${WORK}/cache:ro" \ + -v "${PWD}/${OUT}:${WORK}/output" \ + "google/deepvariant:${DV_VERSION}" \ + /opt/deepvariant/bin/run_deepvariant \ + --model_type="${MODEL_TYPE}" \ + --ref="${WORK}/cache/grch38_chr20.fasta" \ + --reads="${WORK}/cache/HG002.chr20.bam" \ + --output_vcf="${WORK}/output/output.vcf.gz" \ + --output_gvcf="${WORK}/output/output.g.vcf.gz" \ + --num_shards="${N_SHARDS}" \ + --regions="${REGION}" \ + --intermediate_results_dir="${WORK}/output/intermediate" \ + 2>&1 | tee "${OUT}/run.log" + +# Extract the intermediate TFRecords for parity bench. +EXAMPLES_GLOB=("${OUT}"/intermediate/make_examples.tfrecord*) +CV_GLOB=("${OUT}"/intermediate/call_variants_output.tfrecord*) +if (( ${#EXAMPLES_GLOB[@]} > 0 && ${#CV_GLOB[@]} > 0 )); then + cp "${EXAMPLES_GLOB[0]}" "${OUT}/examples_chr20.tfrecord" + cp "${CV_GLOB[0]}" "${OUT}/call_variants_chr20.tfrecord" +fi + +# Build a 1000-example slice for fast bench iteration. +mkdir -p cache +python3 - < "${OUT}/manifest.json" + +echo "==> done — ${OUT}" +ls -lh "${OUT}/" diff --git a/tools/reference/fetch_chr20_fixture.sh b/tools/reference/fetch_chr20_fixture.sh new file mode 100755 index 000000000..2e4787e66 --- /dev/null +++ b/tools/reference/fetch_chr20_fixture.sh @@ -0,0 +1,33 @@ +#!/usr/bin/env bash +# Pull the small GIAB chr20 fixture used by the reference capture. +# This is the same dataset Google uses in the public DeepVariant case study. +set -euo pipefail + +cd "$(dirname "$0")" +mkdir -p cache + +REF_URL="https://storage.googleapis.com/deepvariant/case-study-testdata/grch38_chr20.fasta" +REF_FAI_URL="https://storage.googleapis.com/deepvariant/case-study-testdata/grch38_chr20.fasta.fai" +BAM_URL="https://storage.googleapis.com/deepvariant/case-study-testdata/HG002_NIST_150bp_50x.chr20.bam" +BAI_URL="https://storage.googleapis.com/deepvariant/case-study-testdata/HG002_NIST_150bp_50x.chr20.bam.bai" + +fetch() { + local url="$1" + local out="$2" + if [[ -s "${out}" ]]; then + echo " ${out} present, skip" + return + fi + echo " fetching ${url}" + curl -fL --retry 3 --connect-timeout 15 -o "${out}.partial" "${url}" + mv "${out}.partial" "${out}" +} + +echo "==> chr20 reference fixture" +fetch "${REF_URL}" cache/grch38_chr20.fasta +fetch "${REF_FAI_URL}" cache/grch38_chr20.fasta.fai +fetch "${BAM_URL}" cache/HG002.chr20.bam +fetch "${BAI_URL}" cache/HG002.chr20.bam.bai + +echo "==> done" +ls -lh cache/ diff --git a/tools/verify_gpu.sh b/tools/verify_gpu.sh new file mode 100755 index 000000000..4e0669e87 --- /dev/null +++ b/tools/verify_gpu.sh @@ -0,0 +1,68 @@ +#!/usr/bin/env bash +# Wrap a command with `powermetrics` GPU/ANE residency capture. +# +# Usage: +# sudo ./tools/verify_gpu.sh path/to/log.json -- ./build/deepvariant call_variants ... +# +# powermetrics requires root, so this is intended to be invoked under sudo +# during validation runs. It writes a JSON summary to the first arg, then runs +# the rest of the command line. Pure shell + awk — no Python. +set -euo pipefail + +if [[ $# -lt 3 ]]; then + echo "usage: $0 -- " >&2 + exit 2 +fi + +OUT="$1"; shift +[[ "$1" == "--" ]] || { echo "expected '--' separator" >&2; exit 2; } +shift + +TMP_PM="$(mktemp -t verify_gpu.pm.XXXXXX)" +trap 'rm -f "${TMP_PM}"' EXIT + +if [[ "$(id -u)" -ne 0 ]]; then + echo "warning: not running as root; powermetrics will fail. re-run with sudo." >&2 +fi + +# Start powermetrics in background. +powermetrics --samplers gpu_power,ane_power -i 500 -f text > "${TMP_PM}" 2>/dev/null & +PM_PID=$! + +# Run the workload. +RC=0 +"$@" || RC=$? + +# Stop powermetrics. +kill -INT "${PM_PID}" 2>/dev/null || true +wait "${PM_PID}" 2>/dev/null || true + +# Summarise with awk — no interpreter required beyond /usr/bin/awk. +awk ' + /^GPU Power:/ { + n_gpu++; sum_gpu += $3; + if ($3 > max_gpu) max_gpu = $3; + if ($3 > 50) act_gpu++; + next + } + /^ANE Power:/ { + n_ane++; sum_ane += $3; + if ($3 > max_ane) max_ane = $3; + if ($3 > 50) act_ane++; + next + } + END { + function blk(label, n, sum, mx, act, mean, pct) { + mean = (n > 0) ? sum / n : 0 + pct = (n > 0) ? 100.0 * act / n : 0 + printf " \"%s\": {\"n\": %d, \"mean_mw\": %.1f, \"max_mw\": %d, \"active_pct\": %.1f}", \ + label, n, mean, mx, pct + } + print "{" + blk("gpu", n_gpu, sum_gpu, max_gpu, act_gpu); print "," + blk("ane", n_ane, sum_ane, max_ane, act_ane); print "" + print "}" + } +' "${TMP_PM}" | tee "${OUT}" + +exit "${RC}" From 27906e0a33e26528a24d63235bdb0e034cbf62b8 Mon Sep 17 00:00:00 2001 From: Benjamin Demaille Date: Sat, 25 Apr 2026 23:30:25 +0200 Subject: [PATCH 003/282] phase-0 step-2: vendor 25 TF .proto files at r2.16 All TF .proto schemas (~110 KB total) needed for SavedModel parsing without TensorFlow runtime, sourced verbatim from github.com/tensorflow/tensorflow @ r2.16. SOURCES.md records the upstream branch and exact list; re-fetch is a single curl loop. Files cover: SavedModel + MetaGraphDef + Saver + GraphDef + NodeDef + AttrValue + Tensor / TensorShape / DataType + OpDef + Function + ResourceHandle + Variable + TensorBundle (BundleEntryProto for the variables/variables.{index, data-*} format). Generated/ Python bindings will be produced by `protoc --python_out` in setup_venvs.sh and gitignored. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 1 + tools/conversion/Protos/SOURCES.md | 67 ++++ .../framework/allocation_description.proto | 29 ++ .../core/framework/attr_value.proto | 64 ++++ .../core/framework/cost_graph.proto | 89 +++++ .../core/framework/device_attributes.proto | 58 +++ .../tensorflow/core/framework/full_type.proto | 310 ++++++++++++++++ .../tensorflow/core/framework/function.proto | 136 +++++++ .../tensorflow/core/framework/graph.proto | 60 +++ .../tensorflow/core/framework/node_def.proto | 95 +++++ .../tensorflow/core/framework/op_def.proto | 193 ++++++++++ .../core/framework/resource_handle.proto | 45 +++ .../core/framework/step_stats.proto | 88 +++++ .../tensorflow/core/framework/tensor.proto | 100 +++++ .../core/framework/tensor_description.proto | 24 ++ .../core/framework/tensor_shape.proto | 46 +++ .../tensorflow/core/framework/types.proto | 100 +++++ .../tensorflow/core/framework/variable.proto | 84 +++++ .../tensorflow/core/framework/versions.proto | 33 ++ .../core/protobuf/debug_event.proto | 300 +++++++++++++++ .../core/protobuf/error_codes.proto | 11 + .../tensorflow/core/protobuf/meta_graph.proto | 342 ++++++++++++++++++ .../core/protobuf/saved_model.proto | 23 ++ .../tensorflow/core/protobuf/saver.proto | 48 +++ .../tensorflow/core/protobuf/struct.proto | 164 +++++++++ .../core/protobuf/tensor_bundle.proto | 66 ++++ .../protobuf/trackable_object_graph.proto | 80 ++++ 27 files changed, 2656 insertions(+) create mode 100644 tools/conversion/Protos/SOURCES.md create mode 100644 tools/conversion/Protos/tensorflow/core/framework/allocation_description.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/attr_value.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/cost_graph.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/device_attributes.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/full_type.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/function.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/graph.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/node_def.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/op_def.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/resource_handle.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/step_stats.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/tensor.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/tensor_description.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/tensor_shape.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/types.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/variable.proto create mode 100644 tools/conversion/Protos/tensorflow/core/framework/versions.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/debug_event.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/error_codes.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/meta_graph.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/saved_model.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/saver.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/struct.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/tensor_bundle.proto create mode 100644 tools/conversion/Protos/tensorflow/core/protobuf/trackable_object_graph.proto diff --git a/.gitignore b/.gitignore index f6fba89d3..cd5621a8f 100644 --- a/.gitignore +++ b/.gitignore @@ -19,6 +19,7 @@ build-*/ tools/conversion/venv-*/ tools/conversion/.cache/ tools/conversion/models/ +tools/conversion/Generated/ tools/reference/cache/ tools/reference/output/ benchmarks/runs/ diff --git a/tools/conversion/Protos/SOURCES.md b/tools/conversion/Protos/SOURCES.md new file mode 100644 index 000000000..c95cac3d9 --- /dev/null +++ b/tools/conversion/Protos/SOURCES.md @@ -0,0 +1,67 @@ +# Vendored protobuf sources + +All `.proto` files under `Protos/tensorflow/` are vendored verbatim from upstream and **not patched**. Their license is each upstream project's own. Re-fetch via the commands below if anything changes upstream. + +## TensorFlow — `tensorflow/r2.16` branch (Apache-2.0) + +Source: +Branch: `r2.16` (matches the TF version that DeepVariant 1.10 SavedModels were written by). +Fetched: 2026-04-25. + +Files (25): + +```text +core/framework/allocation_description.proto +core/framework/attr_value.proto +core/framework/cost_graph.proto +core/framework/device_attributes.proto +core/framework/full_type.proto +core/framework/function.proto +core/framework/graph.proto +core/framework/node_def.proto +core/framework/op_def.proto +core/framework/resource_handle.proto +core/framework/step_stats.proto +core/framework/tensor.proto +core/framework/tensor_description.proto +core/framework/tensor_shape.proto +core/framework/types.proto +core/framework/variable.proto +core/framework/versions.proto +core/protobuf/debug_event.proto +core/protobuf/error_codes.proto +core/protobuf/meta_graph.proto +core/protobuf/saved_model.proto +core/protobuf/saver.proto +core/protobuf/struct.proto +core/protobuf/tensor_bundle.proto +core/protobuf/trackable_object_graph.proto +``` + +Re-fetch: + +```sh +TF_REF="r2.16" +BASE="https://raw.githubusercontent.com/tensorflow/tensorflow/${TF_REF}/tensorflow" +cd tools/conversion/Protos/tensorflow +for f in ; do + curl -fsSL -o "${f}" "${BASE}/${f}" +done +``` + +## Generation + +Python bindings are generated under `tools/conversion/Generated/` (gitignored): + +```sh +cd tools/conversion +protoc --python_out=Generated/ \ + --proto_path=Protos/tensorflow \ + Protos/tensorflow/core/**/*.proto +``` + +Bindings are regenerated on demand by `setup_venvs.sh` (TBD). + +## Why we vendor instead of pip-install + +The natural way to get TF's `.proto` definitions is `pip install tensorflow`, which we explicitly forbid (Voie B refined — TF banned in v2). Vendoring the schema files alone is ~110 KB and gives us proto bindings via `protoc --python_out` with no TF runtime. diff --git a/tools/conversion/Protos/tensorflow/core/framework/allocation_description.proto b/tools/conversion/Protos/tensorflow/core/framework/allocation_description.proto new file mode 100644 index 000000000..f18caa40b --- /dev/null +++ b/tools/conversion/Protos/tensorflow/core/framework/allocation_description.proto @@ -0,0 +1,29 @@ +syntax = "proto3"; + +package tensorflow; + +option cc_enable_arenas = true; +option java_outer_classname = "AllocationDescriptionProtos"; +option java_multiple_files = true; +option java_package = "org.tensorflow.framework"; +option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/framework/allocation_description_go_proto"; + +message AllocationDescription { + // Total number of bytes requested + int64 requested_bytes = 1; + + // Total number of bytes allocated if known + int64 allocated_bytes = 2; + + // Name of the allocator used + string allocator_name = 3; + + // Identifier of the allocated buffer if known + int64 allocation_id = 4; + + // Set if this tensor only has one remaining reference + bool has_single_reference = 5; + + // Address of the allocation. + uint64 ptr = 6; +} diff --git a/tools/conversion/Protos/tensorflow/core/framework/attr_value.proto b/tools/conversion/Protos/tensorflow/core/framework/attr_value.proto new file mode 100644 index 000000000..2bd5b552a --- /dev/null +++ b/tools/conversion/Protos/tensorflow/core/framework/attr_value.proto @@ -0,0 +1,64 @@ +syntax = "proto3"; + +package tensorflow; + +import "tensorflow/core/framework/tensor.proto"; +import "tensorflow/core/framework/tensor_shape.proto"; +import "tensorflow/core/framework/types.proto"; + +option cc_enable_arenas = true; +option java_outer_classname = "AttrValueProtos"; +option java_multiple_files = true; +option java_package = "org.tensorflow.framework"; +option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/framework/attr_value_go_proto"; + +// Protocol buffer representing the value for an attr used to configure an Op. +// Comment indicates the corresponding attr type. Only the field matching the +// attr type may be filled. +message AttrValue { + // LINT.IfChange + message ListValue { + repeated bytes s = 2; // "list(string)" + repeated int64 i = 3 [packed = true]; // "list(int)" + repeated float f = 4 [packed = true]; // "list(float)" + repeated bool b = 5 [packed = true]; // "list(bool)" + repeated DataType type = 6 [packed = true]; // "list(type)" + repeated TensorShapeProto shape = 7; // "list(shape)" + repeated TensorProto tensor = 8; // "list(tensor)" + repeated NameAttrList func = 9; // "list(attr)" + } + // LINT.ThenChange(//tensorflow/c/c_api.cc) + + oneof value { + bytes s = 2; // "string" + int64 i = 3; // "int" + float f = 4; // "float" + bool b = 5; // "bool" + DataType type = 6; // "type" + TensorShapeProto shape = 7; // "shape" + TensorProto tensor = 8; // "tensor" + ListValue list = 1; // any "list(...)" + + // "func" represents a function. func.name is a function's name or + // a primitive op's name. func.attr.first is the name of an attr + // defined for that function. func.attr.second is the value for + // that attr in the instantiation. + NameAttrList func = 10; + + // This is a placeholder only used in nodes defined inside a + // function. It indicates the attr value will be supplied when + // the function is instantiated. For example, let us suppose a + // node "N" in function "FN". "N" has an attr "A" with value + // placeholder = "foo". When FN is instantiated with attr "foo" + // set to "bar", the instantiated node N's attr A will have been + // given the value "bar". + string placeholder = 9; + } +} + +// A list of attr names and their values. The whole list is attached +// with a string name. E.g., MatMul[T=float]. +message NameAttrList { + string name = 1; + map attr = 2; +} diff --git a/tools/conversion/Protos/tensorflow/core/framework/cost_graph.proto b/tools/conversion/Protos/tensorflow/core/framework/cost_graph.proto new file mode 100644 index 000000000..42c9e23cf --- /dev/null +++ b/tools/conversion/Protos/tensorflow/core/framework/cost_graph.proto @@ -0,0 +1,89 @@ +syntax = "proto3"; + +package tensorflow; + +import "tensorflow/core/framework/tensor_shape.proto"; +import "tensorflow/core/framework/types.proto"; + +option cc_enable_arenas = true; +option java_outer_classname = "CostGraphProtos"; +option java_multiple_files = true; +option java_package = "org.tensorflow.framework"; +option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/framework/cost_graph_go_proto"; + +message CostGraphDef { + message Node { + // The name of the node. Names are globally unique. + string name = 1; + + // The device of the node. Can be empty if the node is mapped to the + // default partition or partitioning hasn't been run yet. + string device = 2; + + // The id of the node. Node ids are only unique inside a partition. + int32 id = 3; + + // Inputs of this node. They must be executed before this node can be + // executed. An input is a particular output of another node, specified + // by the node id and the output index. + message InputInfo { + int32 preceding_node = 1; + int32 preceding_port = 2; + } + repeated InputInfo input_info = 4; + + // Outputs of this node. + message OutputInfo { + int64 size = 1; + // If >= 0, the output is an alias of an input. Note that an alias input + // may itself be an alias. The algorithm will therefore need to follow + // those pointers. + int64 alias_input_port = 2; + TensorShapeProto shape = 3; + DataType dtype = 4; + } + repeated OutputInfo output_info = 5; + + // Temporary memory used by this node. + int64 temporary_memory_size = 6; + + // Persistent memory used by this node. + int64 persistent_memory_size = 12; + + int64 host_temp_memory_size = 10 [deprecated = true]; + int64 device_temp_memory_size = 11 [deprecated = true]; + int64 device_persistent_memory_size = 16 [deprecated = true]; + + // Estimate of the computational cost of this node, in microseconds. + int64 compute_cost = 9; + + // Analytical estimate of the computational cost of this node, in + // microseconds. + int64 compute_time = 14; + + // Analytical estimate of the memory access cost of this node, in + // microseconds. + int64 memory_time = 15; + + // If true, the output is permanent: it can't be discarded, because this + // node is part of the "final output". Nodes may depend on final nodes. + bool is_final = 7; + + // Ids of the control inputs for this node. + repeated int32 control_input = 8; + + // Are the costs inaccurate? + bool inaccurate = 17; + } + repeated Node node = 1; + + // Total cost of this graph, typically used for balancing decisions. + message AggregatedCost { + // Aggregated cost value. + float cost = 1; + + // Aggregated cost dimension (e.g. 'memory', 'compute', 'network'). + string dimension = 2; + } + repeated AggregatedCost cost = 2; +} diff --git a/tools/conversion/Protos/tensorflow/core/framework/device_attributes.proto b/tools/conversion/Protos/tensorflow/core/framework/device_attributes.proto new file mode 100644 index 000000000..5f568e255 --- /dev/null +++ b/tools/conversion/Protos/tensorflow/core/framework/device_attributes.proto @@ -0,0 +1,58 @@ +syntax = "proto3"; + +package tensorflow; + +option cc_enable_arenas = true; +option java_outer_classname = "DeviceAttributesProtos"; +option java_multiple_files = true; +option java_package = "org.tensorflow.framework"; +option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/framework/device_attributes_go_proto"; + +message InterconnectLink { + int32 device_id = 1; + string type = 2; + int32 strength = 3; +} + +message LocalLinks { + repeated InterconnectLink link = 1; +} + +message DeviceLocality { + // Optional bus locality of device. Default value of 0 means + // no specific locality. Specific localities are indexed from 1. + int32 bus_id = 1; + + // Optional NUMA locality of device. + int32 numa_node = 2; + + // Optional local interconnect links to other devices. + LocalLinks links = 3; +} + +message DeviceAttributes { + // Fully specified name of the device within a cluster. + string name = 1; + + // String representation of device_type. + string device_type = 2; + + // Memory capacity of device in bytes. + int64 memory_limit = 4; + + // Platform-specific data about device that may be useful + // for supporting efficient data transfers. + DeviceLocality locality = 5; + + // A device is assigned a global unique number each time it is + // initialized. "incarnation" should never be 0. + fixed64 incarnation = 6; + + // String representation of the physical device that this device maps to. + string physical_device_desc = 7; + + // A physical device ID for use in XLA DeviceAssignments, unique across + // clients in a multi-client setup. Set to -1 if unavailable, non-negative + // otherwise. + int64 xla_global_id = 8; +} diff --git a/tools/conversion/Protos/tensorflow/core/framework/full_type.proto b/tools/conversion/Protos/tensorflow/core/framework/full_type.proto new file mode 100644 index 000000000..19e8da5ab --- /dev/null +++ b/tools/conversion/Protos/tensorflow/core/framework/full_type.proto @@ -0,0 +1,310 @@ +syntax = "proto3"; + +package tensorflow; + +option cc_enable_arenas = true; +option java_outer_classname = "FullTypeProtos"; +option java_multiple_files = true; +option java_package = "org.tensorflow.framework"; +option go_package = "github.com/tensorflow/tensorflow/tensorflow/go/core/framework/full_type_go_proto"; + +// LINT.IfChange +// Experimental. Represents the complete type information of a TensorFlow value. +enum FullTypeId { + // The default represents an uninitialized values. + TFT_UNSET = 0; + + // Type symbols. Used to construct more complex type expressions like + // algebraic data types. + + // Type variables may serve as placeholder for any other type ID in type + // templates. + // + // Examples: + // TFT_DATASET[TFT_VAR["T"]] is a Dataset returning a type indicated by "T". + // TFT_TENSOR[TFT_VAR["T"]] is a Tensor of n element type indicated by "T". + // TFT_TENSOR[TFT_VAR["T"]], TFT_TENSOR[TFT_VAR["T"]] are two tensors of + // identical element types. + // TFT_TENSOR[TFT_VAR["P"]], TFT_TENSOR[TFT_VAR["Q"]] are two tensors of + // independent element types. + // + TFT_VAR = 1; + + // Wildcard type. Describes a parameter of unknown type. In TensorFlow, that + // can mean either a "Top" type (accepts any type), or a dynamically typed + // object whose type is unknown in context. + // Important: "unknown" does not necessarily mean undeterminable! + TFT_ANY = 2; + + // The algebraic product type. This is an algebraic type that may be used just + // for logical grouping. Not to confused with TFT_TUPLE which describes a + // concrete object of several elements. + // + // Example: + // TFT_DATASET[TFT_PRODUCT[TFT_TENSOR[TFT_INT32], TFT_TENSOR[TFT_FLOAT64]]] + // is a Dataset producing two tensors, an integer one and a float one. + // + TFT_PRODUCT = 3; + + // Represents a named field, with the name stored in the attribute. + // + // Parametrization: + // TFT_NAMED[]{} + // * is the type of the field + // * is the field name, as string (thpugh can theoretically be an int + // as well) + // + // Example: + // TFT_RECORD[ + // TFT_NAMED[TFT_TENSOR[TFT_INT32]]{'foo'}, + // TFT_NAMED[TFT_TENSOR[TFT_FLOAT32]]{'bar'}, + // ] + // is a structure with two fields, an int tensor "foo" and a float tensor + // "bar". + TFT_NAMED = 4; + + // Template definition. Expands the variables by repeating a template as + // arguments of container. + // + // Parametrization: + // TFT_FOR_EACH[,