Apple Silicon native port (v2) — native arm64 DeepVariant/DeepTrio/DeepSomatic/pangenome#1085
Apple Silicon native port (v2) — native arm64 DeepVariant/DeepTrio/DeepSomatic/pangenome#1085BenjaminDEMAILLE wants to merge 11 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
c1a9c63 to
d92874b
Compare
|
Thanks for the PR! Because of the way our project is set up, we aren't able to merge GitHub PRs directly. Additionally, since this is a very large PR, reviewing it in its entirety might be difficult. However, I can look through it to see if there are specific components we can review, test, and adopt. If we end up incorporating any of your changes, I'll make sure to credit your GitHub username and reference this PR in the commit description. Let me know if that works for you, or if you'd prefer that I close this PR. -pichuan |
|
Hi @pichuan, thanks a lot for the thoughtful reply, that works perfectly for me. Please keep the PR open as a reference; I'm happy for you to lift whatever is useful, and crediting the GitHub username + PR reference is more than enough. I'll sign the CLA so that's not a blocker. To make cherry-picking easier, here is the PR sliced into independent, separately-reviewable units. Most of it is Apple/Metal-specific by nature (CMake build, MPSGraph/BNNS inference, Homebrew packaging) and probably not relevant to the upstream Bazel/Linux tree. But a few slices carry platform-neutral value: Likely useful to upstream (platform-neutral):
Apple-specific (probably reference-only for you): the CMake/macOS build detangling, Metal MPSGraph + BNNS inference path, C++ reimplementations of the Python postprocess steps (haplotype conflict resolution, gVCF banding, Happy to open smaller, focused PRs for any specific slice you'd like to review in isolation, or to answer questions on any of the above. Whatever is least work for your team. |
0cd41f8 to
2724a87
Compare
Detailed engineering report: how the Apple Silicon native port was builtFollowing up on the discussion above, here is a full walkthrough of the porting methodology and the development history, for anyone on the team who wants to cherry-pick specific pieces. The PR branch itself was squashed to a single commit to fit the CLA scanner's 250-commit limit, but every milestone commit referenced below is still intact and browsable in my fork. I cite them inline as Note on authorship: the bulk of the work is mine; a set of correctness, build, and feature commits were contributed by @nh13 through fork PRs and are attributed explicitly in the dedicated section near the end. Methodology and constraintsThe port was built under a few self-imposed hard constraints, which shaped every decision:
Phase 0 to 3: foundation and first native VCFThe foundation (build detangling, weight conversion scaffolding, Core ML PoC, release + Homebrew scaffolding) landed as Candidate-set parityTwo upstream defaults turned out to be load-bearing for matching Docker's candidate set: partitioning the calling region into 1000 bp chunks ( Phase 4: GIAB F1 gateWeight packing moved to a custom Phase 5.5: Metal inference and the determinism campaignThe Inception-v3 backbone was rebuilt on MPSGraph ( Parallel sharding via
End state on chr20: 100% site-set parity, 0 FILTER mismatches, all PASS variants identical to Docker, with the residual byte differences confined to QUAL/PL FP32 drift. Phase 6: DeepTrio, DeepSomatic, pangenome-aware DVEach tool was held to the same chr20 FILTER-parity gate.
Phase 8: deterministic-conv research and opt-in togglesA full-network deterministic Metal conv path (per-thread sequential FMA, no SIMD-group reduction, guaranteed cross-chip determinism) was built and validated bit-exact per block, including a Kahan-compensated variant ( Phase 9: feature completionAlt-aligned pileup ( Whole-genome validation and the late parity fixesScaling to whole-genome surfaced a few more real bugs, several of which are arguably relevant beyond this port:
The CoreML-vs-Metal backend matrix ( Real-data long-read validationFirst real PacBio + ONT runs (GIAB FTP BAMs streamed by byte-range) were initially ~5% below Docker on SNP F1. Root cause: an empty Final all-mode re-regression (pre-PR), and two bugs it caughtA full every-model-type re-regression against independently generated Docker references (
Result: all Illumina short-read modes at 0 FM (germline WGS/WES, trio WGS/WES, somatic WGS/WES/FFPE T+N and tumor-only, pangenome WGS), and all long-read/RNA modes within the documented < 5% FP32-drift tolerance. PerformanceNEON kernels for normalization and base-color/M-block classification ( Contributions by @nh13@nh13 reviewed the port and contributed the following, merged via fork PRs. These are well-isolated and may be among the easier pieces to evaluate:
Further work from @nh13 (single-sample multi-sample VariantCaller alignment + A note on the squashBecause this branch is the head of this PR, and the CLA bot cannot auto-scan beyond 250 commits, I squashed the 273-commit history into one content-identical commit (same tree hash) so the scan can run over the whole thing. The full per-commit development history remains intact and linkable in my fork, with each contributor's original authorship preserved on their own commits, as cited throughout this report. Happy to provide any of these as small standalone PRs on request. |
…pSomatic/pangenome Fresh-start port of DeepVariant (+ DeepTrio, DeepSomatic, pangenome-aware DV) to a single fully-native arm64 binary on Apple Silicon: Metal GPU / ANE inference, zero Python interpreter at runtime, one-command Homebrew install. Builds via CMake on macOS only; upstream Bazel/Python left untouched as reference. History squashed from 273 commits into a single change to keep the PR within the CLA scanner's commit limit. Full development history and per-decision rationale are preserved in PORT_LOG.md. Release gates (HG002 whole-genome, chr20 fixtures, vs Docker 1.10.0): SNP/INDEL F1 delta = 0; FILTER parity 0 FM on chr20:10M-10.1M, 0.027% on full chr20; GPU engagement verified via powermetrics. All Illumina short-read modes at 0 FM; long-read/RNA modes within the documented < 5% FP32-drift tolerance.
2724a87 to
a21e4d4
Compare
Gate MPSGraph precision on DV_METAL_FP16: when set, use optimizationLevel1 + reducedPrecisionFastMath=Default (FP16 Winograd / operand conversion) instead of the forced Level0/FP32 path. Default (unset) is unchanged and remains contractual full FP32. Validated on chr20 (HG002, 4731 examples): metadata and record order are identical to FP32, and fp16-vs-fp32 genotype-call agreement is 60/4731 (1.27%) — exactly the FP32 backend's own run-to-run nondeterminism baseline, i.e. reduced precision adds no divergence beyond what the default metal backend already exhibits. A true ANE path is not added: the 7-channel Inception-v3 input is rejected by the ANE, so MPSGraph stays on the GPU. Any default flip should still be gated behind GIAB concordance.
Reword the nonstandard 'FP19/TF32' to 'TF32 (19-bit)' — TF32 is the 19-bit operand format; it is not FP16 (already named as the Winograd path).
- parity_check.autoload silently swallowed any decode error and switched formats via a bare except; fall back to the minimal decoder only on a raised error or zero rows, log which decoder is used, and print a shape-mismatch reason instead of indexing absent numeric keys. - inception_v3_mil adds build-time shape asserts on the stem conv and the final linear so an HWIO->OIHW perm or dropped .T fails at convert time. - bench._decode_feature asserts the decoded element count matches (h,w,c). - test_savedmodel_reader asserted weights() raises NotImplementedError, but it is implemented now; assert it raises on a malformed bundle instead.
architecture.md's Phase-0 ADR named Core ML/MIL as the inference backend, but the shipped default is Metal MPSGraph FP32 (call_variants_main.cc --inference_backend=metal; coreml debug-only; ANE not engaged), matching validation.md and scientific_report.md. Mark it SUPERSEDED and correct the Decision, preserving the historical Phase-0 numbers. Fix the conversion README (nonexistent requirements-metal.txt; coremltools>=9.0 TF-free) and the vendored proto count (25->26, matching SOURCES.md).
Determinism-neutral CPU wins on the make_examples hot loop (output unchanged): - std::move the read vector into working_reads on the no-realigner default path instead of deep-copying every Read proto (repointing the one later use). - Hoist loop-invariant absl::GetFlag reads and the RealignerOptions proto to const locals computed once before the region loop. - Emit a periodic 'processed X/Y regions' progress line.
- Attach a 1 MiB streambuf to the reader before open (the writer already coalesces into 1 MiB), cutting per-record read() syscalls. Pure buffering; format and decoded bytes unchanged. - Add a WriteRecord(std::string_view) overload and pass the example payload by view from ExampleWriter::Add instead of materializing a std::string; on-disk framing is byte-identical.
- Add -mcpu=apple-m1 (ISA/scheduling only; fast-math stays off so the deterministic conv/bnns reductions remain bit-identical, guarded by the determinism microtests). - Delete neon_base_color.h + its microtest: built-but-unused in production and semantically divergent from the live FillBaseColorBatch (a latent trap). - Refresh the stale 'NOT YET WIRED' banner on neon_cigar_classify.h, which is in fact wired into allelecounter.cc.
Add --enable_inference_pipelining (default off) that overlaps the GPU MPSGraph backbone of batch N+1 with the CPU BNNS finalize + CVO build of batch N, for the two-stage Metal path only. A single finalize worker drains a FIFO queue and is the sole CVO producer (shared emit_cvos lambda), with a 2-slot double buffer (features/probs) and an at-most-one-in-flight invariant, so output order is identical to the serial path. Validated on chr20 (HG002, 4731 examples): record set, metadata, and order are byte-identical to serial, and the genotype-prob differences are exactly the default metal backend's own run-to-run nondeterminism envelope (serial-vs-serial == serial-vs-pipelined).
perf: determinism-neutral speedups + opt-in inference pipelining (and docs/tests)
perf(metal): opt-in DV_METAL_FP16 reduced-precision inference
DeepVariant — Apple Silicon native port (v2)
Fresh-start port of DeepVariant (+ DeepTrio, DeepSomatic, pangenome-aware DV) to a single fully-native arm64 binary on Apple Silicon: Metal GPU / ANE inference, zero Python interpreter at runtime, one-command Homebrew install. Builds via CMake on macOS only; upstream Bazel/Python left untouched as reference.
Built clean at HEAD;
ctestgreen (7/7).Release gates
Pre-PR full all-mode re-regression — EVERY model_type vs Docker on public data (2026-06-21)
Apples-to-apples FILTER parity (native vs the matching Docker 1.10.0 image, same input BAM + model), chr20 fixtures. Bundles re-extracted via Docker; long-read / RNA data streamed from public GIAB +
deepvariant/brain-genomics-publicbuckets.All Illumina short-read modes: 0 FM (perfect FILTER parity, PASS+GT identical). All long-read / RNA modes within the documented < 5 % tolerance (small-model dispatch + FP32 GPU-vs-CPU drift + homopolymer — the explicit non-goal class).
Two bugs found and fixed during this re-regression
partition_size(cc1d35de): cli.cc hardcoded--partition_size=25000, so per-chunk reservoir sampling over-downsampled high-coverage windows and dropped low-coverage candidate clusters. Reverted to Docker's default 1000 → 254→309 shared, 0 FM, PASS-identical. (The earlier "322/322" was a harness artifact, now corrected in docs.)split_skip_reads(af59d3de): the flag was plumbed but the implementation (split spliced N-CIGAR reads into per-exon sub-reads) was never ported from upstreamrealigner.py:split_reads, so intron-spanning reads degraded the RNA pileup → model emitted homref → NoCall. Ported asSplitReadsOnSkip()(gated by--split_skip_reads, RNASEQ-only; other modes byte-identical) → 73→2 FM, PASS 41→72 = Docker.Known follow-ups (not blockers)
fcmmdep). Workaround: Docker GLnexus under Rosetta.🤖 Generated with Claude Code