Skip to content

Apple Silicon native port (v2) — native arm64 DeepVariant/DeepTrio/DeepSomatic/pangenome#1085

Open
BenjaminDEMAILLE wants to merge 11 commits into
google:r1.10from
IPNP-BIPN:apple-silicon-native-v2-pr
Open

Apple Silicon native port (v2) — native arm64 DeepVariant/DeepTrio/DeepSomatic/pangenome#1085
BenjaminDEMAILLE wants to merge 11 commits into
google:r1.10from
IPNP-BIPN:apple-silicon-native-v2-pr

Conversation

@BenjaminDEMAILLE

Copy link
Copy Markdown

DeepVariant — Apple Silicon native port (v2)

Fresh-start port of DeepVariant (+ DeepTrio, DeepSomatic, pangenome-aware DV) to a single fully-native arm64 binary on Apple Silicon: Metal GPU / ANE inference, zero Python interpreter at runtime, one-command Homebrew install. Builds via CMake on macOS only; upstream Bazel/Python left untouched as reference.

Built clean at HEAD; ctest green (7/7).

Release gates

Gate Threshold Status
SNP F1 vs Docker (HG002 WG) ≥ Docker − 0.05 % ✅ Δ = 0 (0.996440)
INDEL F1 vs Docker (HG002 WG) ≥ Docker − 0.10 % ✅ Δ = 0 (0.995766)
FILTER parity chr20:10M-10.1M 0 FM
FILTER parity full chr20 ≤ 0.25 % FM ✅ 0.027 %
GPU truly engaged powermetrics > 0

Pre-PR full all-mode re-regression — EVERY model_type vs Docker on public data (2026-06-21)

Apples-to-apples FILTER parity (native vs the matching Docker 1.10.0 image, same input BAM + model), chr20 fixtures. Bundles re-extracted via Docker; long-read / RNA data streamed from public GIAB + deepvariant / brain-genomics-public buckets.

Tool Mode FM Verdict
DeepVariant WGS / WES 0 / 0
DeepVariant PACBIO / ONT / HYBRID 3 / 14 / 4 ✅ < 5 % LR tol
DeepVariant MAS-seq (real HG004) 11 (4.6 %) ✅ LR tol
DeepVariant RNASEQ (real HG005) 2 ✅ (after fix)
DeepTrio WGS (×3) 1 / 2 / 0 (RefCall↔NoCall, PASS+GT identical)
DeepTrio WES (×3) 0 / 0 / 0
DeepTrio PACBIO (×3, GIAB) 3 / 4 / 3 ✅ LR tol
DeepTrio ONT (×3, R104) 15 / 15 / 16 ✅ LR tol
DeepSomatic WGS-TN / WES-TN / FFPE_WGS-TN / FFPE_WES-TN 0 / 0 / 0 / 0
DeepSomatic WGS-TO 0
DeepSomatic PACBIO-TO / ONT-TO 20 / 17 ✅ LR tol
Pangenome WGS 0 (after fix)

All Illumina short-read modes: 0 FM (perfect FILTER parity, PASS+GT identical). All long-read / RNA modes within the documented < 5 % tolerance (small-model dispatch + FP32 GPU-vs-CPU drift + homopolymer — the explicit non-goal class).

Two bugs found and fixed during this re-regression

  • Pangenome partition_size (cc1d35de): cli.cc hardcoded --partition_size=25000, so per-chunk reservoir sampling over-downsampled high-coverage windows and dropped low-coverage candidate clusters. Reverted to Docker's default 1000 → 254→309 shared, 0 FM, PASS-identical. (The earlier "322/322" was a harness artifact, now corrected in docs.)
  • RNASEQ split_skip_reads (af59d3de): the flag was plumbed but the implementation (split spliced N-CIGAR reads into per-exon sub-reads) was never ported from upstream realigner.py:split_reads, so intron-spanning reads degraded the RNA pileup → model emitted homref → NoCall. Ported as SplitReadsOnSkip() (gated by --split_skip_reads, RNASEQ-only; other modes byte-identical) → 73→2 FM, PASS 41→72 = Docker.

Known follow-ups (not blockers)

  • Virgin-machine matrix (Phase 7): needs clean M1/M2/M3/M4 hardware to validate the one-command Homebrew install.
  • Code signing + notarization: needs an Apple Developer account.
  • Native GLnexus packaging: blocked upstream (deleted fcmm dep). Workaround: Docker GLnexus under Rosetta.
  • Minor residuals (non-PASS, no variant-call impact): pangenome 1 RefCall (chr20:10029259); RNASEQ ~24 RefCall/NoCall + 1 homopolymer indel in RNA repeat regions.

🤖 Generated with Claude Code

@google-cla

google-cla Bot commented Jun 21, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@pichuan

pichuan commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Hi @BenjaminDEMAILLE,

Thanks for the PR!

Because of the way our project is set up, we aren't able to merge GitHub PRs directly.

Additionally, since this is a very large PR, reviewing it in its entirety might be difficult. However, I can look through it to see if there are specific components we can review, test, and adopt. If we end up incorporating any of your changes, I'll make sure to credit your GitHub username and reference this PR in the commit description.

Let me know if that works for you, or if you'd prefer that I close this PR.

-pichuan

@pichuan pichuan self-assigned this Jun 24, 2026
@BenjaminDEMAILLE

Copy link
Copy Markdown
Author

Hi @pichuan, thanks a lot for the thoughtful reply, that works perfectly for me. Please keep the PR open as a reference; I'm happy for you to lift whatever is useful, and crediting the GitHub username + PR reference is more than enough.

I'll sign the CLA so that's not a blocker.

To make cherry-picking easier, here is the PR sliced into independent, separately-reviewable units. Most of it is Apple/Metal-specific by nature (CMake build, MPSGraph/BNNS inference, Homebrew packaging) and probably not relevant to the upstream Bazel/Linux tree. But a few slices carry platform-neutral value:

Likely useful to upstream (platform-neutral):

  1. Cross-platform non-determinism findings. While chasing bit-for-bit Docker parity I hit several spots where DeepVariant output depends on the C++/NumPy standard-library implementation, not just the seed:

    • std::shuffle differs between libc++ and libstdc++ for the same mt19937_64 state (read subsampling in pileup_image).
    • np.random.RandomState.randint uses bitmask-rejection, not Lemire, in the reservoir sampling path.
    • PL is computed in log-space with truncation in the writer, which diverges by 1 unit from a PHRED-space rounding at boundaries.

    These are documented with exact reproductions in my PORT_LOG; they may be worth a known-reproducibility note even on Linux.

  2. Two faithful bug repros found during a full all-mode re-regression (these are about my port, but the diagnosis may flag real upstream edge cases): pangenome partition_size vs per-partition reservoir downsampling interaction, and RNASEQ N-CIGAR split_skip_reads.

Apple-specific (probably reference-only for you): the CMake/macOS build detangling, Metal MPSGraph + BNNS inference path, C++ reimplementations of the Python postprocess steps (haplotype conflict resolution, gVCF banding, simplify_variant_alleles), and the Homebrew distribution tooling.

Happy to open smaller, focused PRs for any specific slice you'd like to review in isolation, or to answer questions on any of the above. Whatever is least work for your team.

@BenjaminDEMAILLE

BenjaminDEMAILLE commented Jun 24, 2026

Copy link
Copy Markdown
Author

Detailed engineering report: how the Apple Silicon native port was built

Following up on the discussion above, here is a full walkthrough of the porting methodology and the development history, for anyone on the team who wants to cherry-pick specific pieces. The PR branch itself was squashed to a single commit to fit the CLA scanner's 250-commit limit, but every milestone commit referenced below is still intact and browsable in my fork. I cite them inline as IPNP-BIPN/deepvariant@<sha> so you can jump straight to the relevant change.

Note on authorship: the bulk of the work is mine; a set of correctness, build, and feature commits were contributed by @nh13 through fork PRs and are attributed explicitly in the dedicated section near the end.

Methodology and constraints

The port was built under a few self-imposed hard constraints, which shaped every decision:

  • Zero Python at runtime. The shipped artifact is a single native arm64 binary. The only Python that remains is upstream's own tools/*.py, left untouched as reference. Model conversion runs inside the google/deepvariant:1.10.0 Docker image (where TF is available) and emits a packed weight file; nothing in the runtime path imports TF or Python.
  • CMake-on-macOS only. Upstream Bazel/BUILD rules and Python are never modified. The native build wraps the reused C++ (make_examples, pileup_image, allelecounter, realigner, direct_phasing, postprocess, nucleus IO) and replaces the inference + orchestration layers.
  • Docker-parity as the gate, not a goal. The acceptance bar throughout was bit-for-bit FILTER-class parity against the matching Docker 1.10.0 image on chr20 fixtures (bcftools isec shared sites, 0 FILTER mismatches, identical PASS set + GT). PL/QUAL byte drift from FP32 non-associativity on the Apple GPU is the one explicit non-goal, since it is unavoidable in any parallel reduction.

Phase 0 to 3: foundation and first native VCF

The foundation (build detangling, weight conversion scaffolding, Core ML PoC, release + Homebrew scaffolding) landed as IPNP-BIPN/deepvariant@b8f494f4. The first end-to-end native VCF reached 91.8% exact-match in IPNP-BIPN/deepvariant@39f510ca, then the native realigner port recovered the upstream-matching candidate set (IPNP-BIPN/deepvariant@f12576ea). Batched Core ML inference gave the first GPU win at 1.8x (IPNP-BIPN/deepvariant@9721b177), and all 23 model variants were converted to .mlpackage (IPNP-BIPN/deepvariant@344bfa91). The postprocess path was brought to 99.93% bit-parity on identical CVOs in IPNP-BIPN/deepvariant@0bf0d34b.

Candidate-set parity

Two upstream defaults turned out to be load-bearing for matching Docker's candidate set: partitioning the calling region into 1000 bp chunks (IPNP-BIPN/deepvariant@e8cbf9c8, confirmed in IPNP-BIPN/deepvariant@b002b273) and --min_mapping_quality defaulting to 5 (IPNP-BIPN/deepvariant@423f957f). Together these reached 100% candidate-set parity (IPNP-BIPN/deepvariant@60f4d2a6).

Phase 4: GIAB F1 gate

Weight packing moved to a custom .dvw format with an mmap C++ loader (IPNP-BIPN/deepvariant@dbc2e5ca, IPNP-BIPN/deepvariant@685f34ba), and the GIAB hap.py F1 gate on HG002 chr20 passed against Docker (IPNP-BIPN/deepvariant@36b4a884, IPNP-BIPN/deepvariant@54abf889).

Phase 5.5: Metal inference and the determinism campaign

The Inception-v3 backbone was rebuilt on MPSGraph (IPNP-BIPN/deepvariant@94990ab5) with a deterministic CPU dense + softmax tail in BNNS (IPNP-BIPN/deepvariant@7ac9c618). Two structural bugs initially masqueraded as "channel permutation": a stale weight file, and incorrect (conv, bn) layer pairings (Keras InceptionV3 does not enumerate layers in strict conv/bn order). Both were fixed by byte-matching each frozen-graph kernel against the bundle (IPNP-BIPN/deepvariant@15bc0480, IPNP-BIPN/deepvariant@133c66dc, IPNP-BIPN/deepvariant@12be1ca5), bringing all 19 layer taps within 1 ULP of the TF reference. A key lesson: MPSGraph convolution2DWithSourceTensor with NHWC + HWIO is bit-exact; and Keras BatchNormalization defaults to epsilon 1e-3, not 1e-4.

Parallel sharding via posix_spawn and then true intra-process threading brought chr20 make_examples to a few minutes (IPNP-BIPN/deepvariant@752f5b69, IPNP-BIPN/deepvariant@518bf1e5). The first full-chr20 measurement showed 1.13% FILTER drift vs Docker (IPNP-BIPN/deepvariant@7929a880), which kicked off the determinism campaign. This is the part most likely to be of general interest, because the root causes are cross-platform reproducibility issues, not Apple-specific:

  • std::shuffle is implementation-defined (IPNP-BIPN/deepvariant@a6f5b0ea). libc++ and libstdc++ produce different sequences for the same mt19937_64 state. The pileup read-subsampling shuffle therefore picked different reads than Docker. Fixed by porting libstdc++'s exact paired Fisher-Yates + Lemire 128-bit uniform into the native path. 1.13% to 0.54%.
  • Pruned-allele CVOs leaking into the multi-allelic likelihood product (IPNP-BIPN/deepvariant@2808d053). Upstream's merge_predictions skips CVOs for pruned alleles; we were not. 0.54% to 0.33%.
  • NumPy RandomState.randint uses bitmask-rejection, not Lemire (IPNP-BIPN/deepvariant@21f6b083). The per-partition Algorithm-R reservoir sampling (max_reads_per_partition) has to match this exact code path, including per-1000 bp granularity. This was the single biggest mover: 0.33% to 0.01%.
  • Haplotype conflict resolution (IPNP-BIPN/deepvariant@43224487). Ported maybe_resolve_conflicting_variants (joint-likelihood maximization under the ploidy-2 constraint) which collapses a SNP overlapping a 1/2 indel to homref. 27 to 2 PASS-flips.
  • simplify_variant_alleles (IPNP-BIPN/deepvariant@92cce813). Stripping the longest common postfix prevents a tandem-repeat substitution from falsely overlapping a neighbouring SNP. 2 to 0 PASS-flips.
  • Small-model determinism: replaced Core ML small-model inference with a scalar FP32 sequential MLP in BNNS-CPU, plus per-alt-set dispatch matching upstream's biallelic + combinations enumeration (IPNP-BIPN/deepvariant@d87b21f6, IPNP-BIPN/deepvariant@68c1f5e4).
  • AltAlleleQual and PL formatting (IPNP-BIPN/deepvariant@30991c42, IPNP-BIPN/deepvariant@dbacdcac). phred(1 - sum_alt) rounded to 7 decimals, and PL computed in log-space with truncation (matching vcf_conversion.cc, which truncates rather than rounds). The PL fix alone took record-level byte-identity from 88.3% to 97.2%.

End state on chr20: 100% site-set parity, 0 FILTER mismatches, all PASS variants identical to Docker, with the residual byte differences confined to QUAL/PL FP32 drift.

Phase 6: DeepTrio, DeepSomatic, pangenome-aware DV

Each tool was held to the same chr20 FILTER-parity gate.

  • DeepTrio (IPNP-BIPN/deepvariant@51e60aae, IPNP-BIPN/deepvariant@37b24bf0, IPNP-BIPN/deepvariant@d5647f45). Two trio-specific root causes: per-sample candidate_positions rather than a union (IPNP-BIPN/deepvariant@6c40ab79), and parameterizing the Metal Inception input height/channels, since the trio pileup is 140 rows, not 100 (IPNP-BIPN/deepvariant@807e8dc8). 0 FILTER mismatches on all three samples.
  • DeepSomatic (IPNP-BIPN/deepvariant@2fe4503b). The GERMLINE reclassification filter (IPNP-BIPN/deepvariant@892ba919), somatic threshold overrides (IPNP-BIPN/deepvariant@a5799cf9), and the final sort_by_alt_allele_support pic-level option (IPNP-BIPN/deepvariant@32c5f6ea) closed it to 100% across 693 sites, T+N and tumor-only.
  • Pangenome-aware DV (IPNP-BIPN/deepvariant@706fcaf2). Skipping the realigner for the pangenome sample, per-mode aln_* params, and dbg_disable_graph_pruning (PruneLite preserves low-weight alt-haplotype paths) brought it to parity (IPNP-BIPN/deepvariant@6ed1680d, IPNP-BIPN/deepvariant@8e387531, IPNP-BIPN/deepvariant@3d686750). See the correction note below on partition_size.

Phase 8: deterministic-conv research and opt-in toggles

A full-network deterministic Metal conv path (per-thread sequential FMA, no SIMD-group reduction, guaranteed cross-chip determinism) was built and validated bit-exact per block, including a Kahan-compensated variant (IPNP-BIPN/deepvariant@980b7ce9, IPNP-BIPN/deepvariant@29d8307e, IPNP-BIPN/deepvariant@6293cd52). It is roughly 3x slower and does not change F1, so it ships off by default behind DV_METAL_SERIAL_FULL (IPNP-BIPN/deepvariant@05b91adc). Literature-driven F1 toggles (temperature scaling, multi-seed TTA) landed opt-in (IPNP-BIPN/deepvariant@63b996b9, IPNP-BIPN/deepvariant@95b9ff5f). Native GLnexus packaging is blocked upstream by the deleted fcmm dependency (IPNP-BIPN/deepvariant@d2f16f4b).

Phase 9: feature completion

Alt-aligned pileup (IPNP-BIPN/deepvariant@5e8d1578), methylation calling channel (IPNP-BIPN/deepvariant@b2b16641), DirectPhasing per-region orchestration + PS info field (IPNP-BIPN/deepvariant@db22e6cb, IPNP-BIPN/deepvariant@d88dbe59, IPNP-BIPN/deepvariant@999f1522), and a full native gVCF block-emission implementation with _quantize_gq banding and Docker-matching FORMAT order (IPNP-BIPN/deepvariant@0b43f3d3, IPNP-BIPN/deepvariant@f4100ce1). The reference-block rows are byte-identical to Docker's gVCF.

Whole-genome validation and the late parity fixes

Scaling to whole-genome surfaced a few more real bugs, several of which are arguably relevant beyond this port:

  • TFRecord truncation: F_NOCACHE could silently truncate partial writes, and the reader needed to tolerate a truncated last record per shard (IPNP-BIPN/deepvariant@d9994d2e, IPNP-BIPN/deepvariant@9261e17d).
  • Canonical-contig filter to match Docker's default region handling (IPNP-BIPN/deepvariant@80d97a59).
  • Removing a pre-reservoir sort: Docker samples in BAM-natural order (IPNP-BIPN/deepvariant@aab576f1), which took WG FILTER parity from 99.91% to 99.9993% (IPNP-BIPN/deepvariant@6b4245ca).
  • normalize_reads propagation onto the realigner (IPNP-BIPN/deepvariant@dd31b9e7). This was the Path D fix: the FastPassAligner was not getting set_normalize_reads(true), which clustered FILTER drift at the pericentromere. It cut full-chr20 FM by 87% to 0.027% (IPNP-BIPN/deepvariant@4e3687c8), an order of magnitude under the ship gate, with F1 bit-identical to Docker.

The CoreML-vs-Metal backend matrix (IPNP-BIPN/deepvariant@7e7f963c, IPNP-BIPN/deepvariant@c64715a2) settled on Metal MPSGraph FP32 as the shipped default. Whole-genome HG002 F1 matched Docker exactly (IPNP-BIPN/deepvariant@08b42a8d).

Real-data long-read validation

First real PacBio + ONT runs (GIAB FTP BAMs streamed by byte-range) were initially ~5% below Docker on SNP F1. Root cause: an empty --small_model_path silently disabled small-model dispatch (IPNP-BIPN/deepvariant@1e524ebc, IPNP-BIPN/deepvariant@6a3cc749). Fixed with a warning when the bundle declares a small model but the flag is empty, plus auto-discovery of the conventional sibling weights dir (IPNP-BIPN/deepvariant@63e4428a, IPNP-BIPN/deepvariant@769d72cd).

Final all-mode re-regression (pre-PR), and two bugs it caught

A full every-model-type re-regression against independently generated Docker references (IPNP-BIPN/deepvariant@d92874b2) caught two genuine bugs:

  • Pangenome partition_size (IPNP-BIPN/deepvariant@53887cf2). The earlier "322/322" pangenome result was a harness artifact: cli.cc had hardcoded --partition_size=25000, so the per-partition reservoir cap over-downsampled high-coverage windows and dropped low-coverage candidate clusters. Reverted to Docker's default 1000. (Upstream pangenome does not pass --partition_size at all.) This is a good cautionary example: raising partition_size silently changes the reservoir-sampling rate.
  • RNASEQ split_skip_reads (IPNP-BIPN/deepvariant@ad197f3c). The flag was plumbed but the implementation (splitting N-CIGAR spliced reads into per-exon sub-reads) was never ported, so intron-spanning reads degraded the RNA pileup. Ported as SplitReadsOnSkip(), RNASEQ-only.

Result: all Illumina short-read modes at 0 FM (germline WGS/WES, trio WGS/WES, somatic WGS/WES/FFPE T+N and tumor-only, pangenome WGS), and all long-read/RNA modes within the documented < 5% FP32-drift tolerance.

Performance

NEON kernels for normalization and base-color/M-block classification (IPNP-BIPN/deepvariant@60e7bd09, IPNP-BIPN/deepvariant@7cfbceea, IPNP-BIPN/deepvariant@2a14083b), pipeline parallelism with an async writer + pre-fetch reader (IPNP-BIPN/deepvariant@667cca60), and an opt-in ane_speculate backend (FP16 ANE speculation with GPU FP32 rerun on borderline softmax) (IPNP-BIPN/deepvariant@eee48e54). Whole-genome wall-time is ~1.84x vs Docker-under-Rosetta (IPNP-BIPN/deepvariant@bb9f5bd3); a native-Linux-x86 comparison is still TBD.

Contributions by @nh13

@nh13 reviewed the port and contributed the following, merged via fork PRs. These are well-isolated and may be among the easier pieces to evaluate:

  • Correctness + build hardening: GL normalization and gVCF/multiallelic fidelity (IPNP-BIPN/deepvariant@4f4356b2), TFRecord CRC32C verification surfacing truncation instead of silent EOF (IPNP-BIPN/deepvariant@43b8b4cd), postprocess writer flush/close status + total site ordering (IPNP-BIPN/deepvariant@329f45bb), Metal command-buffer status check before consuming inference output (IPNP-BIPN/deepvariant@db043a20), DT_BFLOAT16 weight decoding (IPNP-BIPN/deepvariant@c8659ba9), CMake conv link-cycle break (IPNP-BIPN/deepvariant@564dda13), and Core ML MLMultiArray stride handling (IPNP-BIPN/deepvariant@f2a31cf6).
  • Feature parity: select_variant_types candidate filtering (IPNP-BIPN/deepvariant@be845cbb), sex-chromosome haploid calling with haploid_contigs/par_regions (IPNP-BIPN/deepvariant@d45752c2), gzipped PAR BED support (IPNP-BIPN/deepvariant@ed834dd1), methylation_aware_phasing compiled + unit-tested against the native build (IPNP-BIPN/deepvariant@6e038003), and phasing the full candidate set in the trio DirectPhasing path (IPNP-BIPN/deepvariant@c8d3924f).

Further work from @nh13 (single-sample multi-sample VariantCaller alignment + --parse_sam_aux_fields fix, native methylation-aware phasing wiring, FP16, and pipelining) is in flight as additional fork PRs.

A note on the squash

Because this branch is the head of this PR, and the CLA bot cannot auto-scan beyond 250 commits, I squashed the 273-commit history into one content-identical commit (same tree hash) so the scan can run over the whole thing. The full per-commit development history remains intact and linkable in my fork, with each contributor's original authorship preserved on their own commits, as cited throughout this report. Happy to provide any of these as small standalone PRs on request.

…pSomatic/pangenome

Fresh-start port of DeepVariant (+ DeepTrio, DeepSomatic, pangenome-aware DV)
to a single fully-native arm64 binary on Apple Silicon: Metal GPU / ANE
inference, zero Python interpreter at runtime, one-command Homebrew install.
Builds via CMake on macOS only; upstream Bazel/Python left untouched as
reference.

History squashed from 273 commits into a single change to keep the PR within
the CLA scanner's commit limit. Full development history and per-decision
rationale are preserved in PORT_LOG.md.

Release gates (HG002 whole-genome, chr20 fixtures, vs Docker 1.10.0):
SNP/INDEL F1 delta = 0; FILTER parity 0 FM on chr20:10M-10.1M, 0.027% on
full chr20; GPU engagement verified via powermetrics. All Illumina
short-read modes at 0 FM; long-read/RNA modes within the documented < 5%
FP32-drift tolerance.
@BenjaminDEMAILLE BenjaminDEMAILLE force-pushed the apple-silicon-native-v2-pr branch from 2724a87 to a21e4d4 Compare June 24, 2026 08:12
nh13 and others added 10 commits June 24, 2026 11:16
Gate MPSGraph precision on DV_METAL_FP16: when set, use optimizationLevel1
+ reducedPrecisionFastMath=Default (FP16 Winograd / operand conversion)
instead of the forced Level0/FP32 path. Default (unset) is unchanged and
remains contractual full FP32.

Validated on chr20 (HG002, 4731 examples): metadata and record order are
identical to FP32, and fp16-vs-fp32 genotype-call agreement is 60/4731
(1.27%) — exactly the FP32 backend's own run-to-run nondeterminism
baseline, i.e. reduced precision adds no divergence beyond what the
default metal backend already exhibits. A true ANE path is not added: the
7-channel Inception-v3 input is rejected by the ANE, so MPSGraph stays on
the GPU. Any default flip should still be gated behind GIAB concordance.
Reword the nonstandard 'FP19/TF32' to 'TF32 (19-bit)' — TF32 is the 19-bit
operand format; it is not FP16 (already named as the Winograd path).
- parity_check.autoload silently swallowed any decode error and switched
  formats via a bare except; fall back to the minimal decoder only on a
  raised error or zero rows, log which decoder is used, and print a
  shape-mismatch reason instead of indexing absent numeric keys.
- inception_v3_mil adds build-time shape asserts on the stem conv and the
  final linear so an HWIO->OIHW perm or dropped .T fails at convert time.
- bench._decode_feature asserts the decoded element count matches (h,w,c).
- test_savedmodel_reader asserted weights() raises NotImplementedError, but
  it is implemented now; assert it raises on a malformed bundle instead.
architecture.md's Phase-0 ADR named Core ML/MIL as the inference backend, but
the shipped default is Metal MPSGraph FP32 (call_variants_main.cc
--inference_backend=metal; coreml debug-only; ANE not engaged), matching
validation.md and scientific_report.md. Mark it SUPERSEDED and correct the
Decision, preserving the historical Phase-0 numbers. Fix the conversion README
(nonexistent requirements-metal.txt; coremltools>=9.0 TF-free) and the vendored
proto count (25->26, matching SOURCES.md).
Determinism-neutral CPU wins on the make_examples hot loop (output unchanged):
- std::move the read vector into working_reads on the no-realigner default
  path instead of deep-copying every Read proto (repointing the one later use).
- Hoist loop-invariant absl::GetFlag reads and the RealignerOptions proto to
  const locals computed once before the region loop.
- Emit a periodic 'processed X/Y regions' progress line.
- Attach a 1 MiB streambuf to the reader before open (the writer already
  coalesces into 1 MiB), cutting per-record read() syscalls. Pure buffering;
  format and decoded bytes unchanged.
- Add a WriteRecord(std::string_view) overload and pass the example payload
  by view from ExampleWriter::Add instead of materializing a std::string;
  on-disk framing is byte-identical.
- Add -mcpu=apple-m1 (ISA/scheduling only; fast-math stays off so the
  deterministic conv/bnns reductions remain bit-identical, guarded by the
  determinism microtests).
- Delete neon_base_color.h + its microtest: built-but-unused in production and
  semantically divergent from the live FillBaseColorBatch (a latent trap).
- Refresh the stale 'NOT YET WIRED' banner on neon_cigar_classify.h, which is
  in fact wired into allelecounter.cc.
Add --enable_inference_pipelining (default off) that overlaps the GPU
MPSGraph backbone of batch N+1 with the CPU BNNS finalize + CVO build of
batch N, for the two-stage Metal path only. A single finalize worker
drains a FIFO queue and is the sole CVO producer (shared emit_cvos
lambda), with a 2-slot double buffer (features/probs) and an
at-most-one-in-flight invariant, so output order is identical to the
serial path. Validated on chr20 (HG002, 4731 examples): record set,
metadata, and order are byte-identical to serial, and the genotype-prob
differences are exactly the default metal backend's own run-to-run
nondeterminism envelope (serial-vs-serial == serial-vs-pipelined).
perf: determinism-neutral speedups + opt-in inference pipelining (and docs/tests)
perf(metal): opt-in DV_METAL_FP16 reduced-precision inference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants