perf: use jemalloc for make_examples in the runtime image#1087
Conversation
make_examples spends a large share of time in allocation-heavy pileup and local-realignment work. Preloading jemalloc (vs glibc malloc) measurably reduces its wall-clock; it has no measurable effect on call_variants (TF inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers only. Installed via the distro libjemalloc2 package; the bare soname (libjemalloc.so.2) keeps it architecture-portable.
f5129f6 to
da4f616
Compare
… pangenome images The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload. This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.
|
Hi @tfenne , Thanks for the PR! Since I believe you're already familiar with our process, I'll go ahead and start the review. As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description. Please let me know if you have any concerns with this approach. -pichuan |
|
Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs. |
|
I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a baseline runtime: 113.1 |
What
Preload jemalloc as the allocator for the
make_examplesfamily of binaries in the published Docker image, instead of the default glibcmalloc.Why
make_examplesis a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibcmalloc.The preload is scoped to the make_examples family only.
call_variantsis TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.Impact
~7.3% faster
make_exampleson a 30× WGS HG003chr20run with the productionwgsconfig (7-channel model +--call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).Verified the wrapper actually engages it: the
make_examplespython process launched via the wrapped entrypoint mapslibjemalloc.so.2and hasLD_PRELOAD=libjemalloc.so.2in its environment; the unwrapped entrypoint does not.Changes (1 file)
Dockerfile:libjemalloc2package in the runtime image;LD_PRELOAD=libjemalloc.so.2on themake_examples,multisample_make_examples, andmake_examples_somaticwrappers.The bare so name (
libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.Correctness
This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in
malloc/freereplacement;make_examplesoutput is bit-identical. No source files are touched.Notes
I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.