perf: use jemalloc for make_examples in the runtime image by tfenne · Pull Request #1087 · google/deepvariant

tfenne · 2026-06-22T22:56:52Z

What

Preload jemalloc as the allocator for the
make_examples family of binaries in the published Docker image, instead of the default glibc malloc.

Why

make_examples is a CPU-bound stage, and a large share of its work — pileup construction and local realignment — is allocation-heavy (many short-lived objects across many worker shards). jemalloc handles that allocation pattern substantially better than glibc malloc.

The preload is scoped to the make_examples family only. call_variants is TensorFlow inference, not allocation-bound, and showed no measurable change with jemalloc, so there's no reason to apply it there.

Impact

~7.3% faster make_examples on a 30× WGS HG003 chr20 run with the production wgs config (7-channel model + --call_small_model_examples): 231.0s → 214.2s (mean of 2 reps each, same host, back-to-back; c8a.4xlarge / 16 vCPU, 16 shards).

Verified the wrapper actually engages it: the make_examples python process launched via the wrapped entrypoint maps libjemalloc.so.2 and has LD_PRELOAD=libjemalloc.so.2 in its environment; the unwrapped entrypoint does not.

Changes (1 file)

Dockerfile:
- install the distro libjemalloc2 package in the runtime image;
- prefix LD_PRELOAD=libjemalloc.so.2 on the make_examples,
  multisample_make_examples, and make_examples_somatic wrappers.

The bare so name (libjemalloc.so.2, not a hardcoded path) is resolved from the default linker search path, so it stays correct on any future supported architecture.

Correctness

This is an allocator swap only — no algorithmic or output change. jemalloc is a drop-in malloc/free replacement; make_examples output is bit-identical. No source files are touched.

Notes

I also test mimalloc and rpmalloc; the former was slower that glibc malloc, while rpmalloc was faster but not as fast as jemalloc.

make_examples spends a large share of time in allocation-heavy pileup and local-realignment work. Preloading jemalloc (vs glibc malloc) measurably reduces its wall-clock; it has no measurable effect on call_variants (TF inference), so the LD_PRELOAD is scoped to the make_examples-family wrappers only. Installed via the distro libjemalloc2 package; the bare soname (libjemalloc.so.2) keeps it architecture-portable.

… pangenome images The runtime image already preloads jemalloc for its make_examples wrappers, but the DeepTrio, DeepSomatic, and pangenome-aware images build FROM ubuntu:22.04 rather than from that image, so they inherited neither the libjemalloc2 install nor the preload and ran make_examples on glibc malloc. Their make_examples does the same allocation-heavy pileup/realignment work, so they benefit from the same preload. This applies the identical pattern to each of the three Dockerfiles: install libjemalloc2 and prepend LD_PRELOAD=libjemalloc.so.2 to that image's make_examples wrapper only (deeptrio/make_examples, make_examples_somatic, make_examples_pangenome_aware_dv). call_variants and the other wrappers are left untouched, matching the runtime image.

pichuan · 2026-06-24T04:32:22Z

Hi @tfenne ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

tfenne · 2026-06-24T04:49:35Z

Thanks @pichuan - understood about the contribution/merge process, and thanks for taking a look at this and my other PRs.

tfenne · 2026-06-26T15:14:20Z

I re-benchmarked this PR's changes vs. r1.10 through the standard docker build process, running the baseline and modified versions in the resulting docker containers on the chr20 short-read WGS set on a c8a.4xlarge at AWS with 16 cores / 16 shards. In that setup:

baseline runtime: 113.1
pr runtime: 105.7
% change: ~7.3% improvement

tfenne force-pushed the tf_jemalloc-image branch from f5129f6 to da4f616 Compare June 23, 2026 04:26

pichuan self-assigned this Jun 24, 2026

pichuan self-requested a review June 26, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use jemalloc for make_examples in the runtime image#1087

perf: use jemalloc for make_examples in the runtime image#1087
tfenne wants to merge 2 commits into
google:r1.10from
tfenne:tf_jemalloc-image

tfenne commented Jun 22, 2026 •

edited

Loading

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tfenne commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Impact

Changes (1 file)

Correctness

Notes

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

tfenne commented Jun 24, 2026

Uh oh!

tfenne commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tfenne commented Jun 22, 2026 •

edited

Loading