perf: build htslib with libdeflate for faster BGZF/BAM decoding by nh13 · Pull Request #1086 · google/deepvariant

nh13 · 2026-06-22T19:02:16Z

What

Build the vendored htslib with libdeflate so BGZF (BAM) (de)compression uses libdeflate instead of zlib — the same optimization samtools/htslib ship by default.

Why

Profiling make_examples (the CPU-bound stage) shows a large share of time in zlib inflate, decompressing BGZF blocks while reading the input BAM. libdeflate's decoder is substantially faster.

Impact

~7% faster make_examples on a 30× WGS chr22:20,000,000-30,000,000 slice (5:33.7 → 5:10.4, best of 2 reps, same host, back-to-back). perf confirms the decode moves to libdeflate (deflate_decompress_bmi2) and the zlib share drops from ~22% to ~14% (the remainder is the TensorFlow TFRecord output codec, out of scope here).

Changes (3 files)

WORKSPACE: vendor libdeflate 1.20 via http_archive (hermetic, pinned sha256), mirroring how htslib is vendored.
third_party/libdeflate.BUILD (new): minimal cc_library.
third_party/htslib.BUILD: define HAVE_LIBDEFLATE and add @libdeflate to htslib's deps.

Correctness

DEFLATE decompression is bit-exact per RFC 1951; libdeflate is MIT-licensed and is htslib's supported codec. No behavior change — only decode speed.

🤖 Generated with Claude Code

@libdeflate

make_examples spends a large fraction of wall time in zlib decompressing BGZF blocks while reading the input BAM. Building htslib against libdeflate (its supported fast DEFLATE codec, as used by samtools) moves BGZF decode off zlib and onto libdeflate's BMI2-optimized decompressor. Measured ~7% faster make_examples on a 30x WGS chr22:20-30Mb slice (5:33.7 -> 5:10.4, best of 2 reps, same host, back-to-back). libdeflate 1.20 is vendored hermetically via http_archive (matching how htslib itself is vendored); HAVE_LIBDEFLATE is enabled in the htslib config and @libdeflate is added to its deps. libdeflate is MIT-licensed. DEFLATE decompression is bit-exact per RFC 1951, so output is unchanged.

google-cla · 2026-06-22T19:02:28Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

pichuan · 2026-06-24T04:32:09Z

Hi @nh13 ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

nh13 · 2026-06-24T18:01:34Z

Thank-you @pichuan for responding so quickly. I am grateful that you're taking the time to review this work, and I am hopefully that if this and other changes go in, it'll benefit the community, and the environment too!

pichuan · 2026-06-26T06:34:14Z

Hi @nh13, here is an update on my testing result for this PR:

Testing methodology:

I built a Docker with code in this PR. The Docker image is named gh1086.
As a baseline, I built another Docker with the exact same base code (before this PR). The Docker image is named head937500229.

Then, I first tested this PR with n2-standard-96, with just one run per each of the 6 types in metrics.

In that test, I didn't find significant runtime difference. And some of the runtimes are different enough but that might be due to run variations.

So I decided to test again with two more differences:

Instead of n2-standard-96, i tested on c3d-standard-16 , which has 16 vCPUs and 64 GB memory. See: https://docs.cloud.google.com/compute/docs/general-purpose-machines#c3d_series: "C3D VMs are powered by the 4th generation AMD EPYC™ (Genoa) processor with a maximum frequency of 3.7 Ghz. "
I ran 5 trials of each of the 6 types in metrics.

This is actually what we do for every release. We look out for:

Making sure all 5 runs of the same thing have exactly the same md5sum (VCFs and gVCFs).
We check the changes in hap.py accuracy and see if it's expected.
We check for average runtime differences and see if it's expected. (With 5 runs, we hope that helps with the runtime variation.)

md5sum and file size observation:

Confirmed that gh1086 has consistent md5sum: All 5 runs of the same type have exactly the same md5sum (VCFs and gVCFs)
head937500229 has consistent md5sum as well. But they're all different from gh1086. This is because gh1086 changed the way compression is done.
gh1086 produced smaller compressed .vcf.gz and .g.vcf.gz files across all case studies. Compared to head937500229, the file size reduction ranges from 3.6% to 8.9% per file, with an overall reduction of 6.9% (264.5 MiB saved across the 12 output files).

Runtime summary:

TL;DR: gh1086 (libdeflate) is faster than head937500229 (zlib) on make_examples across all 6 case studies, with no regressions on any other stage:

Case Study	make_examples Speedup	Significance
hybrid-pacbio-illumina	-9.9% (~25 min saved)	✅ p<0.001
exome	-3.0%	✅ p<0.05
rnaseq	-2.8%	✅ p<0.05
wgs	-2.5% (~4.6 min saved)	✅ p<0.001
ont-r104	-1.3%	ns (high variance)
pacbio	-1.3%	ns (high variance)

The call_variants, postprocess_variants, and vcf_stats stages are statistically unchanged, confirming the improvement is isolated to the BAM I/O path as expected. The previous single-run comparison had noisy results — with 5 trials, the signal is now clear: gh1086 is consistently faster, never slower.

Click for `gh1086` runtime table

`gh1086` runtime table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	466.96	9.642	5	7m 46s
exome	HG003	call_variants	162.51	27.648	5	2m 42s
exome	HG003	postprocess_variants	29.43	0.177	5	29s
exome	HG003	vcf_stats	5.5	0.036	5	5s
exome	HG003	total	658.9	37.322	5	10m 58s
hybrid-pacbio-illumina	HG003	make_examples	13971.72	73.979	5	3h 52m 51s
hybrid-pacbio-illumina	HG003	call_variants	23544.69	43.602	5	6h 32m 24s
hybrid-pacbio-illumina	HG003	postprocess_variants	293.53	9.006	5	4m 53s
hybrid-pacbio-illumina	HG003	vcf_stats	240.81	1.289	5	4m 0s
hybrid-pacbio-illumina	HG003	total	37809.95	87.255	5	10h 30m 9s
ont-r104	HG003	make_examples	12897.07	112.368	5	3h 34m 57s
ont-r104	HG003	call_variants	11831.9	27.586	5	3h 17m 11s
ont-r104	HG003	postprocess_variants	1013.3	9.045	5	16m 53s
ont-r104	HG003	vcf_stats	357.66	4.702	5	5m 57s
ont-r104	HG003	total	25742.27	102.278	5	7h 9m 2s
pacbio	HG003	make_examples	8507.56	46.34	5	2h 21m 47s
pacbio	HG003	call_variants	6211.3	6.875	5	1h 43m 31s
pacbio	HG003	postprocess_variants	521.27	6.698	5	8m 41s
pacbio	HG003	vcf_stats	276.86	2.36	5	4m 36s
pacbio	HG003	total	15240.12	44.084	5	4h 14m 0s
rnaseq	HG005	make_examples	1464.11	37.572	5	24m 24s
rnaseq	HG005	call_variants	96.11	1.692	5	1m 36s
rnaseq	HG005	postprocess_variants	205.63	3.738	5	3m 25s
rnaseq	HG005	vcf_stats	4.95	0.089	5	4s
rnaseq	HG005	total	1765.84	42.737	5	29m 25s
wgs	HG003	make_examples	10901.99	100.289	5	3h 1m 41s
wgs	HG003	call_variants	5834.68	14.477	5	1h 37m 14s
wgs	HG003	postprocess_variants	396.08	3.54	5	6m 36s
wgs	HG003	vcf_stats	253.03	2.444	5	4m 13s
wgs	HG003	total	17132.76	108.833	5	4h 45m 32s

Click for `head937500229` runtime table

`head937500229` runtime table

uid	sample	stage	mean_runtime	std_runtime	n_trials	mean_hruntime
exome	HG003	make_examples	481.59	8.884	5	8m 1s
exome	HG003	call_variants	150.94	0.808	5	2m 30s
exome	HG003	postprocess_variants	27.28	0.182	5	27s
exome	HG003	vcf_stats	5.54	0.058	5	5s
exome	HG003	total	659.81	9.358	5	10m 59s
hybrid-pacbio-illumina	HG003	make_examples	15513.45	51.654	5	4h 18m 33s
hybrid-pacbio-illumina	HG003	call_variants	23520.08	29.808	5	6h 32m 0s
hybrid-pacbio-illumina	HG003	postprocess_variants	290.26	4.803	5	4m 50s
hybrid-pacbio-illumina	HG003	vcf_stats	243.47	2.018	5	4m 3s
hybrid-pacbio-illumina	HG003	total	39323.79	67.138	5	10h 55m 23s
ont-r104	HG003	make_examples	13062.19	156.502	5	3h 37m 42s
ont-r104	HG003	call_variants	11814.9	7.661	5	3h 16m 54s
ont-r104	HG003	postprocess_variants	1033.73	8.6	5	17m 13s
ont-r104	HG003	vcf_stats	361.09	3.496	5	6m 1s
ont-r104	HG003	total	25910.83	158.26	5	7h 11m 50s
pacbio	HG003	make_examples	8615.63	155.868	5	2h 23m 35s
pacbio	HG003	call_variants	6237.02	32.756	5	1h 43m 57s
pacbio	HG003	postprocess_variants	521.13	8.723	5	8m 41s
pacbio	HG003	vcf_stats	280.33	1.387	5	4m 40s
pacbio	HG003	total	15373.77	182.451	5	4h 16m 13s
rnaseq	HG005	make_examples	1506.26	9.424	5	25m 6s
rnaseq	HG005	call_variants	94.85	0.066	5	1m 34s
rnaseq	HG005	postprocess_variants	204.45	1.424	5	3m 24s
rnaseq	HG005	vcf_stats	4.94	0.057	5	4s
rnaseq	HG005	total	1805.56	10.267	5	30m 5s
wgs	HG003	make_examples	11180.41	111.767	5	3h 6m 20s
wgs	HG003	call_variants	5825.96	5.848	5	1h 37m 5s
wgs	HG003	postprocess_variants	406.19	1.954	5	6m 46s
wgs	HG003	vcf_stats	252.8	1.256	5	4m 12s
wgs	HG003	total	17412.56	110.714	5	4h 50m 12s

Based on this finding, I will recommend to my team that we incorporate this PR. I will send it for internal review in case there are more feedback. I plan to use this as the commit message:

Commit message draft:

perf: Build htslib with libdeflate for faster BGZF/BAM decoding

Build the vendored htslib with libdeflate so BGZF (BAM) (de)compression
uses libdeflate instead of zlib — the same optimization samtools/htslib
ship by default.

Profiling make_examples shows that BAM bgzf decoding is the dominant
codec cost. libdeflate provides significantly faster DEFLATE/gzip
compression and decompression compared to zlib.

Based on #1086
Credit: GitHub user @nh13

I will give another update once it's internally reviewed and submitted.

pichuan self-assigned this Jun 24, 2026

pichuan mentioned this pull request Jun 26, 2026

perf: Speed up the small-model caller: call the model directly instead of Model.predict() #1089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: build htslib with libdeflate for faster BGZF/BAM decoding#1086

perf: build htslib with libdeflate for faster BGZF/BAM decoding#1086
nh13 wants to merge 1 commit into
google:r1.10from
nh13:nh_htslib-libdeflate

nh13 commented Jun 22, 2026

Uh oh!

google-cla Bot commented Jun 22, 2026

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

nh13 commented Jun 24, 2026

Uh oh!

pichuan commented Jun 26, 2026

`gh1086` runtime table

`head937500229` runtime table

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nh13 commented Jun 22, 2026

What

Why

Impact

Changes (3 files)

Correctness

Uh oh!

google-cla Bot commented Jun 22, 2026

Uh oh!

pichuan commented Jun 24, 2026

Uh oh!

nh13 commented Jun 24, 2026

Uh oh!

pichuan commented Jun 26, 2026

md5sum and file size observation:

Runtime summary:

gh1086 runtime table

head937500229 runtime table

Commit message draft:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`gh1086` runtime table

`head937500229` runtime table