Skip to content

perf: build htslib with libdeflate for faster BGZF/BAM decoding#1086

Open
nh13 wants to merge 1 commit into
google:r1.10from
nh13:nh_htslib-libdeflate
Open

perf: build htslib with libdeflate for faster BGZF/BAM decoding#1086
nh13 wants to merge 1 commit into
google:r1.10from
nh13:nh_htslib-libdeflate

Conversation

@nh13

@nh13 nh13 commented Jun 22, 2026

Copy link
Copy Markdown

What

Build the vendored htslib with libdeflate so BGZF (BAM) (de)compression uses libdeflate instead of zlib — the same optimization samtools/htslib ship by default.

Why

Profiling make_examples (the CPU-bound stage) shows a large share of time in zlib inflate, decompressing BGZF blocks while reading the input BAM. libdeflate's decoder is substantially faster.

Impact

~7% faster make_examples on a 30× WGS chr22:20,000,000-30,000,000 slice (5:33.7 → 5:10.4, best of 2 reps, same host, back-to-back). perf confirms the decode moves to libdeflate (deflate_decompress_bmi2) and the zlib share drops from ~22% to ~14% (the remainder is the TensorFlow TFRecord output codec, out of scope here).

Changes (3 files)

  • WORKSPACE: vendor libdeflate 1.20 via http_archive (hermetic, pinned sha256), mirroring how htslib is vendored.
  • third_party/libdeflate.BUILD (new): minimal cc_library.
  • third_party/htslib.BUILD: define HAVE_LIBDEFLATE and add @libdeflate to htslib's deps.

Correctness

DEFLATE decompression is bit-exact per RFC 1951; libdeflate is MIT-licensed and is htslib's supported codec. No behavior change — only decode speed.

🤖 Generated with Claude Code

make_examples spends a large fraction of wall time in zlib decompressing
BGZF blocks while reading the input BAM. Building htslib against libdeflate
(its supported fast DEFLATE codec, as used by samtools) moves BGZF decode
off zlib and onto libdeflate's BMI2-optimized decompressor.

Measured ~7% faster make_examples on a 30x WGS chr22:20-30Mb slice
(5:33.7 -> 5:10.4, best of 2 reps, same host, back-to-back).

libdeflate 1.20 is vendored hermetically via http_archive (matching how
htslib itself is vendored); HAVE_LIBDEFLATE is enabled in the htslib config
and @libdeflate is added to its deps. libdeflate is MIT-licensed. DEFLATE
decompression is bit-exact per RFC 1951, so output is unchanged.
@google-cla

google-cla Bot commented Jun 22, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@pichuan

pichuan commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Hi @nh13 ,

Thanks for the PR!

Since I believe you're already familiar with our process, I'll go ahead and start the review.

As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description.

Please let me know if you have any concerns with this approach.

-pichuan

@pichuan pichuan self-assigned this Jun 24, 2026
@nh13

nh13 commented Jun 24, 2026

Copy link
Copy Markdown
Author

Thank-you @pichuan for responding so quickly. I am grateful that you're taking the time to review this work, and I am hopefully that if this and other changes go in, it'll benefit the community, and the environment too!

@pichuan

pichuan commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Hi @nh13, here is an update on my testing result for this PR:

Testing methodology:

  1. I built a Docker with code in this PR. The Docker image is named gh1086.
  2. As a baseline, I built another Docker with the exact same base code (before this PR). The Docker image is named head937500229.

Then, I first tested this PR with n2-standard-96, with just one run per each of the 6 types in metrics.

In that test, I didn't find significant runtime difference. And some of the runtimes are different enough but that might be due to run variations.

So I decided to test again with two more differences:

  1. Instead of n2-standard-96, i tested on c3d-standard-16 , which has 16 vCPUs and 64 GB memory. See: https://docs.cloud.google.com/compute/docs/general-purpose-machines#c3d_series: "C3D VMs are powered by the 4th generation AMD EPYC™ (Genoa) processor with a maximum frequency of 3.7 Ghz. "
  2. I ran 5 trials of each of the 6 types in metrics.

This is actually what we do for every release. We look out for:

  1. Making sure all 5 runs of the same thing have exactly the same md5sum (VCFs and gVCFs).
  2. We check the changes in hap.py accuracy and see if it's expected.
  3. We check for average runtime differences and see if it's expected. (With 5 runs, we hope that helps with the runtime variation.)

md5sum and file size observation:

  • Confirmed that gh1086 has consistent md5sum: All 5 runs of the same type have exactly the same md5sum (VCFs and gVCFs)
  • head937500229 has consistent md5sum as well. But they're all different from gh1086. This is because gh1086 changed the way compression is done.
  • gh1086 produced smaller compressed .vcf.gz and .g.vcf.gz files across all case studies. Compared to head937500229, the file size reduction ranges from 3.6% to 8.9% per file, with an overall reduction of 6.9% (264.5 MiB saved across the 12 output files).

Runtime summary:

TL;DR: gh1086 (libdeflate) is faster than head937500229 (zlib) on make_examples across all 6 case studies, with no regressions on any other stage:

Case Study make_examples Speedup Significance
hybrid-pacbio-illumina -9.9% (~25 min saved) ✅ p<0.001
exome -3.0% ✅ p<0.05
rnaseq -2.8% ✅ p<0.05
wgs -2.5% (~4.6 min saved) ✅ p<0.001
ont-r104 -1.3% ns (high variance)
pacbio -1.3% ns (high variance)

The call_variants, postprocess_variants, and vcf_stats stages are statistically unchanged, confirming the improvement is isolated to the BAM I/O path as expected. The previous single-run comparison had noisy results — with 5 trials, the signal is now clear: gh1086 is consistently faster, never slower.

Click for `gh1086` runtime table

gh1086 runtime table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 466.96 9.642 5 7m 46s
exome HG003 call_variants 162.51 27.648 5 2m 42s
exome HG003 postprocess_variants 29.43 0.177 5 29s
exome HG003 vcf_stats 5.5 0.036 5 5s
exome HG003 total 658.9 37.322 5 10m 58s
hybrid-pacbio-illumina HG003 make_examples 13971.72 73.979 5 3h 52m 51s
hybrid-pacbio-illumina HG003 call_variants 23544.69 43.602 5 6h 32m 24s
hybrid-pacbio-illumina HG003 postprocess_variants 293.53 9.006 5 4m 53s
hybrid-pacbio-illumina HG003 vcf_stats 240.81 1.289 5 4m 0s
hybrid-pacbio-illumina HG003 total 37809.95 87.255 5 10h 30m 9s
ont-r104 HG003 make_examples 12897.07 112.368 5 3h 34m 57s
ont-r104 HG003 call_variants 11831.9 27.586 5 3h 17m 11s
ont-r104 HG003 postprocess_variants 1013.3 9.045 5 16m 53s
ont-r104 HG003 vcf_stats 357.66 4.702 5 5m 57s
ont-r104 HG003 total 25742.27 102.278 5 7h 9m 2s
pacbio HG003 make_examples 8507.56 46.34 5 2h 21m 47s
pacbio HG003 call_variants 6211.3 6.875 5 1h 43m 31s
pacbio HG003 postprocess_variants 521.27 6.698 5 8m 41s
pacbio HG003 vcf_stats 276.86 2.36 5 4m 36s
pacbio HG003 total 15240.12 44.084 5 4h 14m 0s
rnaseq HG005 make_examples 1464.11 37.572 5 24m 24s
rnaseq HG005 call_variants 96.11 1.692 5 1m 36s
rnaseq HG005 postprocess_variants 205.63 3.738 5 3m 25s
rnaseq HG005 vcf_stats 4.95 0.089 5 4s
rnaseq HG005 total 1765.84 42.737 5 29m 25s
wgs HG003 make_examples 10901.99 100.289 5 3h 1m 41s
wgs HG003 call_variants 5834.68 14.477 5 1h 37m 14s
wgs HG003 postprocess_variants 396.08 3.54 5 6m 36s
wgs HG003 vcf_stats 253.03 2.444 5 4m 13s
wgs HG003 total 17132.76 108.833 5 4h 45m 32s
Click for `head937500229` runtime table

head937500229 runtime table

uid sample stage mean_runtime std_runtime n_trials mean_hruntime
exome HG003 make_examples 481.59 8.884 5 8m 1s
exome HG003 call_variants 150.94 0.808 5 2m 30s
exome HG003 postprocess_variants 27.28 0.182 5 27s
exome HG003 vcf_stats 5.54 0.058 5 5s
exome HG003 total 659.81 9.358 5 10m 59s
hybrid-pacbio-illumina HG003 make_examples 15513.45 51.654 5 4h 18m 33s
hybrid-pacbio-illumina HG003 call_variants 23520.08 29.808 5 6h 32m 0s
hybrid-pacbio-illumina HG003 postprocess_variants 290.26 4.803 5 4m 50s
hybrid-pacbio-illumina HG003 vcf_stats 243.47 2.018 5 4m 3s
hybrid-pacbio-illumina HG003 total 39323.79 67.138 5 10h 55m 23s
ont-r104 HG003 make_examples 13062.19 156.502 5 3h 37m 42s
ont-r104 HG003 call_variants 11814.9 7.661 5 3h 16m 54s
ont-r104 HG003 postprocess_variants 1033.73 8.6 5 17m 13s
ont-r104 HG003 vcf_stats 361.09 3.496 5 6m 1s
ont-r104 HG003 total 25910.83 158.26 5 7h 11m 50s
pacbio HG003 make_examples 8615.63 155.868 5 2h 23m 35s
pacbio HG003 call_variants 6237.02 32.756 5 1h 43m 57s
pacbio HG003 postprocess_variants 521.13 8.723 5 8m 41s
pacbio HG003 vcf_stats 280.33 1.387 5 4m 40s
pacbio HG003 total 15373.77 182.451 5 4h 16m 13s
rnaseq HG005 make_examples 1506.26 9.424 5 25m 6s
rnaseq HG005 call_variants 94.85 0.066 5 1m 34s
rnaseq HG005 postprocess_variants 204.45 1.424 5 3m 24s
rnaseq HG005 vcf_stats 4.94 0.057 5 4s
rnaseq HG005 total 1805.56 10.267 5 30m 5s
wgs HG003 make_examples 11180.41 111.767 5 3h 6m 20s
wgs HG003 call_variants 5825.96 5.848 5 1h 37m 5s
wgs HG003 postprocess_variants 406.19 1.954 5 6m 46s
wgs HG003 vcf_stats 252.8 1.256 5 4m 12s
wgs HG003 total 17412.56 110.714 5 4h 50m 12s

Based on this finding, I will recommend to my team that we incorporate this PR. I will send it for internal review in case there are more feedback. I plan to use this as the commit message:

Commit message draft:

perf: Build htslib with libdeflate for faster BGZF/BAM decoding

Build the vendored htslib with libdeflate so BGZF (BAM) (de)compression
uses libdeflate instead of zlib — the same optimization samtools/htslib
ship by default.

Profiling make_examples shows that BAM bgzf decoding is the dominant
codec cost. libdeflate provides significantly faster DEFLATE/gzip
compression and decompression compared to zlib.

Based on #1086
Credit: GitHub user @nh13


I will give another update once it's internally reviewed and submitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants