perf: build htslib with libdeflate for faster BGZF/BAM decoding#1086
perf: build htslib with libdeflate for faster BGZF/BAM decoding#1086nh13 wants to merge 1 commit into
Conversation
make_examples spends a large fraction of wall time in zlib decompressing BGZF blocks while reading the input BAM. Building htslib against libdeflate (its supported fast DEFLATE codec, as used by samtools) moves BGZF decode off zlib and onto libdeflate's BMI2-optimized decompressor. Measured ~7% faster make_examples on a 30x WGS chr22:20-30Mb slice (5:33.7 -> 5:10.4, best of 2 reps, same host, back-to-back). libdeflate 1.20 is vendored hermetically via http_archive (matching how htslib itself is vendored); HAVE_LIBDEFLATE is enabled in the htslib config and @libdeflate is added to its deps. libdeflate is MIT-licensed. DEFLATE decompression is bit-exact per RFC 1951, so output is unchanged.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Hi @nh13 , Thanks for the PR! Since I believe you're already familiar with our process, I'll go ahead and start the review. As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description. Please let me know if you have any concerns with this approach. -pichuan |
|
Thank-you @pichuan for responding so quickly. I am grateful that you're taking the time to review this work, and I am hopefully that if this and other changes go in, it'll benefit the community, and the environment too! |
|
Hi @nh13, here is an update on my testing result for this PR: Testing methodology:
Then, I first tested this PR with In that test, I didn't find significant runtime difference. And some of the runtimes are different enough but that might be due to run variations. So I decided to test again with two more differences:
This is actually what we do for every release. We look out for:
md5sum and file size observation:
Runtime summary:TL;DR:
The Click for `gh1086` runtime table
|
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 466.96 | 9.642 | 5 | 7m 46s |
| exome | HG003 | call_variants | 162.51 | 27.648 | 5 | 2m 42s |
| exome | HG003 | postprocess_variants | 29.43 | 0.177 | 5 | 29s |
| exome | HG003 | vcf_stats | 5.5 | 0.036 | 5 | 5s |
| exome | HG003 | total | 658.9 | 37.322 | 5 | 10m 58s |
| hybrid-pacbio-illumina | HG003 | make_examples | 13971.72 | 73.979 | 5 | 3h 52m 51s |
| hybrid-pacbio-illumina | HG003 | call_variants | 23544.69 | 43.602 | 5 | 6h 32m 24s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 293.53 | 9.006 | 5 | 4m 53s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 240.81 | 1.289 | 5 | 4m 0s |
| hybrid-pacbio-illumina | HG003 | total | 37809.95 | 87.255 | 5 | 10h 30m 9s |
| ont-r104 | HG003 | make_examples | 12897.07 | 112.368 | 5 | 3h 34m 57s |
| ont-r104 | HG003 | call_variants | 11831.9 | 27.586 | 5 | 3h 17m 11s |
| ont-r104 | HG003 | postprocess_variants | 1013.3 | 9.045 | 5 | 16m 53s |
| ont-r104 | HG003 | vcf_stats | 357.66 | 4.702 | 5 | 5m 57s |
| ont-r104 | HG003 | total | 25742.27 | 102.278 | 5 | 7h 9m 2s |
| pacbio | HG003 | make_examples | 8507.56 | 46.34 | 5 | 2h 21m 47s |
| pacbio | HG003 | call_variants | 6211.3 | 6.875 | 5 | 1h 43m 31s |
| pacbio | HG003 | postprocess_variants | 521.27 | 6.698 | 5 | 8m 41s |
| pacbio | HG003 | vcf_stats | 276.86 | 2.36 | 5 | 4m 36s |
| pacbio | HG003 | total | 15240.12 | 44.084 | 5 | 4h 14m 0s |
| rnaseq | HG005 | make_examples | 1464.11 | 37.572 | 5 | 24m 24s |
| rnaseq | HG005 | call_variants | 96.11 | 1.692 | 5 | 1m 36s |
| rnaseq | HG005 | postprocess_variants | 205.63 | 3.738 | 5 | 3m 25s |
| rnaseq | HG005 | vcf_stats | 4.95 | 0.089 | 5 | 4s |
| rnaseq | HG005 | total | 1765.84 | 42.737 | 5 | 29m 25s |
| wgs | HG003 | make_examples | 10901.99 | 100.289 | 5 | 3h 1m 41s |
| wgs | HG003 | call_variants | 5834.68 | 14.477 | 5 | 1h 37m 14s |
| wgs | HG003 | postprocess_variants | 396.08 | 3.54 | 5 | 6m 36s |
| wgs | HG003 | vcf_stats | 253.03 | 2.444 | 5 | 4m 13s |
| wgs | HG003 | total | 17132.76 | 108.833 | 5 | 4h 45m 32s |
Click for `head937500229` runtime table
head937500229 runtime table
| uid | sample | stage | mean_runtime | std_runtime | n_trials | mean_hruntime |
|---|---|---|---|---|---|---|
| exome | HG003 | make_examples | 481.59 | 8.884 | 5 | 8m 1s |
| exome | HG003 | call_variants | 150.94 | 0.808 | 5 | 2m 30s |
| exome | HG003 | postprocess_variants | 27.28 | 0.182 | 5 | 27s |
| exome | HG003 | vcf_stats | 5.54 | 0.058 | 5 | 5s |
| exome | HG003 | total | 659.81 | 9.358 | 5 | 10m 59s |
| hybrid-pacbio-illumina | HG003 | make_examples | 15513.45 | 51.654 | 5 | 4h 18m 33s |
| hybrid-pacbio-illumina | HG003 | call_variants | 23520.08 | 29.808 | 5 | 6h 32m 0s |
| hybrid-pacbio-illumina | HG003 | postprocess_variants | 290.26 | 4.803 | 5 | 4m 50s |
| hybrid-pacbio-illumina | HG003 | vcf_stats | 243.47 | 2.018 | 5 | 4m 3s |
| hybrid-pacbio-illumina | HG003 | total | 39323.79 | 67.138 | 5 | 10h 55m 23s |
| ont-r104 | HG003 | make_examples | 13062.19 | 156.502 | 5 | 3h 37m 42s |
| ont-r104 | HG003 | call_variants | 11814.9 | 7.661 | 5 | 3h 16m 54s |
| ont-r104 | HG003 | postprocess_variants | 1033.73 | 8.6 | 5 | 17m 13s |
| ont-r104 | HG003 | vcf_stats | 361.09 | 3.496 | 5 | 6m 1s |
| ont-r104 | HG003 | total | 25910.83 | 158.26 | 5 | 7h 11m 50s |
| pacbio | HG003 | make_examples | 8615.63 | 155.868 | 5 | 2h 23m 35s |
| pacbio | HG003 | call_variants | 6237.02 | 32.756 | 5 | 1h 43m 57s |
| pacbio | HG003 | postprocess_variants | 521.13 | 8.723 | 5 | 8m 41s |
| pacbio | HG003 | vcf_stats | 280.33 | 1.387 | 5 | 4m 40s |
| pacbio | HG003 | total | 15373.77 | 182.451 | 5 | 4h 16m 13s |
| rnaseq | HG005 | make_examples | 1506.26 | 9.424 | 5 | 25m 6s |
| rnaseq | HG005 | call_variants | 94.85 | 0.066 | 5 | 1m 34s |
| rnaseq | HG005 | postprocess_variants | 204.45 | 1.424 | 5 | 3m 24s |
| rnaseq | HG005 | vcf_stats | 4.94 | 0.057 | 5 | 4s |
| rnaseq | HG005 | total | 1805.56 | 10.267 | 5 | 30m 5s |
| wgs | HG003 | make_examples | 11180.41 | 111.767 | 5 | 3h 6m 20s |
| wgs | HG003 | call_variants | 5825.96 | 5.848 | 5 | 1h 37m 5s |
| wgs | HG003 | postprocess_variants | 406.19 | 1.954 | 5 | 6m 46s |
| wgs | HG003 | vcf_stats | 252.8 | 1.256 | 5 | 4m 12s |
| wgs | HG003 | total | 17412.56 | 110.714 | 5 | 4h 50m 12s |
Based on this finding, I will recommend to my team that we incorporate this PR. I will send it for internal review in case there are more feedback. I plan to use this as the commit message:
Commit message draft:
perf: Build htslib with libdeflate for faster BGZF/BAM decoding
Build the vendored htslib with libdeflate so BGZF (BAM) (de)compression
uses libdeflate instead of zlib — the same optimization samtools/htslib
ship by default.
Profiling make_examples shows that BAM bgzf decoding is the dominant
codec cost. libdeflate provides significantly faster DEFLATE/gzip
compression and decompression compared to zlib.
Based on #1086
Credit: GitHub user @nh13
I will give another update once it's internally reviewed and submitted.
What
Build the vendored htslib with libdeflate so BGZF (BAM) (de)compression uses libdeflate instead of zlib — the same optimization samtools/htslib ship by default.
Why
Profiling
make_examples(the CPU-bound stage) shows a large share of time in zlib inflate, decompressing BGZF blocks while reading the input BAM. libdeflate's decoder is substantially faster.Impact
~7% faster
make_exampleson a 30× WGSchr22:20,000,000-30,000,000slice (5:33.7 → 5:10.4, best of 2 reps, same host, back-to-back).perfconfirms the decode moves to libdeflate (deflate_decompress_bmi2) and the zlib share drops from ~22% to ~14% (the remainder is the TensorFlow TFRecord output codec, out of scope here).Changes (3 files)
WORKSPACE: vendorlibdeflate1.20 viahttp_archive(hermetic, pinned sha256), mirroring how htslib is vendored.third_party/libdeflate.BUILD(new): minimalcc_library.third_party/htslib.BUILD: defineHAVE_LIBDEFLATEand add@libdeflateto htslib's deps.Correctness
DEFLATE decompression is bit-exact per RFC 1951; libdeflate is MIT-licensed and is htslib's supported codec. No behavior change — only decode speed.
🤖 Generated with Claude Code