Support Snappy compression and a configurable GZIP level for make_examples output#1088
Open
tfenne wants to merge 2 commits into
Open
Support Snappy compression and a configurable GZIP level for make_examples output#1088tfenne wants to merge 2 commits into
tfenne wants to merge 2 commits into
Conversation
…amples make_examples has always written its tf.Example output as GZIP-compressed TFRecords at the zlib default level. On many-core machines the GZIP step is a meaningful slice of make_examples CPU, and when output throughput is provisioned generously the cheaper Snappy codec is the better trade. This adds that control. The compression codec is inferred from the examples file-name suffix, so the file name is the single source of truth for both writing and reading: a ".snappy" examples path selects Snappy, any other suffix keeps the historical GZIP behaviour. The C++ writer (nucleus::ExampleWriter, via the new CompressionTypeForPath) and every examples reader (call_variants, data_providers, show_examples, and the shape probe in dv_utils, via dv_utils.compression_type_for_examples_path) derive the codec the same way, so a file can never be written with one codec and read with another. A new --examples_compression_level flag (MakeExamplesOptions.examples_compression_level) exposes the GZIP level (-1 or 0..9; -1 = library default). It only affects GZIP output; the proto field is declared optional so an unset value is distinguishable from a deliberate level 0, and a level supplied alongside a Snappy path is ignored with a warning. The auxiliary make_examples outputs (call_variant_outputs, small_model_examples) are written through the Python TFRecord writer, which cannot emit Snappy, so they remain GZIP and their file names are forced to ".gz" to stay consistent with their bytes. Tested: C++ example_writer_test (codec detection, Snappy round-trip, GZIP level applied), dv_utils_test (suffix detection), make_examples_core_test (side-output renaming), and an end-to-end make_examples -> call_variants round-trip for both Snappy and GZIP examples.
d9fe074 to
65cf535
Compare
The compression work added a suffix-inferred codec and --examples_compression_level to make_examples, but the run_deepvariant wrapper still hardcoded .gz example names and did not surface the level, so Snappy was reachable only by invoking make_examples directly. This wires both controls through the orchestration wrapper.
--examples_compression {GZIP,SNAPPY} selects the suffix of the single shared examples path, which feeds both make_examples (writer) and call_variants (reader) so they stay aligned; the auxiliary outputs (gVCF, call_variant_outputs, small_model) remain GZIP. --examples_compression_level forwards the GZIP level to make_examples for GZIP output only, and check_flags validates the range and warns if a level is supplied alongside SNAPPY.
Author
|
Pushed an extra commit this morning that exposes the compression options in |
Collaborator
|
Hi @tfenne , Thanks for the PR! Since I believe you're already familiar with our process, I'll go ahead and start the review. As a reminder, because of the way our project is set up, we aren't able to merge GitHub PRs directly. If the changes look good, I will commit them, crediting your GitHub username and referencing this PR in the commit description. Please let me know if you have any concerns with this approach. -pichuan |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
make_exampleshas always written itstf.Exampleoutput as GZIP-compressed TFRecords at zlib's default level, with the codec hardcoded on both the write and read sides. On many-core machines the GZIP step is a meaningful slice ofmake_examplesCPU time, and when output throughput is provisioned generously the cheaper Snappy codec is often the better trade. This PR makes the examples codec selectable and exposes the GZIP level. Output compression type is keyed off filenames, compression level by a new option.Going from gzip level 6 to level 1 on chr20 saved ~7% of make_examples runtime, at the cost of about doubling the output size. Going gzip level 6 to snappy saved 10% at the cost of about 4x more storage. The tradeoff is user selectable based on available storage, and is much more modest with the small-model handling a large fraction of calls now.
Design: the file name is the single source of truth
The compression codec is inferred from the examples file-name suffix rather than from a separate flag:
*.snappyexamples path selects Snappy;The C++ writer (
nucleus::ExampleWriter, via the newCompressionTypeForPath) and every examples reader (call_variants,data_providers,show_examples, and the shape probe indv_utils, via the newdv_utils.compression_type_for_examples_path) derive the codec the same way. Detection is case-insensitive and handles the usual sharded (@N,-ddddd-of-ddddd) and comma-separated path forms.A new
--examples_compression_levelflag (protoMakeExamplesOptions.examples_compression_level) exposes the GZIP level (-1or0..9;-1= library default). It only affects GZIP output. The proto field is declaredoptionalso an unset value is distinguishable from a deliberate level 0 (otherwise a non-flag caller would silently get level 0 / no deflate), and a level supplied alongside a.snappypath is ignored with a warning (Snappy has no levels).Backward compatibility
Default behaviour is unchanged: with no
.snappysuffix and no level flag, output is GZIP at the library default level, exactly as before. The auxiliarymake_examplesoutputs (call_variant_outputs,small_model_examples) are written through the Python TFRecord writer, which cannot emit Snappy, so they remain GZIP — and their file names are forced to.gzeven when the main examples are Snappy, so each file's name matches its actual bytes.What changed
third_party/nucleus/io/example_writer.{h,cc}: codec inferred from the path;compression_levelplumbed intoRecordWriterOptions.zlib_options; new publicCompressionTypeForPath.deepvariant/protos/deepvariant.proto:optional int32 examples_compression_level.deepvariant/make_examples_options.py:--examples_compression_levelflag + range validator.deepvariant/make_examples_native.cc: passes the level (library default when unset).deepvariant/dv_utils.py,call_variants.py,data_providers.py,show_examples.py: readers autodetect the codec from the suffix.deepvariant/make_examples_core.py: side outputs forced to.gz.Testing
third_party/nucleus/io:example_writer_test— codec detection, Snappy write/read round-trip, and GZIP level actually applied (level 0 file larger than level 9).deepvariant:dv_utils_test— suffix detection across casing and sharding forms.deepvariant:make_examples_core_test— side-output.snappy→.gzrenaming (incl. uppercase).make_examples→call_variantsround-trip on a chr20 region for both Snappy and GZIP examples (and an explicit GZIP level), confirmingcall_variantsreads Snappy output and that an out-of-range level is rejected.