Skip to content

fix: compress uncompressed read files before upload#84

Merged
igboyes merged 1 commit into
virtool:mainfrom
ReeceHoffmann:reecehoffmann/vir-2348-sample-creation-fails-with-file-is-not-compressed-when-input
May 22, 2026
Merged

fix: compress uncompressed read files before upload#84
igboyes merged 1 commit into
virtool:mainfrom
ReeceHoffmann:reecehoffmann/vir-2348-sample-creation-fails-with-file-is-not-compressed-when-input

Conversation

@ReeceHoffmann
Copy link
Copy Markdown
Member

  • Check if each input file is gzipped; if not, compress it before upload
  • Adapts finalize step to handle both compressed and uncompressed user inputs
  • Resolves VIR-2348: sample creation failures when input FASTQ is not compressed
  • Adds proc parameter to finalize for pigz parallelization during compression

- Check if each input file is gzipped; if not, compress it before upload
- Adapts finalize step to handle both compressed and uncompressed user inputs
- Resolves VIR-2348: sample creation failures when input FASTQ is not compressed
- Adds `proc` parameter to finalize for pigz parallelization during compression
@ReeceHoffmann ReeceHoffmann requested a review from igboyes as a code owner May 22, 2026 16:05
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the finalize step in workflow.py to handle uncompressed read files by compressing them before upload. It introduces the proc parameter and uses is_gzipped to check file status. Feedback suggests parallelizing the compression and upload processes using asyncio.gather to improve efficiency, along with adding a check to prevent renaming a file to itself.

Comment thread workflow.py
Comment on lines 67 to +73
for i, path in enumerate(read_paths):
new_path = await asyncio.to_thread(path.rename, f"reads_{i + 1}.fq.gz")
await new_sample.upload(new_path)
target = path.with_name(f"reads_{i + 1}.fq.gz")
if is_gzipped(path):
await asyncio.to_thread(path.rename, target)
else:
await asyncio.to_thread(compress_file, path, target, proc)
await new_sample.upload(target)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation processes each read file (compression and upload) sequentially. For paired-end samples, this means the second file's compression only starts after the first file has finished its upload. Since compression is CPU-bound and uploading is network-bound, these operations can be overlapped to improve performance. Using asyncio.gather would allow the workflow to process both files in parallel, which is more efficient and safe given that Virtool samples typically have at most two read files. Additionally, adding a check to ensure path != target before renaming prevents potential issues on systems where renaming a file to itself might cause errors or unnecessary I/O.

Suggested change
for i, path in enumerate(read_paths):
new_path = await asyncio.to_thread(path.rename, f"reads_{i + 1}.fq.gz")
await new_sample.upload(new_path)
target = path.with_name(f"reads_{i + 1}.fq.gz")
if is_gzipped(path):
await asyncio.to_thread(path.rename, target)
else:
await asyncio.to_thread(compress_file, path, target, proc)
await new_sample.upload(target)
async def process_path(i, path):
target = path.with_name(f"reads_{i + 1}.fq.gz")
if is_gzipped(path):
if path != target:
await asyncio.to_thread(path.rename, target)
else:
await asyncio.to_thread(compress_file, path, target, proc)
await new_sample.upload(target)
await asyncio.gather(
*(process_path(i, path) for i, path in enumerate(read_paths))
)

Copy link
Copy Markdown
Member Author

@ReeceHoffmann ReeceHoffmann May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. re gather: Technically there is a core of a good idea here given that these steps are rate limited by different resources. As written though it will just try to spin up to compression instances using all cores for both. If we want the performance benefit we can introduce a pipelined approach that actaully separates the compression and upload steps, but imo the performance is probably not worth the complexity here. esp given that most samples are only a single file
  2. re path!= target guard. In linux I think think this just prevents a no-op. I'm okay with this change, just not sure it really does anything

@igboyes igboyes merged commit 4bc7caa into virtool:main May 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants