fix(marketplace): correct nllb/parakeet model packaging and sniff archive magic bytes#565
Conversation
…hive magic bytes The nllb model archive is XZ-compressed but published as .tar.bz2, so the installer routed it into BzDecoder and failed mid-extraction. The parakeet package shipped its model files flat into models/, breaking the model_dir expected by the official sample and just download-parakeet-models. - Sniff leading magic bytes (xz/bzip2/gzip/zstd) when extracting model archives and prefer the content kind over the extension-derived guess. - Add .tar.xz/.txz archive support (liblzma). - Rename the nllb model artifact to .tar.xz (same bytes, same sha256) and bump nllb to 0.3.1. - Repackage parakeet models as a tarball extracting to sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/ and bump parakeet to 0.3.1. Fixes #549 Fixes #548 Signed-off-by: streamkit-devin <devin@streamkit.dev>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
🚩 Justfile download-parakeet-models recipe not updated for archive-based packaging
The parakeet model packaging changed from individual files to a single archive sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2 in plugin.yml and official-plugins.json, but the justfile download-parakeet-models recipe (line 810-823) still downloads individual files (encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx, tokens.txt) from HuggingFace. This is likely intentional if the individual files remain available on HuggingFace alongside the archive, but if they're removed in the future, just download-parakeet-models would fail. Worth confirming the individual files still exist on the HuggingFace repo.
(Refers to lines 810-823)
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Intentional — the flat files stay on the Hugging Face repo (the upload steps in the PR description don't remove them), so older registry versions (≤0.3.0) and this recipe keep working. Happy to switch the recipe to the tarball in a follow-up once the archive is uploaded, if preferred.
| archive_path = archive_path.display() | ||
| ) | ||
| })?; | ||
| let kind = resolve_archive_kind(&archive_path, ext_kind)?; |
There was a problem hiding this comment.
📝 Info: resolve_archive_kind opens the archive file a second time
In maybe_extract_model_archive, the archive file is opened at line 1633 (std::fs::File::open), then resolve_archive_kind at line 1639 opens the same file again internally (apps/skit/src/marketplace_installer.rs:1988) to read 6 bytes for magic-byte sniffing. The original file handle remains at position 0, so there's no correctness issue — but it does mean two file descriptors are opened for the same file. This could be avoided by reading 6 bytes from the original file and then seeking back to 0, or by restructuring to pass the already-opened file. Minor inefficiency, not a bug.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
Acknowledged — kept the second short-lived open to keep resolve_archive_kind self-contained and avoid threading a Seek bound through the extraction path; it reads 6 bytes once per install so the cost is negligible.
| "nllb-200-distilled-600M-ct2-int8.tar.xz" | ||
| ], | ||
| "expected_size_bytes": 1135260128, | ||
| "license": "CC-BY-NC-4.0", |
There was a problem hiding this comment.
📝 Info: NLLB sha256 and expected_size_bytes unchanged despite filename change — verified intentional
The NLLB model file reference changed from nllb-200-distilled-600M-ct2-int8.tar.bz2 to .tar.xz but the sha256 and expected_size_bytes values are identical to the base commit. This initially appeared to be an oversight (Parakeet's metadata WAS updated when its format changed). However, the commit message explicitly states "same bytes, same sha256" — the file on HuggingFace was always XZ-compressed data that was mislabeled with a .tar.bz2 extension. The rename corrects the extension to match the actual content. The resolve_archive_kind magic-byte sniffing (apps/skit/src/marketplace_installer.rs:1985-2006) was specifically added to handle this class of mislabeled archive, providing resilience during any transition period where the extension might not match the content.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
There was a problem hiding this comment.
The unchanged size/sha is intentional, but for the opposite reason: the file on Hugging Face is already XZ (verified its magic bytes: fd 37 7a 58 5a 00) — only its name is wrong. The fix is to upload the same bytes under the corrected .tar.xz name, so expected_size_bytes and sha256 stay identical. No recompression is planned.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #565 +/- ##
==========================================
+ Coverage 82.22% 82.24% +0.01%
==========================================
Files 236 236
Lines 70183 70258 +75
Branches 1846 1846
==========================================
+ Hits 57711 57786 +75
Misses 12466 12466
Partials 6 6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Signed-off-by: streamkit-devin <devin@streamkit.dev>
| fn sniff_compression_kind(header: &[u8]) -> Option<ModelArchiveKind> { | ||
| if header.starts_with(&XZ_MAGIC) { | ||
| return Some(ModelArchiveKind::TarXz); | ||
| } | ||
| if header.starts_with(&BZIP2_MAGIC) && header.get(3).is_some_and(u8::is_ascii_digit) { | ||
| return Some(ModelArchiveKind::TarBz2); | ||
| } | ||
| if header.starts_with(&GZIP_MAGIC) { | ||
| return Some(ModelArchiveKind::TarGz); | ||
| } | ||
| if header.starts_with(&ZSTD_MAGIC) { | ||
| return Some(ModelArchiveKind::TarZst); | ||
| } | ||
| None |
There was a problem hiding this comment.
📝 Info: Plain tar files have negligible false-positive risk from magic sniffing
When ext_kind is Tar (uncompressed), sniff_compression_kind examines the first 6 bytes of the file. In a plain tar archive, these bytes are the start of the first entry's filename. A false positive would require the filename to start with compression magic bytes (e.g., \xfd7zXZ\x00 for XZ, BZh + digit for bz2, \x1f\x8b for gzip, or \x28\xb5\x2f\xfd for zstd). All of these contain non-printable or unusual byte sequences that are extremely unlikely in model filenames. The bzip2 check at apps/skit/src/marketplace_installer.rs:1969 additionally requires byte 3 to be an ASCII digit, further reducing false-positive risk. This design is sound.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
.tar.bz2, so the installer routed it intoBzDecoderand failed with "Failed to read archive entry". The installer now sniffs the leading magic bytes (xzFD 37 7A 58 5A 00, bzip2BZh, gzip1F 8B, zstd28 B5 2F FD) and prefers the content kind over the extension-derived guess, with unit tests covering the mapping..tar.xz/.txzarchives are now supported vialiblzma(statically linked).nllb-200-distilled-600M-ct2-int8.tar.xz— the same bytes as the existing file, just correctly named, so the sha256 (6c95a9bc…) is unchanged. nllb bumped to 0.3.1 (append-only registry).models/, breaking themodel_dir: models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8expected by the official sample andjust download-parakeet-models. The manifest now references a single tarball that extracts to that directory, matching every other ML plugin. parakeet bumped to 0.3.1.Required out-of-band registry steps (maintainer)
The model artifacts live on Hugging Face, which I cannot write to. Before (or together with) merging:
just download-nllb-models # on main: fetches the old .tar.bz2 name mv models/nllb-200-distilled-600M-ct2-int8.tar.bz2 models/nllb-200-distilled-600M-ct2-int8.tar.xz HF_TOKEN=… python3 scripts/marketplace/upload_models_to_hf.py --repo streamkit/nllb-models.tar.bz2in place so existing registry versions (≤0.3.0) and old checkouts keep working.181f96d0dcac111d68c2ee5843655bb08ecb5fe26f845a915c3e4fca7915b9f8(486,660,829 bytes) corresponds to this reproducible packaging (GNU tar + bzip2 -9):Alternatively, provide me an HF write token and I can perform the uploads.
Review & Validation
cargo test -p streamkit-server marketplace_installer(sniffing + extraction tests pass)Notes
scripts/marketplace/test_append_only.pyis broken onmain(it invokesbuild_registry.pywithout the now-required--bundle-url-templatearg) — preexisting, unrelated to this change.Fixes #549
Fixes #548
Link to Devin session: https://staging.itsdev.in/sessions/bff20f8daa8045f6a5725cfd58666e2a
Requested by: @streamer45
Devin Review
0e9bd94