Skip to content

fix(marketplace): correct nllb/parakeet model packaging and sniff archive magic bytes#565

Open
staging-devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1780751954-fix-marketplace-model-packaging
Open

fix(marketplace): correct nllb/parakeet model packaging and sniff archive magic bytes#565
staging-devin-ai-integration[bot] wants to merge 2 commits into
mainfrom
devin/1780751954-fix-marketplace-model-packaging

Conversation

@staging-devin-ai-integration
Copy link
Copy Markdown
Contributor

@staging-devin-ai-integration staging-devin-ai-integration Bot commented Jun 6, 2026

Summary

  • The nllb model archive on Hugging Face is actually XZ-compressed but published as .tar.bz2, so the installer routed it into BzDecoder and failed with "Failed to read archive entry". The installer now sniffs the leading magic bytes (xz FD 37 7A 58 5A 00, bzip2 BZh, gzip 1F 8B, zstd 28 B5 2F FD) and prefers the content kind over the extension-derived guess, with unit tests covering the mapping. .tar.xz/.txz archives are now supported via liblzma (statically linked).
  • The nllb manifest now points at nllb-200-distilled-600M-ct2-int8.tar.xz — the same bytes as the existing file, just correctly named, so the sha256 (6c95a9bc…) is unchanged. nllb bumped to 0.3.1 (append-only registry).
  • The parakeet package shipped its 4 model files flat into models/, breaking the model_dir: models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8 expected by the official sample and just download-parakeet-models. The manifest now references a single tarball that extracts to that directory, matching every other ML plugin. parakeet bumped to 0.3.1.

Required out-of-band registry steps (maintainer)

The model artifacts live on Hugging Face, which I cannot write to. Before (or together with) merging:

  1. nllb — upload the existing archive under the corrected name (same bytes, no recompression):
    just download-nllb-models   # on main: fetches the old .tar.bz2 name
    mv models/nllb-200-distilled-600M-ct2-int8.tar.bz2 models/nllb-200-distilled-600M-ct2-int8.tar.xz
    HF_TOKEN=… python3 scripts/marketplace/upload_models_to_hf.py --repo streamkit/nllb-models
    Keep the old .tar.bz2 in place so existing registry versions (≤0.3.0) and old checkouts keep working.
  2. parakeet — build and upload the tarball. The manifest sha256 181f96d0dcac111d68c2ee5843655bb08ecb5fe26f845a915c3e4fca7915b9f8 (486,660,829 bytes) corresponds to this reproducible packaging (GNU tar + bzip2 -9):
    just download-parakeet-models
    cd models && tar --sort=name --owner=0 --group=0 --numeric-owner \
        --mtime='2026-01-01 00:00:00 UTC' \
        -cf sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8 \
      && bzip2 -9 sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar && cd ..
    sha256sum models/sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2  # must print 181f96d0…
    HF_TOKEN=… python3 scripts/marketplace/upload_models_to_hf.py --repo streamkit/parakeet-models
    The flat files stay in place for older registry versions.
  3. Run the Marketplace Release workflow to publish the 0.3.1 manifests.

Alternatively, provide me an HF write token and I can perform the uploads.

Review & Validation

  • cargo test -p streamkit-server marketplace_installer (sniffing + extraction tests pass)
  • Confirm the HF artifacts above are uploaded before the registry release; until then 0.3.1 installs will fail with a 404/hash mismatch
  • Note: even without the registry republish, the magic-byte sniffing alone fixes nllb installs of the current mislabeled archive

Notes

  • scripts/marketplace/test_append_only.py is broken on main (it invokes build_registry.py without the now-required --bundle-url-template arg) — preexisting, unrelated to this change.

Fixes #549
Fixes #548

Link to Devin session: https://staging.itsdev.in/sessions/bff20f8daa8045f6a5725cfd58666e2a
Requested by: @streamer45


Devin Review

Status Commit
🟢 Reviewed 0e9bd94
Open in Devin Review (Staging)

…hive magic bytes

The nllb model archive is XZ-compressed but published as .tar.bz2, so the
installer routed it into BzDecoder and failed mid-extraction. The parakeet
package shipped its model files flat into models/, breaking the model_dir
expected by the official sample and just download-parakeet-models.

- Sniff leading magic bytes (xz/bzip2/gzip/zstd) when extracting model
  archives and prefer the content kind over the extension-derived guess.
- Add .tar.xz/.txz archive support (liblzma).
- Rename the nllb model artifact to .tar.xz (same bytes, same sha256) and
  bump nllb to 0.3.1.
- Repackage parakeet models as a tarball extracting to
  sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8/ and bump parakeet to 0.3.1.

Fixes #549
Fixes #548

Signed-off-by: streamkit-devin <devin@streamkit.dev>
@staging-devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

Copy link
Copy Markdown
Contributor Author

@staging-devin-ai-integration staging-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 potential issues.

Open in Devin Review (Staging)
Debug

Playground

Comment thread justfile
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Justfile download-parakeet-models recipe not updated for archive-based packaging

The parakeet model packaging changed from individual files to a single archive sherpa-onnx-nemo-parakeet-tdt-0.6b-v3-int8.tar.bz2 in plugin.yml and official-plugins.json, but the justfile download-parakeet-models recipe (line 810-823) still downloads individual files (encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx, tokens.txt) from HuggingFace. This is likely intentional if the individual files remain available on HuggingFace alongside the archive, but if they're removed in the future, just download-parakeet-models would fail. Worth confirming the individual files still exist on the HuggingFace repo.

(Refers to lines 810-823)

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — the flat files stay on the Hugging Face repo (the upload steps in the PR description don't remove them), so older registry versions (≤0.3.0) and this recipe keep working. Happy to switch the recipe to the tarball in a follow-up once the archive is uploaded, if preferred.

archive_path = archive_path.display()
)
})?;
let kind = resolve_archive_kind(&archive_path, ext_kind)?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: resolve_archive_kind opens the archive file a second time

In maybe_extract_model_archive, the archive file is opened at line 1633 (std::fs::File::open), then resolve_archive_kind at line 1639 opens the same file again internally (apps/skit/src/marketplace_installer.rs:1988) to read 6 bytes for magic-byte sniffing. The original file handle remains at position 0, so there's no correctness issue — but it does mean two file descriptors are opened for the same file. This could be avoided by reading 6 bytes from the original file and then seeking back to 0, or by restructuring to pass the already-opened file. Minor inefficiency, not a bug.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — kept the second short-lived open to keep resolve_archive_kind self-contained and avoid threading a Seek bound through the extraction path; it reads 6 bytes once per install so the cost is negligible.

Comment on lines +137 to 140
"nllb-200-distilled-600M-ct2-int8.tar.xz"
],
"expected_size_bytes": 1135260128,
"license": "CC-BY-NC-4.0",
Copy link
Copy Markdown
Contributor Author

@staging-devin-ai-integration staging-devin-ai-integration Bot Jun 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: NLLB sha256 and expected_size_bytes unchanged despite filename change — verified intentional

The NLLB model file reference changed from nllb-200-distilled-600M-ct2-int8.tar.bz2 to .tar.xz but the sha256 and expected_size_bytes values are identical to the base commit. This initially appeared to be an oversight (Parakeet's metadata WAS updated when its format changed). However, the commit message explicitly states "same bytes, same sha256" — the file on HuggingFace was always XZ-compressed data that was mislabeled with a .tar.bz2 extension. The rename corrects the extension to match the actual content. The resolve_archive_kind magic-byte sniffing (apps/skit/src/marketplace_installer.rs:1985-2006) was specifically added to handle this class of mislabeled archive, providing resilience during any transition period where the extension might not match the content.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unchanged size/sha is intentional, but for the opposite reason: the file on Hugging Face is already XZ (verified its magic bytes: fd 37 7a 58 5a 00) — only its name is wrong. The fix is to upload the same bytes under the corrected .tar.xz name, so expected_size_bytes and sha256 stay identical. No recompression is planned.

Comment thread apps/skit/src/marketplace_installer.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

❌ Patch coverage is 94.73684% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.24%. Comparing base (efd8469) to head (0e9bd94).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
apps/skit/src/marketplace_installer.rs 94.73% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #565      +/-   ##
==========================================
+ Coverage   82.22%   82.24%   +0.01%     
==========================================
  Files         236      236              
  Lines       70183    70258      +75     
  Branches     1846     1846              
==========================================
+ Hits        57711    57786      +75     
  Misses      12466    12466              
  Partials        6        6              
Flag Coverage Δ
backend 82.24% <94.73%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
core 85.61% <ø> (ø)
engine 83.34% <ø> (-0.09%) ⬇️
api 90.06% <ø> (ø)
nodes 82.52% <ø> (+0.01%) ⬆️
server 80.58% <94.73%> (+0.05%) ⬆️
plugin-native 83.70% <ø> (ø)
plugin-wasm 92.20% <ø> (ø)
ui-services 84.69% <ø> (ø)
ui-components 60.49% <ø> (ø)
Files with missing lines Coverage Δ
apps/skit/src/marketplace_installer.rs 46.90% <94.73%> (+1.95%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
Copy link
Copy Markdown
Contributor Author

@staging-devin-ai-integration staging-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review (Staging)
Debug

Playground

Comment on lines +1965 to +1978
fn sniff_compression_kind(header: &[u8]) -> Option<ModelArchiveKind> {
if header.starts_with(&XZ_MAGIC) {
return Some(ModelArchiveKind::TarXz);
}
if header.starts_with(&BZIP2_MAGIC) && header.get(3).is_some_and(u8::is_ascii_digit) {
return Some(ModelArchiveKind::TarBz2);
}
if header.starts_with(&GZIP_MAGIC) {
return Some(ModelArchiveKind::TarGz);
}
if header.starts_with(&ZSTD_MAGIC) {
return Some(ModelArchiveKind::TarZst);
}
None
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Plain tar files have negligible false-positive risk from magic sniffing

When ext_kind is Tar (uncompressed), sniff_compression_kind examines the first 6 bytes of the file. In a plain tar archive, these bytes are the start of the first entry's filename. A false positive would require the filename to start with compression magic bytes (e.g., \xfd7zXZ\x00 for XZ, BZh + digit for bz2, \x1f\x8b for gzip, or \x28\xb5\x2f\xfd for zstd). All of these contain non-printable or unusual byte sequences that are extremely unlikely in model filenames. The bzip2 check at apps/skit/src/marketplace_installer.rs:1969 additionally requires byte 3 to be an ASCII digit, further reducing false-positive risk. This design is sound.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants