Skip to content

Commit a0cbc46

Browse files
authored
refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380)
Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS
1 parent b4e3069 commit a0cbc46

5 files changed

Lines changed: 345 additions & 433 deletions

File tree

Makefile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -560,18 +560,18 @@ test-extra-backend-vllm: docker-build-vllm
560560
## the `test-extra-backend-tinygrad-all` aggregate.
561561
test-extra-backend-tinygrad: docker-build-tinygrad
562562
BACKEND_IMAGE=local-ai-backend:tinygrad \
563-
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
563+
BACKEND_TEST_MODEL_NAME=Qwen/Qwen3-0.6B \
564564
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
565565
BACKEND_TEST_OPTIONS=tool_parser:hermes \
566566
$(MAKE) test-extra-backend
567567

568568
## tinygrad — embeddings via LLM last-hidden-state pooling. Reuses the same
569-
## Qwen2.5-0.5B-Instruct as the chat target so we don't need a separate BERT
570-
## vendor; the Embedding RPC mean-pools and L2-normalizes the last-layer
571-
## hidden state.
569+
## Qwen3-0.6B as the chat target so we don't need a separate BERT vendor;
570+
## the Embedding RPC mean-pools and L2-normalizes the last-layer hidden
571+
## state.
572572
test-extra-backend-tinygrad-embeddings: docker-build-tinygrad
573573
BACKEND_IMAGE=local-ai-backend:tinygrad \
574-
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
574+
BACKEND_TEST_MODEL_NAME=Qwen/Qwen3-0.6B \
575575
BACKEND_TEST_CAPS=health,load,embeddings \
576576
$(MAKE) test-extra-backend
577577

0 commit comments

Comments
 (0)