Add a speculative-decoding serving example for Qwen3-8B#307
Merged
dennis-upbound merged 1 commit intoJun 25, 2026
Merged
Conversation
negz
reviewed
Jun 24, 2026
Comment on lines
+1
to
+4
| # Exposes the qwen3-8b-base deployment as an OpenAI-compatible URL, so the | ||
| # benchmark can hit the baseline engine the same way it hits the speculative one. | ||
| # Read its address from status.address: | ||
| # kubectl get ms qwen3-8b-base -n ml-team -o jsonpath='{.status.address}' |
Collaborator
There was a problem hiding this comment.
Does this duplicate the existing example?
Do we need to include everything required here to actually run the experiment, or should we just include the final result? I would prefer the latter.
036b471 to
31f83cf
Compare
31f83cf to
86783c8
Compare
negz
reviewed
Jun 24, 2026
| @@ -0,0 +1,38 @@ | |||
| # No-speculation baseline for the benchmark: Qwen3-8B on a single L4 without | |||
Collaborator
There was a problem hiding this comment.
@pluna I think this should be removed too. 😄 Once this is gone we can merge.
Collaborator
Author
There was a problem hiding this comment.
@negz The Qwen3-8B example has a different deployment configuration. Existing example has tool-call-parser and reasoning-parser args, which I didn't want to have in the speculative decoding, so I had to create a new one for base and spec to have only one difference --speculative-config.
The Examples page had no recipe for speculative decoding, a common way to cut single-GPU decode latency. This adds one for n-gram (prompt-lookup) speculation, which proposes tokens by matching the prompt and so runs as a single Standalone vLLM engine with no draft model or extra cached weights. The page leads with the measured impact: on a copy-heavy edit workload the speculative engine roughly doubles output-token throughput (16 to 39 tok/s) and halves the time per output token, accepting 65% of drafted tokens. I ran it end to end on GKE: the cluster provisioned and the deployment served a valid completion, and the numbers come from that run. The platform side reuses the Qwen3-8B example rather than restating it, so the example carries only what is unique to speculative decoding: the engine's `--speculative-config`, its ModelDeployment, and its ModelService. Signed-off-by: Pablo Luna <pablo@upbound.io>
86783c8 to
db0c670
Compare
dennis-upbound
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
The Examples page covers single-engine, multi-node, and disaggregated serving, but nothing for speculative decoding, a common way to cut single-GPU decode latency. This adds an example for n-gram (prompt-lookup) speculation, which proposes tokens by matching the prompt and so runs as one
StandalonevLLM engine with no draft model or extra cached weights.The page serves Qwen3-8B on a single L4 and benchmarks the speculative engine against an identical baseline without
--speculative-config. On a copy-heavy edit workload the speculative engine reaches 2.4 times the output-token throughput at about half the time per output token, with a 65% draft acceptance rate. I ran it end to end on GKE: the cluster provisioned, the deployment served a valid completion, and the benchmark numbers in the page come from that run.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.Added or updated tests covering any composition function changes.No composition functions changed; this is docs only.git commit -s.