Skip to content

Add a speculative-decoding serving example for Qwen3-8B#307

Merged
dennis-upbound merged 1 commit into
modelplaneai:mainfrom
pluna:docs-example-qwen3-speculative-decoding
Jun 25, 2026
Merged

Add a speculative-decoding serving example for Qwen3-8B#307
dennis-upbound merged 1 commit into
modelplaneai:mainfrom
pluna:docs-example-qwen3-speculative-decoding

Conversation

@pluna

@pluna pluna commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description of your changes

The Examples page covers single-engine, multi-node, and disaggregated serving, but nothing for speculative decoding, a common way to cut single-GPU decode latency. This adds an example for n-gram (prompt-lookup) speculation, which proposes tokens by matching the prompt and so runs as one Standalone vLLM engine with no draft model or extra cached weights.

The page serves Qwen3-8B on a single L4 and benchmarks the speculative engine against an identical baseline without --speculative-config. On a copy-heavy edit workload the speculative engine reaches 2.4 times the output-token throughput at about half the time per output token, with a 65% draft acceptance rate. I ran it end to end on GKE: the cluster provisioned, the deployment served a valid completion, and the benchmark numbers in the page come from that run.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check (or ./nix.sh flake check) and made sure it passes.
  • Added or updated tests covering any composition function changes. No composition functions changed; this is docs only.
  • Signed off every commit with git commit -s.

Comment on lines +1 to +4
# Exposes the qwen3-8b-base deployment as an OpenAI-compatible URL, so the
# benchmark can hit the baseline engine the same way it hits the speculative one.
# Read its address from status.address:
# kubectl get ms qwen3-8b-base -n ml-team -o jsonpath='{.status.address}'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this duplicate the existing example?

Do we need to include everything required here to actually run the experiment, or should we just include the final result? I would prefer the latter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -0,0 +1,38 @@
# No-speculation baseline for the benchmark: Qwen3-8B on a single L4 without

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pluna I think this should be removed too. 😄 Once this is gone we can merge.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@negz The Qwen3-8B example has a different deployment configuration. Existing example has tool-call-parser and reasoning-parser args, which I didn't want to have in the speculative decoding, so I had to create a new one for base and spec to have only one difference --speculative-config.

The Examples page had no recipe for speculative decoding, a common way to
cut single-GPU decode latency. This adds one for n-gram (prompt-lookup)
speculation, which proposes tokens by matching the prompt and so runs as a
single Standalone vLLM engine with no draft model or extra cached weights.

The page leads with the measured impact: on a copy-heavy edit workload the
speculative engine roughly doubles output-token throughput (16 to 39 tok/s)
and halves the time per output token, accepting 65% of drafted tokens. I ran
it end to end on GKE: the cluster provisioned and the deployment served a valid
completion, and the numbers come from that run.

The platform side reuses the Qwen3-8B example rather than restating it, so the
example carries only what is unique to speculative decoding: the engine's
`--speculative-config`, its ModelDeployment, and its ModelService.

Signed-off-by: Pablo Luna <pablo@upbound.io>
@dennis-upbound dennis-upbound merged commit 55bfb80 into modelplaneai:main Jun 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants