Add a speculative-decoding serving example for Qwen3-8B by pluna · Pull Request #307 · modelplaneai/modelplane

pluna · 2026-06-24T00:48:30Z

Description of your changes

The Examples page covers single-engine, multi-node, and disaggregated serving, but nothing for speculative decoding, a common way to cut single-GPU decode latency. This adds an example for n-gram (prompt-lookup) speculation, which proposes tokens by matching the prompt and so runs as one Standalone vLLM engine with no draft model or extra cached weights.

The page serves Qwen3-8B on a single L4 and benchmarks the speculative engine against an identical baseline without --speculative-config. On a copy-heavy edit workload the speculative engine reaches 2.4 times the output-token throughput at about half the time per output token, with a 65% draft acceptance rate. I ran it end to end on GKE: the cluster provisioned, the deployment served a valid completion, and the benchmark numbers in the page come from that run.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
~~Added or updated tests covering any composition function changes.~~ No composition functions changed; this is docs only.
Signed off every commit with git commit -s.

negz · 2026-06-24T00:56:17Z

+# Exposes the qwen3-8b-base deployment as an OpenAI-compatible URL, so the
+# benchmark can hit the baseline engine the same way it hits the speculative one.
+# Read its address from status.address:
+#   kubectl get ms qwen3-8b-base -n ml-team -o jsonpath='{.status.address}'


Does this duplicate the existing example?

Do we need to include everything required here to actually run the experiment, or should we just include the final result? I would prefer the latter.

negz · 2026-06-24T20:17:19Z

@@ -0,0 +1,38 @@
+# No-speculation baseline for the benchmark: Qwen3-8B on a single L4 without


@pluna I think this should be removed too. 😄 Once this is gone we can merge.

@negz The Qwen3-8B example has a different deployment configuration. Existing example has tool-call-parser and reasoning-parser args, which I didn't want to have in the speculative decoding, so I had to create a new one for base and spec to have only one difference --speculative-config.

The Examples page had no recipe for speculative decoding, a common way to cut single-GPU decode latency. This adds one for n-gram (prompt-lookup) speculation, which proposes tokens by matching the prompt and so runs as a single Standalone vLLM engine with no draft model or extra cached weights. The page leads with the measured impact: on a copy-heavy edit workload the speculative engine roughly doubles output-token throughput (16 to 39 tok/s) and halves the time per output token, accepting 65% of drafted tokens. I ran it end to end on GKE: the cluster provisioned and the deployment served a valid completion, and the numbers come from that run. The platform side reuses the Qwen3-8B example rather than restating it, so the example carries only what is unique to speculative decoding: the engine's `--speculative-config`, its ModelDeployment, and its ModelService. Signed-off-by: Pablo Luna <pablo@upbound.io>

vercel Bot deployed to Preview – modelplane-docs June 24, 2026 00:49 View deployment

negz reviewed Jun 24, 2026

View reviewed changes

pluna force-pushed the docs-example-qwen3-speculative-decoding branch from 036b471 to 31f83cf Compare June 24, 2026 18:07

vercel Bot deployed to Preview – modelplane-docs June 24, 2026 18:08 View deployment

pluna force-pushed the docs-example-qwen3-speculative-decoding branch from 31f83cf to 86783c8 Compare June 24, 2026 19:18

vercel Bot deployed to Preview – modelplane-docs June 24, 2026 19:19 View deployment

negz reviewed Jun 24, 2026

View reviewed changes

pluna force-pushed the docs-example-qwen3-speculative-decoding branch from 86783c8 to db0c670 Compare June 25, 2026 19:16

vercel Bot deployed to Preview – modelplane-docs June 25, 2026 19:17 View deployment

dennis-upbound approved these changes Jun 25, 2026

View reviewed changes

dennis-upbound merged commit 55bfb80 into modelplaneai:main Jun 25, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a speculative-decoding serving example for Qwen3-8B#307

Add a speculative-decoding serving example for Qwen3-8B#307
dennis-upbound merged 1 commit into
modelplaneai:mainfrom
pluna:docs-example-qwen3-speculative-decoding

pluna commented Jun 24, 2026

Uh oh!

negz Jun 24, 2026

Uh oh!

pluna Jun 24, 2026

Uh oh!

negz Jun 24, 2026

Uh oh!

pluna Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,38 @@
		# No-speculation baseline for the benchmark: Qwen3-8B on a single L4 without

Uh oh!

Conversation

pluna commented Jun 24, 2026

Description of your changes

Uh oh!

negz Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

pluna Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

negz Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

pluna Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants