Support add_generation_prompt request parameter for chat completions (#3877) by exzile · Pull Request #4331 · openvinotoolkit/model_server

exzile · 2026-06-26T20:07:36Z

Summary

Closes #3877.

The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on /v3/chat/completions, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted — the building block for assistant prefill.

Changes

Parse and validate add_generation_prompt in the request handler; store it on the request struct.
Honor it at every chat-template application site: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and passed into the template render).
Replaces the previously hardcoded add_generation_prompt = true.

Testing

Added tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved).

Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct:

default → prompt ends with <|im_start|>assistant
add_generation_prompt: false → that trailing generation prompt is omitted

Scope note

This implements add_generation_prompt only. Full assistant prefill — continuing from the final assistant message without closing it (continue_final_message in transformers/vLLM) — is a separate control that the genai C++ apply_chat_template does not currently expose, and is left as a follow-up.

🤖 Generated with Claude Code

mzegla · 2026-06-30T12:58:23Z

Looks like a rebase with conflict resolution is needed.

The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on the /v3/chat/completions request, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted, which is the building block for assistant prefill. - Parse add_generation_prompt in the request (openai_api_handler.cpp) and store it on the request struct (openai_request.hpp). - Honor it in all chat-template application sites: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and pass into the template render). - Add tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved). Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct: default renders a trailing "<|im_start|>assistant", add_generation_prompt=false omits it. Note: true assistant prefill (continue_final_message - continuing from the final assistant message without closing it) is a separate control and is left as a follow-up. Implements openvinotoolkit#3877 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Per review: add_generation_prompt is valid and functional for the Responses API - the Endpoint::RESPONSES path in servable.cpp reads request.addGenerationPrompt and passes it to apply_chat_template (and the Jinja path reads it from the request JSON), same as chat completions. Document it in the responses REST API parameters table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Per review feedback: add_generation_prompt is only ever consumed by chat template rendering, so route it through chat_template_kwargs instead of a dedicated InputRequest field. The MINJA path extracts it back out for genai's dedicated apply_chat_template argument; the Python-Jinja path pops it from the kwargs dict before splatting into render() to avoid a duplicate-keyword collision. Also fixes the Python-Jinja path never having wired add_generation_prompt through at all, and two llmtemplate_test.cpp tests that referenced a stale 4-arg applyChatTemplate signature and never compiled. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

…ature/assistant-prefill

mzegla

Please have a look at those few comments. Once we have it resolved I think we will be good to run it through the CI and merge.

…validate kwargs type - Remove the now-redundant top-level add_generation_prompt request field/parsing and the OpenAIRequest.addGenerationPrompt member: the parameter lives entirely in chat_template_kwargs now, and both template paths already default it to true when absent. - Fix MINJA path to use JsonContainer::as_bool() instead of get_bool(), which threw an unhandled ov::Exception (outside the surrounding try/catch) when add_generation_prompt was present but not a boolean; now returns a clean InvalidArgumentError, matching the Python-Jinja path's existing validation. - Trim a leftover comment line in input_request.hpp per review suggestion. - Update chat/responses REST docs to document add_generation_prompt as part of chat_template_kwargs instead of a separate top-level parameter. - Add unit tests for add_generation_prompt=false and the invalid-type rejection on the MINJA path.

…ature/assistant-prefill

exzile · 2026-07-03T16:26:44Z

@mzegla tended to all those comments in the latest commit.
Removed the leftover top-level add_generation_prompt field/parsing (now lives solely in chat_template_kwargs).
Fixed a real bug: the MINJA path used get_bool() outside its try/catch, which would throw an unhandled ov::Exception on a non-boolean value instead of a clean error — now uses as_bool() and returns InvalidArgumentError.
Trimmed the leftover comment line in input_request.hpp per the inline suggestion.
Updated both REST docs pages to document add_generation_prompt under chat_template_kwargs instead of as a separate parameter.
Added two unit tests (false-case and invalid-type rejection).

exzile mentioned this pull request Jun 26, 2026

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill) #3877

Open

exzile force-pushed the feature/assistant-prefill branch from 91043be to 6bb8bd4 Compare June 27, 2026 01:04

mzegla reviewed Jun 29, 2026

View reviewed changes

Comment thread src/llm/apis/openai_api_handler.cpp Outdated

exzile and others added 2 commits June 30, 2026 09:26

exzile force-pushed the feature/assistant-prefill branch from 8c850e7 to de8b330 Compare June 30, 2026 13:27

mzegla reviewed Jul 1, 2026

View reviewed changes

Comment thread src/llm/io_processing/input_request.hpp Outdated

exzile and others added 3 commits July 1, 2026 09:32

Merge branch 'main' into feature/assistant-prefill

16da220

Merge remote-tracking branch 'fork/feature/assistant-prefill' into fe…

709fb43

…ature/assistant-prefill

mzegla reviewed Jul 3, 2026

View reviewed changes

exzile added 3 commits July 3, 2026 12:16

Merge branch 'main' into feature/assistant-prefill

c7d15d4

Merge remote-tracking branch 'fork/feature/assistant-prefill' into fe…

6230277

…ature/assistant-prefill

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support add_generation_prompt request parameter for chat completions (#3877)#4331

Support add_generation_prompt request parameter for chat completions (#3877)#4331
exzile wants to merge 8 commits into
openvinotoolkit:mainfrom
exzile:feature/assistant-prefill

exzile commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

mzegla commented Jun 30, 2026

Uh oh!

Uh oh!

mzegla left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

exzile commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

exzile commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Scope note

Uh oh!

Uh oh!

mzegla commented Jun 30, 2026

Uh oh!

Uh oh!

mzegla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

exzile commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

exzile commented Jun 26, 2026 •

edited

Loading

exzile commented Jul 3, 2026 •

edited

Loading