Skip to content

Support add_generation_prompt request parameter for chat completions (#3877)#4331

Open
exzile wants to merge 8 commits into
openvinotoolkit:mainfrom
exzile:feature/assistant-prefill
Open

Support add_generation_prompt request parameter for chat completions (#3877)#4331
exzile wants to merge 8 commits into
openvinotoolkit:mainfrom
exzile:feature/assistant-prefill

Conversation

@exzile

@exzile exzile commented Jun 26, 2026

Copy link
Copy Markdown

Summary

Closes #3877.

The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on /v3/chat/completions, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted — the building block for assistant prefill.

Changes

  • Parse and validate add_generation_prompt in the request handler; store it on the request struct.
  • Honor it at every chat-template application site: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and passed into the template render).
  • Replaces the previously hardcoded add_generation_prompt = true.

Testing

Added tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved).

Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct:

  • default → prompt ends with <|im_start|>assistant
  • add_generation_prompt: false → that trailing generation prompt is omitted

Scope note

This implements add_generation_prompt only. Full assistant prefill — continuing from the final assistant message without closing it (continue_final_message in transformers/vLLM) — is a separate control that the genai C++ apply_chat_template does not currently expose, and is left as a follow-up.

🤖 Generated with Claude Code

Comment thread src/llm/apis/openai_api_handler.cpp Outdated
@mzegla

mzegla commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Looks like a rebase with conflict resolution is needed.

exzile and others added 2 commits June 30, 2026 09:26
The chat template was always rendered with add_generation_prompt=true,
hardcoded in every servable. This exposes an optional add_generation_prompt
field (bool, default true) on the /v3/chat/completions request, matching
HF transformers and vLLM. When false, the trailing generation prompt is
omitted, which is the building block for assistant prefill.

- Parse add_generation_prompt in the request (openai_api_handler.cpp) and
  store it on the request struct (openai_request.hpp).
- Honor it in all chat-template application sites: the MINJA path (LLM and
  VLM continuous batching, legacy) and the Python-Jinja path (read from the
  request body and pass into the template render).
- Add tests covering default (generation prompt added) and false
  (generation prompt omitted, assistant message preserved).

Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct:
default renders a trailing "<|im_start|>assistant", add_generation_prompt=false
omits it.

Note: true assistant prefill (continue_final_message - continuing from the
final assistant message without closing it) is a separate control and is left
as a follow-up.

Implements openvinotoolkit#3877

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per review: add_generation_prompt is valid and functional for the
Responses API - the Endpoint::RESPONSES path in servable.cpp reads
request.addGenerationPrompt and passes it to apply_chat_template (and
the Jinja path reads it from the request JSON), same as chat
completions. Document it in the responses REST API parameters table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@exzile exzile force-pushed the feature/assistant-prefill branch from 8c850e7 to de8b330 Compare June 30, 2026 13:27
Comment thread src/llm/io_processing/input_request.hpp Outdated
exzile and others added 3 commits July 1, 2026 09:32
Per review feedback: add_generation_prompt is only ever consumed by
chat template rendering, so route it through chat_template_kwargs
instead of a dedicated InputRequest field. The MINJA path extracts it
back out for genai's dedicated apply_chat_template argument; the
Python-Jinja path pops it from the kwargs dict before splatting into
render() to avoid a duplicate-keyword collision. Also fixes the
Python-Jinja path never having wired add_generation_prompt through at
all, and two llmtemplate_test.cpp tests that referenced a stale 4-arg
applyChatTemplate signature and never compiled.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

@mzegla mzegla left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at those few comments. Once we have it resolved I think we will be good to run it through the CI and merge.

Comment thread src/llm/io_processing/input_request.hpp Outdated
Comment thread src/llm/io_processing/input_processors/chat_template_processor.cpp Outdated
Comment thread src/llm/apis/openai_api_handler.cpp Outdated
Comment thread src/llm/apis/openai_api_handler.cpp Outdated
Comment thread docs/model_server_rest_api_chat.md Outdated
Comment thread docs/model_server_rest_api_responses.md Outdated
exzile added 3 commits July 3, 2026 12:16
…validate kwargs type

- Remove the now-redundant top-level add_generation_prompt request field/parsing
  and the OpenAIRequest.addGenerationPrompt member: the parameter lives entirely
  in chat_template_kwargs now, and both template paths already default it to
  true when absent.
- Fix MINJA path to use JsonContainer::as_bool() instead of get_bool(), which
  threw an unhandled ov::Exception (outside the surrounding try/catch) when
  add_generation_prompt was present but not a boolean; now returns a clean
  InvalidArgumentError, matching the Python-Jinja path's existing validation.
- Trim a leftover comment line in input_request.hpp per review suggestion.
- Update chat/responses REST docs to document add_generation_prompt as part of
  chat_template_kwargs instead of a separate top-level parameter.
- Add unit tests for add_generation_prompt=false and the invalid-type rejection
  on the MINJA path.
@exzile

exzile commented Jul 3, 2026

Copy link
Copy Markdown
Author

@mzegla tended to all those comments in the latest commit.
Removed the leftover top-level add_generation_prompt field/parsing (now lives solely in chat_template_kwargs).
Fixed a real bug: the MINJA path used get_bool() outside its try/catch, which would throw an unhandled ov::Exception on a non-boolean value instead of a clean error — now uses as_bool() and returns InvalidArgumentError.
Trimmed the leftover comment line in input_request.hpp per the inline suggestion.
Updated both REST docs pages to document add_generation_prompt under chat_template_kwargs instead of as a separate parameter.
Added two unit tests (false-case and invalid-type rejection).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Allow Assistant Messages to Act as Generation Prefix (Prefill)

2 participants