Support add_generation_prompt request parameter for chat completions (#3877)#4331
Support add_generation_prompt request parameter for chat completions (#3877)#4331exzile wants to merge 8 commits into
Conversation
91043be to
6bb8bd4
Compare
|
Looks like a rebase with conflict resolution is needed. |
The chat template was always rendered with add_generation_prompt=true, hardcoded in every servable. This exposes an optional add_generation_prompt field (bool, default true) on the /v3/chat/completions request, matching HF transformers and vLLM. When false, the trailing generation prompt is omitted, which is the building block for assistant prefill. - Parse add_generation_prompt in the request (openai_api_handler.cpp) and store it on the request struct (openai_request.hpp). - Honor it in all chat-template application sites: the MINJA path (LLM and VLM continuous batching, legacy) and the Python-Jinja path (read from the request body and pass into the template render). - Add tests covering default (generation prompt added) and false (generation prompt omitted, assistant message preserved). Verified end-to-end on the MINJA path with HuggingFaceTB/SmolLM2-360M-Instruct: default renders a trailing "<|im_start|>assistant", add_generation_prompt=false omits it. Note: true assistant prefill (continue_final_message - continuing from the final assistant message without closing it) is a separate control and is left as a follow-up. Implements openvinotoolkit#3877 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per review: add_generation_prompt is valid and functional for the Responses API - the Endpoint::RESPONSES path in servable.cpp reads request.addGenerationPrompt and passes it to apply_chat_template (and the Jinja path reads it from the request JSON), same as chat completions. Document it in the responses REST API parameters table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8c850e7 to
de8b330
Compare
Per review feedback: add_generation_prompt is only ever consumed by chat template rendering, so route it through chat_template_kwargs instead of a dedicated InputRequest field. The MINJA path extracts it back out for genai's dedicated apply_chat_template argument; the Python-Jinja path pops it from the kwargs dict before splatting into render() to avoid a duplicate-keyword collision. Also fixes the Python-Jinja path never having wired add_generation_prompt through at all, and two llmtemplate_test.cpp tests that referenced a stale 4-arg applyChatTemplate signature and never compiled. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
…ature/assistant-prefill
mzegla
left a comment
There was a problem hiding this comment.
Please have a look at those few comments. Once we have it resolved I think we will be good to run it through the CI and merge.
…validate kwargs type - Remove the now-redundant top-level add_generation_prompt request field/parsing and the OpenAIRequest.addGenerationPrompt member: the parameter lives entirely in chat_template_kwargs now, and both template paths already default it to true when absent. - Fix MINJA path to use JsonContainer::as_bool() instead of get_bool(), which threw an unhandled ov::Exception (outside the surrounding try/catch) when add_generation_prompt was present but not a boolean; now returns a clean InvalidArgumentError, matching the Python-Jinja path's existing validation. - Trim a leftover comment line in input_request.hpp per review suggestion. - Update chat/responses REST docs to document add_generation_prompt as part of chat_template_kwargs instead of a separate top-level parameter. - Add unit tests for add_generation_prompt=false and the invalid-type rejection on the MINJA path.
…ature/assistant-prefill
|
@mzegla tended to all those comments in the latest commit. |
Summary
Closes #3877.
The chat template was always rendered with
add_generation_prompt=true, hardcoded in every servable. This exposes an optionaladd_generation_promptfield (bool, defaulttrue) on/v3/chat/completions, matching HF transformers and vLLM. Whenfalse, the trailing generation prompt is omitted — the building block for assistant prefill.Changes
add_generation_promptin the request handler; store it on the request struct.add_generation_prompt = true.Testing
Added tests covering default (generation prompt added) and
false(generation prompt omitted, assistant message preserved).Verified end-to-end on the MINJA path with
HuggingFaceTB/SmolLM2-360M-Instruct:<|im_start|>assistantadd_generation_prompt: false→ that trailing generation prompt is omittedScope note
This implements
add_generation_promptonly. Full assistant prefill — continuing from the final assistant message without closing it (continue_final_messagein transformers/vLLM) — is a separate control that the genai C++apply_chat_templatedoes not currently expose, and is left as a follow-up.🤖 Generated with Claude Code