Skip to content
2 changes: 1 addition & 1 deletion docs/model_server_rest_api_chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
| tools | ✅ | ✅ | ✅ | array | A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools) for more details. |
| tool_choice | ✅ | ✅ | ✅ | string or object | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular tool via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice) for more details. |
| response_format | ✅ | ✅ | ✅ | object | An object specifying the format that the model must output. Setting to `{ "type": "json_schema", "json_schema": {...} }` enables Structured Outputs which ensures the model will match your supplied JSON schema according to [OpenAI reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format). Learn more in the [Structured Outputs demo](../demos/continuous_batching/structured_output/README.md). Additionally, `response_format` can accept [XGrammar structural tags format](https://github.com/mlc-ai/xgrammar/blob/v0.1.26/docs/tutorials/structural_tag.md#format-types) (not part of OpenAI API). For example: `{ "type": "const_string", "value": "Hello World!" }`. **Note** that if model server fails to process the format, the request will still be processed, but the format will not be imposed. |
| chat_template_kwargs | ✅ | ❌ | ✅ | object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. |
| chat_template_kwargs | ✅ | ❌ | ✅ | object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. Also accepts `add_generation_prompt` (bool, default: `true`) — whether to append the chat template's generation prompt (the marker that signals the model to start a new assistant turn). Set to `false` to render the conversation without a trailing generation prompt, e.g. `{"add_generation_prompt": false}` — useful for assistant prefill where the final `assistant` message should be continued rather than treated as a completed turn. Applies to both the Python-Jinja and MINJA chat template paths. |
| skip_special_tokens | ✅ | ❌ | ✅ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |

#### Beam search sampling specific
Expand Down
2 changes: 1 addition & 1 deletion docs/model_server_rest_api_responses.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ curl http://localhost/v3/responses \
| tools | ⚠️ | ✅ | array (optional) | A list of tools the model may call. Currently, only **function** tools are supported. OpenAI also supports built-in tools (web_search, file_search, code_interpreter, etc.) and MCP tools. OVMS additionally accepts a flat `{type, name, parameters}` format alongside the nested `{type, function: {name, parameters}}` format. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools) for more details. |
| tool_choice | ✅ | ✅ | string or object (optional) | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular function via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. |
| reasoning | ⚠️ | ✅ | object (optional) | Configuration for reasoning/thinking mode. The `effort` field accepts `"low"`, `"medium"`, or `"high"` — any value enables thinking mode (`enable_thinking: true` is injected into chat template kwargs). The `summary` field is accepted but ignored. |
| chat_template_kwargs | ✅ | ❌ | object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. |
| chat_template_kwargs | ✅ | ❌ | object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. Also accepts `add_generation_prompt` (bool, default: `true`) — whether to append the chat template's generation prompt (the marker that signals the model to start a new assistant turn). Set to `false` to render the conversation without a trailing generation prompt, e.g. `{"add_generation_prompt": false}` — useful for assistant prefill where the final `assistant` message should be continued rather than treated as a completed turn. Applies to both the Python-Jinja and MINJA chat template paths. |
| skip_special_tokens | ✅ | ❌ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
| stream_options | ❌ | ❌ | | Not supported in Responses API. Usage statistics are always included in the `response.completed` event. |

Expand Down
3 changes: 3 additions & 0 deletions src/llm/apis/openai_api_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,9 @@ absl::StatusOr<InputRequest> OpenAIApiHandler::extractInputRequest(GenerationCon
if (!kwargsResult.ok()) {
return kwargsResult.status();
}
// add_generation_prompt is only ever consumed by chat template rendering, so it is
// read directly out of chat_template_kwargs (both the MINJA and Python-Jinja rendering
// paths default it to true when absent) rather than kept as a separate request field.
if (kwargsResult.value().has_value()) {
chatHistory.set_extra_context(kwargsResult.value().value());
}
Expand Down
17 changes: 14 additions & 3 deletions src/llm/io_processing/input_processors/chat_template_processor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,6 @@ absl::Status ChatTemplateProcessor::process(InputRequest& req) {
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "chatHistory.get_extra_context(): {}", chatHistory.get_extra_context().to_json_string());
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "tools: {}", chatHistory.get_tools().empty() ? std::string("<none>") : chatHistory.get_tools().to_json_string());
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "chatTemplateKwargs: {}", chatHistory.get_extra_context().empty() ? std::string("<none>") : chatHistory.get_extra_context().to_json_string());
SPDLOG_LOGGER_TRACE(llm_calculator_logger, "addGenerationPrompt: {}", true);
}

#if (PYTHON_DISABLE == 0)
Expand All @@ -82,9 +81,21 @@ absl::Status ChatTemplateProcessor::process(InputRequest& req) {
req.promptText = std::move(promptText);
} else {
#endif
constexpr bool addGenerationPrompt = true;
const auto& tools = chatHistory.get_tools();
const auto& kwargs = chatHistory.get_extra_context();
// add_generation_prompt lives inside chat_template_kwargs; MINJA's apply_chat_template
// takes it as a dedicated argument, so extract it here and drop it from the kwargs map
// passed alongside so it isn't supplied twice.
ov::genai::JsonContainer kwargs = chatHistory.get_extra_context();
bool addGenerationPrompt = true;
if (kwargs.contains("add_generation_prompt")) {
const auto asBool = kwargs["add_generation_prompt"].as_bool();
if (!asBool.has_value()) {
return absl::Status(absl::StatusCode::kInvalidArgument,
"add_generation_prompt accepts values true or false");
}
addGenerationPrompt = asBool.value();
kwargs.erase("add_generation_prompt");
}
const std::optional<ov::genai::JsonContainer> optTools =
tools.empty() ? std::nullopt : std::make_optional(tools);
const std::optional<ov::genai::JsonContainer> optKwargs =
Expand Down
1 change: 1 addition & 0 deletions src/llm/io_processing/input_request.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ using InputPayload = std::variant<ov::genai::ChatHistory, std::string>;
struct InputRequest {
InputPayload input; // set in parseRequest()
ov::genai::GenerationConfig generationConfig; // set in parseRequest()
// add_generation_prompt is folded into ChatHistory's extra_context (chat_template_kwargs).

std::string promptText; // written by ChatTemplateProcessor / RawPromptExtractor
ov::Tensor inputIds; // written by TokenizationProcessor (all paths)
Expand Down
10 changes: 8 additions & 2 deletions src/llm/py_jinja_template_processor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,17 @@ bool PyJinjaTemplateProcessor::applyChatTemplate(PyJinjaTemplateProcessor& templ
elif not isinstance(chat_template_kwargs, dict):
raise Exception("chat_template_kwargs must be an object")

# add_generation_prompt is passed as part of chat_template_kwargs; pop it out so
# it is not also supplied via **chat_template_kwargs below (duplicate keyword).
add_generation_prompt = chat_template_kwargs.pop("add_generation_prompt", True)
if not isinstance(add_generation_prompt, bool):
raise Exception("add_generation_prompt accepts values true or false")

tools = request_json["tools"] if "tools" in request_json else None
if tools is None:
output = chat_template.render(messages=messages, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=True, **chat_template_kwargs)
output = chat_template.render(messages=messages, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=add_generation_prompt, **chat_template_kwargs)
else:
output = tool_chat_template.render(messages=messages, tools=tools, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=True, **chat_template_kwargs)
output = tool_chat_template.render(messages=messages, tools=tools, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=add_generation_prompt, **chat_template_kwargs)
except Exception as e:
error = str(e)
)",
Expand Down
37 changes: 37 additions & 0 deletions src/test/llm/input_processing/chat_template_processor_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,43 @@ TEST_F(ChatTemplateProcessorTest, MultiTurnConversation_AllTurnsRendered) {
EXPECT_EQ(req.promptText, expected);
}

// add_generation_prompt=false (read out of chat_template_kwargs) must omit the
// trailing generation-prompt suffix while otherwise rendering normally.
TEST_F(ChatTemplateProcessorTest, AddGenerationPromptFalse_OmitsGenerationPromptSuffix) {
ov::genai::ChatHistory history;
history.push_back({{"role", "user"}, {"content", "What is OpenVINO?"}});
history.set_extra_context(ov::genai::JsonContainer::from_json_string(R"({"add_generation_prompt": false})"));

InputRequest req = makeChatRequest(std::move(history));
ChatTemplateProcessor processor(*sharedTokenizer);
const auto status = processor.process(req);

ASSERT_TRUE(status.ok()) << status.message();

const std::string expected =
std::string(SMOL_DEFAULT_SYSTEM) +
"<|im_start|>user\nWhat is OpenVINO?<|im_end|>\n";
EXPECT_EQ(req.promptText, expected);
EXPECT_EQ(req.promptText.find("<|im_start|>assistant"), std::string::npos)
<< "add_generation_prompt=false must omit the trailing generation prompt";
}

// A non-boolean add_generation_prompt must be rejected with a clear error
// instead of throwing an unhandled JsonContainer type-mismatch exception.
TEST_F(ChatTemplateProcessorTest, AddGenerationPromptNonBoolean_ReturnsInvalidArgument) {
ov::genai::ChatHistory history;
history.push_back({{"role", "user"}, {"content", "Hi."}});
history.set_extra_context(ov::genai::JsonContainer::from_json_string(R"({"add_generation_prompt": "yes"})"));

InputRequest req = makeChatRequest(std::move(history));
ChatTemplateProcessor processor(*sharedTokenizer);
const auto status = processor.process(req);

ASSERT_FALSE(status.ok());
EXPECT_EQ(status.code(), absl::StatusCode::kInvalidArgument);
EXPECT_NE(status.message().find("add_generation_prompt"), std::string::npos) << status.message();
}

// The processor must populate req.promptText and leave the ChatHistory
// variant in req.input intact (it does not replace the input variant).
TEST_F(ChatTemplateProcessorTest, PromptTextPopulated_ChatHistoryVariantPreserved) {
Expand Down
34 changes: 34 additions & 0 deletions src/test/llm/llmtemplate_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,40 @@ TEST_F(LLMChatTemplateTest, ChatTemplateDefault) {
ASSERT_EQ(finalPrompt, expectedOutput);
}

// add_generation_prompt request field controls whether the trailing generation
// prompt is rendered (assistant prefill support, issue #3877).
TEST_F(LLMChatTemplateTest, ChatTemplateAddGenerationPromptDefaultsTrue) {
std::string jinja = "{% for message in messages %}{{ message['role'] }}: {{ message['content'] }}{% endfor %}{% if add_generation_prompt %}<|GEN|>{% endif %}";
ASSERT_TRUE(CreateJinjaConfig(jinja));
LoadTemplateProcessor();
std::string finalPrompt = "";
std::string payloadBody = R"(
{
"messages": [{ "role": "user", "content": "hi" }]
}
)";
ASSERT_EQ(PyJinjaTemplateProcessor::applyChatTemplate(servable->getProperties()->templateProcessor, payloadBody, finalPrompt), true);
ASSERT_NE(finalPrompt.find("<|GEN|>"), std::string::npos) << "default should add generation prompt, got: " << finalPrompt;
}

TEST_F(LLMChatTemplateTest, ChatTemplateAddGenerationPromptFalse) {
std::string jinja = "{% for message in messages %}{{ message['role'] }}: {{ message['content'] }}{% endfor %}{% if add_generation_prompt %}<|GEN|>{% endif %}";
ASSERT_TRUE(CreateJinjaConfig(jinja));
LoadTemplateProcessor();
std::string finalPrompt = "";
// add_generation_prompt is folded into chat_template_kwargs by the OpenAI API handler
// before applyChatTemplate is ever called; reflect that contract here.
std::string payloadBody = R"(
{
"messages": [{ "role": "user", "content": "hi" }, { "role": "assistant", "content": "partial" }],
"chat_template_kwargs": { "add_generation_prompt": false }
}
)";
ASSERT_EQ(PyJinjaTemplateProcessor::applyChatTemplate(servable->getProperties()->templateProcessor, payloadBody, finalPrompt), true);
ASSERT_EQ(finalPrompt.find("<|GEN|>"), std::string::npos) << "add_generation_prompt=false should omit generation prompt, got: " << finalPrompt;
ASSERT_NE(finalPrompt.find("partial"), std::string::npos) << "assistant prefill content should be present, got: " << finalPrompt;
}

TEST_F(LLMChatTemplateTest, ChatTemplateMultiMessage) {
CopyDefaultChatTemplate();
LoadTemplateProcessor();
Expand Down