diff --git a/docs/model_server_rest_api_chat.md b/docs/model_server_rest_api_chat.md index 951694f8f4..2888d61dc0 100644 --- a/docs/model_server_rest_api_chat.md +++ b/docs/model_server_rest_api_chat.md @@ -220,7 +220,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc | tools | ✅ | ✅ | ✅ | array | A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools) for more details. | | tool_choice | ✅ | ✅ | ✅ | string or object | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular tool via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice) for more details. | | response_format | ✅ | ✅ | ✅ | object | An object specifying the format that the model must output. Setting to `{ "type": "json_schema", "json_schema": {...} }` enables Structured Outputs which ensures the model will match your supplied JSON schema according to [OpenAI reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format). Learn more in the [Structured Outputs demo](../demos/continuous_batching/structured_output/README.md). Additionally, `response_format` can accept [XGrammar structural tags format](https://github.com/mlc-ai/xgrammar/blob/v0.1.26/docs/tutorials/structural_tag.md#format-types) (not part of OpenAI API). For example: `{ "type": "const_string", "value": "Hello World!" }`. **Note** that if model server fails to process the format, the request will still be processed, but the format will not be imposed. | -| chat_template_kwargs | ✅ | ❌ | ✅ | object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. | +| chat_template_kwargs | ✅ | ❌ | ✅ | object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. Also accepts `add_generation_prompt` (bool, default: `true`) — whether to append the chat template's generation prompt (the marker that signals the model to start a new assistant turn). Set to `false` to render the conversation without a trailing generation prompt, e.g. `{"add_generation_prompt": false}` — useful for assistant prefill where the final `assistant` message should be continued rather than treated as a completed turn. Applies to both the Python-Jinja and MINJA chat template paths. | | skip_special_tokens | ✅ | ❌ | ✅ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. | #### Beam search sampling specific diff --git a/docs/model_server_rest_api_responses.md b/docs/model_server_rest_api_responses.md index b097ca69c7..2ff2f2c246 100644 --- a/docs/model_server_rest_api_responses.md +++ b/docs/model_server_rest_api_responses.md @@ -104,7 +104,7 @@ curl http://localhost/v3/responses \ | tools | ⚠️ | ✅ | array (optional) | A list of tools the model may call. Currently, only **function** tools are supported. OpenAI also supports built-in tools (web_search, file_search, code_interpreter, etc.) and MCP tools. OVMS additionally accepts a flat `{type, name, parameters}` format alongside the nested `{type, function: {name, parameters}}` format. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools) for more details. | | tool_choice | ✅ | ✅ | string or object (optional) | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular function via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. | | reasoning | ⚠️ | ✅ | object (optional) | Configuration for reasoning/thinking mode. The `effort` field accepts `"low"`, `"medium"`, or `"high"` — any value enables thinking mode (`enable_thinking: true` is injected into chat template kwargs). The `summary` field is accepted but ignored. | -| chat_template_kwargs | ✅ | ❌ | object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. | +| chat_template_kwargs | ✅ | ❌ | object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. Also accepts `add_generation_prompt` (bool, default: `true`) — whether to append the chat template's generation prompt (the marker that signals the model to start a new assistant turn). Set to `false` to render the conversation without a trailing generation prompt, e.g. `{"add_generation_prompt": false}` — useful for assistant prefill where the final `assistant` message should be continued rather than treated as a completed turn. Applies to both the Python-Jinja and MINJA chat template paths. | | skip_special_tokens | ✅ | ❌ | bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. | | stream_options | ❌ | ❌ | | Not supported in Responses API. Usage statistics are always included in the `response.completed` event. | diff --git a/src/llm/apis/openai_api_handler.cpp b/src/llm/apis/openai_api_handler.cpp index c7dc3a93cc..7a66ec3246 100644 --- a/src/llm/apis/openai_api_handler.cpp +++ b/src/llm/apis/openai_api_handler.cpp @@ -358,6 +358,9 @@ absl::StatusOr OpenAIApiHandler::extractInputRequest(GenerationCon if (!kwargsResult.ok()) { return kwargsResult.status(); } + // add_generation_prompt is only ever consumed by chat template rendering, so it is + // read directly out of chat_template_kwargs (both the MINJA and Python-Jinja rendering + // paths default it to true when absent) rather than kept as a separate request field. if (kwargsResult.value().has_value()) { chatHistory.set_extra_context(kwargsResult.value().value()); } diff --git a/src/llm/io_processing/input_processors/chat_template_processor.cpp b/src/llm/io_processing/input_processors/chat_template_processor.cpp index 8966cb33f8..25373b451d 100644 --- a/src/llm/io_processing/input_processors/chat_template_processor.cpp +++ b/src/llm/io_processing/input_processors/chat_template_processor.cpp @@ -67,7 +67,6 @@ absl::Status ChatTemplateProcessor::process(InputRequest& req) { SPDLOG_LOGGER_TRACE(llm_calculator_logger, "chatHistory.get_extra_context(): {}", chatHistory.get_extra_context().to_json_string()); SPDLOG_LOGGER_TRACE(llm_calculator_logger, "tools: {}", chatHistory.get_tools().empty() ? std::string("") : chatHistory.get_tools().to_json_string()); SPDLOG_LOGGER_TRACE(llm_calculator_logger, "chatTemplateKwargs: {}", chatHistory.get_extra_context().empty() ? std::string("") : chatHistory.get_extra_context().to_json_string()); - SPDLOG_LOGGER_TRACE(llm_calculator_logger, "addGenerationPrompt: {}", true); } #if (PYTHON_DISABLE == 0) @@ -82,9 +81,21 @@ absl::Status ChatTemplateProcessor::process(InputRequest& req) { req.promptText = std::move(promptText); } else { #endif - constexpr bool addGenerationPrompt = true; const auto& tools = chatHistory.get_tools(); - const auto& kwargs = chatHistory.get_extra_context(); + // add_generation_prompt lives inside chat_template_kwargs; MINJA's apply_chat_template + // takes it as a dedicated argument, so extract it here and drop it from the kwargs map + // passed alongside so it isn't supplied twice. + ov::genai::JsonContainer kwargs = chatHistory.get_extra_context(); + bool addGenerationPrompt = true; + if (kwargs.contains("add_generation_prompt")) { + const auto asBool = kwargs["add_generation_prompt"].as_bool(); + if (!asBool.has_value()) { + return absl::Status(absl::StatusCode::kInvalidArgument, + "add_generation_prompt accepts values true or false"); + } + addGenerationPrompt = asBool.value(); + kwargs.erase("add_generation_prompt"); + } const std::optional optTools = tools.empty() ? std::nullopt : std::make_optional(tools); const std::optional optKwargs = diff --git a/src/llm/io_processing/input_request.hpp b/src/llm/io_processing/input_request.hpp index af382ee400..ec2fce300f 100644 --- a/src/llm/io_processing/input_request.hpp +++ b/src/llm/io_processing/input_request.hpp @@ -37,6 +37,7 @@ using InputPayload = std::variant; struct InputRequest { InputPayload input; // set in parseRequest() ov::genai::GenerationConfig generationConfig; // set in parseRequest() + // add_generation_prompt is folded into ChatHistory's extra_context (chat_template_kwargs). std::string promptText; // written by ChatTemplateProcessor / RawPromptExtractor ov::Tensor inputIds; // written by TokenizationProcessor (all paths) diff --git a/src/llm/py_jinja_template_processor.cpp b/src/llm/py_jinja_template_processor.cpp index 235e374ea8..abd4c16de8 100644 --- a/src/llm/py_jinja_template_processor.cpp +++ b/src/llm/py_jinja_template_processor.cpp @@ -58,11 +58,17 @@ bool PyJinjaTemplateProcessor::applyChatTemplate(PyJinjaTemplateProcessor& templ elif not isinstance(chat_template_kwargs, dict): raise Exception("chat_template_kwargs must be an object") + # add_generation_prompt is passed as part of chat_template_kwargs; pop it out so + # it is not also supplied via **chat_template_kwargs below (duplicate keyword). + add_generation_prompt = chat_template_kwargs.pop("add_generation_prompt", True) + if not isinstance(add_generation_prompt, bool): + raise Exception("add_generation_prompt accepts values true or false") + tools = request_json["tools"] if "tools" in request_json else None if tools is None: - output = chat_template.render(messages=messages, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=True, **chat_template_kwargs) + output = chat_template.render(messages=messages, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=add_generation_prompt, **chat_template_kwargs) else: - output = tool_chat_template.render(messages=messages, tools=tools, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=True, **chat_template_kwargs) + output = tool_chat_template.render(messages=messages, tools=tools, bos_token=bos_token, eos_token=eos_token, add_generation_prompt=add_generation_prompt, **chat_template_kwargs) except Exception as e: error = str(e) )", diff --git a/src/test/llm/input_processing/chat_template_processor_test.cpp b/src/test/llm/input_processing/chat_template_processor_test.cpp index e61060173e..1cbee1ae65 100644 --- a/src/test/llm/input_processing/chat_template_processor_test.cpp +++ b/src/test/llm/input_processing/chat_template_processor_test.cpp @@ -141,6 +141,43 @@ TEST_F(ChatTemplateProcessorTest, MultiTurnConversation_AllTurnsRendered) { EXPECT_EQ(req.promptText, expected); } +// add_generation_prompt=false (read out of chat_template_kwargs) must omit the +// trailing generation-prompt suffix while otherwise rendering normally. +TEST_F(ChatTemplateProcessorTest, AddGenerationPromptFalse_OmitsGenerationPromptSuffix) { + ov::genai::ChatHistory history; + history.push_back({{"role", "user"}, {"content", "What is OpenVINO?"}}); + history.set_extra_context(ov::genai::JsonContainer::from_json_string(R"({"add_generation_prompt": false})")); + + InputRequest req = makeChatRequest(std::move(history)); + ChatTemplateProcessor processor(*sharedTokenizer); + const auto status = processor.process(req); + + ASSERT_TRUE(status.ok()) << status.message(); + + const std::string expected = + std::string(SMOL_DEFAULT_SYSTEM) + + "<|im_start|>user\nWhat is OpenVINO?<|im_end|>\n"; + EXPECT_EQ(req.promptText, expected); + EXPECT_EQ(req.promptText.find("<|im_start|>assistant"), std::string::npos) + << "add_generation_prompt=false must omit the trailing generation prompt"; +} + +// A non-boolean add_generation_prompt must be rejected with a clear error +// instead of throwing an unhandled JsonContainer type-mismatch exception. +TEST_F(ChatTemplateProcessorTest, AddGenerationPromptNonBoolean_ReturnsInvalidArgument) { + ov::genai::ChatHistory history; + history.push_back({{"role", "user"}, {"content", "Hi."}}); + history.set_extra_context(ov::genai::JsonContainer::from_json_string(R"({"add_generation_prompt": "yes"})")); + + InputRequest req = makeChatRequest(std::move(history)); + ChatTemplateProcessor processor(*sharedTokenizer); + const auto status = processor.process(req); + + ASSERT_FALSE(status.ok()); + EXPECT_EQ(status.code(), absl::StatusCode::kInvalidArgument); + EXPECT_NE(status.message().find("add_generation_prompt"), std::string::npos) << status.message(); +} + // The processor must populate req.promptText and leave the ChatHistory // variant in req.input intact (it does not replace the input variant). TEST_F(ChatTemplateProcessorTest, PromptTextPopulated_ChatHistoryVariantPreserved) { diff --git a/src/test/llm/llmtemplate_test.cpp b/src/test/llm/llmtemplate_test.cpp index e51833cb3b..15168d49cb 100644 --- a/src/test/llm/llmtemplate_test.cpp +++ b/src/test/llm/llmtemplate_test.cpp @@ -167,6 +167,40 @@ TEST_F(LLMChatTemplateTest, ChatTemplateDefault) { ASSERT_EQ(finalPrompt, expectedOutput); } +// add_generation_prompt request field controls whether the trailing generation +// prompt is rendered (assistant prefill support, issue #3877). +TEST_F(LLMChatTemplateTest, ChatTemplateAddGenerationPromptDefaultsTrue) { + std::string jinja = "{% for message in messages %}{{ message['role'] }}: {{ message['content'] }}{% endfor %}{% if add_generation_prompt %}<|GEN|>{% endif %}"; + ASSERT_TRUE(CreateJinjaConfig(jinja)); + LoadTemplateProcessor(); + std::string finalPrompt = ""; + std::string payloadBody = R"( + { + "messages": [{ "role": "user", "content": "hi" }] + } + )"; + ASSERT_EQ(PyJinjaTemplateProcessor::applyChatTemplate(servable->getProperties()->templateProcessor, payloadBody, finalPrompt), true); + ASSERT_NE(finalPrompt.find("<|GEN|>"), std::string::npos) << "default should add generation prompt, got: " << finalPrompt; +} + +TEST_F(LLMChatTemplateTest, ChatTemplateAddGenerationPromptFalse) { + std::string jinja = "{% for message in messages %}{{ message['role'] }}: {{ message['content'] }}{% endfor %}{% if add_generation_prompt %}<|GEN|>{% endif %}"; + ASSERT_TRUE(CreateJinjaConfig(jinja)); + LoadTemplateProcessor(); + std::string finalPrompt = ""; + // add_generation_prompt is folded into chat_template_kwargs by the OpenAI API handler + // before applyChatTemplate is ever called; reflect that contract here. + std::string payloadBody = R"( + { + "messages": [{ "role": "user", "content": "hi" }, { "role": "assistant", "content": "partial" }], + "chat_template_kwargs": { "add_generation_prompt": false } + } + )"; + ASSERT_EQ(PyJinjaTemplateProcessor::applyChatTemplate(servable->getProperties()->templateProcessor, payloadBody, finalPrompt), true); + ASSERT_EQ(finalPrompt.find("<|GEN|>"), std::string::npos) << "add_generation_prompt=false should omit generation prompt, got: " << finalPrompt; + ASSERT_NE(finalPrompt.find("partial"), std::string::npos) << "assistant prefill content should be present, got: " << finalPrompt; +} + TEST_F(LLMChatTemplateTest, ChatTemplateMultiMessage) { CopyDefaultChatTemplate(); LoadTemplateProcessor();