feat(instrumentation): Add litellm package instrumentation#4322
Conversation
📝 WalkthroughWalkthroughAdds a new LiteLLM OpenTelemetry instrumentation package with span, event, and metric handling, plus SDK integration, tests, and sample tool-calling apps. ChangesLiteLLM OpenTelemetry Instrumentation Package
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.23.0)packages/opentelemetry-instrumentation-litellm/tests/test_completion.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.12][ERROR]: unable to find a config; path packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.14][ERROR]: unable to find a config; path packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_models.py┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.26][ERROR]: unable to find a config; path
Comment |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cdbc173 to
839b787
Compare
max-deygin-traceloop
left a comment
There was a problem hiding this comment.
Semconv + Finish Reason Compliance Review
Overall this is a well-structured instrumentation. The parts-based message schema, provider normalization, streaming accumulation, and nesting suppression are all solid. Below are findings organized by severity.
P1 -- Embedding input uses deprecated indexed gen_ai.prompt.{i}.* pattern
File: __init__.py, _set_prompts() (embedding branch)
for index, prompt in enumerate(embedding_input or []):
_set_span_attribute(
span, f"{GenAIAttributes.GEN_AI_PROMPT}.{index}.role", "user"
)
_set_span_attribute(
span, f"{GenAIAttributes.GEN_AI_PROMPT}.{index}.content", prompt
)GEN_AI_PROMPT is explicitly deprecated upstream (Deprecated: Removed, no replacement at this time). The OpenAI instrumentation in this same repo already uses gen_ai.input.messages JSON for embeddings:
# opentelemetry-instrumentation-openai/shared/embeddings_wrappers.py
messages = [
{"role": "user", "parts": [{"type": "text", "content": p}]}
for p in prompts
]
_set_span_attribute(span, GenAIAttributes.GEN_AI_INPUT_MESSAGES, json.dumps(messages))Fix: Replace the indexed pattern with gen_ai.input.messages JSON, matching the OpenAI embeddings instrumentation. Update test_embeddings.py to assert the JSON structure.
P2 -- Events path uses "unknown" fallback instead of ""
File: __init__.py, _emit_choice_events()
finish_reason=_map_finish_reason(choice.get("finish_reason")) or "unknown",When _map_finish_reason returns "" (for None input), the or "unknown" kicks in. This diverges from the repo-wide convention of "" for unavailable finish reasons (used by OpenAI, Anthropic, Google GenAI, and OpenAI Agents instrumentors in this repo). The ChoiceEvent dataclass also defaults to "unknown".
The attribute path (_set_output_messages) correctly uses "" -- the event path should match.
Fix: Change or "unknown" to just pass through the "", and update ChoiceEvent.finish_reason default from "unknown" to "".
P2 -- gen_ai.tool.definitions not captured
LiteLLM's completion() accepts a tools kwarg. The instrumentation doesn't capture tool definitions in gen_ai.tool.definitions (opt-in, gated by should_send_prompts()). The OpenAI instrumentation sets this. Not a blocker but worth adding for parity -- tool-calling is a key LiteLLM use-case.
P2 -- Streaming tests don't verify finish_reason
test_streaming_completion and test_astreaming_completion verify output message content but don't assert:
output[0]["finish_reason"](per-message)attrs[GenAIAttributes.GEN_AI_RESPONSE_FINISH_REASONS](top-level span attr)
The non-streaming test_completion does verify both. Add the same assertions to streaming tests to catch any streaming-specific finish_reason accumulation regressions.
P3 -- Minor: key order inconsistency in _map_content_parts
# str branch -- content first
return [{"content": content, "type": "text"}]
# dict branch in _map_content_block -- type first
return {"type": "text", "content": block.get("text", "")}Semantically identical but inconsistent. Prefer {"type": "text", "content": ...} everywhere (type-first, matching the OTel schema examples).
Finish Reason Checklist
| Check | Status |
|---|---|
_map_finish_reason(None) returns "" (per-message fallback) |
Pass |
_map_finish_reason("tool_calls") returns "tool_call" (plural to singular) |
Pass |
_map_finish_reason("function_call") returns "tool_call" (legacy) |
Pass |
Per-message finish_reason always present in output JSON (required by schema) |
Pass |
Per-message finish_reason is always a string, never null |
Pass |
Top-level gen_ai.response.finish_reasons omitted when no meaningful reasons |
Pass |
Top-level attr NOT gated by should_send_prompts() |
Pass |
| Top-level attr deduplicates while preserving order | Pass |
| Streaming accumulates finish_reason correctly | Pass |
| All code paths (sync, async, streaming sync, streaming async, error) covered | Pass |
Semconv Checklist
| Check | Status |
|---|---|
gen_ai.operation.name set at span creation |
Pass (chat / embeddings) |
gen_ai.provider.name set (not deprecated gen_ai.system) |
Pass |
| Provider aliases normalized to OTel well-known values | Pass |
gen_ai.input.messages uses parts-based JSON (chat path) |
Pass |
gen_ai.output.messages uses parts-based JSON with finish_reason |
Pass |
Text parts use {"type": "text", "content": ...} (not "text" key) |
Pass |
Tool calls use {"type": "tool_call", ...} with parsed arguments |
Pass |
Tool responses use {"type": "tool_call_response", ...} |
Pass |
Image content mapped to BlobPart/UriPart |
Pass |
gen_ai.input.messages for embeddings (not deprecated indexed) |
Fail (P1 above) |
No indexed gen_ai.prompt.{i}.* / gen_ai.completion.{i}.* emitted alongside JSON |
N/A (only embedding path uses indexed) |
semconv-ai version range >=0.5.1,<0.6.0 compatible with other packages |
Pass |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (1)
packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)
826-827: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPreserve the LiteLLM init traceback.
Line 827 only logs
str(e), which hides whether the failure came from import, constructor wiring, or patching.logging.exception(...)orexc_info=Truewould make broken instrumentation much easier to debug.Proposed fix
- except Exception as e: - logging.error(f"Error initializing LiteLLM instrumentor: {e}") + except Exception: + logging.exception("Error initializing LiteLLM instrumentor")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py` around lines 826 - 827, The LiteLLM instrumentor initialization error handling in tracing.py only logs the exception message, which drops the traceback and makes the failure source unclear. Update the exception handler around the LiteLLM initialization path in the tracing module to log with traceback information using the existing logging call site, so failures during import, constructor setup, or patching are all visible when debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.py`:
- Around line 831-834: The exception counter in the Litellm instrumentation is
mislabeled with a time unit even though it records occurrences. Update the
metric declaration in the counter setup near the exception_counter creation to
use a count-like unit instead of unit="time", keeping the existing
Meters.LLM_COMPLETIONS_EXCEPTIONS counter and its description intact.
- Around line 748-754: The streaming error path in _finish_error currently
ignores the provider already resolved by _accumulate_chunk and re-derives it
from kwargs, which can lose the real provider for exception telemetry. Update
_finish_error in the streaming helper to preserve and pass through the cached
resolved provider state from the chunk accumulator when calling
_handle_exception, so failures after the first chunk report under the correct
provider instead of falling back to generic litellm.
- Around line 645-657: The duration timing in the LiteLLM instrumentation
currently uses a wall-clock source, which can produce skewed or negative values
when the system clock changes. Update the timing in the wrapped call path and
the streaming/direct histogram paths to use a monotonic clock instead of
time.time(), keeping the existing start/end measurement flow in __init__.py and
the related span/metrics handling intact.
- Around line 882-883: The _uninstrument() method in the LiteLLM instrumentor
still assumes litellm is installed, which can raise ModuleNotFoundError during
cleanup even though _instrument() treats it as optional. Update _uninstrument()
to guard the import of litellm and return early when the package is missing,
keeping the cleanup path safe in environments without LiteLLM.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_models.py`:
- Around line 5-7: The _FunctionToolCall TypedDict does not match the runtime
tool-call payload shape used by ToolCall and downstream consumers. Update the
typed contract in event_models.py so the nested function data is represented
with a name field matching tool_call["function"]["name"], and keep arguments as
the raw emitted value instead of a dict-only shape. Make sure any related
ToolCall typing or helpers in this module stay consistent with the actual
emitter schema.
- Around line 18-23: `CompletionMessage.role` is currently annotated like it has
a default, but `TypedDict` does not enforce or supply runtime defaults. Update
`CompletionMessage` in `event_models.py` to make `role` a required field, and
then verify the Litellm emitter/consumer code that reads `event.message["role"]`
is aligned with that contract. If any call sites can still omit the field,
adjust them to set `role` explicitly or switch those reads to a safe
`get("role")` access.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/utils.py`:
- Around line 40-41: The dont_throw path in the utils helper should remain
non-throwing even if Config.exception_logger itself fails. Update the exception
handling around the Config.exception_logger call so any error from that callback
is caught in its own try/except and only logged, while preserving the original
swallowed behavior in the surrounding wrapper. Use the dont_throw logic and
Config.exception_logger reference to locate the fix.
In `@packages/opentelemetry-instrumentation-litellm/README.md`:
- Around line 3-5: Add alt text to the PyPI badge image in the README by
updating the inline badge markup so the img element includes a descriptive alt
attribute; locate the badge link near the top of the README and keep the
existing badge URL while adding accessible text that identifies the
package/version badge for screen readers.
In `@packages/opentelemetry-instrumentation-litellm/tests/conftest.py`:
- Around line 56-67: The fixture teardown in instrument_legacy is not protected
against failures during LiteLLMInstrumentor.instrument, which can leave partial
patches behind and affect later tests. Update instrument_legacy to follow the
same try/finally pattern used by instrument_with_content: create the
LiteLLMInstrumentor, run instrument(...) in the protected block, yield the
instrumentor, and ensure instrumentor.uninstrument() always runs in the finally
path even if instrumentation raises.
In `@packages/opentelemetry-instrumentation-litellm/tests/test_nesting.py`:
- Around line 66-87: The nesting test only verifies the top-level LiteLLM span
because mock_response="hello" may bypass the OpenAI SDK entirely. Update
test_both_openai_and_litellm_instrumented to drive a stubbed provider path that
actually reaches OpenAIInstrumentor, and then assert the downstream OpenAI
boundary behavior as well as the litellm.chat span count so the test proves both
instrumentors interact correctly without double-counting.
In `@packages/sample-app/sample_app/litellm_async_streaming.py`:
- Around line 98-102: The get_current_weather tool in litellm_async_streaming.py
currently returns the fake DB temperature unchanged even when unit is
"fahrenheit". Update get_current_weather to detect Fahrenheit requests and
convert the Celsius value before building the json.dumps response, while keeping
the existing behavior for Celsius and preserving the same function signature and
output shape.
In `@packages/sample-app/sample_app/litellm_completion.py`:
- Around line 117-121: The issue is that issue_refund always returns a
successful refund for any model-supplied order_id and amount. Update
issue_refund to validate the order against known refund-eligible records first,
including that the order exists, was delivered/paid, and the requested amount is
within the refundable limit; only then return the refunded response. If the
order is unknown or ineligible, have issue_refund return a non-success result
instead of confirming the refund.
In `@packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py`:
- Around line 151-158: The span post-processing in tracing.py can prevent export
when span_postprocess_callback() raises because original_on_end() is skipped.
Update the on_end wrapper to catch and log callback failures in the
span_postprocess_callback path, always restore attributes._immutable in the
finally block, and then unconditionally call original_on_end(span) so the real
processor still receives every span.
---
Nitpick comments:
In `@packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py`:
- Around line 826-827: The LiteLLM instrumentor initialization error handling in
tracing.py only logs the exception message, which drops the traceback and makes
the failure source unclear. Update the exception handler around the LiteLLM
initialization path in the tracing module to log with traceback information
using the existing logging call site, so failures during import, constructor
setup, or patching are all visible when debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4ac35d93-7407-45c9-b4b0-0be118e912aa
⛔ Files ignored due to path filters (3)
packages/opentelemetry-instrumentation-litellm/uv.lockis excluded by!**/*.lockpackages/sample-app/uv.lockis excluded by!**/*.lockpackages/traceloop-sdk/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (23)
packages/opentelemetry-instrumentation-litellm/README.mdpackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.pypackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/config.pypackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_emitter.pypackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_models.pypackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/utils.pypackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/version.pypackages/opentelemetry-instrumentation-litellm/project.jsonpackages/opentelemetry-instrumentation-litellm/pyproject.tomlpackages/opentelemetry-instrumentation-litellm/pytest.inipackages/opentelemetry-instrumentation-litellm/tests/conftest.pypackages/opentelemetry-instrumentation-litellm/tests/test_completion.pypackages/opentelemetry-instrumentation-litellm/tests/test_custom_provider.pypackages/opentelemetry-instrumentation-litellm/tests/test_embeddings.pypackages/opentelemetry-instrumentation-litellm/tests/test_events.pypackages/opentelemetry-instrumentation-litellm/tests/test_nesting.pypackages/opentelemetry-instrumentation-litellm/tests/test_provider_resolution.pypackages/opentelemetry-instrumentation-litellm/tests/test_streaming.pypackages/sample-app/sample_app/litellm_async_streaming.pypackages/sample-app/sample_app/litellm_completion.pypackages/traceloop-sdk/pyproject.tomlpackages/traceloop-sdk/traceloop/sdk/instruments.pypackages/traceloop-sdk/traceloop/sdk/tracing/tracing.py
| start = time.time() | ||
| token = context_api.attach( | ||
| context_api.set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True) | ||
| ) | ||
| try: | ||
| response = wrapped(*args, **kwargs) | ||
| except Exception as error: | ||
| _handle_exception(span, metrics, error, model, kwargs) | ||
| raise | ||
| finally: | ||
| context_api.detach(token) | ||
|
|
||
| duration = time.time() - start |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
Use a monotonic clock for duration measurement.
time.time() can jump on wall-clock adjustments, so both the direct and streaming histograms can record skewed or even negative durations. Capture elapsed time with a monotonic source instead.
Proposed fix
- start = time.time()
+ start = time.perf_counter()
@@
- duration = time.time() - start
+ duration = time.perf_counter() - start
@@
- start = time.time()
+ start = time.perf_counter()
@@
- duration = time.time() - start
+ duration = time.perf_counter() - start
@@
- time.time() - start, accumulated_provider=accumulated.get("provider"),
+ time.perf_counter() - start, accumulated_provider=accumulated.get("provider"),Also applies to: 681-693, 706-710
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.py`
around lines 645 - 657, The duration timing in the LiteLLM instrumentation
currently uses a wall-clock source, which can produce skewed or negative values
when the system clock changes. Update the timing in the wrapped call path and
the streaming/direct histogram paths to use a monotonic clock instead of
time.time(), keeping the existing start/end measurement flow in __init__.py and
the related span/metrics handling intact.
| def _finish_error(self, error): | ||
| if self._self_done: | ||
| return | ||
| self._self_done = True | ||
| _handle_exception( | ||
| self._self_span, self._self_metrics, error, self._self_model, self._self_kwargs | ||
| ) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
Preserve the resolved provider on streaming error paths.
_accumulate_chunk() already captures the provider once a chunk exposes it, but _finish_error() ignores that state and falls back to kwargs-only resolution. A stream that fails after the first chunk can therefore emit exception telemetry under generic litellm instead of the real provider.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.py`
around lines 748 - 754, The streaming error path in _finish_error currently
ignores the provider already resolved by _accumulate_chunk and re-derives it
from kwargs, which can lose the real provider for exception telemetry. Update
_finish_error in the streaming helper to preserve and pass through the cached
resolved provider state from the chunk accumulator when calling
_handle_exception, so failures after the first chunk report under the correct
provider instead of falling back to generic litellm.
| "exception_counter": meter.create_counter( | ||
| name=Meters.LLM_COMPLETIONS_EXCEPTIONS, | ||
| unit="time", | ||
| description="Number of exceptions occurred during chat completions", |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
Fix the exception counter unit.
This counter tracks occurrences, but it is declared with unit="time". That mislabels the metric in downstream backends and dashboards. Use a count-like unit instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.py`
around lines 831 - 834, The exception counter in the Litellm instrumentation is
mislabeled with a time unit even though it records occurrences. Update the
metric declaration in the counter setup near the exception_counter creation to
use a count-like unit instead of unit="time", keeping the existing
Meters.LLM_COMPLETIONS_EXCEPTIONS counter and its description intact.
| def _uninstrument(self, **kwargs): | ||
| import litellm |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Make _uninstrument() tolerate a missing optional LiteLLM install.
_instrument() already treats litellm as optional, but this unconditional import will raise ModuleNotFoundError during cleanup in the same environment. Guard the import and return early.
Proposed fix
def _uninstrument(self, **kwargs):
- import litellm
+ try:
+ import litellm
+ except ModuleNotFoundError:
+ return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _uninstrument(self, **kwargs): | |
| import litellm | |
| def _uninstrument(self, **kwargs): | |
| try: | |
| import litellm | |
| except ModuleNotFoundError: | |
| return |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.py`
around lines 882 - 883, The _uninstrument() method in the LiteLLM instrumentor
still assumes litellm is installed, which can raise ModuleNotFoundError during
cleanup even though _instrument() treats it as optional. Update _uninstrument()
to guard the import of litellm and return early when the package is missing,
keeping the cleanup path safe in environments without LiteLLM.
| class _FunctionToolCall(TypedDict): | ||
| function_name: str | ||
| arguments: Optional[dict[str, Any]] |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Align ToolCall.function with the actual payload shape.
Downstream code and tests use tool_call["function"]["name"] and preserve arguments as a raw value, but this TypedDict advertises function_name plus dict arguments. That makes the public type contract diverge from the runtime schema and will mislead any typed caller integrating with the emitter.
Proposed fix
class _FunctionToolCall(TypedDict):
- function_name: str
- arguments: Optional[dict[str, Any]]
+ name: str
+ arguments: Optional[Any]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class _FunctionToolCall(TypedDict): | |
| function_name: str | |
| arguments: Optional[dict[str, Any]] | |
| class _FunctionToolCall(TypedDict): | |
| name: str | |
| arguments: Optional[Any] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_models.py`
around lines 5 - 7, The _FunctionToolCall TypedDict does not match the runtime
tool-call payload shape used by ToolCall and downstream consumers. Update the
typed contract in event_models.py so the nested function data is represented
with a name field matching tool_call["function"]["name"], and keep arguments as
the raw emitted value instead of a dict-only shape. Make sure any related
ToolCall typing or helpers in this module stay consistent with the actual
emitter schema.
| @pytest.fixture(scope="function") | ||
| def instrument_legacy(tracer_provider, meter_provider): | ||
| instrumentor = LiteLLMInstrumentor() | ||
| instrumentor.instrument( | ||
| tracer_provider=tracer_provider, | ||
| meter_provider=meter_provider, | ||
| ) | ||
|
|
||
| yield instrumentor | ||
|
|
||
| instrumentor.uninstrument() | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Wrap instrument_legacy teardown in try/finally.
If instrumentor.instrument(...) fails after partially patching LiteLLM, this fixture skips uninstrument() and can leak wrappers into later tests. Mirror the instrument_with_content pattern here.
Suggested fix
`@pytest.fixture`(scope="function")
def instrument_legacy(tracer_provider, meter_provider):
instrumentor = LiteLLMInstrumentor()
- instrumentor.instrument(
- tracer_provider=tracer_provider,
- meter_provider=meter_provider,
- )
-
- yield instrumentor
-
- instrumentor.uninstrument()
+ try:
+ instrumentor.instrument(
+ tracer_provider=tracer_provider,
+ meter_provider=meter_provider,
+ )
+ yield instrumentor
+ finally:
+ instrumentor.uninstrument()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.fixture(scope="function") | |
| def instrument_legacy(tracer_provider, meter_provider): | |
| instrumentor = LiteLLMInstrumentor() | |
| instrumentor.instrument( | |
| tracer_provider=tracer_provider, | |
| meter_provider=meter_provider, | |
| ) | |
| yield instrumentor | |
| instrumentor.uninstrument() | |
| `@pytest.fixture`(scope="function") | |
| def instrument_legacy(tracer_provider, meter_provider): | |
| instrumentor = LiteLLMInstrumentor() | |
| try: | |
| instrumentor.instrument( | |
| tracer_provider=tracer_provider, | |
| meter_provider=meter_provider, | |
| ) | |
| yield instrumentor | |
| finally: | |
| instrumentor.uninstrument() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/opentelemetry-instrumentation-litellm/tests/conftest.py` around
lines 56 - 67, The fixture teardown in instrument_legacy is not protected
against failures during LiteLLMInstrumentor.instrument, which can leave partial
patches behind and affect later tests. Update instrument_legacy to follow the
same try/finally pattern used by instrument_with_content: create the
LiteLLMInstrumentor, run instrument(...) in the protected block, yield the
instrumentor, and ensure instrumentor.uninstrument() always runs in the finally
path even if instrumentation raises.
| def test_both_openai_and_litellm_instrumented(span_exporter, tracer_provider, meter_provider): | ||
| """Instrumenting both must not raise and must not double-count.""" | ||
| pytest.importorskip("opentelemetry.instrumentation.openai") | ||
| from opentelemetry.instrumentation.litellm import LiteLLMInstrumentor | ||
| from opentelemetry.instrumentation.openai import OpenAIInstrumentor | ||
|
|
||
| openai_instrumentor = OpenAIInstrumentor() | ||
| litellm_instrumentor = LiteLLMInstrumentor() | ||
| try: | ||
| openai_instrumentor.instrument( | ||
| tracer_provider=tracer_provider, meter_provider=meter_provider | ||
| ) | ||
| litellm_instrumentor.instrument( | ||
| tracer_provider=tracer_provider, meter_provider=meter_provider | ||
| ) | ||
| litellm.completion( | ||
| model="gpt-3.5-turbo", | ||
| messages=MESSAGES, | ||
| mock_response="hello", | ||
| ) | ||
| spans = span_exporter.get_finished_spans() | ||
| assert [span.name for span in spans] == ["litellm.chat"] |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Exercise the OpenAI boundary here
mock_response="hello" can short-circuit before the OpenAI SDK is invoked, so this only proves the litellm.chat span. Use a stubbed provider call that actually reaches OpenAIInstrumentor, or assert suppression at that downstream boundary too.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/opentelemetry-instrumentation-litellm/tests/test_nesting.py` around
lines 66 - 87, The nesting test only verifies the top-level LiteLLM span because
mock_response="hello" may bypass the OpenAI SDK entirely. Update
test_both_openai_and_litellm_instrumented to drive a stubbed provider path that
actually reaches OpenAIInstrumentor, and then assert the downstream OpenAI
boundary behavior as well as the litellm.chat span count so the test proves both
instrumentors interact correctly without double-counting.
| def get_current_weather(location: str, unit: str = "celsius") -> str: | ||
| """Pretend tool implementation — a real one would call a weather API.""" | ||
| fake_db = {"paris": 18, "london": 14, "new york": 22} | ||
| temp = fake_db.get(location.lower(), 20) | ||
| return json.dumps({"location": location, "temperature": temp, "unit": unit}) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Convert temperatures when unit="fahrenheit".
The tool schema allows Fahrenheit, but the fake DB values are returned unchanged, so 18°C can be reported as 18°F.
Proposed fix
def get_current_weather(location: str, unit: str = "celsius") -> str:
"""Pretend tool implementation — a real one would call a weather API."""
fake_db = {"paris": 18, "london": 14, "new york": 22}
temp = fake_db.get(location.lower(), 20)
+ if unit == "fahrenheit":
+ temp = temp * 9 / 5 + 32
+ elif unit != "celsius":
+ return json.dumps({"error": f"unsupported unit {unit}"})
return json.dumps({"location": location, "temperature": temp, "unit": unit})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_current_weather(location: str, unit: str = "celsius") -> str: | |
| """Pretend tool implementation — a real one would call a weather API.""" | |
| fake_db = {"paris": 18, "london": 14, "new york": 22} | |
| temp = fake_db.get(location.lower(), 20) | |
| return json.dumps({"location": location, "temperature": temp, "unit": unit}) | |
| def get_current_weather(location: str, unit: str = "celsius") -> str: | |
| """Pretend tool implementation — a real one would call a weather API.""" | |
| fake_db = {"paris": 18, "london": 14, "new york": 22} | |
| temp = fake_db.get(location.lower(), 20) | |
| if unit == "fahrenheit": | |
| temp = temp * 9 / 5 + 32 | |
| elif unit != "celsius": | |
| return json.dumps({"error": f"unsupported unit {unit}"}) | |
| return json.dumps({"location": location, "temperature": temp, "unit": unit}) |
🧰 Tools
🪛 ast-grep (0.44.0)
[info] 101-101: use jsonify instead of json.dumps for JSON output
Context: json.dumps({"location": location, "temperature": temp, "unit": unit})
Note: [CWE-116] Improper Encoding or Escaping of Output.
(use-jsonify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/sample-app/sample_app/litellm_async_streaming.py` around lines 98 -
102, The get_current_weather tool in litellm_async_streaming.py currently
returns the fake DB temperature unchanged even when unit is "fahrenheit". Update
get_current_weather to detect Fahrenheit requests and convert the Celsius value
before building the json.dumps response, while keeping the existing behavior for
Celsius and preserving the same function signature and output shape.
| def issue_refund(order_id: str, amount: float) -> str: | ||
| # A real implementation would call a payments API; here we just confirm. | ||
| return json.dumps( | ||
| {"order_id": order_id, "amount": amount, "refund_id": f"RF-{order_id}", "status": "refunded"} | ||
| ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Validate refund eligibility before confirming.
issue_refund currently confirms any model-supplied order_id/amount, even for unknown or undelivered orders. That can make the sample report an invalid refund as successful.
Proposed fix
def issue_refund(order_id: str, amount: float) -> str:
# A real implementation would call a payments API; here we just confirm.
+ order = _ORDERS.get(order_id)
+ if order is None:
+ return json.dumps({"error": f"no order {order_id}"})
+ if order["status"] != "delivered":
+ return json.dumps({"error": f"order {order_id} is not eligible for refund"})
+ if amount <= 0 or amount > order["total"]:
+ return json.dumps({"error": f"invalid refund amount {amount}"})
return json.dumps(
{"order_id": order_id, "amount": amount, "refund_id": f"RF-{order_id}", "status": "refunded"}
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def issue_refund(order_id: str, amount: float) -> str: | |
| # A real implementation would call a payments API; here we just confirm. | |
| return json.dumps( | |
| {"order_id": order_id, "amount": amount, "refund_id": f"RF-{order_id}", "status": "refunded"} | |
| ) | |
| def issue_refund(order_id: str, amount: float) -> str: | |
| # A real implementation would call a payments API; here we just confirm. | |
| order = _ORDERS.get(order_id) | |
| if order is None: | |
| return json.dumps({"error": f"no order {order_id}"}) | |
| if order["status"] != "delivered": | |
| return json.dumps({"error": f"order {order_id} is not eligible for refund"}) | |
| if amount <= 0 or amount > order["total"]: | |
| return json.dumps({"error": f"invalid refund amount {amount}"}) | |
| return json.dumps( | |
| {"order_id": order_id, "amount": amount, "refund_id": f"RF-{order_id}", "status": "refunded"} | |
| ) |
🧰 Tools
🪛 ast-grep (0.44.0)
[info] 118-120: use jsonify instead of json.dumps for JSON output
Context: json.dumps(
{"order_id": order_id, "amount": amount, "refund_id": f"RF-{order_id}", "status": "refunded"}
)
Note: [CWE-116] Improper Encoding or Escaping of Output.
(use-jsonify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/sample-app/sample_app/litellm_completion.py` around lines 117 - 121,
The issue is that issue_refund always returns a successful refund for any
model-supplied order_id and amount. Update issue_refund to validate the order
against known refund-eligible records first, including that the order exists,
was delivered/paid, and the requested amount is within the refundable limit;
only then return the refunded response. If the order is unknown or ineligible,
have issue_refund return a non-success result instead of confirming the refund.
| try: | ||
| # Call the custom on_end first | ||
| span_postprocess_callback(span) | ||
| finally: | ||
| if was_immutable: | ||
| attributes._immutable = True | ||
| # Then call the original to ensure normal processing | ||
| original_on_end(span) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Always forward spans to the real processor.
If span_postprocess_callback() raises on Line 153, original_on_end() on Line 158 never runs, so that span is never exported. Catch/log callback failures after restoring immutability, then always invoke the original on_end.
Proposed fix
def wrapped_on_end(span):
# OTel freezes a span's attributes (sets `_immutable`) in
# Span.end() before on_end runs, so the postprocess callback
# can't modify them. Temporarily unfreeze around the callback
# so it can redact/mutate attributes, then restore the flag.
attributes = getattr(span, "_attributes", None)
was_immutable = getattr(attributes, "_immutable", False)
if was_immutable:
attributes._immutable = False
try:
- # Call the custom on_end first
- span_postprocess_callback(span)
+ span_postprocess_callback(span)
+ except Exception:
+ logging.exception("span_postprocess_callback failed")
finally:
if was_immutable:
attributes._immutable = True
# Then call the original to ensure normal processing
original_on_end(span)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| # Call the custom on_end first | |
| span_postprocess_callback(span) | |
| finally: | |
| if was_immutable: | |
| attributes._immutable = True | |
| # Then call the original to ensure normal processing | |
| original_on_end(span) | |
| try: | |
| span_postprocess_callback(span) | |
| except Exception: | |
| logging.exception("span_postprocess_callback failed") | |
| finally: | |
| if was_immutable: | |
| attributes._immutable = True | |
| # Then call the original to ensure normal processing | |
| original_on_end(span) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py` around lines 151 -
158, The span post-processing in tracing.py can prevent export when
span_postprocess_callback() raises because original_on_end() is skipped. Update
the on_end wrapper to catch and log callback failures in the
span_postprocess_callback path, always restore attributes._immutable in the
finally block, and then unconditionally call original_on_end(span) so the real
processor still receives every span.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opentelemetry-instrumentation-litellm/tests/test_embeddings.py`:
- Around line 42-44: The list-input embeddings test currently reads attributes
from spans[0] without verifying the emitted span contract, so it can miss extra
wrapper or internal spans. Update the test around
span_exporter.get_finished_spans() in the list-input case to assert there is
exactly one span and that its name is litellm.embeddings before accessing attrs
or GenAIAttributes.GEN_AI_INPUT_MESSAGES, mirroring the existing test_embedding
assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e96e6aa2-0e1d-4ce9-84cd-5d903d2c53cf
📒 Files selected for processing (5)
packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/__init__.pypackages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_models.pypackages/opentelemetry-instrumentation-litellm/tests/test_completion.pypackages/opentelemetry-instrumentation-litellm/tests/test_embeddings.pypackages/opentelemetry-instrumentation-litellm/tests/test_streaming.py
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/opentelemetry-instrumentation-litellm/opentelemetry/instrumentation/litellm/event_models.py
- packages/opentelemetry-instrumentation-litellm/tests/test_streaming.py
- packages/opentelemetry-instrumentation-litellm/tests/test_completion.py
| spans = span_exporter.get_finished_spans() | ||
| attrs = spans[0].attributes | ||
| messages = json.loads(attrs[GenAIAttributes.GEN_AI_INPUT_MESSAGES]) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Assert the emitted span contract in the list-input test.
This test indexes spans[0] but never checks that the call still emits exactly one litellm.embeddings span. If list-input handling starts producing an extra wrapper/internal span, this case will still pass while missing the regression. Mirror the test_embedding span-count/name assertion here before reading attributes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/opentelemetry-instrumentation-litellm/tests/test_embeddings.py`
around lines 42 - 44, The list-input embeddings test currently reads attributes
from spans[0] without verifying the emitted span contract, so it can miss extra
wrapper or internal spans. Update the test around
span_exporter.get_finished_spans() in the list-input case to assert there is
exactly one span and that its name is litellm.embeddings before accessing attrs
or GenAIAttributes.GEN_AI_INPUT_MESSAGES, mirroring the existing test_embedding
assertion.
max-deygin-traceloop
left a comment
There was a problem hiding this comment.
All five findings from the previous review are addressed in fdbe72a:
- P1 fixed: Embedding input now uses
gen_ai.input.messagesJSON (matching OpenAI instrumentation). Newtest_embedding_list_inputcovers multi-input case. - P2 fixed: Events path finish_reason fallback changed from
"unknown"to"",ChoiceEventdefault updated to match. - P2 fixed:
gen_ai.tool.definitionsnow captured via_build_tool_def/_set_tool_definitions. Implementation matches the OpenAI instrumentation's flattened format. Test added. - P2 fixed: Streaming tests now assert both per-message
finish_reasonand top-levelgen_ai.response.finish_reasons. - P3 fixed: Key order in
_map_content_partsis now type-first everywhere.
All 30 tests pass locally. Semconv and finish reason compliance verified.
feat(instrumentation): ...orfix(instrumentation): ....Summary by CodeRabbit