Skip to content

Bug: end_of_turn_delay collapses to min_endpointing_delay when running without VAD in turn_detection="stt" mode #5669

@miguelmoralai

Description

@miguelmoralai

Summary

When an AgentSession is configured with vad=NOT_GIVEN and turn_detection="stt", the end_of_turn_delay (and transcription_delay) values reported on user ChatMessage metrics no longer measure what their names suggest. They collapse to roughly endpointing.min_delay regardless of how long the STT model actually took to detect end of turn.

This is a problem for users who rely on these metrics to monitor real end of utterance latency or to compare STT providers.

Repro

  1. Build an AgentSession with a streaming STT that emits END_OF_SPEECH (Soniox or Deepgram Flux for example).
  2. Pass turn_detection="stt" and vad=NOT_GIVEN.
  3. Run a normal turn and inspect the metrics dict on the resulting user ChatMessage.

Observed: end_of_turn_delay is approximately equal to endpointing.min_delay on every turn, even on ambiguous turns where the STT model spends close to its MAX_ENDPOINT_DELAY ceiling before firing.

Expected: end_of_turn_delay should reflect the time from the user actually stopping to the EOU pipeline committing, including the STT model decision time.

Root cause

In livekit/agents/voice/audio_recognition.py the _last_speaking_time anchor is set in three places when there is no VAD attached:

# line 846 (FINAL_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

# line 903 (PREFLIGHT_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

# line 944 (END_OF_SPEECH, stt mode)
if not self._vad or self._last_speaking_time is None:
    self._last_speaking_time = time.time()

Without VAD, the first event handler that runs stamps _last_speaking_time = time.time(). For Soniox and Flux this is essentially the moment END_OF_SPEECH arrives, that is, after the model has already decided end of turn. The bounce task then runs and sleeps for endpointing_delay, then computes:

end_of_turn_delay = time.time() - last_speaking_time  # line 1121

Since last_speaking_time was just set to "now" before the sleep, the resulting value is approximately endpointing_delay and hides the real STT detection time.

The TODO already in the file at line 848 acknowledges this:

# vad disabled, use stt timestamp
# TODO: this would screw up transcription latency metrics
# but we'll live with it for now.
# the correct way is to ensure STT fires SpeechEventType.END_OF_SPEECH
# and using that timestamp for _last_speaking_time

Suggested fix

Use SpeechData.end_time from the last FINAL_TRANSCRIPT alternative (which both Soniox and Flux populate as a stream relative timestamp) plus a stream start anchor captured on the first event of the turn, to compute a wall clock acoustic stop time. Then assign that to _last_speaking_time instead of time.time().

That would close the TODO and make end_of_turn_delay and transcription_delay reflect real model behavior in STT driven mode without depending on VAD being attached.

Environment

  • livekit-agents==1.5.8
  • Python 3.12
  • Plugins reproduced with: livekit-plugins-soniox, livekit-plugins-deepgram (Flux v2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions