Summary
When an AgentSession is configured with vad=NOT_GIVEN and turn_detection="stt", the end_of_turn_delay (and transcription_delay) values reported on user ChatMessage metrics no longer measure what their names suggest. They collapse to roughly endpointing.min_delay regardless of how long the STT model actually took to detect end of turn.
This is a problem for users who rely on these metrics to monitor real end of utterance latency or to compare STT providers.
Repro
- Build an
AgentSession with a streaming STT that emits END_OF_SPEECH (Soniox or Deepgram Flux for example).
- Pass
turn_detection="stt" and vad=NOT_GIVEN.
- Run a normal turn and inspect the
metrics dict on the resulting user ChatMessage.
Observed: end_of_turn_delay is approximately equal to endpointing.min_delay on every turn, even on ambiguous turns where the STT model spends close to its MAX_ENDPOINT_DELAY ceiling before firing.
Expected: end_of_turn_delay should reflect the time from the user actually stopping to the EOU pipeline committing, including the STT model decision time.
Root cause
In livekit/agents/voice/audio_recognition.py the _last_speaking_time anchor is set in three places when there is no VAD attached:
# line 846 (FINAL_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
self._last_speaking_time = time.time()
# line 903 (PREFLIGHT_TRANSCRIPT)
if not self._vad or self._last_speaking_time is None:
self._last_speaking_time = time.time()
# line 944 (END_OF_SPEECH, stt mode)
if not self._vad or self._last_speaking_time is None:
self._last_speaking_time = time.time()
Without VAD, the first event handler that runs stamps _last_speaking_time = time.time(). For Soniox and Flux this is essentially the moment END_OF_SPEECH arrives, that is, after the model has already decided end of turn. The bounce task then runs and sleeps for endpointing_delay, then computes:
end_of_turn_delay = time.time() - last_speaking_time # line 1121
Since last_speaking_time was just set to "now" before the sleep, the resulting value is approximately endpointing_delay and hides the real STT detection time.
The TODO already in the file at line 848 acknowledges this:
# vad disabled, use stt timestamp
# TODO: this would screw up transcription latency metrics
# but we'll live with it for now.
# the correct way is to ensure STT fires SpeechEventType.END_OF_SPEECH
# and using that timestamp for _last_speaking_time
Suggested fix
Use SpeechData.end_time from the last FINAL_TRANSCRIPT alternative (which both Soniox and Flux populate as a stream relative timestamp) plus a stream start anchor captured on the first event of the turn, to compute a wall clock acoustic stop time. Then assign that to _last_speaking_time instead of time.time().
That would close the TODO and make end_of_turn_delay and transcription_delay reflect real model behavior in STT driven mode without depending on VAD being attached.
Environment
livekit-agents==1.5.8
- Python 3.12
- Plugins reproduced with:
livekit-plugins-soniox, livekit-plugins-deepgram (Flux v2)
Summary
When an
AgentSessionis configured withvad=NOT_GIVENandturn_detection="stt", theend_of_turn_delay(andtranscription_delay) values reported on userChatMessagemetrics no longer measure what their names suggest. They collapse to roughlyendpointing.min_delayregardless of how long the STT model actually took to detect end of turn.This is a problem for users who rely on these metrics to monitor real end of utterance latency or to compare STT providers.
Repro
AgentSessionwith a streaming STT that emitsEND_OF_SPEECH(Soniox or Deepgram Flux for example).turn_detection="stt"andvad=NOT_GIVEN.metricsdict on the resulting userChatMessage.Observed:
end_of_turn_delayis approximately equal toendpointing.min_delayon every turn, even on ambiguous turns where the STT model spends close to itsMAX_ENDPOINT_DELAYceiling before firing.Expected:
end_of_turn_delayshould reflect the time from the user actually stopping to the EOU pipeline committing, including the STT model decision time.Root cause
In
livekit/agents/voice/audio_recognition.pythe_last_speaking_timeanchor is set in three places when there is no VAD attached:Without VAD, the first event handler that runs stamps
_last_speaking_time = time.time(). For Soniox and Flux this is essentially the momentEND_OF_SPEECHarrives, that is, after the model has already decided end of turn. The bounce task then runs and sleeps forendpointing_delay, then computes:Since
last_speaking_timewas just set to "now" before the sleep, the resulting value is approximatelyendpointing_delayand hides the real STT detection time.The TODO already in the file at line 848 acknowledges this:
Suggested fix
Use
SpeechData.end_timefrom the lastFINAL_TRANSCRIPTalternative (which both Soniox and Flux populate as a stream relative timestamp) plus a stream start anchor captured on the first event of the turn, to compute a wall clock acoustic stop time. Then assign that to_last_speaking_timeinstead oftime.time().That would close the TODO and make
end_of_turn_delayandtranscription_delayreflect real model behavior in STT driven mode without depending on VAD being attached.Environment
livekit-agents==1.5.8livekit-plugins-soniox,livekit-plugins-deepgram(Flux v2)