realtime(observability): forward diagnostics + stats over the signaling WS#148
realtime(observability): forward diagnostics + stats over the signaling WS#148nagar-decart wants to merge 7 commits into
Conversation
…ng WS
Pipes client-side WebRTC / ICE / networking observability events through
the existing realtime WebSocket as `{type: "observability", data}`
messages. Bouncer logs them under the session's structured-log context
in Datadog (DecartAI/api#1882), so SDK-side observations land
correlated with our server-side LiveKit / inference traces without any
new endpoints, auth, or transport.
What flows over the WS now:
- Every diagnostic emitted from `RealtimeObservability.diagnostic()`
(e.g. `client-session-connection-breakdown`, `reconnect`, `videoStall`)
- Every periodic WebRTC stats snapshot collected by the in-process
stats collector
Wiring:
- `SignalingChannel.sendObservability(data)` — new public, fire-and-forget
method that writes `{type: "observability", data}` (drops silently if
the socket isn't open; never throws).
- `RealtimeObservability.setObservabilityForwarder(fn | null)` — sink
set by `StreamSession.createTransport()` after the signaling channel
is constructed, and cleared on `tearDown()`.
- `ObservabilityMessage` added to `OutgoingRealtimeMessage` union.
This is intentionally additive to existing observability: the local
`onDiagnostic`/`onStats` callbacks and the platform telemetry POSTs
keep working as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit: |
The existing observability sink only emitted three diagnostic types (client-session-connection-breakdown, reconnect, videoStall) plus periodic WebRTC stats — useful for metrics, useless for actually debugging an ICE failure. This adds raw event-level instrumentation that mirrors the iOS PR DecartAI/decart-ios#28's DecartLiveKitLogger + NetworkPathObserver coverage. New file: observability/network-instrumentation.ts Attaches addEventListener-based listeners to: - The LiveKit Room (Connected/Disconnected/Reconnecting/Reconnected, SignalReconnecting, ConnectionStateChanged, ConnectionQualityChanged, MediaDevicesError, LocalTrackPublished, ParticipantConnected, TrackSubscribed/Muted/Unmuted). - The underlying publisher + subscriber RTCPeerConnections (private access via room.engine.pcManager, with addEventListener so we never displace LiveKit's own handlers): icecandidate, icecandidateerror, iceconnectionstatechange, connectionstatechange, icegatheringstatechange, signalingstatechange, negotiationneeded, datachannel, track. ICE candidates carry full address/port/type /priority/foundation. Selected candidate pair (with RTT + addresses) is snapshotted from getStats() when ICE settles. - Browser network state (initial snapshot of navigator.connection + online/offline + visibilitychange + NetworkInformation 'change'). New methods: - RealtimeObservability.emitInstrumentationEvent(name, data) bypasses the strict DiagnosticEvents type union so per-candidate events can flow without bloating the public type surface. Signaling traffic: - SignalingChannel.writeMessage emits 'signaling-sent' (skips 'observability' to avoid recursing). - SignalingChannel.handleMessage emits 'signaling-received'. All events flow through the same observabilityForwarder -> realtime WS -> bouncer -> Datadog @event.* path that the previous change established. No new transport, no new endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| } | ||
|
|
||
| // Re-export for convenience so consumers know what's available. | ||
| export { Track }; |
There was a problem hiding this comment.
Unused Track import and re-export
Low Severity
Track is imported from livekit-client and re-exported at line 370, but it's never used within this file, and no other file imports Track from network-instrumentation. This is dead code — the only import from this module is attachRoomInstrumentation.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f58f9c7. Configure here.
…ebug signal Earlier instrumentation pass dumped everything over the WS — periodic WebRTC stats (every 1s), every signaling frame, every signalingstatechange, every track/visibility/connection-quality event. In a 15-min Datadog window with 3 successful sessions, stats alone produced 141 of 250 events (56%) while actual ICE-level data was ~7%. That signal-to-noise ratio makes the stream useless for debugging connection failures. Forward only what helps diagnose WHY a connection fails or stalls: DROPPED (noise during a healthy session, not useful for failure debug): - Periodic WebRTC stats (still local via onStats + platform telemetry) - signaling-sent for every outbound frame - signaling-received for routine generation_tick / prompt_ack / set_image_ack - signaling-state on PC (SDP renegotiation chatter) - negotiation-needed, track-received, data-channel-opened on PC - track-subscribed/-muted/-unmuted, local-track-published, participant-connected, connection-quality on Room - page-visibility KEPT (the ICE / connection debug picture): - ice-candidate, ice-candidate-past, ice-candidate-error - ice-connection-state, ice-gathering-state, pc-connection-state - pc-attached (with snapshot of current state) - selected-candidate-pair (with local/remote addresses + RTT) - connected-address (LiveKit's getter for the chosen address) - room-connected/-disconnected/-reconnecting/-signal-reconnecting/-reconnected - room-connection-state - media-devices-error - network-state (initial + on change), browser-online/-offline - signaling-received for livekit_room_info / session_id (the only signaling messages that matter for connection setup) - client-session-connection-breakdown (the breakdown diagnostic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| this.options.onStats?.(stats); | ||
| this.telemetryReporter.addStats(stats); | ||
| this.detectVideoStall(stats); | ||
| // Stats intentionally not forwarded over the WS. They fire every 1s |
There was a problem hiding this comment.
stop() doesn't clear observabilityForwarder like other fields
Low Severity
The stop() method resets every other private field (statsCollector, telemetryReporter, connectionBreakdown, etc.) but does not clear observabilityForwarder. After stop() is called, diagnostic() and emitInstrumentationEvent() still forward payloads over the WebSocket. In client.ts, observability.stop() is called before session.disconnect(), creating a window where the "stopped" observability instance continues sending data.
Reviewed by Cursor Bugbot for commit 4518087. Configure here.
- pc-attached now carries the PC's iceServers config (URLs only; credentials redacted to presence-bools) + iceTransportPolicy. Directly answers "did the JoinResponse give us STUN/TURN at all?", which was the core unknown in the 29CM ICE-failure investigation. - room-disconnected now decodes the numeric DisconnectReason enum into a readable string (signal_close, connection_timeout, media_failure, etc.). Mirrored in the test mock. - livekit_room_info signaling event now carries livekitUrl / roomName / sessionId — tells us which SFU node owns the room without needing to grep server-side logs. - Dropped room-connection-state (duplicated room-connected / -disconnected / -reconnecting). - Dropped connected-address (redundant with selected-candidate-pair.remote.address). - ice-candidate trimmed from 15 fields to 12: drop raw SDP `candidate` string, sdpMid, sdpMLineIndex, usernameFragment. Keep type, protocol, address, port, priority, foundation, component, tcpType, relatedAddress/Port, networkType, url. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TS lib used in CI typecheck (different from the DOM types vitest uses locally) doesn't expose RTCIceCandidateStats. The RTCIceCandidatePairStats reference also tripped because the same lib exposes it under a slightly different shape — narrowing failed and field accesses landed on 'never'. Replaced with two minimal structural types (IceCandidateStat, IceCandidatePairStat) covering only the fields we actually read from getStats(). Functional behavior is unchanged. Also cast the session_id signaling case through `unknown` — it's sent by the bouncer but not in the typed IncomingRealtimeMessage union, so TS narrowed the comparison to never. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that together close the visibility gap for failed `room.connect()` attempts (the .251-style ICE-fails-without-warning case): 1. Attach PC instrumentation via engine's 'transportsCreated' event. Previously we polled `room.engine.pcManager` every 100ms for up to 5s. That works for sessions that connect, but for connections that hang inside ICE (PCs created, gathering started, never converges) the polling window can race the early candidates. RTCEngine emits 'transportsCreated' the instant publisher/subscriber PCs exist, inside `room.connect()` before ICE gathering starts. Hooking into that guarantees we attach before the first icecandidate fires. Polling is kept as a fallback for engine builds that don't expose the event. 2. Buffer observability events until the WS is OPEN. `sendObservability` previously silently dropped events when `ws.readyState !== OPEN`. But `attachRoomInstrumentation` emits network-state + pc-attached synchronously, which can happen before the WS has finished opening on cold-start. The result was that exactly the events we most need for diagnosing a failed connection were the ones silently lost. Now we queue (bounded at 256) and flush on `ws.onopen`. Overflow is reported as a single event so we know if anything was actually dropped. 3. Emit `livekit-connect-start` / `livekit-connect-error` from media-channel.connect(). When `room.connect()` rejects, the stream-session catch tears down the channel before any failure signal can reach the WS forwarder. Emitting bracket events around the LiveKit call ensures we record the error on the wire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 215aa74. Configure here.
| } | ||
|
|
||
| private tearDown(): void { | ||
| this.config.observability?.setObservabilityForwarder(null); |
There was a problem hiding this comment.
Forwarder cleared on transport reuse
Medium Severity
tearDown clears the observability WebSocket forwarder, but connect retries and a second connect after disconnect reuse the same SignalingChannel without calling createTransport again. After the first failed or ended attempt, diagnostics and instrumentation no longer reach sendObservability even when the session connects successfully on a later try.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 215aa74. Configure here.
| logger: this.logger, | ||
| videoCodec: this.config.videoCodec, | ||
| }); | ||
| // Forward client-side diagnostics + WebRTC stats back over the |
There was a problem hiding this comment.
Breakdown not forwarded after media fail
Medium Severity
When media.connect or publishLocalTracks fails, the inner catch runs tearDown (which clears the forwarder) before the outer catch runs finishConnectionBreakdown. The client-session-connection-breakdown diagnostic is therefore not sent over the WebSocket for the failure mode this change targets, even though earlier livekit-connect-error instrumentation may have been forwarded.
Reviewed by Cursor Bugbot for commit 215aa74. Configure here.


Summary
Pipes client-side WebRTC / ICE / networking observability events through the existing realtime WebSocket as
{type: "observability", data}messages. Bouncer logs them under the session's structured-log context in Datadog (DecartAI/api#1882), so SDK-side observations land correlated with our server-side LiveKit / inference traces without any new endpoints, auth, or transport.Context: while debugging the 29CM connection failures we had rich server-side LiveKit / ICE logs but no symmetric visibility into what the SDK's own WebRTC engine was producing on the client. This closes the gap.
What flows over the WS now
RealtimeObservability.diagnostic():client-session-connection-breakdown(with per-phase timings + error)reconnectvideoStallWiring
SignalingChannel.sendObservability(data)— new public, fire-and-forget method that writes{type: "observability", data}(drops silently if socket isn't open, never throws).RealtimeObservability.setObservabilityForwarder(fn | null)— sink set byStreamSession.createTransport()after the signaling channel is constructed, and cleared ontearDown().ObservabilityMessageadded to theOutgoingRealtimeMessageunion.Intentionally additive to existing observability: the local
onDiagnostic/onStatscallbacks and the platform telemetry POSTs keep working as before.Diff size
4 files, +49 / -1 lines. No new dependencies, no new config.
Test plan
pnpm -F @decartai/sdk build— passespnpm -F @decartai/sdk test— 206 tests passservice:bouncer-realtime "RTAPI: client observability"log lines in Datadog us5🤖 Generated with Claude Code
Note
Medium Risk
Touches connection setup/teardown and relies on LiveKit private engine APIs (with fallbacks), but paths are best-effort, bounded-buffered, and avoid forwarding tokens or high-volume stats.
Overview
Adds client-side WebRTC/ICE debugging telemetry on the existing realtime signaling WebSocket as
{type: "observability", data}, so bouncer can log it next to server traces (no new transport).New
network-instrumentationhooks LiveKitRoom(including privateengine.pcManager/transportsCreated+ short poll fallback) and both publisher/subscriberRTCPeerConnections: ICE candidates (live +getStatsbackfill), ICE/PC/gathering states, candidate errors, selected pair, redacted ICE servers, browser online/offline/navigator.connection, and connection-focused room events. Steady-state noise (renegotiation, routine ticks) is deliberately omitted.MediaChannelattaches instrumentation beforeroom.connect(), emitslivekit-connect-start/livekit-connect-errorand failedwebrtc-handshakephase on connect rejection, and tears listeners down on disconnect.RealtimeObservabilitygains a forwarder sink andemitInstrumentationEvent; typeddiagnostic()events are also forwarded. Periodic WebRTC stats are not sent over the WS (still local callback + telemetry POST).SignalingChannel.sendObservabilityis best-effort with a 256-event pre-open buffer flushed on WS open (plus overflow instrumentation).StreamSessionwires observability → signaling on transport create and clears on teardown.ObservabilityMessageextends outgoing message types; join-timelivekit_room_info/session_idare traced without logging the LiveKit token.Reviewed by Cursor Bugbot for commit 215aa74. Bugbot is set up for automated code reviews on this repo. Configure here.