Skip to content

realtime(observability): forward diagnostics + stats over the signaling WS#148

Open
nagar-decart wants to merge 7 commits into
mainfrom
eyal/observability-bouncer-ingest
Open

realtime(observability): forward diagnostics + stats over the signaling WS#148
nagar-decart wants to merge 7 commits into
mainfrom
eyal/observability-bouncer-ingest

Conversation

@nagar-decart
Copy link
Copy Markdown
Contributor

@nagar-decart nagar-decart commented May 29, 2026

Summary

Pipes client-side WebRTC / ICE / networking observability events through the existing realtime WebSocket as {type: "observability", data} messages. Bouncer logs them under the session's structured-log context in Datadog (DecartAI/api#1882), so SDK-side observations land correlated with our server-side LiveKit / inference traces without any new endpoints, auth, or transport.

Context: while debugging the 29CM connection failures we had rich server-side LiveKit / ICE logs but no symmetric visibility into what the SDK's own WebRTC engine was producing on the client. This closes the gap.

What flows over the WS now

  • Every diagnostic emitted from RealtimeObservability.diagnostic():
    • client-session-connection-breakdown (with per-phase timings + error)
    • reconnect
    • videoStall
  • Every periodic WebRTC stats snapshot collected by the in-process stats collector (video / outbound / candidate pairs / etc.)

Wiring

  • SignalingChannel.sendObservability(data) — new public, fire-and-forget method that writes {type: "observability", data} (drops silently if socket isn't open, never throws).
  • RealtimeObservability.setObservabilityForwarder(fn | null) — sink set by StreamSession.createTransport() after the signaling channel is constructed, and cleared on tearDown().
  • ObservabilityMessage added to the OutgoingRealtimeMessage union.

Intentionally additive to existing observability: the local onDiagnostic / onStats callbacks and the platform telemetry POSTs keep working as before.

Diff size

4 files, +49 / -1 lines. No new dependencies, no new config.

Test plan

  • pnpm -F @decartai/sdk build — passes
  • pnpm -F @decartai/sdk test — 206 tests pass
  • Verify in webrtc-connect-test that a session with this SDK build produces service:bouncer-realtime "RTAPI: client observability" log lines in Datadog us5

🤖 Generated with Claude Code


Note

Medium Risk
Touches connection setup/teardown and relies on LiveKit private engine APIs (with fallbacks), but paths are best-effort, bounded-buffered, and avoid forwarding tokens or high-volume stats.

Overview
Adds client-side WebRTC/ICE debugging telemetry on the existing realtime signaling WebSocket as {type: "observability", data}, so bouncer can log it next to server traces (no new transport).

New network-instrumentation hooks LiveKit Room (including private engine.pcManager / transportsCreated + short poll fallback) and both publisher/subscriber RTCPeerConnections: ICE candidates (live + getStats backfill), ICE/PC/gathering states, candidate errors, selected pair, redacted ICE servers, browser online/offline/navigator.connection, and connection-focused room events. Steady-state noise (renegotiation, routine ticks) is deliberately omitted.

MediaChannel attaches instrumentation before room.connect(), emits livekit-connect-start / livekit-connect-error and failed webrtc-handshake phase on connect rejection, and tears listeners down on disconnect.

RealtimeObservability gains a forwarder sink and emitInstrumentationEvent; typed diagnostic() events are also forwarded. Periodic WebRTC stats are not sent over the WS (still local callback + telemetry POST).

SignalingChannel.sendObservability is best-effort with a 256-event pre-open buffer flushed on WS open (plus overflow instrumentation). StreamSession wires observability → signaling on transport create and clears on teardown. ObservabilityMessage extends outgoing message types; join-time livekit_room_info / session_id are traced without logging the LiveKit token.

Reviewed by Cursor Bugbot for commit 215aa74. Bugbot is set up for automated code reviews on this repo. Configure here.

nagar-decart and others added 2 commits May 29, 2026 15:19
…ng WS

Pipes client-side WebRTC / ICE / networking observability events through
the existing realtime WebSocket as `{type: "observability", data}`
messages. Bouncer logs them under the session's structured-log context
in Datadog (DecartAI/api#1882), so SDK-side observations land
correlated with our server-side LiveKit / inference traces without any
new endpoints, auth, or transport.

What flows over the WS now:
- Every diagnostic emitted from `RealtimeObservability.diagnostic()`
  (e.g. `client-session-connection-breakdown`, `reconnect`, `videoStall`)
- Every periodic WebRTC stats snapshot collected by the in-process
  stats collector

Wiring:
- `SignalingChannel.sendObservability(data)` — new public, fire-and-forget
  method that writes `{type: "observability", data}` (drops silently if
  the socket isn't open; never throws).
- `RealtimeObservability.setObservabilityForwarder(fn | null)` — sink
  set by `StreamSession.createTransport()` after the signaling channel
  is constructed, and cleared on `tearDown()`.
- `ObservabilityMessage` added to `OutgoingRealtimeMessage` union.

This is intentionally additive to existing observability: the local
`onDiagnostic`/`onStats` callbacks and the platform telemetry POSTs
keep working as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 29, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@decartai/sdk@148

commit: 215aa74

Comment thread packages/sdk/src/realtime/observability/realtime-observability.ts Outdated
The existing observability sink only emitted three diagnostic types
(client-session-connection-breakdown, reconnect, videoStall) plus
periodic WebRTC stats — useful for metrics, useless for actually
debugging an ICE failure. This adds raw event-level instrumentation
that mirrors the iOS PR DecartAI/decart-ios#28's
DecartLiveKitLogger + NetworkPathObserver coverage.

New file: observability/network-instrumentation.ts

Attaches addEventListener-based listeners to:
- The LiveKit Room (Connected/Disconnected/Reconnecting/Reconnected,
  SignalReconnecting, ConnectionStateChanged, ConnectionQualityChanged,
  MediaDevicesError, LocalTrackPublished, ParticipantConnected,
  TrackSubscribed/Muted/Unmuted).
- The underlying publisher + subscriber RTCPeerConnections (private
  access via room.engine.pcManager, with addEventListener so we never
  displace LiveKit's own handlers): icecandidate, icecandidateerror,
  iceconnectionstatechange, connectionstatechange,
  icegatheringstatechange, signalingstatechange, negotiationneeded,
  datachannel, track. ICE candidates carry full address/port/type
  /priority/foundation. Selected candidate pair (with RTT + addresses)
  is snapshotted from getStats() when ICE settles.
- Browser network state (initial snapshot of navigator.connection +
  online/offline + visibilitychange + NetworkInformation 'change').

New methods:
- RealtimeObservability.emitInstrumentationEvent(name, data)
  bypasses the strict DiagnosticEvents type union so per-candidate
  events can flow without bloating the public type surface.

Signaling traffic:
- SignalingChannel.writeMessage emits 'signaling-sent' (skips
  'observability' to avoid recursing).
- SignalingChannel.handleMessage emits 'signaling-received'.

All events flow through the same observabilityForwarder ->
realtime WS -> bouncer -> Datadog @event.* path that the previous
change established. No new transport, no new endpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/sdk/src/realtime/signaling-channel.ts Outdated
}

// Re-export for convenience so consumers know what's available.
export { Track };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused Track import and re-export

Low Severity

Track is imported from livekit-client and re-exported at line 370, but it's never used within this file, and no other file imports Track from network-instrumentation. This is dead code — the only import from this module is attachRoomInstrumentation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f58f9c7. Configure here.

…ebug signal

Earlier instrumentation pass dumped everything over the WS — periodic
WebRTC stats (every 1s), every signaling frame, every signalingstatechange,
every track/visibility/connection-quality event. In a 15-min Datadog
window with 3 successful sessions, stats alone produced 141 of 250 events
(56%) while actual ICE-level data was ~7%. That signal-to-noise ratio
makes the stream useless for debugging connection failures.

Forward only what helps diagnose WHY a connection fails or stalls:

DROPPED (noise during a healthy session, not useful for failure debug):
- Periodic WebRTC stats (still local via onStats + platform telemetry)
- signaling-sent for every outbound frame
- signaling-received for routine generation_tick / prompt_ack / set_image_ack
- signaling-state on PC (SDP renegotiation chatter)
- negotiation-needed, track-received, data-channel-opened on PC
- track-subscribed/-muted/-unmuted, local-track-published,
  participant-connected, connection-quality on Room
- page-visibility

KEPT (the ICE / connection debug picture):
- ice-candidate, ice-candidate-past, ice-candidate-error
- ice-connection-state, ice-gathering-state, pc-connection-state
- pc-attached (with snapshot of current state)
- selected-candidate-pair (with local/remote addresses + RTT)
- connected-address (LiveKit's getter for the chosen address)
- room-connected/-disconnected/-reconnecting/-signal-reconnecting/-reconnected
- room-connection-state
- media-devices-error
- network-state (initial + on change), browser-online/-offline
- signaling-received for livekit_room_info / session_id (the only
  signaling messages that matter for connection setup)
- client-session-connection-breakdown (the breakdown diagnostic)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
this.options.onStats?.(stats);
this.telemetryReporter.addStats(stats);
this.detectVideoStall(stats);
// Stats intentionally not forwarded over the WS. They fire every 1s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop() doesn't clear observabilityForwarder like other fields

Low Severity

The stop() method resets every other private field (statsCollector, telemetryReporter, connectionBreakdown, etc.) but does not clear observabilityForwarder. After stop() is called, diagnostic() and emitInstrumentationEvent() still forward payloads over the WebSocket. In client.ts, observability.stop() is called before session.disconnect(), creating a window where the "stopped" observability instance continues sending data.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4518087. Configure here.

nagar-decart and others added 3 commits May 29, 2026 16:40
- pc-attached now carries the PC's iceServers config (URLs only;
  credentials redacted to presence-bools) + iceTransportPolicy.
  Directly answers "did the JoinResponse give us STUN/TURN at all?",
  which was the core unknown in the 29CM ICE-failure investigation.
- room-disconnected now decodes the numeric DisconnectReason enum
  into a readable string (signal_close, connection_timeout,
  media_failure, etc.). Mirrored in the test mock.
- livekit_room_info signaling event now carries livekitUrl /
  roomName / sessionId — tells us which SFU node owns the room
  without needing to grep server-side logs.
- Dropped room-connection-state (duplicated room-connected /
  -disconnected / -reconnecting).
- Dropped connected-address (redundant with
  selected-candidate-pair.remote.address).
- ice-candidate trimmed from 15 fields to 12: drop raw SDP `candidate`
  string, sdpMid, sdpMLineIndex, usernameFragment. Keep type, protocol,
  address, port, priority, foundation, component, tcpType,
  relatedAddress/Port, networkType, url.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TS lib used in CI typecheck (different from the DOM types vitest
uses locally) doesn't expose RTCIceCandidateStats. The
RTCIceCandidatePairStats reference also tripped because the same lib
exposes it under a slightly different shape — narrowing failed and
field accesses landed on 'never'.

Replaced with two minimal structural types (IceCandidateStat,
IceCandidatePairStat) covering only the fields we actually read from
getStats(). Functional behavior is unchanged.

Also cast the session_id signaling case through `unknown` — it's sent
by the bouncer but not in the typed IncomingRealtimeMessage union, so
TS narrowed the comparison to never.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that together close the visibility gap for failed `room.connect()`
attempts (the .251-style ICE-fails-without-warning case):

1. Attach PC instrumentation via engine's 'transportsCreated' event.
   Previously we polled `room.engine.pcManager` every 100ms for up to 5s.
   That works for sessions that connect, but for connections that hang
   inside ICE (PCs created, gathering started, never converges) the
   polling window can race the early candidates. RTCEngine emits
   'transportsCreated' the instant publisher/subscriber PCs exist,
   inside `room.connect()` before ICE gathering starts. Hooking into
   that guarantees we attach before the first icecandidate fires.
   Polling is kept as a fallback for engine builds that don't expose the
   event.

2. Buffer observability events until the WS is OPEN.
   `sendObservability` previously silently dropped events when
   `ws.readyState !== OPEN`. But `attachRoomInstrumentation` emits
   network-state + pc-attached synchronously, which can happen before
   the WS has finished opening on cold-start. The result was that
   exactly the events we most need for diagnosing a failed connection
   were the ones silently lost. Now we queue (bounded at 256) and flush
   on `ws.onopen`. Overflow is reported as a single event so we know
   if anything was actually dropped.

3. Emit `livekit-connect-start` / `livekit-connect-error` from
   media-channel.connect(). When `room.connect()` rejects, the
   stream-session catch tears down the channel before any failure
   signal can reach the WS forwarder. Emitting bracket events around
   the LiveKit call ensures we record the error on the wire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 215aa74. Configure here.

}

private tearDown(): void {
this.config.observability?.setObservabilityForwarder(null);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forwarder cleared on transport reuse

Medium Severity

tearDown clears the observability WebSocket forwarder, but connect retries and a second connect after disconnect reuse the same SignalingChannel without calling createTransport again. After the first failed or ended attempt, diagnostics and instrumentation no longer reach sendObservability even when the session connects successfully on a later try.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 215aa74. Configure here.

logger: this.logger,
videoCodec: this.config.videoCodec,
});
// Forward client-side diagnostics + WebRTC stats back over the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breakdown not forwarded after media fail

Medium Severity

When media.connect or publishLocalTracks fails, the inner catch runs tearDown (which clears the forwarder) before the outer catch runs finishConnectionBreakdown. The client-session-connection-breakdown diagnostic is therefore not sent over the WebSocket for the failure mode this change targets, even though earlier livekit-connect-error instrumentation may have been forwarded.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 215aa74. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant