Skip to content

[4/7] Telemetry Event Emission and Aggregation#327

Open
samikshya-db wants to merge 131 commits intomainfrom
telemetry-4-event-aggregation
Open

[4/7] Telemetry Event Emission and Aggregation#327
samikshya-db wants to merge 131 commits intomainfrom
telemetry-4-event-aggregation

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

Part 4 of 7-part Telemetry Implementation Stack

This layer adds event emission and per-statement aggregation with smart batching.

Summary

Implements TelemetryEventEmitter for event-driven telemetry and MetricsAggregator for efficient per-statement aggregation with smart flushing.

Components

TelemetryEventEmitter (lib/telemetry/TelemetryEventEmitter.ts)

Event-driven architecture using Node.js EventEmitter:

  • Type-safe emission methods for each event type
  • Respects telemetryEnabled configuration flag
  • All exceptions swallowed and logged at debug level only
  • Zero performance impact when disabled (early return)

Event Types:

  • connection.open - Successful connection establishment
  • statement.start - Statement execution begins
  • statement.complete - Statement execution completes
  • cloudfetch.chunk - CloudFetch chunk downloaded
  • error - Exception occurred with terminal classification

MetricsAggregator (lib/telemetry/MetricsAggregator.ts)

Per-statement aggregation with smart batching:

Aggregation Strategy:

  • Connection events → emit immediately (no aggregation)
  • Statement events → buffer until completeStatement() called
  • Terminal errors → flush immediately (critical failures)
  • Retryable errors → buffer until statement complete (optimize batching)

Flush Triggers:

  • Batch size reached (default: 100 metrics)
  • Periodic timer fired (default: 5000ms)
  • Terminal exception occurred (immediate flush)
  • Manual flush() called

Memory Management:

  • Bounded buffers prevent memory leaks
  • Completed statements removed from memory
  • Periodic timer cleanup

Smart Batching Benefits

  • Reduces HTTP overhead: Fewer export calls
  • Optimizes bandwidth: Batch multiple metrics
  • Critical error priority: Terminal errors flushed immediately
  • Efficient aggregation: Per-statement grouping reduces data size

Testing

  • 31 unit tests for TelemetryEventEmitter (100% function coverage)
  • 32 unit tests for MetricsAggregator (94% line, 82% branch coverage)
  • Tests verify exception swallowing (CRITICAL requirement)
  • Tests verify debug-only logging (CRITICAL requirement)
  • Tests verify batch size and timer triggers
  • Tests verify terminal vs retryable error handling

Next Steps

This PR is followed by:

  • [5/7] Export: DatabricksTelemetryExporter
  • [6/7] Integration: Wire into DBSQLClient
  • [7/7] Testing & Documentation

Dependencies

Depends on:

@samikshya-db
Copy link
Copy Markdown
Collaborator Author

The emission format confirms to the telemetry proto, marked this ready for review.

samikshya-db and others added 11 commits January 29, 2026 20:20
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
@samikshya-db samikshya-db force-pushed the telemetry-3-client-management branch from 87d1e85 to 32003e9 Compare January 29, 2026 20:21
samikshya-db and others added 11 commits January 29, 2026 20:21
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
This is part 5 of 7 in the telemetry implementation stack.

Components:
- DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker
- TelemetryExporterStub: Test stub for integration tests

DatabricksTelemetryExporter:
- Exports telemetry metrics to Databricks via HTTP POST
- Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth)
- Integrates with CircuitBreaker for per-host endpoint protection
- Retry logic with exponential backoff and jitter
- Exception classification (terminal vs retryable)

Export Flow:
1. Check circuit breaker state (skip if OPEN)
2. Execute with circuit breaker protection
3. Retry on retryable errors with backoff
4. Circuit breaker tracks success/failure
5. All exceptions swallowed and logged at debug level

Retry Strategy:
- Max retries: 3 (default, configurable)
- Exponential backoff: 100ms * 2^attempt
- Jitter: Random 0-100ms to prevent thundering herd
- Terminal errors: No retry (401, 403, 404, 400)
- Retryable errors: Retry with backoff (429, 500, 502, 503, 504)

Circuit Breaker Integration:
- Success → Record success with circuit breaker
- Failure → Record failure with circuit breaker
- Circuit OPEN → Skip export, log at debug
- Automatic recovery via HALF_OPEN state

Critical Requirements:
- All exceptions swallowed (NEVER throws)
- All logging at LogLevel.debug ONLY
- No console logging
- Driver continues when telemetry fails

Testing:
- 24 comprehensive unit tests
- 96% statement coverage, 84% branch coverage
- Tests verify exception swallowing
- Tests verify retry logic
- Tests verify circuit breaker integration
- TelemetryExporterStub for integration tests

Dependencies:
- Builds on all previous layers [1/7] through [4/7]
samikshya-db and others added 14 commits February 5, 2026 16:11
Feature flags now use the same circuit breaker protection as telemetry
for resilience against endpoint failures.

**Changes:**
- FeatureFlagCache now accepts optional CircuitBreakerRegistry
- Feature flag fetches wrapped in circuit breaker execution
- Shared circuit breaker registry between feature flags and telemetry
- Per-host circuit breaker isolation maintained
- Falls back to cached values when circuit is OPEN

**Benefits:**
- Protects against repeated failures to feature flag endpoint
- Fails fast when endpoint is down (circuit OPEN)
- Auto-recovery after timeout (60s default)
- Same resilience patterns as telemetry export

**Configuration:**
- Failure threshold: 5 consecutive failures
- Timeout: 60 seconds
- Per-host isolation (failures on one host don't affect others)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ad format

- Update FeatureFlagCache tests to use new extensible flags Map
- Fix DatabricksTelemetryExporter tests to use protoLogs format
- Verify telemetry endpoints use correct paths (/telemetry-ext, /telemetry-unauth)
- 213 passing, 13 logging assertion tests need investigation
…or infrastructure

Takes main's versions of CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter,
MetricsAggregator, TelemetryEventEmitter, and types — bringing in SSRF hardening,
overflow protection, async flush/close, errorStack redaction, and IAuthentication-based
auth headers. Removes IClientContext.getAuthHeaders() in favour of the direct
authProvider injection pattern from main.

Co-authored-by: samikshya-chand_data
…e, MetricsAggregator

Picks up #362 (shared connection stack + close() flush race fix) merged to main.
Resolves add/add conflicts by preferring main's versions of the three infrastructure
files, consistent with the prior merge strategy.

Co-authored-by: samikshya-chand_data
…race

If getOrCreateClient ran after refCount hit 0 but before clients.delete(host),
it would receive a closing TelemetryClient. Deleting synchronously first means
any concurrent caller gets a fresh instance instead.

Co-authored-by: samikshya-chand_data
- close() made synchronous (no I/O, no reason for async); uses finally to
  guarantee this.closed=true even when logger throws
- TelemetryClientProvider.releaseClient() made synchronous for the same reason
- Misleading ConcurrentHashMap reference removed from docstring
- getRefCount/getActiveClients marked @internal (test-only surface)
- Update tests to match: rejects→throws stubs, remove await on sync calls

Co-authored-by: samikshya-chand_data
…x auth

- TelemetryClient.close() is now synchronous with try/catch/finally
- TelemetryClientProvider.releaseClient() is synchronous; map entry
  deleted before close() to prevent stale-client async race
- Provider and client log at debug level with @internal test helpers
- Fix getAuthHeaders() compile error: add getAuthProvider() to
  IClientContext and update DatabricksTelemetryExporter / FeatureFlagCache
- Remove undefined TelemetryTerminalError re-export from index.ts
- types.ts DEFAULT_TELEMETRY_CONFIG now Readonly+Object.freeze

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Base automatically changed from telemetry-3-client-management to main April 22, 2026 05:09
Resolve add/add conflicts on TelemetryClientProvider.ts and its test file
plus TelemetryClient.test.ts by taking main's version. These files were
finalized in PR #326 — main has the reviewed/merged final form (refCount
underflow guard, host normalization, narrowed non-Error catch, corrected
sync-close docstring, regex log-matchers, dropped tautology test). The
[4/7] branch only carried the pre-review copies of those files; no new
callers or functionality here depend on the old internal signatures, so
main wins.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Remove docs/TELEMETRY.md, spec/telemetry-design.md,
spec/telemetry-sprint-plan.md, spec/telemetry-test-completion-summary.md
and revert README.md — these ~4.9k lines of markdown are being
split into a stacked docs-only PR on top of this branch to keep the
[4/7] diff focused on code.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

samikshya-db added a commit that referenced this pull request Apr 22, 2026
Stacked on top of [4/7] code PR (#327). Contains the markdown that was
originally part of the [4/7] branch and was pulled out to keep the code
PR reviewable:

- docs/TELEMETRY.md — user-facing telemetry guide
- spec/telemetry-design.md — detailed design spec
- spec/telemetry-sprint-plan.md — 2-week sprint plan
- spec/telemetry-test-completion-summary.md — test coverage summary
- README.md — telemetry section

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Local-only e2e harness that duplicates what tests/e2e/telemetry/telemetry-integration.test.ts covers in CI.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant