[4/7] Telemetry Event Emission and Aggregation#327
[4/7] Telemetry Event Emission and Aggregation#327samikshya-db wants to merge 131 commits intomainfrom
Conversation
|
The emission format confirms to the telemetry proto, marked this ready for review. |
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
87d1e85 to
32003e9
Compare
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
This is part 5 of 7 in the telemetry implementation stack. Components: - DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker - TelemetryExporterStub: Test stub for integration tests DatabricksTelemetryExporter: - Exports telemetry metrics to Databricks via HTTP POST - Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth) - Integrates with CircuitBreaker for per-host endpoint protection - Retry logic with exponential backoff and jitter - Exception classification (terminal vs retryable) Export Flow: 1. Check circuit breaker state (skip if OPEN) 2. Execute with circuit breaker protection 3. Retry on retryable errors with backoff 4. Circuit breaker tracks success/failure 5. All exceptions swallowed and logged at debug level Retry Strategy: - Max retries: 3 (default, configurable) - Exponential backoff: 100ms * 2^attempt - Jitter: Random 0-100ms to prevent thundering herd - Terminal errors: No retry (401, 403, 404, 400) - Retryable errors: Retry with backoff (429, 500, 502, 503, 504) Circuit Breaker Integration: - Success → Record success with circuit breaker - Failure → Record failure with circuit breaker - Circuit OPEN → Skip export, log at debug - Automatic recovery via HALF_OPEN state Critical Requirements: - All exceptions swallowed (NEVER throws) - All logging at LogLevel.debug ONLY - No console logging - Driver continues when telemetry fails Testing: - 24 comprehensive unit tests - 96% statement coverage, 84% branch coverage - Tests verify exception swallowing - Tests verify retry logic - Tests verify circuit breaker integration - TelemetryExporterStub for integration tests Dependencies: - Builds on all previous layers [1/7] through [4/7]
Feature flags now use the same circuit breaker protection as telemetry for resilience against endpoint failures. **Changes:** - FeatureFlagCache now accepts optional CircuitBreakerRegistry - Feature flag fetches wrapped in circuit breaker execution - Shared circuit breaker registry between feature flags and telemetry - Per-host circuit breaker isolation maintained - Falls back to cached values when circuit is OPEN **Benefits:** - Protects against repeated failures to feature flag endpoint - Fails fast when endpoint is down (circuit OPEN) - Auto-recovery after timeout (60s default) - Same resilience patterns as telemetry export **Configuration:** - Failure threshold: 5 consecutive failures - Timeout: 60 seconds - Per-host isolation (failures on one host don't affect others) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ad format - Update FeatureFlagCache tests to use new extensible flags Map - Fix DatabricksTelemetryExporter tests to use protoLogs format - Verify telemetry endpoints use correct paths (/telemetry-ext, /telemetry-unauth) - 213 passing, 13 logging assertion tests need investigation
…or infrastructure Takes main's versions of CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter, MetricsAggregator, TelemetryEventEmitter, and types — bringing in SSRF hardening, overflow protection, async flush/close, errorStack redaction, and IAuthentication-based auth headers. Removes IClientContext.getAuthHeaders() in favour of the direct authProvider injection pattern from main. Co-authored-by: samikshya-chand_data
…e, MetricsAggregator Picks up #362 (shared connection stack + close() flush race fix) merged to main. Resolves add/add conflicts by preferring main's versions of the three infrastructure files, consistent with the prior merge strategy. Co-authored-by: samikshya-chand_data
Co-authored-by: samikshya-chand_data
…race If getOrCreateClient ran after refCount hit 0 but before clients.delete(host), it would receive a closing TelemetryClient. Deleting synchronously first means any concurrent caller gets a fresh instance instead. Co-authored-by: samikshya-chand_data
- close() made synchronous (no I/O, no reason for async); uses finally to guarantee this.closed=true even when logger throws - TelemetryClientProvider.releaseClient() made synchronous for the same reason - Misleading ConcurrentHashMap reference removed from docstring - getRefCount/getActiveClients marked @internal (test-only surface) - Update tests to match: rejects→throws stubs, remove await on sync calls Co-authored-by: samikshya-chand_data
…x auth - TelemetryClient.close() is now synchronous with try/catch/finally - TelemetryClientProvider.releaseClient() is synchronous; map entry deleted before close() to prevent stale-client async race - Provider and client log at debug level with @internal test helpers - Fix getAuthHeaders() compile error: add getAuthProvider() to IClientContext and update DatabricksTelemetryExporter / FeatureFlagCache - Remove undefined TelemetryTerminalError re-export from index.ts - types.ts DEFAULT_TELEMETRY_CONFIG now Readonly+Object.freeze Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Resolve add/add conflicts on TelemetryClientProvider.ts and its test file plus TelemetryClient.test.ts by taking main's version. These files were finalized in PR #326 — main has the reviewed/merged final form (refCount underflow guard, host normalization, narrowed non-Error catch, corrected sync-close docstring, regex log-matchers, dropped tautology test). The [4/7] branch only carried the pre-review copies of those files; no new callers or functionality here depend on the old internal signatures, so main wins. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Remove docs/TELEMETRY.md, spec/telemetry-design.md, spec/telemetry-sprint-plan.md, spec/telemetry-test-completion-summary.md and revert README.md — these ~4.9k lines of markdown are being split into a stacked docs-only PR on top of this branch to keep the [4/7] diff focused on code. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Stacked on top of [4/7] code PR (#327). Contains the markdown that was originally part of the [4/7] branch and was pulled out to keep the code PR reviewable: - docs/TELEMETRY.md — user-facing telemetry guide - spec/telemetry-design.md — detailed design spec - spec/telemetry-sprint-plan.md — 2-week sprint plan - spec/telemetry-test-completion-summary.md — test coverage summary - README.md — telemetry section Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Local-only e2e harness that duplicates what tests/e2e/telemetry/telemetry-integration.test.ts covers in CI. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Part 4 of 7-part Telemetry Implementation Stack
This layer adds event emission and per-statement aggregation with smart batching.
Summary
Implements TelemetryEventEmitter for event-driven telemetry and MetricsAggregator for efficient per-statement aggregation with smart flushing.
Components
TelemetryEventEmitter (
lib/telemetry/TelemetryEventEmitter.ts)Event-driven architecture using Node.js EventEmitter:
telemetryEnabledconfiguration flagEvent Types:
connection.open- Successful connection establishmentstatement.start- Statement execution beginsstatement.complete- Statement execution completescloudfetch.chunk- CloudFetch chunk downloadederror- Exception occurred with terminal classificationMetricsAggregator (
lib/telemetry/MetricsAggregator.ts)Per-statement aggregation with smart batching:
Aggregation Strategy:
completeStatement()calledFlush Triggers:
flush()calledMemory Management:
Smart Batching Benefits
Testing
Next Steps
This PR is followed by:
Dependencies
Depends on: