SQLMesh · eakmanrq · Jun 25, 2025 · Jun 25, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,351 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+SQLMesh is a next-generation data transformation framework that enables:
+- Virtual data environments for isolated development without warehouse costs
+- Plan/apply workflow (like Terraform) for safe deployments
+- Multi-dialect SQL support with automatic transpilation
+- Incremental processing to run only necessary transformations
+- Built-in testing and CI/CD integration
+
+**Requirements**: Python >= 3.9 (Note: Python 3.13+ is not yet supported)
+
+## Essential Commands
+
+### Environment setup
+```bash
+# Create and activate a Python virtual environment (Python >= 3.9, < 3.13)
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+
+# Install development dependencies
+make install-dev
+
+# Setup pre-commit hooks (important for code quality)
+make install-pre-commit
+```
+
+### Common Development Tasks
+```bash
+# Run linters and formatters (ALWAYS run before committing)
+make style
+
+# Fast tests for quick feedback during development
+make fast-test
+
+# Slow tests for comprehensive coverage
+make slow-test
+
+# Run specific test file
+pytest tests/core/test_context.py -v
+
+# Run tests with specific marker
+pytest -m "not slow and not docker" -v
+
+# Build package
+make package
+
+# Serve documentation locally
+make docs-serve
+```
+
+### Engine-Specific Testing
+```bash
+# DuckDB (default, no setup required)
+make duckdb-test
+
+# Other engines require credentials/Docker
+make snowflake-test  # Needs SNOWFLAKE_* env vars
+make bigquery-test   # Needs GOOGLE_APPLICATION_CREDENTIALS
+make databricks-test # Needs DATABRICKS_* env vars
+```
+
+### UI Development
+```bash
+# In web/client directory
+pnpm run dev   # Start development server
+pnpm run build # Production build
+pnpm run test  # Run tests
+
+# Docker-based UI
+make ui-up     # Start UI in Docker
+make ui-down   # Stop UI
+```
+
+## Architecture Overview
+
+### Core Components
+
+**sqlmesh/core/context.py**: The main Context class orchestrates all SQLMesh operations. This is the entry point for understanding how models are loaded, plans are created, and executions happen.
+
+**sqlmesh/core/model/**: Model definitions and kinds (FULL, INCREMENTAL_BY_TIME_RANGE, SCD_TYPE_2, etc.). Each model kind has specific behaviors for how data is processed.
+
+**sqlmesh/core/snapshot/**: The versioning system. Snapshots are immutable versions of models identified by fingerprints. Understanding snapshots is crucial for how SQLMesh tracks changes.
+
+**sqlmesh/core/plan/**: Plan building and evaluation logic. Plans determine what changes need to be applied and in what order.
+
+**sqlmesh/core/engine_adapter/**: Database engine adapters provide a unified interface across 16+ SQL engines. Each adapter handles engine-specific SQL generation and execution.
+
+### Key Concepts
+
+1. **Virtual Environments**: Lightweight branches that share unchanged data between environments, reducing storage costs and deployment time.
+
+2. **Fingerprinting**: Models are versioned using content-based fingerprints. Any change to a model's logic creates a new version.
+
+3. **State Sync**: Manages metadata across different backends (can be stored in the data warehouse or external databases).
+
+4. **Intervals**: Time-based partitioning system for incremental models, tracking what data has been processed.
+
+## Testing Philosophy
+
+- Tests are marked with pytest markers:
+  - **Type markers**: `fast`, `slow`, `docker`, `remote`, `cicdonly`, `isolated`, `registry_isolation`
+  - **Domain markers**: `cli`, `dbt`, `github`, `jupyter`, `web`
+  - **Engine markers**: `engine`, `athena`, `bigquery`, `clickhouse`, `databricks`, `duckdb`, `motherduck`, `mssql`, `mysql`, `postgres`, `redshift`, `snowflake`, `spark`, `trino`, `risingwave`
+- Default to `fast` tests during development
+- Engine tests use real connections when available, mocks otherwise
+- The `sushi` example project is used extensively in tests
+- Use `DuckDBMetadata` helper for validating table metadata in tests
+- Tests run in parallel by default (`pytest -n auto`)
+
+## Code Style Guidelines
+
+- Python: Black formatting, isort for imports, mypy for type checking, Ruff for linting
+- TypeScript/React: ESLint + Prettier configuration
+- SQL: SQLGlot handles parsing/formatting
+- All style checks run via `make style`
+- Pre-commit hooks enforce all style rules automatically
+- Important: Some modules (duckdb, numpy, pandas) are banned at module level to prevent import-time side effects
+
+## Important Files
+
+- `sqlmesh/core/context.py`: Main orchestration class
+- `examples/sushi/`: Reference implementation used in tests
+- `web/server/main.py`: Web UI backend entry point
+- `web/client/src/App.tsx`: Web UI frontend entry point
+- `vscode/extension/src/extension.ts`: VSCode extension entry point
+
+## Common Pitfalls
+
+1. **Engine Tests**: Many tests require specific database credentials or Docker. Check test markers before running.
+
+2. **Path Handling**: Be careful with Windows paths - use `pathlib.Path` for cross-platform compatibility.
+
+3. **State Management**: Understanding the state sync mechanism is crucial for debugging environment issues.
+
+4. **Snapshot Versioning**: Changes to model logic create new versions - this is by design for safe deployments.
+
+5. **Module Imports**: Avoid importing duckdb, numpy, or pandas at module level - these are banned by Ruff to prevent long load times in cases where the libraries aren't used.
+
+## GitHub CI/CD Bot Architecture
+
+SQLMesh includes a GitHub CI/CD bot integration that automates data transformation workflows. The implementation is located in `sqlmesh/integrations/github/` and follows a clean architectural pattern.
+
+### Code Organization
+
+**Core Integration Files:**
+- `sqlmesh/cicd/bot.py`: Main CLI entry point (`sqlmesh_cicd` command)
+- `sqlmesh/integrations/github/cicd/controller.py`: Core bot orchestration logic
+- `sqlmesh/integrations/github/cicd/command.py`: Individual command implementations
+- `sqlmesh/integrations/github/cicd/config.py`: Configuration classes and validation
+
+### Architecture Pattern
+
+The bot follows a **Command Pattern** architecture:
+
+1. **CLI Layer** (`bot.py`): Handles argument parsing and delegates to controllers
+2. **Controller Layer** (`controller.py`): Orchestrates workflow execution and manages state
+3. **Command Layer** (`command.py`): Implements individual operations (test, deploy, plan, etc.)
+4. **Configuration Layer** (`config.py`): Manages bot configuration and validation
+
+### Key Components
+
+**GitHubCICDController**: Main orchestrator that:
+- Manages GitHub API interactions via PyGithub
+- Coordinates workflow execution across different commands
+- Handles error reporting through GitHub Check Runs
+- Manages PR comment interactions and status updates
+
+**Command Implementations**:
+- `run_tests()`: Executes unit tests with detailed reporting
+- `update_pr_environment()`: Creates/updates virtual PR environments
+- `gen_prod_plan()`: Generates production deployment plans
+- `deploy_production()`: Handles production deployments
+- `check_required_approvers()`: Validates approval requirements
+
+**Configuration Management**:
+- Uses Pydantic models for type-safe configuration
+- Supports both YAML config files and environment variables
+- Validates bot settings and user permissions
+- Handles approval workflows and deployment triggers
+
+### Integration with Core SQLMesh
+
+The bot leverages core SQLMesh components:
+- **Context**: Uses SQLMesh Context for project operations
+- **Plan/Apply**: Integrates with SQLMesh's plan generation and application
+- **Virtual Environments**: Creates isolated PR environments using SQLMesh's virtual data environments
+- **State Sync**: Manages metadata synchronization across environments
+- **Testing Framework**: Executes SQLMesh unit tests and reports results
+
+### Error Handling and Reporting
+
+- **GitHub Check Runs**: Creates detailed status reports for each workflow step
+- **PR Comments**: Provides user-friendly feedback on failures and successes
+- **Structured Logging**: Uses SQLMesh's logging framework for debugging
+- **Exception Handling**: Graceful handling of GitHub API failures and SQLMesh errors
+
+## Environment Variables for Engine Testing
+
+When running engine-specific tests, these environment variables are required:
+
+- **Snowflake**: `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_WAREHOUSE`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USER`, `SNOWFLAKE_PASSWORD`
+- **BigQuery**: `BIGQUERY_KEYFILE` or `GOOGLE_APPLICATION_CREDENTIALS`
+- **Databricks**: `DATABRICKS_CATALOG`, `DATABRICKS_SERVER_HOSTNAME`, `DATABRICKS_HTTP_PATH`, `DATABRICKS_ACCESS_TOKEN`, `DATABRICKS_CONNECT_VERSION`
+- **Redshift**: `REDSHIFT_HOST`, `REDSHIFT_USER`, `REDSHIFT_PASSWORD`, `REDSHIFT_DATABASE`
+- **Athena**: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `ATHENA_S3_WAREHOUSE_LOCATION`
+- **ClickHouse Cloud**: `CLICKHOUSE_CLOUD_HOST`, `CLICKHOUSE_CLOUD_USERNAME`, `CLICKHOUSE_CLOUD_PASSWORD`
+
+## Migrations System
+
+SQLMesh uses a migration system to evolve its internal state database schema and metadata format. The migrations handle changes to SQLMesh's internal structure, not user data transformations.
+
+### Migration Structure
+
+**Location**: `sqlmesh/migrations/` - Contains 80+ migration files from v0001 to v0083+
+
+**Naming Convention**: `v{XXXX}_{descriptive_name}.py` (e.g., `v0001_init.py`, `v0083_use_sql_for_scd_time_data_type_data_hash.py`)
+
+**Core Infrastructure**:
+- `sqlmesh/core/state_sync/db/migrator.py`: Main migration orchestrator
+- `sqlmesh/utils/migration.py`: Cross-database compatibility utilities
+- `sqlmesh/core/state_sync/base.py`: Auto-discovery and loading logic
+
+### Migration Categories
+
+**Schema Evolution**:
+- State table creation/modification (snapshots, environments, intervals)
+- Column additions/removals and index management
+- Database engine compatibility fixes (MySQL/MSSQL field size limits)
+
+**Data Format Migrations**:
+- JSON metadata structure updates (snapshot serialization changes)
+- Path normalization (Windows compatibility)
+- Fingerprint recalculation when SQLGlot parsing changes
+
+**Cleanup Operations**:
+- Removing obsolete tables and unused data
+- Metadata optimization and attribute cleanup
+
+### Key Migration Patterns
+
+```python
+# Standard migration function signature
+def migrate(state_sync, **kwargs):  # type: ignore
+    engine_adapter = state_sync.engine_adapter
+    schema = state_sync.schema
+    # Migration logic here
+
+# Common operations
+engine_adapter.create_state_table(table_name, columns_dict)
+engine_adapter.alter_table(alter_expression)
+engine_adapter.drop_table(table_name)
+```
+
+### State Management Integration
+
+**Core State Tables**:
+- `_snapshots`: Model version metadata (most frequently migrated)
+- `_environments`: Environment definitions
+- `_versions`: Schema/SQLGlot/SQLMesh version tracking
+- `_intervals`: Incremental processing metadata
+
+**Migration Safety**:
+- Automatic backups before migration (unless `skip_backup=True`)
+- Atomic database transactions for consistency
+- Snapshot count validation before/after migrations
+- Automatic rollback on failures
+
+### Migration Execution
+
+**Auto-Discovery**: Migrations are automatically loaded using `pkgutil.iter_modules()`
+
+**Triggers**: Migrations run automatically when:
+- Schema version mismatch detected
+- SQLGlot version changes require fingerprint recalculation
+- Manual `sqlmesh migrate` command execution
+
+**Execution Flow**:
+1. Version comparison (local vs remote schema)
+2. Backup creation of state tables
+3. Sequential migration execution (numerical order)
+4. Snapshot fingerprint recalculation if needed  
+5. Environment updates with new snapshot references
+
+## dbt Integration
+
+SQLMesh provides native support for dbt projects, allowing users to run existing dbt projects while gaining access to SQLMesh's advanced features like virtual environments and plan/apply workflows.
+
+### Core dbt Integration
+
+**Location**: `sqlmesh/dbt/` - Complete dbt integration architecture
+
+**Key Components**:
+- `sqlmesh/dbt/loader.py`: Main dbt project loader extending SQLMesh's base loader
+- `sqlmesh/dbt/manifest.py`: dbt manifest parsing and project discovery
+- `sqlmesh/dbt/adapter.py`: dbt adapter system for SQL execution and schema operations
+- `sqlmesh/dbt/model.py`: dbt model configurations and materialization mapping
+- `sqlmesh/dbt/context.py`: dbt project context and environment management
+
+### Project Conversion
+
+**dbt Converter**: `sqlmesh/dbt/converter/` - Tools for migrating dbt projects to SQLMesh
+
+**Key Features**:
+- `convert.py`: Main conversion orchestration
+- `jinja.py` & `jinja_transforms.py`: Jinja template and macro conversion
+- Full support for dbt assets (models, seeds, sources, tests, snapshots, macros)
+
+**CLI Commands**:
+```bash
+# Initialize SQLMesh in existing dbt project
+sqlmesh init -t dbt
+
+# Convert dbt project to SQLMesh format
+sqlmesh dbt convert
+```
+
+### Supported dbt Features
+
+**Project Structure**:
+- Full dbt project support (models, seeds, sources, tests, snapshots, macros)
+- dbt package dependencies and version management
+- Profile integration using existing `profiles.yml` for connections
+
+**Materializations**:
+- All standard dbt materializations (table, view, incremental, ephemeral)
+- Incremental model strategies (delete+insert, merge, insert_overwrite)
+- SCD Type 2 support and snapshot strategies
+
+**Advanced Features**:
+- Jinja templating with full macro support
+- Runtime variable passing and configuration
+- dbt test integration and execution
+- Cross-database compatibility with SQLMesh's multi-dialect support
+
+### Example Projects
+
+**sushi_dbt**: `examples/sushi_dbt/` - Complete dbt project running with SQLMesh
+**Test Fixtures**: `tests/fixtures/dbt/sushi_test/` - Comprehensive test dbt project with all asset types
+
+### Integration Benefits
+
+When using dbt with SQLMesh, you gain:
+- **Virtual Environments**: Isolated development without warehouse costs
+- **Plan/Apply Workflow**: Safe deployments with change previews
+- **Multi-Dialect Support**: Run the same dbt project across different SQL engines
+- **Advanced Testing**: Enhanced testing capabilities beyond standard dbt tests
+- **State Management**: Sophisticated metadata and versioning system