Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
351 changes: 351 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,351 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

SQLMesh is a next-generation data transformation framework that enables:
- Virtual data environments for isolated development without warehouse costs
- Plan/apply workflow (like Terraform) for safe deployments
- Multi-dialect SQL support with automatic transpilation
- Incremental processing to run only necessary transformations
- Built-in testing and CI/CD integration

**Requirements**: Python >= 3.9 (Note: Python 3.13+ is not yet supported)

## Essential Commands

### Environment setup
```bash
# Create and activate a Python virtual environment (Python >= 3.9, < 3.13)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Install development dependencies
make install-dev

# Setup pre-commit hooks (important for code quality)
make install-pre-commit
```

### Common Development Tasks
```bash
# Run linters and formatters (ALWAYS run before committing)
make style

# Fast tests for quick feedback during development
make fast-test

# Slow tests for comprehensive coverage
make slow-test

# Run specific test file
pytest tests/core/test_context.py -v

# Run tests with specific marker
pytest -m "not slow and not docker" -v

# Build package
make package

# Serve documentation locally
make docs-serve
```

### Engine-Specific Testing
```bash
# DuckDB (default, no setup required)
make duckdb-test

# Other engines require credentials/Docker
make snowflake-test # Needs SNOWFLAKE_* env vars
make bigquery-test # Needs GOOGLE_APPLICATION_CREDENTIALS
make databricks-test # Needs DATABRICKS_* env vars
```

### UI Development
```bash
# In web/client directory
pnpm run dev # Start development server
pnpm run build # Production build
pnpm run test # Run tests

# Docker-based UI
make ui-up # Start UI in Docker
make ui-down # Stop UI
```

## Architecture Overview

### Core Components

**sqlmesh/core/context.py**: The main Context class orchestrates all SQLMesh operations. This is the entry point for understanding how models are loaded, plans are created, and executions happen.

**sqlmesh/core/model/**: Model definitions and kinds (FULL, INCREMENTAL_BY_TIME_RANGE, SCD_TYPE_2, etc.). Each model kind has specific behaviors for how data is processed.

**sqlmesh/core/snapshot/**: The versioning system. Snapshots are immutable versions of models identified by fingerprints. Understanding snapshots is crucial for how SQLMesh tracks changes.

**sqlmesh/core/plan/**: Plan building and evaluation logic. Plans determine what changes need to be applied and in what order.

**sqlmesh/core/engine_adapter/**: Database engine adapters provide a unified interface across 16+ SQL engines. Each adapter handles engine-specific SQL generation and execution.

### Key Concepts

1. **Virtual Environments**: Lightweight branches that share unchanged data between environments, reducing storage costs and deployment time.

2. **Fingerprinting**: Models are versioned using content-based fingerprints. Any change to a model's logic creates a new version.

3. **State Sync**: Manages metadata across different backends (can be stored in the data warehouse or external databases).

4. **Intervals**: Time-based partitioning system for incremental models, tracking what data has been processed.

## Testing Philosophy

- Tests are marked with pytest markers:
- **Type markers**: `fast`, `slow`, `docker`, `remote`, `cicdonly`, `isolated`, `registry_isolation`
- **Domain markers**: `cli`, `dbt`, `github`, `jupyter`, `web`
- **Engine markers**: `engine`, `athena`, `bigquery`, `clickhouse`, `databricks`, `duckdb`, `motherduck`, `mssql`, `mysql`, `postgres`, `redshift`, `snowflake`, `spark`, `trino`, `risingwave`
- Default to `fast` tests during development
- Engine tests use real connections when available, mocks otherwise
- The `sushi` example project is used extensively in tests
- Use `DuckDBMetadata` helper for validating table metadata in tests
- Tests run in parallel by default (`pytest -n auto`)

## Code Style Guidelines

- Python: Black formatting, isort for imports, mypy for type checking, Ruff for linting
- TypeScript/React: ESLint + Prettier configuration
- SQL: SQLGlot handles parsing/formatting
- All style checks run via `make style`
- Pre-commit hooks enforce all style rules automatically
- Important: Some modules (duckdb, numpy, pandas) are banned at module level to prevent import-time side effects

## Important Files

- `sqlmesh/core/context.py`: Main orchestration class
- `examples/sushi/`: Reference implementation used in tests
- `web/server/main.py`: Web UI backend entry point
- `web/client/src/App.tsx`: Web UI frontend entry point
- `vscode/extension/src/extension.ts`: VSCode extension entry point

## Common Pitfalls

1. **Engine Tests**: Many tests require specific database credentials or Docker. Check test markers before running.

2. **Path Handling**: Be careful with Windows paths - use `pathlib.Path` for cross-platform compatibility.

3. **State Management**: Understanding the state sync mechanism is crucial for debugging environment issues.

4. **Snapshot Versioning**: Changes to model logic create new versions - this is by design for safe deployments.

5. **Module Imports**: Avoid importing duckdb, numpy, or pandas at module level - these are banned by Ruff to prevent long load times in cases where the libraries aren't used.

## GitHub CI/CD Bot Architecture

SQLMesh includes a GitHub CI/CD bot integration that automates data transformation workflows. The implementation is located in `sqlmesh/integrations/github/` and follows a clean architectural pattern.

### Code Organization

**Core Integration Files:**
- `sqlmesh/cicd/bot.py`: Main CLI entry point (`sqlmesh_cicd` command)
- `sqlmesh/integrations/github/cicd/controller.py`: Core bot orchestration logic
- `sqlmesh/integrations/github/cicd/command.py`: Individual command implementations
- `sqlmesh/integrations/github/cicd/config.py`: Configuration classes and validation

### Architecture Pattern

The bot follows a **Command Pattern** architecture:

1. **CLI Layer** (`bot.py`): Handles argument parsing and delegates to controllers
2. **Controller Layer** (`controller.py`): Orchestrates workflow execution and manages state
3. **Command Layer** (`command.py`): Implements individual operations (test, deploy, plan, etc.)
4. **Configuration Layer** (`config.py`): Manages bot configuration and validation

### Key Components

**GitHubCICDController**: Main orchestrator that:
- Manages GitHub API interactions via PyGithub
- Coordinates workflow execution across different commands
- Handles error reporting through GitHub Check Runs
- Manages PR comment interactions and status updates

**Command Implementations**:
- `run_tests()`: Executes unit tests with detailed reporting
- `update_pr_environment()`: Creates/updates virtual PR environments
- `gen_prod_plan()`: Generates production deployment plans
- `deploy_production()`: Handles production deployments
- `check_required_approvers()`: Validates approval requirements

**Configuration Management**:
- Uses Pydantic models for type-safe configuration
- Supports both YAML config files and environment variables
- Validates bot settings and user permissions
- Handles approval workflows and deployment triggers

### Integration with Core SQLMesh

The bot leverages core SQLMesh components:
- **Context**: Uses SQLMesh Context for project operations
- **Plan/Apply**: Integrates with SQLMesh's plan generation and application
- **Virtual Environments**: Creates isolated PR environments using SQLMesh's virtual data environments
- **State Sync**: Manages metadata synchronization across environments
- **Testing Framework**: Executes SQLMesh unit tests and reports results

### Error Handling and Reporting

- **GitHub Check Runs**: Creates detailed status reports for each workflow step
- **PR Comments**: Provides user-friendly feedback on failures and successes
- **Structured Logging**: Uses SQLMesh's logging framework for debugging
- **Exception Handling**: Graceful handling of GitHub API failures and SQLMesh errors

## Environment Variables for Engine Testing

When running engine-specific tests, these environment variables are required:

- **Snowflake**: `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_WAREHOUSE`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USER`, `SNOWFLAKE_PASSWORD`
- **BigQuery**: `BIGQUERY_KEYFILE` or `GOOGLE_APPLICATION_CREDENTIALS`
- **Databricks**: `DATABRICKS_CATALOG`, `DATABRICKS_SERVER_HOSTNAME`, `DATABRICKS_HTTP_PATH`, `DATABRICKS_ACCESS_TOKEN`, `DATABRICKS_CONNECT_VERSION`
- **Redshift**: `REDSHIFT_HOST`, `REDSHIFT_USER`, `REDSHIFT_PASSWORD`, `REDSHIFT_DATABASE`
- **Athena**: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `ATHENA_S3_WAREHOUSE_LOCATION`
- **ClickHouse Cloud**: `CLICKHOUSE_CLOUD_HOST`, `CLICKHOUSE_CLOUD_USERNAME`, `CLICKHOUSE_CLOUD_PASSWORD`

## Migrations System

SQLMesh uses a migration system to evolve its internal state database schema and metadata format. The migrations handle changes to SQLMesh's internal structure, not user data transformations.

### Migration Structure

**Location**: `sqlmesh/migrations/` - Contains 80+ migration files from v0001 to v0083+

**Naming Convention**: `v{XXXX}_{descriptive_name}.py` (e.g., `v0001_init.py`, `v0083_use_sql_for_scd_time_data_type_data_hash.py`)

**Core Infrastructure**:
- `sqlmesh/core/state_sync/db/migrator.py`: Main migration orchestrator
- `sqlmesh/utils/migration.py`: Cross-database compatibility utilities
- `sqlmesh/core/state_sync/base.py`: Auto-discovery and loading logic

### Migration Categories

**Schema Evolution**:
- State table creation/modification (snapshots, environments, intervals)
- Column additions/removals and index management
- Database engine compatibility fixes (MySQL/MSSQL field size limits)

**Data Format Migrations**:
- JSON metadata structure updates (snapshot serialization changes)
- Path normalization (Windows compatibility)
- Fingerprint recalculation when SQLGlot parsing changes

**Cleanup Operations**:
- Removing obsolete tables and unused data
- Metadata optimization and attribute cleanup

### Key Migration Patterns

```python
# Standard migration function signature
def migrate(state_sync, **kwargs): # type: ignore
engine_adapter = state_sync.engine_adapter
schema = state_sync.schema
# Migration logic here

# Common operations
engine_adapter.create_state_table(table_name, columns_dict)
engine_adapter.alter_table(alter_expression)
engine_adapter.drop_table(table_name)
```

### State Management Integration

**Core State Tables**:
- `_snapshots`: Model version metadata (most frequently migrated)
- `_environments`: Environment definitions
- `_versions`: Schema/SQLGlot/SQLMesh version tracking
- `_intervals`: Incremental processing metadata

**Migration Safety**:
- Automatic backups before migration (unless `skip_backup=True`)
- Atomic database transactions for consistency
- Snapshot count validation before/after migrations
- Automatic rollback on failures

### Migration Execution

**Auto-Discovery**: Migrations are automatically loaded using `pkgutil.iter_modules()`

**Triggers**: Migrations run automatically when:
- Schema version mismatch detected
- SQLGlot version changes require fingerprint recalculation
- Manual `sqlmesh migrate` command execution

**Execution Flow**:
1. Version comparison (local vs remote schema)
2. Backup creation of state tables
3. Sequential migration execution (numerical order)
4. Snapshot fingerprint recalculation if needed
5. Environment updates with new snapshot references

## dbt Integration

SQLMesh provides native support for dbt projects, allowing users to run existing dbt projects while gaining access to SQLMesh's advanced features like virtual environments and plan/apply workflows.

### Core dbt Integration

**Location**: `sqlmesh/dbt/` - Complete dbt integration architecture

**Key Components**:
- `sqlmesh/dbt/loader.py`: Main dbt project loader extending SQLMesh's base loader
- `sqlmesh/dbt/manifest.py`: dbt manifest parsing and project discovery
- `sqlmesh/dbt/adapter.py`: dbt adapter system for SQL execution and schema operations
- `sqlmesh/dbt/model.py`: dbt model configurations and materialization mapping
- `sqlmesh/dbt/context.py`: dbt project context and environment management

### Project Conversion

**dbt Converter**: `sqlmesh/dbt/converter/` - Tools for migrating dbt projects to SQLMesh

**Key Features**:
- `convert.py`: Main conversion orchestration
- `jinja.py` & `jinja_transforms.py`: Jinja template and macro conversion
- Full support for dbt assets (models, seeds, sources, tests, snapshots, macros)

**CLI Commands**:
```bash
# Initialize SQLMesh in existing dbt project
sqlmesh init -t dbt

# Convert dbt project to SQLMesh format
sqlmesh dbt convert
```

### Supported dbt Features

**Project Structure**:
- Full dbt project support (models, seeds, sources, tests, snapshots, macros)
- dbt package dependencies and version management
- Profile integration using existing `profiles.yml` for connections

**Materializations**:
- All standard dbt materializations (table, view, incremental, ephemeral)
- Incremental model strategies (delete+insert, merge, insert_overwrite)
- SCD Type 2 support and snapshot strategies

**Advanced Features**:
- Jinja templating with full macro support
- Runtime variable passing and configuration
- dbt test integration and execution
- Cross-database compatibility with SQLMesh's multi-dialect support

### Example Projects

**sushi_dbt**: `examples/sushi_dbt/` - Complete dbt project running with SQLMesh
**Test Fixtures**: `tests/fixtures/dbt/sushi_test/` - Comprehensive test dbt project with all asset types

### Integration Benefits

When using dbt with SQLMesh, you gain:
- **Virtual Environments**: Isolated development without warehouse costs
- **Plan/Apply Workflow**: Safe deployments with change previews
- **Multi-Dialect Support**: Run the same dbt project across different SQL engines
- **Advanced Testing**: Enhanced testing capabilities beyond standard dbt tests
- **State Management**: Sophisticated metadata and versioning system