|
| 1 | +# Add AUDIT_ONLY Model Kind for Multi-Table Validation |
| 2 | + |
| 3 | +## Summary |
| 4 | +This PR introduces a new `AUDIT_ONLY` model kind to SQLMesh, addressing the gap in validating relationships between multiple tables without materializing unnecessary tables. This feature combines the benefits of models (DAG participation, dependencies, scheduling) with audit behavior (validation without materialization). |
| 5 | + |
| 6 | +## Problem Statement |
| 7 | +Previously, SQLMesh users had to choose between: |
| 8 | +- Creating wasteful materialized models just to run cross-table validations |
| 9 | +- Using standalone audits that don't integrate well with model dependencies |
| 10 | +- Building external validation systems outside SQLMesh |
| 11 | + |
| 12 | +## Solution |
| 13 | +The `AUDIT_ONLY` model kind enables users to: |
| 14 | +- Validate relationships across multiple tables (e.g., referential integrity) |
| 15 | +- Run complex validation queries that don't belong to a single model |
| 16 | +- Participate in the model DAG with proper dependencies |
| 17 | +- Avoid creating unnecessary materialized tables |
| 18 | + |
| 19 | +## Implementation Details |
| 20 | + |
| 21 | +### Core Changes |
| 22 | + |
| 23 | +#### 1. Model Kind Definition (`sqlmesh/core/model/kind.py`) |
| 24 | +- Added `AUDIT_ONLY` to `ModelKindName` enum |
| 25 | +- Created `AuditOnlyKind` class with configuration: |
| 26 | + - `blocking` (default: `True`): Whether failures stop the pipeline |
| 27 | + - `max_failing_rows` (default: `10`): Number of sample rows in error messages |
| 28 | +- Marked as `is_symbolic=True` (no materialization) |
| 29 | + |
| 30 | +#### 2. Execution Strategy (`sqlmesh/core/snapshot/evaluator.py`) |
| 31 | +- Created `AuditOnlyStrategy` extending `SymbolicStrategy` |
| 32 | +- Executes validation query and checks for returned rows |
| 33 | +- Raises `AuditError` with sample data if validation fails |
| 34 | +- Properly integrated with the evaluation strategy routing |
| 35 | + |
| 36 | +#### 3. Parser Support (`sqlmesh/core/dialect.py`) |
| 37 | +- Added `AUDIT_ONLY` to list of model kinds that accept properties |
| 38 | + |
| 39 | +#### 4. Snapshot Definition (`sqlmesh/core/snapshot/definition.py`) |
| 40 | +- Fixed `evaluatable` property to include audit-only models |
| 41 | +- Ensures proper interval tracking for validation execution |
| 42 | + |
| 43 | +### Testing |
| 44 | + |
| 45 | +#### Unit Tests (`tests/core/test_model.py`) |
| 46 | +- 6 unit tests covering: |
| 47 | + - Basic parsing and properties |
| 48 | + - Blocking/non-blocking configuration |
| 49 | + - Max failing rows configuration |
| 50 | + - Python model support |
| 51 | + - Full configuration scenarios |
| 52 | + - Serialization/deserialization |
| 53 | + |
| 54 | +#### Integration Tests (`tests/core/test_integration.py`) |
| 55 | +- 6 integration tests validating: |
| 56 | + - Validation success/failure scenarios |
| 57 | + - Blocking vs non-blocking behavior |
| 58 | + - Dependency tracking |
| 59 | + - Scheduling with cron |
| 60 | + - Metadata changes |
| 61 | + |
| 62 | +### Documentation |
| 63 | + |
| 64 | +#### User Documentation Updates |
| 65 | +- **`docs/concepts/audits.md`**: Added comprehensive AUDIT_ONLY section under Advanced Usage |
| 66 | +- **`docs/concepts/models/model_kinds.md`**: Added detailed AUDIT_ONLY section with examples |
| 67 | +- **`docs/reference/model_configuration.md`**: Added AUDIT_ONLY configuration reference |
| 68 | + |
| 69 | +#### Example Models (`examples/sushi/models/`) |
| 70 | +Added 3 demonstration models (all non-blocking for demo purposes): |
| 71 | +- `audit_order_integrity.sql`: Validates referential integrity |
| 72 | +- `audit_waiter_revenue_anomalies.sql`: Detects revenue anomalies |
| 73 | +- `audit_duplicate_orders.sql`: Identifies duplicate orders |
| 74 | + |
| 75 | +## Usage Example |
| 76 | + |
| 77 | +```sql |
| 78 | +MODEL ( |
| 79 | + name data_quality.order_validation, |
| 80 | + kind AUDIT_ONLY ( |
| 81 | + blocking TRUE, |
| 82 | + max_failing_rows 20 |
| 83 | + ), |
| 84 | + depends_on [orders, customers], |
| 85 | + cron '@daily' |
| 86 | +); |
| 87 | + |
| 88 | +-- Query returns 0 rows for success |
| 89 | +SELECT |
| 90 | + o.order_id, |
| 91 | + o.customer_id, |
| 92 | + 'Missing customer record' as issue |
| 93 | +FROM orders o |
| 94 | +LEFT JOIN customers c ON o.customer_id = c.customer_id |
| 95 | +WHERE c.customer_id IS NULL; |
| 96 | +``` |
| 97 | + |
| 98 | +## Key Differences from Traditional Audits |
| 99 | + |
| 100 | +| Feature | Traditional Audits | AUDIT_ONLY Models | |
| 101 | +|---------|-------------------|-------------------| |
| 102 | +| **Scope** | Single model | Multiple models | |
| 103 | +| **Dependencies** | Implicit | Explicit via depends_on | |
| 104 | +| **Materialization** | N/A | Never materializes | |
| 105 | +| **Location** | `audits/` directory | `models/` directory | |
| 106 | +| **Scheduling** | With parent model | Independent cron | |
| 107 | +| **DAG Participation** | Attached to model | Full model in DAG | |
| 108 | + |
| 109 | +## Migration Path |
| 110 | +- No breaking changes to existing models or audits |
| 111 | +- Optional feature - only use when needed |
| 112 | +- Can gradually migrate complex audits to audit-only models |
| 113 | + |
| 114 | +## Testing Instructions |
| 115 | + |
| 116 | +1. **Run unit tests:** |
| 117 | + ```bash |
| 118 | + pytest tests/core/test_model.py -k audit_only -xvs |
| 119 | + ``` |
| 120 | + |
| 121 | +2. **Run integration tests:** |
| 122 | + ```bash |
| 123 | + pytest tests/core/test_integration.py -k audit_only -xvs |
| 124 | + ``` |
| 125 | + |
| 126 | +3. **Try the sushi examples:** |
| 127 | + ```bash |
| 128 | + cd examples/sushi |
| 129 | + sqlmesh plan |
| 130 | + # Note: Example models are non-blocking so they won't fail the pipeline |
| 131 | + ``` |
| 132 | + |
| 133 | +4. **Create a test AUDIT_ONLY model:** |
| 134 | + ```sql |
| 135 | + -- Save as models/test_audit.sql |
| 136 | + MODEL ( |
| 137 | + name test.audit_validation, |
| 138 | + kind AUDIT_ONLY, |
| 139 | + depends_on [your_table1, your_table2] |
| 140 | + ); |
| 141 | + |
| 142 | + -- This should return 0 rows for success |
| 143 | + SELECT * FROM your_table1 |
| 144 | + WHERE some_condition_that_indicates_invalid_data; |
| 145 | + ``` |
| 146 | + |
| 147 | +## Checklist |
| 148 | +- [x] Add `AUDIT_ONLY` to `ModelKindName` enum |
| 149 | +- [x] Create `AuditOnlyKind` class |
| 150 | +- [x] Update `ModelKind` Union type |
| 151 | +- [x] Update `MODEL_KIND_NAME_TO_TYPE` mapping |
| 152 | +- [x] Create `AuditOnlyStrategy` class |
| 153 | +- [x] Update `_evaluation_strategy` routing |
| 154 | +- [x] Add `is_audit_only` properties |
| 155 | +- [x] Write unit tests |
| 156 | +- [x] Write integration tests |
| 157 | +- [x] Update documentation |
| 158 | +- [x] Add examples to sushi demo project |
| 159 | + |
| 160 | +## Related Issues |
| 161 | +Addresses the need for multi-table validation without materialization as described in the RFC. |
| 162 | + |
| 163 | +## Notes for Reviewers |
| 164 | +- The feature is designed to be non-intrusive and backward compatible |
| 165 | +- Example models in sushi are set to non-blocking to avoid disrupting tests |
| 166 | +- Documentation emphasizes when to use AUDIT_ONLY vs traditional audits |
| 167 | +- The implementation follows existing SQLMesh patterns for symbolic models |
| 168 | + |
| 169 | +## Future Enhancements (Not in this PR) |
| 170 | +- Support for incremental validation by time range |
| 171 | +- Configurable number of failing rows to capture |
| 172 | +- Warning mode that logs issues without failing |
| 173 | +- Different visualization in UI/lineage graph |
0 commit comments