|
1 | | -# Dataframely - Coding Agent Instructions |
| 1 | +# Dataframely |
2 | 2 |
|
3 | | -## Project Overview |
| 3 | +## Package Management |
4 | 4 |
|
5 | | -Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in |
6 | | -polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It |
7 | | -supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes. |
| 5 | +This repository uses the Pixi package manager. When editing `pixi.toml`, run `pixi lock` afterwards. |
8 | 6 |
|
9 | | -## Tech Stack |
| 7 | +When running any commands (like `pytest`), prepend them with `pixi run`. |
10 | 8 |
|
11 | | -### Core Technologies |
| 9 | +## Code Style |
12 | 10 |
|
13 | | -- **Python**: Primary language for the public API |
14 | | -- **Rust**: Backend for polars plugin and custom regex operations |
15 | | -- **Polars**: Only supported data frame library |
16 | | -- **pyo3 & maturin**: Rust-Python bindings and build system |
17 | | -- **pixi**: Primary environment and task manager (NOT pip/conda directly) |
| 11 | +### Documentation |
18 | 12 |
|
19 | | -### Build System |
| 13 | +- Document all public functions/methods and classes using docstrings |
| 14 | + - For functions & methods, use Google Docstrings and include `Args` (if there are any arguments) and `Returns` (if |
| 15 | + there is a return type). |
| 16 | + - Do not include type hints in the docstrings |
| 17 | + - Do not mention default values in the docstrings |
| 18 | +- Do not write docstrings for private functions/methods unless the function is highly complex |
20 | 19 |
|
21 | | -- **maturin**: Builds the Rust extension module `dataframely._native` |
22 | | -- **Cargo**: Rust dependency management |
23 | | -- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components |
| 20 | +### License Headers |
24 | 21 |
|
25 | | -## Environment Setup |
| 22 | +Do not manually adjust or add license headers. A pre-commit hook will take care of this. |
26 | 23 |
|
27 | | -**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically |
28 | | -required for Rust-only operations. |
| 24 | +## Testing |
29 | 25 |
|
30 | | -### Initial Setup |
| 26 | +- Never use classes for pytest, but only free functions |
| 27 | +- Do not put `__init__.py` files into test directories |
| 28 | +- Tests should not have docstrings unless they are very complicated or very specific, i.e. warrant a description beyond |
| 29 | + the test's name |
| 30 | +- All tests should follow the arrange-act-assert pattern. The respective logical blocks should be distinguished via |
| 31 | + code comments as follows: |
31 | 32 |
|
32 | | -Unless already performed via external setup steps: |
| 33 | + ```python |
| 34 | + def test_method() -> None: |
| 35 | + # Arrange |
| 36 | + ... |
33 | 37 |
|
34 | | -```bash |
35 | | -# Install Rust toolchain |
36 | | -rustup show |
| 38 | + # Act |
| 39 | + ... |
37 | 40 |
|
38 | | -# Install pixi environment and dependencies |
39 | | -pixi install |
| 41 | + # Assert |
| 42 | + ... |
| 43 | + ``` |
40 | 44 |
|
41 | | -# Build and install the package locally (REQUIRED after Rust changes) |
42 | | -pixi run postinstall |
43 | | -``` |
| 45 | +- If two or more tests are structurally equivalent, they should be merged into a single test and parametrized with |
| 46 | + `@pytest.mark.parametrize` |
| 47 | +- If at least two tests share the same logic in the "arrange" step, the respective logic should be extracted into a |
| 48 | + fixture |
44 | 49 |
|
45 | | -### After Rust Code Changes |
| 50 | +## Reviewing |
46 | 51 |
|
47 | | -**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension. |
48 | | - |
49 | | -## Development Workflow |
50 | | - |
51 | | -### Running Tests |
52 | | - |
53 | | -```bash |
54 | | -# Run all tests (excludes S3 tests by default) |
55 | | -pixi run test |
56 | | - |
57 | | -# Run tests with S3 backend (requires moto server) |
58 | | -pixi run test -m s3 |
59 | | - |
60 | | -# Run specific test file or directory |
61 | | -pixi run test tests/schema/ |
62 | | - |
63 | | -# Run with coverage |
64 | | -pixi run test-coverage |
65 | | - |
66 | | -# Run benchmarks |
67 | | -pixi run test-bench |
68 | | -``` |
69 | | - |
70 | | -### Code Quality |
71 | | - |
72 | | -**NEVER** run linters/formatters directly. Use pre-commit: |
73 | | - |
74 | | -```bash |
75 | | -# Run all pre-commit hooks |
76 | | -pixi run pre-commit run |
77 | | -``` |
78 | | - |
79 | | -Pre-commit handles: |
80 | | - |
81 | | -- **Python**: ruff (lint & format), mypy (type checking), docformatter |
82 | | -- **Rust**: cargo fmt, cargo clippy |
83 | | -- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace |
84 | | - |
85 | | -### Building Documentation |
86 | | - |
87 | | -```bash |
88 | | -# Build documentation |
89 | | -pixi run -e docs postinstall |
90 | | -pixi run docs |
91 | | - |
92 | | -# Open in browser (macOS) |
93 | | -open docs/_build/html/index.html |
94 | | -``` |
95 | | - |
96 | | -## Project Structure |
97 | | - |
98 | | -``` |
99 | | -dataframely/ # Python package |
100 | | - schema.py # Core Schema class for DataFrame validation |
101 | | - collection/ # Collection class for validating multiple interconnected DataFrames |
102 | | - columns/ # Column type definitions (String, Integer, Float, etc.) |
103 | | - testing/ # Testing utilities (factories, masks, storage mocks) |
104 | | - _storage/ # Storage backends (Parquet, Delta Lake) |
105 | | - _rule.py # Rule decorator for validation rules |
106 | | - _plugin.py # Polars plugin registration |
107 | | - _native.pyi # Type stubs for Rust extension |
108 | | -
|
109 | | -src/ # Rust source code |
110 | | - lib.rs # PyO3 module definition |
111 | | - polars_plugin/ # Custom polars plugin for validation |
112 | | - regex/ # Custom regex operations |
113 | | -
|
114 | | -tests/ # Unit tests (mirrors dataframely/ structure) |
115 | | - benches/ # Benchmark tests |
116 | | - conftest.py # Shared pytest fixtures (including s3_server) |
117 | | -
|
118 | | -docs/ # Sphinx documentation |
119 | | - guides/ # User guides and examples |
120 | | - api/ # Auto-generated API reference |
121 | | -``` |
122 | | - |
123 | | -## Pixi Environments |
124 | | - |
125 | | -Multiple environments for different purposes: |
126 | | - |
127 | | -- **default**: Base Python + core dependencies |
128 | | -- **dev**: Includes jupyter for notebooks |
129 | | -- **test**: Testing dependencies (pytest, moto, boto3, etc.) |
130 | | -- **docs**: Documentation building (sphinx, myst-parser, etc.) |
131 | | -- **lint**: Linting and formatting tools |
132 | | -- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy) |
133 | | -- **py310-py314**: Python version-specific environments |
134 | | - |
135 | | -Use `-e <env>` to run commands in specific environments: |
136 | | - |
137 | | -```bash |
138 | | -pixi run -e test test |
139 | | -pixi run -e docs docs |
140 | | -``` |
141 | | - |
142 | | -## API Design Principles |
143 | | - |
144 | | -### Critical Guidelines |
145 | | - |
146 | | -1. **NO BREAKING CHANGES**: Public API must remain backward compatible |
147 | | -2. **100% Test Coverage**: All new code requires tests |
148 | | -3. **Documentation Required**: All public features need docstrings + API docs |
149 | | -4. **Cautious API Extension**: Avoid adding to public API unless necessary |
150 | | - |
151 | | -### Public API |
152 | | - |
153 | | -Public exports are in `dataframely/__init__.py`. Main components: |
154 | | - |
155 | | -- **Schema classes**: `Schema` for DataFrame validation |
156 | | -- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation |
157 | | -- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc. |
158 | | -- **Decorators**: `@rule()`, `@filter()` |
159 | | -- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation` |
160 | | - |
161 | | -## Common Pitfalls & Solutions |
162 | | - |
163 | | -### S3 Testing |
164 | | - |
165 | | -The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file). |
166 | | - |
167 | | -**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends. |
168 | | - |
169 | | -## Testing Strategy |
170 | | - |
171 | | -- Tests are organized by module, mirroring the `dataframely/` structure |
172 | | -- Use `dy.Schema.sample()` for generating test data |
173 | | -- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution |
174 | | -- S3 tests use moto server fixture from `conftest.py` |
175 | | -- Benchmark tests in `tests/benches/` use pytest-benchmark |
176 | | - |
177 | | -## Validation Pattern |
178 | | - |
179 | | -Typical usage pattern: |
180 | | - |
181 | | -```python |
182 | | -class MySchema(dy.Schema): |
183 | | - col = dy.String(nullable=False) |
184 | | - |
185 | | - @dy.rule() |
186 | | - def my_rule(cls) -> pl.Expr: |
187 | | - return pl.col("col").str.len_chars() > 0 |
188 | | - |
189 | | -# Validate and cast |
190 | | -validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True) |
191 | | -``` |
192 | | - |
193 | | -## Key Configuration Files |
194 | | - |
195 | | -- `pixi.toml`: Environment and task definitions |
196 | | -- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest) |
197 | | -- `Cargo.toml`: Rust dependencies and build settings |
198 | | -- `.pre-commit-config.yaml`: All code quality checks |
199 | | -- `rust-toolchain.toml`: Rust nightly version specification |
200 | | - |
201 | | -## When Making Changes |
202 | | - |
203 | | -1. **Python code**: Run `pixi run pre-commit run` before committing |
204 | | -2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests |
205 | | -3. **Tests**: Ensure `pixi run test` passes. If changes might affect storage backends, use `pixi run test -m s3`. |
206 | | -4. **Documentation**: Update docstrings |
207 | | -5. **API changes**: Ensure backward compatibility or document migration path |
208 | | - |
209 | | -### Pull request titles (required) |
210 | | - |
211 | | -Pull request titles must follow the Conventional Commits format: `<type>[!]: <Subject>` |
212 | | - |
213 | | -Allowed `type` values: |
214 | | - |
215 | | -- `feat`: A new feature |
216 | | -- `fix`: A bug fix |
217 | | -- `docs`: Documentation only changes |
218 | | -- `style`: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc) |
219 | | -- `refactor`: A code change that neither fixes a bug nor adds a feature |
220 | | -- `perf`: A code change that improves performance |
221 | | -- `test`: Adding missing tests or correcting existing tests |
222 | | -- `build`: Changes that affect the build system or external dependencies |
223 | | -- `ci`: Changes to our CI configuration files and scripts |
224 | | -- `chore`: Other changes that don't modify src or test files |
225 | | -- `revert`: Reverts a previous commit |
226 | | - |
227 | | -Additional rules: |
228 | | - |
229 | | -- Use `!` only for **breaking changes** |
230 | | -- `Subject` must start with an **uppercase** letter and must **not** end with `.` or a trailing space |
231 | | - |
232 | | -## Performance Considerations |
233 | | - |
234 | | -- Validation uses native polars expressions for performance |
235 | | -- Custom Rust plugin for advanced validation logic |
236 | | -- Lazy evaluation supported via `LazyFrame` for large datasets |
237 | | -- Avoid materializing data unnecessarily in validation rules |
| 52 | +When reviewing code changes, make sure that the `SKILL.md` is up-to-date and in line with the public API of this |
| 53 | +package. |
0 commit comments