Skip to content

Commit c059946

Browse files
docs: Add coding agent docs, SKILL.md, and update copilot instructions (#290)
Co-authored-by: Oliver Borchert <oliver.borchert@quantco.com>
1 parent ca50937 commit c059946

File tree

5 files changed

+351
-222
lines changed

5 files changed

+351
-222
lines changed

.github/copilot-instructions.md

Lines changed: 37 additions & 221 deletions
Original file line numberDiff line numberDiff line change
@@ -1,237 +1,53 @@
1-
# Dataframely - Coding Agent Instructions
1+
# Dataframely
22

3-
## Project Overview
3+
## Package Management
44

5-
Dataframely is a declarative, polars-native data frame validation library. It validates schemas and data content in
6-
polars DataFrames using native polars expressions and a custom Rust-based polars plugin for high performance. It
7-
supports validating individual data frames via `Schema` classes and interconnected data frames via `Collection` classes.
5+
This repository uses the Pixi package manager. When editing `pixi.toml`, run `pixi lock` afterwards.
86

9-
## Tech Stack
7+
When running any commands (like `pytest`), prepend them with `pixi run`.
108

11-
### Core Technologies
9+
## Code Style
1210

13-
- **Python**: Primary language for the public API
14-
- **Rust**: Backend for polars plugin and custom regex operations
15-
- **Polars**: Only supported data frame library
16-
- **pyo3 & maturin**: Rust-Python bindings and build system
17-
- **pixi**: Primary environment and task manager (NOT pip/conda directly)
11+
### Documentation
1812

19-
### Build System
13+
- Document all public functions/methods and classes using docstrings
14+
- For functions & methods, use Google Docstrings and include `Args` (if there are any arguments) and `Returns` (if
15+
there is a return type).
16+
- Do not include type hints in the docstrings
17+
- Do not mention default values in the docstrings
18+
- Do not write docstrings for private functions/methods unless the function is highly complex
2019

21-
- **maturin**: Builds the Rust extension module `dataframely._native`
22-
- **Cargo**: Rust dependency management
23-
- Rust toolchain specified in `rust-toolchain.toml` with clippy and rustfmt components
20+
### License Headers
2421

25-
## Environment Setup
22+
Do not manually adjust or add license headers. A pre-commit hook will take care of this.
2623

27-
**CRITICAL**: Always use `pixi` commands - never run `pip`, `conda`, `python`, or `cargo` directly unless specifically
28-
required for Rust-only operations.
24+
## Testing
2925

30-
### Initial Setup
26+
- Never use classes for pytest, but only free functions
27+
- Do not put `__init__.py` files into test directories
28+
- Tests should not have docstrings unless they are very complicated or very specific, i.e. warrant a description beyond
29+
the test's name
30+
- All tests should follow the arrange-act-assert pattern. The respective logical blocks should be distinguished via
31+
code comments as follows:
3132

32-
Unless already performed via external setup steps:
33+
```python
34+
def test_method() -> None:
35+
# Arrange
36+
...
3337

34-
```bash
35-
# Install Rust toolchain
36-
rustup show
38+
# Act
39+
...
3740

38-
# Install pixi environment and dependencies
39-
pixi install
41+
# Assert
42+
...
43+
```
4044

41-
# Build and install the package locally (REQUIRED after Rust changes)
42-
pixi run postinstall
43-
```
45+
- If two or more tests are structurally equivalent, they should be merged into a single test and parametrized with
46+
`@pytest.mark.parametrize`
47+
- If at least two tests share the same logic in the "arrange" step, the respective logic should be extracted into a
48+
fixture
4449

45-
### After Rust Code Changes
50+
## Reviewing
4651

47-
**Always run** `pixi run postinstall` after modifying any Rust code in `src/` to rebuild the native extension.
48-
49-
## Development Workflow
50-
51-
### Running Tests
52-
53-
```bash
54-
# Run all tests (excludes S3 tests by default)
55-
pixi run test
56-
57-
# Run tests with S3 backend (requires moto server)
58-
pixi run test -m s3
59-
60-
# Run specific test file or directory
61-
pixi run test tests/schema/
62-
63-
# Run with coverage
64-
pixi run test-coverage
65-
66-
# Run benchmarks
67-
pixi run test-bench
68-
```
69-
70-
### Code Quality
71-
72-
**NEVER** run linters/formatters directly. Use pre-commit:
73-
74-
```bash
75-
# Run all pre-commit hooks
76-
pixi run pre-commit run
77-
```
78-
79-
Pre-commit handles:
80-
81-
- **Python**: ruff (lint & format), mypy (type checking), docformatter
82-
- **Rust**: cargo fmt, cargo clippy
83-
- **Other**: prettier (md/yml), taplo (toml), license headers, trailing whitespace
84-
85-
### Building Documentation
86-
87-
```bash
88-
# Build documentation
89-
pixi run -e docs postinstall
90-
pixi run docs
91-
92-
# Open in browser (macOS)
93-
open docs/_build/html/index.html
94-
```
95-
96-
## Project Structure
97-
98-
```
99-
dataframely/ # Python package
100-
schema.py # Core Schema class for DataFrame validation
101-
collection/ # Collection class for validating multiple interconnected DataFrames
102-
columns/ # Column type definitions (String, Integer, Float, etc.)
103-
testing/ # Testing utilities (factories, masks, storage mocks)
104-
_storage/ # Storage backends (Parquet, Delta Lake)
105-
_rule.py # Rule decorator for validation rules
106-
_plugin.py # Polars plugin registration
107-
_native.pyi # Type stubs for Rust extension
108-
109-
src/ # Rust source code
110-
lib.rs # PyO3 module definition
111-
polars_plugin/ # Custom polars plugin for validation
112-
regex/ # Custom regex operations
113-
114-
tests/ # Unit tests (mirrors dataframely/ structure)
115-
benches/ # Benchmark tests
116-
conftest.py # Shared pytest fixtures (including s3_server)
117-
118-
docs/ # Sphinx documentation
119-
guides/ # User guides and examples
120-
api/ # Auto-generated API reference
121-
```
122-
123-
## Pixi Environments
124-
125-
Multiple environments for different purposes:
126-
127-
- **default**: Base Python + core dependencies
128-
- **dev**: Includes jupyter for notebooks
129-
- **test**: Testing dependencies (pytest, moto, boto3, etc.)
130-
- **docs**: Documentation building (sphinx, myst-parser, etc.)
131-
- **lint**: Linting and formatting tools
132-
- **optionals**: Optional dependencies (pydantic, deltalake, pyarrow, sqlalchemy)
133-
- **py310-py314**: Python version-specific environments
134-
135-
Use `-e <env>` to run commands in specific environments:
136-
137-
```bash
138-
pixi run -e test test
139-
pixi run -e docs docs
140-
```
141-
142-
## API Design Principles
143-
144-
### Critical Guidelines
145-
146-
1. **NO BREAKING CHANGES**: Public API must remain backward compatible
147-
2. **100% Test Coverage**: All new code requires tests
148-
3. **Documentation Required**: All public features need docstrings + API docs
149-
4. **Cautious API Extension**: Avoid adding to public API unless necessary
150-
151-
### Public API
152-
153-
Public exports are in `dataframely/__init__.py`. Main components:
154-
155-
- **Schema classes**: `Schema` for DataFrame validation
156-
- **Collection classes**: `Collection`, `CollectionMember` for multi-DataFrame validation
157-
- **Column types**: `String`, `Integer`, `Float`, `Bool`, `Date`, `Datetime`, etc.
158-
- **Decorators**: `@rule()`, `@filter()`
159-
- **Type hints**: `DataFrame[Schema]`, `LazyFrame[Schema]`, `Validation`
160-
161-
## Common Pitfalls & Solutions
162-
163-
### S3 Testing
164-
165-
The `s3_server` fixture in `tests/conftest.py` uses `subprocess.Popen` to start moto_server on port 9999. This is a **workaround** for a polars issue with ThreadedMotoServer. When the polars issue is fixed, it should be replaced with ThreadedMotoServer (code is commented in the file).
166-
167-
**Note**: CI skips S3 tests by default. Run with `pixi run test -m s3` when modifying storage backends.
168-
169-
## Testing Strategy
170-
171-
- Tests are organized by module, mirroring the `dataframely/` structure
172-
- Use `dy.Schema.sample()` for generating test data
173-
- Test both eager (`DataFrame`) and lazy (`LazyFrame`) execution
174-
- S3 tests use moto server fixture from `conftest.py`
175-
- Benchmark tests in `tests/benches/` use pytest-benchmark
176-
177-
## Validation Pattern
178-
179-
Typical usage pattern:
180-
181-
```python
182-
class MySchema(dy.Schema):
183-
col = dy.String(nullable=False)
184-
185-
@dy.rule()
186-
def my_rule(cls) -> pl.Expr:
187-
return pl.col("col").str.len_chars() > 0
188-
189-
# Validate and cast
190-
validated_df: dy.DataFrame[MySchema] = MySchema.validate(df, cast=True)
191-
```
192-
193-
## Key Configuration Files
194-
195-
- `pixi.toml`: Environment and task definitions
196-
- `pyproject.toml`: Python package metadata, tool configurations (ruff, mypy, pytest)
197-
- `Cargo.toml`: Rust dependencies and build settings
198-
- `.pre-commit-config.yaml`: All code quality checks
199-
- `rust-toolchain.toml`: Rust nightly version specification
200-
201-
## When Making Changes
202-
203-
1. **Python code**: Run `pixi run pre-commit run` before committing
204-
2. **Rust code**: Run `pixi run postinstall` to rebuild, then run tests
205-
3. **Tests**: Ensure `pixi run test` passes. If changes might affect storage backends, use `pixi run test -m s3`.
206-
4. **Documentation**: Update docstrings
207-
5. **API changes**: Ensure backward compatibility or document migration path
208-
209-
### Pull request titles (required)
210-
211-
Pull request titles must follow the Conventional Commits format: `<type>[!]: <Subject>`
212-
213-
Allowed `type` values:
214-
215-
- `feat`: A new feature
216-
- `fix`: A bug fix
217-
- `docs`: Documentation only changes
218-
- `style`: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
219-
- `refactor`: A code change that neither fixes a bug nor adds a feature
220-
- `perf`: A code change that improves performance
221-
- `test`: Adding missing tests or correcting existing tests
222-
- `build`: Changes that affect the build system or external dependencies
223-
- `ci`: Changes to our CI configuration files and scripts
224-
- `chore`: Other changes that don't modify src or test files
225-
- `revert`: Reverts a previous commit
226-
227-
Additional rules:
228-
229-
- Use `!` only for **breaking changes**
230-
- `Subject` must start with an **uppercase** letter and must **not** end with `.` or a trailing space
231-
232-
## Performance Considerations
233-
234-
- Validation uses native polars expressions for performance
235-
- Custom Rust plugin for advanced validation logic
236-
- Lazy evaluation supported via `LazyFrame` for large datasets
237-
- Avoid materializing data unnecessarily in validation rules
52+
When reviewing code changes, make sure that the `SKILL.md` is up-to-date and in line with the public API of this
53+
package.

0 commit comments

Comments
 (0)