Guidance for AI agents (e.g., Claude Code, Cursor, GPT-based tools) working in this repository.
- ALL tests MUST pass for work to be complete
- Never claim code "works" if any test fails
- Failing tests anywhere mean the codebase is broken and must be fixed
- Changes that break existing tests are not done until fixed
- A finished change passes linting, type checking, and the full test suite
gp-libs (this repo packages the unihan_db library) provides SQLAlchemy models and helpers to load the Unicode UNIHAN dataset. It is part of the cihai/gp-libs family and is powered by unihan-etl.
Key features:
- Bootstrap UNIHAN data from official sources (via
unihan-etl) into SQLAlchemy models - Default SQLite database stored in the user's XDG data dir (see
unihan_db.dirs) - Helper utilities to convert ORM rows to dictionaries (
bootstrap.to_dict) - Ships fixtures and example script (
examples/01_bootstrap.py) to seed and query data
This project uses:
- Python 3.10+ (<4.0)
- uv for dependency management (see
uv.lock/ dependency-groups) - ruff for linting/formatting
- mypy for type checking
- pytest for tests (with doctest support in fixtures)
- Sphinx for documentation
# Install runtime deps
uv pip install --editable .
# Install with dev tools (ruff, mypy, pytest, docs, etc.)
uv pip install --editable . -G dev
# Sync exactly to lockfile
uv pip sync# Run full suite
just test # wraps uv run py.test
uv run pytest # equivalent
# Single file / test
uv run pytest tests/test_bootstrap.py
uv run pytest tests/test_bootstrap.py::test_import_unihan_raw
# Continuous testing
just start # just test then uv run ptw .
uv run ptw . # pytest-watcher# Ruff lint
just ruff
uv run ruff check . --fix --show-fixes
# Format with ruff
just ruff-format
uv run ruff format .
# mypy
just mypy
uv run mypy src tests
# Watchers (requires entr)
just watch-ruff
just watch-mypyjust build-docs # sphinx html
just start-docs # sphinx autobuild server
just design-docs # rebuild CSS/JS assets- Format:
uv run ruff format . - Run tests:
uv run pytest - Lint:
uv run ruff check . --fix --show-fixes - Type-check:
uv run mypy - Re-run tests to confirm green
src/unihan_db/bootstrap.py– logging setup, default UNIHAN file/field lists, ETL options merge, data download viaunihan-etl,bootstrap_unihanto populate the database,to_dicthelpers, andget_sessionto create a scoped SQLAlchemy session (defaults to SQLite in XDG data dir).src/unihan_db/importer.py– transforms normalized UNIHAN records into ORM objects, wiring relations for readings, locations, variants, and indexes before commit.src/unihan_db/tables.py– SQLAlchemy ORM models (Base,Unhn, and many related tables for readings, strokes, variants, etc.).src/unihan_db/__init__.py– establishes XDG directories viaappdirs/unihan_etl.AppDirs, creating the data dir on import.src/unihan_db/__about__.py– package metadata (version).- Examples –
examples/01_bootstrap.pydemonstrates bootstrapping and querying random rows.
bootstrap.bootstrap_datacallsunihan_etl.core.Packagerwith default UNIHAN file list (UNIHAN_FILES) and field list (UNIHAN_FIELDS).bootstrap.bootstrap_unihaninserts rows only whenUnhnis empty, batching for speed and committing once.- Default database URL template is
sqlite:///{user_data_dir}/unihan_db.db;dirs.user_data_dircomes from XDG on the current OS.
- Pytest fixtures live in
tests/conftest.pyandconftest.py(root) to support doctests. - Fixtures provide in-memory SQLite engine, scoped sessions, and a zipped UNIHAN fixture built from
tests/fixtures/usingUNIHAN_FILESlist. - Tests avoid network by zipping local fixture files; rely on
unihan_optionsfixture for ETL options. - Example script is executed in tests (
tests/test_example.py) to ensure docs stay runnable.
- Prefer provided fixtures (
engine,session,unihan_options,zip_file,project_root) over ad-hoc setup. - Keep tests deterministic—no external downloads; use fixtures in
tests/fixtures. - Use
tmp_path/tmpdirfixtures for filesystem writes; root conftest auto-setsHOMEand cwd to temp paths. - If adding doctests, ensure fixtures are wired via
add_doctest_fixturesin root conftest.
def test_can_round_trip_char(session, engine):
from unihan_db.tables import Base, Unhn
Base.metadata.create_all(engine)
session.add(Unhn(char="好", ucn="U+597D"))
session.commit()
assert session.query(Unhn).count() == 1- Include
from __future__ import annotationsin Python modules. - Prefer namespace imports for stdlib (
import typing as t) to keep type usage explicit; third-party packages may usefrom X import Y. - Use Ruff for style/formatting; keep docstrings in NumPy/reST sections (
Parameters,Returns). - Align with existing patterns: SQLAlchemy ORM models, scoped sessions, and helper functions in
bootstrap.py.
These rules guide future logging changes; existing code may not yet conform.
- Use
logging.getLogger(__name__)in every module - Add
NullHandlerin library__init__.pyfiles - Never configure handlers, levels, or formatters in library code — that's the application's job
Pass structured data on every log call where useful for filtering, searching, or test assertions.
Core keys (stable, scalar, safe at any log level):
| Key | Type | Context |
|---|---|---|
unihan_field |
str |
UNIHAN field name |
unihan_source_file |
str |
source data file path |
unihan_record_count |
int |
records processed |
unihan_db_table |
str |
database table name |
unihan_db_rows |
int |
rows affected |
Heavy/optional keys (DEBUG only, potentially large):
| Key | Type | Context |
|---|---|---|
unihan_stdout |
list[str] |
subprocess stdout lines (truncate or cap; %(unihan_stdout)s produces repr) |
unihan_stderr |
list[str] |
subprocess stderr lines (same caveats) |
Treat established keys as compatibility-sensitive — downstream users may build dashboards and alerts on them. Change deliberately.
snake_case, not dotted;unihan_prefix- Prefer stable scalars; avoid ad-hoc objects
- Heavy keys (
unihan_stdout,unihan_stderr) are DEBUG-only; consider companionunihan_stdout_lenfields or hard truncation (e.g.stdout[:100])
logger.debug("msg %s", val) not f-strings. Two rationales:
- Deferred string interpolation: skipped entirely when level is filtered
- Aggregator message template grouping:
"Running %s"is one signature grouped ×10,000; f-strings make each line unique
When computing val itself is expensive, guard with if logger.isEnabledFor(logging.DEBUG).
Increment for each wrapper layer so %(filename)s:%(lineno)d and OTel code.filepath point to the real caller. Verify whenever call depth changes.
For objects with stable identity (Dataset, Reader, Exporter), use LoggerAdapter to avoid repeating the same extra on every call. Lead with the portable pattern (override process() to merge); merge_extra=True simplifies this on Python 3.13+.
| Level | Use for | Examples |
|---|---|---|
DEBUG |
Internal mechanics, data I/O | Field parsing, record transformation steps |
INFO |
Data lifecycle, user-visible operations | Download completed, export finished, database bootstrapped |
WARNING |
Recoverable issues, deprecation, user-actionable config | Missing optional field, deprecated data format |
ERROR |
Failures that stop an operation | Download failed, parse error, database write failed |
Config discovery noise belongs in DEBUG; only surprising/user-actionable config issues → WARNING.
- Lowercase, past tense for events:
"download completed","parse error" - No trailing punctuation
- Keep messages short; put details in
extra, not the message string
- Use
logger.exception()only insideexceptblocks when you are not re-raising - Use
logger.error(..., exc_info=True)when you need the traceback outside anexceptblock - Avoid
logger.exception()followed byraise— this duplicates the traceback. Either add context viaextrathat would otherwise be lost, or let the exception propagate
Assert on caplog.records attributes, not string matching on caplog.text:
- Scope capture:
caplog.at_level(logging.DEBUG, logger="unihan_db.bootstrap") - Filter records rather than index by position:
[r for r in caplog.records if hasattr(r, "unihan_field")] - Assert on schema:
record.unihan_record_count == 100not"100 records" in caplog.text caplog.record_tuplescannot access extra fields — always usecaplog.records
- f-strings/
.format()in log calls - Unguarded logging in hot loops (guard with
isEnabledFor()) - Catch-log-reraise without adding new context
print()for diagnostics- Logging secret env var values (log key names only)
- Non-scalar ad-hoc objects in
extra - Requiring custom
extrafields in format strings without safe defaults (missing keys raiseKeyError)
Commit subjects: Scope(type[detail]): concise description
Body template:
why: Reason or impact.
what:
- Key technical changes
- Single topic only
Guidelines:
- Subject ≤50 chars; body lines ≤72 chars; imperative mood.
- One topic per commit; separate subject and body with a blank line.
Common commit types:
- feat: New features or enhancements
- fix: Bug fixes
- refactor: Code restructuring without functional change
- docs: Documentation updates
- chore: Maintenance (dependencies, tooling, config)
- test: Test-related updates
- style: Code style and formatting
- py(deps): Dependencies
- py(deps[dev]): Dev dependencies
- ai(rules[AGENTS]): AI rule updates
- ai(claude[rules]): Claude Code rules (CLAUDE.md)
- ai(claude[command]): Claude Code command changes
When writing documentation (README, CHANGES, docs/), follow these rules for code blocks:
One command per code block. This makes commands individually copyable.
Put explanations outside the code block, not as comments inside.
Good:
Run the tests:
$ uv run pytestRun with coverage:
$ uv run pytest --covBad:
# Run the tests
$ uv run pytest
# Run with coverage
$ uv run pytest --cov- When ETL runs slowly, log level INFO in
bootstrapalready streams progress; avoid adding noisy prints. - If objects look stale, recreate sessions or re-run
Base.metadata.create_allagainst your engine in tests. - Use the provided fixtures to keep paths isolated—tests assume
HOMEand cwd are temporary.