Data-Forge is an asynchronous service for generating reference files (starting with Kerchunk) for large climate datasets. It converts local NetCDF-style inputs into cloud-friendly reference files and writes them to local filesystem or S3 destinations.
- Asynchronous job handling and monitoring
- Kerchunk reference file generation (NetCDF to Zarr references)
- Kerchunk output: local filesystem or S3
- REST API and CLI for job submission and tracking
- No internal storage—the service writes directly to user-managed destinations
- Submit a job with local NetCDF files and parameters (e.g., chunking).
- Monitor job status and progress asynchronously via API/CLI.
- Download or access the generated Kerchunk reference files at your storage endpoint.
# Submit a local NetCDF-to-Kerchunk job
$ data-forge submit \
--input ./data/dataset/*.nc \
--concat-dims time \
--metadata '{"project": "CMIP6"}'
# Monitor job progress
$ data-forge status <job-id> --watch
# Get reference file URL or download
$ data-forge get-url <job-id>
$ data-forge download <job-id> --output ./local_refs/uv venv
uv sync --all-groups --extra server
uv run pytest -vvvFor a lightweight CLI-only install, the base package is enough:
uv sync
uv run data-forge --helpInstall the server extra for the API, worker, conversion backends, STAC publish support, and monitoring stack.
- API: FastAPI (REST endpoints), job monitoring/status, OpenAPI docs
- Job Queue: Dramatiq + Redis (asynchronous processing)
- Workers: Process Kerchunk conversion and write outputs directly to user-managed destinations
- Output: Reference files are written to local filesystem or S3
- No Internal Storage: Reference files are written directly to user-managed destinations
- Remote input support
- STAC / ESGF publish integration
- Globus Auth
- Dask-based scaling
- Docker Compose for local/single-node deployment
- Copy
.env.exampleto.envand setDATAFORGE_LOCAL_INPUT_MAPPINGSto map host local input prefixes onto mounted container prefixes - Base stack:
docker compose up --build - Local output overlay:
docker compose -f compose.yaml -f compose.local.yaml up --build - S3 output overlay:
docker compose -f compose.yaml -f compose.s3.yaml up --build - Helm chart for Kubernetes (production, scalable)
- Minimal required services: API, worker(s), Redis
- Release/versioning workflow:
docs/release-versioning.md
For the default repo-local sample data mount, configure:
DATAFORGE_LOCAL_INPUT_MAPPINGS=[{"host_prefix":"/home/user/data-forge/data","container_prefix":"/inputs/repo-data"}]If you mount additional local input directories into the API and worker containers, add more mapping entries. Local job submissions are rewritten by longest matching host_prefix, while s3:// inputs pass through unchanged.
See the docs/ directory for:
- Full user guide and CLI reference
- API specification (OpenAPI/Swagger)
- Deployment guides (Docker, Kubernetes)
- Architecture/design docs
- Contribution instructions
Data-Forge aims to make FAIR, cloud-optimized data publishing simple and scalable for the global climate data community.