Data-Forge

Data-Forge is an asynchronous service for generating reference files (starting with Kerchunk) for large climate datasets. It converts local NetCDF-style inputs into cloud-friendly reference files and writes them to local filesystem or S3 destinations.

Features

Asynchronous job handling and monitoring
Kerchunk reference file generation (NetCDF to Zarr references)
Kerchunk output: local filesystem or S3
REST API and CLI for job submission and tracking
No internal storage—the service writes directly to user-managed destinations

Typical Workflow

Submit a job with local NetCDF files and parameters (e.g., chunking).
Monitor job status and progress asynchronously via API/CLI.
Download or access the generated Kerchunk reference files at your storage endpoint.

Example CLI Usage

# Submit a local NetCDF-to-Kerchunk job
$ data-forge submit \
  --input ./data/dataset/*.nc \
  --concat-dims time \
  --metadata '{"project": "CMIP6"}'

# Monitor job progress
$ data-forge status <job-id> --watch

# Get reference file URL or download
$ data-forge get-url <job-id>
$ data-forge download <job-id> --output ./local_refs/

uv Setup

uv venv
uv sync --all-groups --extra server
uv run pytest -vvv

For a lightweight CLI-only install, the base package is enough:

uv sync
uv run data-forge --help

Install the server extra for the API, worker, conversion backends, STAC publish support, and monitoring stack.

High-Level Architecture

API: FastAPI (REST endpoints), job monitoring/status, OpenAPI docs
Job Queue: Dramatiq + Redis (asynchronous processing)
Workers: Process Kerchunk conversion and write outputs directly to user-managed destinations
Output: Reference files are written to local filesystem or S3
No Internal Storage: Reference files are written directly to user-managed destinations

Roadmap

Remote input support
STAC / ESGF publish integration
Globus Auth
Dask-based scaling

Deployment

Docker Compose for local/single-node deployment
Copy .env.example to .env and set DATAFORGE_LOCAL_INPUT_MAPPINGS to map host local input prefixes onto mounted container prefixes
Base stack: docker compose up --build
Local output overlay: docker compose -f compose.yaml -f compose.local.yaml up --build
S3 output overlay: docker compose -f compose.yaml -f compose.s3.yaml up --build
Helm chart for Kubernetes (production, scalable)
Minimal required services: API, worker(s), Redis
Release/versioning workflow: docs/release-versioning.md

For the default repo-local sample data mount, configure:

DATAFORGE_LOCAL_INPUT_MAPPINGS=[{"host_prefix":"/home/user/data-forge/data","container_prefix":"/inputs/repo-data"}]

If you mount additional local input directories into the API and worker containers, add more mapping entries. Local job submissions are rewritten by longest matching host_prefix, while s3:// inputs pass through unchanged.

Documentation

See the docs/ directory for:

Full user guide and CLI reference
API specification (OpenAPI/Swagger)
Deployment guides (Docker, Kubernetes)
Architecture/design docs
Contribution instructions

Data-Forge aims to make FAIR, cloud-optimized data publishing simple and scalable for the global climate data community.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
deploy/garage		deploy/garage
docker		docker
docs		docs
helm/data-forge		helm/data-forge
scripts		scripts
src/dataforge		src/dataforge
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.local.yaml		compose.local.yaml
compose.s3.yaml		compose.s3.yaml
compose.yaml		compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Forge

Features

Typical Workflow

Example CLI Usage

uv Setup

High-Level Architecture

Roadmap

Deployment

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data-Forge

Features

Typical Workflow

Example CLI Usage

uv Setup

High-Level Architecture

Roadmap

Deployment

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages