Skip to content

Retrieval/Ingestion feature branch merge#1129

Draft
Amir-R25 wants to merge 9 commits into
mainfrom
feature-branch-rag
Draft

Retrieval/Ingestion feature branch merge#1129
Amir-R25 wants to merge 9 commits into
mainfrom
feature-branch-rag

Conversation

@Amir-R25
Copy link
Copy Markdown
Collaborator

This'll be a big one, draft to see how big

Amir-R25 and others added 8 commits April 28, 2026 15:57
**RAG ingestion: document loaders, data models, and docs**

Builds out the ingestion layer of the new RAG pipeline, the first stage
in the `load → chunk → embed → store` design.


Note: Ignore docs having inline code for now, that'll be handled before
merging with main

## What's included

**Loaders** (`railtracks.retrieval.loaders`)

- `BaseDocumentLoader`: streaming-first abstract base; `astream()` is
the single primitive, `aload()` and `load()` derive from it
automatically
- `TextLoader`: `.txt` and `.md` files and directories
- `CSVLoader`: one `Document` per row; `content_columns`,
`ignore_columns`, and `content_separator` control how columns map to
content vs metadata
- `JSONLoader`: one `Document` per JSON object; `content_keys` (`"*"` by
default) and `ignore_keys` give the same field-level control
- `PyPDFLoader`: per-page or whole-document breakdown strategies;
requires `railtracks[pdf]`

All loaders handle both single files and recursive directory loading.
The interface is always a flat stream of `Document` objects regardless
of source.

**Data models** (`railtracks.retrieval.models`)

- `Document`: typed (`DocumentType` enum), with UUID and source metadata
- `Chunk`: carries `document_id` so lineage back to the source document
is never lost

**Tests**: 82 unit tests covering all four loaders and the base class.

**Docs**: reference pages for each loader, custom loader guide, and
ingestion overview. Design docs updated to reflect the actual
`astream()`-based interface.
## What does this add?

This PR introduces the **retrieval chunking subsystem** under
`railtracks.retrieval.chunking`: a small, composable API for turning
`Document` instances into `Chunk` instances before embedding and
storage.

It adds:

- **Core abstractions**: `Chunker` (ABC), `Splitter` (protocol),
`Tokenizer` (protocol), plus a single protected `_make_chunks` path so
`document_id`, dense `index`, metadata inheritance, and optional
character `offsets` stay consistent.
- **Concrete chunkers**:
- `RecursiveCharacterChunker` — recursive separator splitting + overlap
merge with offset-aware spans.
- `MarkdownHeaderChunker` — heading-aware sections with header
breadcrumbs in metadata and optional fallback splitting for oversized
bodies.
- `SentenceChunker` — fixed-size **sentence windows** with
sentence-level overlap (default regex sentence boundaries; pluggable
splitter).
- `FixedTokenChunker` — fixed **token** windows with overlap (default
`TiktokenTokenizer`); `Chunk.offsets` intentionally unset in v1 for this
chunker.
- **Unit tests** under
`packages/railtracks/tests/unit_tests/retrieval/chunking/` covering
chunkers, tokenization, and cross-chunker invariants.
- **User-facing docs**: `docs/rag/components/chunking/overview.md` and
`docs/rag/components/chunking/methods.md`, aligned with the existing
ingestion docs layout and the Chunking entries already listed in
`mkdocs.yml`.

If this branch also includes **ingestion package import fixes** (e.g.
trimming broken `loaders` exports and exposing `JSONLoader`), call that
out explicitly in the final PR body so reviewers see it as a small
packaging fix bundled with retrieval work.

**Problem solved:** downstream RAG pieces need stable, backend-agnostic
`Chunk` objects with predictable metadata and (where supported) source
offsets, without ad hoc splitting logic scattered across the codebase.

## Type of changes

Please check the type of change your PR introduces:

- [ ] 🐛 Bug fix (non-breaking change which fixes an issue) 
- [x] ✨ New feature (non-breaking change which adds functionality)
- [ ] 💥 Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] 📚 Documentation update (improvements or corrections to
documentation)
- [ ] 🎨 Code style/formatting (changes that do not affect the meaning of
the code)
- [ ] ♻️ Refactoring (code change that neither fixes a bug nor adds a
feature)
- [ ] ⚡ Performance improvement (code change that improves performance)
- [x] ✅ Test update (adding missing tests or correcting existing tests)
- [ ] 🔧 Build/CI changes (changes to build process or continuous
integration)
- [ ] 🗑️ Chore (other changes that don't modify src or test files)

## Background context

Retrieval is being built out in layers: ingestion produces `Document`
objects; chunking is the next step before embeddings and vector stores.
This PR lands the **chunking slice** of that pipeline as an explicit
module with tests and public docs, while keeping chunking imports under
`railtracks.retrieval.chunking` (domain models remain on
`railtracks.retrieval`).

## Checklist for Author

### Code Quality
- [x] Code follows the project's style guidelines (run `ruff check .`
and `ruff format .`)
- [x] Code is commented, particularly in hard-to-understand areas

### Testing
- [x] Tests added/updated and pass locally (`pytest tests`)
- [ ] Test coverage maintained

### Documentation
- [x] Documentation updated if needed (bot will verify)

### Git & PR Management
- [ ] PR title clearly describes the change

### Breaking Changes
- [ ] Breaking changes are documented
- [ ] Migration guide provided in documentation (step-by-step
instructions for users to update their code/config)

---

## Final Product

**Import surface**

```python
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import (
    RecursiveCharacterChunker,
    MarkdownHeaderChunker,
    SentenceChunker,
    FixedTokenChunker,
)
Embedding Module

Summary: 
- `Embedding` abstract base class:
  - `aembed`: Subclasses need to implement
- `embed`: Defaults to `asyncio.run(aembed)` if not overwritten by
subclasses
- `astream_batches`: streaming embedding batches of chunks for the
upcoming rag pipeline
- `SyncEmbedding`: mixin for providers that do not provide an async
embedding api
- `LiteLLMEmbedding`: implements the `Embedding` base class to be used
with litellm
- Point of concern: `LiteLLM` does not directly expose embedding costs,
so we are using an internal field of theirs called `_hidden_params`
which provides cost. This is potentially problematic as its an internal
undocumented param
- This can be a point for a case that maybe for embedding it's best to
just directly implement each providers logic. Current design future
proofs if we make this decision as `OpenAIEmbedding` can inherit from
`Embedding` directly rather `LiteLLMEmbedding`
- `OpenAIEmbedding`, `AzureEmbedding` and `OllamaEmbedding` child
classes extending `LiteLLMEmbedding` for now
- AI Generated documentation (Placeholder for now, will need to
edit/verify before merging feature branch to main)
- unit tests
- Current Date Model (please provide feedback)
<img width="1323" height="861" alt="image"
src="https://github.com/user-attachments/assets/f5a7af72-aab3-41b6-b778-deffdb0b5121"
/>



I would say please mainly focus your review on `Embedding` base class
and see if it make sense from the following two paradigms:
- Extendible embedding wrapper for advanced users
- Part of an event based ingestion/retrieval pipeline

---------

Co-authored-by: Copilot <copilot@github.com>
## What does this add?

Adds the **Stores** layer to the RAG pipeline — the persistence and
retrieval layer that sits after the embedding stage.

**Core data models**

- `StoreEntry` — the atomic unit written to and retrieved from a store.
Required fields map directly from `EmbeddedChunk`; all enrichment fields
(`abstract`, `summary`, `scope`, `store_category`, `entities`,
`valid_from`, `valid_until`) are optional.
- `StoreEntry.from_chunk(embedded_chunk, ...)` — classmethod that builds
an entry from an `EmbeddedChunk` automatically; all field mapping is
handled, enrichment fields are keyword-only overrides.
- `StoreScope` — frozen dataclass for hard-filter namespacing. Any
non-`None` field becomes a mandatory equality filter on every read and
clear — a query scoped to `user_id="alice"` will never return entries
written under `user_id="bob"`.
- `StoreCategory` — string enum (`EPISODIC`, `SEMANTIC`, `SKILL`,
`PROCEDURAL`) for classifying entry knowledge type.
- `DetailLevel` — enum controlling retrieval verbosity: `L0` returns
abstract only, `L1` adds summary, `L2` returns full content (default).
Enables cheap first-pass ranking before an expensive full-content pass.
- `StoreQuery` / `RetrievedStoreEntry` — query input and scored result
wrappers.

**`Store` protocol**

An async `write` / `read` / `delete` / `clear` interface. Any class
satisfying it can be used wherever a `Store` is expected.

**`VectorStore`**

The built-in implementation. Delegates low-level index I/O to a
swappable `VectorBackend` while owning payload serialization, scope
filter construction, and detail-level projection.

**Three backends**

| Backend | Persistence | Install |
|---|---|---|
| `InMemoryBackend` | Optional JSON snapshot | None |
| `ChromaBackend` | Via client mode (ephemeral / on-disk / HTTP) |
`railtracks[stores-chroma]` |
| `PgvectorBackend` | Full Postgres durability |
`railtracks[stores-vector]` |

All backends accept a `DistanceMetric` (`COSINE`, `L2`, `IP`).
`ChromaBackend` and `PgvectorBackend` expose an async `create(...)`
factory that initialises the backend in one call.

**Documentation**

Full docs added under `docs/rag/components/stores/`: an overview page
covering the data model, querying, the `Store` protocol, and
`VectorStore` usage, plus a backends page covering all three backends
and the custom-backend recipe. Wired into `mkdocs.yml`.

**Model rename: `Memory*` → `Store*`**

All public model names were changed from `Memory`-prefixed to
`Store`-prefixed (`MemoryEntry` → `StoreEntry`, `MemoryScope` →
`StoreScope`, etc.) because "memory" is overloaded in the LLM agent
literature (LangMem, MemGPT) and these stores are general-purpose
retrieval stores, not agent memory systems.

---

## Type of changes

- [ ] 🐛 Bug fix (non-breaking change which fixes an issue)
- [x] ✨ New feature (non-breaking change which adds functionality)
- [x] 💥 Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] 📚 Documentation update (improvements or corrections to
documentation)
- [ ] 🎨 Code style/formatting (changes that do not affect the meaning of
the code)
- [x] ♻️ Refactoring (code change that neither fixes a bug nor adds a
feature)
- [ ] ⚡ Performance improvement (code change that improves performance)
- [x] ✅ Test update (adding missing tests or correcting existing tests)
- [ ] 🔧 Build/CI changes (changes to build process or continuous
integration)
- [ ] 🗑️ Chore (other changes that don't modify src or test files)

---

## Background context

The stores layer completes the core RAG pipeline (`ingestion → chunking
→ embedding → stores`). It was designed backend-agnostic from the start:
the `VectorBackend` protocol is the only contract between `VectorStore`
and the underlying index, so swapping from `InMemoryBackend` to
`PgvectorBackend` requires changing one line.

`StoreScope` was modelled as a hard filter (not a soft hint) to make
multi-tenant data isolation a first-class concern. The scope fields are
serialised as `scope_*` prefixed keys in the payload so they are always
filterable at the backend level, not post-hoc in Python.

`DetailLevel` exists to enable multi-stage retrieval pipelines that are
common in production RAG: a cheap `L0` pass over abstracts can narrow a
large corpus to a short candidate list before a more expensive `L2` pass
fetches full content for re-ranking.

The `Memory*` → `Store*` rename was made early to avoid a harder
migration later. The term "memory" is already claimed by agent memory
systems (LangMem, MemGPT, semantic/episodic agent memory); using it for
a retrieval store creates confusing overlap in docs and user mental
models.

---

## Checklist for Author

### Code Quality
- [x] Code follows the project's style guidelines (run `ruff check .`
and `ruff format .`)
- [ ] Code is commented, particularly in hard-to-understand areas

### Testing
- [x] Tests added/updated and pass locally (`pytest tests`)
- [x] Test coverage maintained

### Documentation
- [x] Documentation updated if needed (bot will verify)

### Git & PR Management
- [x] PR title clearly describes the change

### Breaking Changes
- [x] Breaking changes are documented
- [x] Migration guide provided in documentation (step-by-step
instructions for users to update their code/config)

---

## Final Product

**Creating and indexing entries from an embedding pipeline:**

```python
from railtracks.retrieval.stores import (
    VectorStore, StoreEntry, StoreScope, StoreCategory,
    StoreQuery, DetailLevel,
)
from railtracks.retrieval.stores.vector.backends import InMemoryBackend

store = VectorStore(InMemoryBackend())
scope = StoreScope(user_id="alice", session_id="s-001")

# Index
for embedded_chunk in embedded_chunks:
    entry = StoreEntry.from_chunk(
        embedded_chunk,
        scope=scope,
        store_category=StoreCategory.SEMANTIC,
        abstract=embedded_chunk.chunk.content[:100],
    )
    await store.write(entry)

# Retrieve with detail-level control
query = StoreQuery(
    text="What is the refund policy?",
    scope=scope,
    embedding=query_vector,
    top_k=5,
    detail_level=DetailLevel.L2,
)
results = await store.read(query)
for r in results:
    print(f"(rank={r.rank}, score={r.score:.3f}) {r.entry.content}")
```

**Production backend (Pgvector):**

```python
from railtracks.retrieval.stores import VectorStore, PgvectorBackend

backend = await PgvectorBackend.create(
    dsn="postgresql://user:pass@localhost/mydb",
    table="my_index",
    dim=1536,
)
store = VectorStore(backend)
```

**Persistent development backend (Chroma on-disk):**

```python
from railtracks.retrieval.stores import VectorStore, ChromaBackend

backend = ChromaBackend("my-collection", path="/data/chroma")
await backend.initialize()
store = VectorStore(backend)
```
## Summary

Adds `HuggingFaceDatasetLoader`, a streaming document loader that wraps
`datasets.load_dataset` in streaming mode and yields one `Document` per
row through the existing `BaseDocumentLoader.astream()` contract.
Unlocks hundreds of HF Hub datasets (squad, ms_marco, trivia_qa,
wiki_dpr, c4, …) for the RAG ingestion pipeline without per-dataset
code.

Adds a new optional extra: `huggingface = ["datasets >= 2.14.0"]`.

Closes #1092


## Type of change

- [ ] Bug fix
- [x] Feature
- [ ] Breaking change
- [x] Docs
- [x] Refactor / chore / build / tests

## Checklist

- [x] Lint & format pass (`ruff check . && ruff format .`)
- [x] Tests added/updated and pass locally (`pytest tests`): 16 new
tests, full loaders suite 98/98 green
- [x] Docs updated; new page
`docs/rag/components/ingestion/huggingface.md`, overview table row,
mkdocs nav entry
- [x] Breaking changes include migration notes; N/A, adds new public
name + new optional extra only; no existing exports or behaviour changed

## Notes

### Acceptance criteria

- [x] Extends `BaseDocumentLoader`
- [x] Constructor accepts `dataset_name`, `split`, `content_columns:
list[str]`, optional `metadata_columns` and `dataset_kwargs`
- [x] `astream()` is truly lazy — verified by a test that confirms only
one row is consumed per `__anext__` call (no full-dataset buffering)
- [x] Each row yields a `Document` with `source =
"{dataset_name}/{split}"` and metadata populated from `metadata_columns`
+ `row_index`
- [x] Raises `ValueError` if any `content_columns` are not present in
the dataset schema
- [x] Unit tests use a mocked `IterableDataset` — no network calls
during the test suite

### Basic usage

```python
from railtracks.retrieval.loaders.huggingface_loader import HuggingFaceDatasetLoader

loader = HuggingFaceDatasetLoader(
    dataset_name="squad",
    split="validation",
    content_columns=["question", "context"],
    metadata_columns=["title", "id"],
    content_separator="\n\n",
)

# Rows are streamed lazily from the Hub — one at a time
async for doc in loader.astream():
    # doc.source   == "squad/validation"
    # doc.content  == "<question>\n\n<context>"
    # doc.metadata == {"title": "...", "id": "...", "row_index": N}
    ...
For datasets with subsets, pinned revisions, or auth tokens, forward via dataset_kwargs:


HuggingFaceDatasetLoader(..., dataset_kwargs={"name": "v2.1", "revision": "abc"})
streaming=True is set by default and can be overridden through the same dict.
## Summary

Adds `BaseOCRLoader` (abstract) and `PyPDFOCRLoader` (concrete) to the
retrieval module. Closes #1079.

`PyPDFOCRLoader` handles both text-based and scanned PDFs in a single
loader:
- Tries `pypdf` text extraction per page first
- Pages where extraction returns empty are rasterized with `pypdfium2`
and
  recognized via Tesseract (`pytesseract`)
- `force_ocr=True` flag OCRs every page unconditionally
- Metadata records whether each page used OCR (`metadata["ocr"]` for
page
  strategy; `metadata["ocr_pages"]` for document strategy)

**Design decisions:**
- **pypdfium2** (Apache-2.0 / BSD-3) over Poppler (GPL-2), PyMuPDF
(AGPL),
and Wand/ImageMagick (transitively AGPL via Ghostscript). It was already
  a transitive dependency via `pdfplumber`.
- **Tesseract** as the default OCR engine — open-source, no API key, the
  obvious default for an OSS library. Subclassing `BaseOCRLoader` with a
different engine (EasyOCR, Textract, LLM vision) is a one-method
override.
- New `[ocr]` optional-dependency extra in `pyproject.toml`. Users also
need the Tesseract binary on PATH — surfaced in the `ImportError`
message.

## Type of change

- [ ] Bug fix
- [x] Feature
- [ ] Breaking change
- [ ] Docs
- [ ] Refactor / chore / build / tests

## Checklist

- [x] Lint & format pass (`ruff check . && ruff format .`)
- [x] Tests added/updated and pass locally (`pytest tests`)
- [ ] Docs updated if user-facing behavior changed
- [ ] Breaking changes include migration notes

## Notes

**Test coverage:**

- **29 unit tests** (mocked, no real binaries needed in CI):
  - `BaseOCRLoader` contract + inheritance (6 tests)
  - `PyPDFOCRLoader` init / errors / text-fast-path / OCR-fallback /
    `force_ocr` / page strategy / document strategy / directory loading
    (23 tests)
- **End-to-end smoke test** in
`packages/railtracks/tests/end_to_end/rag/`,
  excluded from regular CI per `pyproject.toml`'s existing `addopts`.
  Generates a synthetic image-only PDF with Pillow, runs the loader,
  asserts text recovery. Verified locally on Windows with Tesseract
  v5.5.0.20241111.
- **Manual verification** against a real scanned typed document (clean
  recovery) and a handwritten document (expected Tesseract limitation —
  intended to be addressed in future via additional `BaseOCRLoader`
  subclasses for handwriting-capable engines).
- Full loaders test suite: 111 passed, no regressions.

**Installation:** users need to `pip install "railtracks[ocr]"` for the
Python packages, and separately install the Tesseract binary and put it
on PATH. The loader's `ImportError` message points to the official
Tesseract install docs.

---------

Co-authored-by: Pooria Ashrafian <61358399+Pooria90@users.noreply.github.com>
…1109)

## Summary

Adds the `RetrievalRuntime` orchestrator and the supporting `Store` /
loader
changes needed to drive the full ingest → retrieve flow end-to-end. Also
removes the legacy `railtracks.vector_stores` package now that
`railtracks.retrieval` supersedes it.

```
┌────────┐   ┌─────────┐   ┌──────────┐   ┌────────┐   ┌───────────┐
│ Loader │ → │ Chunker │ → │ Embedder │ → │ Store  │ → │ Retrieval │
└────────┘   └─────────┘   └──────────┘   └────────┘   └───────────┘
     ▲             ▲             ▲             ▲              ▲
     └─────────────┴─────────────┴─────────────┴──────────────┘
                                 │
                       ┌─────────────────────┐
                       │  RetrievalRuntime   │
                       │  (the orchestrator) │
                       └─────────────────────┘
```

### `RetrievalRuntime`

The orchestrator that wires a chunker + embedder + `Store` (+ optional
`scope`)
into the ingest/retrieve flow.

* **Loader is passed to `ingest()`, not the constructor**: one runtime
captures
*how* to process (chunker/embedder/store/scope); the loader decides
*what*.
  A single runtime can ingest from many sources and re-ingest to update.
* **Streaming + aggregate APIs:** `ingest(loader)` is an async generator
yielding
per-batch events; `ingest_all(loader)` drains it and returns
`IngestionStats`.
* **Events:** `BatchIngested` (carries per-batch `EmbeddingMetrics` —
tokens, cost,
latency, vector count), `EmbeddingFailure`, `DocumentFailed`,
`DocumentSkipped`.
  `batch_index` is **per-document**, not run-global.
* **Upsert semantics:** before writing the *first* chunk of a document
the runtime
fires `store.delete_where({"document_id": str(doc.id)})` to clear the
prior
version. The delete only runs once a batch succeeds, so a total
embedding
  failure preserves the previous version. Writes are per-chunk and **not
transactional** — a crash mid-write leaves a partial document (recovered
on the
  next ingest, see below).
* **Count-aware staleness (skip unchanged docs):** a document is skipped
only when
  the store already holds a *complete* copy — matched on `source_path` +
`content_hash` and the persisted `doc_chunk_count`. A partially-written
document
(fewer chunks than expected after an interrupted run) is re-ingested
rather than
left broken. Counting is done via `find()` rather than a `count()` call
so the
  runtime depends only on the `Store` protocol. 
* **Token-size guard:** when `max_tokens` is set, chunks over the
per-item limit
are dropped *before* embedding and surfaced as `EmbeddingFailure`
instead of
causing provider 4xx errors. Uses `TiktokenTokenizer` by default.
(Partial fix
  for the embedding per-item token-cap gap — see Known limitations.)
* **Embedding-model consistency:** the model is captured from the first
successful
  batch; a later `retrieve()` with a different embedder raises
`EmbeddingModelMismatchError` (cross-model similarity scores are
meaningless).
Note: capture is **in-process only** — a fresh runtime over an existing
store
  won't enforce until its first ingest.
* `on_ingest` / `on_retrieve` callbacks for logging/observability;
  `delete_document(id)` convenience wrapper.

### `stores` module

`Store` protocol:
* added `delete_where(filters)` and `find(filters, limit=1)`
(metadata-only
lookup, no vector search) — both required by the runtime's
upsert/staleness paths.

`StoreEntry`:
* `vector` is now `list[float] | None`. Read results no longer
round-trip the
vector (was `[]`, now `None`) — the backend owns the stored vector;
callers must
  not rely on this field on retrieved entries.

`StoreQuery`:
* `scope` is now optional (`StoreScope | None`) for single-tenant
callers.
* `metadata_filters` retyped `dict[str, Any]` (was `dict[str, str]`).
* removed the unused `strategies` field and the `RetrievalStrategy`
enum.

`VectorStore` (base) / `VectorBackend`:
* `VectorBackend` protocol gained `list_where(filters, limit)` and
`count(filters)`.
**`count` lives on the backend, not `Store`** — keeps the runtime's
dependency
  surface to the `Store` protocol alone.
* `VectorStore` now implements `find`, `delete_where`, and `count`.
* Payload encoding spreads scalar `chunk_metadata` values to the top
level (in
addition to the JSON-encoded blob) so flat-equality `metadata_filters` /
`find`
  work against them.

Backend implementations (`chroma`, `in_memory`, `pgvector`) all
implement
`list_where` + `count`. Plus:
* **pgvector** `_build_where` now compares JSONB-to-JSONB
(`payload->$k::text = $v::jsonb`) so non-string scalars (int/bool/None)
keep
their JSON type instead of being stringified. Filters are parameterized;
`LIMIT`
  is `int`-cast before interpolation. Added `pool_kwargs` passthrough to
  `asyncpg.create_pool` for tuning `min_size`/`max_size`/etc.
* **in_memory** `_flush` is now async — JSON encode happens under the
lock, the
disk write is offloaded to a thread so the event loop isn't blocked.
Search now
sanitizes non-finite scores (NaN/inf from a misbehaving embedder):
they're
logged and sorted/dropped to the end instead of corrupting the ranking.

### `loaders` module

`Document`:
* **`id` is now derived deterministically from `source` via
`uuid5(NAMESPACE_URL, source)`** so re-ingesting the same source yields
the same
id across processes. Fixes a silent upsert bug where modified files left
their
prior chunks orphaned in the store, because
`delete_where({"document_id": ...})`
was keyed on a fresh random UUID each pass. Sourceless documents fall
back to
  `uuid4()` (no stable identity ⇒ no upsert semantics).
* added `content_hash` (SHA-256, computed by the runtime at ingest time;
loaders
  leave it `None`) used by staleness detection. `type` now defaults to
  `DocumentType.TEXT`.

* `Sanitizer` protocol for PII redaction (sync **or** async `sanitize`;
errors
  propagate, no logic baked into the framework).
* `SanitizingLoader` wraps any `BaseDocumentLoader` + a `Sanitizer`,
running every
  yielded document through it.

### Removals / cleanup

* Deleted the legacy `railtracks.vector_stores` package (`chroma`,
`chunking/`,
  `filter`, `vector_store_base`) and its tests — fully superseded by
`railtracks.retrieval.stores` and `railtracks.retrieval.chunking` (~7.5k
lines).
* `retrieval.__init__` now exports the public surface:
`RetrievalRuntime`, the
ingestion event/stats types, `Store`, `StoreEntry`, `StoreQuery`,
`StoreScope`,
  `VectorStore`, `EmbeddingFailure`, `EmbeddingModelMismatchError`.

## Type of change

- [ ] Bug fix
- [x] Feature
- [ ] Breaking change
- [ ] Docs
- [ ] Refactor / chore / build / tests

## Checklist

- [x] Lint & format pass (`ruff check . && ruff format .`)
- [x] Tests added/updated and pass locally (`pytest tests`)
- [ ] Docs updated if user-facing behavior changed
- [ ] Breaking changes include migration notes

## Notes

**Review callouts**
* Ingest upsert is not transactional (per-chunk writes); count-aware
staleness is
  what makes an interrupted ingest self-heal on the next run.
* `_captured_model` is in-process only — model-mismatch enforcement
doesn't
survive a fresh runtime over a pre-populated store until its first
ingest.
* pgvector `list_where` interpolates `LIMIT {int(limit)}` (int-cast,
safe); all
  filter values stay parameterized.

**Known limitations / follow-ups**
* The `max_tokens` guard enforces a **per-item** token cap (drops
oversized chunks
pre-embedding); it does **not** do batch-level token-budget packing.
Batches are
still sized by count (`default_batch_size`), so a batch of in-spec
chunks can
still exceed a provider's per-request token limit (e.g. OpenAI's 8191).
Worth a
  follow-up for token-aware batch packing.
Dropping stale imports
@Amir-R25 Amir-R25 linked an issue May 28, 2026 that may be closed by this pull request
…L) (#1102)

## Summary

- Adds `S3Loader` /  `GCSLoader` / `AzureBlobLoader` / `SQLLoader` 
- All providers use optional extras (`railtracks[aws]`,
`railtracks[gcp]`, `railtracks[azure-blob]`, `railtracks[sql]`) so the
core package stays lean
- All loaders expose sync and async interfaces (`load`/`aload`)
- SQL classes include a context-manager (`with SQLLoader(...) as l`) and
explicit `close()` for engine lifecycle management
- SQL identifier arguments validated against a strict allowlist at
construction time to prevent injection
- Full unit test coverage across all classes
- Comprehensive developer docs
- Writer are not included in the PR, it will be in a separate module

## Security hardening

- SQL table/column names validated at `__init__` time — raises
`ValueError` on any metacharacter (`[A-Za-z_][A-Za-z0-9_$]*` allowlist,
supports `schema.table`)
- Helpful `ValueError` when `content_column` is missing from query
results (was a bare `KeyError`)
- `__repr__` on all classes exposes only non-sensitive fields
(bucket/container name); credentials never appear in repr
- All `ImportError` messages include both `pip install` and `uv add`
forms

## Limitations documented

- CTE (`WITH …`) queries not supported as `table_or_query`; workaround
shown in docs
- `aload` are thread-backed (`asyncio.to_thread`) not true-async; noted
in docs with guidance for high-concurrency cases

## Test plan

- [x] 127 unit tests passing across all 4 providers × loader + writer
- [x] SQL tests use real in-memory SQLite (no mocks for correctness)
- [x] Cloud tests (S3/GCS/Azure) use provider SDK mocks
- [x] Async variants covered for all classes

## Additional file
Tutorial file are added to ipynb file on colab, not included per
requirements for single-location consistency.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][Retrieval] Retrieval Module

5 participants