Retrieval/Ingestion feature branch merge#1129
Draft
Amir-R25 wants to merge 9 commits into
Draft
Conversation
**RAG ingestion: document loaders, data models, and docs** Builds out the ingestion layer of the new RAG pipeline, the first stage in the `load → chunk → embed → store` design. Note: Ignore docs having inline code for now, that'll be handled before merging with main ## What's included **Loaders** (`railtracks.retrieval.loaders`) - `BaseDocumentLoader`: streaming-first abstract base; `astream()` is the single primitive, `aload()` and `load()` derive from it automatically - `TextLoader`: `.txt` and `.md` files and directories - `CSVLoader`: one `Document` per row; `content_columns`, `ignore_columns`, and `content_separator` control how columns map to content vs metadata - `JSONLoader`: one `Document` per JSON object; `content_keys` (`"*"` by default) and `ignore_keys` give the same field-level control - `PyPDFLoader`: per-page or whole-document breakdown strategies; requires `railtracks[pdf]` All loaders handle both single files and recursive directory loading. The interface is always a flat stream of `Document` objects regardless of source. **Data models** (`railtracks.retrieval.models`) - `Document`: typed (`DocumentType` enum), with UUID and source metadata - `Chunk`: carries `document_id` so lineage back to the source document is never lost **Tests**: 82 unit tests covering all four loaders and the base class. **Docs**: reference pages for each loader, custom loader guide, and ingestion overview. Design docs updated to reflect the actual `astream()`-based interface.
## What does this add?
This PR introduces the **retrieval chunking subsystem** under
`railtracks.retrieval.chunking`: a small, composable API for turning
`Document` instances into `Chunk` instances before embedding and
storage.
It adds:
- **Core abstractions**: `Chunker` (ABC), `Splitter` (protocol),
`Tokenizer` (protocol), plus a single protected `_make_chunks` path so
`document_id`, dense `index`, metadata inheritance, and optional
character `offsets` stay consistent.
- **Concrete chunkers**:
- `RecursiveCharacterChunker` — recursive separator splitting + overlap
merge with offset-aware spans.
- `MarkdownHeaderChunker` — heading-aware sections with header
breadcrumbs in metadata and optional fallback splitting for oversized
bodies.
- `SentenceChunker` — fixed-size **sentence windows** with
sentence-level overlap (default regex sentence boundaries; pluggable
splitter).
- `FixedTokenChunker` — fixed **token** windows with overlap (default
`TiktokenTokenizer`); `Chunk.offsets` intentionally unset in v1 for this
chunker.
- **Unit tests** under
`packages/railtracks/tests/unit_tests/retrieval/chunking/` covering
chunkers, tokenization, and cross-chunker invariants.
- **User-facing docs**: `docs/rag/components/chunking/overview.md` and
`docs/rag/components/chunking/methods.md`, aligned with the existing
ingestion docs layout and the Chunking entries already listed in
`mkdocs.yml`.
If this branch also includes **ingestion package import fixes** (e.g.
trimming broken `loaders` exports and exposing `JSONLoader`), call that
out explicitly in the final PR body so reviewers see it as a small
packaging fix bundled with retrieval work.
**Problem solved:** downstream RAG pieces need stable, backend-agnostic
`Chunk` objects with predictable metadata and (where supported) source
offsets, without ad hoc splitting logic scattered across the codebase.
## Type of changes
Please check the type of change your PR introduces:
- [ ] 🐛 Bug fix (non-breaking change which fixes an issue)
- [x] ✨ New feature (non-breaking change which adds functionality)
- [ ] 💥 Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] 📚 Documentation update (improvements or corrections to
documentation)
- [ ] 🎨 Code style/formatting (changes that do not affect the meaning of
the code)
- [ ] ♻️ Refactoring (code change that neither fixes a bug nor adds a
feature)
- [ ] ⚡ Performance improvement (code change that improves performance)
- [x] ✅ Test update (adding missing tests or correcting existing tests)
- [ ] 🔧 Build/CI changes (changes to build process or continuous
integration)
- [ ] 🗑️ Chore (other changes that don't modify src or test files)
## Background context
Retrieval is being built out in layers: ingestion produces `Document`
objects; chunking is the next step before embeddings and vector stores.
This PR lands the **chunking slice** of that pipeline as an explicit
module with tests and public docs, while keeping chunking imports under
`railtracks.retrieval.chunking` (domain models remain on
`railtracks.retrieval`).
## Checklist for Author
### Code Quality
- [x] Code follows the project's style guidelines (run `ruff check .`
and `ruff format .`)
- [x] Code is commented, particularly in hard-to-understand areas
### Testing
- [x] Tests added/updated and pass locally (`pytest tests`)
- [ ] Test coverage maintained
### Documentation
- [x] Documentation updated if needed (bot will verify)
### Git & PR Management
- [ ] PR title clearly describes the change
### Breaking Changes
- [ ] Breaking changes are documented
- [ ] Migration guide provided in documentation (step-by-step
instructions for users to update their code/config)
---
## Final Product
**Import surface**
```python
from railtracks.retrieval import Document
from railtracks.retrieval.chunking import (
RecursiveCharacterChunker,
MarkdownHeaderChunker,
SentenceChunker,
FixedTokenChunker,
)
Embedding Module Summary: - `Embedding` abstract base class: - `aembed`: Subclasses need to implement - `embed`: Defaults to `asyncio.run(aembed)` if not overwritten by subclasses - `astream_batches`: streaming embedding batches of chunks for the upcoming rag pipeline - `SyncEmbedding`: mixin for providers that do not provide an async embedding api - `LiteLLMEmbedding`: implements the `Embedding` base class to be used with litellm - Point of concern: `LiteLLM` does not directly expose embedding costs, so we are using an internal field of theirs called `_hidden_params` which provides cost. This is potentially problematic as its an internal undocumented param - This can be a point for a case that maybe for embedding it's best to just directly implement each providers logic. Current design future proofs if we make this decision as `OpenAIEmbedding` can inherit from `Embedding` directly rather `LiteLLMEmbedding` - `OpenAIEmbedding`, `AzureEmbedding` and `OllamaEmbedding` child classes extending `LiteLLMEmbedding` for now - AI Generated documentation (Placeholder for now, will need to edit/verify before merging feature branch to main) - unit tests - Current Date Model (please provide feedback) <img width="1323" height="861" alt="image" src="https://github.com/user-attachments/assets/f5a7af72-aab3-41b6-b778-deffdb0b5121" /> I would say please mainly focus your review on `Embedding` base class and see if it make sense from the following two paradigms: - Extendible embedding wrapper for advanced users - Part of an event based ingestion/retrieval pipeline --------- Co-authored-by: Copilot <copilot@github.com>
## What does this add?
Adds the **Stores** layer to the RAG pipeline — the persistence and
retrieval layer that sits after the embedding stage.
**Core data models**
- `StoreEntry` — the atomic unit written to and retrieved from a store.
Required fields map directly from `EmbeddedChunk`; all enrichment fields
(`abstract`, `summary`, `scope`, `store_category`, `entities`,
`valid_from`, `valid_until`) are optional.
- `StoreEntry.from_chunk(embedded_chunk, ...)` — classmethod that builds
an entry from an `EmbeddedChunk` automatically; all field mapping is
handled, enrichment fields are keyword-only overrides.
- `StoreScope` — frozen dataclass for hard-filter namespacing. Any
non-`None` field becomes a mandatory equality filter on every read and
clear — a query scoped to `user_id="alice"` will never return entries
written under `user_id="bob"`.
- `StoreCategory` — string enum (`EPISODIC`, `SEMANTIC`, `SKILL`,
`PROCEDURAL`) for classifying entry knowledge type.
- `DetailLevel` — enum controlling retrieval verbosity: `L0` returns
abstract only, `L1` adds summary, `L2` returns full content (default).
Enables cheap first-pass ranking before an expensive full-content pass.
- `StoreQuery` / `RetrievedStoreEntry` — query input and scored result
wrappers.
**`Store` protocol**
An async `write` / `read` / `delete` / `clear` interface. Any class
satisfying it can be used wherever a `Store` is expected.
**`VectorStore`**
The built-in implementation. Delegates low-level index I/O to a
swappable `VectorBackend` while owning payload serialization, scope
filter construction, and detail-level projection.
**Three backends**
| Backend | Persistence | Install |
|---|---|---|
| `InMemoryBackend` | Optional JSON snapshot | None |
| `ChromaBackend` | Via client mode (ephemeral / on-disk / HTTP) |
`railtracks[stores-chroma]` |
| `PgvectorBackend` | Full Postgres durability |
`railtracks[stores-vector]` |
All backends accept a `DistanceMetric` (`COSINE`, `L2`, `IP`).
`ChromaBackend` and `PgvectorBackend` expose an async `create(...)`
factory that initialises the backend in one call.
**Documentation**
Full docs added under `docs/rag/components/stores/`: an overview page
covering the data model, querying, the `Store` protocol, and
`VectorStore` usage, plus a backends page covering all three backends
and the custom-backend recipe. Wired into `mkdocs.yml`.
**Model rename: `Memory*` → `Store*`**
All public model names were changed from `Memory`-prefixed to
`Store`-prefixed (`MemoryEntry` → `StoreEntry`, `MemoryScope` →
`StoreScope`, etc.) because "memory" is overloaded in the LLM agent
literature (LangMem, MemGPT) and these stores are general-purpose
retrieval stores, not agent memory systems.
---
## Type of changes
- [ ] 🐛 Bug fix (non-breaking change which fixes an issue)
- [x] ✨ New feature (non-breaking change which adds functionality)
- [x] 💥 Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] 📚 Documentation update (improvements or corrections to
documentation)
- [ ] 🎨 Code style/formatting (changes that do not affect the meaning of
the code)
- [x] ♻️ Refactoring (code change that neither fixes a bug nor adds a
feature)
- [ ] ⚡ Performance improvement (code change that improves performance)
- [x] ✅ Test update (adding missing tests or correcting existing tests)
- [ ] 🔧 Build/CI changes (changes to build process or continuous
integration)
- [ ] 🗑️ Chore (other changes that don't modify src or test files)
---
## Background context
The stores layer completes the core RAG pipeline (`ingestion → chunking
→ embedding → stores`). It was designed backend-agnostic from the start:
the `VectorBackend` protocol is the only contract between `VectorStore`
and the underlying index, so swapping from `InMemoryBackend` to
`PgvectorBackend` requires changing one line.
`StoreScope` was modelled as a hard filter (not a soft hint) to make
multi-tenant data isolation a first-class concern. The scope fields are
serialised as `scope_*` prefixed keys in the payload so they are always
filterable at the backend level, not post-hoc in Python.
`DetailLevel` exists to enable multi-stage retrieval pipelines that are
common in production RAG: a cheap `L0` pass over abstracts can narrow a
large corpus to a short candidate list before a more expensive `L2` pass
fetches full content for re-ranking.
The `Memory*` → `Store*` rename was made early to avoid a harder
migration later. The term "memory" is already claimed by agent memory
systems (LangMem, MemGPT, semantic/episodic agent memory); using it for
a retrieval store creates confusing overlap in docs and user mental
models.
---
## Checklist for Author
### Code Quality
- [x] Code follows the project's style guidelines (run `ruff check .`
and `ruff format .`)
- [ ] Code is commented, particularly in hard-to-understand areas
### Testing
- [x] Tests added/updated and pass locally (`pytest tests`)
- [x] Test coverage maintained
### Documentation
- [x] Documentation updated if needed (bot will verify)
### Git & PR Management
- [x] PR title clearly describes the change
### Breaking Changes
- [x] Breaking changes are documented
- [x] Migration guide provided in documentation (step-by-step
instructions for users to update their code/config)
---
## Final Product
**Creating and indexing entries from an embedding pipeline:**
```python
from railtracks.retrieval.stores import (
VectorStore, StoreEntry, StoreScope, StoreCategory,
StoreQuery, DetailLevel,
)
from railtracks.retrieval.stores.vector.backends import InMemoryBackend
store = VectorStore(InMemoryBackend())
scope = StoreScope(user_id="alice", session_id="s-001")
# Index
for embedded_chunk in embedded_chunks:
entry = StoreEntry.from_chunk(
embedded_chunk,
scope=scope,
store_category=StoreCategory.SEMANTIC,
abstract=embedded_chunk.chunk.content[:100],
)
await store.write(entry)
# Retrieve with detail-level control
query = StoreQuery(
text="What is the refund policy?",
scope=scope,
embedding=query_vector,
top_k=5,
detail_level=DetailLevel.L2,
)
results = await store.read(query)
for r in results:
print(f"(rank={r.rank}, score={r.score:.3f}) {r.entry.content}")
```
**Production backend (Pgvector):**
```python
from railtracks.retrieval.stores import VectorStore, PgvectorBackend
backend = await PgvectorBackend.create(
dsn="postgresql://user:pass@localhost/mydb",
table="my_index",
dim=1536,
)
store = VectorStore(backend)
```
**Persistent development backend (Chroma on-disk):**
```python
from railtracks.retrieval.stores import VectorStore, ChromaBackend
backend = ChromaBackend("my-collection", path="/data/chroma")
await backend.initialize()
store = VectorStore(backend)
```
## Summary Adds `HuggingFaceDatasetLoader`, a streaming document loader that wraps `datasets.load_dataset` in streaming mode and yields one `Document` per row through the existing `BaseDocumentLoader.astream()` contract. Unlocks hundreds of HF Hub datasets (squad, ms_marco, trivia_qa, wiki_dpr, c4, …) for the RAG ingestion pipeline without per-dataset code. Adds a new optional extra: `huggingface = ["datasets >= 2.14.0"]`. Closes #1092 ## Type of change - [ ] Bug fix - [x] Feature - [ ] Breaking change - [x] Docs - [x] Refactor / chore / build / tests ## Checklist - [x] Lint & format pass (`ruff check . && ruff format .`) - [x] Tests added/updated and pass locally (`pytest tests`): 16 new tests, full loaders suite 98/98 green - [x] Docs updated; new page `docs/rag/components/ingestion/huggingface.md`, overview table row, mkdocs nav entry - [x] Breaking changes include migration notes; N/A, adds new public name + new optional extra only; no existing exports or behaviour changed ## Notes ### Acceptance criteria - [x] Extends `BaseDocumentLoader` - [x] Constructor accepts `dataset_name`, `split`, `content_columns: list[str]`, optional `metadata_columns` and `dataset_kwargs` - [x] `astream()` is truly lazy — verified by a test that confirms only one row is consumed per `__anext__` call (no full-dataset buffering) - [x] Each row yields a `Document` with `source = "{dataset_name}/{split}"` and metadata populated from `metadata_columns` + `row_index` - [x] Raises `ValueError` if any `content_columns` are not present in the dataset schema - [x] Unit tests use a mocked `IterableDataset` — no network calls during the test suite ### Basic usage ```python from railtracks.retrieval.loaders.huggingface_loader import HuggingFaceDatasetLoader loader = HuggingFaceDatasetLoader( dataset_name="squad", split="validation", content_columns=["question", "context"], metadata_columns=["title", "id"], content_separator="\n\n", ) # Rows are streamed lazily from the Hub — one at a time async for doc in loader.astream(): # doc.source == "squad/validation" # doc.content == "<question>\n\n<context>" # doc.metadata == {"title": "...", "id": "...", "row_index": N} ... For datasets with subsets, pinned revisions, or auth tokens, forward via dataset_kwargs: HuggingFaceDatasetLoader(..., dataset_kwargs={"name": "v2.1", "revision": "abc"}) streaming=True is set by default and can be overridden through the same dict.
## Summary Adds `BaseOCRLoader` (abstract) and `PyPDFOCRLoader` (concrete) to the retrieval module. Closes #1079. `PyPDFOCRLoader` handles both text-based and scanned PDFs in a single loader: - Tries `pypdf` text extraction per page first - Pages where extraction returns empty are rasterized with `pypdfium2` and recognized via Tesseract (`pytesseract`) - `force_ocr=True` flag OCRs every page unconditionally - Metadata records whether each page used OCR (`metadata["ocr"]` for page strategy; `metadata["ocr_pages"]` for document strategy) **Design decisions:** - **pypdfium2** (Apache-2.0 / BSD-3) over Poppler (GPL-2), PyMuPDF (AGPL), and Wand/ImageMagick (transitively AGPL via Ghostscript). It was already a transitive dependency via `pdfplumber`. - **Tesseract** as the default OCR engine — open-source, no API key, the obvious default for an OSS library. Subclassing `BaseOCRLoader` with a different engine (EasyOCR, Textract, LLM vision) is a one-method override. - New `[ocr]` optional-dependency extra in `pyproject.toml`. Users also need the Tesseract binary on PATH — surfaced in the `ImportError` message. ## Type of change - [ ] Bug fix - [x] Feature - [ ] Breaking change - [ ] Docs - [ ] Refactor / chore / build / tests ## Checklist - [x] Lint & format pass (`ruff check . && ruff format .`) - [x] Tests added/updated and pass locally (`pytest tests`) - [ ] Docs updated if user-facing behavior changed - [ ] Breaking changes include migration notes ## Notes **Test coverage:** - **29 unit tests** (mocked, no real binaries needed in CI): - `BaseOCRLoader` contract + inheritance (6 tests) - `PyPDFOCRLoader` init / errors / text-fast-path / OCR-fallback / `force_ocr` / page strategy / document strategy / directory loading (23 tests) - **End-to-end smoke test** in `packages/railtracks/tests/end_to_end/rag/`, excluded from regular CI per `pyproject.toml`'s existing `addopts`. Generates a synthetic image-only PDF with Pillow, runs the loader, asserts text recovery. Verified locally on Windows with Tesseract v5.5.0.20241111. - **Manual verification** against a real scanned typed document (clean recovery) and a handwritten document (expected Tesseract limitation — intended to be addressed in future via additional `BaseOCRLoader` subclasses for handwriting-capable engines). - Full loaders test suite: 111 passed, no regressions. **Installation:** users need to `pip install "railtracks[ocr]"` for the Python packages, and separately install the Tesseract binary and put it on PATH. The loader's `ImportError` message points to the official Tesseract install docs. --------- Co-authored-by: Pooria Ashrafian <61358399+Pooria90@users.noreply.github.com>
…1109) ## Summary Adds the `RetrievalRuntime` orchestrator and the supporting `Store` / loader changes needed to drive the full ingest → retrieve flow end-to-end. Also removes the legacy `railtracks.vector_stores` package now that `railtracks.retrieval` supersedes it. ``` ┌────────┐ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌───────────┐ │ Loader │ → │ Chunker │ → │ Embedder │ → │ Store │ → │ Retrieval │ └────────┘ └─────────┘ └──────────┘ └────────┘ └───────────┘ ▲ ▲ ▲ ▲ ▲ └─────────────┴─────────────┴─────────────┴──────────────┘ │ ┌─────────────────────┐ │ RetrievalRuntime │ │ (the orchestrator) │ └─────────────────────┘ ``` ### `RetrievalRuntime` The orchestrator that wires a chunker + embedder + `Store` (+ optional `scope`) into the ingest/retrieve flow. * **Loader is passed to `ingest()`, not the constructor**: one runtime captures *how* to process (chunker/embedder/store/scope); the loader decides *what*. A single runtime can ingest from many sources and re-ingest to update. * **Streaming + aggregate APIs:** `ingest(loader)` is an async generator yielding per-batch events; `ingest_all(loader)` drains it and returns `IngestionStats`. * **Events:** `BatchIngested` (carries per-batch `EmbeddingMetrics` — tokens, cost, latency, vector count), `EmbeddingFailure`, `DocumentFailed`, `DocumentSkipped`. `batch_index` is **per-document**, not run-global. * **Upsert semantics:** before writing the *first* chunk of a document the runtime fires `store.delete_where({"document_id": str(doc.id)})` to clear the prior version. The delete only runs once a batch succeeds, so a total embedding failure preserves the previous version. Writes are per-chunk and **not transactional** — a crash mid-write leaves a partial document (recovered on the next ingest, see below). * **Count-aware staleness (skip unchanged docs):** a document is skipped only when the store already holds a *complete* copy — matched on `source_path` + `content_hash` and the persisted `doc_chunk_count`. A partially-written document (fewer chunks than expected after an interrupted run) is re-ingested rather than left broken. Counting is done via `find()` rather than a `count()` call so the runtime depends only on the `Store` protocol. * **Token-size guard:** when `max_tokens` is set, chunks over the per-item limit are dropped *before* embedding and surfaced as `EmbeddingFailure` instead of causing provider 4xx errors. Uses `TiktokenTokenizer` by default. (Partial fix for the embedding per-item token-cap gap — see Known limitations.) * **Embedding-model consistency:** the model is captured from the first successful batch; a later `retrieve()` with a different embedder raises `EmbeddingModelMismatchError` (cross-model similarity scores are meaningless). Note: capture is **in-process only** — a fresh runtime over an existing store won't enforce until its first ingest. * `on_ingest` / `on_retrieve` callbacks for logging/observability; `delete_document(id)` convenience wrapper. ### `stores` module `Store` protocol: * added `delete_where(filters)` and `find(filters, limit=1)` (metadata-only lookup, no vector search) — both required by the runtime's upsert/staleness paths. `StoreEntry`: * `vector` is now `list[float] | None`. Read results no longer round-trip the vector (was `[]`, now `None`) — the backend owns the stored vector; callers must not rely on this field on retrieved entries. `StoreQuery`: * `scope` is now optional (`StoreScope | None`) for single-tenant callers. * `metadata_filters` retyped `dict[str, Any]` (was `dict[str, str]`). * removed the unused `strategies` field and the `RetrievalStrategy` enum. `VectorStore` (base) / `VectorBackend`: * `VectorBackend` protocol gained `list_where(filters, limit)` and `count(filters)`. **`count` lives on the backend, not `Store`** — keeps the runtime's dependency surface to the `Store` protocol alone. * `VectorStore` now implements `find`, `delete_where`, and `count`. * Payload encoding spreads scalar `chunk_metadata` values to the top level (in addition to the JSON-encoded blob) so flat-equality `metadata_filters` / `find` work against them. Backend implementations (`chroma`, `in_memory`, `pgvector`) all implement `list_where` + `count`. Plus: * **pgvector** `_build_where` now compares JSONB-to-JSONB (`payload->$k::text = $v::jsonb`) so non-string scalars (int/bool/None) keep their JSON type instead of being stringified. Filters are parameterized; `LIMIT` is `int`-cast before interpolation. Added `pool_kwargs` passthrough to `asyncpg.create_pool` for tuning `min_size`/`max_size`/etc. * **in_memory** `_flush` is now async — JSON encode happens under the lock, the disk write is offloaded to a thread so the event loop isn't blocked. Search now sanitizes non-finite scores (NaN/inf from a misbehaving embedder): they're logged and sorted/dropped to the end instead of corrupting the ranking. ### `loaders` module `Document`: * **`id` is now derived deterministically from `source` via `uuid5(NAMESPACE_URL, source)`** so re-ingesting the same source yields the same id across processes. Fixes a silent upsert bug where modified files left their prior chunks orphaned in the store, because `delete_where({"document_id": ...})` was keyed on a fresh random UUID each pass. Sourceless documents fall back to `uuid4()` (no stable identity ⇒ no upsert semantics). * added `content_hash` (SHA-256, computed by the runtime at ingest time; loaders leave it `None`) used by staleness detection. `type` now defaults to `DocumentType.TEXT`. * `Sanitizer` protocol for PII redaction (sync **or** async `sanitize`; errors propagate, no logic baked into the framework). * `SanitizingLoader` wraps any `BaseDocumentLoader` + a `Sanitizer`, running every yielded document through it. ### Removals / cleanup * Deleted the legacy `railtracks.vector_stores` package (`chroma`, `chunking/`, `filter`, `vector_store_base`) and its tests — fully superseded by `railtracks.retrieval.stores` and `railtracks.retrieval.chunking` (~7.5k lines). * `retrieval.__init__` now exports the public surface: `RetrievalRuntime`, the ingestion event/stats types, `Store`, `StoreEntry`, `StoreQuery`, `StoreScope`, `VectorStore`, `EmbeddingFailure`, `EmbeddingModelMismatchError`. ## Type of change - [ ] Bug fix - [x] Feature - [ ] Breaking change - [ ] Docs - [ ] Refactor / chore / build / tests ## Checklist - [x] Lint & format pass (`ruff check . && ruff format .`) - [x] Tests added/updated and pass locally (`pytest tests`) - [ ] Docs updated if user-facing behavior changed - [ ] Breaking changes include migration notes ## Notes **Review callouts** * Ingest upsert is not transactional (per-chunk writes); count-aware staleness is what makes an interrupted ingest self-heal on the next run. * `_captured_model` is in-process only — model-mismatch enforcement doesn't survive a fresh runtime over a pre-populated store until its first ingest. * pgvector `list_where` interpolates `LIMIT {int(limit)}` (int-cast, safe); all filter values stay parameterized. **Known limitations / follow-ups** * The `max_tokens` guard enforces a **per-item** token cap (drops oversized chunks pre-embedding); it does **not** do batch-level token-budget packing. Batches are still sized by count (`default_batch_size`), so a batch of in-spec chunks can still exceed a provider's per-request token limit (e.g. OpenAI's 8191). Worth a follow-up for token-aware batch packing.
Dropping stale imports
…L) (#1102) ## Summary - Adds `S3Loader` / `GCSLoader` / `AzureBlobLoader` / `SQLLoader` - All providers use optional extras (`railtracks[aws]`, `railtracks[gcp]`, `railtracks[azure-blob]`, `railtracks[sql]`) so the core package stays lean - All loaders expose sync and async interfaces (`load`/`aload`) - SQL classes include a context-manager (`with SQLLoader(...) as l`) and explicit `close()` for engine lifecycle management - SQL identifier arguments validated against a strict allowlist at construction time to prevent injection - Full unit test coverage across all classes - Comprehensive developer docs - Writer are not included in the PR, it will be in a separate module ## Security hardening - SQL table/column names validated at `__init__` time — raises `ValueError` on any metacharacter (`[A-Za-z_][A-Za-z0-9_$]*` allowlist, supports `schema.table`) - Helpful `ValueError` when `content_column` is missing from query results (was a bare `KeyError`) - `__repr__` on all classes exposes only non-sensitive fields (bucket/container name); credentials never appear in repr - All `ImportError` messages include both `pip install` and `uv add` forms ## Limitations documented - CTE (`WITH …`) queries not supported as `table_or_query`; workaround shown in docs - `aload` are thread-backed (`asyncio.to_thread`) not true-async; noted in docs with guidance for high-concurrency cases ## Test plan - [x] 127 unit tests passing across all 4 providers × loader + writer - [x] SQL tests use real in-memory SQLite (no mocks for correctness) - [x] Cloud tests (S3/GCS/Azure) use provider SDK mocks - [x] Async variants covered for all classes ## Additional file Tutorial file are added to ipynb file on colab, not included per requirements for single-location consistency. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This'll be a big one, draft to see how big