Retrieval/Ingestion feature branch merge by Amir-R25 · Pull Request #1129 · RailtownAI/railtracks

Amir-R25 · 2026-05-28T18:00:57Z

This'll be a big one, draft to see how big

**RAG ingestion: document loaders, data models, and docs** Builds out the ingestion layer of the new RAG pipeline, the first stage in the `load → chunk → embed → store` design. Note: Ignore docs having inline code for now, that'll be handled before merging with main ## What's included **Loaders** (`railtracks.retrieval.loaders`) - `BaseDocumentLoader`: streaming-first abstract base; `astream()` is the single primitive, `aload()` and `load()` derive from it automatically - `TextLoader`: `.txt` and `.md` files and directories - `CSVLoader`: one `Document` per row; `content_columns`, `ignore_columns`, and `content_separator` control how columns map to content vs metadata - `JSONLoader`: one `Document` per JSON object; `content_keys` (`"*"` by default) and `ignore_keys` give the same field-level control - `PyPDFLoader`: per-page or whole-document breakdown strategies; requires `railtracks[pdf]` All loaders handle both single files and recursive directory loading. The interface is always a flat stream of `Document` objects regardless of source. **Data models** (`railtracks.retrieval.models`) - `Document`: typed (`DocumentType` enum), with UUID and source metadata - `Chunk`: carries `document_id` so lineage back to the source document is never lost **Tests**: 82 unit tests covering all four loaders and the base class. **Docs**: reference pages for each loader, custom loader guide, and ingestion overview. Design docs updated to reflect the actual `astream()`-based interface.

## What does this add? This PR introduces the **retrieval chunking subsystem** under `railtracks.retrieval.chunking`: a small, composable API for turning `Document` instances into `Chunk` instances before embedding and storage. It adds: - **Core abstractions**: `Chunker` (ABC), `Splitter` (protocol), `Tokenizer` (protocol), plus a single protected `_make_chunks` path so `document_id`, dense `index`, metadata inheritance, and optional character `offsets` stay consistent. - **Concrete chunkers**: - `RecursiveCharacterChunker` — recursive separator splitting + overlap merge with offset-aware spans. - `MarkdownHeaderChunker` — heading-aware sections with header breadcrumbs in metadata and optional fallback splitting for oversized bodies. - `SentenceChunker` — fixed-size **sentence windows** with sentence-level overlap (default regex sentence boundaries; pluggable splitter). - `FixedTokenChunker` — fixed **token** windows with overlap (default `TiktokenTokenizer`); `Chunk.offsets` intentionally unset in v1 for this chunker. - **Unit tests** under `packages/railtracks/tests/unit_tests/retrieval/chunking/` covering chunkers, tokenization, and cross-chunker invariants. - **User-facing docs**: `docs/rag/components/chunking/overview.md` and `docs/rag/components/chunking/methods.md`, aligned with the existing ingestion docs layout and the Chunking entries already listed in `mkdocs.yml`. If this branch also includes **ingestion package import fixes** (e.g. trimming broken `loaders` exports and exposing `JSONLoader`), call that out explicitly in the final PR body so reviewers see it as a small packaging fix bundled with retrieval work. **Problem solved:** downstream RAG pieces need stable, backend-agnostic `Chunk` objects with predictable metadata and (where supported) source offsets, without ad hoc splitting logic scattered across the codebase. ## Type of changes Please check the type of change your PR introduces: - [ ] 🐛 Bug fix (non-breaking change which fixes an issue) - [x] ✨ New feature (non-breaking change which adds functionality) - [ ] 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] 📚 Documentation update (improvements or corrections to documentation) - [ ] 🎨 Code style/formatting (changes that do not affect the meaning of the code) - [ ] ♻️ Refactoring (code change that neither fixes a bug nor adds a feature) - [ ] ⚡ Performance improvement (code change that improves performance) - [x] ✅ Test update (adding missing tests or correcting existing tests) - [ ] 🔧 Build/CI changes (changes to build process or continuous integration) - [ ] 🗑️ Chore (other changes that don't modify src or test files) ## Background context Retrieval is being built out in layers: ingestion produces `Document` objects; chunking is the next step before embeddings and vector stores. This PR lands the **chunking slice** of that pipeline as an explicit module with tests and public docs, while keeping chunking imports under `railtracks.retrieval.chunking` (domain models remain on `railtracks.retrieval`). ## Checklist for Author ### Code Quality - [x] Code follows the project's style guidelines (run `ruff check .` and `ruff format .`) - [x] Code is commented, particularly in hard-to-understand areas ### Testing - [x] Tests added/updated and pass locally (`pytest tests`) - [ ] Test coverage maintained ### Documentation - [x] Documentation updated if needed (bot will verify) ### Git & PR Management - [ ] PR title clearly describes the change ### Breaking Changes - [ ] Breaking changes are documented - [ ] Migration guide provided in documentation (step-by-step instructions for users to update their code/config) --- ## Final Product **Import surface** ```python from railtracks.retrieval import Document from railtracks.retrieval.chunking import ( RecursiveCharacterChunker, MarkdownHeaderChunker, SentenceChunker, FixedTokenChunker, )

Embedding Module Summary: - `Embedding` abstract base class: - `aembed`: Subclasses need to implement - `embed`: Defaults to `asyncio.run(aembed)` if not overwritten by subclasses - `astream_batches`: streaming embedding batches of chunks for the upcoming rag pipeline - `SyncEmbedding`: mixin for providers that do not provide an async embedding api - `LiteLLMEmbedding`: implements the `Embedding` base class to be used with litellm - Point of concern: `LiteLLM` does not directly expose embedding costs, so we are using an internal field of theirs called `_hidden_params` which provides cost. This is potentially problematic as its an internal undocumented param - This can be a point for a case that maybe for embedding it's best to just directly implement each providers logic. Current design future proofs if we make this decision as `OpenAIEmbedding` can inherit from `Embedding` directly rather `LiteLLMEmbedding` - `OpenAIEmbedding`, `AzureEmbedding` and `OllamaEmbedding` child classes extending `LiteLLMEmbedding` for now - AI Generated documentation (Placeholder for now, will need to edit/verify before merging feature branch to main) - unit tests - Current Date Model (please provide feedback) <img width="1323" height="861" alt="image" src="https://github.com/user-attachments/assets/f5a7af72-aab3-41b6-b778-deffdb0b5121" /> I would say please mainly focus your review on `Embedding` base class and see if it make sense from the following two paradigms: - Extendible embedding wrapper for advanced users - Part of an event based ingestion/retrieval pipeline --------- Co-authored-by: Copilot <copilot@github.com>

## What does this add? Adds the **Stores** layer to the RAG pipeline — the persistence and retrieval layer that sits after the embedding stage. **Core data models** - `StoreEntry` — the atomic unit written to and retrieved from a store. Required fields map directly from `EmbeddedChunk`; all enrichment fields (`abstract`, `summary`, `scope`, `store_category`, `entities`, `valid_from`, `valid_until`) are optional. - `StoreEntry.from_chunk(embedded_chunk, ...)` — classmethod that builds an entry from an `EmbeddedChunk` automatically; all field mapping is handled, enrichment fields are keyword-only overrides. - `StoreScope` — frozen dataclass for hard-filter namespacing. Any non-`None` field becomes a mandatory equality filter on every read and clear — a query scoped to `user_id="alice"` will never return entries written under `user_id="bob"`. - `StoreCategory` — string enum (`EPISODIC`, `SEMANTIC`, `SKILL`, `PROCEDURAL`) for classifying entry knowledge type. - `DetailLevel` — enum controlling retrieval verbosity: `L0` returns abstract only, `L1` adds summary, `L2` returns full content (default). Enables cheap first-pass ranking before an expensive full-content pass. - `StoreQuery` / `RetrievedStoreEntry` — query input and scored result wrappers. **`Store` protocol** An async `write` / `read` / `delete` / `clear` interface. Any class satisfying it can be used wherever a `Store` is expected. **`VectorStore`** The built-in implementation. Delegates low-level index I/O to a swappable `VectorBackend` while owning payload serialization, scope filter construction, and detail-level projection. **Three backends** | Backend | Persistence | Install | |---|---|---| | `InMemoryBackend` | Optional JSON snapshot | None | | `ChromaBackend` | Via client mode (ephemeral / on-disk / HTTP) | `railtracks[stores-chroma]` | | `PgvectorBackend` | Full Postgres durability | `railtracks[stores-vector]` | All backends accept a `DistanceMetric` (`COSINE`, `L2`, `IP`). `ChromaBackend` and `PgvectorBackend` expose an async `create(...)` factory that initialises the backend in one call. **Documentation** Full docs added under `docs/rag/components/stores/`: an overview page covering the data model, querying, the `Store` protocol, and `VectorStore` usage, plus a backends page covering all three backends and the custom-backend recipe. Wired into `mkdocs.yml`. **Model rename: `Memory*` → `Store*`** All public model names were changed from `Memory`-prefixed to `Store`-prefixed (`MemoryEntry` → `StoreEntry`, `MemoryScope` → `StoreScope`, etc.) because "memory" is overloaded in the LLM agent literature (LangMem, MemGPT) and these stores are general-purpose retrieval stores, not agent memory systems. --- ## Type of changes - [ ] 🐛 Bug fix (non-breaking change which fixes an issue) - [x] ✨ New feature (non-breaking change which adds functionality) - [x] 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] 📚 Documentation update (improvements or corrections to documentation) - [ ] 🎨 Code style/formatting (changes that do not affect the meaning of the code) - [x] ♻️ Refactoring (code change that neither fixes a bug nor adds a feature) - [ ] ⚡ Performance improvement (code change that improves performance) - [x] ✅ Test update (adding missing tests or correcting existing tests) - [ ] 🔧 Build/CI changes (changes to build process or continuous integration) - [ ] 🗑️ Chore (other changes that don't modify src or test files) --- ## Background context The stores layer completes the core RAG pipeline (`ingestion → chunking → embedding → stores`). It was designed backend-agnostic from the start: the `VectorBackend` protocol is the only contract between `VectorStore` and the underlying index, so swapping from `InMemoryBackend` to `PgvectorBackend` requires changing one line. `StoreScope` was modelled as a hard filter (not a soft hint) to make multi-tenant data isolation a first-class concern. The scope fields are serialised as `scope_*` prefixed keys in the payload so they are always filterable at the backend level, not post-hoc in Python. `DetailLevel` exists to enable multi-stage retrieval pipelines that are common in production RAG: a cheap `L0` pass over abstracts can narrow a large corpus to a short candidate list before a more expensive `L2` pass fetches full content for re-ranking. The `Memory*` → `Store*` rename was made early to avoid a harder migration later. The term "memory" is already claimed by agent memory systems (LangMem, MemGPT, semantic/episodic agent memory); using it for a retrieval store creates confusing overlap in docs and user mental models. --- ## Checklist for Author ### Code Quality - [x] Code follows the project's style guidelines (run `ruff check .` and `ruff format .`) - [ ] Code is commented, particularly in hard-to-understand areas ### Testing - [x] Tests added/updated and pass locally (`pytest tests`) - [x] Test coverage maintained ### Documentation - [x] Documentation updated if needed (bot will verify) ### Git & PR Management - [x] PR title clearly describes the change ### Breaking Changes - [x] Breaking changes are documented - [x] Migration guide provided in documentation (step-by-step instructions for users to update their code/config) --- ## Final Product **Creating and indexing entries from an embedding pipeline:** ```python from railtracks.retrieval.stores import ( VectorStore, StoreEntry, StoreScope, StoreCategory, StoreQuery, DetailLevel, ) from railtracks.retrieval.stores.vector.backends import InMemoryBackend store = VectorStore(InMemoryBackend()) scope = StoreScope(user_id="alice", session_id="s-001") # Index for embedded_chunk in embedded_chunks: entry = StoreEntry.from_chunk( embedded_chunk, scope=scope, store_category=StoreCategory.SEMANTIC, abstract=embedded_chunk.chunk.content[:100], ) await store.write(entry) # Retrieve with detail-level control query = StoreQuery( text="What is the refund policy?", scope=scope, embedding=query_vector, top_k=5, detail_level=DetailLevel.L2, ) results = await store.read(query) for r in results: print(f"(rank={r.rank}, score={r.score:.3f}) {r.entry.content}") ``` **Production backend (Pgvector):** ```python from railtracks.retrieval.stores import VectorStore, PgvectorBackend backend = await PgvectorBackend.create( dsn="postgresql://user:pass@localhost/mydb", table="my_index", dim=1536, ) store = VectorStore(backend) ``` **Persistent development backend (Chroma on-disk):** ```python from railtracks.retrieval.stores import VectorStore, ChromaBackend backend = ChromaBackend("my-collection", path="/data/chroma") await backend.initialize() store = VectorStore(backend) ```

## Summary Adds `HuggingFaceDatasetLoader`, a streaming document loader that wraps `datasets.load_dataset` in streaming mode and yields one `Document` per row through the existing `BaseDocumentLoader.astream()` contract. Unlocks hundreds of HF Hub datasets (squad, ms_marco, trivia_qa, wiki_dpr, c4, …) for the RAG ingestion pipeline without per-dataset code. Adds a new optional extra: `huggingface = ["datasets >= 2.14.0"]`. Closes #1092 ## Type of change - [ ] Bug fix - [x] Feature - [ ] Breaking change - [x] Docs - [x] Refactor / chore / build / tests ## Checklist - [x] Lint & format pass (`ruff check . && ruff format .`) - [x] Tests added/updated and pass locally (`pytest tests`): 16 new tests, full loaders suite 98/98 green - [x] Docs updated; new page `docs/rag/components/ingestion/huggingface.md`, overview table row, mkdocs nav entry - [x] Breaking changes include migration notes; N/A, adds new public name + new optional extra only; no existing exports or behaviour changed ## Notes ### Acceptance criteria - [x] Extends `BaseDocumentLoader` - [x] Constructor accepts `dataset_name`, `split`, `content_columns: list[str]`, optional `metadata_columns` and `dataset_kwargs` - [x] `astream()` is truly lazy — verified by a test that confirms only one row is consumed per `__anext__` call (no full-dataset buffering) - [x] Each row yields a `Document` with `source = "{dataset_name}/{split}"` and metadata populated from `metadata_columns` + `row_index` - [x] Raises `ValueError` if any `content_columns` are not present in the dataset schema - [x] Unit tests use a mocked `IterableDataset` — no network calls during the test suite ### Basic usage ```python from railtracks.retrieval.loaders.huggingface_loader import HuggingFaceDatasetLoader loader = HuggingFaceDatasetLoader( dataset_name="squad", split="validation", content_columns=["question", "context"], metadata_columns=["title", "id"], content_separator="\n\n", ) # Rows are streamed lazily from the Hub — one at a time async for doc in loader.astream(): # doc.source == "squad/validation" # doc.content == "<question>\n\n<context>" # doc.metadata == {"title": "...", "id": "...", "row_index": N} ... For datasets with subsets, pinned revisions, or auth tokens, forward via dataset_kwargs: HuggingFaceDatasetLoader(..., dataset_kwargs={"name": "v2.1", "revision": "abc"}) streaming=True is set by default and can be overridden through the same dict.

## Summary Adds `BaseOCRLoader` (abstract) and `PyPDFOCRLoader` (concrete) to the retrieval module. Closes #1079. `PyPDFOCRLoader` handles both text-based and scanned PDFs in a single loader: - Tries `pypdf` text extraction per page first - Pages where extraction returns empty are rasterized with `pypdfium2` and recognized via Tesseract (`pytesseract`) - `force_ocr=True` flag OCRs every page unconditionally - Metadata records whether each page used OCR (`metadata["ocr"]` for page strategy; `metadata["ocr_pages"]` for document strategy) **Design decisions:** - **pypdfium2** (Apache-2.0 / BSD-3) over Poppler (GPL-2), PyMuPDF (AGPL), and Wand/ImageMagick (transitively AGPL via Ghostscript). It was already a transitive dependency via `pdfplumber`. - **Tesseract** as the default OCR engine — open-source, no API key, the obvious default for an OSS library. Subclassing `BaseOCRLoader` with a different engine (EasyOCR, Textract, LLM vision) is a one-method override. - New `[ocr]` optional-dependency extra in `pyproject.toml`. Users also need the Tesseract binary on PATH — surfaced in the `ImportError` message. ## Type of change - [ ] Bug fix - [x] Feature - [ ] Breaking change - [ ] Docs - [ ] Refactor / chore / build / tests ## Checklist - [x] Lint & format pass (`ruff check . && ruff format .`) - [x] Tests added/updated and pass locally (`pytest tests`) - [ ] Docs updated if user-facing behavior changed - [ ] Breaking changes include migration notes ## Notes **Test coverage:** - **29 unit tests** (mocked, no real binaries needed in CI): - `BaseOCRLoader` contract + inheritance (6 tests) - `PyPDFOCRLoader` init / errors / text-fast-path / OCR-fallback / `force_ocr` / page strategy / document strategy / directory loading (23 tests) - **End-to-end smoke test** in `packages/railtracks/tests/end_to_end/rag/`, excluded from regular CI per `pyproject.toml`'s existing `addopts`. Generates a synthetic image-only PDF with Pillow, runs the loader, asserts text recovery. Verified locally on Windows with Tesseract v5.5.0.20241111. - **Manual verification** against a real scanned typed document (clean recovery) and a handwritten document (expected Tesseract limitation — intended to be addressed in future via additional `BaseOCRLoader` subclasses for handwriting-capable engines). - Full loaders test suite: 111 passed, no regressions. **Installation:** users need to `pip install "railtracks[ocr]"` for the Python packages, and separately install the Tesseract binary and put it on PATH. The loader's `ImportError` message points to the official Tesseract install docs. --------- Co-authored-by: Pooria Ashrafian <61358399+Pooria90@users.noreply.github.com>

…1109) ## Summary Adds the `RetrievalRuntime` orchestrator and the supporting `Store` / loader changes needed to drive the full ingest → retrieve flow end-to-end. Also removes the legacy `railtracks.vector_stores` package now that `railtracks.retrieval` supersedes it. ``` ┌────────┐ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌───────────┐ │ Loader │ → │ Chunker │ → │ Embedder │ → │ Store │ → │ Retrieval │ └────────┘ └─────────┘ └──────────┘ └────────┘ └───────────┘ ▲ ▲ ▲ ▲ ▲ └─────────────┴─────────────┴─────────────┴──────────────┘ │ ┌─────────────────────┐ │ RetrievalRuntime │ │ (the orchestrator) │ └─────────────────────┘ ``` ### `RetrievalRuntime` The orchestrator that wires a chunker + embedder + `Store` (+ optional `scope`) into the ingest/retrieve flow. * **Loader is passed to `ingest()`, not the constructor**: one runtime captures *how* to process (chunker/embedder/store/scope); the loader decides *what*. A single runtime can ingest from many sources and re-ingest to update. * **Streaming + aggregate APIs:** `ingest(loader)` is an async generator yielding per-batch events; `ingest_all(loader)` drains it and returns `IngestionStats`. * **Events:** `BatchIngested` (carries per-batch `EmbeddingMetrics` — tokens, cost, latency, vector count), `EmbeddingFailure`, `DocumentFailed`, `DocumentSkipped`. `batch_index` is **per-document**, not run-global. * **Upsert semantics:** before writing the *first* chunk of a document the runtime fires `store.delete_where({"document_id": str(doc.id)})` to clear the prior version. The delete only runs once a batch succeeds, so a total embedding failure preserves the previous version. Writes are per-chunk and **not transactional** — a crash mid-write leaves a partial document (recovered on the next ingest, see below). * **Count-aware staleness (skip unchanged docs):** a document is skipped only when the store already holds a *complete* copy — matched on `source_path` + `content_hash` and the persisted `doc_chunk_count`. A partially-written document (fewer chunks than expected after an interrupted run) is re-ingested rather than left broken. Counting is done via `find()` rather than a `count()` call so the runtime depends only on the `Store` protocol. * **Token-size guard:** when `max_tokens` is set, chunks over the per-item limit are dropped *before* embedding and surfaced as `EmbeddingFailure` instead of causing provider 4xx errors. Uses `TiktokenTokenizer` by default. (Partial fix for the embedding per-item token-cap gap — see Known limitations.) * **Embedding-model consistency:** the model is captured from the first successful batch; a later `retrieve()` with a different embedder raises `EmbeddingModelMismatchError` (cross-model similarity scores are meaningless). Note: capture is **in-process only** — a fresh runtime over an existing store won't enforce until its first ingest. * `on_ingest` / `on_retrieve` callbacks for logging/observability; `delete_document(id)` convenience wrapper. ### `stores` module `Store` protocol: * added `delete_where(filters)` and `find(filters, limit=1)` (metadata-only lookup, no vector search) — both required by the runtime's upsert/staleness paths. `StoreEntry`: * `vector` is now `list[float] | None`. Read results no longer round-trip the vector (was `[]`, now `None`) — the backend owns the stored vector; callers must not rely on this field on retrieved entries. `StoreQuery`: * `scope` is now optional (`StoreScope | None`) for single-tenant callers. * `metadata_filters` retyped `dict[str, Any]` (was `dict[str, str]`). * removed the unused `strategies` field and the `RetrievalStrategy` enum. `VectorStore` (base) / `VectorBackend`: * `VectorBackend` protocol gained `list_where(filters, limit)` and `count(filters)`. **`count` lives on the backend, not `Store`** — keeps the runtime's dependency surface to the `Store` protocol alone. * `VectorStore` now implements `find`, `delete_where`, and `count`. * Payload encoding spreads scalar `chunk_metadata` values to the top level (in addition to the JSON-encoded blob) so flat-equality `metadata_filters` / `find` work against them. Backend implementations (`chroma`, `in_memory`, `pgvector`) all implement `list_where` + `count`. Plus: * **pgvector** `_build_where` now compares JSONB-to-JSONB (`payload->$k::text = $v::jsonb`) so non-string scalars (int/bool/None) keep their JSON type instead of being stringified. Filters are parameterized; `LIMIT` is `int`-cast before interpolation. Added `pool_kwargs` passthrough to `asyncpg.create_pool` for tuning `min_size`/`max_size`/etc. * **in_memory** `_flush` is now async — JSON encode happens under the lock, the disk write is offloaded to a thread so the event loop isn't blocked. Search now sanitizes non-finite scores (NaN/inf from a misbehaving embedder): they're logged and sorted/dropped to the end instead of corrupting the ranking. ### `loaders` module `Document`: * **`id` is now derived deterministically from `source` via `uuid5(NAMESPACE_URL, source)`** so re-ingesting the same source yields the same id across processes. Fixes a silent upsert bug where modified files left their prior chunks orphaned in the store, because `delete_where({"document_id": ...})` was keyed on a fresh random UUID each pass. Sourceless documents fall back to `uuid4()` (no stable identity ⇒ no upsert semantics). * added `content_hash` (SHA-256, computed by the runtime at ingest time; loaders leave it `None`) used by staleness detection. `type` now defaults to `DocumentType.TEXT`. * `Sanitizer` protocol for PII redaction (sync **or** async `sanitize`; errors propagate, no logic baked into the framework). * `SanitizingLoader` wraps any `BaseDocumentLoader` + a `Sanitizer`, running every yielded document through it. ### Removals / cleanup * Deleted the legacy `railtracks.vector_stores` package (`chroma`, `chunking/`, `filter`, `vector_store_base`) and its tests — fully superseded by `railtracks.retrieval.stores` and `railtracks.retrieval.chunking` (~7.5k lines). * `retrieval.__init__` now exports the public surface: `RetrievalRuntime`, the ingestion event/stats types, `Store`, `StoreEntry`, `StoreQuery`, `StoreScope`, `VectorStore`, `EmbeddingFailure`, `EmbeddingModelMismatchError`. ## Type of change - [ ] Bug fix - [x] Feature - [ ] Breaking change - [ ] Docs - [ ] Refactor / chore / build / tests ## Checklist - [x] Lint & format pass (`ruff check . && ruff format .`) - [x] Tests added/updated and pass locally (`pytest tests`) - [ ] Docs updated if user-facing behavior changed - [ ] Breaking changes include migration notes ## Notes **Review callouts** * Ingest upsert is not transactional (per-chunk writes); count-aware staleness is what makes an interrupted ingest self-heal on the next run. * `_captured_model` is in-process only — model-mismatch enforcement doesn't survive a fresh runtime over a pre-populated store until its first ingest. * pgvector `list_where` interpolates `LIMIT {int(limit)}` (int-cast, safe); all filter values stay parameterized. **Known limitations / follow-ups** * The `max_tokens` guard enforces a **per-item** token cap (drops oversized chunks pre-embedding); it does **not** do batch-level token-budget packing. Batches are still sized by count (`default_batch_size`), so a batch of in-spec chunks can still exceed a provider's per-request token limit (e.g. OpenAI's 8191). Worth a follow-up for token-aware batch packing.

Dropping stale imports

…L) (#1102) ## Summary - Adds `S3Loader` / `GCSLoader` / `AzureBlobLoader` / `SQLLoader` - All providers use optional extras (`railtracks[aws]`, `railtracks[gcp]`, `railtracks[azure-blob]`, `railtracks[sql]`) so the core package stays lean - All loaders expose sync and async interfaces (`load`/`aload`) - SQL classes include a context-manager (`with SQLLoader(...) as l`) and explicit `close()` for engine lifecycle management - SQL identifier arguments validated against a strict allowlist at construction time to prevent injection - Full unit test coverage across all classes - Comprehensive developer docs - Writer are not included in the PR, it will be in a separate module ## Security hardening - SQL table/column names validated at `__init__` time — raises `ValueError` on any metacharacter (`[A-Za-z_][A-Za-z0-9_$]*` allowlist, supports `schema.table`) - Helpful `ValueError` when `content_column` is missing from query results (was a bare `KeyError`) - `__repr__` on all classes exposes only non-sensitive fields (bucket/container name); credentials never appear in repr - All `ImportError` messages include both `pip install` and `uv add` forms ## Limitations documented - CTE (`WITH …`) queries not supported as `table_or_query`; workaround shown in docs - `aload` are thread-backed (`asyncio.to_thread`) not true-async; noted in docs with guidance for high-concurrency cases ## Test plan - [x] 127 unit tests passing across all 4 providers × loader + writer - [x] SQL tests use real in-memory SQLite (no mocks for correctness) - [x] Cloud tests (S3/GCS/Azure) use provider SDK mocks - [x] Async variants covered for all classes ## Additional file Tutorial file are added to ipynb file on colab, not included per requirements for single-location consistency. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Amir-R25 and others added 8 commits April 28, 2026 15:57

stale imports removed (#1128)

8355bfb

Dropping stale imports

Amir-R25 linked an issue May 28, 2026 that may be closed by this pull request

[Feature][Retrieval] Retrieval Module #1078

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval/Ingestion feature branch merge#1129

Retrieval/Ingestion feature branch merge#1129
Amir-R25 wants to merge 9 commits into
mainfrom
feature-branch-rag

Amir-R25 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Amir-R25 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants