hybrid semantic search: embeddings + KNN + RRF fusion (#1)#33
Open
arreyder wants to merge 7 commits into
Open
Conversation
Adds dense-vector semantic search blended with the existing lexical ranking, so conceptually-related memories surface even with no shared terms. - internal/embed: pluggable Embedder; Ollama /api/embeddings impl; FromEnv (EMBED_URL/EMBED_MODEL/EMBED_DIM). Disabled no-op when unconfigured → lexical-only, never fails. - schema: knn_vector_768 DenseVectorField (cosine) + `embedding` field (stored=false so vectors never bloat responses; indexed for KNN). - store/bulk_store: embed title+content on write. update: re-embed only when content/title change (tag/importance updates skip it), fetching the missing half so the vector reflects both. - search: when enabled, embed the query, KNN alongside lexical edismax, fuse with reciprocal rank fusion (k=60). Over-fetch fusionK then trim to limit. semantic=false opts out; start>0 (pagination) forces lexical-only. Any embed/KNN error degrades to lexical-only. - client.KNNQuery (POSTs the vector to dodge URL-length limits) + formatVector. - cmd/solr-mem-backfill: one-shot re-embed of existing memories (idempotent). Tests: embedder (request/parse/dim, disabled), fuseResponses (RRF order, dedup, semantic-only inclusion, limit, nil). build/vet/test/gofmt clean. Deploy: schema reload (docker cp + cores RELOAD) + EMBED_* env on the server + run backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t 6000) 16/596 memories failed backfill with 'input length exceeds context length' (nomic-embed-text ~2048 tokens). Truncate embed input by runes; title+head carries the semantic signal. Configurable via EMBED_MAX_CHARS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /select handler defaults to edismax, which doesn't honor the leading
{!knn} parser switch — it tokenized the 768-float vector literal across qf
fields, exceeding maxClauseCount (1024) -> 500. Send defType=lucene (+ drop
facet/hl) so the knn parser is used.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exical count) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ric retrieval nomic-embed-text is trained with task prefixes; without them query/doc vectors are misaligned and precision suffers. Split Embed into EmbedDocument/EmbedQuery; prefix stored text with 'search_document: ' and queries with 'search_query: ' (auto for nomic models, overridable via EMBED_DOC_PREFIX/EMBED_QUERY_PREFIX). Requires re-backfill so stored vectors carry the document prefix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
Author
|
Follow-up #1 (nomic task prefixes) is now included in this PR (commit Live quality check after prefixes:
|
…ng1024 field Dimension change can't happen in place (Lucene forbids mixed vector dims in a field). Add embedding1024 (knn_vector_1024) and point store/update/backfill/KNN at it; the old 768 'embedding' field goes empty/vestigial (no delete-all, zero risk to existing data). Set EMBED_MODEL=mxbai-embed-large, EMBED_DIM=1024. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pus too hard mxbai-embed-large (1024d) benchmarks higher but its 512-token window forced ~1200-char truncation (vs nomic's 6000) — embedding ~1/5 of each memory, with no conceptual-precision gain. nomic-embed-text (2048 tok) fits our content far better. Point store/update/backfill/KNN back at the 768 'embedding' field; embedding1024 stays defined-but-vestigial (kept so its data doesn't break reload). Env reverts to nomic/768/6000. Real conceptual-precision lever is reranking or chunked embeddings, not a bigger model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Roadmap item #1: dense-vector semantic search blended with the existing lexical ranking, so conceptually-related memories surface even with no shared terms.
What it does
internal/embed: pluggableEmbedder; Ollama/api/embeddingsimpl;FromEnv(EMBED_URL/EMBED_MODEL/EMBED_DIM, default nomic-embed-text/768). Disabled no-op when unconfigured → lexical-only, never fails. Input truncated toEMBED_MAX_CHARS(default 6000) to stay under the model's context window.knn_vector_768DenseVectorField (cosine) +embeddingfield (stored=falseso vectors never bloat responses; indexed for KNN).store/bulk_storeembed title+content;updatere-embeds only when content/title change (tag/importance updates skip it), fetching the missing half so the vector reflects both.semantic=falseopts out;start>0(pagination) forces lexical-only. Any embed/KNN error degrades to lexical-only.client.KNNQuery: POSTs the vector (dodges URL-length limits) and forcesdefType=lucene— the/selecthandler defaults to edismax, which doesn't honor the leading{!knn}parser-switch and would tokenize the 768-float literal acrossqffields, blowingmaxClauseCount.cmd/solr-mem-backfill: one-shot re-embed of existing memories (idempotent).Verified live (crr-mini0)
596/596 memories embedded. A query with zero lexical matches (
semantic=false→ 0 results) returns conceptually-related memories via vector similarity (semantic=true→ hits). Mechanism confirmed end-to-end.Tests
embedder (request/parse/dim/truncate/disabled), fuseResponses (RRF order, dedup, semantic-only inclusion, limit, nil). build/vet/test/gofmt clean.
Deploy notes (gotchas hit)
127.0.0.1(EMBED_URL=http://127.0.0.1:11434). Tunnel is a systemd service on pax99.docker cp managed-schema.xmlinto the live core +cores?action=RELOAD(classic read-only schema factory).Follow-ups (relevance tuning — not in this PR)
search_document:to stored text andsearch_query:to queries — nomic-embed-text is trained for this and precision is noticeably better with it. Needs a doc-vs-query distinction in the Embedder + a re-backfill. Highest-value next step.mxbai-embed-large(1024d) for higher quality.🤖 Generated with Claude Code