Background
daft-lance supports creating Lance indexes through Daft, but there are two different levels of support that should be tracked separately:
- Index creation: the index can be created from
daft-lance.
- Distributed segment-index creation: index segments are built on workers, usually per Lance fragment or fragment group, and then committed by the coordinator through Lance segment-index APIs.
This issue tracks the second form: distributed segment-index creation.
Today, daft-lance has partial support:
BTREE has an opt-in segmented distributed path via segmented=True.
ZONEMAP can be created, but currently falls back to single-node Lance scalar index creation.
INVERTED / FTS have an older distributed path based on merged index metadata, not the newer segment-index workflow.
- Vector index creation is out of scope for this scalar segment-index tracking issue.
Current Lance Dependency
Current daft-lance main / v0.4.0 depends on:
pylance>=7.0.0
- locked version:
pylance==7.0.0
Distributed scalar segment-index creation depends on Lance exposing the required uncommitted segment creation and commit APIs through its Python bindings. daft-lance should align its dependency with the Lance version that provides those public APIs for the target scalar index types.
Before implementing more scalar segment-index types, we should decide whether to bump the Lance dependency and migrate to the public APIs:
create_index_uncommitted(..., fragment_ids=...)
commit_existing_index_segments(...)
The existing BTREE segmented implementation should also be revisited after the dependency/API baseline is updated, so it can use public Lance APIs consistently with the new implementations.
Support Matrix
| Index type |
Index creation today |
Distributed segment-index creation today |
Notes |
BTREE |
Yes |
Partially yes |
Supported with segmented=True; should be migrated to the public Lance segment-index APIs. |
BITMAP |
Yes, via Lance scalar index creation |
No |
Good first scalar follow-up after the dependency/API foundation. |
INVERTED |
Yes |
Not via segment-index |
Existing distributed path uses legacy metadata merge flow, not segment-index. |
FTS |
Yes, via inverted-index style path |
Not via segment-index |
Should be implemented together with INVERTED if they share the same Lance segment-index path. |
ZONEMAP |
Yes |
No |
Existing creation path is not segment-index distributed creation. |
| Vector indexes |
In progress / proposed |
No |
#25 is in progress, but not based on segment-index |
Proposed Work Items
1. Dependency and API foundation
Goal:
- Bump the Lance dependency if needed.
- Introduce a shared segment-index creation path in
daft-lance.
- Prefer public Lance APIs over private/internal APIs.
Tasks:
- Evaluate the minimum required
pylance version.
- Likely bump from
pylance>=7.0.0 to a version with public scalar segment-index APIs.
- Add a common helper for worker-side uncommitted segment creation.
- Add a common coordinator path for committing existing index segments.
- Migrate existing segmented
BTREE support to the public API.
2. BITMAP distributed segment-index creation
Goal:
- Add true distributed segment-index creation for
BITMAP.
Expected behavior:
- Split Lance fragments across Daft workers.
- Build uncommitted
BITMAP index segments on workers.
- Commit index segments on the coordinator.
- Add tests covering distributed execution and query correctness.
3. INVERTED / FTS distributed segment-index creation
Goal:
- Migrate
INVERTED and FTS to the segment-index workflow.
Expected behavior:
- Reuse the same underlying implementation where Lance treats
FTS as an inverted-index style index.
- Preserve both user-facing entry points / aliases.
- Add separate tests for
INVERTED and FTS behavior.
- Avoid changing existing query semantics while replacing the build path.
4. ZONEMAP distributed segment-index creation
Goal:
- Add distributed segment-index creation for
ZONEMAP.
Expected behavior:
- Split Lance fragments across Daft workers.
- Build uncommitted
ZONEMAP index segments on workers.
- Commit index segments on the coordinator.
- Add tests covering distributed execution and query correctness.
Proposed PR Breakdown
| Order |
PR |
Purpose |
| 1 |
Dependency/API foundation + BTREE cleanup |
Upgrade the Lance dependency if needed, introduce a shared segment-index workflow, and migrate existing BTREE segmented=True support to public Lance APIs. |
| 2 |
BITMAP segment-index |
Add distributed segment-index creation for BITMAP. |
| 3 |
INVERTED / FTS segment-index |
Migrate both INVERTED and FTS to the segment-index workflow, sharing implementation where possible but keeping separate behavior coverage. |
| 4 |
ZONEMAP segment-index |
Add distributed segment-index creation for ZONEMAP. |
References
Background
daft-lancesupports creating Lance indexes through Daft, but there are two different levels of support that should be tracked separately:daft-lance.This issue tracks the second form: distributed segment-index creation.
Today,
daft-lancehas partial support:BTREEhas an opt-in segmented distributed path viasegmented=True.ZONEMAPcan be created, but currently falls back to single-node Lance scalar index creation.INVERTED/FTShave an older distributed path based on merged index metadata, not the newer segment-index workflow.Current Lance Dependency
Current
daft-lancemain /v0.4.0depends on:pylance>=7.0.0pylance==7.0.0Distributed scalar segment-index creation depends on Lance exposing the required uncommitted segment creation and commit APIs through its Python bindings.
daft-lanceshould align its dependency with the Lance version that provides those public APIs for the target scalar index types.Before implementing more scalar segment-index types, we should decide whether to bump the Lance dependency and migrate to the public APIs:
create_index_uncommitted(..., fragment_ids=...)commit_existing_index_segments(...)The existing
BTREEsegmented implementation should also be revisited after the dependency/API baseline is updated, so it can use public Lance APIs consistently with the new implementations.Support Matrix
BTREEsegmented=True; should be migrated to the public Lance segment-index APIs.BITMAPINVERTEDFTSINVERTEDif they share the same Lance segment-index path.ZONEMAPProposed Work Items
1. Dependency and API foundation
Goal:
daft-lance.Tasks:
pylanceversion.pylance>=7.0.0to a version with public scalar segment-index APIs.BTREEsupport to the public API.2.
BITMAPdistributed segment-index creationGoal:
BITMAP.Expected behavior:
BITMAPindex segments on workers.3.
INVERTED/FTSdistributed segment-index creationGoal:
INVERTEDandFTSto the segment-index workflow.Expected behavior:
FTSas an inverted-index style index.INVERTEDandFTSbehavior.4.
ZONEMAPdistributed segment-index creationGoal:
ZONEMAP.Expected behavior:
ZONEMAPindex segments on workers.Proposed PR Breakdown
BTREEcleanupBTREE segmented=Truesupport to public Lance APIs.BITMAPsegment-indexBITMAP.INVERTED/FTSsegment-indexINVERTEDandFTSto the segment-index workflow, sharing implementation where possible but keeping separate behavior coverage.ZONEMAPsegment-indexZONEMAP.References
ZONEMAPindex creation and query optimizationBTREEindex building