Skip to content

Track distributed segment-index creation support for Lance index types #27

@everySympathy

Description

@everySympathy

Background

daft-lance supports creating Lance indexes through Daft, but there are two different levels of support that should be tracked separately:

  1. Index creation: the index can be created from daft-lance.
  2. Distributed segment-index creation: index segments are built on workers, usually per Lance fragment or fragment group, and then committed by the coordinator through Lance segment-index APIs.

This issue tracks the second form: distributed segment-index creation.

Today, daft-lance has partial support:

  • BTREE has an opt-in segmented distributed path via segmented=True.
  • ZONEMAP can be created, but currently falls back to single-node Lance scalar index creation.
  • INVERTED / FTS have an older distributed path based on merged index metadata, not the newer segment-index workflow.
  • Vector index creation is out of scope for this scalar segment-index tracking issue.

Current Lance Dependency

Current daft-lance main / v0.4.0 depends on:

  • pylance>=7.0.0
  • locked version: pylance==7.0.0

Distributed scalar segment-index creation depends on Lance exposing the required uncommitted segment creation and commit APIs through its Python bindings. daft-lance should align its dependency with the Lance version that provides those public APIs for the target scalar index types.

Before implementing more scalar segment-index types, we should decide whether to bump the Lance dependency and migrate to the public APIs:

  • create_index_uncommitted(..., fragment_ids=...)
  • commit_existing_index_segments(...)

The existing BTREE segmented implementation should also be revisited after the dependency/API baseline is updated, so it can use public Lance APIs consistently with the new implementations.

Support Matrix

Index type Index creation today Distributed segment-index creation today Notes
BTREE Yes Partially yes Supported with segmented=True; should be migrated to the public Lance segment-index APIs.
BITMAP Yes, via Lance scalar index creation No Good first scalar follow-up after the dependency/API foundation.
INVERTED Yes Not via segment-index Existing distributed path uses legacy metadata merge flow, not segment-index.
FTS Yes, via inverted-index style path Not via segment-index Should be implemented together with INVERTED if they share the same Lance segment-index path.
ZONEMAP Yes No Existing creation path is not segment-index distributed creation.
Vector indexes In progress / proposed No #25 is in progress, but not based on segment-index

Proposed Work Items

1. Dependency and API foundation

Goal:

  • Bump the Lance dependency if needed.
  • Introduce a shared segment-index creation path in daft-lance.
  • Prefer public Lance APIs over private/internal APIs.

Tasks:

  • Evaluate the minimum required pylance version.
  • Likely bump from pylance>=7.0.0 to a version with public scalar segment-index APIs.
  • Add a common helper for worker-side uncommitted segment creation.
  • Add a common coordinator path for committing existing index segments.
  • Migrate existing segmented BTREE support to the public API.

2. BITMAP distributed segment-index creation

Goal:

  • Add true distributed segment-index creation for BITMAP.

Expected behavior:

  • Split Lance fragments across Daft workers.
  • Build uncommitted BITMAP index segments on workers.
  • Commit index segments on the coordinator.
  • Add tests covering distributed execution and query correctness.

3. INVERTED / FTS distributed segment-index creation

Goal:

  • Migrate INVERTED and FTS to the segment-index workflow.

Expected behavior:

  • Reuse the same underlying implementation where Lance treats FTS as an inverted-index style index.
  • Preserve both user-facing entry points / aliases.
  • Add separate tests for INVERTED and FTS behavior.
  • Avoid changing existing query semantics while replacing the build path.

4. ZONEMAP distributed segment-index creation

Goal:

  • Add distributed segment-index creation for ZONEMAP.

Expected behavior:

  • Split Lance fragments across Daft workers.
  • Build uncommitted ZONEMAP index segments on workers.
  • Commit index segments on the coordinator.
  • Add tests covering distributed execution and query correctness.

Proposed PR Breakdown

Order PR Purpose
1 Dependency/API foundation + BTREE cleanup Upgrade the Lance dependency if needed, introduce a shared segment-index workflow, and migrate existing BTREE segmented=True support to public Lance APIs.
2 BITMAP segment-index Add distributed segment-index creation for BITMAP.
3 INVERTED / FTS segment-index Migrate both INVERTED and FTS to the segment-index workflow, sharing implementation where possible but keeping separate behavior coverage.
4 ZONEMAP segment-index Add distributed segment-index creation for ZONEMAP.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions