Skip to content

fix: access logical dataset#7

Open
ejdam87 wants to merge 3 commits into
masterfrom
fix/dataset-index
Open

fix: access logical dataset#7
ejdam87 wants to merge 3 commits into
masterfrom
fix/dataset-index

Conversation

@ejdam87

@ejdam87 ejdam87 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

This PR replaces usage of underlying table for logical dataset.

Summary by CodeRabbit

  • Bug Fixes
    • Improved slide tile indexing so filtered or selected datasets return the correct tiles.
    • Fixed a mismatch that could cause outdated row offsets when working with already-filtered data.
    • Clarified internal row-index handling to better support dataset operations.

@ejdam87 ejdam87 requested a review from Adames4 June 30, 2026 15:12
@ejdam87 ejdam87 self-assigned this Jun 30, 2026
@ejdam87 ejdam87 requested review from a team June 30, 2026 15:12
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@ejdam87, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 48 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f3d35ab-980b-4fb0-aedb-0642252abf97

📥 Commits

Reviewing files that changed from the base of the PR and between d0fab0c and ea66d2c.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • rationai/mlkit/data/datasets/slides_tiles_loader.py
📝 Walkthrough

Walkthrough

In SlidesTilesLoader._build_tile_index, slide_ids derivation and row count are changed from direct PyArrow table access (tiles.data.column("slide_id"), tiles.data.num_rows) to the HuggingFace dataset logical interface (pa.array(tiles["slide_id"]), len(tiles)). Comments are updated to reflect that this respects prior .filter()/.select() calls.

Changes

Tile Index Logical Interface Fix

Layer / File(s) Summary
HF dataset logical access in _build_tile_index
rationai/mlkit/data/datasets/slides_tiles_loader.py
slide_ids now uses pa.array(tiles["slide_id"]) and num_rows uses len(tiles) instead of direct PyArrow table column access; comments updated to describe logical row indices and filter-safe semantics.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • RationAI/mlkit#5: Introduced the original SlidesTilesLoader._build_tile_index implementation that this PR directly patches to use the HF dataset logical interface.

Suggested reviewers

  • matejpekar
  • vejtek
  • JakubPekar

🐇 A bunny once hopped through a dataset of tiles,
But the PyArrow shortcut caused index-shaped wiles.
"Use the HF interface!" the rabbit declared,
"So filters and selects are properly shared!"
Now logical rows hop in logical styles. 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and clearly matches the main change: switching dataset access to the logical dataset interface.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dataset-index

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies _build_tile_index in slides_tiles_loader.py to read slide_id through the Hugging Face Dataset's logical interface, ensuring that any active filters or selections are respected. The reviewer noted that using pa.array(tiles["slide_id"]) causes performance overhead by converting the column to a Python list first, and recommended using tiles.with_format("arrow")["slide_id"] to retrieve the PyArrow array directly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread rationai/mlkit/data/datasets/slides_tiles_loader.py Outdated
@matejpekar matejpekar requested a review from JakubPekar June 30, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants