Skip to content

[Bug][QDP] Fix List<T> readers failing on nullable outer rows#1402

Open
0lai0 wants to merge 1 commit into
apache:mainfrom
0lai0:fix-1401-list-null-outer-rows
Open

[Bug][QDP] Fix List<T> readers failing on nullable outer rows#1402
0lai0 wants to merge 1 commit into
apache:mainfrom
0lai0:fix-1401-list-null-outer-rows

Conversation

@0lai0

@0lai0 0lai0 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Related Issues

Closes #1401
Part of #1338

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Why

Reading a nullable List<T> column with a null outer row (e.g. [[1, 2], null, [3, 4]]) fails with:
InvalidInput("Inconsistent sample sizes: expected 2, got 0")
Root cause: sample-size validation calls ListArray::value_length(i) without checking is_null(i). For a null outer row, Arrow returns 0, which the reader treats as a length-0 list.

If a non-null row was seen first → validation fails with expected N, got 0
If the null row is row 0 → sample_size is seeded to 0 and corrupts all subsequent rows
This is pre-existing (not introduced by #1393) and affects three readers:

  • ParquetReader
  • ParquetStreamingReader
  • ArrowIPCReader

NullHandling::FillZero does not help because the failure happens at list-length validation, before value filling runs.

How

Null outer row semantics

Policy Behavior
Reject Return InvalidInput with a clear message
FillZero Fill sample_size zeros once sample_size is known
sample_size unknown Skip the null row (no zeros, not counted in num_samples) — e.g. leading all-null batch before any non-null row establishes size

Code changes

  1. ParquetReader (parquet.rs)

    • Add is_null(i) guard in sample-size validation loop
    • Split fast path (no null outer rows) vs slow path (per-row null handling per NullHandling)
  2. ParquetStreamingReader (parquet.rs)

    • Seed sample_size from first non-null row instead of row 0
    • Skip all-null batches when size is unknown
    • Same null-handling policy as batch reader
    • Defensive .max(1) on sample-size divisor in read_batch
  3. ArrowIPCReader (arrow_ipc.rs)

    • Two-phase List handling
    • Phase 1: validate size from non-null rows only
    • Phase 2: collect data with the same null policy

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes

// All rows null but sample_size known from an earlier batch.
Some(ss) => ss,
// All rows null and sample_size unknown: skip batch.
None => continue,

@rich7420 rich7420 Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In Reject mode, when the first batch is entirely null and sample_size is still unknown, None => continue returns before null_handling is consulted — so this batch is silently skipped instead of rejected. The batch ParquetReader errors on the same input, so the two readers diverge here. It's also the one Reject path without test coverage. Suggest either erroring on Reject here too, or documenting the difference explicitly.

@rich7420

Copy link
Copy Markdown
Contributor

@0lai0 thanks for the patch!

})?;

let current_size = float_array.len();
if list_array.is_null(i) {

@ryankert01 ryankert01 Jun 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to ask about the context of a possible null might exists in a stream of data? We often assume the data user sent are correct. We need a boundry of what we should check and what we shouldn't check for user. (to strike a balance between performance and early stop when handling error data)

this pr already do well actually~

@ryankert01 ryankert01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] List<T> reader fails with "Inconsistent sample sizes" on nullable outer rows

3 participants