[Bug][QDP] Fix List<T> readers failing on nullable outer rows#1402
[Bug][QDP] Fix List<T> readers failing on nullable outer rows#14020lai0 wants to merge 1 commit into
Conversation
| // All rows null but sample_size known from an earlier batch. | ||
| Some(ss) => ss, | ||
| // All rows null and sample_size unknown: skip batch. | ||
| None => continue, |
There was a problem hiding this comment.
nit: In Reject mode, when the first batch is entirely null and sample_size is still unknown, None => continue returns before null_handling is consulted — so this batch is silently skipped instead of rejected. The batch ParquetReader errors on the same input, so the two readers diverge here. It's also the one Reject path without test coverage. Suggest either erroring on Reject here too, or documenting the difference explicitly.
|
@0lai0 thanks for the patch! |
| })?; | ||
|
|
||
| let current_size = float_array.len(); | ||
| if list_array.is_null(i) { |
There was a problem hiding this comment.
I want to ask about the context of a possible null might exists in a stream of data? We often assume the data user sent are correct. We need a boundry of what we should check and what we shouldn't check for user. (to strike a balance between performance and early stop when handling error data)
this pr already do well actually~
Related Issues
Closes #1401
Part of #1338
Changes
Why
Reading a nullable
List<T>column with a null outer row (e.g.[[1, 2], null, [3, 4]]) fails with:InvalidInput("Inconsistent sample sizes: expected 2, got 0")Root cause: sample-size validation calls
ListArray::value_length(i)without checkingis_null(i). For a null outer row, Arrow returns 0, which the reader treats as a length-0 list.If a non-null row was seen first → validation fails with
expected N, got 0If the null row is row 0 → sample_size is seeded to 0 and corrupts all subsequent rows
This is pre-existing (not introduced by #1393) and affects three readers:
NullHandling::FillZerodoes not help because the failure happens at list-length validation, before value filling runs.How
Null outer row semantics
RejectInvalidInputwith a clear messageFillZerosample_sizezeros oncesample_sizeis knownsample_sizeunknownnum_samples) — e.g. leading all-null batch before any non-null row establishes sizeCode changes
ParquetReader(parquet.rs)is_null(i)guard in sample-size validation loopNullHandling)ParquetStreamingReader(parquet.rs)sample_sizefrom first non-null row instead of row 0.max(1)on sample-size divisor inread_batchArrowIPCReader(arrow_ipc.rs)Checklist