[GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision in Spark-33#12261
[GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision in Spark-33#12261Xtpacz wants to merge 3 commits into
Conversation
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
philo-he
left a comment
There was a problem hiding this comment.
The changes look good to me. Could you rebase the code to check whether the CI failure goes away?
|
@wForget Thanks for the correction! |
I will rebase. Thanks! |
|
Run Gluten Clickhouse CI on x86 |
|
@Xtpacz The execution plan in your screenshot does not appear to match the reproducer sql (the reproducer sql does not have an id_count column). |
|
Run Gluten Clickhouse CI on x86 |
|
@wForget Sorry for the confusion — I habitually desensitized the column names, which caused the mismatch with the screenshot. It doesn't affect the actual result though. Let me rerun with the exact reproducer SQL and update here. INSERT INTO tb_sink
SELECT
a.val,
(a.val - COALESCE(SUM(b.val), 0) / 5.0)
/ (COALESCE(SUM(b.val), 0) / 5.0) AS growth_rate
FROM t1 a CROSS JOIN t2 b
GROUP BY a.val;
|
|
I suspect it might be related to apache/spark#36698, cc @ulysses-you Could you please take a look? |
|
Run Gluten Clickhouse CI on x86 |
|
@Xtpacz So it looks like the issue only exists in spark-33. if i understand it correctly, there is a fallback in your test, would you please help to check the fallback in the driver log? There should be some log like "Fallback due to xxx" |
|
@zhouyuan Thanks for you review. It is confirmed that this issue only exists in spark-33, as you mentioned. Here is the fallback log from driver on spark-33: This confirms that CheckOverflowTransformer use original.child.dataType (which is left.dataType in spark-33's BinaryArithmetic), so causing the precision mismatch. |



Fix: #12260
What changes are proposed in this pull request?
CheckOverflowTransformerreadsoriginal.child.dataTypeto decide whether to insert a cast. ForBinaryArithmetic, Spark's.dataTypereturnsleft.dataTyperather than the arithmetic result type. After child transformers apply rescale optimizations, the actual output type may differ from the Spark-declared type, and the cast is wrongly skipped.The resulting substrait plan has decimal types that mismatch function signatures. Velox's SimpleFunction validation rejects it, and
ColumnarPartialProjectRulefalls the entire Project back to JVM. Result is correct (via fallback) but native acceleration is lost.Reproducer
Root cause:
gluten/gluten-substrait/src/main/scala/org/apache/gluten/expression/UnaryExpressionTransformer.scala
Line 90 in fc90a79
this read the Spark expression's declared type instead of the transformer's actual output type.
Fix
How was this patch tested?
Before fix — Project falls back to JVM:
After fix — Project runs natively in Velox:
Key differences:
Project(JVM,*= codegen) toProjectExecTransformer(Velox native,^= transformer)VeloxColumnarToRowmoves from before Project (forced conversion to feed JVM) to after Project (deferred until output)