Skip to content

[GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision in Spark-33#12261

Open
Xtpacz wants to merge 3 commits into
apache:mainfrom
Xtpacz:fix-decimal-pb
Open

[GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision in Spark-33#12261
Xtpacz wants to merge 3 commits into
apache:mainfrom
Xtpacz:fix-decimal-pb

Conversation

@Xtpacz

@Xtpacz Xtpacz commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Fix: #12260

What changes are proposed in this pull request?

CheckOverflowTransformer reads original.child.dataType to decide whether to insert a cast. For BinaryArithmetic, Spark's .dataType returns left.dataType rather than the arithmetic result type. After child transformers apply rescale optimizations, the actual output type may differ from the Spark-declared type, and the cast is wrongly skipped.

The resulting substrait plan has decimal types that mismatch function signatures. Velox's SimpleFunction validation rejects it, and ColumnarPartialProjectRule falls the entire Project back to JVM. Result is correct (via fallback) but native acceleration is lost.

Reproducer

CREATE TABLE t1 (val BIGINT) USING parquet;
CREATE TABLE t2 (val BIGINT) USING parquet;
INSERT INTO t1 VALUES (200);
INSERT INTO t2 VALUES (100), (100), (100), (100), (100);

SELECT
    a.val,
    (a.val - COALESCE(SUM(b.val), 0) / 5.0)
        / (COALESCE(SUM(b.val), 0) / 5.0) AS growth_rate
FROM t1 a CROSS JOIN t2 b
GROUP BY a.val;

Root cause:


this read the Spark expression's declared type instead of the transformer's actual output type.

Fix

- original.child.dataType,
+ child.dataType,

How was this patch tested?

Before fix — Project falls back to JVM:

== Final Plan ==
* Project (17)                                         ← JVM, codegen id=3
+- VeloxColumnarToRow (16)                             ← extra C2R conversion
   +- ^ RegularHashAggregateExecTransformer (14)
      +- ^ VeloxBroadcastNestedLoopJoinExecTransformer (13)
         :- ^ InputIteratorTransformer (7)
         :  +- BroadcastQueryStage (5)
         :     +- ColumnarBroadcastExchange (4)
         :        +- RowToVeloxColumnar (3)
         :           +- * ColumnarToRow (2)
         :              +- BatchScan (1)
         +- ^ InputIteratorTransformer (12)
            +- RowToVeloxColumnar (10)
               +- * ColumnarToRow (9)
                  +- BatchScan (8)

After fix — Project runs natively in Velox:

== Final Plan ==
VeloxColumnarToRow (17)
+- ^ ProjectExecTransformer (15)                       ← native Velox Project
   +- ^ RegularHashAggregateExecTransformer (14)
      +- ^ VeloxBroadcastNestedLoopJoinExecTransformer (13)
         :- ^ InputIteratorTransformer (7)
         :  +- BroadcastQueryStage (5)
         :     +- ColumnarBroadcastExchange (4)
         :        +- RowToVeloxColumnar (3)
         :           +- * ColumnarToRow (2)
         :              +- BatchScan (1)
         +- ^ InputIteratorTransformer (12)
            +- RowToVeloxColumnar (10)
               +- * ColumnarToRow (9)
                  +- BatchScan (8)

Key differences:

  • Node (15) changes from Project (JVM, * = codegen) to ProjectExecTransformer (Velox native, ^ = transformer)
  • VeloxColumnarToRow moves from before Project (forced conversion to feed JVM) to after Project (deferred until output)
  • Aggregate→Project pipeline stays in Velox without breaking

@github-actions github-actions Bot added the CORE works for Gluten Core label Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Xtpacz

Xtpacz commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@zhouyuan @philo-he could you have a look at the PR?

@Xtpacz

Xtpacz commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@JkSelf @wForget could you have a look at the PR?

@wForget

wForget commented Jun 10, 2026

Copy link
Copy Markdown
Member

I executed the reproducer sqls you provided, but it doesn't seem to reproduce the issue. Is there something I'm missing?

image

spark version: 3.5.1
gluten version: 1.5.0

For BinaryArithmetic, Spark's .dataType returns left.dataType rather than the arithmetic result type.

Furthermore, this description seems incorrect; BinaryArithmetic.dataType should be the result type.

@philo-he philo-he left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Could you rebase the code to check whether the CI failure goes away?

@Xtpacz

Xtpacz commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@wForget Thanks for the correction!
My environment is Spark 3.3.4 + Gluten 1.6.0
|image

@Xtpacz

Xtpacz commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

The changes look good to me. Could you rebase the code to check whether the CI failure goes away?

I will rebase. Thanks!

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@wForget

wForget commented Jun 10, 2026

Copy link
Copy Markdown
Member

@Xtpacz The execution plan in your screenshot does not appear to match the reproducer sql (the reproducer sql does not have an id_count column).

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Xtpacz

Xtpacz commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@wForget Sorry for the confusion — I habitually desensitized the column names, which caused the mismatch with the screenshot. It doesn't affect the actual result though. Let me rerun with the exact reproducer SQL and update here.

INSERT INTO tb_sink
SELECT
    a.val,
    (a.val - COALESCE(SUM(b.val), 0) / 5.0)
        / (COALESCE(SUM(b.val), 0) / 5.0) AS growth_rate
FROM t1 a CROSS JOIN t2 b
GROUP BY a.val;
image

@wForget

wForget commented Jun 10, 2026

Copy link
Copy Markdown
Member

I suspect it might be related to apache/spark#36698, cc @ulysses-you Could you please take a look?

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhouyuan

Copy link
Copy Markdown
Member

@Xtpacz
Starting from spark-34, the data type check logic is changed, it will conduct the right type based on the child decimal datatype:
https://github.com/apache/spark/blob/branch-3.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L221
it's actually a optimization to skip the cast in some cases.

So it looks like the issue only exists in spark-33. if i understand it correctly, there is a fallback in your test, would you please help to check the fallback in the driver log? There should be some log like "Fallback due to xxx"

@Xtpacz

Xtpacz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@zhouyuan Thanks for you review. It is confirmed that this issue only exists in spark-33, as you mentioned. Here is the fallback log from driver on spark-33:

INFO org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project[QueryId=4], due to: 
 - Native validation failed: 
   |- Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1450 function:validate, thrown from file:ExprCompiler.cpp line:311 function:compileCall, reason:Found incompatible return types for 'divide' (DECIMAL(38, 9) vs. DECIMAL(38, 10)) for input types (DECIMAL(29, 6), DECIMAL(27, 6)).

This confirms that CheckOverflowTransformer use original.child.dataType (which is left.dataType in spark-33's BinaryArithmetic), so causing the precision mismatch.

@philo-he philo-he changed the title [VL] Fix CheckOverflowTransformer using wrong child type for cast decision [GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision Jun 11, 2026
@zhouyuan zhouyuan changed the title [GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision [GLUTEN-12260][VL] Fix CheckOverflowTransformer using wrong child type for cast decision in Spark-33 Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL] Project with decimal arithmetic falls back to JVM when aggregate result is divided by integer-valued decimal literal

4 participants