[NGSOK-1703] Pull changes from v4.2.0-rc1#39
Draft
giggsoff wants to merge 263 commits into
Draft
Conversation
…nResolution=true for lazy column validation test
### What changes were proposed in this pull request?
Pin `spark.sql.analyzer.strictDataFrameColumnResolution=true` around the body of the `lazy column validation` test in `DataFrameSuite`. The config is set via `spark.conf.set/unset` rather than `withSQLConf` because the lazy SQLConf entry trips `withSQLConf`'s `isModifiable` check on the Connect server.
### Why are the changes needed?
The test asserts that `df4.schema` throws `AnalysisException` for `df1("x")` when `df1` does not contain `x`. This holds only under strict plan-id-based resolution; if the name-based fallback path is enabled, `df1("x")` resolves to `df2.x` from the join output and `df4.schema` succeeds. Today this works because `STRICT_DATAFRAME_COLUMN_RESOLUTION` defaults to `true`, but the test should not silently rely on that default; pinning it makes the assumption explicit and keeps the test robust against future default changes or environments where the default is overridden.
### Does this PR introduce _any_ user-facing change?
No. Test-only change.
### How was this patch tested?
Existing `DataFrameSuite."lazy column validation"`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Closes apache#55604 from zhengruifeng/SPARK-strict-df-col-resolution-test-pin.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…oc log noise
### What changes were proposed in this pull request?
Three related changes that fix the most common CI debug pain on this project: a single root cause cascading into many red checks, each surfacing only "Process completed with exit code 1" with no file:line.
**1. Structural decoupling — style is no longer attached to compile.**
Today, scalastyle runs as a side effect of `(Compile / compile)` in SBT and as the default `<execution>` of `scalastyle-maven-plugin` bound to the Maven `verify` phase. A single style violation in `core` aborts the compile of `core` and every transitive dependent, so every CI job that recompiles those sources fails — Build modules, Documentation generation, Java 17/25 Maven build, sparkr, Docker integration tests, TPC-DS — each with no file/line in its annotation. The dedicated lint job is just one of seven jobs that fail; nothing tells the user it's the root cause.
This PR removes the coupling at both build systems:
- `project/SparkBuild.scala`: drop the `(Compile / compile) := { scalaStyleOnCompile.value; … }` (and matching `(Test / compile)`) hooks from `enableScalaStyle`. The standalone `scalaStyleOnCompile` / `scalaStyleOnTest` task definitions remain so `dev/lint-scala` → `dev/scalastyle` → `sbt scalastyle test:scalastyle` continues to work; only the compile-time auto-invocation is removed. `NOLINT_ON_COMPILE` env-var gating is dropped (no-op once the hook is gone).
- `pom.xml`: remove the default `<execution>` from `scalastyle-maven-plugin` and add an opt-in `scalastyle` profile that re-binds `check`. Default Maven builds (`mvn install`, `mvn package`, etc.) no longer trigger scalastyle; activate the profile (`mvn ... -Pscalastyle`) or invoke the goal directly (`mvn scalastyle:check`) to run it.
- `.github/workflows/build_and_test.yml`: remove the now-no-op `NOLINT_ON_COMPILE` env vars from Build modules / pyspark / sparkr / lint / docs jobs.
After this change, scalastyle runs in exactly one place in CI: the dedicated lint job (via SBT through `dev/lint-scala`). A style violation fails ONE check; every other CI job stays green because they don't see the violation. No `needs: lint` chaining is needed — the cascade is gone at the source. Style coverage is unchanged (the lint job still scans every Scala file with the same config).
**2. Surface scalastyle violations as inline PR annotations.**
Even with the cascade removed, the lint job's natural output is verbose sbt output where the actual file:line of a violation is buried. `dev/scalastyle` now parses scalastyle's output and re-emits each violation as a GitHub Actions `::error file=...,line=...,title=Scalastyle::...` annotation when running under CI. The violation appears inline on the PR's "Files changed" tab — no log scrolling needed.
Two output formats are handled:
- scalastyle's native console writer: `error file=<path> message=<text> line=<n> [column=<n>]`, used by the explicit `scalastyle` / `test:scalastyle` tasks.
- sbt's logger format: `[error] <path>:<line>: <message>`, used when `Tasks.doScalastyle` is invoked through sbt's logger (the format we actually see in CI today).
The two regex branches are precise enough to skip near-miss lines: sbt's "task failed" summaries, exception traces, and regular Scala compile errors with their `:LINE:COL:` triple-colon shape are correctly not matched.
**3. Strip genjavadoc-stub diagnostic noise from the CI log.**
`docs/_plugins/build_api_docs.rb` wraps the `build/sbt unidoc` invocation with a small inline state machine that drops genjavadoc-stub `[error]` blocks from CI stdout. javadoc emits ~3500 such lines per unidoc run (`cannot find symbol` / `illegal combination of modifiers` / `non-static type variable` / `X is not public in org.apache.spark.Y; cannot be accessed from outside package` on stub `T1, T2, ...` type variables and stub references to package-private Scala types) — all benign because `--ignore-source-errors` is set, but they bury everything else. Each diagnostic is a header line followed by 3–5 `[error|warn]`-prefixed continuation lines (snippet, caret, symbol/location); the state machine drops both.
The filter matches by **message text**, not just by `target/java/` path, so legitimate doclint diagnostics on stub paths (e.g. a heading-out-of-sequence in a Scala doc comment that genjavadoc preserves into the stub) survive. Real `src/main/java/` diagnostics are untouched.
Note: this PR previously also experimented with `-Xdoclint:html` and a custom doclet to surface the per-file diagnostic when unidoc fails. That work was dropped after PR apache#55581 (apache#55581) identified the real root cause: javadoc's default `-Xmaxerrs 100` caps error reporting at 100, so the diagnostics we wanted never reached our interception layer at all. PR apache#55581's `-Xmaxerrs`/`-Xmaxwarns` bump and `-verbose` addition is the right fix and is complementary to the log filter shipped here.
### Why are the changes needed?
A single scalastyle violation in catalyst recently caused 7 red checks at once, each surfacing only a generic "exit code 1" annotation; finding the actual file:line required grepping through a multi-megabyte job log. Decoupling collapses the style cascade to one red check; the scalastyle annotation makes that check actionable on the PR's Files-changed tab; the genjavadoc-stub log filter cuts ~3500 lines of inert noise from every unidoc run so the rest of the doc-gen output stays scannable.
### Does this PR introduce _any_ user-facing change?
No SQL or runtime changes. Two developer-facing changes:
- `mvn install` (or any Maven invocation that hits the `verify` phase) no longer runs scalastyle by default. Activate the `scalastyle` profile (`mvn ... -Pscalastyle`) or invoke `mvn scalastyle:check` explicitly. The lint job in CI continues to enforce style.
- `sbt compile` no longer triggers scalastyle. Run `sbt scalastyle test:scalastyle` (or `dev/lint-scala`) explicitly to check style locally; CI still enforces it.
### How was this patch tested?
- **Annotation parsers** were unit-checked locally against representative real CI samples for both scalastyle output formats (native + sbt logger). Near-miss lines that must NOT match (sbt task-failed summaries, exception traces, compile errors with the `:LINE:COL:` shape) are correctly skipped.
- **End-to-end annotation rendering** confirmed on an earlier CI run by planting a deliberate scalastyle violation on this branch (later reverted): the violation surfaced as an inline annotation on the PR's "Files changed" tab pointing at the right file:line.
- **Stub-line filter** unit-checked locally against representative samples for the four known-benign genjavadoc structural error patterns plus two doclint diagnostics on stub paths plus three non-stub control cases — all classify correctly. End-to-end verified by inspecting captured CI logs from earlier failed unidoc runs on this branch.
- **Style coverage unchanged**: the lint job continues to run `sbt scalastyle test:scalastyle` exactly as before; the same scalastyle config (`scalastyle-config.xml`) governs which violations fire.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude (Anthropic)
Closes apache#55563 from cloud-fan/ci-error-clarity.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…G_PANDAS_UDF ### What changes were proposed in this pull request? Add ASV micro-benchmarks for `SQL_WINDOW_AGG_PANDAS_UDF` to `python/benchmarks/bench_eval_type.py`, mirroring the existing `SQL_WINDOW_AGG_ARROW_UDF` benchmarks added in [SPARK-56120](https://issues.apache.org/jira/browse/SPARK-56120). The new mixin `_WindowAggPandasBenchMixin` reuses the same scenario shapes and UDF set as the Arrow variant (sum, mean_multi) and writes the worker payload with `runner_conf={"window_bound_types": "unbounded"}`. ### Why are the changes needed? Part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724). Establishes performance baselines before refactoring this eval type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `./python/asv check --python=same` passes. Local run with `-a repeat=3`: ```text bench_eval_type.WindowAggPandasUDFTimeBench.time_worker ================ ============ ================ -- udf ---------------- ----------------------------- scenario sum_udf mean_multi_udf ================ ============ ================ few_groups_sm 50.3+/-0.2ms 57.2+/-2ms few_groups_lg 98.9+/-0.9ms 109+/-2ms many_groups_sm 1.84+/-0s 2.10+/-0.04s many_groups_lg 540+/-1ms 639+/-30ms wide_cols 515+/-20ms 515+/-0.7ms ================ ============ ================ bench_eval_type.WindowAggPandasUDFPeakmemBench.peakmem_worker ================ ========= ================ -- udf ---------------- -------------------------- scenario sum_udf mean_multi_udf ================ ========= ================ few_groups_sm 472M 469M few_groups_lg 536M 532M many_groups_sm 502M 500M many_groups_lg 607M 605M wide_cols 588M 585M ================ ========= ================ ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#55603 from Yicong-Huang/SPARK-56658/bench/window-agg-pandas. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…AP_PANDAS_ITER_UDF
### What changes were proposed in this pull request?
Adds ASV microbenchmark `GroupedMapPandasIterUDF{Time,Peakmem}Bench` for `SQL_GROUPED_MAP_PANDAS_ITER_UDF` in `python/benchmarks/bench_eval_type.py`.
### Why are the changes needed?
Establish a baseline before refactoring `SQL_GROUPED_MAP_PANDAS_ITER_UDF` (subtask of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724)).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`COLUMNS=120 asv run --bench GroupedMapPandasIterUDF --python=same -a repeat=3`
```text
[75.00%] ··· bench_eval_type.GroupedMapPandasIterUDFPeakmemBench.peakmem_worker ok
[75.00%] ··· ================= ============== ========== ==================
-- udf
----------------- --------------------------------------------
scenario identity_udf sort_udf key_identity_udf
================= ============== ========== ==================
sm_grp_few_col 477M 478M 471M
sm_grp_many_col 484M 485M 484M
lg_grp_few_col 634M 635M 550M
lg_grp_many_col 767M 768M 765M
mixed_types 467M 467M 467M
================= ============== ========== ==================
[100.00%] ··· bench_eval_type.GroupedMapPandasIterUDFTimeBench.time_worker ok
[100.00%] ··· ================= ============== ============ ==================
-- udf
----------------- ----------------------------------------------
scenario identity_udf sort_udf key_identity_udf
================= ============== ============ ==================
sm_grp_few_col 438±4ms 495±5ms 412±4ms
sm_grp_many_col 357±10ms 374±70ms 346±2ms
lg_grp_few_col 790±8ms 1.01±0.05s 707±10ms
lg_grp_many_col 932±20ms 1.02±0s 939±10ms
mixed_types 429±2ms 467±2ms 396±2ms
================= ============== ============ ==================
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#55600 from Yicong-Huang/SPARK-56649/bench/grouped-map-pandas-iter.
Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…S_ITER_UDF
### What changes were proposed in this pull request?
Adds ASV microbenchmark coverage for `SQL_MAP_PANDAS_ITER_UDF` (eval type backing `DataFrame.mapInPandas`) in `python/benchmarks/bench_eval_type.py`: `MapPandasIterUDFTimeBench` and `MapPandasIterUDFPeakmemBench`.
### Why are the changes needed?
Baseline coverage under SPARK-55724 before the upcoming `read_udfs()` consolidation refactor.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`COLUMNS=120 asv run --bench MapPandasIter --python=same -a repeat=3`, two stable runs locally.
```text
=================== ============== =========== ============
-- udf
------------------- ---------------------------------------
scenario identity_udf sort_udf filter_udf
=================== ============== =========== ============
sm_batch_few_col 293+/-20ms 339+/-30ms 328+/-10ms
sm_batch_many_col 199+/-9ms 217+/-10ms 199+/-1ms
lg_batch_few_col 652+/-10ms 796+/-6ms 721+/-20ms
lg_batch_many_col 747+/-7ms 1.10+/-0.3s 781+/-30ms
pure_ints 137+/-10ms 154+/-2ms 143+/-1ms
pure_floats 133+/-4ms 152+/-3ms 142+/-5ms
pure_strings 590+/-9ms 719+/-6ms 597+/-10ms
pure_ts 201+/-6ms 223+/-4ms 211+/-5ms
mixed_types 380+/-3ms 446+/-4ms 407+/-6ms
=================== ============== =========== ============
```
```text
=================== ============== ========== ============
-- udf
------------------- --------------------------------------
scenario identity_udf sort_udf filter_udf
=================== ============== ========== ============
sm_batch_few_col 473M 474M 473M
sm_batch_many_col 478M 478M 477M
lg_batch_few_col 522M 523M 522M
lg_batch_many_col 565M 566M 567M
pure_ints 484M 485M 484M
pure_floats 497M 498M 497M
pure_strings 511M 512M 511M
pure_ts 501M 501M 501M
mixed_types 494M 496M 494M
=================== ============== ========== ============
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#55601 from Yicong-Huang/SPARK-56657/bench/map-pandas-iter.
Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…s as local callable ### What changes were proposed in this pull request? Adds the engine-side foundation for the language-agnostic UDF framework introduced by [SPARK-55278](https://issues.apache.org/jira/browse/SPARK-55278). Lives under `udf/worker/core` with no dependency beyond `spark-tags` and the existing `udf/worker/proto`. Public API in `org.apache.spark.udf.worker.core`: - `WorkerDispatcher` — owns workers for one `UDFWorkerSpecification`; hands callers `WorkerSession`s. - `WorkerSession` — one UDF execution. `init` (exactly once) and `process` (at most once, after init) are `final` with AtomicBoolean guards; subclasses implement `doInit` / `doProcess`. `InitMessage` is a placeholder until the UDF wire protocol lands. - `WorkerConnection` — transport channel; one server-side endpoint shared by all sessions. - `UnixSocketWorkerConnection` — UDS subclass; owns the socket path and removes the file on close. - `WorkerSecurityScope`, `WorkerLogger` — pool-partitioning identity and minimal logging surface (no SLF4J dep). Direct-creation impl in `.direct`: - `DirectWorkerDispatcher` (abstract, transport-agnostic) — process spawn, environment lifecycle (`installation` / `environment_verification` / `environment_cleanup`), worker registry, race-safe close. Five protected hooks for transport plug-in: `newEndpointAddress`, `waitForReady`, `cleanupEndpointAddress`, `closeTransport`, `validateTransportSupport`. - `DirectUnixSocketWorkerDispatcher` (abstract) — UDS implementation of the hooks (private 0700 socket dir, file-existence polling, file deletion). - `DirectWorkerProcess` + `WorkerArtifacts` (process + connection + output log) — close sends SIGTERM, waits `gracefulTimeoutMs`, then artifacts handle connection close + SIGKILL+reap + file cleanup. - `DirectWorkerException` / `DirectWorkerTimeoutException` — runtime-failure types; the timeout subclass lets callers discriminate retry policy. Hardening: - CAS-idempotent `close()`; acquire-before-publish + post-publish `closed` re-check makes `createSession` racing with `close` either throw or get a session whose worker is reaped. - SIGKILL paths use a bounded-wait reap helper so the kernel actually reaps the child; stuck zombies WARN with a context tag. - Proto-provided init/graceful timeouts are clamped to a 30s engine maximum (WARN on clamp). - All install failures (timeout or non-zero exit) transition the environment to `Failed` permanently. - UDS socket directory is created 0700; non-POSIX fallback WARNs if the platform refuses the perms. Proto: - `UDFWorkerProperties.connection` is singular (one server-side endpoint shared by sessions). - `WorkerConnectionSpec` proto message renamed to disambiguate from the core Scala `WorkerConnection`. Notably out of scope (TODOs in code): connection pooling, security-scope partitioning, indirect creation (daemon), TCP transport (a sibling subclass implementing the same five hooks), retriable-vs-permanent failure classification, shared shutdown hook across dispatchers. ### Why are the changes needed? This is the Spark-independent foundation the rest of the language-agnostic UDF work will build on. The dispatcher is already factored to support transports beyond UDS. ### Does this PR introduce _any_ user-facing change? No. The module is not wired into any planner/executor path; classes are `Experimental`. ### How was this patch tested? `DirectWorkerDispatcherSuite` (31 tests) covering: worker spawn / SIGTERM / SIGKILL escalation; concurrent createSession (distinct processes and socket paths); the close-vs-in-flight-createSession race; createSession-after-close rejection; socket-dir 0700 + cleanup; init / callable / graceful timeout caps and behaviour; environment lifecycle (skip-install-on-verify, install-once-under-concurrency, cached failure, install-timeout permanence, cleanup on close); spec validation (missing/non-UDS connection, verify-without-install, securityScope). build/mvn test -pl udf/worker/core -am ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes apache#55406 from haiyangsun-db/SPARK-56412. Authored-by: Haiyang Sun <haiyang.sun@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
…raint output ### What changes were proposed in this pull request? `BaseConstraint.toDescription()` now always emits `RELY|NORELY`, matching `toDDL()` used by SHOW CREATE TABLE. Previously, `toDescription()` skipped the `NORELY` token when `rely=false`, so the same constraint rendered differently across the two surfaces. After this change, the invariant holds: `toDDL` is `"CONSTRAINT " + name + " " + toDescription`. ### Why are the changes needed? After SPARK-52141 (DESCRIBE EXTENDED) and SPARK-52142 (SHOW CREATE TABLE) landed, the two surfaces render the same constraint asymmetrically: | State | DESCRIBE EXTENDED (before) | SHOW CREATE TABLE | |--------------|-----------------------------------|-------------------------------------| | PK default | PRIMARY KEY (a) NOT ENFORCED | PRIMARY KEY (a) NOT ENFORCED NORELY | | CHECK default| CHECK (...) ENFORCED | CHECK (...) ENFORCED NORELY | The original review on PR apache#51577 considered skipping both `NOT ENFORCED` and `NORELY` when default, but only `NORELY` was skipped because `ENFORCED`'s default is per-subclass and not visible to `BaseConstraint`. The result is a confusing experience: SHOW CREATE TABLE reports the rely state explicitly while DESCRIBE EXTENDED hides it for the default value. ### Does this PR introduce any user-facing change? Slight output change in DESCRIBE EXTENDED command output. DESCRIBE EXTENDED now emits `NORELY` for constraints whose `rely` value is the default. PRIMARY KEY / UNIQUE / FOREIGN KEY / CHECK constraints all show the full `ENFORCED|NOT ENFORCED RELY|NORELY` suffix, matching SHOW CREATE TABLE. ### How was this patch tested? Updated golden strings in the v2 `DescribeTableSuite` "desc table constraints" test. Ran: build/sbt 'sql/testOnly \ org.apache.spark.sql.execution.command.v2.DescribeTableSuite \ org.apache.spark.sql.execution.command.v2.ShowCreateTableSuite' 41 tests pass. ### Was this patch authored or co-authored using generative AI tooling? Claude Opus 4.7 Closes apache#55590 from yyanyy/desc-extended-rely-norely. Authored-by: yyanyy <yyanyyyy@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
…mixins ### What changes were proposed in this pull request? Three PySpark test mixin classes still inherit explicitly from `object`: - `python/pyspark/sql/tests/test_udf.py` — `BaseUDFTestsMixin(object)` - `python/pyspark/sql/tests/test_resources.py` — `ResourceProfileTestsMixin(object)` - `python/pyspark/sql/tests/arrow/test_arrow_map.py` — `MapInArrowTestsMixin(object)` This PR drops the explicit `(object)` base, matching the modern `class Foo:` form used by the other ~50 mixin classes in `pyspark.sql.tests`. ### Why are the changes needed? `class Foo(object):` is a Python 2 idiom required to opt into new-style classes. Under Python 3 — which is the only version Spark supports — every class implicitly inherits from `object`, so the `(object)` base is dead code. Three mixins still carry the legacy form while the rest of the test suite has standardized on the modern form; aligning them keeps the codebase consistent and removes a small piece of cargo-culted boilerplate that occasionally gets copied into new mixins. ### Does this PR introduce _any_ user-facing change? No. Test-only change with no behavioral impact. ### How was this patch tested? Existing tests. `class Foo(object):` and `class Foo:` produce identical classes under Python 3 (same MRO, same `__bases__`, same `isinstance` results). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes apache#55608 from zhengruifeng/drop-mixin-object-base. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…ark test classes ### What changes were proposed in this pull request? Seven concrete PySpark test classes declare the framework base class before the test mixin (e.g. `class X(ReusedSQLTestCase, XTestsMixin)`), while the rest of the codebase consistently puts the mixin first. This PR reorders the parent classes in those 7 files to follow the project convention (mixin first, base class last): - `python/pyspark/sql/tests/test_utils.py` — `UtilsTests` - `python/pyspark/sql/tests/test_errors.py` — `ErrorsTests` - `python/pyspark/sql/tests/test_functions.py` — `FunctionsTests` - `python/pyspark/sql/tests/pandas/test_pandas_map.py` — `MapInPandasTests` - `python/pyspark/sql/tests/connect/client/test_artifact.py` — `ArtifactTests` - `python/pyspark/sql/tests/connect/client/test_artifact_localcluster.py` — `LocalClusterArtifactTests` - `python/pyspark/sql/tests/connect/test_utils.py` — `ConnectUtilsTests` ### Why are the changes needed? Python resolves attribute and method lookups along the **Method Resolution Order (MRO)** — the linear sequence computed via C3 linearization from the parent list. The first class in the MRO that defines the requested attribute wins, and `super()` delegates to the *next* class in the MRO (not the syntactic parent). The order of parents in the class declaration directly drives the MRO: ```python class X(ReusedSQLTestCase, XTestsMixin): ... # MRO: X -> ReusedSQLTestCase -> ... -> XTestsMixin -> object # `setUp` defined on the mixin would be SHADOWED by ReusedSQLTestCase.setUp. class X(XTestsMixin, ReusedSQLTestCase): ... # MRO: X -> XTestsMixin -> ReusedSQLTestCase -> ... -> object # `setUp` defined on the mixin takes precedence, then super() chains # into ReusedSQLTestCase.setUp as expected. ``` Today none of the affected mixins override `setUp` / `tearDown` / `setUpClass` / `tearDownClass`, so MRO resolves to the same target either way and the current behavior is identical. The inconsistency is a latent footgun: any future lifecycle hook added to the mixin would be silently shadowed in these 7 classes only, while the same hook would work correctly in the ~50 sibling test classes that follow the convention. That class of bug surfaces only at runtime in the affected suites and is easy to miss during review. Standardizing the parent ordering makes the convention enforceable by inspection. ### Does this PR introduce _any_ user-facing change? No. This is a test-only change with no behavioral impact today. ### How was this patch tested? Existing tests. The change is a parent-class reorder that does not affect MRO resolution for any currently overridden method. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes apache#55607 from zhengruifeng/fix-mixin-inheritance-order. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Currently, DSv2 is lacking the required abstractions and machinery for allowing transactionality in DML operations. First, this PR introduces the public Java interfaces that connectors need to implement. These are the following: `TransactionInfo` — carries the transaction metadata. `Transaction` — represents a transaction. Exposes catalog(), commit(), abort(), and close(). `TransactionalCatalogPlugin` — extends CatalogPlugin with beginTransaction(TransactionInfo) method. Second, it expands Spark to manage the Transaction lifecycle. This is done as follows: - Pre-analysis. At pre-analysis, we look at the plan for logical nodes that implement the `TransactionalWrite` trait. When a plan contains such a node we initiate a transaction. Commands that do not result in execution, e.g. EXPLAIN, we should never initiate a transaction. - Analysis. We create a copy of the analyzer that contains the `TransactionAwareCatalogManager` instance. This is a `CatalogManager` that is aware of the current transaction and intercepts catalog(name) and currentCatalog lookups by returning the transaction catalog instead of the default session catalog. All catalog look ups within the query go through the `TransactionAwareCatalogManager`. - Planning. At planning we need to propagate the transaction to the execution nodes. This is necessary so we can commit the transaction before the relation cache is refreshed. This process occurs post planning where the plan is fully formed. We walk the plan and inject the transaction callback to each `TransactionalExec` node. - Execution. `V2ExistingTableWriteExec` nodes invoke the transaction commit right after the write operation and before the relation cache is refreshed. - Abort/close. Every QE operation, i.e. analyzed, commandExecuted, optimizedPlan, sparkPlan, executedPlan etc., is wrapped with executeWithTransactionContext closure. This ensures that if an issue occurs at any point throughout the QE, the transition is aborted and closed. Note, since we can have multiple QueryExecution instantiations for the same operation, the abort/close operations need to be idempotent. Due the nested structure of query execution, we might end up triggering nested abort/close calls. This PR is based on aokolnychyi's prototype. ### Why are the changes needed? We are currently lacking the required abstractions and machinery for DSv2 connectors to implement transactions in write operations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Introduced new suites and added tests in existing suites. ### Was this patch authored or co-authored using generative AI tooling? Claude Sonnet 4.6. Closes apache#55278 from andreaschat-db/dsv2TransactionApi5. Authored-by: Andreas Chatzistergiou <andreas.chatzistergiou@databricks.com> Signed-off-by: Anton Okolnychyi <aokolnychyi@apache.org>
… create v2 for new format ### What changes were proposed in this pull request? Revert some of the changes to profile result. Now we keep the old profile result format and create a completely new one. Create a new accumulator ID for the new format. The worker will produce both format and send them to different accumulators. The old client will consume the old format and the new client will consume the new format. ### Why are the changes needed? To keep the old client working with the new server. Also make it possible in the future to fully migrate to a better format. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#55524 from gaogaotiantian/fix-profile-result. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
….test_session ### What changes were proposed in this pull request? Four related changes to `python/pyspark/sql/tests/connect/test_session.py`: 1. Add the standard `if __name__ == "__main__"` block (matching the convention used by every other `test_*.py` module in the same directory). It is the only `test_*.py` file under `python/pyspark/**/tests/` missing this block. 2. Fix four tests that called `session.stop()` against a fake remote (`sc://other`). `stop()` invokes `client.release_session()`, which makes a real gRPC `ReleaseSession` call and retries until exhaustion. The fix sets `release_session_on_close = False` before `stop()`, matching the same intent as `test_active_session_expires_when_client_closes` / `test_default_session_expires_when_client_closes`, which set `_client._closed = True` for similar reasons. 3. Add a `setUp` that resets `RemoteSparkSession._default_session` and `_active_session.session` to `None`, so tests are order-independent. `_set_default_and_active_session` only sets state when it is currently `None`, and several tests intentionally leak state via the `_client._closed = True` workaround. 4. Replace `RemoteSparkSession.getDefaultSession()` with `RemoteSparkSession._get_default_session()` in `test_default_session_expires_when_client_closes` (2 occurrences). `getDefaultSession` does not exist on either `pyspark.sql.connect.session.SparkSession` or `pyspark.sql.session.SparkSession`; the actual classmethod is `_get_default_session`. ### Why are the changes needed? The test runner invokes each module via `bin/pyspark <module>` → `python -m <module>`. Without a `__main__` block, that just imports the module's class definitions and exits — **no tests ran**. CI was a silent no-op for this file. Adding the `__main__` block exposed three pre-existing latent bugs (timeouts on `stop()`, an undefined-method call, and order-dependent state), each of which has been broken since the corresponding test was added. All four changes must land together for the file to actually contribute coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pre-existing tests in this file. The `__main__` block is byte-identical to the pattern used by sibling files (e.g., `test_connect_basic.py`). All test fixes only affect test files; production `stop()` semantics are unchanged (`release_session_on_close` is a public attribute on `SparkSession`, defaulting to `True`). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) Closes apache#55560 from zhengruifeng/pyspark-test-session-main-block. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
… for reused DataFrame in natural join ### What changes were proposed in this pull request? Fix an `AMBIGUOUS_COLUMN_REFERENCE` regression introduced by SPARK-55070 when a DataFrame is referenced both directly in a join and also nested under a natural/USING join elsewhere in the same plan. Introduce a `Candidate(expr, depth, hidden)` case class. Both `resolveDataFrameColumnByPlanId` and `resolveDataFrameColumnRecursively` now return `(Seq[Candidate], Boolean)` — they collect candidates rather than merging eagerly. The walk latches `hidden = h || r.references.subsetOf(AttributeSet(p.metadataOutput))` at each ancestor; the filter then keeps candidates whose references are visible in `p.output ++ p.metadataOutput`. The merge is **lifted from per-recursion-level to the top of `resolveDataFrameColumn`**. Two effects: 1. After the walk returns the full candidate list, partition by `hidden`: - if any regular (`hidden = false`) candidate exists, run the merge over regulars only and ignore hidden ones (e.g. the natural/USING-join hidden key); - otherwise run the same merge over hidden candidates (preserves SPARK-55070's `rhs["join_key"]` case where the only matching candidate is hidden). 2. Subtree-internal collisions no longer throw early — a depth-0 candidate from a different subtree can rescue a case the upstream foldLeft would have thrown on. The depth-0 direct-match tiebreaker in the final `foldLeft` is preserved. ### Why are the changes needed? SPARK-55070 broadened the ancestor filter in `resolveDataFrameColumnRecursively` from `p.outputSet` to `p.output ++ p.metadataOutput` so that `rhs["join_key"]` works after a natural/USING join (where one join key is hidden in `Project.hiddenOutputTag`). But when the same DataFrame is both used directly in a join and also nested under a natural/USING-join wrapper elsewhere in the plan, the broadened filter lets both candidates through `resolveDataFrameColumnByPlanId`'s merge, tripping `throw ambiguousColumnReferences(u)`. For example, queries like: ```python enriched = events.join(dim, "dim_id", "left") # USING join hides dim's dim_id result = (fact .join(dim, fact["fk"] == dim["dim_id"], "left") # direct use of dim .join(enriched, "txn_id", "full_outer") .select(dim["dim_id"])) # previously AMBIGUOUS ``` now resolve `dim["dim_id"]` to the direct-usage output candidate. ### Does this PR introduce _any_ user-facing change? Yes — bug fix. Queries that referenced a DataFrame both directly in a join and nested under a natural/USING join (where the wrapper hides one of the columns into `metadataOutput`) previously raised `AMBIGUOUS_COLUMN_REFERENCE`. They now resolve to the direct-usage candidate. ### How was this patch tested? - New `test_select_regular_column_with_reused_dataframe_hidden_in_natural_join` added to `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`. The events data is constructed so the USING-wrapper's hidden `dim_id` would be NULL on one row while the direct-join `dim_id` is non-null, pinning resolution to the direct candidate (not just verifying the throw is gone). - Manually verified `pyspark.sql.tests.test_column` (classic) and `pyspark.sql.tests.connect.test_parity_column` (Connect, both `ColumnParityTests` and `ColumnParityTestsWithNonStrictDFColResolution`): the new test passes; pre-existing `test_column_date_time_op` errors with `UNSUPPORTED_TIME_TYPE` are unrelated to this change. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (model: claude-opus-4-7) Closes apache#55582 from zhengruifeng/metadata-col-resolution-single-pass. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…atch CDC reads ### What changes were proposed in this pull request? This PR adds the `netChanges` deduplication mode to the `ResolveChangelogTable` analyzer rule (SPARK-55668 / apache#55508 ). When a CDC read sets `deduplicationMode = 'netChanges'`, intermediate changes per row identity are collapsed into a single net effect, per the [SPIP](https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?tab=t.0#heading=h.m1700lw4wsoj) `Deduplication Semantics` in B.8. #### Net change collapse rules (per SPIP) | Change Sequence | Net Result | |-----------------------|-----------------------------------------| | INSERT → DELETE | No row (cancels out) | | INSERT → UPDATE(s) | Final insert only | | UPDATE(s) | First pre-image + final post-image | | UPDATE(s) → DELETE | delete with original values | #### Implementation: 2x2 matrix on `(existedBefore, existsAfter)` The four SPIP rules map onto a 2x2 matrix that the implementation evaluates per `rowId` partition: - `existedBefore` is `true` iff the partition's first event is `delete` or `update_preimage`. - `existsAfter` is `true` iff the partition's last event is `insert` or `update_postimage`. These two booleans are sufficient to reproduce the SPIP rules above, because the SPIP only cares about whether the row existed at the boundaries of the version range — never about the intermediate events. | existedBefore | existsAfter | output | |---------------|-------------|-------------------------------------| | false | false | (cancel) | | false | true | insert | | true | false | delete | | true | true | update_preimage + update_postimage | If `computeUpdates = false`, the `update_preimage` + `update_postimage` pair is emitted as `delete` + `insert` instead. Pipeline: `Window` (per-`rowId` aggregates: row number, row count, first/last `_change_type`) → `Filter` (keep first and/or last row per partition) → `Project` (relabel `_change_type`, drop helper columns). ### Why are the changes needed? This completes the net-change post-processing capability of the DSv2 CDC API per the SPIP. Without it, connectors that surface intermediate changes cannot expose a deduplicated change feed to users via the standard CDC API. ### Does this PR introduce _any_ user-facing change? Yes. Requesting `deduplicationMode = 'netChanges'` on a CDC read now produces a deduplicated change stream. Previously the same request was rejected up-front. ### How was this patch tested? Added `ResolveChangelogTableNetChangesSuite` — a trait + 2 concrete suite classes (`...WithComputeUpdatesSuite`, `...WithoutComputeUpdatesSuite`) running the same 16-test body under both modes (32 invocations total). Coverage: - 4 single-event tests (lone insert/delete across various range shapes). - 9 matrix tests covering all (first_change_type, last_change_type) cells. - 1 range-narrowing test (events outside the requested version range are not seen). - 2 multi-rowId tests (independent partitions, mixed mode-dependent cells). Removed 2 obsolete tests in `ResolveChangelogTablePostProcessingSuite` that asserted the previous "not supported" rejection. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) Closes apache#55583 from SanJSp/SPARK-55953-net-changes. Lead-authored-by: Sandro Sp <sandrospeh@gmail.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
… operations ### What changes were proposed in this pull request? This PR implement group filtering for WriteDelta row level operations. ### Why are the changes needed? These changes are needed to close the gap in WriteDelta plans. ### Does this PR introduce _any_ user-facing change? Changes are backward compatible. ### How was this patch tested? This PR comes with tests. ### Was this patch authored or co-authored using generative AI tooling? Claude Code v2.1.123. Closes apache#55612 from aokolnychyi/spark-56669. Authored-by: Anton Okolnychyi <aokolnychyi@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
…mands ### What changes were proposed in this pull request? Follow-up to [SPARK-52729](https://issues.apache.org/jira/browse/SPARK-52729) (which added `MetadataOnlyTable` and CREATE / ALTER VIEW … AS support for DS v2 catalogs). This PR closes out the *Remaining work* section of that PR's description, plus a few API/test cleanups. The branch is structured as **three commits**: **1. `[SPARK-56655][SQL] Rename v2 metadata-table API: MetadataOnlyTable, RelationCatalog, loadRelation`** — mechanical rename of the v2 view-API surface introduced by the parent PR: | Before | After | |---------------------|--------------------| | `MetadataOnlyTable` | `MetadataTable` | | `RelationCatalog` | `TableViewCatalog` | | `loadRelation()` | `loadTableOrView()` | | `DataSourceV2MetadataOnly{Table,View}Suite` | `DataSourceV2Metadata{Table,View}Suite` | The unrelated v2 helpers `CatalogV2Util.loadRelation` and `RelationResolution`'s private `loadRelation(V2TableReference)` predate the parent PR and are intentionally not renamed. **2. `[SPARK-56655][SQL] Implement remaining v2 view DDL and inspection commands`** — the production-side work: *New view DDL execs* (`AlterV2ViewExec.scala`): - `AlterV2ViewSetPropertiesExec`, `AlterV2ViewUnsetPropertiesExec` — merge / drop user TBLPROPERTIES on the existing view; dispatch to `ViewCatalog#replaceView`. - `AlterV2ViewSchemaBindingExec` — rewrites the schema-binding mode field; dispatch to `replaceView`. - `RenameV2ViewExec` — dispatches to a new abstract `ViewCatalog#renameView(Identifier, Identifier)` (added to `ViewCatalog`, contracted on `TableViewCatalog`). - A shared `V2ViewMetadataMutation.builderFrom(existing)` helper seeds a `ViewInfo.Builder` from the existing view so mutating execs override only the field they're changing. *New view inspection execs* (`V2ViewInspectionExecs.scala`): - `ShowCreateV2ViewExec` — reconstructs the `CREATE VIEW …` DDL from `ViewInfo`. - `ShowV2ViewPropertiesExec`, `ShowV2ViewColumnsExec`, `DescribeV2ViewExec`, `DescribeV2ViewColumnExec` — produce output rows directly from `ViewInfo`. `DESCRIBE TABLE EXTENDED` emits a v2-native `# Detailed View Information` block. *v1-parity for `SHOW TABLES` on a `TableViewCatalog`* (`ShowTablesExec.scala`): when the catalog is a `TableViewCatalog`, route through `listRelationSummaries` so views appear alongside tables. Pure `TableCatalog` catalogs continue to use `listTables` and return tables only. *All `UNSUPPORTED_FEATURE.TABLE_OPERATION` pins* from the parent PR for these commands are replaced with real strategy cases. *Architecture — single typed payload for resolved views*. To match the v2-table convention (where `ResolvedTable.table: Table` carries the resolved Table — and v1 tables appear as `V1Table` wrapping a `CatalogTable`), `ResolvedPersistentView` now carries `info: ViewInfo`: ```scala case class ResolvedPersistentView( catalog: CatalogPlugin, identifier: Identifier, info: ViewInfo) // v2 ViewInfo for v2 views; V1ViewInfo wrapping // CatalogTable for session-catalog (v1) views ``` A new `private[sql] V1ViewInfo(v1Table: CatalogTable) extends ViewInfo` exposes a v1 `CatalogTable` through the v2 `ViewInfo` surface, mirroring `V1Table` for `Table`. `ViewInfo`'s constructor relaxed from `private` to `protected` so the subclass can call it. v1-only paths (DescribeTableCommand via `ResolvedChildHelper`, the v1 DescribeRelation JSON path, ApplyDefaultCollation's AlterViewAs rewrite, CreateTableLike strategy case) recover the original `CatalogTable` via `info.asInstanceOf[V1ViewInfo].v1Table`. v2 strategy cases consume `rpv.info` directly. `ResolvedPersistentView.output` now exposes the view's schema attributes (with char/varchar normalization), so `DescribeColumn` against a v2 view survives `CheckAnalysis` — the column resolves naturally through `ResolveReferences`, and the strategy / v1 rewrite both extract `nameParts` from the resolved attribute. No new logical plan needed for DESCRIBE COLUMN; the existing one flows to the planner intact. Also folded into this commit: - `ALTER TABLE <view> RENAME TO …` is rejected with `EXPECT_TABLE_NOT_VIEW.USE_ALTER_VIEW` on the v2 path (mirrors v1 `DDLUtils.verifyAlterTableType`). **3. `[SPARK-56655][SQL][TESTS] Add per-catalog view command test triplets and fold in late prod tweaks`** — test scaffolding: Mirror the DROP TABLE test layout from `sql/core/test/.../command/{,v1/,v2/}` for every v2 view DDL / inspection command. Each command lands as: - `command/<Cmd>SuiteBase.scala` — unified tests parameterized by `$catalog` - `command/v1/<Cmd>Suite.scala` — extends Base + v1 `ViewCommandSuiteBase` (pins `$catalog` to `spark_catalog`, so the unified tests target the session catalog) - `command/v2/<Cmd>Suite.scala` — extends Base + v2 `ViewCommandSuiteBase` (pins `$catalog` to a fresh `test_view_catalog` backed by a new general-purpose `InMemoryTableViewCatalog` test fixture), plus catalog-state assertions specific to the v2 fixture Triplets cover: CREATE VIEW, ALTER VIEW … AS, ALTER VIEW SET / UNSET TBLPROPERTIES, ALTER VIEW RENAME TO, ALTER VIEW WITH SCHEMA, SHOW CREATE TABLE, SHOW TBLPROPERTIES, SHOW COLUMNS, DESCRIBE TABLE, DESCRIBE TABLE … COLUMN, DROP VIEW, SHOW VIEWS. Each Base test runs against both `spark_catalog` (v1, hits the existing v1 commands) and `test_view_catalog` (v2, hits the new execs from commit #2), giving a single source of cross-catalog behavioral parity. The pre-existing `DataSourceV2MetadataViewSuite` is trimmed: CREATE / ALTER / DROP / SHOW VIEW DDL coverage moves to the per-catalog triplets. What remains in the leaf suite is genuinely v2-specific structural coverage (view read-path, V1Table.toCatalogTable round-trip, pure-ViewCatalog read + ALTER, multi-level-namespace cyclic detection / error rendering, REFRESH / ANALYZE rejection, SHOW TABLES on a `TableViewCatalog` returning both kinds). ### Why are the changes needed? The parent PR shipped CREATE VIEW + ALTER VIEW … AS through the v2 surface but pinned the rest of the view DDL/inspection family with `UNSUPPORTED_FEATURE.TABLE_OPERATION`. Third-party v2 catalogs that host views still couldn't run `ALTER VIEW SET TBLPROPERTIES`, `ALTER VIEW … RENAME TO`, `ALTER VIEW … WITH SCHEMA <mode>`, `DESCRIBE`, `SHOW CREATE TABLE`, `SHOW TBLPROPERTIES`, `SHOW COLUMNS` against their views — full v1 parity for non-session view catalogs requires this set. The rename to `MetadataTable` / `TableViewCatalog` / `loadTableOrView` was discussed during the parent PR's review as a clarity improvement; doing it now (before the API ships in a release) avoids deprecation churn later. `ResolvedPersistentView.info: ViewInfo` (with `V1ViewInfo` for v1 views) brings v2 view command flow in line with the existing v2 table command convention (`ResolvedTable.table: Table`, with `V1Table` for v1 tables) — single typed payload, resolved at analysis time, consumed at exec time. Looking up database objects at runtime is the anti-pattern this removes. ### Does this PR introduce _any_ user-facing change? Yes for **connector developers**: - The rename is a source-incompatible change on the still-`Evolving` v2 view API surface. Connectors implementing the parent PR's `RelationCatalog` / overriding `loadRelation` / referencing `MetadataOnlyTable` need to update to `TableViewCatalog` / `loadTableOrView` / `MetadataTable`. - `ViewCatalog` gains a new abstract `renameView(oldIdent, newIdent)` method. Existing `ViewCatalog` implementations need to add it (the parent PR has not yet released, so this is still pre-release breakage). - `ViewInfo`'s constructor is now `protected` (was `private`). Existing call sites use the `ViewInfo.Builder`; only internal subclasses need to call the constructor directly. Yes for **end users on a non-session v2 view catalog**: the listed DDL / inspection commands now succeed against a `ViewCatalog` instead of erroring with `UNSUPPORTED_FEATURE.TABLE_OPERATION`. `SHOW TABLES` on a `TableViewCatalog` now returns both tables and views, matching v1 SHOW TABLES output. `ALTER TABLE <view> RENAME TO …` (wrong syntax) now returns `EXPECT_TABLE_NOT_VIEW.USE_ALTER_VIEW` instead of silently succeeding. No user-facing change on the session catalog path — those plans are still rewritten to the v1 commands by `ResolveSessionCatalog` and behave exactly as before. ### How was this patch tested? - 13 new per-catalog `*SuiteBase` triplets under `sql/core/src/test/.../command/{,v1/,v2/}` exercise each command against both a v1 (session) catalog and a fresh v2 `InMemoryTableViewCatalog` fixture. - The pre-existing `DataSourceV2MetadataViewSuite` is trimmed to v2-specific structural tests (read path, `V1Table.toCatalogTable` round-trip, pure-ViewCatalog read+ALTER, multi-level-namespace cyclic / error rendering, REFRESH/ANALYZE rejection, SHOW TABLES TableViewCatalog). - 242 view-related tests pass locally across 30 suites, plus 54/54 SimpleSQLViewSuite, plus 171/171 table-side inspection suites (verifying no regression in `ResolvedChildHelper.getTableMetadata` after the metadata→info refactor). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) Closes apache#55593 from cloud-fan/v2-view-followup. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
…llocator
### What changes were proposed in this pull request?
This PR restricts `ExecutorResizePlugin` to activate only when the `direct` pods allocator (`ExecutorPodsAllocator`) is configured via `spark.kubernetes.allocation.pods.allocator`.
When the configured allocator is anything else (`statefulset`, `deployment`, or a user-supplied class), `ExecutorResizeDriverPlugin.init()` now logs a warning and returns early without creating a Kubernetes client or
scheduling the periodic memory check.
### Why are the changes needed?
`ExecutorResizePlugin` resizes individual executor pods in place via `pods().withName(...).subresource("resize").patch(...)`. This works correctly only when Spark itself owns each pod, i.e. with the `direct` allocator. For the other allocators, user-defined or the higher-level controller's pod template is the source of truth and patching a single pod under those allocators may have unexpected side-effects.
- `statefulset` → `StatefulSet.spec.template`
- `deployment` → `Deployment.spec.template`
### Does this PR introduce _any_ user-facing change?
No, `ExecutorResizePlugin` is a new unreleased feature of Apache Spark 4.2.0.
### How was this patch tested?
Pass the CIs with the newly added test coverage.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)
Closes apache#55615 from dongjoon-hyun/SPARK-56670.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ng and -verbose post-filter JIRA: https://issues.apache.org/jira/browse/SPARK-56666 ### What changes were proposed in this pull request? Refines the unidoc javacOptions in `JavaUnidoc / unidoc / javacOptions` and the post-process stream filter in `docs/_plugins/build_api_docs.rb` so that the Documentation generation CI log is small enough to scan visually while still surfacing per-file `error: reference not found` diagnostics on broken `{link}` references. Builds on the `-Xmaxerrs` and `-verbose` insight from apache#55581 (SPARK-56630 follow-up): javadoc's default `-Xmaxerrs 100` cap was hit by the ~100 inert genjavadoc-stub errors during source loading, so doclint never ran on the real sources, and the per-file `error: reference not found` diagnostics surfaced only with `-verbose`. That PR's flag set (`-Xmaxerrs 999999`, `-Xmaxwarns 999999`, `-verbose`) achieved the diagnostic goal but at a ~77K-line CI log per run. This PR keeps the diagnostic visibility and brings the visible CI log down to ~4K lines (95% reduction), with four changes: 1. **`-Xmaxerrs 0`** instead of `-Xmaxerrs 999999`. The `0` value is treated as unlimited by javadoc (locally verified) and reads cleaner than the magic number. 2. **`-Xdoclint:all` + `-Xdoclint:-missing`** (two separate flags, matching the existing `Compile / doc / javacOptions` pattern in `SparkBuild.scala`). Suppresses the `missing` doclint group at javadoc level: the ~22K `no comment` / `no param` / `no return` / `no throws` warnings (each rendered as a 3-line block) that dominate the log on every Spark unidoc run. The two-flag form is load-bearing — bare `-Xdoclint:-missing` alone demotes other doclint groups (notably `reference`) to warning level, making broken `{link}` non-fatal; the explicit `-Xdoclint:all` first keeps reference at error level. Locally verified. 3. **Drop `-Xmaxwarns 999999`.** Warnings don't fail CI; error visibility is governed by `-Xmaxerrs`, not `-Xmaxwarns`. javadoc's default cap of 100 is sufficient — shows a sample of any remaining warnings without flooding. Saves ~4K lines beyond `-Xdoclint:-missing` alone. 4. **Post-filter `-verbose` progress lines from the build_api_docs.rb stream.** `-verbose` itself stays (it is load-bearing for per-file `error: reference not found` emission per apache#55581), but its progress noise — `Loading source file ...`, `[parsing started/completed]`, `[loading /path/X.class]`, `Generating /path/X.html` — carries no diagnostic signal. The existing stream filter is extended with a `verbose_line` regex that drops these single-line progress entries from stdout. Saves ~13K lines. ### Why are the changes needed? Documentation generation CI logs were ~77K lines per run after SPARK-56630's flag set. That is large enough that scanning for diagnostics by eye is impractical, and grep-piping is the only reasonable workflow. Most of the volume is structural noise (genjavadoc stub errors, `no comment` warnings, `-verbose` progress markers) with no diagnostic signal. After this PR the log is ~4K lines on a real-failure run; the per-file `error: reference not found` diagnostics PR apache#55581 added are the dominant content. Empirical breakdown of the reduction (verified end-to-end on this branch's earlier test commits with deliberately broken `{link}` plants in both a real `.java` source and a Scala source): | State | Log lines | Vs baseline | | --- | ---: | ---: | | PR apache#55581's flag set (baseline) | 77K | | | Add `-Xdoclint:all,-missing` | 22K | -71% | | Drop `-Xmaxwarns 999999` | 18K | -77% | | Post-filter `-verbose` progress | **~4K** | **-95%** | All diagnostic targets remain visible: per-file `error: reference not found` for both Java sources and Scala sources via the genjavadoc stub. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Validated end-to-end on earlier (now reverted) test commits of this branch with planted broken `{link}` references in both code paths: - `ColumnarMap.java` (real Java source): `{link org.apache.spark.deliberately.NoSuchClass}` and `{link ColumnVector#nonExistentMethod()}`. - `Partition.scala` (Scala source via genjavadoc): `[[Partition.index]]` — the wrong `.` separator that javadoc reads as inner-class lookup and fails to resolve. This is the case PR apache#55581's AGENTS.md note documents as the most common scaladoc-side cause of unidoc failure. Both surfaced as per-file `error: reference not found` diagnostics in the CI log on the test commit, doc gen failed as expected, log size dropped to 3,977 lines, and zero `Loading source file` / `[parsing started]` / `[loading X.class]` / `Generating *.html` / `no comment` lines remained visible. See the test-result comment below for the full breakdown. `-Xmaxerrs 0` and the bare-`-Xdoclint:-missing` demotion behavior were verified locally with standalone javadoc invocations on a minimal test file. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) Closes apache#55605 from cloud-fan/unidoc-fixes-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ECUTE IMMEDIATE
### What changes were proposed in this pull request?
Fix parameter marker collation strength in `EXECUTE IMMEDIATE` so that parameters get **implicit** collation strength instead of **explicit**.
- In `ParameterHandler.convertToSql`, collated string parameters (including those nested in arrays, maps, structs) are now serialized as `CAST('value' AS STRING COLLATE X)` instead of `'value' COLLATE X`, giving them implicit collation strength when re-parsed.
- Null parameters delegate to the `CAST(NULL AS ...)` form that `Literal.sql` already emits for typed nulls; the CAST's `NullType` child yields Default (not Implicit) strength via `CollationTypeCoercion`'s `Cast` base case, which is the intended behavior for NULL.
- Add `DataTypeUtils.hasNonDefaultStringCharOrVarcharType` to recursively check if a type contains any explicitly collated STRING/CHAR/VARCHAR.
- Add `ElementAt` collation context propagation in `CollationTypeCoercion.findCollationContext` to extract collation context from a map's value type or an array's element type, so that downstream expressions (e.g. comparisons with columns of different collations) can correctly detect indeterminate collation conflicts.
Note on scope: this PR fixes the strength only for `EXECUTE IMMEDIATE` parameters (which are substituted as SQL text by `ParameterHandler`). Parameters bound at the plan level via `spark.sql(\"...\", Map(...))` (`NameParameterizedQuery`) continue to receive Default strength, matching their pre-existing behavior. The two paths are exercised side-by-side in the new tests (`parameterized query vs column collation` vs. `execute immediate parameter implicit vs column collation`) so the divergence is explicit. Aligning the plan-level path is left for a follow-up.
### Why are the changes needed?
Previously, parameters in `EXECUTE IMMEDIATE` had explicit collation strength. This caused incorrect behavior — for example, a parameter with `COLLATE UTF8_LCASE` would win over a column's collation instead of producing an `INDETERMINATE_COLLATION_IN_EXPRESSION` error, which is the correct behavior for implicit-strength collations meeting a different column collation.
Additionally, `element_at()` on a collated value did not propagate the value's collation context through `CollationTypeCoercion`, so downstream comparisons could not correctly detect indeterminate collation conflicts.
### Does this PR introduce _any_ user-facing change?
Yes. Parameters in `EXECUTE IMMEDIATE` now have implicit collation strength (matching `NamedExpression` / `VariableReference` / `SubqueryExpression`, i.e. columns and session variables — not string literals, which are Default). Collation conflicts between parameters and columns with different collations will now correctly raise `INDETERMINATE_COLLATION_IN_EXPRESSION` instead of silently using the parameter's collation.
### How was this patch tested?
Added 13 new tests in `CollationSuite` covering:
- `EXECUTE IMMEDIATE` with literal parameters (with/without explicit COLLATE) vs query literals and columns
- `EXECUTE IMMEDIATE` with variable parameters vs query literals and columns
- `EXECUTE IMMEDIATE` with complex type parameters (ARRAY, MAP, STRUCT) including collation and strength, and the `DATATYPE_MISMATCH.MAP_FUNCTION_DIFF_TYPES` error when a collated map key is looked up with a mismatched lookup key
- `EXECUTE IMMEDIATE` with null parameters (plain NULL, typed NULL, variable NULL)
- `EXECUTE IMMEDIATE` with named parameters (`USING value AS name`)
- `EXECUTE IMMEDIATE` with two parameters of different collations producing indeterminate collation errors
- Parameterized queries (`spark.sql()` API) vs columns and with collation strength
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-6)
Closes apache#55219 from ilicmarkodb/param_mark_collation.
Authored-by: Marko Ilić <marko.ilic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ow level operations" This reverts commit 5ef2e1b.
…test_connect_session ### What changes were proposed in this pull request? Consolidate `python/pyspark/sql/tests/connect/test_session.py` into `python/pyspark/sql/tests/connect/test_connect_session.py`: - Move the module-level `CustomChannelBuilder(DefaultChannelBuilder)` and the `SparkSessionTestCase` class from `test_session.py` into `test_connect_session.py`. - Rename `SparkSessionTestCase` to `SparkConnectSessionBuilderTests` to disambiguate from `pyspark.testing.mlutils.SparkSessionTestCase` and to describe its scope (builder validation + session-state tests against fake `sc://` remotes, no real server). - Add `Optional` and `DefaultChannelBuilder` imports; the rest of the imports were already present in `test_connect_session.py`. - Delete `python/pyspark/sql/tests/connect/test_session.py`. - Drop the now-stale `pyspark.sql.tests.connect.test_session` entry from `dev/sparktestsupport/modules.py`. ### Why are the changes needed? Both files lived in the same `tests/connect/` directory and tested Spark Connect session behavior. Two separate files added test-runner overhead (separate module entry, separate process startup) without a clear separation of concerns. Folding them into a single file groups the three session test classes together: - `SparkConnectSessionTests` — real-server query / error / session-id tests - `SparkConnectSessionBuilderTests` — builder validation + session-state tests against fake `sc://` remotes - `SparkConnectSessionWithOptionsTest` — session with config options ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? Existing tests in both files are preserved. The merged file passes `python -m py_compile` and `black --check` (26.3.1). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) Closes apache#55617 from zhengruifeng/merge-connect-session-tests. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…ces` for Connect parity convention ### What changes were proposed in this pull request? The Connect parity test file `python/pyspark/sql/tests/connect/test_resources.py` and its class `ResourceProfileTests` are the only ones in `python/pyspark/sql/tests/connect/` that don't follow the naming convention used by 200+ other parity tests: - Filename should be `test_parity_<name>.py`. - Class name should be `<Name>ParityTests`. This PR renames: - `python/pyspark/sql/tests/connect/test_resources.py` -> `python/pyspark/sql/tests/connect/test_parity_resources.py` - `ResourceProfileTests` -> `ResourceProfileParityTests` References to the old module path are updated in `dev/sparktestsupport/modules.py` and `.github/workflows/build_python_connect.yml`. ### Why are the changes needed? Consistency. The `test_parity_*` filename and `*ParityTests` class name are the established convention across the Connect parity test suite; this rename brings the resource profile parity tests in line and makes it easier to locate and identify parity tests at a glance. ### Does this PR introduce _any_ user-facing change? No. Test-only change with no behavioral impact. ### How was this patch tested? Existing tests; only filename and class name changed. The test module is still discovered through `dev/sparktestsupport/modules.py` and the connect CI workflow. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes apache#55618 from zhengruifeng/rename-test-parity-resources. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Refactor `SQL_GROUPED_MAP_PANDAS_UDF` to be self-contained in `read_udfs()`. ### Why are the changes needed? Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in `read_udfs()` improves readability and makes it easier to reason about the data flow for each eval type independently. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. No behavior change. ASV benchmark (`GroupedMapPandasUDFTimeBench`): ```text master: 174fc60 vs PR: 41dc496 Time (ms, lower = better) scenario udf master PR Δ ----------------------------------------------------------------- sm_grp_few_col identity_udf 424.0 423.3 -0.17% sm_grp_few_col sort_udf 481.7 473.5 -1.70% sm_grp_few_col key_identity_udf 427.6 384.3 -10.12% sm_grp_many_col identity_udf 332.6 328.6 -1.21% sm_grp_many_col sort_udf 343.4 341.9 -0.43% sm_grp_many_col key_identity_udf 330.5 324.6 -1.77% lg_grp_few_col identity_udf 242.2 236.1 -2.52% lg_grp_few_col sort_udf 359.7 357.6 -0.57% lg_grp_few_col key_identity_udf 212.3 216.1+ 1.77% lg_grp_many_col identity_udf 492.1 517.9+ 5.24% lg_grp_many_col sort_udf 598.7 613.2+ 2.42% lg_grp_many_col key_identity_udf 479.2 488.7+ 1.97% mixed_types identity_udf 422.9 440.2+ 4.08% mixed_types sort_udf 449.4 456.9+ 1.65% mixed_types key_identity_udf 398.1 383.2 -3.73% ----------------------------------------------------------------- SUM 5994.5 5986.1 -0.14% ``` Aggregate essentially flat (-0.14%); per-scenario variation within run-to-run noise. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#55495 from Yicong-Huang/SPARK-56477. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Introduce a new method `.withSchemaEvolution()` to the dataframe writer API to enable schema evolution during writes. This fills the current gap where schema evolution in inserts can only be enabled via SQL using `INSERT WITH SCHEMA EVOLUTION`. MERGE using dataframe already has a similar user-surface: [withSchemaEvolution()](https://github.com/apache/spark/blob/cf8be22a20bfc1e6fe2265977540974f810795ad/sql/api/src/main/scala/org/apache/spark/sql/MergeIntoWriter.scala#L162). Checks are added to fail when schema evolution is enabled on writes that don't support it: creating or replacing a table, writes on DSv1 tables. ### Does this PR introduce _any_ user-facing change? Yes, users can now enable schema evolution using the dataframe writer API, for example: ``` df.write .mode("append") .withSchemaEvolution() .saveAsTable(t) ``` and other variations of writes using DataframeWriter & DataFrameWriterV2. ### How was this patch tested? Added tests covering enabling (and using) schema evolution via all supported dataframe write methods. Negative tests for calls that don't support schema evolution Closes apache#55446 from johanl-db/SPARK-55689-df-with-schema-evolution. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Anton Okolnychyi <aokolnychyi@apache.org>
…ical code blocks ### What changes were proposed in this pull request? Extract a `scheduleResubmit()` helper method in `DAGScheduler` to replace three identical code blocks that schedule `ResubmitFailedStages` via `messageScheduler`. ### Why are the changes needed? The same 5-line block for scheduling a `ResubmitFailedStages` event is duplicated in three places (rollback handling, fetch failure handling, and barrier stage failure handling). Extracting it into a helper reduces duplication and improves readability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a pure refactoring with no behavior change. Existing tests cover the affected code paths. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.6) Closes apache#55614 from jiangxb1987/SPARK-56575. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
…s.( -> filterPropagation.)symmetricFilterPropagation.enabled` ### What changes were proposed in this pull request? This PR renames the SQL config key `spark.sql.optimizer.mergeSubplans.symmetricFilterPropagation.enabled` to `spark.sql.optimizer.mergeSubplans.filterPropagation.symmetricFilterPropagation.enabled`. ### Why are the changes needed? To follow the Apache Spark configuration namespace rule. The new key better reflects the configuration hierarchy. `spark.sql.optimizer.mergeSubplans.filterPropagation.enabled` is the parent gate for filter propagation in `MergeSubplans`, and the symmetric variant is a sub-feature of it. Nesting the symmetric flag under `filterPropagation.*` makes that relationship explicit, and groups related options together when listed alphabetically. The previous key `mergeSubplans.symmetricFilterPropagation.enabled` placed the two sibling options at different levels of the namespace, which obscured the dependency (the symmetric flag has no effect when filter propagation is disabled, as documented in the config `doc`). This config was added in 4.2.0 and has not been released yet, so no backwards-compat alias is required. ### Does this PR introduce _any_ user-facing change? No. The config has not been released yet (introduced for 4.2.0). ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 Closes apache#55633 from dongjoon-hyun/dongjoon/suspicious-hofstadter-e56b11. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…istTableAndViewSummaries ### What changes were proposed in this pull request? Rename the unified listing method on `TableViewCatalog` from `listRelationSummaries` to `listTableAndViewSummaries`. Updates the declaration in `TableViewCatalog.java` and the two existing references introduced by [SPARK-56655](https://issues.apache.org/jira/browse/SPARK-56655): - `sql/catalyst/.../TableViewCatalog.java` -- method declaration + javadoc - `sql/core/.../ShowTablesExec.scala` -- call site + scaladoc reference - `sql/core/.../DataSourceV2MetadataViewSuite.scala` -- comment reference ### Why are the changes needed? The new name explicitly states what the method returns (tables and views) and avoids overloading Spark's existing "Relation" terminology (`BaseRelation`, `LogicalRelation`, etc.), matching the rename rationale that already changed `RelationCatalog` -> `TableViewCatalog` in SPARK-56655. The API is still `Evolving` and unreleased, so this is a pre-release rename with no deprecation cycle needed. ### Does this PR introduce _any_ user-facing change? No end-user behavior change. ### How was this patch tested? Pure rename; existing tests in `DataSourceV2MetadataViewSuite` exercise the renamed method via `SHOW TABLES` on a `TableViewCatalog`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic), claude-opus-4 Closes apache#55616 from gengliangwang/SPARK-56672. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
…esClient` to `k8s` package ### What changes were proposed in this pull request? This PR promotes the `kubernetesClient` constructor parameter of `KubernetesClusterSchedulerBackend` to a `private[k8s] val`, exposing it as a read-only member to other components in the `org.apache.spark.scheduler.cluster.k8s` package. ### Why are the changes needed? Like `ExecutorResizePlugin`, many plugins currently build its own `KubernetesClient` via `SparkKubernetesClientFactory.createKubernetesClient(...)`. This produces multiple clients in the driver process that target the same API server with the same credentials. The driver already owns a single `KubernetesClient` through `KubernetesClusterSchedulerBackend`, whose lifecycle is managed in `start()` / `stop()`. Exposing it as `private[k8s] val` lets package-local components reuse this shared client in follow-up work, without widening the API surface to other modules or external users. This PR is intentionally limited to the visibility change. Reusing the shared client from existing plugins is left to follow-up PRs so each change stays minimal and reviewable. ### Does this PR introduce _any_ user-facing change? No. The visibility is `private[k8s]`, so the change is invisible outside the `org.apache.spark.scheduler.cluster.k8s` package. ### How was this patch tested? Pass the CIs with a new unit test. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7 Closes apache#55634 from dongjoon-hyun/SPARK-56684. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…columns at refresh time
Co-authored-by: Isaac
### What changes were proposed in this pull request?
This PR adds column ID support to the DSv2 Column interface so that Spark can detect when a column has been
dropped and re-added with the same name during table refresh.
Connectors that support column IDs (e.g., Delta, Iceberg) can now return a unique identifier via `Column.id()`. Spark validates these IDs at refresh time and fails with a clear error if a column has been replaced.
Column IDs are assigned at the top-level column granularity. Nested struct fields, array elements,
and map keys/values do not have separate IDs through this API.
A top-level-only ID will not detect same-name same-type drop+re-add of a nested field. This
matches the current behavior of connectors like Delta and Iceberg, where the top-level column ID
is unchanged when a nested field is dropped and re-added.
Connectors that want to detect nested drop+re-add can do so by encoding their nested field IDs
into the top-level `Column.id()` string. For example, a column `person STRUCT<name, age>` with
root ID 5 and nested IDs 10, 11 could return `"5[10,11]"` — if `age` is dropped and re-added
with new nested ID 12, the string becomes `"5[10,12]"`, and Spark detects the mismatch.
Changes to nested fields that alter the parent's data type (e.g., dropping `person.age` changes
the struct from `STRUCT<name, age>` to `STRUCT<name>`) are caught by schema validation regardless
of ID support.
The default `InMemoryTableCatalog` preserves column IDs across type changes, matching the behavior
of real connectors like Delta and Iceberg. A new `ComposedColumnIdTableCatalog` demonstrates the
recommended adoption pattern for connectors with nested IDs: encoding the full subtree into the
top-level `Column.id()` string so that any nested change is detected.
### Why are the changes needed?
Spark's existing refresh validation only checks column names and types, so it cannot detect when a column has been replaced with a same-named column. This PR exposes column IDs through the DSv2 API so Spark can catch these scenarios at refresh time.
### Does this PR introduce _any_ user-facing change?
No. Column IDs are a new opt-in API for DSv2 connectors. Existing connectors are unaffected because `Column.id()` defaults to `null`, and null IDs are skipped during validation. Only connectors that explicitly implement `Column.id()` will see the new `COLUMN_ID_MISMATCH` error when a column is dropped and re-added between analysis and execution.
### How was this patch tested?
New tests added in `DataSourceV2DataFrameSuite` and `DataSourceV2ExtSessionColumnIdSuite`.
#### Testing concept
Real connectors (Delta, Iceberg, etc.) differ in how much identity tracking they support: some assign unique IDs to every column, some only to certain columns, some support table-level IDs but not column IDs, and some support none at all. The test infrastructure needs to cover this full connector support matrix so we can verify that column ID validation works correctly when the connector opts in, and is safely skipped when it does not.
To achieve this, we introduce custom test catalogs that each simulate a different real-world connector behavior. They all extend `InMemoryTableCatalog` and share the same `InMemoryBaseTable` base which auto-assigns unique column IDs via a global counter. Each catalog then selectively overrides or strips certain identity fields to model a specific connector scenario.
#### Test catalog hierarchy
```
InMemoryTableCatalog (base: full table ID + full column IDs)
|
+-- NullTableIdInMemoryTableCatalog
| Table ID = null, column IDs = normal
| Purpose: verify column IDs alone detect drop+recreate without table-level identity
|
| +-- SharedInMemoryTableCatalog
| Tables stored in a static map shared across SparkSessions
| Purpose: verify cross-session detection (session1 holds stale DF, session2 mutates)
|
+-- NullColumnIdInMemoryTableCatalog
| Table ID = normal, column IDs = ALL null
| Purpose: verify validation is safely skipped when connector does not support column IDs
|
+-- MixedColumnIdTableCatalog
| Table ID = normal, column IDs = selectively null per column name
| Purpose: verify that columns with null IDs are skipped without incorrectly failing
|
+-- TypeChangeResetsColIdTableCatalog
| Table ID = normal, column IDs reset to fresh values on type changes
| Purpose: verify column ID mismatch fires when type changes produce new IDs
|
+-- ComposedColumnIdTableCatalog
Table ID = normal, nested field IDs encoded into top-level Column.id() string
Purpose: verify nested drop+re-add and deep nesting detected via subtree-ID composition
```
#### What is tested
| Catalog | What it tests |
|---------|--------------|
| `testcat` (`InMemoryTableCatalog`) | Full column ID support, IDs preserved across type changes (matching Delta/Iceberg). Bulk of the tests: drop+re-add (same/different type/case, multiple cols, complex types), access patterns (joins, filters, aggregates, sorts, subqueries, temp views, writeTo, insertInto), layered defense (type widening and nested type change caught by schema validation when ID is preserved), nested field ID preservation (parent and sibling IDs unchanged after nested drop or add), and no-false-positive checks (column additions, data-only inserts, nested field additions tolerated) |
| `nullidcat` (`NullTableIdInMemoryTableCatalog`) | Column IDs alone detect drop+recreate table scenario without table-level identity |
| `nullcolidcat` (`NullColumnIdInMemoryTableCatalog`) | Validation is safely skipped when all column IDs are null |
| `mixedcolidcat` (`MixedColumnIdTableCatalog`) | Columns with null IDs are skipped; columns with non-null IDs are still validated |
| `resetidcat` (`TypeChangeResetsColIdTableCatalog`) | Column ID mismatch fires when type changes produce fresh IDs (unlike the default which preserves IDs) |
| `composedidcat` (`ComposedColumnIdTableCatalog`) | Nested drop+re-add detected via subtree-ID composition (structs, arrays, maps), depth-3 nesting, drop+re-add with same name changes composed ID, data insertion tolerance |
| `sharedcat` (`SharedInMemoryTableCatalog`) | Cross-session: session2 mutates table, session1's stale DataFrame detects it at refresh |
### Was this patch authored or co-authored using generative AI tooling?
Yes
Closes apache#55376 from longvu-db/dsv2-column-id.
Authored-by: Thang Long Vu <long.vu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? This PR wires frozen SQL PATH semantics into analysis for persisted views and SQL functions (scalar and table). Specifically: - `AnalysisContext.withAnalysisContext(viewDesc)` now reads `viewStoredResolutionPath` and seeds `resolutionPathEntries` from persisted metadata when present. - `AnalysisContext.withAnalysisContext(function)` now reads `functionStoredResolutionPath` and seeds `resolutionPathEntries` for SQL function body analysis. - `CatalogManager.deserializePathEntries` is added to parse stored JSON path entries into analysis-time path entries. - Relation-resolution comments/docs are updated to reflect that persisted frozen path is now applied. - Regression tests are added: - `SQLViewSuite`: persisted view keeps creation-time PATH semantics. - `SQLFunctionSuite`: persisted SQL scalar function keeps creation-time PATH semantics. - `SQLFunctionSuite`: persisted SQL table function keeps creation-time PATH semantics. ### Why are the changes needed? Without this wiring, persisted views/functions may resolve unqualified names using the caller's current PATH instead of the PATH captured at creation time. That can cause behavior drift after `SET PATH` changes. This PR makes persisted object resolution stable and deterministic. ### Does this PR introduce _any_ user-facing change? Yes. For persisted views and SQL functions (including SQL table functions) created with PATH enabled, unqualified name resolution now consistently follows the stored creation-time PATH, even if the session PATH changes later. ### How was this patch tested? Added/updated unit tests and ran focused suites locally: - `build/sbt 'sql/testOnly org.apache.spark.sql.execution.SimpleSQLViewSuite'` - `build/sbt 'sql/testOnly org.apache.spark.sql.execution.SQLFunctionSuite'` Both suites passed with the new regression tests. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor Codex 5.3 Closes apache#55569 from srielau/SPARK-56639-frozen-path-semantics. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Allison Wang <allison.wang@databricks.com>
…on in FilterExec codegen ### What changes were proposed in this pull request? Followup to apache#56209. This adds an internal conf `spark.sql.subexpressionElimination.filterExec.enabled` (default `true`) that gates subexpression elimination (CSE) in `FilterExec` whole-stage codegen specifically. When set to `false`, `FilterExec` falls back to the existing predicate codegen path (`generatePredicateCode`) that loads input columns lazily and short-circuits; subexpression elimination elsewhere (e.g. `ProjectExec`) is unaffected. The conf is checked alongside the existing `spark.sql.subexpressionElimination.enabled` gate in `FilterExec.doConsume`. ### Why are the changes needed? CSE in `FilterExec` codegen (SPARK-56032) can introduce a performance regression for some filters. To materialize eliminated subexpressions into shared variables, the CSE path emits an eager prologue (`inputVarsEvalCode`) that evaluates every input column referenced by the predicates at the top of the per-row loop. The non-CSE path instead loads each column lazily, right before the predicate that needs it, so a cheap, highly selective leading predicate can short-circuit and skip decoding the remaining columns for rejected rows. The eager prologue defeats that short-circuiting. When a filter has expensive-to-decode columns behind a cheaper, selective predicate -- e.g. high-precision decimals, which allocate a `BigInteger`/`BigDecimal` per decode -- eagerly decoding those columns for every row, including rows the cheap predicate would reject, is pure waste. This showed up as a measurable regression on TPC-DS q28. apache#56209 already removes the prologue when there is no common subexpression at all. But even when a common subexpression exists, the eager prologue can still regress if the savings from eliminating it don't outweigh the lost short-circuiting. The global `spark.sql.subexpressionElimination.enabled` flag is too coarse to address this -- turning it off also disables CSE for projections and other operators. This conf provides a targeted kill-switch to fall back to the lazy, short-circuiting path for `FilterExec` while keeping CSE everywhere else. The default stays `true`, so behavior is unchanged. ### Does this PR introduce _any_ user-facing change? No. The conf is internal and defaults to `true`, preserving the current behavior. ### How was this patch tested? New unit test in `WholeStageCodegenSuite`: with a genuine common subexpression in the filter predicates, flipping `spark.sql.subexpressionElimination.filterExec.enabled` off (while leaving global subexpression elimination on) makes `FilterExec` fall back to the lazy non-CSE path -- the shared subexpression is re-evaluated per use, matching the code generated when CSE is globally disabled. Existing SPARK-56032 FilterExec CSE tests continue to exercise the default-on path. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Claude Code) Closes apache#56271 from cloud-fan/SPARK-56032-filterExec-cse-conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6d61d46) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Pin pandas and pandas-stub image for lint image. ### Why are the changes needed? CI is failing, but we don't support pandas 3 now. We pin pandas version on our docker image to match the dev env. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? mypy passed locally. Pending CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#56289 from gaogaotiantian/pin-pandas-version-lint. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Tian Gao <gaogaotiantian@hotmail.com> (cherry picked from commit af55029) Signed-off-by: Tian Gao <gaogaotiantian@hotmail.com>
### What changes were proposed in this pull request? Use sqlite instead of file for mlflow doctest. ### Why are the changes needed? Pandas CI is failing - https://github.com/manhha2502/spark/actions/runs/26841118356/job/79154335111 The latest `mlflow` does not allow the usage of file anymore - it will raise an exception. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally passed, pending CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#56290 from gaogaotiantian/use-sqlite-uri. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Tian Gao <gaogaotiantian@hotmail.com> (cherry picked from commit 21fee7d) Signed-off-by: Tian Gao <gaogaotiantian@hotmail.com>
… memory mode ### What changes were proposed in this pull request? Backport of apache#56234 to branch-4.2. ### Why are the changes needed? See apache#56234. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? See apache#56234. ### Was this patch authored or co-authored using generative AI tooling? Co-authored with Claude (Anthropic), used for analysis, code generation and review assistance. Generated-by: Claude Sonnet 4.6 Closes apache#56281 from kete1987/SPARK-57183-rocksdb-lrucache-leak-branch-4.2. Authored-by: Iván Morales <kete1987@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? `UnionExec` whole-stage codegen fusion (SPARK-56482) kept per-emission codegen state in mutable instance fields on the plan node: `currentEmittingChild` (set in `doProduce`, read in `doConsume` to pick a child's projection) and `numOutputRowsTerm` (the once-per-stage metric term). This PR moves both fields to `ThreadLocal`, isolating the state to the single thread that runs a given `doCodeGen` pass. ### Why are the changes needed? A single `UnionExec` instance can have its whole-stage codegen driven by more than one thread at the same time: a reused exchange/subquery stage is generated concurrently with the main plan, and async subquery / dynamic-partition-pruning execution can overlap a driver-side `doCodeGen`. With the shared mutable field, a racing `doProduce` resets `currentEmittingChild` to `-1` while another thread is still inside `doConsume`, tripping: ``` java.lang.IllegalArgumentException: requirement failed: UnionExec.doConsume invoked outside doProduce emission window ``` This surfaced as a flaky `LogicalPlanTagInSparkPlanSuite.q2` failure (q2 contains a `UNION`, and union fusion is enabled by default). Each `doCodeGen` pass is itself single-threaded (`produce` -> `doConsume` run inline on one thread), so a `ThreadLocal` isolates the state per pass without the cross-thread race, while preserving the existing per-stage semantics (the metric term is still computed once per pass). ### Does this PR introduce _any_ user-facing change? No. It removes an intermittent internal code-generation failure; the generated code and query results are unchanged. ### How was this patch tested? Added a `UnionCodegenSuite` test, "SPARK-57196: concurrent codegen of a shared UnionExec stage is thread-safe", that drives `doCodeGen()` on one shared fused `UnionExec` stage from 8 threads. It reproduces the "outside doProduce emission window" failure on the unpatched code and passes with this fix. Also verified the full `UnionCodegenSuite` (43 tests), its ANSI/AQE variants, and `LogicalPlanTagInSparkPlanSuite` q2 all pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8) Closes apache#56252 from gengliangwang/spark-union-codegen-threadsafe. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 694c848) Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? Follow-up to apache#56160 (SPARK-57113) addressing post-merge review comments and reducing duplication in the AutoCDC flow code and its test suites. No behavior change. `sql/pipelines/.../graph/FlowExecution.scala`: - Hoist the `org.json4s` imports out of `serializeKeyColumnNames` / `parseKeyColumnNames` to the top of the file, per Spark's import conventions. - Factor the duplicated AutoCDC key-field resolution shared by `auxiliaryKeyColumnNames` and `validateNoAutoCdcKeyDrift` into a single `expectedAuxiliaryKeyFields` helper. - Import `scala.collection.mutable` instead of using a fully-qualified inline reference. Tests: - Add a shared `singleAutoCdcFlowPipeline` helper to `AutoCdcGraphExecutionTestMixin` and use it across the AutoCDC SCD1 E2E suites (`AutoCdcScd1KeyDriftSuite`, `AutoCdcScd1MultiPipelineSuite`, `AutoCdcScd1AuxiliaryTableDurabilitySuite`, `AutoCdcScd1SchemaEvolutionSuite`), removing the repeated single-table/single-flow `TestGraphRegistrationContext` registration boilerplate. ### Why are the changes needed? Addresses non-blocking review feedback left on apache#56160 and reduces duplication in the AutoCDC flow code and its tests, improving readability and maintainability. The net diff removes ~300 lines. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pure refactor with no behavior change, covered by the existing AutoCDC suites: - `AutoCdcAuxiliaryTableSuite` - `AutoCdcScd1KeyDriftSuite` - `AutoCdcScd1MultiPipelineSuite` - `AutoCdcScd1AuxiliaryTableDurabilitySuite` - `AutoCdcScd1SchemaEvolutionSuite` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.8) Closes apache#56255 from szehon-ho/spark-57113-follow-up. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 232c309) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Reorder `ReplaceDeduplicateWithAggregate` before `RewriteExceptAll` in the "Replace Operators" optimizer batch.
### Why are the changes needed?
`dropDuplicates("id", "name").exceptAll(other)` throws `INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND` at execution time. The root cause is that `RewriteExceptAll` captures attribute references from `left.output` before `ReplaceDeduplicateWithAggregate` has replaced the Deduplicate node with an Aggregate(First(...)). The First() alias creates new exprIds that don't match what `RewriteExceptAll` baked into its Generate node.
### Does this PR introduce _any_ user-facing change?
Yes. `exceptAll (and intersectAll)` now work correctly after `dropDuplicates` with a column subset.
### How was this patch tested?
Added a test in `DataFrameSetOperationsSuite` verifying `exceptAll`, `except`, and `intersectAll` after `dropDuplicates(subset)`.
### Was this patch authored or co-authored using generative AI tooling?
Yes.
Closes apache#55905 from shrirangmhalgi/SPARK-51262-except-all-not-working-with-drop-duplicates.
Authored-by: Shrirang Mhalgi <shrirangmhalgi@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8a76ac4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR wires the `tpcds-1g` job in `.github/workflows/build_and_test.yml` to consume the shared `precompile` artifact, extending the pattern already applied to `docker-integration-tests` and `k8s-integration-tests` ([SPARK-57069](https://issues.apache.org/jira/browse/SPARK-57069); parent [SPARK-56830](https://issues.apache.org/jira/browse/SPARK-56830)). Concretely: - The `precompile` job's `if:` gate is extended to also fire when `tpcds-1g == 'true'` in the precondition output, so the artifact is available whenever the job runs. - `tpcds-1g`: - `needs: precondition` -> `needs: [precondition, precompile]` - `if:` extended with `(!cancelled()) &&` so the job still runs if precompile is cancelled. - Adds "Download precompiled artifact" + "Extract precompiled artifact" steps after Java install, with graceful fallback (`continue-on-error: true`). The `tpcds-1g` job drives SBT directly via `build/sbt "sql/testOnly ..."` (and `build/sbt "sql/Test/runMain org.apache.spark.sql.GenTPCDSData ..."` on a TPC-DS data cache miss), so it does not go through `dev/run-tests.py` and needs no `SKIP_SCALA_BUILD` flag -- the same situation as `k8s-integration-tests`. The first SBT invocation otherwise compiles `sql/core` (main + test) from scratch. The `precompile` job already runs `Test/package`, which compiles the `sql/core` test classes this job depends on (`TPCDSQueryTestSuite`, `TPCDSCollationQueryTestSuite`, `GenTPCDSData`, `TPCDSSchema`). Extracting the precompiled `target/` lets SBT skip that compile and run the test phase directly. ### Optional: graceful fallback if precompile fails Same pattern as the prior consumers: - `precompile` keeps `continue-on-error: true`. - The "Download precompiled artifact" step is gated on `needs.precompile.result == 'success'` and has `continue-on-error: true`. - "Extract precompiled artifact" is gated on the download succeeding and has `continue-on-error: true`. - If extraction fails or the artifact is missing, SBT compiles from scratch exactly as before. Worst case is degraded to the pre-PR behavior, not a workflow failure. Note: the existing `# Any TPC-DS related updates on this job need to be applied to tpcds-1g-gen job of benchmark.yml as well` comment refers to TPC-DS data-generation parameters (scale factor, `tpcds-kit` ref, `GenTPCDSData` args). This PR changes none of those -- it only adds build-artifact reuse, and `benchmark.yml` is a standalone workflow with no shared `precompile` job -- so no corresponding change is needed there. ### Why are the changes needed? Today every run of `build_and_test.yml` that requires `tpcds-1g` re-runs the same `sql/core` SBT compile that the `precompile` job already produced for `pyspark` / `sparkr` / `build` / docker / k8s. Wiring `tpcds-1g` to the existing artifact removes that duplicate compile for free (precompile is already running). ### Does this PR introduce _any_ user-facing change? No. CI infrastructure change only. ### How was this patch tested? The change is exercised by the CI run of this PR itself. The Download/Extract steps log the artifact size; if the precompile job is forced to fail (or its artifact is missing), the job falls back to the original local SBT build. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes apache#56200 from zhengruifeng/precompile-tpcds-ci-share-dev5. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 5ec09fe) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…xpected exception ### What changes were proposed in this pull request? In YARN client mode, `YarnClientSchedulerBackend`'s `MonitorThread` only catches `InterruptedException` / `InterruptedIOException`. If any other exception occurs during application monitoring (e.g., network failure, credential expiration, or other runtime errors), the thread dies silently. Since the driver JVM has active non-daemon threads (SparkUI, heartbeats), the process hangs indefinitely in a zombie state. This patch adds a `NonFatal` catch clause that logs the error and calls `sc.stop()`, ensuring the driver shuts down cleanly. ### Why are the changes needed? In managed environments (cloud platform agents, workflow schedulers), a hung driver is indistinguishable from one doing legitimate post-execution work. This causes resource leakage, orphaned processes, and extended job timeout durations. ### Does this PR introduce _any_ user-facing change? Yes. Previously, certain failures in the monitor thread caused the driver to hang forever. Now the driver shuts down cleanly with an error log. ### How was this patch tested? Added a new test in `YarnClientSchedulerBackendSuite` with a test that mocks `Client.monitorApplication` to throw a `RuntimeException` and asserts `sc.stop()` is called (via `SparkListener.onApplicationEnd`). ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes apache#56274 from shrirangmhalgi/SPARK-57191-yarn-driver-hang. Authored-by: shrirangmhalgi <shrirangmhalgi@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@apache.org> (cherry picked from commit d9ed843) Signed-off-by: Kousuke Saruta <sarutak@apache.org>
…jobs
### What changes were proposed in this pull request?
Replace 8 distinct per-job Coursier cache keys with a single `coursier-<hash>` key in `.github/workflows/build_and_test.yml` and `python_hosted_runner_test.yml`:
- **`precompile`** and **`build`** (Scala test matrix): `actions/cachev5` — both can write `coursier-<hash>`. `precompile` is the primary writer (runs first, full dependency superset via all `-P` profiles). `build` is the fallback writer — when `precompile` is absent or its save fails, the first `build` matrix entry seeds the cache. When `precompile` did save it, `build` gets an exact key hit and GHA automatically skips the post-save (caches are immutable).
- **All other consumers** (`pyspark` ×9, `sparkr`, `lint`, `docs`, `tpcds-1g`, `docker-integration-tests`, `k8s-integration-tests`): converted to `actions/cache/restorev5` — restore-only, never write. `tpcds-1g` in particular only fires when SQL code changes and is skipped on the vast majority of runs, so its own Coursier cache entry would typically be LRU-evicted before the next run anyway.
### Why are the changes needed?
**1. Same-commit duplicates — ~0.01% apart by bytes.**
Per-job keys let every consumer job re-save its own copy of effectively the same content. Measured on master:
```
precompile-coursier-4f7e6f95 1,469,711,354 B current superset
25-hadoop3-coursier-4f7e6f95 1,469,562,712 B 145 KB different ← 0.01% apart
precompile-coursier-03ca361a 1,624,890,072 B previous-hash superset (~90% same)
────────────────────────────────────────────────────────────────────
total: ~4.56 GB distinct content: ~1.47 GB
```
The 145 KB delta exists because Coursier doesn't prune: on a cold run the test-matrix job restores the precompile superset via restore-key, runs tests (which resolve nothing beyond it), and its post-step re-saves a byte-for-byte copy under its own key. The per-module keys are not holding different dependency sets — they are holding copies of the same superset.
**2. Repo-wide 10 GB budget consumed by duplicates.**
Duplicates from just two branches left no room for any other branch:
```
branch-4.x: tpcds-coursier 1895 MB
21-hadoop3-coursier 1437 MB
docker-integration-coursier 1437 MB → 4770 MB
master: precompile-coursier (hash A) 1549 MB
precompile-coursier (hash B) 1401 MB
25-hadoop3-coursier (hash B) 1401 MB → 4351 MB
total: ~9.1 GB used, 10 GB budget
```
Old maintenance branches (branch-4.0, 4.1, 4.2, 3.5) had their caches evicted before their next scheduled CI run and were always cold.
**3. Dep-upgrade burst amplifies the problem.**
`pom.xml`/`plugins.sbt` are touched ~5–6 times per month on average, but upgrades cluster: on 2026-05-28 alone, 5 dependency upgrades merged in a single day (rocksdbjni, joda-time, gson, Jetty, zstd-jni). Each commit rolls the hash, so 5 consecutive CI runs each start with a cold Coursier cache. Under the old design each cold run raced to create ~5 new ~1.4 GB entries (~7 GB), immediately overflowing the budget and evicting the previous run's still-warm caches. Under the new design each cold run creates exactly 1 entry (~1.4 GB), so a burst of 5 dep-upgrade commits creates ~7 GB total — still within budget and without evicting each other.
**Summary:** with one writer per branch the per-branch footprint drops from ~4.5 GB to ~1.4 GB, fitting ~6 branches in the 10 GB budget simultaneously, and a burst of dep-upgrade commits no longer triggers a cascade of mutual evictions.
### Does this PR introduce _any_ user-facing change?
No. CI-only.
### How was this patch tested?
YAML validates with `python3 -c "import yaml; yaml.safe_load(...)"`.
The correctness of the one-writer design relies on two GHA cache guarantees verified in prior CI runs:
1. Caches are immutable — an exact key hit skips the post-save step (`Cache hit occurred on the primary key …, not saving cache`), so multiple jobs using `actions/cachev5` with the same key don't produce duplicates when the cache already exists.
2. The `precompile` job builds with every profile (`-Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver -Pdocker-integration-tests -Pvolcano`), so its `~/.cache/coursier` is a superset of every consumer job's closure.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-sonnet-4-6)
Closes apache#56201 from zhengruifeng/unify-coursier-ci-cache-opt-dev6.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 4dbc9d7)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Clarify the Scaladoc and PySpark docstring for three public `RuntimeConfig` methods whose descriptions did not accurately explain the interaction between an explicitly-set value and the key's built-in (`ConfigEntry`) default value.
**`get(key: String)` / `get(key)`**
- Before: *"If the key is not set yet, return its default value if possible"*
- After: *"If the key is not explicitly set, return its built-in default value if one exists"*
**`get(key: String, default: String)` / `get(key, default)`**
- Before: *"If the key is not set yet, return the user given `default`"* — silent about overriding the ConfigEntry default
- After: explicitly states the user-supplied `default` is returned *instead of* the key's built-in default value
**`getOption(key: String)`**
- Before: *"return its default value if possible, otherwise `None`"* — vague
- After: *"return `Some` of its built-in default value if one exists, otherwise `None`"*
**PySpark `RuntimeConfig.get()`**
- Removed the misleading phrase *"assuming it is set"*
- Rewrote the description to explain both calling forms separately
- Added a `Raises` section documenting `SparkNoSuchElementException`
- Fixed the `Returns` section to follow NumPy docstring format
- Added four illustrative doctest examples using `spark.sql.sources.partitionOverwriteMode` that demonstrate the built-in-default behaviour
---
### Why are the changes needed?
The documentation was misleading in a subtle but important way. Users reading `get(key)` would think it throws whenever the key is *unset*, but in fact it returns the ConfigEntry's built-in default silently. Similarly, users reading `get(key, default)` had no way to know the supplied default overrides the built-in default — not just fills a gap when no default exists at all.
A concrete example (from SPARK-49798):
```python
spark.conf.unset("spark.sql.session.timeZone")
# Old doc implied this would throw — it does NOT; returns built-in default "Etc/UTC"
spark.conf.get("spark.sql.session.timeZone")
# => "Etc/UTC"
# Old doc did not mention this ignores the built-in default "Etc/UTC"
spark.conf.get("spark.sql.session.timeZone", "Europe/Berlin")
# => "Europe/Berlin"
```
---
### Does this PR introduce _any_ user-facing change?
No.
---
### How was this patch tested?
No new tests added. The described behaviour is already fully covered by existing tests:
- **Scala:** `RuntimeConfigSuite` — *"set and get a config with defaultValue"* tests all three cases (`get(key)` returning built-in default, `getOption(key)` returning `Some(builtInDefault)`, and `get(key, userDefault)` overriding the built-in default)
- **Python:** `pyspark/sql/tests/test_conf.py` — `test_conf` (lines 38–45) asserts the same behaviour using `spark.sql.sources.partitionOverwriteMode`
Four new doctest examples were added to `pyspark/sql/conf.py`. They are executable via `python python/pyspark/sql/conf.py` and assert the same cases the unit tests cover.
---
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6
Closes apache#56154 from brijrajk/SPARK-49798-fix-runtimeconfig-docs.
Authored-by: BRIJ RAJ KISHORE <22271048+brijrajk@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d1ef991)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ath. ### What changes were proposed in this pull request? Fixes the bug introduced by apache#55610. When `local` is a very large negative value, `Math.floorMod(local, unitMicros)` is positive because `unitMicros` is positive, and the subtraction can overflow to positive values. This produces a inconsistent result from the slow path. In most common cases, this optimization still takes effect. The performance loss is ignorable. ### Why are the changes needed? Fix incorrect result. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. It passes before apache#55610, or after this change. But it fails on the current master. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#56313 from chenhao-db/fix_trunc_overflow. Authored-by: chenhao-db <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 94d23c3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Current implementation for path based tables in SQL is partial. We are removing it for Spark 4.2. ### Why are the changes needed? The previous [attempt](apache#56039) to add path-based table support was halted due to concerns. For Spark 4.2 we are letting the connector handle it. ### Does this PR introduce _any_ user-facing change? No. Path based support in SQL was not yet released. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? Claude Opus 4.7. Closes apache#56316 from andreaschat-db/dsv2TransactionRemoveSQLPathBasedSupport. Authored-by: Andreas Chatzistergiou <andreas.chatzistergiou@databricks.com> Signed-off-by: Anton Okolnychyi <aokolnychyi@apache.org> (cherry picked from commit 42db152) Signed-off-by: Anton Okolnychyi <aokolnychyi@apache.org>
… respect `spark.sql.redaction.string.regex` ### What changes were proposed in this pull request? This PR changes `SparkSQLDriver.scala` to redact a query before `setJobDescription`. ### Why are the changes needed? In the current implementation, redaction is done in `SQLExecution.scala` so the description in the table on the top of `/SQL/execution` is redacted. <img width="1083" height="349" alt="sql-execution-page-top-table" src="https://github.com/user-attachments/assets/b06fb255-2b46-473d-9046-1b2d578e3bda" /> But the description in the table on the `/jobs` page and the one in the table on the bottom of `/SQL/execution` page are not redacted. <img width="525" height="692" alt="jobs-page-before" src="https://github.com/user-attachments/assets/0a5a8ce8-e4be-4669-bd7d-a6c62fe316ca" /> <img width="515" height="274" alt="sql-execution-page-before" src="https://github.com/user-attachments/assets/bd0406cc-5b0b-40a0-96c4-9f9fa1aa048a" /> ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new test. Also confirmed descriptions are redacted in UI. ``` $ bin/spark-sql --conf spark.sql.redaction.string.regex="secret.*=.*" spark-sql (default)> CREATE TABLE test1(secret string); spark-sql (default)> SELECT * FROM test1 WHERE secret=1; ``` <img width="852" height="690" alt="jobs-page-after" src="https://github.com/user-attachments/assets/8e28e37e-369f-479c-9711-999b431756db" /> <img width="598" height="272" alt="sql-execution-page-after" src="https://github.com/user-attachments/assets/cb734556-619b-45c6-a7f6-d52e60132aff" /> ### Was this patch authored or co-authored using generative AI tooling? Kiro CLI / Claude Closes apache#56326 from sarutak/fix-redact-sql-description. Authored-by: Kousuke Saruta <sarutak@amazon.co.jp> Signed-off-by: Kousuke Saruta <sarutak@apache.org> (cherry picked from commit 583e5bb) Signed-off-by: Kousuke Saruta <sarutak@apache.org>
…y should respect `spark.sql.redaction.string.regex`" This reverts commit ab3ec61.
…espects `spark.sql.redaction.string.regex` ### What changes were proposed in this pull request? This PR adds a test to `SQLExecutionSuite` that verifies the SQL execution description (`SparkListenerSQLExecutionStart.description`) is redacted according to `spark.sql.redaction.string.regex`. ### Why are the changes needed? `SQLExecution` redacts the job description before it is recorded in `SparkListenerSQLExecutionStart`, but there is no test covering this behavior. This test guards the redaction so it is not accidentally dropped. ### Does this PR introduce _any_ user-facing change? No. This is a test-only change. ### How was this patch tested? Pass the CI with the newly added test. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8) Closes apache#56358 from dongjoon-hyun/SPARK-57297. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Kousuke Saruta <sarutak@apache.org> (cherry picked from commit 4bc7196) Signed-off-by: Kousuke Saruta <sarutak@apache.org>
… respect `spark.sql.redaction.string.regex` ### What changes were proposed in this pull request? This PR changes `SparkSQLDriver.scala` to redact a query before `setJobDescription`. ### Why are the changes needed? In the current implementation, when a query is executed through `SparkSQLDriver`, redaction is done in `SQLExecution.scala` so the description in the table on the top of `/SQL/execution` is redacted. <img width="1083" height="349" alt="sql-execution-page-top-table" src="https://github.com/user-attachments/assets/b06fb255-2b46-473d-9046-1b2d578e3bda" /> But the description in the table on the `/jobs` page and the one in the table on the bottom of `/SQL/execution` page are not redacted. <img width="525" height="692" alt="jobs-page-before" src="https://github.com/user-attachments/assets/31c88b98-779b-4305-bf71-58f19a1d7117" /> <img width="515" height="274" alt="sql-execution-page-before" src="https://github.com/user-attachments/assets/012be251-f642-4ded-8f77-32f811b05cac" /> NOTE: Even after this PR is merged, when a job description is set manually using `sc.setJobDescription`, the description displayed in the `/jobs` page and the one on the bottom of `/SQL/execution` page are not redacted though the one on the top of `SQL/execution` page is redacted. ``` $ bin/spark-shell -c spark.sql.redaction.string.regex="secret.*=.*" scala> val s = "SELECT * FROM (SELECT 'secret=1')" scala> sc.setJobDescription(s) scala> sql(s).show() +--------+ |secret=1| +--------+ |secret=1| +--------+ ``` **description in `/jobs` page** <img width="555" height="226" alt="jobs-page-not-redacted" src="https://github.com/user-attachments/assets/b4e084ad-b648-4ba6-b049-ef42f570398d" /> **description in `/SQL/execution` (top)** <img width="913" height="203" alt="sql-execution-page-redacted" src="https://github.com/user-attachments/assets/91e745f0-aa7f-4618-98e9-5b4b117415da" /> **description in `/SQL/execution` (bottom)** <img width="536" height="292" alt="sql-execution-page-not-redacted" src="https://github.com/user-attachments/assets/761aad76-0d1b-49af-9e03-58510cd474d1" /> This is consistent with the previous behavior and not a regression. There is no simple way to redact them and doing it is out of scope of this PR. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new test and confirmed the test `SQL execution description should respect spark.sql.redaction.string.regex` added in apache#56358 passed. Also confirmed descriptions are redacted in UI. ``` $ bin/spark-sql --conf spark.sql.redaction.string.regex="secret.*=.*" spark-sql (default)> CREATE TABLE test1(secret string); spark-sql (default)> SELECT * FROM test1 WHERE secret=1; ``` <img width="607" height="213" alt="jobs-page-after-2" src="https://github.com/user-attachments/assets/62646cfc-67c3-46b5-a9f9-695b1f874462" /> <img width="589" height="274" alt="sql-execution-page-after-2" src="https://github.com/user-attachments/assets/597db0da-58fb-4275-b6aa-7e8b301f15d0" /> ### Was this patch authored or co-authored using generative AI tooling? Kiro CLI / Claude Closes apache#56361 from sarutak/fix-redact-sql-description-v2. Authored-by: Kousuke Saruta <sarutak@amazon.co.jp> Signed-off-by: Kousuke Saruta <sarutak@apache.org> (cherry picked from commit 96b255f) Signed-off-by: Kousuke Saruta <sarutak@apache.org>
…b Actions cache ### What changes were proposed in this pull request? Install `zstd` in all CI container image Dockerfiles (`dev/infra/Dockerfile` and the `python-*`, `docs`, `lint`, `sparkr` images under `dev/spark-test-image/`). ### Why are the changes needed? `actions/cache` has never successfully restored a cache in any container-based CI job — confirmed by `apache/spark`'s cache history, which has no `pyspark-coursier-*` / `sparkr-coursier-*` / `docs-coursier-*` entry. This is a long-standing issue, present since container jobs were introduced. `actions/cache` computes a cache **version** = `SHA256(paths + compression_method)` and includes it in the lookup URL. Host runners have `zstd` and use it; container images lack `zstd` and fall back to `gzip`. The version then differs, so caches saved by host jobs (e.g. the Coursier cache written by `precompile`) are invisible to container jobs even when the key matches. Installing `zstd` aligns the compression method. (The `build-` cache happened to work because it is written by both host and container jobs, so a gzip-version entry also existed; host-only caches had no gzip entry.) ### Does this PR introduce _any_ user-facing change? No. CI-only. ### How was this patch tested? Before (run [26956300346](https://github.com/zhengruifeng/spark/actions/runs/26956300346)): `precompile` saved `Linux-coursier-<hash>`, but all container jobs reported `Cache not found` for the same key minutes later. After (run [26996424034](https://github.com/zhengruifeng/spark/actions/runs/26996424034)): `pyspark`, `sparkr`, `lint`, `docs` all `Cache restored from key: Linux-coursier-<hash>`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-sonnet-4-6) Closes apache#56324 from zhengruifeng/fix-container-coursier-cache-ci-cache-opt-dev6. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 9c1adaf) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Make every CI cache `key:` and `restore-keys:` value OS-specific and give them a uniform shape: `<prefix>-${{ runner.os }}-<env>-<hash>` — the human-readable prefix leads, `${{ runner.os }}` follows, then any environment-specific component (e.g. the Java version, or Java + Hadoop for Coursier), then the `hashFiles(...)` hash.
- `build_and_test.yml`: `build-`, `docs-maven-`, `tpcds-`, and the Coursier caches (`coursier-${{ runner.os }}-…`, including the matrix variant `coursier-${{ runner.os }}-${{ matrix.java }}-${{ matrix.hadoop }}-…`)
- `build_sparkr_window.yml`: `build-sparkr-windows-maven-`
- `maven_test.yml`: `build-`, and `maven-${{ runner.os }}-java${{ ... }}-` (Java version after the OS)
- `build_python_connect.yml` / `build_python_connect40.yml`: `build-spark-connect-python-only-`, `coursier-build-spark-connect-python-only-`
- `python_hosted_runner_test.yml`: `build-`, `coursier-`
- `publish_snapshot.yml`: `snapshot-maven-`
The Coursier caches already embedded `${{ runner.os }}` (as a leading prefix); they are reordered here so that every cache key in the workflows follows the same `<prefix>-${{ runner.os }}-…` convention.
### Why are the changes needed?
Without an OS component, cache entries from Linux, macOS, and Windows runners share the same key namespace. A cache written by one OS can be restored on another, causing subtle build failures or stale-artifact issues. Embedding `${{ runner.os }}` ensures each OS has its own isolated cache. Keeping the descriptive prefix first preserves readability, groups related caches together, and makes the keys consistent across all workflows.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI workflow change only; no code logic changed. The cache keys are verified by inspection.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8
Closes apache#56342 from zhengruifeng/ci-cache-key-runner-os-dev7.
Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 4d8b715)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…nner CI jobs ### What changes were proposed in this pull request? This PR extends the SBT precompile-sharing pattern (parent: [SPARK-56830](https://issues.apache.org/jira/browse/SPARK-56830), pyspark: [SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768)) to the python-only macOS / ARM workflows that run via `.github/workflows/python_hosted_runner_test.yml`. Concretely: - New `precompile` job in `python_hosted_runner_test.yml` runs Spark's SBT build once on `${{ inputs.os }}`: ``` ./build/sbt -Phadoop-3 -Pyarn -Pspark-ganglia-lgpl -Phadoop-cloud -Phive \ -Pkubernetes -Pjvm-profiler -Pkinesis-asl -Phive-thriftserver \ -Pdocker-integration-tests -Pvolcano \ Test/package streaming-kinesis-asl-assembly/assembly connect/assembly assembly/package ``` It tars every `target/` directory (excluding `./build/` and `./.git/`) with `tar -czf`, uploads as `spark-compile-<os>-<branch>-<run_id>` with `retention-days: 1`. - The 9 pyspark matrix entries in the same workflow add `precompile` to `needs:` and `if: (!cancelled())`, download/extract the artifact (with graceful fallback), and export `SKIP_SCALA_BUILD=true` so `dev/run-tests.py` skips `build_apache_spark` and `build_spark_assembly_sbt`. - Cache steps in the new precompile job are gated `if: ${{ runner.os != 'macOS' }}` to match the existing TODO(SPARK-54466) pattern in this file: on `macos-26` the precompile runs without GHA cache; on `ubuntu-24.04-arm` it caches as expected. - Artifact name includes `${{ inputs.os }}` so the two callers (`build_python_3.12_macos26.yml` and `build_python_3.12_arm.yml`) cannot collide. This benefits both callers of the reusable workflow: - `.github/workflows/build_python_3.12_macos26.yml` (macos-26) - `.github/workflows/build_python_3.12_arm.yml` (ubuntu-24.04-arm) ### Optional: graceful fallback if precompile fails Same pattern as SPARK-56768: - `precompile` has `continue-on-error: true` so a failed or cancelled precompile does not fail the workflow run. - The matrix's "Download precompiled artifact" step is gated on `needs.precompile.result == 'success'` and itself has `continue-on-error: true`. - The "Extract precompiled artifact" step is gated on the download succeeding, and also has `continue-on-error: true`. - Inside the "Run tests" bash block, `SKIP_SCALA_BUILD=true` is exported only when `steps.extract-precompiled.outcome == 'success'`. Otherwise it stays unset and `dev/run-tests.py` falls back to the original local SBT build. ### Why are the changes needed? Today every one of the 9 pyspark matrix entries in `python_hosted_runner_test.yml` runs the same SBT build from scratch. Sharing the compile artifact once across the matrix avoids 8x duplicate SBT compile work per scheduled run of `build_python_3.12_macos26.yml` (and `build_python_3.12_arm.yml`). This mirrors the savings already realized for the Linux pyspark matrix in SPARK-56768. ### Does this PR introduce _any_ user-facing change? No. CI infrastructure change only. ### How was this patch tested? The change is exercised by the CI run of this PR itself. To validate the `macos-26` path specifically, a temporary PR-builder hook ran `python_hosted_runner_test.yml` on `macos-26` (dropped before merge); all 9 `Python on macOS` pyspark matrix entries passed while reusing the shared precompiled artifact: https://github.com/zhengruifeng/spark/actions/runs/27010550379 If the precompile job is forced to fail (or its artifact is missing), the matrix entries should still pass via the fallback path. The "Run tests" step logs `Reusing precompiled artifact, skipping local SBT build.` to make the fast path visible per matrix entry. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes apache#56107 from zhengruifeng/share-sbt-compile-python-macos-dev5. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit e8ca287) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
…on API Currently, the DSv2 Transaction API skips dataframe caching. This can cause significant performance regression to existing workloads. Dataframes cached prior to the transaction should be allowed to be reused within the transaction. Cache substitution during a transaction now delegates to the connector via `Transaction.registerScans`. Spark hands every materialized scan in a candidate cached subtree to the active transaction, and the connector decides whether reusing the cached snapshots is compatible with its isolation contract. Furthermore, this PR fixes an issue where the `v2TableReference` to Relation mechanism would not take into account subqueries. With the fix all pre-resolved relations in the plan (including subqueries) are un-resolved to `v2TableReference` and then resolved again at `ResolveRelations`. ### What changes were proposed in this pull request? This transaction introduces `Transaction.registerScans` in the transaction API. Check above for more details. ### Why are the changes needed? Without this fix cached dataframes cannot be used in transactions. As a result, the DSv2 transaction API will introduce significant performance regression for relevant workloads. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and new tests. ### Was this patch authored or co-authored using generative AI tooling? Opus 4.7 Closes apache#56121 from andreaschat-db/dsv2TransactionDFCachingFix. Authored-by: Andreas Chatzistergiou <andreas.chatzistergiou@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 542ea3b) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…e triggered Put all the CI-unrelated files in a module called `dev-tools` so they are correctly categorized - then they won't be considered part of "root" module and trigger a full CI. Notice that `lint` workflow is not impacted because it does not check changed files. To reduce CI usage. According to Claude Code estimation, we could skip about 70 commits in the past year (2% of all commits). No. CI. No. Closes apache#56312 from gaogaotiantian/ignore-some-files. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Tian Gao <gaogaotiantian@hotmail.com> (cherry picked from commit 952a283) Signed-off-by: Tian Gao <gaogaotiantian@hotmail.com>
Brings in 260 upstream commits from apache/spark v4.2.0-preview5..v4.2.0-rc1. Conflict resolutions: - pom.xml (45), docs/_config.yml, python/pyspark/version.py: kept the downstream version string 4.2.0.1-4.3.0-0 (ours), preserving rc1's other auto-merged content. - .github/workflows/*.yml (10): kept deleted (downstream uses a single ci.yml; these Apache workflows were removed in NGSOK-1622). - DropTableExec.scala: combined downstream purge-on-external (isPurgeableExternalTable) with rc1's ViewCatalog-aware not-exists branch. - ddl.scala (DropTableCommand): kept downstream's hoisted `val table` (reused for isPurgeableExternalTable) and adopted rc1's CatalogTable.isViewLike(t) match to also cover METRIC_VIEW.
fbda773 to
214c045
Compare
Signed-off-by: Petr Fedchenkov <giggsoff@gmail.com>
Bump all module versions, docs config, PySpark and R package versions from 4.2.0.1-4.3.0-0 to 4.2.0.1-4.3.0-1. Signed-off-by: Petr Fedchenkov <giggsoff@gmail.com>
214c045 to
fbf9f21
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sync changes from v4.2.0-rc1 and update version to 4.2.0.1-4.3.0-1