[Improvement-17330][K8s] Replace job watcher with informer by det101 · Pull Request #18358 · apache/dolphinscheduler

det101 · 2026-06-17T06:28:09Z

Was this PR generated or assisted by AI?

YES

Purpose of the pull request

fix #17330

Brief change log

Verify this pull request

This change added tests and can be verified as follows:

./mvnw -pl dolphinscheduler-task-plugin/dolphinscheduler-task-api clean test -Dtest=K8sTaskExecutorTest

[√ ] Manual K8s smoke test with busybox short task (sleep 10) succeeds end-to-end
[√] long task (sleep 2400) does not fail with too old resource version ([Improvement][K8s] too old resource version #17330)

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

sonarqubecloud · 2026-06-17T09:33:30Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 60%)

See analysis details on SonarQube Cloud

det101 · 2026-06-23T06:02:24Z

Upgrading Fabric8 from version 6.4 to 6.0: The BOM upgrade was found to affect all Kubernetes clients, resulting in a significant impact. Modifications were made, but still based on version 6.0. @SbloodyS @ruanwenjun

det101 · 2026-06-23T07:44:05Z

Manual verification (minikube)
Tested on standalone + minikube with busybox:1.30.1, two sequential K8S tasks.

Short task (sleep 15): Job submitted; informer logged event received, job: ..., action: ADD/UPDATE; terminal status 0 → succeed in k8s. Task SUCCESS.

Long task (sleep 2400, ~40 min): Informer kept receiving ADD/UPDATE for the full run; pod finished after 40 min; status 0 → succeed in k8s. Workflow SUCCESS. No too old resource version or fail in k8s.

SbloodyS

This still does not fully replace the old Watcher.onClose failure path. In Fabric8 6.0, SharedIndexInformer.start() completes from Reflector.listSyncAndWatch(), but Reflector does not compose the watch future returned by startWatcher(); later non-HttpGone watch closures only set running=false and do not complete this start future exceptionally. For non-timeout tasks, awaitJobCompletion() can therefore block forever instead of failing the task as the old onClose(WatcherException) did. We need an explicit monitor/failure path for informer/watch stopping, or another way to count down the latch when the informer can no longer observe the Job.

det101 · 2026-06-24T03:27:28Z

This still does not fully replace the old Watcher.onClose failure path. In Fabric8 6.0, SharedIndexInformer.start() completes from Reflector.listSyncAndWatch(), but Reflector does not compose the watch future returned by startWatcher(); later non-HttpGone watch closures only set running=false and do not complete this start future exceptionally. For non-timeout tasks, awaitJobCompletion() can therefore block forever instead of failing the task as the old onClose(WatcherException) did. We need an explicit monitor/failure path for informer/watch stopping, or another way to count down the latch when the informer can no longer observe the Job.

Hi, to address the concern that awaitJobCompletion() may block forever when the informer stops observing the Job in Fabric8 6.0, my approach is:

Primary: SharedIndexInformer handles ADD/UPDATE/DELETE.
Safety net: poll Job status via GET every 30s; count down the latch on terminal state or Job deletion if informer events are missed.
We intentionally do not fail on isWatching()==false to avoid false failures during relist gaps or while the Job is still running. Task timeout remains the final fallback.

Does this approach work for you?

SbloodyS · 2026-06-27T06:18:22Z

I agree that polling the Job status via GET is a useful safety net when informer events are missed and the Kubernetes API is still reachable.

However, I think this still does not fully cover the old Watcher.onClose failure path. In the current implementation, poll failures are only logged and do not count down the latch. Also, task timeout is only a fallback when the timeout strategy is FAILED or WARNFAILED; for tasks without those timeout strategies, awaitJobCompletion() can still block indefinitely if the informer stops delivering events and GET keeps failing.

Could we add a bounded failure policy for continuous polling errors, or another explicit fatal/stopped informer path, so the task can fail instead of waiting forever? A unit test for “informer started, no terminal event, GET keeps failing, no timeout strategy” would also help cover this case.

Co-authored-by: Cursor <cursoragent@cursor.com>

Use runnableInformer with start().whenComplete() for startup failure handling, check terminal status on onAdd, align event log format, and expand unit tests. Keep Fabric8 at 6.0.0 instead of 6.4.0: the BOM upgrade affects all K8s client call sites (API cluster management, datasource, Spark on K8s, task execution) and tightens kubeconfig validation, which breaks unrelated flows such as cluster update when config is re-validated. The 6.4 stopped() API is not available on 6.0; startup errors are covered by start().whenComplete(), and watch relist handles too old resource version at runtime.

…y net

Co-authored-by: Cursor <cursoragent@cursor.com>

det101 · 2026-06-29T01:33:33Z

Good catch — you're right that logging poll errors alone doesn't cover the old Watcher.onClose fatal path, and without a FAILED/WARNFAILED timeout strategy awaitJobCompletion() could block indefinitely.
I've added a bounded failure policy: after 3 consecutive GET poll failures (30s interval, ~90s total), the task fails and counts down the latch. A successful GET resets the counter, so transient errors don't immediately fail the task. This mirrors the intent of the old watcher close path when the API stays unreachable.

ruanwenjun

Please don't put all the implementation into a single class, as it makes the code harder to maintain. Consider introducing a dedicated K8sJobMonitor class to monitor the job status after submission, which would help keep the responsibilities better separated.

det101 requested review from Gallardot, SbloodyS and caishunfeng as code owners June 17, 2026 06:28

det101 force-pushed the fix-17330-k8s-stale-resource-version branch from 746825e to b65a890 Compare June 17, 2026 07:12

SbloodyS closed this Jun 17, 2026

SbloodyS reopened this Jun 17, 2026

github-actions Bot added backend test labels Jun 17, 2026

github-actions Bot assigned det101 Jun 17, 2026

ruanwenjun reviewed Jun 18, 2026

View reviewed changes

Comment thread ...-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/k8s/impl/K8sTaskExecutor.java Outdated

Comment thread ...-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/k8s/impl/K8sTaskExecutor.java Outdated

det101 force-pushed the fix-17330-k8s-stale-resource-version branch from b65a890 to ab3aeef Compare June 22, 2026 06:08

SbloodyS requested changes Jun 22, 2026

View reviewed changes

ruanwenjun force-pushed the fix-17330-k8s-stale-resource-version branch from fc4f603 to 273ecec Compare June 22, 2026 08:41

det101 force-pushed the fix-17330-k8s-stale-resource-version branch 2 times, most recently from fc72e84 to 38650b9 Compare June 23, 2026 05:52

SbloodyS added this to the 3.5.0 milestone Jun 24, 2026

SbloodyS added the improvement make more easy to user or prompt friendly label Jun 24, 2026

SbloodyS reviewed Jun 24, 2026

View reviewed changes

github-advanced-security AI found potential problems Jun 24, 2026

View reviewed changes

Comment thread ...-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/k8s/impl/K8sTaskExecutor.java Fixed

det101 requested review from SbloodyS and ruanwenjun June 26, 2026 09:07

luxl and others added 4 commits June 29, 2026 09:33

[Improvement-17330][K8s] Replace job watcher with informer

5b15fa3

Co-authored-by: Cursor <cursoragent@cursor.com>

[Improvement-17330][K8s] Add GET job status polling as informer safet…

a80da0c

…y net

Remove unused taskInstanceId from job watcher path

7f8bb62

[Improvement-17330][K8s] Fail task after consecutive GET poll failures

9d67b03

Co-authored-by: Cursor <cursoragent@cursor.com>

det101 force-pushed the fix-17330-k8s-stale-resource-version branch from 5b588a3 to 9d67b03 Compare June 29, 2026 01:33

det101 closed this Jun 30, 2026

det101 reopened this Jun 30, 2026

ruanwenjun reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Improvement-17330][K8s] Replace job watcher with informer#18358

[Improvement-17330][K8s] Replace job watcher with informer#18358
det101 wants to merge 5 commits into
apache:devfrom
det101:fix-17330-k8s-stale-resource-version

det101 commented Jun 17, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

det101 commented Jun 23, 2026 •

edited

Loading

Uh oh!

det101 commented Jun 23, 2026

Uh oh!

SbloodyS left a comment

Uh oh!

det101 commented Jun 24, 2026

Uh oh!

Uh oh!

SbloodyS commented Jun 27, 2026

Uh oh!

det101 commented Jun 29, 2026

Uh oh!

ruanwenjun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

det101 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was this PR generated or assisted by AI?

Purpose of the pull request

Brief change log

Verify this pull request

Pull Request Notice

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Quality Gate failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

det101 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

det101 commented Jun 23, 2026

Uh oh!

SbloodyS left a comment

Choose a reason for hiding this comment

Uh oh!

det101 commented Jun 24, 2026

Uh oh!

Uh oh!

SbloodyS commented Jun 27, 2026

Uh oh!

det101 commented Jun 29, 2026

Uh oh!

ruanwenjun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

det101 commented Jun 17, 2026 •

edited

Loading

det101 commented Jun 23, 2026 •

edited

Loading