Skip to content

[Improvement-17330][K8s] Replace job watcher with informer#18358

Open
det101 wants to merge 5 commits into
apache:devfrom
det101:fix-17330-k8s-stale-resource-version
Open

[Improvement-17330][K8s] Replace job watcher with informer#18358
det101 wants to merge 5 commits into
apache:devfrom
det101:fix-17330-k8s-stale-resource-version

Conversation

@det101

@det101 det101 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Was this PR generated or assisted by AI?

YES

Purpose of the pull request

fix #17330

Brief change log

Verify this pull request

This change added tests and can be verified as follows:

./mvnw -pl dolphinscheduler-task-plugin/dolphinscheduler-task-api clean test -Dtest=K8sTaskExecutorTest

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

@det101 det101 force-pushed the fix-17330-k8s-stale-resource-version branch from 746825e to b65a890 Compare June 17, 2026 07:12
@SbloodyS SbloodyS closed this Jun 17, 2026
@SbloodyS SbloodyS reopened this Jun 17, 2026
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 60%)

See analysis details on SonarQube Cloud

@det101 det101 force-pushed the fix-17330-k8s-stale-resource-version branch from b65a890 to ab3aeef Compare June 22, 2026 06:08
@ruanwenjun ruanwenjun force-pushed the fix-17330-k8s-stale-resource-version branch from fc4f603 to 273ecec Compare June 22, 2026 08:41
@det101 det101 force-pushed the fix-17330-k8s-stale-resource-version branch 2 times, most recently from fc72e84 to 38650b9 Compare June 23, 2026 05:52
@det101

det101 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Upgrading Fabric8 from version 6.4 to 6.0: The BOM upgrade was found to affect all Kubernetes clients, resulting in a significant impact. Modifications were made, but still based on version 6.0. @SbloodyS @ruanwenjun

@det101

det101 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Manual verification (minikube)
Tested on standalone + minikube with busybox:1.30.1, two sequential K8S tasks.

Short task (sleep 15): Job submitted; informer logged event received, job: ..., action: ADD/UPDATE; terminal status 0 → succeed in k8s. Task SUCCESS.

Long task (sleep 2400, ~40 min): Informer kept receiving ADD/UPDATE for the full run; pod finished after 40 min; status 0 → succeed in k8s. Workflow SUCCESS. No too old resource version or fail in k8s.

image image

@SbloodyS SbloodyS added this to the 3.5.0 milestone Jun 24, 2026
@SbloodyS SbloodyS added the improvement make more easy to user or prompt friendly label Jun 24, 2026

@SbloodyS SbloodyS left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still does not fully replace the old Watcher.onClose failure path. In Fabric8 6.0, SharedIndexInformer.start() completes from Reflector.listSyncAndWatch(), but Reflector does not compose the watch future returned by startWatcher(); later non-HttpGone watch closures only set running=false and do not complete this start future exceptionally. For non-timeout tasks, awaitJobCompletion() can therefore block forever instead of failing the task as the old onClose(WatcherException) did. We need an explicit monitor/failure path for informer/watch stopping, or another way to count down the latch when the informer can no longer observe the Job.

@det101

det101 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

This still does not fully replace the old Watcher.onClose failure path. In Fabric8 6.0, SharedIndexInformer.start() completes from Reflector.listSyncAndWatch(), but Reflector does not compose the watch future returned by startWatcher(); later non-HttpGone watch closures only set running=false and do not complete this start future exceptionally. For non-timeout tasks, awaitJobCompletion() can therefore block forever instead of failing the task as the old onClose(WatcherException) did. We need an explicit monitor/failure path for informer/watch stopping, or another way to count down the latch when the informer can no longer observe the Job.

Hi, to address the concern that awaitJobCompletion() may block forever when the informer stops observing the Job in Fabric8 6.0, my approach is:

Primary: SharedIndexInformer handles ADD/UPDATE/DELETE.
Safety net: poll Job status via GET every 30s; count down the latch on terminal state or Job deletion if informer events are missed.
We intentionally do not fail on isWatching()==false to avoid false failures during relist gaps or while the Job is still running. Task timeout remains the final fallback.

Does this approach work for you?

@det101 det101 requested review from SbloodyS and ruanwenjun June 26, 2026 09:07
@SbloodyS

Copy link
Copy Markdown
Member

I agree that polling the Job status via GET is a useful safety net when informer events are missed and the Kubernetes API is still reachable.

However, I think this still does not fully cover the old Watcher.onClose failure path. In the current implementation, poll failures are only logged and do not count down the latch. Also, task timeout is only a fallback when the timeout strategy is FAILED or WARNFAILED; for tasks without those timeout strategies, awaitJobCompletion() can still block indefinitely if the informer stops delivering events and GET keeps failing.

Could we add a bounded failure policy for continuous polling errors, or another explicit fatal/stopped informer path, so the task can fail instead of waiting forever? A unit test for “informer started, no terminal event, GET keeps failing, no timeout strategy” would also help cover this case.

luxl and others added 4 commits June 29, 2026 09:33
Co-authored-by: Cursor <cursoragent@cursor.com>
Use runnableInformer with start().whenComplete() for startup failure handling,
check terminal status on onAdd, align event log format, and expand unit tests.

Keep Fabric8 at 6.0.0 instead of 6.4.0: the BOM upgrade affects all K8s client
call sites (API cluster management, datasource, Spark on K8s, task execution) and
tightens kubeconfig validation, which breaks unrelated flows such as cluster
update when config is re-validated. The 6.4 stopped() API is not available on
6.0; startup errors are covered by start().whenComplete(), and watch relist
handles too old resource version at runtime.
@det101 det101 force-pushed the fix-17330-k8s-stale-resource-version branch from 5b588a3 to 9d67b03 Compare June 29, 2026 01:33
@det101

det101 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Good catch — you're right that logging poll errors alone doesn't cover the old Watcher.onClose fatal path, and without a FAILED/WARNFAILED timeout strategy awaitJobCompletion() could block indefinitely.
I've added a bounded failure policy: after 3 consecutive GET poll failures (30s interval, ~90s total), the task fails and counts down the latch. A successful GET resets the counter, so transient errors don't immediately fail the task. This mirrors the intent of the old watcher close path when the API stays unreachable.

@det101 det101 closed this Jun 30, 2026
@det101 det101 reopened this Jun 30, 2026

@ruanwenjun ruanwenjun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't put all the implementation into a single class, as it makes the code harder to maintain. Consider introducing a dedicated K8sJobMonitor class to monitor the job status after submission, which would help keep the responsibilities better separated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend improvement make more easy to user or prompt friendly test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement][K8s] too old resource version

5 participants