[Improvement-17330][K8s] Replace job watcher with informer#18358
[Improvement-17330][K8s] Replace job watcher with informer#18358det101 wants to merge 5 commits into
Conversation
746825e to
b65a890
Compare
|
b65a890 to
ab3aeef
Compare
fc4f603 to
273ecec
Compare
fc72e84 to
38650b9
Compare
|
Upgrading Fabric8 from version 6.4 to 6.0: The BOM upgrade was found to affect all Kubernetes clients, resulting in a significant impact. Modifications were made, but still based on version 6.0. @SbloodyS @ruanwenjun |
SbloodyS
left a comment
There was a problem hiding this comment.
This still does not fully replace the old Watcher.onClose failure path. In Fabric8 6.0, SharedIndexInformer.start() completes from Reflector.listSyncAndWatch(), but Reflector does not compose the watch future returned by startWatcher(); later non-HttpGone watch closures only set running=false and do not complete this start future exceptionally. For non-timeout tasks, awaitJobCompletion() can therefore block forever instead of failing the task as the old onClose(WatcherException) did. We need an explicit monitor/failure path for informer/watch stopping, or another way to count down the latch when the informer can no longer observe the Job.
Hi, to address the concern that awaitJobCompletion() may block forever when the informer stops observing the Job in Fabric8 6.0, my approach is: Primary: SharedIndexInformer handles ADD/UPDATE/DELETE. Does this approach work for you? |
|
I agree that polling the Job status via GET is a useful safety net when informer events are missed and the Kubernetes API is still reachable. However, I think this still does not fully cover the old Could we add a bounded failure policy for continuous polling errors, or another explicit fatal/stopped informer path, so the task can fail instead of waiting forever? A unit test for “informer started, no terminal event, GET keeps failing, no timeout strategy” would also help cover this case. |
Co-authored-by: Cursor <cursoragent@cursor.com>
Use runnableInformer with start().whenComplete() for startup failure handling, check terminal status on onAdd, align event log format, and expand unit tests. Keep Fabric8 at 6.0.0 instead of 6.4.0: the BOM upgrade affects all K8s client call sites (API cluster management, datasource, Spark on K8s, task execution) and tightens kubeconfig validation, which breaks unrelated flows such as cluster update when config is re-validated. The 6.4 stopped() API is not available on 6.0; startup errors are covered by start().whenComplete(), and watch relist handles too old resource version at runtime.
Co-authored-by: Cursor <cursoragent@cursor.com>
5b588a3 to
9d67b03
Compare
|
Good catch — you're right that logging poll errors alone doesn't cover the old |
ruanwenjun
left a comment
There was a problem hiding this comment.
Please don't put all the implementation into a single class, as it makes the code harder to maintain. Consider introducing a dedicated K8sJobMonitor class to monitor the job status after submission, which would help keep the responsibilities better separated.




Was this PR generated or assisted by AI?
YES
Purpose of the pull request
fix #17330
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
./mvnw -pl dolphinscheduler-task-plugin/dolphinscheduler-task-api clean test -Dtest=K8sTaskExecutorTest
busyboxshort task (sleep 10) succeeds end-to-endsleep 2400) does not fail withtoo old resource version([Improvement][K8s] too old resource version #17330)Pull Request Notice
Pull Request Notice
If your pull request contains incompatible change, you should also add it to
docs/docs/en/guide/upgrade/incompatible.md