[Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode by hehe7318 · Pull Request #707 · aws/aws-parallelcluster-node

hehe7318 · 2026-06-10T16:04:13Z

Description of changes

_start_partitions now resets the nodes of INACTIVE partitions before bringing partitions back UP, via a new reset_nodes_in_inactive_partitions helper in slurm_commands. Only INACTIVE partitions are reset, so nodes of active partitions (which may be running jobs) are not touched. The stop path is unchanged.

Why: when protected mode disables a partition, it sets it INACTIVE but does not reset its nodes in the same iteration (cleanup is normally done on a later clustermgtd iteration). Dynamic nodes left powering up without a backing instance can therefore keep a stale nodeaddr. Starting the fleet brought partitions back UP without resetting these nodes, so right after start clustermgtd saw them again as bootstrap failures and sent the cluster straight back into protected mode. As documented, starting the fleet alone should be enough to recover from protected mode without requiring a fleet stop; resetting these nodes on start makes that reliable.

Tests

Unit test passed as expected.
Integ-test test_slurm_custom_partitions passed.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

gmarciani · 2026-06-10T16:18:31Z

    log.info("Setting slurm partitions to INACTIVE and terminating all compute nodes...")
-    update_all_partitions(PartitionStatus.INACTIVE, reset_node_addrs_hostname=True)
+    # Reset nodes for all partitions, even those already INACTIVE: protected mode may have left dirty nodes.
+    update_all_partitions(


In our public doc we declare it is enough to start the fleet to recover from the protected mode.

See https://docs.aws.amazon.com/parallelcluster/latest/ug/slurm-protected-mode-v3.html#slurm-protected-mode-exit-v3

This is the correct expectation because we do not want to force the user to stop the fleet just to recover one queue from protected mode.

However, it seems from the code change that the fleet stop is required to make the recovery reliable.
Is it possible to make the start fleet able to recover without the need for the stop?

That's a good question, we may want to move the logic from cluster.stop() to cluster.start().
For example during the cluster.start(), we reset all the nodes in INACTIVE partitions.

Done in new version!

…ition nodes When protected mode disables a partition it sets the partition to INACTIVE but does not reset its nodes in the same iteration; that cleanup is normally done by a later clustermgtd iteration. Dynamic nodes that were powering up without a backing instance can therefore be left with a stale nodeaddr. Starting the compute fleet brought the partitions back UP without resetting these nodes, so right after start clustermgtd saw them again as bootstrap failures and sent the cluster straight back into protected mode, leaving the partitions INACTIVE. As documented, starting the fleet alone should be enough to recover from protected mode. Reset the nodes of INACTIVE partitions on fleet start, before bringing the partitions back UP. Only INACTIVE partitions are reset, so nodes of active partitions that may be running jobs are left untouched.

gmarciani · 2026-06-11T20:26:03Z

+        ),
+    ],
+)
+def test_reset_nodes_in_inactive_partitions(mock_partitions, expected_idle_call, mocker):


[Testing] It is important that the function reset_nodes_in_inactive_partitions is executed before update_all_partitions(PartitionStatus.UP). Can we capture the ordering with unit test?

gmarciani · 2026-06-11T20:27:21Z

 def _start_partitions():
    log.info("Setting slurm partitions to UP and resuming nodes...")
+    reset_nodes_in_inactive_partitions()
    update_all_partitions(PartitionStatus.UP, reset_node_addrs_hostname=False)


We explicitly set reset_node_addrs_hostname=False. which seems the opposite of what we are doing by adding reset_nodes_in_inactive_partitions. Why do we think this is not a contradiction and it is fine?

update_all_partitions(UP, reset_node_addrs_hostname=False) only flips the partition state and intentionally leaves node addrs alone. And the node reset is a separate, explicit step (reset_nodes_in_inactive_partitions) scoped only to INACTIVE partitions and executed before the partitions are brought back UP.

So not a contradiction.

gmarciani · 2026-06-11T21:10:10Z

+        ]
+        if inactive_nodes:
+            log.info("Resetting nodes of INACTIVE partitions: %s", inactive_nodes)
+            set_nodes_idle(",".join(inactive_nodes), reset_node_addrs_hostname=True)


The function set_nodes_idle sets the state of nodes in INACTIVE partitions to resume. If I am not wrong this is the first time we set resume on inactive partitions. What is the state of those nodes before we resume those? what is the effect of setting them idle when the partition is inactive?

Good catch, now I switched to reset_nodes(state="down"), same as clustermgtd's _reset_nodes_in_inactive_partitions does for inactive partitions. Also added needs_reset_when_inactive() filtering so that only the nodes that actually need it (e.g. nodeaddr still set) are reset, again to keep the behavior aligned with clustermgtd's inactive cleanup.

Because I saw

aws-parallelcluster-node/src/slurm_plugin/clustermgtd.py

Line 637 in d31f163

# Setting to down and not power_down cause while inactive power_down doesn't seem to be applied

The comment there says power_down doesn't seem to be applied while the partition is inactive, so my previous change had a chance of not working as well.

gmarciani · 2026-06-11T21:14:07Z

+        ]
+        if inactive_nodes:
+            log.info("Resetting nodes of INACTIVE partitions: %s", inactive_nodes)
+            set_nodes_idle(",".join(inactive_nodes), reset_node_addrs_hostname=True)


Would it be possible for inactive_nodes to contain empty elements?
If possible then we must put a safeguard because empty elements could produce a malformed comma separated list.

Example:

inactive_nodes = ["node1,node2", "", "node4,node5"] ",".join(inactive_nodes) -> "node1,node2,,node3,node4" (two commas with no elements inside)

Thanks! Fixed, partitions with empty nodenames are skipped, and the list now goes through get_nodes_info + per-node filtering. Added a unit test for the empty-partition case.

…per-node filtering Set the nodes to 'down' instead of 'resume': while a partition is INACTIVE, Slurm does not seem to apply power-state transitions (clustermgtd's inactive cleanup uses 'down' for the same reason). Once the partitions are brought back UP, clustermgtd reconciles these down nodes to powered_down. Filter the nodes through needs_reset_when_inactive() instead of resetting all nodes of the inactive partitions, and skip partitions with no nodes to avoid building a malformed node list. Also assert in unit tests that the reset runs before partitions are brought back UP.

gmarciani · 2026-06-12T20:20:16Z

+            for part in get_partitions_info()
+            if PartitionStatus(part.state) == PartitionStatus.INACTIVE and part.nodenames
+        ]
+        if not inactive_partition_nodes:


Could you please add a log line saying that there is no nodes in inactive partitions to reset?
I know it may seem a minor, but considering that we are adding this logic to fix a race condition, it may be useuful for the troubleshooting to know exactly what the daemons sees when it checks

…tter debug

hehe7318 added the 3.x label Jun 10, 2026

hehe7318 requested review from a team as code owners June 10, 2026 16:04

hehe7318 mentioned this pull request Jun 10, 2026

[Test] Revert the test_slurm_custom_partitions test-only fix aws/aws-parallelcluster#7438

Open

gmarciani reviewed Jun 10, 2026

View reviewed changes

hehe7318 force-pushed the wip/clean-all-nodes-when-cluster-stop branch from fdbecf1 to bfaff17 Compare June 11, 2026 19:36

hehe7318 changed the title ~~[Node] Reset all partition nodes on cluster stop~~ [Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode Jun 11, 2026

Merge branch 'develop' into wip/clean-all-nodes-when-cluster-stop

f3b3f08

gmarciani reviewed Jun 11, 2026

View reviewed changes

gmarciani reviewed Jun 12, 2026

View reviewed changes

Add log when there is no nodes in inactive partitions to reset for be…

a167842

…tter debug

gmarciani approved these changes Jun 12, 2026

View reviewed changes

hehe7318 merged commit aca9d7b into aws:develop Jun 15, 2026
13 of 14 checks passed

Conversation

hehe7318 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hehe7318 Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hehe7318 commented Jun 10, 2026 •

edited

Loading

hehe7318 Jun 12, 2026 •

edited

Loading