Skip to content

[Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode#707

Merged
hehe7318 merged 4 commits into
aws:developfrom
hehe7318:wip/clean-all-nodes-when-cluster-stop
Jun 15, 2026
Merged

[Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode#707
hehe7318 merged 4 commits into
aws:developfrom
hehe7318:wip/clean-all-nodes-when-cluster-stop

Conversation

@hehe7318

@hehe7318 hehe7318 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description of changes

_start_partitions now resets the nodes of INACTIVE partitions before bringing partitions back UP, via a new reset_nodes_in_inactive_partitions helper in slurm_commands. Only INACTIVE partitions are reset, so nodes of active partitions (which may be running jobs) are not touched. The stop path is unchanged.

Why: when protected mode disables a partition, it sets it INACTIVE but does not reset its nodes in the same iteration (cleanup is normally done on a later clustermgtd iteration). Dynamic nodes left powering up without a backing instance can therefore keep a stale nodeaddr. Starting the fleet brought partitions back UP without resetting these nodes, so right after start clustermgtd saw them again as bootstrap failures and sent the cluster straight back into protected mode. As documented, starting the fleet alone should be enough to recover from protected mode without requiring a fleet stop; resetting these nodes on start makes that reliable.

Tests

  • Unit test passed as expected.
  • Integ-test test_slurm_custom_partitions passed.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

log.info("Setting slurm partitions to INACTIVE and terminating all compute nodes...")
update_all_partitions(PartitionStatus.INACTIVE, reset_node_addrs_hostname=True)
# Reset nodes for all partitions, even those already INACTIVE: protected mode may have left dirty nodes.
update_all_partitions(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our public doc we declare it is enough to start the fleet to recover from the protected mode.

See https://docs.aws.amazon.com/parallelcluster/latest/ug/slurm-protected-mode-v3.html#slurm-protected-mode-exit-v3

This is the correct expectation because we do not want to force the user to stop the fleet just to recover one queue from protected mode.

However, it seems from the code change that the fleet stop is required to make the recovery reliable.
Is it possible to make the start fleet able to recover without the need for the stop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, we may want to move the logic from cluster.stop() to cluster.start().
For example during the cluster.start(), we reset all the nodes in INACTIVE partitions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in new version!

…ition nodes

When protected mode disables a partition it sets the partition to INACTIVE but does
not reset its nodes in the same iteration; that cleanup is normally done by a later
clustermgtd iteration. Dynamic nodes that were powering up without a backing instance
can therefore be left with a stale nodeaddr. Starting the compute fleet brought the
partitions back UP without resetting these nodes, so right after start clustermgtd saw
them again as bootstrap failures and sent the cluster straight back into protected
mode, leaving the partitions INACTIVE.

As documented, starting the fleet alone should be enough to recover from protected
mode. Reset the nodes of INACTIVE partitions on fleet
start, before bringing the partitions back UP. Only INACTIVE partitions are reset, so
nodes of active partitions that may be running jobs are left untouched.
@hehe7318 hehe7318 force-pushed the wip/clean-all-nodes-when-cluster-stop branch from fdbecf1 to bfaff17 Compare June 11, 2026 19:36
@hehe7318 hehe7318 changed the title [Node] Reset all partition nodes on cluster stop [Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode Jun 11, 2026
),
],
)
def test_reset_nodes_in_inactive_partitions(mock_partitions, expected_idle_call, mocker):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Testing] It is important that the function reset_nodes_in_inactive_partitions is executed before update_all_partitions(PartitionStatus.UP). Can we capture the ordering with unit test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

def _start_partitions():
log.info("Setting slurm partitions to UP and resuming nodes...")
reset_nodes_in_inactive_partitions()
update_all_partitions(PartitionStatus.UP, reset_node_addrs_hostname=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly set reset_node_addrs_hostname=False. which seems the opposite of what we are doing by adding reset_nodes_in_inactive_partitions. Why do we think this is not a contradiction and it is fine?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_all_partitions(UP, reset_node_addrs_hostname=False) only flips the partition state and intentionally leaves node addrs alone. And the node reset is a separate, explicit step (reset_nodes_in_inactive_partitions) scoped only to INACTIVE partitions and executed before the partitions are brought back UP.

So not a contradiction.

Comment thread src/common/schedulers/slurm_commands.py Outdated
]
if inactive_nodes:
log.info("Resetting nodes of INACTIVE partitions: %s", inactive_nodes)
set_nodes_idle(",".join(inactive_nodes), reset_node_addrs_hostname=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function set_nodes_idle sets the state of nodes in INACTIVE partitions to resume. If I am not wrong this is the first time we set resume on inactive partitions. What is the state of those nodes before we resume those? what is the effect of setting them idle when the partition is inactive?

@hehe7318 hehe7318 Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, now I switched to reset_nodes(state="down"), same as clustermgtd's _reset_nodes_in_inactive_partitions does for inactive partitions. Also added needs_reset_when_inactive() filtering so that only the nodes that actually need it (e.g. nodeaddr still set) are reset, again to keep the behavior aligned with clustermgtd's inactive cleanup.

Because I saw

# Setting to down and not power_down cause while inactive power_down doesn't seem to be applied

The comment there says power_down doesn't seem to be applied while the partition is inactive, so my previous change had a chance of not working as well.

Comment thread src/common/schedulers/slurm_commands.py Outdated
]
if inactive_nodes:
log.info("Resetting nodes of INACTIVE partitions: %s", inactive_nodes)
set_nodes_idle(",".join(inactive_nodes), reset_node_addrs_hostname=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for inactive_nodes to contain empty elements?
If possible then we must put a safeguard because empty elements could produce a malformed comma separated list.

Example:

inactive_nodes = ["node1,node2", "", "node4,node5"]
",".join(inactive_nodes) -> "node1,node2,,node3,node4"  (two commas with no elements inside)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed, partitions with empty nodenames are skipped, and the list now goes through get_nodes_info + per-node filtering. Added a unit test for the empty-partition case.

…per-node filtering

Set the nodes to 'down' instead of 'resume': while a partition is INACTIVE, Slurm does
not seem to apply power-state transitions (clustermgtd's inactive cleanup uses 'down'
for the same reason). Once the partitions are brought back UP, clustermgtd reconciles
these down nodes to powered_down.

Filter the nodes through needs_reset_when_inactive() instead of resetting all nodes of
the inactive partitions, and skip partitions with no nodes to avoid building a
malformed node list. Also assert in unit tests that the reset runs before partitions
are brought back UP.
for part in get_partitions_info()
if PartitionStatus(part.state) == PartitionStatus.INACTIVE and part.nodenames
]
if not inactive_partition_nodes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a log line saying that there is no nodes in inactive partitions to reset?
I know it may seem a minor, but considering that we are adding this logic to fix a race condition, it may be useuful for the troubleshooting to know exactly what the daemons sees when it checks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@hehe7318 hehe7318 merged commit aca9d7b into aws:develop Jun 15, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants