[Node] Reset inactive partition nodes on fleet start to ensure recovering from protected mode#707
Conversation
| log.info("Setting slurm partitions to INACTIVE and terminating all compute nodes...") | ||
| update_all_partitions(PartitionStatus.INACTIVE, reset_node_addrs_hostname=True) | ||
| # Reset nodes for all partitions, even those already INACTIVE: protected mode may have left dirty nodes. | ||
| update_all_partitions( |
There was a problem hiding this comment.
In our public doc we declare it is enough to start the fleet to recover from the protected mode.
This is the correct expectation because we do not want to force the user to stop the fleet just to recover one queue from protected mode.
However, it seems from the code change that the fleet stop is required to make the recovery reliable.
Is it possible to make the start fleet able to recover without the need for the stop?
There was a problem hiding this comment.
That's a good question, we may want to move the logic from cluster.stop() to cluster.start().
For example during the cluster.start(), we reset all the nodes in INACTIVE partitions.
There was a problem hiding this comment.
Done in new version!
…ition nodes When protected mode disables a partition it sets the partition to INACTIVE but does not reset its nodes in the same iteration; that cleanup is normally done by a later clustermgtd iteration. Dynamic nodes that were powering up without a backing instance can therefore be left with a stale nodeaddr. Starting the compute fleet brought the partitions back UP without resetting these nodes, so right after start clustermgtd saw them again as bootstrap failures and sent the cluster straight back into protected mode, leaving the partitions INACTIVE. As documented, starting the fleet alone should be enough to recover from protected mode. Reset the nodes of INACTIVE partitions on fleet start, before bringing the partitions back UP. Only INACTIVE partitions are reset, so nodes of active partitions that may be running jobs are left untouched.
fdbecf1 to
bfaff17
Compare
| ), | ||
| ], | ||
| ) | ||
| def test_reset_nodes_in_inactive_partitions(mock_partitions, expected_idle_call, mocker): |
There was a problem hiding this comment.
[Testing] It is important that the function reset_nodes_in_inactive_partitions is executed before update_all_partitions(PartitionStatus.UP). Can we capture the ordering with unit test?
| def _start_partitions(): | ||
| log.info("Setting slurm partitions to UP and resuming nodes...") | ||
| reset_nodes_in_inactive_partitions() | ||
| update_all_partitions(PartitionStatus.UP, reset_node_addrs_hostname=False) |
There was a problem hiding this comment.
We explicitly set reset_node_addrs_hostname=False. which seems the opposite of what we are doing by adding reset_nodes_in_inactive_partitions. Why do we think this is not a contradiction and it is fine?
There was a problem hiding this comment.
update_all_partitions(UP, reset_node_addrs_hostname=False) only flips the partition state and intentionally leaves node addrs alone. And the node reset is a separate, explicit step (reset_nodes_in_inactive_partitions) scoped only to INACTIVE partitions and executed before the partitions are brought back UP.
So not a contradiction.
| ] | ||
| if inactive_nodes: | ||
| log.info("Resetting nodes of INACTIVE partitions: %s", inactive_nodes) | ||
| set_nodes_idle(",".join(inactive_nodes), reset_node_addrs_hostname=True) |
There was a problem hiding this comment.
The function set_nodes_idle sets the state of nodes in INACTIVE partitions to resume. If I am not wrong this is the first time we set resume on inactive partitions. What is the state of those nodes before we resume those? what is the effect of setting them idle when the partition is inactive?
There was a problem hiding this comment.
Good catch, now I switched to reset_nodes(state="down"), same as clustermgtd's _reset_nodes_in_inactive_partitions does for inactive partitions. Also added needs_reset_when_inactive() filtering so that only the nodes that actually need it (e.g. nodeaddr still set) are reset, again to keep the behavior aligned with clustermgtd's inactive cleanup.
Because I saw
The comment there says power_down doesn't seem to be applied while the partition is inactive, so my previous change had a chance of not working as well.
| ] | ||
| if inactive_nodes: | ||
| log.info("Resetting nodes of INACTIVE partitions: %s", inactive_nodes) | ||
| set_nodes_idle(",".join(inactive_nodes), reset_node_addrs_hostname=True) |
There was a problem hiding this comment.
Would it be possible for inactive_nodes to contain empty elements?
If possible then we must put a safeguard because empty elements could produce a malformed comma separated list.
Example:
inactive_nodes = ["node1,node2", "", "node4,node5"]
",".join(inactive_nodes) -> "node1,node2,,node3,node4" (two commas with no elements inside)
There was a problem hiding this comment.
Thanks! Fixed, partitions with empty nodenames are skipped, and the list now goes through get_nodes_info + per-node filtering. Added a unit test for the empty-partition case.
…per-node filtering Set the nodes to 'down' instead of 'resume': while a partition is INACTIVE, Slurm does not seem to apply power-state transitions (clustermgtd's inactive cleanup uses 'down' for the same reason). Once the partitions are brought back UP, clustermgtd reconciles these down nodes to powered_down. Filter the nodes through needs_reset_when_inactive() instead of resetting all nodes of the inactive partitions, and skip partitions with no nodes to avoid building a malformed node list. Also assert in unit tests that the reset runs before partitions are brought back UP.
| for part in get_partitions_info() | ||
| if PartitionStatus(part.state) == PartitionStatus.INACTIVE and part.nodenames | ||
| ] | ||
| if not inactive_partition_nodes: |
There was a problem hiding this comment.
Could you please add a log line saying that there is no nodes in inactive partitions to reset?
I know it may seem a minor, but considering that we are adding this logic to fix a race condition, it may be useuful for the troubleshooting to know exactly what the daemons sees when it checks
Description of changes
_start_partitionsnow resets the nodes of INACTIVE partitions before bringing partitions back UP, via a newreset_nodes_in_inactive_partitionshelper inslurm_commands. Only INACTIVE partitions are reset, so nodes of active partitions (which may be running jobs) are not touched. The stop path is unchanged.Why: when protected mode disables a partition, it sets it INACTIVE but does not reset its nodes in the same iteration (cleanup is normally done on a later clustermgtd iteration). Dynamic nodes left powering up without a backing instance can therefore keep a stale nodeaddr. Starting the fleet brought partitions back UP without resetting these nodes, so right after start clustermgtd saw them again as bootstrap failures and sent the cluster straight back into protected mode. As documented, starting the fleet alone should be enough to recover from protected mode without requiring a fleet stop; resetting these nodes on start makes that reliable.
Tests
test_slurm_custom_partitionspassed.Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.