net: don't constantly post notifications to clients until they send#2534
Open
hawkw wants to merge 1 commit into
Open
net: don't constantly post notifications to clients until they send#2534hawkw wants to merge 1 commit into
hawkw wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, as described in #2532, if the system is connected to a management network, the
nettask posts notifications to its client tasks somewhat incessantly until the first time they actually send a packet. These notifications are spurious. Ultimately, this behavior was part of the root cause of #2528: due to a couple bugs inhiffy, the spurious notifications could preventtask-hiffyfrom checking whether a debug probe has attempted to kick it into executing a HIF script whenhiffyis compiled with network RPC support. PR #2530 fixed the bugs inhiffy, so it now behaves correctly no matter how many timesnetnotifies it. However, I still wanted to get to the bottom of the spurious notifications.The details of my adventures debugging this can be found in issue #2532. Essentially, what's going on here is that, when
netis woken by IP activity, it calls a function calledwake_sockets(), which is what delivers notifications to client tasks.wake_sockets()will notify a client task if either of the following conditions is true:can_recv()insmoltcpreturnstrue)can_send()insmoltcpreturnstrue) and a flag in an array calledclient_waiting_to_sendis set for that socket's index. This is intended to only notify clients if they have previously attempted a send and the buffer was full, so that they can try again. We don't want to notify clients any time their TX buffer has space, as they may not actually care at that point in time.Here's where we arrive at the bug. As I described in this comment, the values of
client_waiting_to_sendis initialized totruefor every socket when the net task starts. As the comment from @cbiffle states,Notifying every socket when the
nettask restarts is necessary because a client may have attempted to send a packet, found that the buffer was full, and then gone intosys_recv_notificationto wait for send capacity, and then thenettask panicked. When thenettask starts back up, it doesn't know that client task was waiting for send capacity, and therefore would never wake it. Thus, we must assume that every task is waiting to send when we restart.However, the current behavior results in spurious notifications because the flag in
client_waiting_to_sendnever gets cleared until that client actually does asend_packetsuccessfully, as I described here. In almost all the management network communication protocols we use today, the SP is never the initiator, so in general, most tasks will not try to send packets on their own in the absence of a request from the control plane orhumility. This means that when thenettask starts up for the first time, and no tasks were waiting on send capacity from a previousnetgeneration, it will notify them any time it sees IP activity. If the client is not waiting to send, it will generally handle the notification by trying to callrecv_packet, and then it gets backErr(QueueEmpty), goes "huh, I guess there's nothing for me", and go back to waiting for a notification. Then, the next time there's any IP activity on the management network, the whole thing will happen again, because theclient_waiting_to_sendflags never get cleared until the task actually does a send.This doesn't manifest as much in the product, because the control plane will periodically talk to the SP on both its
control-plane-agentsocket (for polling sensor data) and itssnitchsocket (for collecting ereports). The first time the control plane sends the SP a request and the SP sends a packet in response, theclient_waiting_to_sendflag gets cleared and that task gets out of the spurious notification state pretty quickly. But on a bench Cosmo connected to a Minibar where nothing is actually trying to talk to it over the management network, thenettask basically just keeps waking everyone forever. Most tasks are more or less resilient to this, but (as seen withhiffyin #2528), it can trigger weird bugs that are sensitive to spurious notifications.This commit fixes it in the simplest possible way. Instead of initializing
client_waiting_to_sendtotrue, we initialize it tofalseand only set the flag if the client actually callssend_packet(). To ensure that everyone is woken up whennetrestarts in case they were waiting in asend_packet()whennetpanics, we just blast out a notification to everyone immediately upon startup, exactly one time. This doesn't need to check if there's actually space in the TX buffers, because, well, thenettask just started, so we know there's nothing in the buffer. I'm pretty sure this should be totally fine: if a client task is waiting for send capacity after all this happens, and it restarts, it will have to go all the way back through whatever code path led to it sendingsend_packet()and encountering the full buffer again, soclient_waiting_in_sendwill be set by then anyway. But, I'd love confirmation from @cbiffle that this is correct.I did some testing to validate the fix, which can be read here.
Fixes #2532