Skip to content

Commit c9c89dc

Browse files
jrfastabborkmann
authored andcommitted
bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
If copy_page_to_iter() fails or even partially completes, but with fewer bytes copied than expected we currently reset sg.start and return EFAULT. This proves problematic if we already copied data into the user buffer before we return an error. Because we leave the copied data in the user buffer and fail to unwind the scatterlist so kernel side believes data has been copied and user side believes data has _not_ been received. Expected behavior should be to return number of bytes copied and then on the next read we need to return the error assuming its still there. This can happen if we have a copy length spanning multiple scatterlist elements and one or more complete before the error is hit. The error is rare enough though that my normal testing with server side programs, such as nginx, httpd, envoy, etc., I have never seen this. The only reliable way to reproduce that I've found is to stream movies over my browser for a day or so and wait for it to hang. Not very scientific, but with a few extra WARN_ON()s in the code the bug was obvious. When we review the errors from copy_page_to_iter() it seems we are hitting a page fault from copy_page_to_iter_iovec() where the code checks fault_in_pages_writeable(buf, copy) where buf is the user buffer. It also seems typical server applications don't hit this case. The other way to try and reproduce this is run the sockmap selftest tool test_sockmap with data verification enabled, but it doesn't reproduce the fault. Perhaps we can trigger this case artificially somehow from the test tools. I haven't sorted out a way to do that yet though. Fixes: 604326b ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
1 parent 2acc3c1 commit c9c89dc

1 file changed

Lines changed: 9 additions & 6 deletions

File tree

net/ipv4/tcp_bpf.c

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
1515
{
1616
struct iov_iter *iter = &msg->msg_iter;
1717
int peek = flags & MSG_PEEK;
18-
int i, ret, copied = 0;
1918
struct sk_msg *msg_rx;
19+
int i, copied = 0;
2020

2121
msg_rx = list_first_entry_or_null(&psock->ingress_msg,
2222
struct sk_msg, list);
@@ -37,11 +37,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
3737
page = sg_page(sge);
3838
if (copied + copy > len)
3939
copy = len - copied;
40-
ret = copy_page_to_iter(page, sge->offset, copy, iter);
41-
if (ret != copy) {
42-
msg_rx->sg.start = i;
43-
return -EFAULT;
44-
}
40+
copy = copy_page_to_iter(page, sge->offset, copy, iter);
41+
if (!copy)
42+
return copied ? copied : -EFAULT;
4543

4644
copied += copy;
4745
if (likely(!peek)) {
@@ -56,6 +54,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
5654
put_page(page);
5755
}
5856
} else {
57+
/* Lets not optimize peek case if copy_page_to_iter
58+
* didn't copy the entire length lets just break.
59+
*/
60+
if (copy != sge->length)
61+
return copied;
5962
sk_msg_iter_var_next(i);
6063
}
6164

0 commit comments

Comments
 (0)