From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
dgibson@redhat.com
Subject: Re: [PATCH 2/2] tcp: allow retransmit when peer receive window is zero
Date: Wed, 24 Apr 2024 11:04:51 +1000 [thread overview]
Message-ID: <Zihas53XonHtyLI0@zatzit> (raw)
In-Reply-To: <20240420191920.104876-3-jmaloy@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 7413 bytes --]
On Sat, Apr 20, 2024 at 03:19:20PM -0400, Jon Maloy wrote:
> A bug in kernel TCP may lead to a deadlock where a zero window is sent
> from the peer, while buffer reads doesn't lead to it being updated.
> At the same time, the zero window stops this side from sending out
> more data to trigger new advertisements to be sent from the peer.
This is a bit confusing to me. I guess the buffer reads must be
happening on the peer, not "this side", but that's not very clear. I
think "received from the peer" might be better than "sent from the
peer" to make it more obvious that the point of view is uniformly from
"this side".
> RFC 793 states that it always is permitted for a sender to send one byte
> of data even when the window is zero. This resolves the deadlock described
> above, so we choose to introduce it here as a last resort. We allow it both
> during fast and as keep-alives when the timer sees no activity on
> the connection.
Looks like there's a missing noun after "during fast".
> However, we notice that this solution doesn´t work well. Traffic sometimes
> goes to zero, and onley recovers after the timer has resolved the situation.
s/onley/only/
> Because of this, we chose to improve it slightly: The deadlock happens when a
Is it actually useful to describe the "bad" workaround if we're using
a different one? I don't see how the better workaround is an obvious
extension of the bad one.
> packet has been dropped at the peer end because of memory squeeze. We therefore
> consider it legitimate to retransmit that packet while considering the window
> size that was valid at the moment it was first transmitted. This works
> much better.
>
> It should be noted that although this solves the problem we have at hand,
> it is not a genuine solution to the kernel bug. There may well be TCP stacks
> around in other OS-es which don't do this probing.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
> tcp.c | 26 ++++++++++++++++----------
> tcp_conn.h | 2 ++
> 2 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 95d400a..9dea151 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1774,6 +1774,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
>
> conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
> + conn->max_seq_to_tap = conn->seq_to_tap;
> }
>
> /**
> @@ -2123,9 +2124,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> *
> * #syscalls recvmsg
> */
> -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, uint32_t wnd_scaled)
> {
> - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> int sendlen, len, plen, v4 = CONN_V4(conn);
> int s = conn->sock, i, ret = 0;
> @@ -2212,6 +2212,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>
> return 0;
> }
> + sendlen = len;
> + if (!peek_offset_cap)
> + sendlen -= already_sent;
>
> sendlen = len;
> if (!peek_offset_cap)
Looks like some duplicated lines from a bad rebase.
> @@ -2241,7 +2244,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> tcp_data_to_tap(c, conn, plen, no_csum, seq);
> seq += plen;
> }
> -
> + /* We need this to know this during retransmission: */
> + if (SEQ_GT(seq, conn->max_seq_to_tap))
> + conn->max_seq_to_tap = seq;
> conn_flag(c, conn, ACK_FROM_TAP_DUE);
>
> return 0;
> @@ -2317,8 +2322,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> SEQ_GE(ack_seq, max_ack_seq)) {
> /* Fast re-transmit */
> retr = !len && !th->fin &&
> - ack_seq == max_ack_seq &&
> - ntohs(th->window) == max_ack_seq_wnd;
> + ack_seq == max_ack_seq;
>
> max_ack_seq_wnd = ntohs(th->window);
> max_ack_seq = ack_seq;
> @@ -2385,9 +2389,10 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> flow_trace(conn,
> "fast re-transmit, ACK: %u, previous sequence: %u",
> max_ack_seq, conn->seq_to_tap);
> +
> conn->seq_ack_from_tap = max_ack_seq;
> conn->seq_to_tap = max_ack_seq;
> - tcp_data_from_sock(c, conn);
> + tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ack_from_tap));
Does the MAX do anything: if max_seq_to_tap equals seq_ack_from_tap
then, by definition, there is nothing to retransmit - everything has
been acked.
> }
>
> if (!iov_i)
> @@ -2483,7 +2488,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> /* The client might have sent data already, which we didn't
> * dequeue waiting for SYN,ACK from tap -- check now.
> */
> - tcp_data_from_sock(c, conn);
> + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
> tcp_send_flag(c, conn, ACK);
> }
>
> @@ -2575,7 +2580,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>
> tcp_tap_window_update(conn, ntohs(th->window));
>
> - tcp_data_from_sock(c, conn);
> + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
>
> if (p->count - idx == 1)
> return 1;
> @@ -2788,7 +2793,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> flow_dbg(conn, "ACK timeout, retry");
> conn->retrans++;
> conn->seq_to_tap = conn->seq_ack_from_tap;
> - tcp_data_from_sock(c, conn);
> + tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ack_from_tap));
> tcp_timer_ctl(c, conn);
> }
> } else {
> @@ -2807,6 +2812,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> tcp_rst(c, conn);
> }
> }
> +
Spurious change
> }
>
> /**
> @@ -2843,7 +2849,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
> conn_event(c, conn, SOCK_FIN_RCVD);
>
> if (events & EPOLLIN)
> - tcp_data_from_sock(c, conn);
> + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
>
> if (events & EPOLLOUT)
> tcp_update_seqack_wnd(c, conn, 0, NULL);
> diff --git a/tcp_conn.h b/tcp_conn.h
> index a5f5cfe..afcdec9 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -29,6 +29,7 @@
> * @wnd_from_tap: Last window size from tap, unscaled (as received)
> * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
> * @seq_to_tap: Next sequence for packets to tap
> + * @max_seq_to_tap: Next seq after highest ever sent. Needeed during retransmit
> * @seq_ack_from_tap: Last ACK number received from tap
> * @seq_from_tap: Next sequence for packets from tap (not actually sent)
> * @seq_ack_to_tap: Last ACK number sent to tap
> @@ -100,6 +101,7 @@ struct tcp_tap_conn {
> uint16_t wnd_to_tap;
>
> uint32_t seq_to_tap;
> + uint32_t max_seq_to_tap;
> uint32_t seq_ack_from_tap;
> uint32_t seq_from_tap;
> uint32_t seq_ack_to_tap;
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-04-24 1:04 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-20 19:19 [PATCH 0/2] Support for SO_PEEK_OFF when a available Jon Maloy
2024-04-20 19:19 ` [PATCH 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-04-23 17:50 ` Stefano Brivio
2024-04-24 0:48 ` David Gibson
2024-04-24 18:30 ` Stefano Brivio
2024-04-26 3:27 ` David Gibson
2024-04-26 5:58 ` Stefano Brivio
2024-04-29 1:46 ` David Gibson
2024-04-25 23:06 ` Jon Maloy
2024-04-24 0:44 ` David Gibson
2024-04-25 23:23 ` Jon Maloy
2024-04-26 3:29 ` David Gibson
2024-04-20 19:19 ` [PATCH 2/2] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-04-24 1:04 ` David Gibson [this message]
2024-04-24 18:31 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zihas53XonHtyLI0@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=dgibson@redhat.com \
--cc=jmaloy@redhat.com \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).