From: Jon Maloy <jmaloy@redhat.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com
Subject: Re: [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero
Date: Tue, 14 May 2024 16:19:16 -0400 [thread overview]
Message-ID: <b20434e0-135f-4a67-2601-11ef0933bf02@redhat.com> (raw)
In-Reply-To: <20240514194650.53b433f9@elisabeth>
On 2024-05-14 13:46, Stefano Brivio wrote:
> On Sat, 11 May 2024 11:20:08 -0400
> Jon Maloy <jmaloy@redhat.com> wrote:
>
>> A bug in kernel TCP may lead to a deadlock where a zero window is sent
>> from the peer, while it is unable to send out window updates even after
>> reads have freed up enough buffer space to permit a larger window.
>> In this situation, new window advertisemnts from the peer can only be
>> triggered by packets arriving from this side.
>>
>> However, such packets are never sent, because the zero-window condition
>> currently prevents this side from sending out any packets whatsoever
>> to the peer.
>>
>> We notice that the above bug is triggered *only* after the peer has
>> dropped an arriving packet because of severe memory squeeze, and that we
>> hence always enter a retransmission situation when this occurs. This
>> also means that it goes against the RFC 9293 recommendation that a
>> previously advertised window never should shrink.
>>
>> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
>> the following statement:
>> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right
>> window edge to the left (SHLD-14). However, a sending TCP peer MUST
>> be robust against window shrinking, which may cause the
>> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
>>
>> If this happens, the sender SHOULD NOT send new data (SHLD-15), but
>> SHOULD retransmit normally the old unacknowledged data between SND.UNA
>> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
>> beyond SND.UNA+SND.WND (MAY-7)"
>>
>> We never see the window become negative, but we interpret this as a
>> recommendation to use the previously available window during
>> retransmission even when the currently advertised window is zero.
>>
>> In case of a zero-window non-retransmission situation where there
>> is no new data to be sent, we also add a simple zero-window probing
>> feature. By sending an empty packet at regular timeout events we
>> resolve the situation described above, since the peer receives the
>> necessary trigger to advertise its window once it becomes non-zero
>> again.
>>
>> It should be noted that although this solves the problem we have at
>> hand, it is not a genuine solution to the kernel bug. There may well
>> be TCP stacks around in other OS-es which don't do this, nor have
>> keep-alive probing as an alternatve way to solve the situation.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>>
>> ---
>> v2: - Using previously advertised window during retransmission, instead
>> highest send sequencece number in the cycle.
>> v3: - Rebased to newest code
>> - Changes based on feedback from PASST team
>> - Sending out empty probe message at timer expiration when
>> we are not in retransmit situation.
>> ---
>> tcp.c | 30 +++++++++++++++++++++---------
>> tcp_conn.h | 2 ++
>> 2 files changed, 23 insertions(+), 9 deletions(-)
>>
>> diff --git a/tcp.c b/tcp.c
>> index 8297812..bd6bf35 100644
>> --- a/tcp.c
>> +++ b/tcp.c
>> @@ -1774,9 +1774,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
>> */
>> static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
>> {
>> + uint32_t wnd_upper;
>> +
>> wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
>> conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
>>
>> + wnd_upper = conn->seq_ack_from_tap + wnd;
>> + if (wnd && SEQ_GT(wnd_upper, conn->seq_wup_from_tap))
>> + conn->seq_wup_from_tap = wnd_upper;
>> +
>> /* FIXME: reflect the tap-side receiver's window back to the sock-side
>> * sender by adjusting SO_RCVBUF? */
>> }
>> @@ -1809,6 +1815,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
>> ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
>>
>> conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
>> + conn->seq_wup_from_tap = conn->seq_to_tap;
>> }
>>
>> /**
>> @@ -2220,7 +2227,6 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>> */
>> static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>> {
>> - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>> int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
>> int sendlen, len, dlen, v4 = CONN_V4(conn);
>> uint32_t max_send, seq, already_sent;
>> @@ -2241,10 +2247,11 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>> }
>>
>> /* How much are we still allowed to send within current window ? */
>> - max_send = conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_tap;
>> + max_send = conn->seq_wup_from_tap - conn->seq_to_tap;
>> if (SEQ_LE(max_send, 0)) {
>> - flow_trace(conn, "Empty window: win: %u, sent: %u",
>> - wnd_scaled, conn->seq_to_tap);
>> + flow_trace(conn, "Empty window: win_upper: %u, sent: %u",
>> + conn->seq_wup_from_tap, conn->seq_to_tap);
>> + conn->seq_wup_from_tap = conn->seq_to_tap;
>> conn_flag(c, conn, STALLED);
>> conn_flag(c, conn, ACK_FROM_TAP_DUE);
>> return 0;
>> @@ -2380,7 +2387,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>> ASSERT(conn->events & ESTABLISHED);
>>
>> for (i = idx, iov_i = 0; i < (int)p->count; i++) {
>> - uint32_t seq, seq_offset, ack_seq;
>> + uint32_t seq, seq_offset, ack_seq, wnd;
>> const struct tcphdr *th;
>> char *data;
>> size_t off;
>> @@ -2413,11 +2420,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>> if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) &&
>> SEQ_GE(ack_seq, max_ack_seq)) {
>> /* Fast re-transmit */
>> + wnd = ntohs(th->window);
>> retr = !len && !th->fin &&
>> ack_seq == max_ack_seq &&
>> - ntohs(th->window) == max_ack_seq_wnd;
>> + (wnd == max_ack_seq_wnd || !wnd);
> Just as a reminder, as I mentioned on Monday: this means we'll
> re-transmit whenever we get a pure window update (!len && !th->fin
> && ack_seq == max_ack_seq) with a zero window. The receiver is telling
> us it ran out of space, and wham, we flood them, as a punishment.
>
> I would let this check alone, and just add zero-window probing, plus
> whatever retransmission you mentioned from the RFC -- but not a fast
> re-transmit on a zero window.
I think I have a good idea here. I'll use it in my next version.
>
>>
>> - max_ack_seq_wnd = ntohs(th->window);
>> + max_ack_seq_wnd = wnd;
>> max_ack_seq = ack_seq;
>> }
>> }
>> @@ -2480,8 +2488,9 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>>
>> if (retr) {
>> flow_trace(conn,
>> - "fast re-transmit, ACK: %u, previous sequence: %u",
>> - max_ack_seq, conn->seq_to_tap);
>> + "fast re-transmit, seqno %u -> %u, win_upper: %u",
>> + conn->seq_to_tap, max_ack_seq,
> I'm not sure if "->" really conveys the meaning of "we're sending this
> sequence *because* of that acknowledgement number".
It really means "we are rewinding seq_to_tap from X to Y". That it is
caused by a
duplicate ack is implicit.
> I would rather keep
> the received acknowledged sequence before everything else, because
> that's the causal trigger for the retransmission.
>
>> + conn->seq_wup_from_tap);
>>
>> conn->seq_to_tap = max_ack_seq;
>> tcp_set_peek_offset(conn->sock, 0);
>> @@ -2931,6 +2940,9 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>> flow_dbg(conn, "activity timeout");
>> tcp_rst(c, conn);
>> }
>> + /* No data sent recently? Keep connection alive. */
>> + if (conn->seq_to_tap == conn->seq_ack_from_tap)
>> + tcp_send_flag(c, conn, ACK_IF_NEEDED);
> If the window is zero, this won't send anything, see the first
> condition in tcp_send_flag(). ACK_IF_NEEDED implies that that function
> should queue an ACK segment if we have data to acknowledge.
Ok. I missed that.
> Here, the flag you want is simply 'ACK'. But we should make sure that
> this can't be taken as a duplicate ACK, that is, we should only send
> this if seq_ack_to_tap == seq_from_tap.
>
> Otherwise, we shouldn't send anything, lest the peer retransmit
> anything that we didn't acknowledge yet.
But then we have no probing... Wasn't that the whole pint of this?
>
>> }
>> }
>>
>> diff --git a/tcp_conn.h b/tcp_conn.h
>> index d280b22..8ae20ef 100644
>> --- a/tcp_conn.h
>> +++ b/tcp_conn.h
>> @@ -30,6 +30,7 @@
>> * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
>> * @seq_to_tap: Next sequence for packets to tap
>> * @seq_ack_from_tap: Last ACK number received from tap
>> + * @seq_wup_from_tap: Right edge of last non-zero window from tap
> "Right edge" makes much more sense to me, and it also matches RFC
> language. Could we turn all the "wup" and "upper" references into
> something like "edge" or "right_edge"?
I tried to come up with something short, because the field name becomes
impractically long. I am open to suggestions.
///jon
>
>> * @seq_from_tap: Next sequence for packets from tap (not actually sent)
>> * @seq_ack_to_tap: Last ACK number sent to tap
>> * @seq_init_from_tap: Initial sequence number from tap
>> @@ -101,6 +102,7 @@ struct tcp_tap_conn {
>>
>> uint32_t seq_to_tap;
>> uint32_t seq_ack_from_tap;
>> + uint32_t seq_wup_from_tap;
>> uint32_t seq_from_tap;
>> uint32_t seq_ack_to_tap;
>> uint32_t seq_init_from_tap;
next prev parent reply other threads:[~2024-05-14 20:19 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-11 15:20 [PATCH v3 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-11 15:20 ` [PATCH v3 1/3] tcp: move seq_to_tap update to when frame is queued Jon Maloy
2024-05-13 2:09 ` David Gibson
2024-05-14 16:48 ` Stefano Brivio
2024-05-11 15:20 ` [PATCH v3 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-13 2:23 ` David Gibson
2024-05-14 17:22 ` Stefano Brivio
2024-05-14 20:06 ` Jon Maloy
2024-05-14 21:00 ` Stefano Brivio
2024-05-11 15:20 ` [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-05-14 17:46 ` Stefano Brivio
2024-05-14 20:19 ` Jon Maloy [this message]
2024-05-14 21:09 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b20434e0-135f-4a67-2601-11ef0933bf02@redhat.com \
--to=jmaloy@redhat.com \
--cc=dgibson@redhat.com \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).