public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com
Subject: Re: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero
Date: Thu, 16 May 2024 13:22:42 +0200	[thread overview]
Message-ID: <20240516132242.20bc761b@elisabeth> (raw)
In-Reply-To: <c7d897ac-9589-f96a-ff51-1eb3af3b7646@redhat.com>

On Wed, 15 May 2024 19:10:49 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> On 2024-05-15 16:24, Stefano Brivio wrote:
> > On Wed, 15 May 2024 11:34:29 -0400
> > Jon Maloy <jmaloy@redhat.com> wrote:
> >  
> >> A bug in kernel TCP may lead to a deadlock where a zero window is sent
> >> from the peer, while it is unable to send out window updates even after
> >> reads have freed up enough buffer space to permit a larger window.
> >> In this situation, new window advertisemnts from the peer can only be
> >> triggered by packets arriving from this side.
> >>
> >> However, such packets are never sent, because the zero-window condition
> >> currently prevents this side from sending out any packets whatsoever
> >> to the peer.
> >>
> >> We notice that the above bug is triggered *only* after the peer has
> >> dropped an arriving packet because of severe memory squeeze, and that we
> >> hence always enter a retransmission situation when this occurs. This
> >> also means that it goes against the RFC 9293 recommendation that a
> >> previously advertised window never should shrink.
> >>
> >> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
> >> the following statement:
> >> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right
> >> window edge to the left (SHLD-14). However, a sending TCP peer MUST
> >> be robust against window shrinking, which may cause the
> >> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
> >>
> >> If this happens, the sender SHOULD NOT send new data (SHLD-15), but
> >> SHOULD retransmit normally the old unacknowledged data between SND.UNA
> >> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
> >> beyond SND.UNA+SND.WND (MAY-7)"
> >>
> >> We never see the window become negative, but we interpret this as a
> >> recommendation to use the previously available window during
> >> retransmission even when the currently advertised window is zero.
> >>
> >> We use the above mechanism only at timer-induced retransmits.
> >> In the case we receive duplicate ack and a zero window, but
> >> still know we have outstanding data acks waiting, we send out an
> >> empty "fast probe" instead of doing fast retransmit. This averts
> >> the risk of overwhelming a memory squeezed peer with retransmits,
> >> while still forcing it to send out a new window update when the
> >> probe is received. This entails a theoretical risk of redundant
> >> retransmits from the peer, but that is a risk worth taking.
> >>
> >> In case of a zero-window non-retransmission situation where there
> >> is no new data to be sent, we also add a simple zero-window probing
> >> feature. By sending an empty packet at regular timeout events we
> >> resolve the situation described above, since the peer receives the
> >> necessary trigger to advertise its window once it becomes non-zero
> >> again.
> >>
> >> It should be noted that although this solves the problem we have at
> >> hand, it is not a genuine solution to the kernel bug. There may well
> >> be TCP stacks around in other OS-es which don't do this, nor have
> >> keep-alive probing as an alternatve way to solve the situation.
> >>
> >> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> >>
> >> ---
> >> v2: - Using previously advertised window during retransmission, instead
> >>        highest send sequencece number in the cycle.
> >> v3: - Rebased to newest code
> >>      - Changes based on feedback from PASST team
> >>      - Sending out empty probe message at timer expiration when
> >>        we are not in retransmit situation.
> >> v4: - Some small changes based on feedback from PASST team.
> >>      - Replaced fast retransmit with a one-time 'fast probe' when
> >>        window is zero.
> >> ---
> >>   tcp.c      | 32 +++++++++++++++++++++++++++-----
> >>   tcp_conn.h |  2 ++
> >>   2 files changed, 29 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/tcp.c b/tcp.c
> >> index 4163bf9..a33f494 100644
> >> --- a/tcp.c
> >> +++ b/tcp.c
> >> @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
> >>    */
> >>   static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> >>   {
> >> +	uint32_t wnd_edge;
> >> +
> >>   	wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> >>   	conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
> >>   
> >> +	wnd_edge = conn->seq_ack_from_tap + wnd;
> >> +	if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))  
> > Here, cppcheck ('make cppcheck') says:
> >
> > tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse]
> >   if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))
> >       ^
> > tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<<conn->ws_from_tap))?(1<<(16+8)):(wnd<<conn->ws_from_tap)', assigned value is less than 1
> >   wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> >         ^
> > tcp.c:1770:6: note: Condition 'wnd' is always true
> >   if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))
> >       ^
> >
> > See the comment in tcp_update_seqack_wnd() and related suppression.
> >
> > It's clearly a false positive (if you omit the MIN() macro, it goes
> > away), so we need that same suppression here.  
> Ok. I'll change it. Still a little annoying when our tools are causing 
> us extra job because they aren't up to the task.

Just like in David's experience, cppcheck probably saved me a
substantial number of embarrassing situations.

I would say it's very much up to the task to be honest, a few false
positives are something we can definitely expect.

> >  
> >> +		conn->seq_wnd_edge_from_tap = wnd_edge;
> >> +
> >>   	/* FIXME: reflect the tap-side receiver's window back to the sock-side
> >>   	 * sender by adjusting SO_RCVBUF? */
> >>   }
> >> @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> >>   	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
> >>   
> >>   	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
> >> +	conn->seq_wnd_edge_from_tap = conn->seq_to_tap;
> >>   }
> >>   
> >>   /**
> >> @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >>    */
> >>   static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >>   {
> >> -	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> >>   	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> >>   	int sendlen, len, dlen, v4 = CONN_V4(conn);
> >> +	uint32_t already_sent, max_send, seq;
> >>   	int s = conn->sock, i, ret = 0;
> >>   	struct msghdr mh_sock = { 0 };
> >>   	uint16_t mss = MSS_GET(conn);
> >> -	uint32_t already_sent, seq;
> >>   	struct iovec *iov;
> >>   
> >>   	/* How much have we read/sent since last received ack ? */
> >> @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >>   		tcp_set_peek_offset(s, 0);
> >>   	}
> >>   
> >> -	if (!wnd_scaled || already_sent >= wnd_scaled) {
> >> +	/* How much are we still allowed to send within current window ? */
> >> +	max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap;
> >> +	if (SEQ_LE(max_send, 0)) {
> >> +		flow_trace(conn, "Empty window: win_upper: %u, sent: %u",  
> > This is not win_upper anymore, and the window is actually full rather
> > than empty (it's... empty of space). Maybe:
> >
> > 		flow_trace(conn, "Window full: right edge: %u, sent: %u"  
> yes.
> >  
> >> +			   conn->seq_wnd_edge_from_tap, conn->seq_to_tap);
> >> +		conn->seq_wnd_edge_from_tap = conn->seq_to_tap;
> >>   		conn_flag(c, conn, STALLED);
> >>   		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> >>   		return 0;
> >>   	}
> >>   
> >>   	/* Set up buffer descriptors we'll fill completely and partially. */
> >> -	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
> >> +	fill_bufs = DIV_ROUND_UP(max_send,  mss);
> >>   	if (fill_bufs > TCP_FRAMES) {
> >>   		fill_bufs = TCP_FRAMES;
> >>   		iov_rem = 0;
> >>   	} else {
> >> -		iov_rem = (wnd_scaled - already_sent) % mss;
> >> +		iov_rem = max_send % mss;
> >>   	}
> >>   
> >>   	/* Prepare iov according to kernel capability */
> >> @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >>   		conn->seq_to_tap = max_ack_seq;
> >>   		tcp_set_peek_offset(conn->sock, 0);
> >>   		tcp_data_from_sock(c, conn);
> >> +	} else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) {
> >> +		/* Force peer to send new advertisement now, but only once */  
> > Two questions:
> >
> > - which advertisement? We're sending a zero-window probe, not forcing
> >    the peer to do much really. I would rather just state that we're
> >    sending a probe  
> Actually, it is only clear if you know the code of the (linux) peer.
> I realized this was maybe a too strong statement, but this is really
> what happens.

Our peer is not necessarily the Linux kernel, though, and, especially,
not a specific (broken) version of it.

> > - what guarantees it only happens once? If we get more data from the
> >    socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a
> >    bit, and send another ACK (duplicate) to the peer, without the peer
> >    necessarily ever advertising a non-zero window meanwhile  
> Right. I need to give this  a new spin, but I think I am on the right 
> track.
> I *really* want this to be solved here, and not need to fall back to
>   the timeout except under exceptional conditions.
> This happens really often in my tests, and with this fix it actually
> works like a charm.
> 
> >
> >    I'm struggling a bit to understand how this can work "cleanly", a
> >    packet capture of this mechanism in action would certainly help.  
> Ok. In v5.
> >  
> >> +		flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u",
> >> +			   max_ack_seq, conn->seq_to_tap);
> >> +		tcp_send_flag(c, conn, ACK);
> >> +		conn->seq_to_tap = max_ack_seq;
> >> +		tcp_set_peek_offset(conn->sock, 0);
> >>   	}
> >>   
> >>   	if (!iov_i)
> >> @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> >>   			flow_dbg(conn, "activity timeout");
> >>   			tcp_rst(c, conn);
> >>   		}
> >> +		/* No data exchanged recently? Keep connection alive. */  
> > ...I just spotted this from v3: this is not the reason why we're
> > sending a keep-alive. We're sending a keep-alive segment because the
> > peer advertised its window as zero.
> >
> > I also realised that this is not scheduled additionally, so it will
> > just trigger on an activity timeout, I suppose. We should reschedule
> > this after ACK_TIMEOUT, instead (that was my earlier suggestion, I
> > didn't check anymore) when the peer advertises a zero window.
> >  
> >> +		if (conn->seq_to_tap == conn->seq_ack_from_tap &&  
> > ...this part will only work if we reset seq_to_tap to seq_ack_from_tap
> > earlier, and we have no pending data to send, which is not necessarily
> > the case if we want to send a zero-window probe.
> >  
> >> +		    conn->seq_from_tap == conn->seq_ack_to_tap)
> >> +			tcp_send_flag(c, conn, ACK);  
> > I think the conditions should simply be:
> >
> > - the window currently advertised by the peer is zero
> >
> > - we don't have pending data to acknowledge (otherwise the peer can
> >    interpret our keep-alive as a duplicate ACK)  
> That is ok.
> The deadlock situation we have ended up in will anyway be resolved by the
> real retransmit that will happen before we get here.
> I am now wondering if this probe it serves any purpose at all?

Well, RFC 9293 says we must implement it, so it's a good idea to do
that, I guess.

Let's say your first probe is lost for whatever reason (if the system
is low on memory, that is actually likely to happen with tap devices).

> >  
> >>   	}
> >>   }
> >>   
> >> diff --git a/tcp_conn.h b/tcp_conn.h
> >> index d280b22..5cbad2a 100644
> >> --- a/tcp_conn.h
> >> +++ b/tcp_conn.h
> >> @@ -30,6 +30,7 @@
> >>    * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
> >>    * @seq_to_tap:		Next sequence for packets to tap
> >>    * @seq_ack_from_tap:	Last ACK number received from tap
> >> + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap
> >>    * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
> >>    * @seq_ack_to_tap:	Last ACK number sent to tap
> >>    * @seq_init_from_tap:	Initial sequence number from tap
> >> @@ -101,6 +102,7 @@ struct tcp_tap_conn {
> >>   
> >>   	uint32_t	seq_to_tap;
> >>   	uint32_t	seq_ack_from_tap;
> >> +	uint32_t	seq_wnd_edge_from_tap;
> >>   	uint32_t	seq_from_tap;
> >>   	uint32_t	seq_ack_to_tap;
> >>   	uint32_t	seq_init_from_tap;  
> 
> I'll re-spin the two first ones asap, so you can apply them, and then I 
> will try to improve
> this one further.

Just mind that if 2/3 makes the issue you're working around here more
likely to happen (me, I've never seen it), we shouldn't go ahead with
2/3 without this, I guess.

-- 
Stefano


      parent reply	other threads:[~2024-05-16 11:23 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15 15:34 [PATCH v4 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-15 15:34 ` [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued Jon Maloy
2024-05-15 20:20   ` Stefano Brivio
2024-05-16  2:24   ` David Gibson
2024-05-16  2:57     ` Jon Maloy
2024-05-16  4:16       ` David Gibson
2024-06-04 17:36         ` Jon Maloy
2024-06-04 18:04           ` Jon Maloy
2024-06-04 18:10             ` Stefano Brivio
2024-05-15 15:34 ` [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-15 20:22   ` Stefano Brivio
2024-05-16  2:29   ` David Gibson
2024-05-16  3:03     ` Jon Maloy
2024-05-15 15:34 ` [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-05-15 20:24   ` Stefano Brivio
2024-05-15 23:10     ` Jon Maloy
2024-05-16  7:19       ` David Gibson
2024-05-16 11:22       ` Stefano Brivio [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240516132242.20bc761b@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).