[net-next 0/2] tcp: add support for SO_PEEK

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [net-next 0/2] tcp: add support for SO_PEEK_OFF socket option
@ 2024-04-03 22:58 Jon Maloy
  2024-04-03 22:58 ` [net-next 1/2] " Jon Maloy
  2024-04-03 22:58 ` [net-next 2/2] tcp: correct handling of extreme menory squeeze Jon Maloy
  0 siblings, 2 replies; 12+ messages in thread
From: Jon Maloy @ 2024-04-03 22:58 UTC (permalink / raw)
  To: passt-dev, sbrivio, lvivier, dgibson, jmaloy

We add support for the SO_PEEK_OFF socket option as a new feature in
TCP.

In a separate patch, we fix a bug that was revealed while testing this
feature.

Jon Maloy (2):
  tcp: add support for SO_PEEK_OFF
  tcp: correct handling of extreme menory squeeze

 net/ipv4/af_inet.c    |  1 +
 net/ipv4/tcp.c        | 16 ++++++++++------
 net/ipv4/tcp_output.c |  5 ++++-
 3 files changed, 15 insertions(+), 7 deletions(-)

-- 
2.42.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [net-next 1/2] tcp: add support for SO_PEEK_OFF socket option
  2024-04-03 22:58 [net-next 0/2] tcp: add support for SO_PEEK_OFF socket option Jon Maloy
@ 2024-04-03 22:58 ` Jon Maloy
  2024-04-03 22:58 ` [net-next 2/2] tcp: correct handling of extreme menory squeeze Jon Maloy
  1 sibling, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2024-04-03 22:58 UTC (permalink / raw)
  To: passt-dev, sbrivio, lvivier, dgibson, jmaloy

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.

In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
[  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
[  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
[  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
[  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
[  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
[  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
[  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
[  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
v4: - Same as v3. Posting was delayed because I first had to debug
      an issue that turned out to not be directly related to this
      change. See next commit in this series.
---
 net/ipv4/af_inet.c |  1 +
 net/ipv4/tcp.c     | 16 ++++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ad278009e469..5c35917b166c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = {
 #endif
 	.splice_eof	   = inet_splice_eof,
 	.splice_read	   = tcp_splice_read,
+	.set_peek_off      = sk_set_peek_off,
 	.read_sock	   = tcp_read_sock,
 	.read_skb	   = tcp_read_skb,
 	.sendmsg_locked    = tcp_sendmsg_locked,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c82dc42f57c6..d4890f2a86d4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
 	struct sk_buff *skb;
 	int copied = 0, err = 0;
 
-	/* XXX -- need to support SO_PEEK_OFF */
-
 	skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
 		err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
 		if (err)
@@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct sk_buff *skb, *last;
+	u32 peek_offset = 0;
 	u32 urg_hole = 0;
 
 	err = -ENOTCONN;
@@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 
 	seq = &tp->copied_seq;
 	if (flags & MSG_PEEK) {
-		peek_seq = tp->copied_seq;
+		peek_offset = max(sk_peek_offset(sk, flags), 0);
+		peek_seq = tp->copied_seq + peek_offset;
 		seq = &peek_seq;
 	}
 
@@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 
 		if ((flags & MSG_PEEK) &&
-		    (peek_seq - copied - urg_hole != tp->copied_seq)) {
+		    (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
 			net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
 					    current->comm,
 					    task_pid_nr(current));
-			peek_seq = tp->copied_seq;
+			peek_seq = tp->copied_seq + peek_offset;
 		}
 		continue;
 
@@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		WRITE_ONCE(*seq, *seq + used);
 		copied += used;
 		len -= used;
-
+		if (flags & MSG_PEEK)
+			sk_peek_offset_fwd(sk, used);
+		else
+			sk_peek_offset_bwd(sk, used);
 		tcp_rcv_space_adjust(sk);
 
 skip_copy:
@@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	__skb_queue_purge(&sk->sk_receive_queue);
 	WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 	WRITE_ONCE(tp->urg_data, 0);
+	sk_set_peek_off(sk, -1);
 	tcp_write_queue_purge(sk);
 	tcp_fastopen_active_disable_ofo_check(sk);
 	skb_rbtree_purge(&tp->out_of_order_queue);
-- 
@@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
 	struct sk_buff *skb;
 	int copied = 0, err = 0;
 
-	/* XXX -- need to support SO_PEEK_OFF */
-
 	skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
 		err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
 		if (err)
@@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct sk_buff *skb, *last;
+	u32 peek_offset = 0;
 	u32 urg_hole = 0;
 
 	err = -ENOTCONN;
@@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 
 	seq = &tp->copied_seq;
 	if (flags & MSG_PEEK) {
-		peek_seq = tp->copied_seq;
+		peek_offset = max(sk_peek_offset(sk, flags), 0);
+		peek_seq = tp->copied_seq + peek_offset;
 		seq = &peek_seq;
 	}
 
@@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		}
 
 		if ((flags & MSG_PEEK) &&
-		    (peek_seq - copied - urg_hole != tp->copied_seq)) {
+		    (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
 			net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
 					    current->comm,
 					    task_pid_nr(current));
-			peek_seq = tp->copied_seq;
+			peek_seq = tp->copied_seq + peek_offset;
 		}
 		continue;
 
@@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 		WRITE_ONCE(*seq, *seq + used);
 		copied += used;
 		len -= used;
-
+		if (flags & MSG_PEEK)
+			sk_peek_offset_fwd(sk, used);
+		else
+			sk_peek_offset_bwd(sk, used);
 		tcp_rcv_space_adjust(sk);
 
 skip_copy:
@@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	__skb_queue_purge(&sk->sk_receive_queue);
 	WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 	WRITE_ONCE(tp->urg_data, 0);
+	sk_set_peek_off(sk, -1);
 	tcp_write_queue_purge(sk);
 	tcp_fastopen_active_disable_ofo_check(sk);
 	skb_rbtree_purge(&tp->out_of_order_queue);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-03 22:58 [net-next 0/2] tcp: add support for SO_PEEK_OFF socket option Jon Maloy
  2024-04-03 22:58 ` [net-next 1/2] " Jon Maloy
@ 2024-04-03 22:58 ` Jon Maloy
  2024-04-05 17:55   ` Stefano Brivio
  1 sibling, 1 reply; 12+ messages in thread
From: Jon Maloy @ 2024-04-03 22:58 UTC (permalink / raw)
  To: passt-dev, sbrivio, lvivier, dgibson, jmaloy

Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
in this series along with the pasta protocol splicer revealed a bug in
the way tcp handles window advertising during extreme memory squeeze
situations.

The excerpt of the below logging session shows what is happeing:

[5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
[5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
[5201<->54494]:   ADVERTISING WINDOW SIZE 0
[5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83

[...]

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0

We can see that although we are adverising a window size of zero,
tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
between this side's and the peer's view of the current window size.
- The peer thinks the window is zero, and stops sending.
- This side ends up in a cycle where it repeatedly caclulates a new
  window size it finds too small to advertise.

Hence no messages are received, and no acknowledges are sent, and
the situation remains locked even after the last queued receive buffer
has been consumed.

We fix this by setting tp->rcv_wnd to 0 before we return from the
function tcp_select_window() in this particular case.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp_output.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e3167ad96567..5803fd402708 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -264,8 +264,11 @@ static u16 tcp_select_window(struct sock *sk)
 	 * are out of memory. The window is temporary, so we don't store
 	 * it on the socket.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
@@ -264,8 +264,11 @@ static u16 tcp_select_window(struct sock *sk)
 	 * are out of memory. The window is temporary, so we don't store
 	 * it on the socket.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-03 22:58 ` [net-next 2/2] tcp: correct handling of extreme menory squeeze Jon Maloy
@ 2024-04-05 17:55   ` Stefano Brivio
  2024-04-05 19:37     ` Jon Maloy
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2024-04-05 17:55 UTC (permalink / raw)
  To: Jon Maloy; +Cc: passt-dev, lvivier, dgibson

On Wed,  3 Apr 2024 18:58:33 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> in this series along with the pasta protocol splicer revealed a bug in
> the way tcp handles window advertising during extreme memory squeeze
> situations.
> 
> The excerpt of the below logging session shows what is happeing:
> 
> [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> 
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> 
> [...]
> 
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> 
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> 
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> 
> We can see that although we are adverising a window size of zero,
> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> between this side's and the peer's view of the current window size.
> - The peer thinks the window is zero, and stops sending.
> - This side ends up in a cycle where it repeatedly caclulates a new
>   window size it finds too small to advertise.
> 
> Hence no messages are received, and no acknowledges are sent, and
> the situation remains locked even after the last queued receive buffer
> has been consumed.
> 
> We fix this by setting tp->rcv_wnd to 0 before we return from the
> function tcp_select_window() in this particular case.
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  net/ipv4/tcp_output.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e3167ad96567..5803fd402708 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -264,8 +264,11 @@ static u16 tcp_select_window(struct sock *sk)
>  	 * are out of memory. The window is temporary, so we don't store
>  	 * it on the socket.

One nit: now that you do store it on the socket, you should probably
change this comment as well.

>  	 */
> -	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> +	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> +		tp->rcv_wnd = 0;
> +		tp->rcv_wup = tp->rcv_nxt;

...I'm wondering if you should set 'pred_flags' to 0, as it's done at
the end of the function for other cases where the window is advertised
as zero.

At least according to the comment to tcp_rcv_established() it looks
like it's needed:

 *      - A zero window was announced from us - zero window probing
 *        is only handled properly in the slow path.

>  		return 0;
> +	}
>  
>  	cur_win = tcp_receive_window(tp);
>  	new_win = __tcp_select_window(sk);

The rest, including 1/2, looks good to me.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-05 17:55   ` Stefano Brivio
@ 2024-04-05 19:37     ` Jon Maloy
  0 siblings, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2024-04-05 19:37 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, lvivier, dgibson



On 2024-04-05 13:55, Stefano Brivio wrote:
> On Wed,  3 Apr 2024 18:58:33 -0400
> Jon Maloy <jmaloy@redhat.com> wrote:
>
>> Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
>> in this series along with the pasta protocol splicer revealed a bug in
>> the way tcp handles window advertising during extreme memory squeeze
>> situations.
>>
>> The excerpt of the below logging session shows what is happeing:
>>
>> [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
>> [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
>> [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
>> [5201<->54494]:   ADVERTISING WINDOW SIZE 0
>> [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>>
>> [5201<->54494]: tcp_recvmsg_locked(->)
>> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
>> [5201<->54494]:     NOT calling tcp_send_ack()
>> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
>>
>> [...]
>>
>> [5201<->54494]: tcp_recvmsg_locked(->)
>> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
>> [5201<->54494]:     NOT calling tcp_send_ack()
>> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
>>
>> [5201<->54494]: tcp_recvmsg_locked(->)
>> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
>> [5201<->54494]:     NOT calling tcp_send_ack()
>> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
>>
>> [5201<->54494]: tcp_recvmsg_locked(->)
>> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]:     NOT calling tcp_send_ack()
>> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>>
>> We can see that although we are adverising a window size of zero,
>> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
>> between this side's and the peer's view of the current window size.
>> - The peer thinks the window is zero, and stops sending.
>> - This side ends up in a cycle where it repeatedly caclulates a new
>>    window size it finds too small to advertise.
>>
>> Hence no messages are received, and no acknowledges are sent, and
>> the situation remains locked even after the last queued receive buffer
>> has been consumed.
>>
>> We fix this by setting tp->rcv_wnd to 0 before we return from the
>> function tcp_select_window() in this particular case.
>> Further testing shows that the connection recovers neatly from the
>> squeeze situation, and traffic can continue indefinitely.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>> ---
>>   net/ipv4/tcp_output.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index e3167ad96567..5803fd402708 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -264,8 +264,11 @@ static u16 tcp_select_window(struct sock *sk)
>>   	 * are out of memory. The window is temporary, so we don't store
>>   	 * it on the socket.
> One nit: now that you do store it on the socket, you should probably
> change this comment as well.
>
>>   	 */
>> -	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
>> +	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
>> +		tp->rcv_wnd = 0;
>> +		tp->rcv_wup = tp->rcv_nxt;
> ...I'm wondering if you should set 'pred_flags' to 0, as it's done at
> the end of the function for other cases where the window is advertised
> as zero.
>
> At least according to the comment to tcp_rcv_established() it looks
> like it's needed:
>
>   *      - A zero window was announced from us - zero window probing
>   *        is only handled properly in the slow path.
>
>>   		return 0;
>> +	}
>>   
>>   	cur_win = tcp_receive_window(tp);
>>   	new_win = __tcp_select_window(sk);
> The rest, including 1/2, looks good to me.
>
Good points. I'll fix those and post the patches with your "Reviewed-by:"

/thx


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-08  8:03     ` Eric Dumazet
@ 2024-04-08 11:13       ` Jon Maloy
  0 siblings, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2024-04-08 11:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson, eric.dumazet



On 2024-04-08 06:03, Eric Dumazet wrote:
> On Sat, Apr 6, 2024 at 8:37 PM Eric Dumazet <edumazet@google.com> wrote:
>> On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
[...]
>>> [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
>>>
>>> [5201<->54494]: tcp_recvmsg_locked(->)
>>> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>>> [5201<->54494]:     NOT calling tcp_send_ack()
>>> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>>> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>>>
>>> We can see that although we are adverising a window size of zero,
>>> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
>>> between this side's and the peer's view of the current window size.
>>> - The peer thinks the window is zero, and stops sending.
>>> - This side ends up in a cycle where it repeatedly caclulates a new
>>>    window size it finds too small to advertise.
>>>
>>> Hence no messages are received, and no acknowledges are sent, and
>>> the situation remains locked even after the last queued receive buffer
>>> has been consumed.
>>>
>>> We fix this by setting tp->rcv_wnd to 0 before we return from the
>>> function tcp_select_window() in this particular case.
>>> Further testing shows that the connection recovers neatly from the
>>> squeeze situation, and traffic can continue indefinitely.
>>>
>>> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
>>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> I do not think this patch is good. If we reach zero window, it is a
> sign something is wrong.
>
> TCP has heuristics to slow down the sender if the receiver does not
> drain the receive queue fast enough.
>
> MSG_PEEK is an obvious reason, and SO_RCVLOWAT too.
>
> I suggest you take a look at tcp_set_rcvlowat(), see what is needed
> for SO_PEEK_OFF (ab)use ?
>
> In short, when SO_PEEK_OFF is in action :
> - TCP needs to not delay ACK when receive queue starts to fill
> - TCP needs to make sure sk_rcvbuf and tp->window_clamp grow (if
> autotuning is enabled)
>
We are not talking about the same socket here. The one being
overloaded is the terminating socket at the guest side. This is
just a regular socket not using MSG_PEEK or SO_PEEK_OFF.

SO_PEEK_OFF is used in the intermediate socket terminating
the connection towards the remote end.  We want to preserve
the message in its receive queue until it has been acknowledged
by the guest side, so we don't need to keep a copy of it in user space.
This seems to work flawlessly.

Anyway, I think this is worth taking a closer look at, as you say.
I don't think this situation should occur at all.

///jon


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-07  5:51       ` Menglong Dong
@ 2024-04-08 11:01         ` Jon Maloy
  0 siblings, 0 replies; 12+ messages in thread
From: Jon Maloy @ 2024-04-08 11:01 UTC (permalink / raw)
  To: Menglong Dong, Jason Xing, Eric Dumazet
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	eric.dumazet, dongmenglong.8



On 2024-04-07 03:51, Menglong Dong wrote:
> On Sun, Apr 7, 2024 at 2:52 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>> On Sun, Apr 7, 2024 at 2:38 AM Eric Dumazet <edumazet@google.com> wrote:
[...]
>>>> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>>>>
>>>> We can see that although we are adverising a window size of zero,
>>>> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
>>>> between this side's and the peer's view of the current window size.
>>>> - The peer thinks the window is zero, and stops sending.
> Hi!
>
> In my original logic, the client will send a zero-window
> ack when it drops the skb because it is out of the
> memory. And the peer SHOULD keep retrans the dropped
> packet.
>
> Does the peer do the transmission in this case? The receive
> window of the peer SHOULD recover once the
> retransmission is successful.
The "peer" is this case is our user-space protocol splicer, emulating
the behavior of of the remote end socket.
At a first glance, it looks like it is *not* performing any retransmits
at all when it sees a zero window at the receiver, so this might indeed
be the problem.
I will be out of office today, but I will test this later this week.

///jon

>
>>>> - This side ends up in a cycle where it repeatedly caclulates a new
>>>>    window size it finds too small to advertise.
> Yeah,  the zero-window suppressed the sending of ack in
> __tcp_cleanup_rbuf, which I wasn't aware of.
>
> The ack will recover the receive window of the peer. Does
> it make the peer retrans the dropped data immediately?
> In my opinion, the peer still needs to retrans the dropped
> packet until the retransmission timer timeout. Isn't it?
>
> If it is, maybe we can do the retransmission immediately
> if we are in zero-window from a window-shrink, which can
> make the recovery faster.
>
> [......]
>>> Any particular reason to not cc Menglong Dong ?
>>> (I just did)
>> He is not working at Tencent any more. Let me CC here one more time.
> Thanks for CC the new email of mine, it's very kind of you,
> xing :/
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-06 16:37   ` Eric Dumazet
  2024-04-07  4:52     ` Jason Xing
@ 2024-04-08  8:03     ` Eric Dumazet
  2024-04-08 11:13       ` Jon Maloy
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2024-04-08  8:03 UTC (permalink / raw)
  To: jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson, eric.dumazet

On Sat, Apr 6, 2024 at 8:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
> >
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> > in this series along with the pasta protocol splicer revealed a bug in
> > the way tcp handles window advertising during extreme memory squeeze
> > situations.
> >
> > The excerpt of the below logging session shows what is happeing:
> >
> > [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> > [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> > [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> > [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> > [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> >
> > [...]
>
> I would prefer a packetdrill test, it is not clear what is happening...
>
> In particular, have you used SO_RCVBUF ?
>
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> >
> > We can see that although we are adverising a window size of zero,
> > tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> > between this side's and the peer's view of the current window size.
> > - The peer thinks the window is zero, and stops sending.
> > - This side ends up in a cycle where it repeatedly caclulates a new
> >   window size it finds too small to advertise.
> >
> > Hence no messages are received, and no acknowledges are sent, and
> > the situation remains locked even after the last queued receive buffer
> > has been consumed.
> >
> > We fix this by setting tp->rcv_wnd to 0 before we return from the
> > function tcp_select_window() in this particular case.
> > Further testing shows that the connection recovers neatly from the
> > squeeze situation, and traffic can continue indefinitely.
> >
> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>

I do not think this patch is good. If we reach zero window, it is a
sign something is wrong.

TCP has heuristics to slow down the sender if the receiver does not
drain the receive queue fast enough.

MSG_PEEK is an obvious reason, and SO_RCVLOWAT too.

I suggest you take a look at tcp_set_rcvlowat(), see what is needed
for SO_PEEK_OFF (ab)use ?

In short, when SO_PEEK_OFF is in action :
- TCP needs to not delay ACK when receive queue starts to fill
- TCP needs to make sure sk_rcvbuf and tp->window_clamp grow (if
autotuning is enabled)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-07  4:52     ` Jason Xing
@ 2024-04-07  5:51       ` Menglong Dong
  2024-04-08 11:01         ` Jon Maloy
  0 siblings, 1 reply; 12+ messages in thread
From: Menglong Dong @ 2024-04-07  5:51 UTC (permalink / raw)
  To: Jason Xing, Eric Dumazet, jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	eric.dumazet, dongmenglong.8

On Sun, Apr 7, 2024 at 2:52 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Sun, Apr 7, 2024 at 2:38 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
> > >
> > > From: Jon Maloy <jmaloy@redhat.com>
> > >
> > > Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> > > in this series along with the pasta protocol splicer revealed a bug in
> > > the way tcp handles window advertising during extreme memory squeeze
> > > situations.
> > >
> > > The excerpt of the below logging session shows what is happeing:
> > >
> > > [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> > > [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> > > [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> > > [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> > > [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> > >
> > > [...]
> >
> > I would prefer a packetdrill test, it is not clear what is happening...
> >
> > In particular, have you used SO_RCVBUF ?
> >
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> > >
> > > [5201<->54494]: tcp_recvmsg_locked(->)
> > > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]:     NOT calling tcp_send_ack()
> > > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > > [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> > >
> > > We can see that although we are adverising a window size of zero,
> > > tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> > > between this side's and the peer's view of the current window size.
> > > - The peer thinks the window is zero, and stops sending.

Hi!

In my original logic, the client will send a zero-window
ack when it drops the skb because it is out of the
memory. And the peer SHOULD keep retrans the dropped
packet.

Does the peer do the transmission in this case? The receive
window of the peer SHOULD recover once the
retransmission is successful.

> > > - This side ends up in a cycle where it repeatedly caclulates a new
> > >   window size it finds too small to advertise.

Yeah,  the zero-window suppressed the sending of ack in
__tcp_cleanup_rbuf, which I wasn't aware of.

The ack will recover the receive window of the peer. Does
it make the peer retrans the dropped data immediately?
In my opinion, the peer still needs to retrans the dropped
packet until the retransmission timer timeout. Isn't it?

If it is, maybe we can do the retransmission immediately
if we are in zero-window from a window-shrink, which can
make the recovery faster.

[......]
> > Any particular reason to not cc Menglong Dong ?
> > (I just did)
>
> He is not working at Tencent any more. Let me CC here one more time.

Thanks for CC the new email of mine, it's very kind of you,
xing :/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-06 16:37   ` Eric Dumazet
@ 2024-04-07  4:52     ` Jason Xing
  2024-04-07  5:51       ` Menglong Dong
  2024-04-08  8:03     ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Xing @ 2024-04-07  4:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: jmaloy, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, eric.dumazet, menglong8.dong, dongmenglong.8

On Sun, Apr 7, 2024 at 2:38 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
> >
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> > in this series along with the pasta protocol splicer revealed a bug in
> > the way tcp handles window advertising during extreme memory squeeze
> > situations.
> >
> > The excerpt of the below logging session shows what is happeing:
> >
> > [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> > [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> > [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> > [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> > [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
> >
> > [...]
>
> I would prefer a packetdrill test, it is not clear what is happening...
>
> In particular, have you used SO_RCVBUF ?
>
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
> >
> > [5201<->54494]: tcp_recvmsg_locked(->)
> > [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]:     NOT calling tcp_send_ack()
> > [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> > [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
> >
> > We can see that although we are adverising a window size of zero,
> > tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> > between this side's and the peer's view of the current window size.
> > - The peer thinks the window is zero, and stops sending.
> > - This side ends up in a cycle where it repeatedly caclulates a new
> >   window size it finds too small to advertise.
> >
> > Hence no messages are received, and no acknowledges are sent, and
> > the situation remains locked even after the last queued receive buffer
> > has been consumed.
> >
> > We fix this by setting tp->rcv_wnd to 0 before we return from the
> > function tcp_select_window() in this particular case.
> > Further testing shows that the connection recovers neatly from the
> > squeeze situation, and traffic can continue indefinitely.
> >
> > Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > ---
> >  net/ipv4/tcp_output.c | 14 +++++++++-----
> >  1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 9282fafc0e61..57ead8f3c334 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -263,11 +263,15 @@ static u16 tcp_select_window(struct sock *sk)
> >         u32 cur_win, new_win;
> >
> >         /* Make the window 0 if we failed to queue the data because we
> > -        * are out of memory. The window is temporary, so we don't store
> > -        * it on the socket.
> > +        * are out of memory. The window needs to be stored in the socket
> > +        * for the connection to recover.
> >          */
> > -       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> > -               return 0;
> > +       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> > +               new_win = 0;
> > +               tp->rcv_wnd = 0;
> > +               tp->rcv_wup = tp->rcv_nxt;
> > +               goto out;
> > +       }
> >
> >         cur_win = tcp_receive_window(tp);
> >         new_win = __tcp_select_window(sk);
> > @@ -301,7 +305,7 @@ static u16 tcp_select_window(struct sock *sk)
> >
> >         /* RFC1323 scaling applied */
> >         new_win >>= tp->rx_opt.rcv_wscale;
> > -
> > +out:
> >         /* If we advertise zero window, disable fast path. */
> >         if (new_win == 0) {
> >                 tp->pred_flags = 0;
> > --
> > 2.42.0
> >
>
> Any particular reason to not cc Menglong Dong ?
> (I just did)

He is not working at Tencent any more. Let me CC here one more time.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-06 18:21 [net-next 0/2] tcp: add support for SO_PEEK_OFF socket option jmaloy
@ 2024-04-06 18:21 ` jmaloy
  2024-04-06 16:37   ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: jmaloy @ 2024-04-06 18:21 UTC (permalink / raw)
  To: netdev, davem
  Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson, eric.dumazet,
	edumazet

From: Jon Maloy <jmaloy@redhat.com>

Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
in this series along with the pasta protocol splicer revealed a bug in
the way tcp handles window advertising during extreme memory squeeze
situations.

The excerpt of the below logging session shows what is happeing:

[5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
[5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
[5201<->54494]:   ADVERTISING WINDOW SIZE 0
[5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83

[...]

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0

[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]:     NOT calling tcp_send_ack()
[5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0

We can see that although we are adverising a window size of zero,
tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
between this side's and the peer's view of the current window size.
- The peer thinks the window is zero, and stops sending.
- This side ends up in a cycle where it repeatedly caclulates a new
  window size it finds too small to advertise.

Hence no messages are received, and no acknowledges are sent, and
the situation remains locked even after the last queued receive buffer
has been consumed.

We fix this by setting tp->rcv_wnd to 0 before we return from the
function tcp_select_window() in this particular case.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp_output.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9282fafc0e61..57ead8f3c334 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -263,11 +263,15 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory. The window needs to be stored in the socket
+	 * for the connection to recover.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
-		return 0;
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		new_win = 0;
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
+		goto out;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
@@ -301,7 +305,7 @@ static u16 tcp_select_window(struct sock *sk)
 
 	/* RFC1323 scaling applied */
 	new_win >>= tp->rx_opt.rcv_wscale;
-
+out:
 	/* If we advertise zero window, disable fast path. */
 	if (new_win == 0) {
 		tp->pred_flags = 0;
-- 
@@ -263,11 +263,15 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory. The window needs to be stored in the socket
+	 * for the connection to recover.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
-		return 0;
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		new_win = 0;
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
+		goto out;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
@@ -301,7 +305,7 @@ static u16 tcp_select_window(struct sock *sk)
 
 	/* RFC1323 scaling applied */
 	new_win >>= tp->rx_opt.rcv_wscale;
-
+out:
 	/* If we advertise zero window, disable fast path. */
 	if (new_win == 0) {
 		tp->pred_flags = 0;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [net-next 2/2] tcp: correct handling of extreme menory squeeze
  2024-04-06 18:21 ` [net-next 2/2] tcp: correct handling of extreme menory squeeze jmaloy
@ 2024-04-06 16:37   ` Eric Dumazet
  2024-04-07  4:52     ` Jason Xing
  2024-04-08  8:03     ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2024-04-06 16:37 UTC (permalink / raw)
  To: jmaloy, Menglong Dong
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson, eric.dumazet

On Sat, Apr 6, 2024 at 8:21 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing of the previous commit ("tcp: add support for SO_PEEK_OFF")
> in this series along with the pasta protocol splicer revealed a bug in
> the way tcp handles window advertising during extreme memory squeeze
> situations.
>
> The excerpt of the below logging session shows what is happeing:
>
> [5201<->54494]:     ==== Activating log @ tcp_select_window()/268 ====
> [5201<->54494]:     (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->54494]:   tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
> [5201<->54494]:   ADVERTISING WINDOW SIZE 0
> [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
>
> [...]

I would prefer a packetdrill test, it is not clear what is happening...

In particular, have you used SO_RCVBUF ?

>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
>
> [5201<->54494]: tcp_recvmsg_locked(->)
> [5201<->54494]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]:     NOT calling tcp_send_ack()
> [5201<->54494]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
> [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
>
> We can see that although we are adverising a window size of zero,
> tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
> between this side's and the peer's view of the current window size.
> - The peer thinks the window is zero, and stops sending.
> - This side ends up in a cycle where it repeatedly caclulates a new
>   window size it finds too small to advertise.
>
> Hence no messages are received, and no acknowledges are sent, and
> the situation remains locked even after the last queued receive buffer
> has been consumed.
>
> We fix this by setting tp->rcv_wnd to 0 before we return from the
> function tcp_select_window() in this particular case.
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  net/ipv4/tcp_output.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9282fafc0e61..57ead8f3c334 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -263,11 +263,15 @@ static u16 tcp_select_window(struct sock *sk)
>         u32 cur_win, new_win;
>
>         /* Make the window 0 if we failed to queue the data because we
> -        * are out of memory. The window is temporary, so we don't store
> -        * it on the socket.
> +        * are out of memory. The window needs to be stored in the socket
> +        * for the connection to recover.
>          */
> -       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> -               return 0;
> +       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> +               new_win = 0;
> +               tp->rcv_wnd = 0;
> +               tp->rcv_wup = tp->rcv_nxt;
> +               goto out;
> +       }
>
>         cur_win = tcp_receive_window(tp);
>         new_win = __tcp_select_window(sk);
> @@ -301,7 +305,7 @@ static u16 tcp_select_window(struct sock *sk)
>
>         /* RFC1323 scaling applied */
>         new_win >>= tp->rx_opt.rcv_wscale;
> -
> +out:
>         /* If we advertise zero window, disable fast path. */
>         if (new_win == 0) {
>                 tp->pred_flags = 0;
> --
> 2.42.0
>

Any particular reason to not cc Menglong Dong ?
(I just did)

This code was added in

commit e2142825c120d4317abf7160a0fc34b3de532586
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Aug 11 10:55:27 2023 +0800

    net: tcp: send zero-window ACK when no memory

    For now, skb will be dropped when no memory, which makes client keep
    retrans util timeout and it's not friendly to the users.

    In this patch, we reply an ACK with zero-window in this case to update
    the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
    connection and will probe the zero-window with the retransmits.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-04-08 11:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-03 22:58 [net-next 0/2] tcp: add support for SO_PEEK_OFF socket option Jon Maloy
2024-04-03 22:58 ` [net-next 1/2] " Jon Maloy
2024-04-03 22:58 ` [net-next 2/2] tcp: correct handling of extreme menory squeeze Jon Maloy
2024-04-05 17:55   ` Stefano Brivio
2024-04-05 19:37     ` Jon Maloy
2024-04-06 18:21 [net-next 0/2] tcp: add support for SO_PEEK_OFF socket option jmaloy
2024-04-06 18:21 ` [net-next 2/2] tcp: correct handling of extreme menory squeeze jmaloy
2024-04-06 16:37   ` Eric Dumazet
2024-04-07  4:52     ` Jason Xing
2024-04-07  5:51       ` Menglong Dong
2024-04-08 11:01         ` Jon Maloy
2024-04-08  8:03     ` Eric Dumazet
2024-04-08 11:13       ` Jon Maloy

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).