public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers
@ 2023-09-22 22:06 Stefano Brivio
  2023-09-22 22:06 ` [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:06 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

The fundamental patch here is 3/5, which is a workaround for a rather
surprising kernel behaviour we seem to be hitting. This all comes from
the investigation around https://bugs.passt.top/show_bug.cgi?id=74.

I can't hit stalls anymore and throughput looks finally good to me
(~3.5gbps with 208 KiB rmem_max and wmem_max), but... please test.

Stefano Brivio (5):
  tcp: Fix comment to tcp_sock_consume()
  tcp: Reset STALLED flag on ACK only, check for pending socket data
  tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  tcp, tap: Don't increase tap-side sequence counter for dropped frames
  passt.1: Add note about tuning rmem_max and wmem_max for throughput

 passt.1 | 33 +++++++++++++++++++++++++
 tap.c   | 10 +++++---
 tap.h   |  2 +-
 tcp.c   | 74 +++++++++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 102 insertions(+), 17 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume()
  2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
@ 2023-09-22 22:06 ` Stefano Brivio
  2023-09-23  2:48   ` David Gibson
  2023-09-22 22:06 ` [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data Stefano Brivio
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:06 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

Note that tcp_sock_consume() doesn't update ACK sequence counters
anymore.

Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index dd3142d..aa1c8c9 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 }
 
 /**
- * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence
+ * tcp_sock_consume() - Consume (discard) data from buffer
  * @conn:	Connection pointer
  * @ack_seq:	ACK sequence, host order
  *
-- 
@@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 }
 
 /**
- * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence
+ * tcp_sock_consume() - Consume (discard) data from buffer
  * @conn:	Connection pointer
  * @ack_seq:	ACK sequence, host order
  *
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
  2023-09-22 22:06 ` [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
@ 2023-09-22 22:06 ` Stefano Brivio
  2023-09-25  3:07   ` David Gibson
  2023-09-22 22:06 ` [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:06 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
that we ran out of tap-side window space, or that all available
socket data is already in flight -- better names welcome!) on any
event: do that only if the first packet in a batch has the ACK flag
set.

Make sure we check for pending socket data when we reset it:
reverting back to level-triggered epoll events, as tcp_epoll_ctl()
does, isn't guaranteed to actually trigger a socket event.

Further, note that the flag only makes sense once a connection is
established, so move all this to the right place, which is convenient
for the next patch, as we want to check if the STALLED flag was set
before processing any new information about the window size
advertised by the tap.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index aa1c8c9..5528e05 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2572,8 +2572,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
 	if (th->ack && !(conn->events & ESTABLISHED))
 		tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
 
-	conn_flag(c, conn, ~STALLED);
-
 	/* Establishing connection from socket */
 	if (conn->events & SOCK_ACCEPTED) {
 		if (th->syn && th->ack && !th->fin) {
@@ -2631,6 +2629,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
 	if (conn->seq_ack_to_tap != conn->seq_from_tap)
 		ack_due = 1;
 
+	if ((conn->flags & STALLED) && th->ack) {
+		conn_flag(c, conn, ~STALLED);
+		tcp_data_from_sock(c, conn);
+	}
+
 	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
 		shutdown(conn->sock, SHUT_WR);
 		conn_event(c, conn, SOCK_FIN_SENT);
-- 
@@ -2572,8 +2572,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
 	if (th->ack && !(conn->events & ESTABLISHED))
 		tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
 
-	conn_flag(c, conn, ~STALLED);
-
 	/* Establishing connection from socket */
 	if (conn->events & SOCK_ACCEPTED) {
 		if (th->syn && th->ack && !th->fin) {
@@ -2631,6 +2629,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
 	if (conn->seq_ack_to_tap != conn->seq_from_tap)
 		ack_due = 1;
 
+	if ((conn->flags & STALLED) && th->ack) {
+		conn_flag(c, conn, ~STALLED);
+		tcp_data_from_sock(c, conn);
+	}
+
 	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
 		shutdown(conn->sock, SHUT_WR);
 		conn_event(c, conn, SOCK_FIN_SENT);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
  2023-09-22 22:06 ` [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
  2023-09-22 22:06 ` [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data Stefano Brivio
@ 2023-09-22 22:06 ` Stefano Brivio
  2023-09-22 22:31   ` Stefano Brivio
                     ` (2 more replies)
  2023-09-22 22:06 ` [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:06 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

It looks like we need it as workaround for this situation, readily
reproducible at least with a 6.5 Linux kernel, with default rmem_max
and wmem_max values:

- an iperf3 client on the host sends about 160 KiB, typically
  segmented into five frames by passt. We read this data using
  MSG_PEEK

- the iperf3 server on the guest starts receiving

- meanwhile, the host kernel advertised a zero-sized window to the
  receiver, as expected

- eventually, the guest acknowledges all the data sent so far, and
  we drop it from the buffer, courtesy of tcp_sock_consume(), using
  recv() with MSG_TRUNC

- the client, however, doesn't get an updated window value, and
  even keepalive packets are answered with zero-window segments,
  until the connection is closed

It looks like dropping data from a socket using MSG_TRUNC doesn't
cause a recalculation of the window, which would be expected as a
result of any receiving operation that invalidates data on a buffer
(that is, not with MSG_PEEK).

Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
the previous value we clamped to, forces a recalculation of the
window which is advertised to the guest.

I couldn't quite confirm this issue by following all the possible
code paths in the kernel, yet. If confirmed, this should be fixed in
the kernel, but meanwhile this workaround looks robust to me (and it
will be needed for backward compatibility anyway).

Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=74
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/tcp.c b/tcp.c
index 5528e05..4606f17 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
 	wnd <<= conn->ws_from_tap;
 	wnd = MIN(MAX_WINDOW, wnd);
 
-	if (conn->flags & WND_CLAMPED) {
+	/* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
+	 * with a zero-sized window on a TCP socket, dropping data (once
+	 * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
+	 * to be enough to make the kernel advertise a non-zero window to the
+	 * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
+	 * value, fixes this.
+	 *
+	 * The STALLED flag on a connection is a sufficient indication that we
+	 * might have a zero-sized window on the socket, because it's set if we
+	 * exhausted the tap-side window, or if everything we receive from a
+	 * socket is already in flight to the guest.
+	 *
+	 * So, if STALLED is set, and we received a window value from the tap,
+	 * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
+	 * further and fixed in the kernel instead, if confirmed.
+	 */
+	if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
 		if (prev_scaled == wnd)
 			return;
 
@@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			i = keep - 1;
 	}
 
-	tcp_clamp_window(c, conn, max_ack_seq_wnd);
-
 	/* On socket flush failure, pretend there was no ACK, try again later */
 	if (ack && !tcp_sock_consume(conn, max_ack_seq))
 		tcp_update_seqack_from_tap(c, conn, max_ack_seq);
 
+	tcp_clamp_window(c, conn, max_ack_seq_wnd);
+
 	if (retr) {
 		trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
 		      max_ack_seq, conn->seq_to_tap);
-- 
@@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
 	wnd <<= conn->ws_from_tap;
 	wnd = MIN(MAX_WINDOW, wnd);
 
-	if (conn->flags & WND_CLAMPED) {
+	/* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
+	 * with a zero-sized window on a TCP socket, dropping data (once
+	 * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
+	 * to be enough to make the kernel advertise a non-zero window to the
+	 * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
+	 * value, fixes this.
+	 *
+	 * The STALLED flag on a connection is a sufficient indication that we
+	 * might have a zero-sized window on the socket, because it's set if we
+	 * exhausted the tap-side window, or if everything we receive from a
+	 * socket is already in flight to the guest.
+	 *
+	 * So, if STALLED is set, and we received a window value from the tap,
+	 * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
+	 * further and fixed in the kernel instead, if confirmed.
+	 */
+	if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
 		if (prev_scaled == wnd)
 			return;
 
@@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			i = keep - 1;
 	}
 
-	tcp_clamp_window(c, conn, max_ack_seq_wnd);
-
 	/* On socket flush failure, pretend there was no ACK, try again later */
 	if (ack && !tcp_sock_consume(conn, max_ack_seq))
 		tcp_update_seqack_from_tap(c, conn, max_ack_seq);
 
+	tcp_clamp_window(c, conn, max_ack_seq_wnd);
+
 	if (retr) {
 		trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
 		      max_ack_seq, conn->seq_to_tap);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
                   ` (2 preceding siblings ...)
  2023-09-22 22:06 ` [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
@ 2023-09-22 22:06 ` Stefano Brivio
  2023-09-25  4:47   ` David Gibson
  2023-09-22 22:06 ` [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput Stefano Brivio
  2023-09-25  5:52 ` [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers David Gibson
  5 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:06 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).

It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.

But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.

Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tap.c | 10 +++++++---
 tap.h |  2 +-
 tcp.c | 43 ++++++++++++++++++++++++++++++++++++-------
 3 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/tap.c b/tap.c
index 93db989..b30ff81 100644
--- a/tap.c
+++ b/tap.c
@@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
  * @c:		Execution context
  * @iov:	Array of buffers, each containing one frame (with L2 headers)
  * @n:		Number of buffers/frames in @iov
+ *
+ * Return: number of frames actually sent
  */
-void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
+size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
 {
 	size_t m;
 
 	if (!n)
-		return;
+		return 0;
 
 	if (c->mode == MODE_PASST)
 		m = tap_send_frames_passt(c, iov, n);
@@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
 		m = tap_send_frames_pasta(c, iov, n);
 
 	if (m < n)
-		debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
+		debug("tap: failed to send %lu frames of %lu", n - m, n);
 
 	pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
+
+	return m;
 }
 
 /**
diff --git a/tap.h b/tap.h
index 021fb7c..952fafc 100644
--- a/tap.h
+++ b/tap.h
@@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
 		    void *in, size_t len);
 int tap_send(const struct ctx *c, const void *data, size_t len);
-void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
+size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
 void tap_update_mac(struct tap_hdr *taph,
 		    const unsigned char *eth_d, const unsigned char *eth_s);
 void tap_listen_handler(struct ctx *c, uint32_t events);
diff --git a/tcp.c b/tcp.c
index 4606f17..76b7b8d 100644
--- a/tcp.c
+++ b/tcp.c
@@ -434,6 +434,16 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
  */
 static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
 
+/**
+ * tcp_buf_seq_update - Sequences to update with length of frames once sent
+ * @seq:	Pointer to sequence number sent to tap-side, to be updated
+ * @len:	TCP payload length
+ */
+struct tcp_buf_seq_update {
+	uint32_t *seq;
+	uint16_t len;
+};
+
 /* Static buffers */
 
 /**
@@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
 #endif
 tcp4_l2_buf[TCP_FRAMES_MEM];
 
+static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
+
 static unsigned int tcp4_l2_buf_used;
 
 /**
@@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
 #endif
 tcp6_l2_buf[TCP_FRAMES_MEM];
 
+static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
+
 static unsigned int tcp6_l2_buf_used;
 
 /* recvmsg()/sendmsg() data for tap */
@@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
  */
 static void tcp_l2_data_buf_flush(struct ctx *c)
 {
-	tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+	unsigned i;
+	size_t m;
+
+	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+	for (i = 0; i < m; i++)
+		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
 	tcp6_l2_buf_used = 0;
 
-	tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+	for (i = 0; i < m; i++)
+		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
 	tcp4_l2_buf_used = 0;
 }
 
@@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
  * @plen:	Payload length at L4
  * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
  * @seq:	Sequence number to be sent
- * @now:	Current timestamp
+ * @seq_update:	Pointer to sequence number to update on successful send
  */
 static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
-			    ssize_t plen, int no_csum, uint32_t seq)
+			    ssize_t plen, int no_csum, uint32_t seq,
+			    uint32_t *seq_update)
 {
 	struct iovec *iov;
 
@@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
 		uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
 
+		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
+		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
+
 		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
 		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
 						       check, seq);
@@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 	} else if (CONN_V6(conn)) {
 		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
 
+		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
+		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
+
 		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
 		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
 						       NULL, seq);
@@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 	int s = conn->sock, i, ret = 0;
 	struct msghdr mh_sock = { 0 };
 	uint16_t mss = MSS_GET(conn);
-	uint32_t already_sent;
+	uint32_t already_sent, seq;
 	struct iovec *iov;
 
 	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
@@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 
 	/* Finally, queue to tap */
 	plen = mss;
+	seq = conn->seq_to_tap;
 	for (i = 0; i < send_bufs; i++) {
 		int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
 
 		if (i == send_bufs - 1)
 			plen = last_len;
 
-		tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
-		conn->seq_to_tap += plen;
+		tcp_data_to_tap(c, conn, plen, no_csum, seq, &conn->seq_to_tap);
+		seq += plen;
 	}
 
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
-- 
@@ -434,6 +434,16 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
  */
 static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
 
+/**
+ * tcp_buf_seq_update - Sequences to update with length of frames once sent
+ * @seq:	Pointer to sequence number sent to tap-side, to be updated
+ * @len:	TCP payload length
+ */
+struct tcp_buf_seq_update {
+	uint32_t *seq;
+	uint16_t len;
+};
+
 /* Static buffers */
 
 /**
@@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
 #endif
 tcp4_l2_buf[TCP_FRAMES_MEM];
 
+static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
+
 static unsigned int tcp4_l2_buf_used;
 
 /**
@@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
 #endif
 tcp6_l2_buf[TCP_FRAMES_MEM];
 
+static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
+
 static unsigned int tcp6_l2_buf_used;
 
 /* recvmsg()/sendmsg() data for tap */
@@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
  */
 static void tcp_l2_data_buf_flush(struct ctx *c)
 {
-	tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+	unsigned i;
+	size_t m;
+
+	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+	for (i = 0; i < m; i++)
+		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
 	tcp6_l2_buf_used = 0;
 
-	tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+	for (i = 0; i < m; i++)
+		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
 	tcp4_l2_buf_used = 0;
 }
 
@@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
  * @plen:	Payload length at L4
  * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
  * @seq:	Sequence number to be sent
- * @now:	Current timestamp
+ * @seq_update:	Pointer to sequence number to update on successful send
  */
 static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
-			    ssize_t plen, int no_csum, uint32_t seq)
+			    ssize_t plen, int no_csum, uint32_t seq,
+			    uint32_t *seq_update)
 {
 	struct iovec *iov;
 
@@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
 		uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
 
+		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
+		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
+
 		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
 		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
 						       check, seq);
@@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 	} else if (CONN_V6(conn)) {
 		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
 
+		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
+		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
+
 		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
 		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
 						       NULL, seq);
@@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 	int s = conn->sock, i, ret = 0;
 	struct msghdr mh_sock = { 0 };
 	uint16_t mss = MSS_GET(conn);
-	uint32_t already_sent;
+	uint32_t already_sent, seq;
 	struct iovec *iov;
 
 	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
@@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 
 	/* Finally, queue to tap */
 	plen = mss;
+	seq = conn->seq_to_tap;
 	for (i = 0; i < send_bufs; i++) {
 		int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
 
 		if (i == send_bufs - 1)
 			plen = last_len;
 
-		tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
-		conn->seq_to_tap += plen;
+		tcp_data_to_tap(c, conn, plen, no_csum, seq, &conn->seq_to_tap);
+		seq += plen;
 	}
 
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput
  2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
                   ` (3 preceding siblings ...)
  2023-09-22 22:06 ` [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
@ 2023-09-22 22:06 ` Stefano Brivio
  2023-09-25  4:57   ` David Gibson
  2023-09-25  5:52 ` [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers David Gibson
  5 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:06 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt.1 | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/passt.1 b/passt.1
index 1ad4276..bcbe6fd 100644
--- a/passt.1
+++ b/passt.1
@@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the
 current sending buffer size to guest or target namespace. This might affect
 throughput of TCP connections.
 
+.SS Tuning for high throughput
+
+On Linux, by default, the maximum memory that can be set for receive and send
+socket buffers is 208 KiB. Those limits are set by the
+\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files,
+see \fBsocket\fR(7).
+
+As of Linux 6.5, while the TCP implementation can dynamically shrink buffers
+depending on utilisation even above those limits, such a small limit will
+reflect on the advertised TCP window at the beginning of a connection, and the
+buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed
+these limits anyway.
+
+Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and
+\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and
+\fIwmem_max\fR limits because the automatic adjustment provided by the TCP
+implementation is then disabled.
+
+As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and
+will not set TCP socket buffer sizes if they are lower than 2 MiB, because this
+would affect the maximum size of TCP buffers for the whole duration of a
+connection.
+
+Note that 208 KiB is, accounting for kernel overhead, enough to fit less than
+three TCP packets at the default MSS. In applications where high throughput is
+expected, it is therefore advisable to increase those limits to at least 2 MiB,
+or even 16 MiB:
+
+.nf
+	sysctl -w net.core.rmem_max=$((16 << 20)
+	sysctl -w net.core.wmem_max=$((16 << 20)
+.fi
+
 .SH LIMITATIONS
 
 Currently, IGMP/MLD proxying (RFC 4605) and support for SCTP (RFC 4960) are not
-- 
@@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the
 current sending buffer size to guest or target namespace. This might affect
 throughput of TCP connections.
 
+.SS Tuning for high throughput
+
+On Linux, by default, the maximum memory that can be set for receive and send
+socket buffers is 208 KiB. Those limits are set by the
+\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files,
+see \fBsocket\fR(7).
+
+As of Linux 6.5, while the TCP implementation can dynamically shrink buffers
+depending on utilisation even above those limits, such a small limit will
+reflect on the advertised TCP window at the beginning of a connection, and the
+buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed
+these limits anyway.
+
+Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and
+\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and
+\fIwmem_max\fR limits because the automatic adjustment provided by the TCP
+implementation is then disabled.
+
+As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and
+will not set TCP socket buffer sizes if they are lower than 2 MiB, because this
+would affect the maximum size of TCP buffers for the whole duration of a
+connection.
+
+Note that 208 KiB is, accounting for kernel overhead, enough to fit less than
+three TCP packets at the default MSS. In applications where high throughput is
+expected, it is therefore advisable to increase those limits to at least 2 MiB,
+or even 16 MiB:
+
+.nf
+	sysctl -w net.core.rmem_max=$((16 << 20)
+	sysctl -w net.core.wmem_max=$((16 << 20)
+.fi
+
 .SH LIMITATIONS
 
 Currently, IGMP/MLD proxying (RFC 4605) and support for SCTP (RFC 4960) are not
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-22 22:06 ` [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
@ 2023-09-22 22:31   ` Stefano Brivio
  2023-09-23  7:55   ` David Gibson
  2023-09-25  4:09   ` David Gibson
  2 siblings, 0 replies; 32+ messages in thread
From: Stefano Brivio @ 2023-09-22 22:31 UTC (permalink / raw)
  To: David Gibson, Matej Hrica; +Cc: passt-dev

Oops,

On Sat, 23 Sep 2023 00:06:08 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> It looks like we need it as workaround for this situation, readily
> reproducible at least with a 6.5 Linux kernel, with default rmem_max
> and wmem_max values:
> 
> - an iperf3 client on the host sends about 160 KiB, typically
>   segmented into five frames by passt. We read this data using
>   MSG_PEEK
> 
> - the iperf3 server on the guest starts receiving
> 
> - meanwhile, the host kernel advertised a zero-sized window to the
>   receiver, as expected
    ^^^ sender

> 
> - eventually, the guest acknowledges all the data sent so far, and
>   we drop it from the buffer, courtesy of tcp_sock_consume(), using
>   recv() with MSG_TRUNC
> 
> - the client, however, doesn't get an updated window value, and
>   even keepalive packets are answered with zero-window segments,
>   until the connection is closed
> 
> It looks like dropping data from a socket using MSG_TRUNC doesn't
> cause a recalculation of the window, which would be expected as a
> result of any receiving operation that invalidates data on a buffer
> (that is, not with MSG_PEEK).
> 
> Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> the previous value we clamped to, forces a recalculation of the
> window which is advertised to the guest.
                                    ^^^ sender

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume()
  2023-09-22 22:06 ` [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
@ 2023-09-23  2:48   ` David Gibson
  0 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-09-23  2:48 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1039 bytes --]

On Sat, Sep 23, 2023 at 12:06:06AM +0200, Stefano Brivio wrote:
> Note that tcp_sock_consume() doesn't update ACK sequence counters
> anymore.
> 
> Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index dd3142d..aa1c8c9 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c,
>  }
>  
>  /**
> - * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence
> + * tcp_sock_consume() - Consume (discard) data from buffer
>   * @conn:	Connection pointer
>   * @ack_seq:	ACK sequence, host order
>   *

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-22 22:06 ` [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
  2023-09-22 22:31   ` Stefano Brivio
@ 2023-09-23  7:55   ` David Gibson
  2023-09-25  4:09   ` David Gibson
  2 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-09-23  7:55 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 4144 bytes --]

On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:
> It looks like we need it as workaround for this situation, readily
> reproducible at least with a 6.5 Linux kernel, with default rmem_max
> and wmem_max values:
> 
> - an iperf3 client on the host sends about 160 KiB, typically
>   segmented into five frames by passt. We read this data using
>   MSG_PEEK
> 
> - the iperf3 server on the guest starts receiving
> 
> - meanwhile, the host kernel advertised a zero-sized window to the
>   receiver, as expected

You noted the s/receiver/sender/ here..

> 
> - eventually, the guest acknowledges all the data sent so far, and
>   we drop it from the buffer, courtesy of tcp_sock_consume(), using
>   recv() with MSG_TRUNC
> 
> - the client, however, doesn't get an updated window value, and
>   even keepalive packets are answered with zero-window segments,
>   until the connection is closed
> 
> It looks like dropping data from a socket using MSG_TRUNC doesn't
> cause a recalculation of the window, which would be expected as a
> result of any receiving operation that invalidates data on a buffer
> (that is, not with MSG_PEEK).
> 
> Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> the previous value we clamped to, forces a recalculation of the
> window which is advertised to the guest.

..and the s/guest/sender/ here..

> 
> I couldn't quite confirm this issue by following all the possible
> code paths in the kernel, yet. If confirmed, this should be fixed in
> the kernel, but meanwhile this workaround looks robust to me (and it
> will be needed for backward compatibility anyway).
> 
> Reported-by: Matej Hrica <mhrica@redhat.com>
> Link: https://bugs.passt.top/show_bug.cgi?id=74
> Analysed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 5528e05..4606f17 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
>  	wnd <<= conn->ws_from_tap;
>  	wnd = MIN(MAX_WINDOW, wnd);
>  
> -	if (conn->flags & WND_CLAMPED) {
> +	/* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
> +	 * with a zero-sized window on a TCP socket, dropping data (once
> +	 * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
> +	 * to be enough to make the kernel advertise a non-zero window to the
> +	 * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing

.. but you need another s/receiver/sender/ here too.

> +	 * value, fixes this.
> +	 *
> +	 * The STALLED flag on a connection is a sufficient indication that we
> +	 * might have a zero-sized window on the socket, because it's set if we
> +	 * exhausted the tap-side window, or if everything we receive from a
> +	 * socket is already in flight to the guest.
> +	 *
> +	 * So, if STALLED is set, and we received a window value from the tap,
> +	 * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
> +	 * further and fixed in the kernel instead, if confirmed.
> +	 */
> +	if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
>  		if (prev_scaled == wnd)
>  			return;
>  
> @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  			i = keep - 1;
>  	}
>  
> -	tcp_clamp_window(c, conn, max_ack_seq_wnd);
> -
>  	/* On socket flush failure, pretend there was no ACK, try again later */
>  	if (ack && !tcp_sock_consume(conn, max_ack_seq))
>  		tcp_update_seqack_from_tap(c, conn, max_ack_seq);
>  
> +	tcp_clamp_window(c, conn, max_ack_seq_wnd);
> +
>  	if (retr) {
>  		trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
>  		      max_ack_seq, conn->seq_to_tap);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-09-22 22:06 ` [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data Stefano Brivio
@ 2023-09-25  3:07   ` David Gibson
  2023-09-27 17:05     ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-09-25  3:07 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 3832 bytes --]

I think the change itself here is sound, but I have some nits to pick
with the description and reasoning.

On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:
> In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> that we ran out of tap-side window space, or that all available
> socket data is already in flight -- better names welcome!

Hmm.. when you put it like that it makes me wonder if those two quite
different conditions really need the same handling.  Hrm.. I guess
both conditions mean that we can't accept data from the socket, even
if it's availble.

> ) on any
> event: do that only if the first packet in a batch has the ACK flag
> set.

"First packet in a batch" may not be accurate here - we're looking at
whichever packet we were up to before calling data_from_tap().  There
could have been earlier packets in the receive batch that were already
processed.

This also raises the question of why the first data packet should be
particularly privileged here.  I'm wondering if what we really want to
check is whether data_from_tap() advanced the ack pointer at all.

I'm not clear on when the th->ack check would ever fail in practice:
aren't the only normal packets in a TCP connection without ACK the
initial SYN or an RST?  We've handled the SYN case earlier, so should
we just have a blanket case above this that if we get a packet with
!ACK, we reset the connection?

> Make sure we check for pending socket data when we reset it:
> reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> does, isn't guaranteed to actually trigger a socket event.

Which sure seems like a kernel bug.  Some weird edge conditions for
edge-triggered seems expected, but this doesn't seem like valid
level-triggered semantics.

Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
what's going on here is that we can't take more data from the socket
until something happens on the tap side (either the window expands, or
it acks some data).  In which case should we be toggling EPOLLIN on
the socket instead?  That seems more explicitly to be saying to the
socket side "we don't currently care if you have data available".

> Further, note that the flag only makes sense once a connection is
> established, so move all this to the right place, which is convenient
> for the next patch, as we want to check if the STALLED flag was set
> before processing any new information about the window size
> advertised by the tap.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index aa1c8c9..5528e05 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2572,8 +2572,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
>  	if (th->ack && !(conn->events & ESTABLISHED))
>  		tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
>  
> -	conn_flag(c, conn, ~STALLED);
> -
>  	/* Establishing connection from socket */
>  	if (conn->events & SOCK_ACCEPTED) {
>  		if (th->syn && th->ack && !th->fin) {
> @@ -2631,6 +2629,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
>  	if (conn->seq_ack_to_tap != conn->seq_from_tap)
>  		ack_due = 1;
>  
> +	if ((conn->flags & STALLED) && th->ack) {
> +		conn_flag(c, conn, ~STALLED);
> +		tcp_data_from_sock(c, conn);
> +	}
> +
>  	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
>  		shutdown(conn->sock, SHUT_WR);
>  		conn_event(c, conn, SOCK_FIN_SENT);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-22 22:06 ` [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
  2023-09-22 22:31   ` Stefano Brivio
  2023-09-23  7:55   ` David Gibson
@ 2023-09-25  4:09   ` David Gibson
  2023-09-25  4:10     ` David Gibson
  2023-09-25  4:21     ` David Gibson
  2 siblings, 2 replies; 32+ messages in thread
From: David Gibson @ 2023-09-25  4:09 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 5939 bytes --]

On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:
> It looks like we need it as workaround for this situation, readily
> reproducible at least with a 6.5 Linux kernel, with default rmem_max
> and wmem_max values:
> 
> - an iperf3 client on the host sends about 160 KiB, typically
>   segmented into five frames by passt. We read this data using
>   MSG_PEEK
> 
> - the iperf3 server on the guest starts receiving
> 
> - meanwhile, the host kernel advertised a zero-sized window to the
>   receiver, as expected
> 
> - eventually, the guest acknowledges all the data sent so far, and
>   we drop it from the buffer, courtesy of tcp_sock_consume(), using
>   recv() with MSG_TRUNC
> 
> - the client, however, doesn't get an updated window value, and
>   even keepalive packets are answered with zero-window segments,
>   until the connection is closed
> 
> It looks like dropping data from a socket using MSG_TRUNC doesn't
> cause a recalculation of the window, which would be expected as a
> result of any receiving operation that invalidates data on a buffer
> (that is, not with MSG_PEEK).
> 
> Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> the previous value we clamped to, forces a recalculation of the
> window which is advertised to the guest.
> 
> I couldn't quite confirm this issue by following all the possible
> code paths in the kernel, yet. If confirmed, this should be fixed in
> the kernel, but meanwhile this workaround looks robust to me (and it
> will be needed for backward compatibility anyway).

So, I tested this, and things got a bit complicated.

First, I reproduced the "read side" problem by setting
net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB.
The "160kiB" stall happened almost every time.  Applying this patch
appears to fix it completely, getting GiB/s throughput consistently.
So, yah.

Then I tried reproducing it differently: by setting both
net.core.rmem_max and net.core.wmem_max to 16MiB, but setting
SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which
actually results in a 256kiB buffer, because of the kernel's weird
interpretation).

With the SO_RCVBUF clamp and without this patch, I don't get the
160kiB stall consistently any more.  What I *do* get is nearly every
time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps.
Sometimes it stalls after several seconds. The stall is slightly
different from the 160kiB stall though: the 160kiB stall seems 0 bytes
transferred on both sides.  With the RCVBUF stall I get a trickle of
bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes
per interval on the sender but occasionally an interval with several
hundred KB.

That is it seems like there's a buffer somewhere that's very slowly
draining into the receiver, then getting topped up in an instant once
it gets low enough.

When I have both this patch and the RCVBUF clamp, I don't seem to be
able to reproduce the trickle-stall anymore, but I still get the slow
transfer speeds most, but not every time.  Sometimes, but only rarely,
I do seem to still get a complete stall (0 bytes on both sides).

So it looks like there are two different things going on here.  It
looks like this patch fixes at least something, but there might be
some more things to investigate afterwards.  On that basis:

Tested-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> 
> Reported-by: Matej Hrica <mhrica@redhat.com>
> Link: https://bugs.passt.top/show_bug.cgi?id=74
> Analysed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 5528e05..4606f17 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
>  	wnd <<= conn->ws_from_tap;
>  	wnd = MIN(MAX_WINDOW, wnd);
>  
> -	if (conn->flags & WND_CLAMPED) {
> +	/* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
> +	 * with a zero-sized window on a TCP socket, dropping data (once
> +	 * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
> +	 * to be enough to make the kernel advertise a non-zero window to the
> +	 * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
> +	 * value, fixes this.
> +	 *
> +	 * The STALLED flag on a connection is a sufficient indication that we
> +	 * might have a zero-sized window on the socket, because it's set if we
> +	 * exhausted the tap-side window, or if everything we receive from a
> +	 * socket is already in flight to the guest.
> +	 *
> +	 * So, if STALLED is set, and we received a window value from the tap,
> +	 * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
> +	 * further and fixed in the kernel instead, if confirmed.
> +	 */
> +	if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
>  		if (prev_scaled == wnd)
>  			return;
>  
> @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  			i = keep - 1;
>  	}
>  
> -	tcp_clamp_window(c, conn, max_ack_seq_wnd);
> -
>  	/* On socket flush failure, pretend there was no ACK, try again later */
>  	if (ack && !tcp_sock_consume(conn, max_ack_seq))
>  		tcp_update_seqack_from_tap(c, conn, max_ack_seq);
>  
> +	tcp_clamp_window(c, conn, max_ack_seq_wnd);
> +
>  	if (retr) {
>  		trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
>  		      max_ack_seq, conn->seq_to_tap);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-25  4:09   ` David Gibson
@ 2023-09-25  4:10     ` David Gibson
  2023-09-25  4:21     ` David Gibson
  1 sibling, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-09-25  4:10 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 4016 bytes --]

On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:
> On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:
> > It looks like we need it as workaround for this situation, readily
> > reproducible at least with a 6.5 Linux kernel, with default rmem_max
> > and wmem_max values:
> > 
> > - an iperf3 client on the host sends about 160 KiB, typically
> >   segmented into five frames by passt. We read this data using
> >   MSG_PEEK
> > 
> > - the iperf3 server on the guest starts receiving
> > 
> > - meanwhile, the host kernel advertised a zero-sized window to the
> >   receiver, as expected
> > 
> > - eventually, the guest acknowledges all the data sent so far, and
> >   we drop it from the buffer, courtesy of tcp_sock_consume(), using
> >   recv() with MSG_TRUNC
> > 
> > - the client, however, doesn't get an updated window value, and
> >   even keepalive packets are answered with zero-window segments,
> >   until the connection is closed
> > 
> > It looks like dropping data from a socket using MSG_TRUNC doesn't
> > cause a recalculation of the window, which would be expected as a
> > result of any receiving operation that invalidates data on a buffer
> > (that is, not with MSG_PEEK).
> > 
> > Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> > the previous value we clamped to, forces a recalculation of the
> > window which is advertised to the guest.
> > 
> > I couldn't quite confirm this issue by following all the possible
> > code paths in the kernel, yet. If confirmed, this should be fixed in
> > the kernel, but meanwhile this workaround looks robust to me (and it
> > will be needed for backward compatibility anyway).
> 
> So, I tested this, and things got a bit complicated.
> 
> First, I reproduced the "read side" problem by setting
> net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB.
> The "160kiB" stall happened almost every time.  Applying this patch
> appears to fix it completely, getting GiB/s throughput consistently.
> So, yah.
> 
> Then I tried reproducing it differently: by setting both
> net.core.rmem_max and net.core.wmem_max to 16MiB, but setting
> SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which
> actually results in a 256kiB buffer, because of the kernel's weird
> interpretation).
> 
> With the SO_RCVBUF clamp and without this patch, I don't get the
> 160kiB stall consistently any more.  What I *do* get is nearly every
> time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps.
> Sometimes it stalls after several seconds. The stall is slightly
> different from the 160kiB stall though: the 160kiB stall seems 0 bytes
> transferred on both sides.  With the RCVBUF stall I get a trickle of
> bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes
> per interval on the sender but occasionally an interval with several
> hundred KB.
> 
> That is it seems like there's a buffer somewhere that's very slowly
> draining into the receiver, then getting topped up in an instant once
> it gets low enough.
> 
> When I have both this patch and the RCVBUF clamp, I don't seem to be
> able to reproduce the trickle-stall anymore, but I still get the slow
> transfer speeds most, but not every time.  Sometimes, but only rarely,
> I do seem to still get a complete stall (0 bytes on both sides).
> 
> So it looks like there are two different things going on here.  It
> looks like this patch fixes at least something, but there might be
> some more things to investigate afterwards.  On that basis:
> 
> Tested-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

With the exception of the previously noted s/receiver/sender/ fixes,
of course.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-25  4:09   ` David Gibson
  2023-09-25  4:10     ` David Gibson
@ 2023-09-25  4:21     ` David Gibson
  2023-09-27 17:05       ` Stefano Brivio
  1 sibling, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-09-25  4:21 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 4080 bytes --]

On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:
> On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:
> > It looks like we need it as workaround for this situation, readily
> > reproducible at least with a 6.5 Linux kernel, with default rmem_max
> > and wmem_max values:
> > 
> > - an iperf3 client on the host sends about 160 KiB, typically
> >   segmented into five frames by passt. We read this data using
> >   MSG_PEEK
> > 
> > - the iperf3 server on the guest starts receiving
> > 
> > - meanwhile, the host kernel advertised a zero-sized window to the
> >   receiver, as expected
> > 
> > - eventually, the guest acknowledges all the data sent so far, and
> >   we drop it from the buffer, courtesy of tcp_sock_consume(), using
> >   recv() with MSG_TRUNC
> > 
> > - the client, however, doesn't get an updated window value, and
> >   even keepalive packets are answered with zero-window segments,
> >   until the connection is closed
> > 
> > It looks like dropping data from a socket using MSG_TRUNC doesn't
> > cause a recalculation of the window, which would be expected as a
> > result of any receiving operation that invalidates data on a buffer
> > (that is, not with MSG_PEEK).
> > 
> > Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> > the previous value we clamped to, forces a recalculation of the
> > window which is advertised to the guest.
> > 
> > I couldn't quite confirm this issue by following all the possible
> > code paths in the kernel, yet. If confirmed, this should be fixed in
> > the kernel, but meanwhile this workaround looks robust to me (and it
> > will be needed for backward compatibility anyway).
> 
> So, I tested this, and things got a bit complicated.
> 
> First, I reproduced the "read side" problem by setting
> net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB.
> The "160kiB" stall happened almost every time.  Applying this patch
> appears to fix it completely, getting GiB/s throughput consistently.
> So, yah.
> 
> Then I tried reproducing it differently: by setting both
> net.core.rmem_max and net.core.wmem_max to 16MiB, but setting
> SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which
> actually results in a 256kiB buffer, because of the kernel's weird
> interpretation).
> 
> With the SO_RCVBUF clamp and without this patch, I don't get the
> 160kiB stall consistently any more.  What I *do* get is nearly every
> time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps.
> Sometimes it stalls after several seconds. The stall is slightly
> different from the 160kiB stall though: the 160kiB stall seems 0 bytes
> transferred on both sides.  With the RCVBUF stall I get a trickle of
> bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes
> per interval on the sender but occasionally an interval with several
> hundred KB.
> 
> That is it seems like there's a buffer somewhere that's very slowly
> draining into the receiver, then getting topped up in an instant once
> it gets low enough.
> 
> When I have both this patch and the RCVBUF clamp, I don't seem to be
> able to reproduce the trickle-stall anymore, but I still get the slow
> transfer speeds most, but not every time.  Sometimes, but only rarely,
> I do seem to still get a complete stall (0 bytes on both sides).

I noted another oddity.  With this patch, _no_ RCVBUF clamp and 16MiB
wmem_max fixed, things seem to behave much better with a small
rmem_max than large.  With rmem_max=256KiB I get pretty consistent
37Gbps throughput and iperf3 -c reports 0 retransmits.

With rmem_max=16MiB, the throughput fluctuates from second to second
between ~3Gbps and ~30Gbps.  The client reports retransmits in some
intervals, which is pretty weird over lo.

Urgh... so many variables.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-09-22 22:06 ` [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
@ 2023-09-25  4:47   ` David Gibson
  2023-09-27 17:06     ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-09-25  4:47 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 8687 bytes --]

On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:
> ...so that we'll retry sending them, instead of more-or-less silently
> dropping them. This happens quite frequently if our sending buffer on
> the UNIX domain socket is heavily constrained (for instance, by the
> 208 KiB default memory limit).
> 
> It might be argued that dropping frames is part of the expected TCP
> flow: we don't dequeue those from the socket anyway, so we'll
> eventually retransmit them.
> 
> But we don't need the receiver to tell us (by the way of duplicate or
> missing ACKs) that we couldn't send them: we already know as
> sendmsg() reports that. This seems to considerably increase
> throughput stability and throughput itself for TCP connections with
> default wmem_max values.
> 
> Unfortunately, the 16 bits left as padding in the frame descriptors

I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
yes?  For AVX2 we have substantially more space here.  Couldn't we put
a conn (or seq) pointer in here at the cost of a few bytes MSS for
non-AVX2 and zero cost for AVX2 (which is probably the majority case)?

> we use internally aren't enough to uniquely identify for which
> connection we should update sequence numbers: create a parallel
> array of pointers to sequence numbers and L4 lengths, of
> TCP_FRAMES_MEM size, and go through it after calling sendmsg().
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tap.c | 10 +++++++---
>  tap.h |  2 +-
>  tcp.c | 43 ++++++++++++++++++++++++++++++++++++-------
>  3 files changed, 44 insertions(+), 11 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 93db989..b30ff81 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
>   * @c:		Execution context
>   * @iov:	Array of buffers, each containing one frame (with L2 headers)
>   * @n:		Number of buffers/frames in @iov
> + *
> + * Return: number of frames actually sent
>   */
> -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
>  {
>  	size_t m;
>  
>  	if (!n)
> -		return;
> +		return 0;
>  
>  	if (c->mode == MODE_PASST)
>  		m = tap_send_frames_passt(c, iov, n);
> @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
>  		m = tap_send_frames_pasta(c, iov, n);
>  
>  	if (m < n)
> -		debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
> +		debug("tap: failed to send %lu frames of %lu", n - m, n);
>  
>  	pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> +
> +	return m;
>  }
>  
>  /**
> diff --git a/tap.h b/tap.h
> index 021fb7c..952fafc 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
>  		    const struct in6_addr *src, const struct in6_addr *dst,
>  		    void *in, size_t len);
>  int tap_send(const struct ctx *c, const void *data, size_t len);
> -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
>  void tap_update_mac(struct tap_hdr *taph,
>  		    const unsigned char *eth_d, const unsigned char *eth_s);
>  void tap_listen_handler(struct ctx *c, uint32_t events);
> diff --git a/tcp.c b/tcp.c
> index 4606f17..76b7b8d 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -434,6 +434,16 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
>   */
>  static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
>  
> +/**
> + * tcp_buf_seq_update - Sequences to update with length of frames once sent
> + * @seq:	Pointer to sequence number sent to tap-side, to be updated
> + * @len:	TCP payload length
> + */
> +struct tcp_buf_seq_update {
> +	uint32_t *seq;
> +	uint16_t len;
> +};
> +
>  /* Static buffers */
>  
>  /**
> @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
>  #endif
>  tcp4_l2_buf[TCP_FRAMES_MEM];
>  
> +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> +
>  static unsigned int tcp4_l2_buf_used;
>  
>  /**
> @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
>  #endif
>  tcp6_l2_buf[TCP_FRAMES_MEM];
>  
> +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> +
>  static unsigned int tcp6_l2_buf_used;
>  
>  /* recvmsg()/sendmsg() data for tap */
> @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
>   */
>  static void tcp_l2_data_buf_flush(struct ctx *c)
>  {
> -	tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> +	unsigned i;
> +	size_t m;
> +
> +	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> +	for (i = 0; i < m; i++)
> +		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
>  	tcp6_l2_buf_used = 0;
>  
> -	tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> +	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> +	for (i = 0; i < m; i++)
> +		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
>  	tcp4_l2_buf_used = 0;
>  }
>  
> @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
>   * @plen:	Payload length at L4
>   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
>   * @seq:	Sequence number to be sent
> - * @now:	Current timestamp
> + * @seq_update:	Pointer to sequence number to update on successful send
>   */
>  static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> -			    ssize_t plen, int no_csum, uint32_t seq)
> +			    ssize_t plen, int no_csum, uint32_t seq,
> +			    uint32_t *seq_update)

seq_update is always &conn->seq_to_tap, so there's no need for an
additional parameter.

>  {
>  	struct iovec *iov;
>  
> @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
>  		uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
>  
> +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
> +
>  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
>  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
>  						       check, seq);
> @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  	} else if (CONN_V6(conn)) {
>  		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
>  
> +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
> +
>  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
>  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
>  						       NULL, seq);
> @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  	int s = conn->sock, i, ret = 0;
>  	struct msghdr mh_sock = { 0 };
>  	uint16_t mss = MSS_GET(conn);
> -	uint32_t already_sent;
> +	uint32_t already_sent, seq;
>  	struct iovec *iov;
>  
>  	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  
>  	/* Finally, queue to tap */
>  	plen = mss;
> +	seq = conn->seq_to_tap;

This will only be correct if tcp_l2_data_buf_flush() is *always*
called between tcp_data_from_sock() calls for the same socket.  That
should be true for the normal course of things.  However, couldn't it
happen that we get a normal socket EPOLLIN event for a particular
connection - calling tcp_data_from_sock() - but in the same epoll()
round we also get a tap ack for the same connection which causes
another call to tcp_data_from_sock() (with the change from patch
2/5).  IIRC those would both happen before the deferred handling and
therefore the data_buf_flush().

Not sure how to deal with that short of separate 'seq_queued' and
'seq_sent' counters in the connection structure, which is a bit
unfortunate.

>  	for (i = 0; i < send_bufs; i++) {
>  		int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
>  
>  		if (i == send_bufs - 1)
>  			plen = last_len;
>  
> -		tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
> -		conn->seq_to_tap += plen;
> +		tcp_data_to_tap(c, conn, plen, no_csum, seq, &conn->seq_to_tap);
> +		seq += plen;
>  	}
>  
>  	conn_flag(c, conn, ACK_FROM_TAP_DUE);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput
  2023-09-22 22:06 ` [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput Stefano Brivio
@ 2023-09-25  4:57   ` David Gibson
  2023-09-27 17:06     ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-09-25  4:57 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 3201 bytes --]

On Sat, Sep 23, 2023 at 12:06:10AM +0200, Stefano Brivio wrote:
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  passt.1 | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/passt.1 b/passt.1
> index 1ad4276..bcbe6fd 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the
>  current sending buffer size to guest or target namespace. This might affect
>  throughput of TCP connections.
>  
> +.SS Tuning for high throughput
> +
> +On Linux, by default, the maximum memory that can be set for receive and send
> +socket buffers is 208 KiB. Those limits are set by the
> +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files,
> +see \fBsocket\fR(7).
> +
> +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers
> +depending on utilisation even above those limits, such a small limit will

"shrink buffers" and "even above those limits" don't seem to quite
work together.

> +reflect on the advertised TCP window at the beginning of a connection, and the

Hmmm.... while [rw]mem_max might limit that initial window size, I
wouldn't expect increasing the limits alone to increase that initial
window size: wouldn't that instead be affected by the TCP default
buffer size i.e. the middle value in net.ipv4.tcp_rmem?

> +buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed
> +these limits anyway.
> +
> +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and
> +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and
> +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP
> +implementation is then disabled.
> +
> +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and
> +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this
> +would affect the maximum size of TCP buffers for the whole duration of a
> +connection.
> +
> +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than
> +three TCP packets at the default MSS. In applications where high throughput is
> +expected, it is therefore advisable to increase those limits to at least 2 MiB,
> +or even 16 MiB:
> +
> +.nf
> +	sysctl -w net.core.rmem_max=$((16 << 20)
> +	sysctl -w net.core.wmem_max=$((16 << 20)
> +.fi

As noted in a previous mail, empirically, this doesn't necessarily
seem to work better for me.  I'm wondering if we'd be better off never
touching RCFBUF and SNDBUF for TCP sockets, and letting the kernel do
its adaptive thing.  We probably still want to expand the buffers as
much as we can for the Unix socket, though.  And we likely still want
expanded limits for the tests so that iperf3 can use large buffers

> +
>  .SH LIMITATIONS
>  
>  Currently, IGMP/MLD proxying (RFC 4605) and support for SCTP (RFC 4960) are not

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers
  2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
                   ` (4 preceding siblings ...)
  2023-09-22 22:06 ` [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput Stefano Brivio
@ 2023-09-25  5:52 ` David Gibson
  5 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-09-25  5:52 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2746 bytes --]

On Sat, Sep 23, 2023 at 12:06:05AM +0200, Stefano Brivio wrote:
> The fundamental patch here is 3/5, which is a workaround for a rather
> surprising kernel behaviour we seem to be hitting. This all comes from
> the investigation around https://bugs.passt.top/show_bug.cgi?id=74.
> 
> I can't hit stalls anymore and throughput looks finally good to me
> (~3.5gbps with 208 KiB rmem_max and wmem_max), but... please test.

Write site issue, testing results

I replied with a bunch of test information already, but that was all
related to the specifically read-side issue: I used 16MiB wmem_max
throughout, but limited the read side buffer either with rmem_max or
SO_RCVBUF.

I've now done some tests looking specifically for write side issues.
I basically reversed the setup, with rmem_max set to 4MiB throughout,
but wmem_max limited to 256kiB.

With no patches applied, I easily get a stall, although the exact
details are a little different from the read-side stall: rather than
being consistently 0 there are a few small bursts of traffic on both
sides.

With 2/5 applied, there doesn't appear to be much difference in
behaviour.

With 3/5 applied, I can no longer reproduce stalls, but throughput
isn't very good.

With 4/5 applied, throughput seems to improve notably (from ~300Mbps
to ~2.5Gbps, though it's not surprisingly variable from second to
second).

Tentative conclusions:

 * The primary cause of the stalls appears to be the kernel bug
   identified, where the window isn't properly recalculated after
   MSG_TRUNC.  3/5 appears to successfully work around that bug.  I
   think getting that merged is our top priority.

 * 2/5 makes logical sense to me, but I don't see a lot of evidence of
   it changing the behaviour here much.  I think we hold it back for
   now, polish it a bit, maybe reconsider it as part of a broader
   rethink of the STALLED flag.

 * 4/5 doesn't appear to be linked to the stalls per se, but does
   appear to generally improve behaviour with limited wmem_max.  I
   think we can improve the implementation a bit, then look at merging
   as the second priority.

Open questions:

 * Even with the fixes, why does very large rmem_max seem to cause
   wildly variable and not great throughput?

 * Why does explicitly limiting RCVBUF usually, but not always, cause
   very poor throughput but without stalling?

 * Given the above oddities, is there any value to us setting RCVBUF
   for TCP sockets, rather than just letting the kernel adapt it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-09-25  3:07   ` David Gibson
@ 2023-09-27 17:05     ` Stefano Brivio
  2023-09-28  1:48       ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-27 17:05 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Mon, 25 Sep 2023 13:07:24 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> I think the change itself here is sound, but I have some nits to pick
> with the description and reasoning.
> 
> On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:
> > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> > that we ran out of tap-side window space, or that all available
> > socket data is already in flight -- better names welcome!  
> 
> Hmm.. when you put it like that it makes me wonder if those two quite
> different conditions really need the same handling.  Hrm.. I guess
> both conditions mean that we can't accept data from the socket, even
> if it's availble.

Right. I mean, we can also call them differently... or maybe pick a
name that reflects the outcome/handling instead of what happened.

> > ) on any
> > event: do that only if the first packet in a batch has the ACK flag
> > set.  
> 
> "First packet in a batch" may not be accurate here - we're looking at
> whichever packet we were up to before calling data_from_tap().  There
> could have been earlier packets in the receive batch that were already
> processed.

Well, it depends on what we call "batch" -- here I meant the pool of
packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool"
would be more accurate.

> This also raises the question of why the first data packet should be
> particularly privileged here.

No reason other than convenience, and yes, it can be subtly wrong.

> I'm wondering if what we really want to
> check is whether data_from_tap() advanced the ack pointer at all.

Right, we probably should do that instead.

> I'm not clear on when the th->ack check would ever fail in practice:
> aren't the only normal packets in a TCP connection without ACK the
> initial SYN or an RST?  We've handled the SYN case earlier, so should
> we just have a blanket case above this that if we get a packet with
> !ACK, we reset the connection?

One thing that's legitimate (rarely seen, but I've seen it, I don't
remember if the Linux kernel ever does that) is a segment without ACK,
and without data, that just updates the window (for example after a
zero window).

If the sequence received/processed so far doesn't correspond to the
latest sequence sent, omitting the ACK flag is useful so that the
window update is not taken as duplicate ACK (that would trigger
retransmission).

> > Make sure we check for pending socket data when we reset it:
> > reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> > does, isn't guaranteed to actually trigger a socket event.  
> 
> Which sure seems like a kernel bug.  Some weird edge conditions for
> edge-triggered seems expected, but this doesn't seem like valid
> level-triggered semantics.

Hmm, yes, and by doing a quick isolated test actually this seems to work
as intended in the kernel. I should drop this and try again.

> Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
> what's going on here is that we can't take more data from the socket
> until something happens on the tap side (either the window expands, or
> it acks some data).  In which case should we be toggling EPOLLIN on
> the socket instead?  That seems more explicitly to be saying to the
> socket side "we don't currently care if you have data available".

The reason was to act on EPOLLRDHUP at the same time. But well, we
could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make
more sense.

For the moment being, we should probably try to see what happens
if this patch simply moves conn_flag(c, conn, ~STALLED); to where 3/5
needs it (or directly incorporate that into 3/5).

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-25  4:21     ` David Gibson
@ 2023-09-27 17:05       ` Stefano Brivio
  2023-09-28  1:51         ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-27 17:05 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Mon, 25 Sep 2023 14:21:47 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:
> > On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:  
> > > It looks like we need it as workaround for this situation, readily
> > > reproducible at least with a 6.5 Linux kernel, with default rmem_max
> > > and wmem_max values:
> > > 
> > > - an iperf3 client on the host sends about 160 KiB, typically
> > >   segmented into five frames by passt. We read this data using
> > >   MSG_PEEK
> > > 
> > > - the iperf3 server on the guest starts receiving
> > > 
> > > - meanwhile, the host kernel advertised a zero-sized window to the
> > >   receiver, as expected
> > > 
> > > - eventually, the guest acknowledges all the data sent so far, and
> > >   we drop it from the buffer, courtesy of tcp_sock_consume(), using
> > >   recv() with MSG_TRUNC
> > > 
> > > - the client, however, doesn't get an updated window value, and
> > >   even keepalive packets are answered with zero-window segments,
> > >   until the connection is closed
> > > 
> > > It looks like dropping data from a socket using MSG_TRUNC doesn't
> > > cause a recalculation of the window, which would be expected as a
> > > result of any receiving operation that invalidates data on a buffer
> > > (that is, not with MSG_PEEK).
> > > 
> > > Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> > > the previous value we clamped to, forces a recalculation of the
> > > window which is advertised to the guest.
> > > 
> > > I couldn't quite confirm this issue by following all the possible
> > > code paths in the kernel, yet. If confirmed, this should be fixed in
> > > the kernel, but meanwhile this workaround looks robust to me (and it
> > > will be needed for backward compatibility anyway).  
> > 
> > So, I tested this, and things got a bit complicated.
> > 
> > First, I reproduced the "read side" problem by setting
> > net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB.
> > The "160kiB" stall happened almost every time.  Applying this patch
> > appears to fix it completely, getting GiB/s throughput consistently.
> > So, yah.
> > 
> > Then I tried reproducing it differently: by setting both
> > net.core.rmem_max and net.core.wmem_max to 16MiB, but setting
> > SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which
> > actually results in a 256kiB buffer, because of the kernel's weird
> > interpretation).
> > 
> > With the SO_RCVBUF clamp and without this patch, I don't get the
> > 160kiB stall consistently any more.  What I *do* get is nearly every
> > time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps.
> > Sometimes it stalls after several seconds. The stall is slightly
> > different from the 160kiB stall though: the 160kiB stall seems 0 bytes
> > transferred on both sides.  With the RCVBUF stall I get a trickle of
> > bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes
> > per interval on the sender but occasionally an interval with several
> > hundred KB.
> > 
> > That is it seems like there's a buffer somewhere that's very slowly
> > draining into the receiver, then getting topped up in an instant once
> > it gets low enough.
> > 
> > When I have both this patch and the RCVBUF clamp, I don't seem to be
> > able to reproduce the trickle-stall anymore, but I still get the slow
> > transfer speeds most, but not every time.  Sometimes, but only rarely,
> > I do seem to still get a complete stall (0 bytes on both sides).  
> 
> I noted another oddity.  With this patch, _no_ RCVBUF clamp and 16MiB
> wmem_max fixed, things seem to behave much better with a small
> rmem_max than large.  With rmem_max=256KiB I get pretty consistent
> 37Gbps throughput and iperf3 -c reports 0 retransmits.
> 
> With rmem_max=16MiB, the throughput fluctuates from second to second
> between ~3Gbps and ~30Gbps.  The client reports retransmits in some
> intervals, which is pretty weird over lo.
> 
> Urgh... so many variables.

This is probably due to the receive buffer getting bigger than
TCP_FRAMES_MEM * MSS4 (or MSS6), so the amount of data we can read in
one shot from the sockets isn't optimally sized anymore.

We should have a look at the difference between not clamping at all
(and if that yields the same throughput, great), and clamping to, I
guess, TCP_FRAMES_MEM * MIN(MSS4, MSS6).

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-09-25  4:47   ` David Gibson
@ 2023-09-27 17:06     ` Stefano Brivio
  2023-09-28  1:58       ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-27 17:06 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Mon, 25 Sep 2023 14:47:52 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:
> > ...so that we'll retry sending them, instead of more-or-less silently
> > dropping them. This happens quite frequently if our sending buffer on
> > the UNIX domain socket is heavily constrained (for instance, by the
> > 208 KiB default memory limit).
> > 
> > It might be argued that dropping frames is part of the expected TCP
> > flow: we don't dequeue those from the socket anyway, so we'll
> > eventually retransmit them.
> > 
> > But we don't need the receiver to tell us (by the way of duplicate or
> > missing ACKs) that we couldn't send them: we already know as
> > sendmsg() reports that. This seems to considerably increase
> > throughput stability and throughput itself for TCP connections with
> > default wmem_max values.
> > 
> > Unfortunately, the 16 bits left as padding in the frame descriptors  
> 
> I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
> yes?

Right, that.

> For AVX2 we have substantially more space here.  Couldn't we put
> a conn (or seq) pointer in here at the cost of a few bytes MSS for
> non-AVX2 and zero cost for AVX2 (which is probably the majority case)?

Yes, true. On the other hand, having this parallel array only affects
readability I guess, whereas inserting pointers and lengths in
tcp[46]_l2_buf_t actually decreases the usable MSS (not just on
non-AVX2 x86, but also on other architectures). So I'd rather stick to
this.

> > we use internally aren't enough to uniquely identify for which
> > connection we should update sequence numbers: create a parallel
> > array of pointers to sequence numbers and L4 lengths, of
> > TCP_FRAMES_MEM size, and go through it after calling sendmsg().
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tap.c | 10 +++++++---
> >  tap.h |  2 +-
> >  tcp.c | 43 ++++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 44 insertions(+), 11 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 93db989..b30ff81 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
> >   * @c:		Execution context
> >   * @iov:	Array of buffers, each containing one frame (with L2 headers)
> >   * @n:		Number of buffers/frames in @iov
> > + *
> > + * Return: number of frames actually sent
> >   */
> > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> >  {
> >  	size_t m;
> >  
> >  	if (!n)
> > -		return;
> > +		return 0;
> >  
> >  	if (c->mode == MODE_PASST)
> >  		m = tap_send_frames_passt(c, iov, n);
> > @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> >  		m = tap_send_frames_pasta(c, iov, n);
> >  
> >  	if (m < n)
> > -		debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
> > +		debug("tap: failed to send %lu frames of %lu", n - m, n);
> >  
> >  	pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> > +
> > +	return m;
> >  }
> >  
> >  /**
> > diff --git a/tap.h b/tap.h
> > index 021fb7c..952fafc 100644
> > --- a/tap.h
> > +++ b/tap.h
> > @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
> >  		    const struct in6_addr *src, const struct in6_addr *dst,
> >  		    void *in, size_t len);
> >  int tap_send(const struct ctx *c, const void *data, size_t len);
> > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> >  void tap_update_mac(struct tap_hdr *taph,
> >  		    const unsigned char *eth_d, const unsigned char *eth_s);
> >  void tap_listen_handler(struct ctx *c, uint32_t events);
> > diff --git a/tcp.c b/tcp.c
> > index 4606f17..76b7b8d 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -434,6 +434,16 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
> >   */
> >  static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
> >  
> > +/**
> > + * tcp_buf_seq_update - Sequences to update with length of frames once sent
> > + * @seq:	Pointer to sequence number sent to tap-side, to be updated
> > + * @len:	TCP payload length
> > + */
> > +struct tcp_buf_seq_update {
> > +	uint32_t *seq;
> > +	uint16_t len;
> > +};
> > +
> >  /* Static buffers */
> >  
> >  /**
> > @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
> >  #endif
> >  tcp4_l2_buf[TCP_FRAMES_MEM];
> >  
> > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> > +
> >  static unsigned int tcp4_l2_buf_used;
> >  
> >  /**
> > @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
> >  #endif
> >  tcp6_l2_buf[TCP_FRAMES_MEM];
> >  
> > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> > +
> >  static unsigned int tcp6_l2_buf_used;
> >  
> >  /* recvmsg()/sendmsg() data for tap */
> > @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
> >   */
> >  static void tcp_l2_data_buf_flush(struct ctx *c)
> >  {
> > -	tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> > +	unsigned i;
> > +	size_t m;
> > +
> > +	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> > +	for (i = 0; i < m; i++)
> > +		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
> >  	tcp6_l2_buf_used = 0;
> >  
> > -	tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> > +	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> > +	for (i = 0; i < m; i++)
> > +		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
> >  	tcp4_l2_buf_used = 0;
> >  }
> >  
> > @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
> >   * @plen:	Payload length at L4
> >   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
> >   * @seq:	Sequence number to be sent
> > - * @now:	Current timestamp
> > + * @seq_update:	Pointer to sequence number to update on successful send
> >   */
> >  static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > -			    ssize_t plen, int no_csum, uint32_t seq)
> > +			    ssize_t plen, int no_csum, uint32_t seq,
> > +			    uint32_t *seq_update)  
> 
> seq_update is always &conn->seq_to_tap, so there's no need for an
> additional parameter.

Oh, right, I'll drop that.

> >  {
> >  	struct iovec *iov;
> >  
> > @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >  		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> >  		uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
> >  
> > +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> > +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
> > +
> >  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> >  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> >  						       check, seq);
> > @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >  	} else if (CONN_V6(conn)) {
> >  		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
> >  
> > +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> > +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
> > +
> >  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> >  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> >  						       NULL, seq);
> > @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  	int s = conn->sock, i, ret = 0;
> >  	struct msghdr mh_sock = { 0 };
> >  	uint16_t mss = MSS_GET(conn);
> > -	uint32_t already_sent;
> > +	uint32_t already_sent, seq;
> >  	struct iovec *iov;
> >  
> >  	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  
> >  	/* Finally, queue to tap */
> >  	plen = mss;
> > +	seq = conn->seq_to_tap;  
> 
> This will only be correct if tcp_l2_data_buf_flush() is *always*
> called between tcp_data_from_sock() calls for the same socket.  That
> should be true for the normal course of things.  However, couldn't it
> happen that we get a normal socket EPOLLIN event for a particular
> connection - calling tcp_data_from_sock() - but in the same epoll()
> round we also get a tap ack for the same connection which causes
> another call to tcp_data_from_sock() (with the change from patch
> 2/5).  IIRC those would both happen before the deferred handling and
> therefore the data_buf_flush().

Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/
but with that change, it's not. Unless we drop that change from 2/5.

> Not sure how to deal with that short of separate 'seq_queued' and
> 'seq_sent' counters in the connection structure, which is a bit
> unfortunate.

I wonder how bad it is if we call tcp_l2_data_buf_flush()
unconditionally before calling tcp_data_from_sock() from
tcp_tap_handler(). But again, maybe this is not needed at all, we
should check that epoll detail from 2/5 first...

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput
  2023-09-25  4:57   ` David Gibson
@ 2023-09-27 17:06     ` Stefano Brivio
  2023-09-28  2:02       ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-27 17:06 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Mon, 25 Sep 2023 14:57:40 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sat, Sep 23, 2023 at 12:06:10AM +0200, Stefano Brivio wrote:
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  passt.1 | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> > 
> > diff --git a/passt.1 b/passt.1
> > index 1ad4276..bcbe6fd 100644
> > --- a/passt.1
> > +++ b/passt.1
> > @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the
> >  current sending buffer size to guest or target namespace. This might affect
> >  throughput of TCP connections.
> >  
> > +.SS Tuning for high throughput
> > +
> > +On Linux, by default, the maximum memory that can be set for receive and send
> > +socket buffers is 208 KiB. Those limits are set by the
> > +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files,
> > +see \fBsocket\fR(7).
> > +
> > +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers
> > +depending on utilisation even above those limits, such a small limit will  
> 
> "shrink buffers" and "even above those limits" don't seem to quite
> work together.

Oops. I guess I should simply s/shrink/grow/ here.

> > +reflect on the advertised TCP window at the beginning of a connection, and the  
> 
> Hmmm.... while [rw]mem_max might limit that initial window size, I
> wouldn't expect increasing the limits alone to increase that initial
> window size: wouldn't that instead be affected by the TCP default
> buffer size i.e. the middle value in net.ipv4.tcp_rmem?

If we don't use SO_RCVBUF, yes... but we currently do, and with that,
we can get a much larger initial window (as we do now).

On the other hand, maybe, as mentioned in my follow-up about 3/5, we
should drop SO_RCVBUF for TCP sockets.

> > +buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed
> > +these limits anyway.
> > +
> > +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and
> > +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and
> > +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP
> > +implementation is then disabled.
> > +
> > +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and
> > +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this
> > +would affect the maximum size of TCP buffers for the whole duration of a
> > +connection.
> > +
> > +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than
> > +three TCP packets at the default MSS. In applications where high throughput is
> > +expected, it is therefore advisable to increase those limits to at least 2 MiB,
> > +or even 16 MiB:
> > +
> > +.nf
> > +	sysctl -w net.core.rmem_max=$((16 << 20)
> > +	sysctl -w net.core.wmem_max=$((16 << 20)
> > +.fi  
> 
> As noted in a previous mail, empirically, this doesn't necessarily
> seem to work better for me.  I'm wondering if we'd be better off never
> touching RCFBUF and SNDBUF for TCP sockets, and letting the kernel do
> its adaptive thing.  We probably still want to expand the buffers as
> much as we can for the Unix socket, though.  And we likely still want
> expanded limits for the tests so that iperf3 can use large buffers

Right. Let's keep this patch for a later time then, and meanwhile check
if we should drop SO_RCVBUF, SO_SNDBUF, or both, for TCP sockets.

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-09-27 17:05     ` Stefano Brivio
@ 2023-09-28  1:48       ` David Gibson
  2023-09-29 15:20         ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-09-28  1:48 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 5497 bytes --]

On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:
> On Mon, 25 Sep 2023 13:07:24 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > I think the change itself here is sound, but I have some nits to pick
> > with the description and reasoning.
> > 
> > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:
> > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> > > that we ran out of tap-side window space, or that all available
> > > socket data is already in flight -- better names welcome!  
> > 
> > Hmm.. when you put it like that it makes me wonder if those two quite
> > different conditions really need the same handling.  Hrm.. I guess
> > both conditions mean that we can't accept data from the socket, even
> > if it's availble.
> 
> Right. I mean, we can also call them differently... or maybe pick a
> name that reflects the outcome/handling instead of what happened.

Sure, if we could think of one.  Except, on second thoughts, I'm not
sure my characterization is correct.  If the tap side window is full
then, indeed, we can't accept data from the socket.  However if
everything we have is in flight that doesn't mean we couldn't put more
data into flight if it arrived.

That consideration, together with the way we use MSG_PEEK possibly
means that we fundamentally need to use edge-triggered interrupts -
with the additional trickiness that entails - to avoid busy polling.
Although even that only works if we get a new edge interrupt when data
is added to a buffer that's been PEEKed but not TRUNCed.  If that's
not true the MSG_PEEK approach might be doomed :(.

> > > ) on any
> > > event: do that only if the first packet in a batch has the ACK flag
> > > set.  
> > 
> > "First packet in a batch" may not be accurate here - we're looking at
> > whichever packet we were up to before calling data_from_tap().  There
> > could have been earlier packets in the receive batch that were already
> > processed.
> 
> Well, it depends on what we call "batch" -- here I meant the pool of
> packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool"
> would be more accurate.

Uh.. I don't think that actually helps.  Remember pools aren't queues.
The point here is that is that the packet we're considering is not the
first of the batch/pool/whatever, but the first of what's left.

> > This also raises the question of why the first data packet should be
> > particularly privileged here.
> 
> No reason other than convenience, and yes, it can be subtly wrong.
> 
> > I'm wondering if what we really want to
> > check is whether data_from_tap() advanced the ack pointer at all.
> 
> Right, we probably should do that instead.

Ok.

> > I'm not clear on when the th->ack check would ever fail in practice:
> > aren't the only normal packets in a TCP connection without ACK the
> > initial SYN or an RST?  We've handled the SYN case earlier, so should
> > we just have a blanket case above this that if we get a packet with
> > !ACK, we reset the connection?
> 
> One thing that's legitimate (rarely seen, but I've seen it, I don't
> remember if the Linux kernel ever does that) is a segment without ACK,
> and without data, that just updates the window (for example after a
> zero window).
> 
> If the sequence received/processed so far doesn't correspond to the
> latest sequence sent, omitting the ACK flag is useful so that the
> window update is not taken as duplicate ACK (that would trigger
> retransmission).

Ah, ok, I wasn't aware of that case.

> > > Make sure we check for pending socket data when we reset it:
> > > reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> > > does, isn't guaranteed to actually trigger a socket event.  
> > 
> > Which sure seems like a kernel bug.  Some weird edge conditions for
> > edge-triggered seems expected, but this doesn't seem like valid
> > level-triggered semantics.
> 
> Hmm, yes, and by doing a quick isolated test actually this seems to work
> as intended in the kernel. I should drop this and try again.
> 
> > Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
> > what's going on here is that we can't take more data from the socket
> > until something happens on the tap side (either the window expands, or
> > it acks some data).  In which case should we be toggling EPOLLIN on
> > the socket instead?  That seems more explicitly to be saying to the
> > socket side "we don't currently care if you have data available".
> 
> The reason was to act on EPOLLRDHUP at the same time. But well, we
> could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make
> more sense.

So specifically to mask EPOLLRDHUP as well? On the grounds that if
we're still chewing on what we got already we don't yet care that
we've reached the end, yes?  So yes, explicitly masking both those
makes more sense to me.. except that as above, I suspect we can't have
level-triggered + MSG_PEEK + no busy polling all at once.

> For the moment being, we should probably try to see what happens
> if this patch simply moves conn_flag(c, conn, ~STALLED); to where 3/5
> needs it (or directly incorporate that into 3/5).

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
  2023-09-27 17:05       ` Stefano Brivio
@ 2023-09-28  1:51         ` David Gibson
  0 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-09-28  1:51 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 5132 bytes --]

On Wed, Sep 27, 2023 at 07:05:50PM +0200, Stefano Brivio wrote:
> On Mon, 25 Sep 2023 14:21:47 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:
> > > On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:  
> > > > It looks like we need it as workaround for this situation, readily
> > > > reproducible at least with a 6.5 Linux kernel, with default rmem_max
> > > > and wmem_max values:
> > > > 
> > > > - an iperf3 client on the host sends about 160 KiB, typically
> > > >   segmented into five frames by passt. We read this data using
> > > >   MSG_PEEK
> > > > 
> > > > - the iperf3 server on the guest starts receiving
> > > > 
> > > > - meanwhile, the host kernel advertised a zero-sized window to the
> > > >   receiver, as expected
> > > > 
> > > > - eventually, the guest acknowledges all the data sent so far, and
> > > >   we drop it from the buffer, courtesy of tcp_sock_consume(), using
> > > >   recv() with MSG_TRUNC
> > > > 
> > > > - the client, however, doesn't get an updated window value, and
> > > >   even keepalive packets are answered with zero-window segments,
> > > >   until the connection is closed
> > > > 
> > > > It looks like dropping data from a socket using MSG_TRUNC doesn't
> > > > cause a recalculation of the window, which would be expected as a
> > > > result of any receiving operation that invalidates data on a buffer
> > > > (that is, not with MSG_PEEK).
> > > > 
> > > > Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> > > > the previous value we clamped to, forces a recalculation of the
> > > > window which is advertised to the guest.
> > > > 
> > > > I couldn't quite confirm this issue by following all the possible
> > > > code paths in the kernel, yet. If confirmed, this should be fixed in
> > > > the kernel, but meanwhile this workaround looks robust to me (and it
> > > > will be needed for backward compatibility anyway).  
> > > 
> > > So, I tested this, and things got a bit complicated.
> > > 
> > > First, I reproduced the "read side" problem by setting
> > > net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB.
> > > The "160kiB" stall happened almost every time.  Applying this patch
> > > appears to fix it completely, getting GiB/s throughput consistently.
> > > So, yah.
> > > 
> > > Then I tried reproducing it differently: by setting both
> > > net.core.rmem_max and net.core.wmem_max to 16MiB, but setting
> > > SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which
> > > actually results in a 256kiB buffer, because of the kernel's weird
> > > interpretation).
> > > 
> > > With the SO_RCVBUF clamp and without this patch, I don't get the
> > > 160kiB stall consistently any more.  What I *do* get is nearly every
> > > time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps.
> > > Sometimes it stalls after several seconds. The stall is slightly
> > > different from the 160kiB stall though: the 160kiB stall seems 0 bytes
> > > transferred on both sides.  With the RCVBUF stall I get a trickle of
> > > bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes
> > > per interval on the sender but occasionally an interval with several
> > > hundred KB.
> > > 
> > > That is it seems like there's a buffer somewhere that's very slowly
> > > draining into the receiver, then getting topped up in an instant once
> > > it gets low enough.
> > > 
> > > When I have both this patch and the RCVBUF clamp, I don't seem to be
> > > able to reproduce the trickle-stall anymore, but I still get the slow
> > > transfer speeds most, but not every time.  Sometimes, but only rarely,
> > > I do seem to still get a complete stall (0 bytes on both sides).  
> > 
> > I noted another oddity.  With this patch, _no_ RCVBUF clamp and 16MiB
> > wmem_max fixed, things seem to behave much better with a small
> > rmem_max than large.  With rmem_max=256KiB I get pretty consistent
> > 37Gbps throughput and iperf3 -c reports 0 retransmits.
> > 
> > With rmem_max=16MiB, the throughput fluctuates from second to second
> > between ~3Gbps and ~30Gbps.  The client reports retransmits in some
> > intervals, which is pretty weird over lo.
> > 
> > Urgh... so many variables.
> 
> This is probably due to the receive buffer getting bigger than
> TCP_FRAMES_MEM * MSS4 (or MSS6), so the amount of data we can read in
> one shot from the sockets isn't optimally sized anymore.

Hm, ok.  Not really sure why that would cause such nasty behaviour.

> We should have a look at the difference between not clamping at all
> (and if that yields the same throughput, great), and clamping to, I
> guess, TCP_FRAMES_MEM * MIN(MSS4, MSS6).

Right.  We currently ask for the largest RCVBUF we can get, which
might not really be what we want.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-09-27 17:06     ` Stefano Brivio
@ 2023-09-28  1:58       ` David Gibson
  2023-09-29 15:19         ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-09-28  1:58 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 10506 bytes --]

On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:
> On Mon, 25 Sep 2023 14:47:52 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:
> > > ...so that we'll retry sending them, instead of more-or-less silently
> > > dropping them. This happens quite frequently if our sending buffer on
> > > the UNIX domain socket is heavily constrained (for instance, by the
> > > 208 KiB default memory limit).
> > > 
> > > It might be argued that dropping frames is part of the expected TCP
> > > flow: we don't dequeue those from the socket anyway, so we'll
> > > eventually retransmit them.
> > > 
> > > But we don't need the receiver to tell us (by the way of duplicate or
> > > missing ACKs) that we couldn't send them: we already know as
> > > sendmsg() reports that. This seems to considerably increase
> > > throughput stability and throughput itself for TCP connections with
> > > default wmem_max values.
> > > 
> > > Unfortunately, the 16 bits left as padding in the frame descriptors  
> > 
> > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
> > yes?
> 
> Right, that.
> 
> > For AVX2 we have substantially more space here.  Couldn't we put
> > a conn (or seq) pointer in here at the cost of a few bytes MSS for
> > non-AVX2 and zero cost for AVX2 (which is probably the majority case)?
> 
> Yes, true. On the other hand, having this parallel array only affects
> readability I guess, whereas inserting pointers and lengths in
> tcp[46]_l2_buf_t actually decreases the usable MSS (not just on
> non-AVX2 x86, but also on other architectures). So I'd rather stick to
> this.

Yeah, I guess so.

Actually.. I did just think of one other option.  It avoids both any
extra padding and a parallel array, but at the cost of additional work
when frames are dropped.  We could use that 16-bits of padding to
store the TCP payload length.  Then when we don't manage to send all
our frames, we do another loop through and add up how many stream
bytes we actually sent to update the seq pointer.

> > > we use internally aren't enough to uniquely identify for which
> > > connection we should update sequence numbers: create a parallel
> > > array of pointers to sequence numbers and L4 lengths, of
> > > TCP_FRAMES_MEM size, and go through it after calling sendmsg().
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  tap.c | 10 +++++++---
> > >  tap.h |  2 +-
> > >  tcp.c | 43 ++++++++++++++++++++++++++++++++++++-------
> > >  3 files changed, 44 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/tap.c b/tap.c
> > > index 93db989..b30ff81 100644
> > > --- a/tap.c
> > > +++ b/tap.c
> > > @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
> > >   * @c:		Execution context
> > >   * @iov:	Array of buffers, each containing one frame (with L2 headers)
> > >   * @n:		Number of buffers/frames in @iov
> > > + *
> > > + * Return: number of frames actually sent
> > >   */
> > > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > >  {
> > >  	size_t m;
> > >  
> > >  	if (!n)
> > > -		return;
> > > +		return 0;
> > >  
> > >  	if (c->mode == MODE_PASST)
> > >  		m = tap_send_frames_passt(c, iov, n);
> > > @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > >  		m = tap_send_frames_pasta(c, iov, n);
> > >  
> > >  	if (m < n)
> > > -		debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
> > > +		debug("tap: failed to send %lu frames of %lu", n - m, n);
> > >  
> > >  	pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> > > +
> > > +	return m;
> > >  }
> > >  
> > >  /**
> > > diff --git a/tap.h b/tap.h
> > > index 021fb7c..952fafc 100644
> > > --- a/tap.h
> > > +++ b/tap.h
> > > @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
> > >  		    const struct in6_addr *src, const struct in6_addr *dst,
> > >  		    void *in, size_t len);
> > >  int tap_send(const struct ctx *c, const void *data, size_t len);
> > > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> > > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> > >  void tap_update_mac(struct tap_hdr *taph,
> > >  		    const unsigned char *eth_d, const unsigned char *eth_s);
> > >  void tap_listen_handler(struct ctx *c, uint32_t events);
> > > diff --git a/tcp.c b/tcp.c
> > > index 4606f17..76b7b8d 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -434,6 +434,16 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
> > >   */
> > >  static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
> > >  
> > > +/**
> > > + * tcp_buf_seq_update - Sequences to update with length of frames once sent
> > > + * @seq:	Pointer to sequence number sent to tap-side, to be updated
> > > + * @len:	TCP payload length
> > > + */
> > > +struct tcp_buf_seq_update {
> > > +	uint32_t *seq;
> > > +	uint16_t len;
> > > +};
> > > +
> > >  /* Static buffers */
> > >  
> > >  /**
> > > @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
> > >  #endif
> > >  tcp4_l2_buf[TCP_FRAMES_MEM];
> > >  
> > > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> > > +
> > >  static unsigned int tcp4_l2_buf_used;
> > >  
> > >  /**
> > > @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
> > >  #endif
> > >  tcp6_l2_buf[TCP_FRAMES_MEM];
> > >  
> > > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> > > +
> > >  static unsigned int tcp6_l2_buf_used;
> > >  
> > >  /* recvmsg()/sendmsg() data for tap */
> > > @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
> > >   */
> > >  static void tcp_l2_data_buf_flush(struct ctx *c)
> > >  {
> > > -	tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> > > +	unsigned i;
> > > +	size_t m;
> > > +
> > > +	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> > > +	for (i = 0; i < m; i++)
> > > +		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
> > >  	tcp6_l2_buf_used = 0;
> > >  
> > > -	tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> > > +	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> > > +	for (i = 0; i < m; i++)
> > > +		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
> > >  	tcp4_l2_buf_used = 0;
> > >  }
> > >  
> > > @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
> > >   * @plen:	Payload length at L4
> > >   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
> > >   * @seq:	Sequence number to be sent
> > > - * @now:	Current timestamp
> > > + * @seq_update:	Pointer to sequence number to update on successful send
> > >   */
> > >  static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > > -			    ssize_t plen, int no_csum, uint32_t seq)
> > > +			    ssize_t plen, int no_csum, uint32_t seq,
> > > +			    uint32_t *seq_update)  
> > 
> > seq_update is always &conn->seq_to_tap, so there's no need for an
> > additional parameter.
> 
> Oh, right, I'll drop that.
> 
> > >  {
> > >  	struct iovec *iov;
> > >  
> > > @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > >  		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> > >  		uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
> > >  
> > > +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> > > +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
> > > +
> > >  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> > >  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> > >  						       check, seq);
> > > @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > >  	} else if (CONN_V6(conn)) {
> > >  		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
> > >  
> > > +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> > > +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
> > > +
> > >  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> > >  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> > >  						       NULL, seq);
> > > @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > >  	int s = conn->sock, i, ret = 0;
> > >  	struct msghdr mh_sock = { 0 };
> > >  	uint16_t mss = MSS_GET(conn);
> > > -	uint32_t already_sent;
> > > +	uint32_t already_sent, seq;
> > >  	struct iovec *iov;
> > >  
> > >  	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > >  
> > >  	/* Finally, queue to tap */
> > >  	plen = mss;
> > > +	seq = conn->seq_to_tap;  
> > 
> > This will only be correct if tcp_l2_data_buf_flush() is *always*
> > called between tcp_data_from_sock() calls for the same socket.  That
> > should be true for the normal course of things.  However, couldn't it
> > happen that we get a normal socket EPOLLIN event for a particular
> > connection - calling tcp_data_from_sock() - but in the same epoll()
> > round we also get a tap ack for the same connection which causes
> > another call to tcp_data_from_sock() (with the change from patch
> > 2/5).  IIRC those would both happen before the deferred handling and
> > therefore the data_buf_flush().
> 
> Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/
> but with that change, it's not. Unless we drop that change from 2/5.

Even if we drop the change, it's a worryingly subtle constraint.

> > Not sure how to deal with that short of separate 'seq_queued' and
> > 'seq_sent' counters in the connection structure, which is a bit
> > unfortunate.
> 
> I wonder how bad it is if we call tcp_l2_data_buf_flush()
> unconditionally before calling tcp_data_from_sock() from
> tcp_tap_handler(). But again, maybe this is not needed at all, we
> should check that epoll detail from 2/5 first...
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput
  2023-09-27 17:06     ` Stefano Brivio
@ 2023-09-28  2:02       ` David Gibson
  0 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-09-28  2:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 4028 bytes --]

On Wed, Sep 27, 2023 at 07:06:16PM +0200, Stefano Brivio wrote:
> On Mon, 25 Sep 2023 14:57:40 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sat, Sep 23, 2023 at 12:06:10AM +0200, Stefano Brivio wrote:
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  passt.1 | 33 +++++++++++++++++++++++++++++++++
> > >  1 file changed, 33 insertions(+)
> > > 
> > > diff --git a/passt.1 b/passt.1
> > > index 1ad4276..bcbe6fd 100644
> > > --- a/passt.1
> > > +++ b/passt.1
> > > @@ -926,6 +926,39 @@ If the sending window cannot be queried, it will always be announced as the
> > >  current sending buffer size to guest or target namespace. This might affect
> > >  throughput of TCP connections.
> > >  
> > > +.SS Tuning for high throughput
> > > +
> > > +On Linux, by default, the maximum memory that can be set for receive and send
> > > +socket buffers is 208 KiB. Those limits are set by the
> > > +\fI/proc/sys/net/core/rmem_max\fR and \fI/proc/sys/net/core/wmem_max\fR files,
> > > +see \fBsocket\fR(7).
> > > +
> > > +As of Linux 6.5, while the TCP implementation can dynamically shrink buffers
> > > +depending on utilisation even above those limits, such a small limit will  
> > 
> > "shrink buffers" and "even above those limits" don't seem to quite
> > work together.
> 
> Oops. I guess I should simply s/shrink/grow/ here.

Or "resize" would work too.

> > > +reflect on the advertised TCP window at the beginning of a connection, and the  
> > 
> > Hmmm.... while [rw]mem_max might limit that initial window size, I
> > wouldn't expect increasing the limits alone to increase that initial
> > window size: wouldn't that instead be affected by the TCP default
> > buffer size i.e. the middle value in net.ipv4.tcp_rmem?
> 
> If we don't use SO_RCVBUF, yes... but we currently do, and with that,
> we can get a much larger initial window (as we do now).

Good point.

> On the other hand, maybe, as mentioned in my follow-up about 3/5, we
> should drop SO_RCVBUF for TCP sockets.

Ok.

> > > +buffer size of the UNIX domain socket buffer used by \fBpasst\fR cannot exceed
> > > +these limits anyway.
> > > +
> > > +Further, as of Linux 6.5, using socket options \fBSO_RCVBUF\fR and
> > > +\fBSO_SNDBUF\fR will prevent TCP buffers to expand above the \fIrmem_max\fR and
> > > +\fIwmem_max\fR limits because the automatic adjustment provided by the TCP
> > > +implementation is then disabled.
> > > +
> > > +As a consequence, \fBpasst\fR and \fBpasta\fR probe these limits at start-up and
> > > +will not set TCP socket buffer sizes if they are lower than 2 MiB, because this
> > > +would affect the maximum size of TCP buffers for the whole duration of a
> > > +connection.
> > > +
> > > +Note that 208 KiB is, accounting for kernel overhead, enough to fit less than
> > > +three TCP packets at the default MSS. In applications where high throughput is
> > > +expected, it is therefore advisable to increase those limits to at least 2 MiB,
> > > +or even 16 MiB:
> > > +
> > > +.nf
> > > +	sysctl -w net.core.rmem_max=$((16 << 20)
> > > +	sysctl -w net.core.wmem_max=$((16 << 20)
> > > +.fi  
> > 
> > As noted in a previous mail, empirically, this doesn't necessarily
> > seem to work better for me.  I'm wondering if we'd be better off never
> > touching RCFBUF and SNDBUF for TCP sockets, and letting the kernel do
> > its adaptive thing.  We probably still want to expand the buffers as
> > much as we can for the Unix socket, though.  And we likely still want
> > expanded limits for the tests so that iperf3 can use large buffers
> 
> Right. Let's keep this patch for a later time then, and meanwhile check
> if we should drop SO_RCVBUF, SO_SNDBUF, or both, for TCP sockets.

Makes sense to me.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-09-28  1:58       ` David Gibson
@ 2023-09-29 15:19         ` Stefano Brivio
  2023-10-03  3:22           ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:19 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Thu, 28 Sep 2023 11:58:45 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:
> > On Mon, 25 Sep 2023 14:47:52 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:  
> > > > ...so that we'll retry sending them, instead of more-or-less silently
> > > > dropping them. This happens quite frequently if our sending buffer on
> > > > the UNIX domain socket is heavily constrained (for instance, by the
> > > > 208 KiB default memory limit).
> > > > 
> > > > It might be argued that dropping frames is part of the expected TCP
> > > > flow: we don't dequeue those from the socket anyway, so we'll
> > > > eventually retransmit them.
> > > > 
> > > > But we don't need the receiver to tell us (by the way of duplicate or
> > > > missing ACKs) that we couldn't send them: we already know as
> > > > sendmsg() reports that. This seems to considerably increase
> > > > throughput stability and throughput itself for TCP connections with
> > > > default wmem_max values.
> > > > 
> > > > Unfortunately, the 16 bits left as padding in the frame descriptors    
> > > 
> > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
> > > yes?  
> > 
> > Right, that.
> >   
> > > For AVX2 we have substantially more space here.  Couldn't we put
> > > a conn (or seq) pointer in here at the cost of a few bytes MSS for
> > > non-AVX2 and zero cost for AVX2 (which is probably the majority case)?  
> > 
> > Yes, true. On the other hand, having this parallel array only affects
> > readability I guess, whereas inserting pointers and lengths in
> > tcp[46]_l2_buf_t actually decreases the usable MSS (not just on
> > non-AVX2 x86, but also on other architectures). So I'd rather stick to
> > this.  
> 
> Yeah, I guess so.
> 
> Actually.. I did just think of one other option.  It avoids both any
> extra padding and a parallel array, but at the cost of additional work
> when frames are dropped.  We could use that 16-bits of padding to
> store the TCP payload length.  Then when we don't manage to send all
> our frames, we do another loop through and add up how many stream
> bytes we actually sent to update the seq pointer.

Hmm, yes. It's slightly more memory efficient, but the complexity seems
a bit overkill to me.

> > > > we use internally aren't enough to uniquely identify for which
> > > > connection we should update sequence numbers: create a parallel
> > > > array of pointers to sequence numbers and L4 lengths, of
> > > > TCP_FRAMES_MEM size, and go through it after calling sendmsg().
> > > > 
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > >  tap.c | 10 +++++++---
> > > >  tap.h |  2 +-
> > > >  tcp.c | 43 ++++++++++++++++++++++++++++++++++++-------
> > > >  3 files changed, 44 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/tap.c b/tap.c
> > > > index 93db989..b30ff81 100644
> > > > --- a/tap.c
> > > > +++ b/tap.c
> > > > @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
> > > >   * @c:		Execution context
> > > >   * @iov:	Array of buffers, each containing one frame (with L2 headers)
> > > >   * @n:		Number of buffers/frames in @iov
> > > > + *
> > > > + * Return: number of frames actually sent
> > > >   */
> > > > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > > > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > > >  {
> > > >  	size_t m;
> > > >  
> > > >  	if (!n)
> > > > -		return;
> > > > +		return 0;
> > > >  
> > > >  	if (c->mode == MODE_PASST)
> > > >  		m = tap_send_frames_passt(c, iov, n);
> > > > @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> > > >  		m = tap_send_frames_pasta(c, iov, n);
> > > >  
> > > >  	if (m < n)
> > > > -		debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
> > > > +		debug("tap: failed to send %lu frames of %lu", n - m, n);
> > > >  
> > > >  	pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> > > > +
> > > > +	return m;
> > > >  }
> > > >  
> > > >  /**
> > > > diff --git a/tap.h b/tap.h
> > > > index 021fb7c..952fafc 100644
> > > > --- a/tap.h
> > > > +++ b/tap.h
> > > > @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
> > > >  		    const struct in6_addr *src, const struct in6_addr *dst,
> > > >  		    void *in, size_t len);
> > > >  int tap_send(const struct ctx *c, const void *data, size_t len);
> > > > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> > > > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> > > >  void tap_update_mac(struct tap_hdr *taph,
> > > >  		    const unsigned char *eth_d, const unsigned char *eth_s);
> > > >  void tap_listen_handler(struct ctx *c, uint32_t events);
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 4606f17..76b7b8d 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -434,6 +434,16 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
> > > >   */
> > > >  static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
> > > >  
> > > > +/**
> > > > + * tcp_buf_seq_update - Sequences to update with length of frames once sent
> > > > + * @seq:	Pointer to sequence number sent to tap-side, to be updated
> > > > + * @len:	TCP payload length
> > > > + */
> > > > +struct tcp_buf_seq_update {
> > > > +	uint32_t *seq;
> > > > +	uint16_t len;
> > > > +};
> > > > +
> > > >  /* Static buffers */
> > > >  
> > > >  /**
> > > > @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
> > > >  #endif
> > > >  tcp4_l2_buf[TCP_FRAMES_MEM];
> > > >  
> > > > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> > > > +
> > > >  static unsigned int tcp4_l2_buf_used;
> > > >  
> > > >  /**
> > > > @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
> > > >  #endif
> > > >  tcp6_l2_buf[TCP_FRAMES_MEM];
> > > >  
> > > > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> > > > +
> > > >  static unsigned int tcp6_l2_buf_used;
> > > >  
> > > >  /* recvmsg()/sendmsg() data for tap */
> > > > @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
> > > >   */
> > > >  static void tcp_l2_data_buf_flush(struct ctx *c)
> > > >  {
> > > > -	tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> > > > +	unsigned i;
> > > > +	size_t m;
> > > > +
> > > > +	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> > > > +	for (i = 0; i < m; i++)
> > > > +		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
> > > >  	tcp6_l2_buf_used = 0;
> > > >  
> > > > -	tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> > > > +	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> > > > +	for (i = 0; i < m; i++)
> > > > +		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
> > > >  	tcp4_l2_buf_used = 0;
> > > >  }
> > > >  
> > > > @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
> > > >   * @plen:	Payload length at L4
> > > >   * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
> > > >   * @seq:	Sequence number to be sent
> > > > - * @now:	Current timestamp
> > > > + * @seq_update:	Pointer to sequence number to update on successful send
> > > >   */
> > > >  static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > > > -			    ssize_t plen, int no_csum, uint32_t seq)
> > > > +			    ssize_t plen, int no_csum, uint32_t seq,
> > > > +			    uint32_t *seq_update)    
> > > 
> > > seq_update is always &conn->seq_to_tap, so there's no need for an
> > > additional parameter.  
> > 
> > Oh, right, I'll drop that.
> >   
> > > >  {
> > > >  	struct iovec *iov;
> > > >  
> > > > @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > > >  		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> > > >  		uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
> > > >  
> > > > +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> > > > +		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
> > > > +
> > > >  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> > > >  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> > > >  						       check, seq);
> > > > @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> > > >  	} else if (CONN_V6(conn)) {
> > > >  		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
> > > >  
> > > > +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> > > > +		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
> > > > +
> > > >  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> > > >  		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> > > >  						       NULL, seq);
> > > > @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > > >  	int s = conn->sock, i, ret = 0;
> > > >  	struct msghdr mh_sock = { 0 };
> > > >  	uint16_t mss = MSS_GET(conn);
> > > > -	uint32_t already_sent;
> > > > +	uint32_t already_sent, seq;
> > > >  	struct iovec *iov;
> > > >  
> > > >  	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> > > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > > >  
> > > >  	/* Finally, queue to tap */
> > > >  	plen = mss;
> > > > +	seq = conn->seq_to_tap;    
> > > 
> > > This will only be correct if tcp_l2_data_buf_flush() is *always*
> > > called between tcp_data_from_sock() calls for the same socket.  That
> > > should be true for the normal course of things.  However, couldn't it
> > > happen that we get a normal socket EPOLLIN event for a particular
> > > connection - calling tcp_data_from_sock() - but in the same epoll()
> > > round we also get a tap ack for the same connection which causes
> > > another call to tcp_data_from_sock() (with the change from patch
> > > 2/5).  IIRC those would both happen before the deferred handling and
> > > therefore the data_buf_flush().  
> > 
> > Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/
> > but with that change, it's not. Unless we drop that change from 2/5.  
> 
> Even if we drop the change, it's a worryingly subtle constraint.

Another option to avoid this...

> > > Not sure how to deal with that short of separate 'seq_queued' and
> > > 'seq_sent' counters in the connection structure, which is a bit
> > > unfortunate.  
> > 
> > I wonder how bad it is if we call tcp_l2_data_buf_flush()
> > unconditionally before calling tcp_data_from_sock() from
> > tcp_tap_handler(). But again, maybe this is not needed at all, we
> > should check that epoll detail from 2/5 first...

other than this one, would be to use that external table to update
sequence numbers *in the frames* as we send stuff out.

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-09-28  1:48       ` David Gibson
@ 2023-09-29 15:20         ` Stefano Brivio
  2023-10-03  3:20           ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:20 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Thu, 28 Sep 2023 11:48:38 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:
> > On Mon, 25 Sep 2023 13:07:24 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > I think the change itself here is sound, but I have some nits to pick
> > > with the description and reasoning.
> > > 
> > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:  
> > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> > > > that we ran out of tap-side window space, or that all available
> > > > socket data is already in flight -- better names welcome!    
> > > 
> > > Hmm.. when you put it like that it makes me wonder if those two quite
> > > different conditions really need the same handling.  Hrm.. I guess
> > > both conditions mean that we can't accept data from the socket, even
> > > if it's availble.  
> > 
> > Right. I mean, we can also call them differently... or maybe pick a
> > name that reflects the outcome/handling instead of what happened.  
> 
> Sure, if we could think of one.  Except, on second thoughts, I'm not
> sure my characterization is correct.  If the tap side window is full
> then, indeed, we can't accept data from the socket.  However if
> everything we have is in flight that doesn't mean we couldn't put more
> data into flight if it arrived.

Right, but that's why we set EPOLLET...

> That consideration, together with the way we use MSG_PEEK possibly
> means that we fundamentally need to use edge-triggered interrupts -
> with the additional trickiness that entails - to avoid busy polling.
> Although even that only works if we get a new edge interrupt when data
> is added to a buffer that's been PEEKed but not TRUNCed.  If that's
> not true the MSG_PEEK approach might be doomed :(.

without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7):

       Since even with edge-triggered epoll, multiple events can be  generated
       upon  receipt  of multiple chunks of data, the caller has the option to
       specify the EPOLLONESHOT flag [...]

so yes, in general, when the socket has more data, we'll get another
event. I didn't test this in an isolated case, perhaps we should, but
from my memory it always worked.

On the other hand, we could actually use EPOLLONESHOT in the other
case, as an optimisation, when we're waiting for an ACK from the tap
side.

> > > > ) on any
> > > > event: do that only if the first packet in a batch has the ACK flag
> > > > set.    
> > > 
> > > "First packet in a batch" may not be accurate here - we're looking at
> > > whichever packet we were up to before calling data_from_tap().  There
> > > could have been earlier packets in the receive batch that were already
> > > processed.  
> > 
> > Well, it depends on what we call "batch" -- here I meant the pool of
> > packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool"
> > would be more accurate.  
> 
> Uh.. I don't think that actually helps.  Remember pools aren't queues.
> The point here is that is that the packet we're considering is not the
> first of the batch/pool/whatever, but the first of what's left.

Right, yes, I actually meant the sub-pool starting from the index (now)
given by the caller.

> > > This also raises the question of why the first data packet should be
> > > particularly privileged here.  
> > 
> > No reason other than convenience, and yes, it can be subtly wrong.
> >   
> > > I'm wondering if what we really want to
> > > check is whether data_from_tap() advanced the ack pointer at all.  
> > 
> > Right, we probably should do that instead.  
> 
> Ok.
> 
> > > I'm not clear on when the th->ack check would ever fail in practice:
> > > aren't the only normal packets in a TCP connection without ACK the
> > > initial SYN or an RST?  We've handled the SYN case earlier, so should
> > > we just have a blanket case above this that if we get a packet with
> > > !ACK, we reset the connection?  
> > 
> > One thing that's legitimate (rarely seen, but I've seen it, I don't
> > remember if the Linux kernel ever does that) is a segment without ACK,
> > and without data, that just updates the window (for example after a
> > zero window).
> > 
> > If the sequence received/processed so far doesn't correspond to the
> > latest sequence sent, omitting the ACK flag is useful so that the
> > window update is not taken as duplicate ACK (that would trigger
> > retransmission).  
> 
> Ah, ok, I wasn't aware of that case.

On a second thought, in that case, we just got a window update, so it's
very reasonable to actually check again if we can send more. Hence the
check on th->ack is bogus anyway.

> > > > Make sure we check for pending socket data when we reset it:
> > > > reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> > > > does, isn't guaranteed to actually trigger a socket event.    
> > > 
> > > Which sure seems like a kernel bug.  Some weird edge conditions for
> > > edge-triggered seems expected, but this doesn't seem like valid
> > > level-triggered semantics.  
> > 
> > Hmm, yes, and by doing a quick isolated test actually this seems to work
> > as intended in the kernel. I should drop this and try again.
> >   
> > > Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
> > > what's going on here is that we can't take more data from the socket
> > > until something happens on the tap side (either the window expands, or
> > > it acks some data).  In which case should we be toggling EPOLLIN on
> > > the socket instead?  That seems more explicitly to be saying to the
> > > socket side "we don't currently care if you have data available".  
> > 
> > The reason was to act on EPOLLRDHUP at the same time. But well, we
> > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make
> > more sense.  
> 
> So specifically to mask EPOLLRDHUP as well? On the grounds that if
> we're still chewing on what we got already we don't yet care that
> we've reached the end, yes?

Right.

> So yes, explicitly masking both those
> makes more sense to me.. except that as above, I suspect we can't have
> level-triggered + MSG_PEEK + no busy polling all at once.

Hmm, right, that's the other problem if we mask EPOLLIN: we won't get
events on new data. I think EPOLLET is really what we need here, at
least for the case where we are not necessarily waiting for an ACK.

For the other case (window full), we can either mask EPOLLIN |
EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated
because we need to re-add the file descriptor).

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-09-29 15:20         ` Stefano Brivio
@ 2023-10-03  3:20           ` David Gibson
  2023-10-05  6:18             ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-10-03  3:20 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 8641 bytes --]

On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote:
> On Thu, 28 Sep 2023 11:48:38 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:
> > > On Mon, 25 Sep 2023 13:07:24 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > I think the change itself here is sound, but I have some nits to pick
> > > > with the description and reasoning.
> > > > 
> > > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:  
> > > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> > > > > that we ran out of tap-side window space, or that all available
> > > > > socket data is already in flight -- better names welcome!    
> > > > 
> > > > Hmm.. when you put it like that it makes me wonder if those two quite
> > > > different conditions really need the same handling.  Hrm.. I guess
> > > > both conditions mean that we can't accept data from the socket, even
> > > > if it's availble.  
> > > 
> > > Right. I mean, we can also call them differently... or maybe pick a
> > > name that reflects the outcome/handling instead of what happened.  
> > 
> > Sure, if we could think of one.  Except, on second thoughts, I'm not
> > sure my characterization is correct.  If the tap side window is full
> > then, indeed, we can't accept data from the socket.  However if
> > everything we have is in flight that doesn't mean we couldn't put more
> > data into flight if it arrived.
> 
> Right, but that's why we set EPOLLET...
> 
> > That consideration, together with the way we use MSG_PEEK possibly
> > means that we fundamentally need to use edge-triggered interrupts -
> > with the additional trickiness that entails - to avoid busy polling.
> > Although even that only works if we get a new edge interrupt when data
> > is added to a buffer that's been PEEKed but not TRUNCed.  If that's
> > not true the MSG_PEEK approach might be doomed :(.
> 
> without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7):
> 
>        Since even with edge-triggered epoll, multiple events can be  generated
>        upon  receipt  of multiple chunks of data, the caller has the option to
>        specify the EPOLLONESHOT flag [...]
> 
> so yes, in general, when the socket has more data, we'll get another
> event. I didn't test this in an isolated case, perhaps we should, but
> from my memory it always worked.

Ok.  That text does seem to suggest it works that way, although it's
not entirely clear that it must always give new events.

> On the other hand, we could actually use EPOLLONESHOT in the other
> case, as an optimisation, when we're waiting for an ACK from the tap
> side.

Hrm.. I can't actually see a case were EPOLLONESHOT would be useful.
By the time we know the receiver's window has been filled, we're
already processing the last event that we'll be able to until the
window opens again.  Setting EPOLLONESHOT would be requesting one more
event.

> > > > > ) on any
> > > > > event: do that only if the first packet in a batch has the ACK flag
> > > > > set.    
> > > > 
> > > > "First packet in a batch" may not be accurate here - we're looking at
> > > > whichever packet we were up to before calling data_from_tap().  There
> > > > could have been earlier packets in the receive batch that were already
> > > > processed.  
> > > 
> > > Well, it depends on what we call "batch" -- here I meant the pool of
> > > packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool"
> > > would be more accurate.  
> > 
> > Uh.. I don't think that actually helps.  Remember pools aren't queues.
> > The point here is that is that the packet we're considering is not the
> > first of the batch/pool/whatever, but the first of what's left.
> 
> Right, yes, I actually meant the sub-pool starting from the index (now)
> given by the caller.
> 
> > > > This also raises the question of why the first data packet should be
> > > > particularly privileged here.  
> > > 
> > > No reason other than convenience, and yes, it can be subtly wrong.
> > >   
> > > > I'm wondering if what we really want to
> > > > check is whether data_from_tap() advanced the ack pointer at all.  
> > > 
> > > Right, we probably should do that instead.  
> > 
> > Ok.
> > 
> > > > I'm not clear on when the th->ack check would ever fail in practice:
> > > > aren't the only normal packets in a TCP connection without ACK the
> > > > initial SYN or an RST?  We've handled the SYN case earlier, so should
> > > > we just have a blanket case above this that if we get a packet with
> > > > !ACK, we reset the connection?  
> > > 
> > > One thing that's legitimate (rarely seen, but I've seen it, I don't
> > > remember if the Linux kernel ever does that) is a segment without ACK,
> > > and without data, that just updates the window (for example after a
> > > zero window).
> > > 
> > > If the sequence received/processed so far doesn't correspond to the
> > > latest sequence sent, omitting the ACK flag is useful so that the
> > > window update is not taken as duplicate ACK (that would trigger
> > > retransmission).  
> > 
> > Ah, ok, I wasn't aware of that case.
> 
> On a second thought, in that case, we just got a window update, so it's
> very reasonable to actually check again if we can send more. Hence the
> check on th->ack is bogus anyway.
> 
> > > > > Make sure we check for pending socket data when we reset it:
> > > > > reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> > > > > does, isn't guaranteed to actually trigger a socket event.    
> > > > 
> > > > Which sure seems like a kernel bug.  Some weird edge conditions for
> > > > edge-triggered seems expected, but this doesn't seem like valid
> > > > level-triggered semantics.  
> > > 
> > > Hmm, yes, and by doing a quick isolated test actually this seems to work
> > > as intended in the kernel. I should drop this and try again.
> > >   
> > > > Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
> > > > what's going on here is that we can't take more data from the socket
> > > > until something happens on the tap side (either the window expands, or
> > > > it acks some data).  In which case should we be toggling EPOLLIN on
> > > > the socket instead?  That seems more explicitly to be saying to the
> > > > socket side "we don't currently care if you have data available".  
> > > 
> > > The reason was to act on EPOLLRDHUP at the same time. But well, we
> > > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make
> > > more sense.  
> > 
> > So specifically to mask EPOLLRDHUP as well? On the grounds that if
> > we're still chewing on what we got already we don't yet care that
> > we've reached the end, yes?
> 
> Right.

Ok.

> > So yes, explicitly masking both those
> > makes more sense to me.. except that as above, I suspect we can't have
> > level-triggered + MSG_PEEK + no busy polling all at once.
> 
> Hmm, right, that's the other problem if we mask EPOLLIN: we won't get
> events on new data. I think EPOLLET is really what we need here, at
> least for the case where we are not necessarily waiting for an ACK.
> 
> For the other case (window full), we can either mask EPOLLIN |
> EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated
> because we need to re-add the file descriptor).

So, thinking further, what I think we want is to always set EPOLLET on
the TCP sockets.  If all received data is in flight we don't need
anything special, at least assuming epoll works like we think it does
above: when we get more data, we get an event and check if we can send
more data into flight.

When the receive window fills we really don't care about new data
until it opens again, so clear EPOLLIN | EPOLLRDHUP.  When the window
does open again - i.e. when we get an ack or window update - both
reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process
anything that's accumulated since we turned EPOLLIN off.

Details to figure out:
 * Do we need to be careful about order of re-enable EPOLLIN
   vs. rechecking the recv() buffer?

 * At what point would we trigger the CLAMP_WINDOW workaround in that
   scheme?

 * Is there any impact of EPOLLET on the other events?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-09-29 15:19         ` Stefano Brivio
@ 2023-10-03  3:22           ` David Gibson
  2023-10-05  6:19             ` Stefano Brivio
  0 siblings, 1 reply; 32+ messages in thread
From: David Gibson @ 2023-10-03  3:22 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 4740 bytes --]

On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:
> On Thu, 28 Sep 2023 11:58:45 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:
> > > On Mon, 25 Sep 2023 14:47:52 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:  
> > > > > ...so that we'll retry sending them, instead of more-or-less silently
> > > > > dropping them. This happens quite frequently if our sending buffer on
> > > > > the UNIX domain socket is heavily constrained (for instance, by the
> > > > > 208 KiB default memory limit).
> > > > > 
> > > > > It might be argued that dropping frames is part of the expected TCP
> > > > > flow: we don't dequeue those from the socket anyway, so we'll
> > > > > eventually retransmit them.
> > > > > 
> > > > > But we don't need the receiver to tell us (by the way of duplicate or
> > > > > missing ACKs) that we couldn't send them: we already know as
> > > > > sendmsg() reports that. This seems to considerably increase
> > > > > throughput stability and throughput itself for TCP connections with
> > > > > default wmem_max values.
> > > > > 
> > > > > Unfortunately, the 16 bits left as padding in the frame descriptors    
> > > > 
> > > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
> > > > yes?  
> > > 
> > > Right, that.
> > >   
> > > > For AVX2 we have substantially more space here.  Couldn't we put
> > > > a conn (or seq) pointer in here at the cost of a few bytes MSS for
> > > > non-AVX2 and zero cost for AVX2 (which is probably the majority case)?  
> > > 
> > > Yes, true. On the other hand, having this parallel array only affects
> > > readability I guess, whereas inserting pointers and lengths in
> > > tcp[46]_l2_buf_t actually decreases the usable MSS (not just on
> > > non-AVX2 x86, but also on other architectures). So I'd rather stick to
> > > this.  
> > 
> > Yeah, I guess so.
> > 
> > Actually.. I did just think of one other option.  It avoids both any
> > extra padding and a parallel array, but at the cost of additional work
> > when frames are dropped.  We could use that 16-bits of padding to
> > store the TCP payload length.  Then when we don't manage to send all
> > our frames, we do another loop through and add up how many stream
> > bytes we actually sent to update the seq pointer.
> 
> Hmm, yes. It's slightly more memory efficient, but the complexity seems
> a bit overkill to me.

More importantly, I forgot the fact that by the time we're sending the
frames, we don't know what connection they're associated with any
more.

[snip]
> > > > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > > > >  
> > > > >  	/* Finally, queue to tap */
> > > > >  	plen = mss;
> > > > > +	seq = conn->seq_to_tap;    
> > > > 
> > > > This will only be correct if tcp_l2_data_buf_flush() is *always*
> > > > called between tcp_data_from_sock() calls for the same socket.  That
> > > > should be true for the normal course of things.  However, couldn't it
> > > > happen that we get a normal socket EPOLLIN event for a particular
> > > > connection - calling tcp_data_from_sock() - but in the same epoll()
> > > > round we also get a tap ack for the same connection which causes
> > > > another call to tcp_data_from_sock() (with the change from patch
> > > > 2/5).  IIRC those would both happen before the deferred handling and
> > > > therefore the data_buf_flush().  
> > > 
> > > Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/
> > > but with that change, it's not. Unless we drop that change from 2/5.  
> > 
> > Even if we drop the change, it's a worryingly subtle constraint.
> 
> Another option to avoid this...
> 
> > > > Not sure how to deal with that short of separate 'seq_queued' and
> > > > 'seq_sent' counters in the connection structure, which is a bit
> > > > unfortunate.  
> > > 
> > > I wonder how bad it is if we call tcp_l2_data_buf_flush()
> > > unconditionally before calling tcp_data_from_sock() from
> > > tcp_tap_handler(). But again, maybe this is not needed at all, we
> > > should check that epoll detail from 2/5 first...
> 
> other than this one, would be to use that external table to update
> sequence numbers *in the frames* as we send stuff out.

Not really sure what you're proposing there.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-10-03  3:20           ` David Gibson
@ 2023-10-05  6:18             ` Stefano Brivio
  2023-10-05  7:36               ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-10-05  6:18 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Tue, 3 Oct 2023 14:20:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote:
> > On Thu, 28 Sep 2023 11:48:38 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:  
> > > > On Mon, 25 Sep 2023 13:07:24 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > I think the change itself here is sound, but I have some nits to pick
> > > > > with the description and reasoning.
> > > > > 
> > > > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:    
> > > > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> > > > > > that we ran out of tap-side window space, or that all available
> > > > > > socket data is already in flight -- better names welcome!      
> > > > > 
> > > > > Hmm.. when you put it like that it makes me wonder if those two quite
> > > > > different conditions really need the same handling.  Hrm.. I guess
> > > > > both conditions mean that we can't accept data from the socket, even
> > > > > if it's availble.    
> > > > 
> > > > Right. I mean, we can also call them differently... or maybe pick a
> > > > name that reflects the outcome/handling instead of what happened.    
> > > 
> > > Sure, if we could think of one.  Except, on second thoughts, I'm not
> > > sure my characterization is correct.  If the tap side window is full
> > > then, indeed, we can't accept data from the socket.  However if
> > > everything we have is in flight that doesn't mean we couldn't put more
> > > data into flight if it arrived.  
> > 
> > Right, but that's why we set EPOLLET...
> >   
> > > That consideration, together with the way we use MSG_PEEK possibly
> > > means that we fundamentally need to use edge-triggered interrupts -
> > > with the additional trickiness that entails - to avoid busy polling.
> > > Although even that only works if we get a new edge interrupt when data
> > > is added to a buffer that's been PEEKed but not TRUNCed.  If that's
> > > not true the MSG_PEEK approach might be doomed :(.  
> > 
> > without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7):
> > 
> >        Since even with edge-triggered epoll, multiple events can be  generated
> >        upon  receipt  of multiple chunks of data, the caller has the option to
> >        specify the EPOLLONESHOT flag [...]
> > 
> > so yes, in general, when the socket has more data, we'll get another
> > event. I didn't test this in an isolated case, perhaps we should, but
> > from my memory it always worked.  
> 
> Ok.  That text does seem to suggest it works that way, although it's
> not entirely clear that it must always give new events.

A glimpse at the code confirms that, but... yes, I think we should test
this more specifically, perhaps even shipping that test case under doc/.

> > On the other hand, we could actually use EPOLLONESHOT in the other
> > case, as an optimisation, when we're waiting for an ACK from the tap
> > side.  
> 
> Hrm.. I can't actually see a case were EPOLLONESHOT would be useful.
> By the time we know the receiver's window has been filled, we're
> already processing the last event that we'll be able to until the
> window opens again.  Setting EPOLLONESHOT would be requesting one more
> event.

Ah, true -- we should have it "always" set and always re-arm, which is
messy and would probably kill any resemblance of high throughput.

> > > > > > ) on any
> > > > > > event: do that only if the first packet in a batch has the ACK flag
> > > > > > set.      
> > > > > 
> > > > > "First packet in a batch" may not be accurate here - we're looking at
> > > > > whichever packet we were up to before calling data_from_tap().  There
> > > > > could have been earlier packets in the receive batch that were already
> > > > > processed.    
> > > > 
> > > > Well, it depends on what we call "batch" -- here I meant the pool of
> > > > packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool"
> > > > would be more accurate.    
> > > 
> > > Uh.. I don't think that actually helps.  Remember pools aren't queues.
> > > The point here is that is that the packet we're considering is not the
> > > first of the batch/pool/whatever, but the first of what's left.  
> > 
> > Right, yes, I actually meant the sub-pool starting from the index (now)
> > given by the caller.
> >   
> > > > > This also raises the question of why the first data packet should be
> > > > > particularly privileged here.    
> > > > 
> > > > No reason other than convenience, and yes, it can be subtly wrong.
> > > >     
> > > > > I'm wondering if what we really want to
> > > > > check is whether data_from_tap() advanced the ack pointer at all.    
> > > > 
> > > > Right, we probably should do that instead.    
> > > 
> > > Ok.
> > >   
> > > > > I'm not clear on when the th->ack check would ever fail in practice:
> > > > > aren't the only normal packets in a TCP connection without ACK the
> > > > > initial SYN or an RST?  We've handled the SYN case earlier, so should
> > > > > we just have a blanket case above this that if we get a packet with
> > > > > !ACK, we reset the connection?    
> > > > 
> > > > One thing that's legitimate (rarely seen, but I've seen it, I don't
> > > > remember if the Linux kernel ever does that) is a segment without ACK,
> > > > and without data, that just updates the window (for example after a
> > > > zero window).
> > > > 
> > > > If the sequence received/processed so far doesn't correspond to the
> > > > latest sequence sent, omitting the ACK flag is useful so that the
> > > > window update is not taken as duplicate ACK (that would trigger
> > > > retransmission).    
> > > 
> > > Ah, ok, I wasn't aware of that case.  
> > 
> > On a second thought, in that case, we just got a window update, so it's
> > very reasonable to actually check again if we can send more. Hence the
> > check on th->ack is bogus anyway.
> >   
> > > > > > Make sure we check for pending socket data when we reset it:
> > > > > > reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> > > > > > does, isn't guaranteed to actually trigger a socket event.      
> > > > > 
> > > > > Which sure seems like a kernel bug.  Some weird edge conditions for
> > > > > edge-triggered seems expected, but this doesn't seem like valid
> > > > > level-triggered semantics.    
> > > > 
> > > > Hmm, yes, and by doing a quick isolated test actually this seems to work
> > > > as intended in the kernel. I should drop this and try again.
> > > >     
> > > > > Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
> > > > > what's going on here is that we can't take more data from the socket
> > > > > until something happens on the tap side (either the window expands, or
> > > > > it acks some data).  In which case should we be toggling EPOLLIN on
> > > > > the socket instead?  That seems more explicitly to be saying to the
> > > > > socket side "we don't currently care if you have data available".    
> > > > 
> > > > The reason was to act on EPOLLRDHUP at the same time. But well, we
> > > > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make
> > > > more sense.    
> > > 
> > > So specifically to mask EPOLLRDHUP as well? On the grounds that if
> > > we're still chewing on what we got already we don't yet care that
> > > we've reached the end, yes?  
> > 
> > Right.  
> 
> Ok.
> 
> > > So yes, explicitly masking both those
> > > makes more sense to me.. except that as above, I suspect we can't have
> > > level-triggered + MSG_PEEK + no busy polling all at once.  
> > 
> > Hmm, right, that's the other problem if we mask EPOLLIN: we won't get
> > events on new data. I think EPOLLET is really what we need here, at
> > least for the case where we are not necessarily waiting for an ACK.
> > 
> > For the other case (window full), we can either mask EPOLLIN |
> > EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated
> > because we need to re-add the file descriptor).  
> 
> So, thinking further, what I think we want is to always set EPOLLET on
> the TCP sockets.  If all received data is in flight we don't need
> anything special, at least assuming epoll works like we think it does
> above: when we get more data, we get an event and check if we can send
> more data into flight.

...I think having EPOLLET always set causes races that are potentially
unsolvable, because we can't always read from the socket until we get
-EAGAIN. That is, this will work:

- recv() all data, we can't write more data to tap
- at some point, we can write again to tap
- more data comes, we'll wake up and continue

but this won't:

- partial recv(), we can't write more data to tap
- at some point, we can write again to tap
- no additional data comes

> When the receive window fills we really don't care about new data
> until it opens again, so clear EPOLLIN | EPOLLRDHUP.  When the window
> does open again - i.e. when we get an ack or window update - both
> reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process
> anything that's accumulated since we turned EPOLLIN off.

Agreed. This should be a mere optimisation on top of the current
behaviour, by the way.

> Details to figure out:
>  * Do we need to be careful about order of re-enable EPOLLIN
>    vs. rechecking the recv() buffer?

It should be done before checking I guess, so that we can end up with
one spurious event (because we already read the data a future EPOLLIN
will tell us about), but never a missing event (because data comes
between recv() and epoll_ctl()).

>  * At what point would we trigger the CLAMP_WINDOW workaround in that
>    scheme?

When we read any data from the socket, with MSG_TRUNC, after the window
full condition.

>  * Is there any impact of EPOLLET on the other events?

Not that I'm aware of. EPOLLHUP and EPOLLERR are reported anyway, and
we don't want EPOLLRDHUP to differ (in this sense) from EPOLLIN. But
again, this is under the assumption that we do *not* always set EPOLLET.

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-10-03  3:22           ` David Gibson
@ 2023-10-05  6:19             ` Stefano Brivio
  2023-10-05  7:38               ` David Gibson
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Brivio @ 2023-10-05  6:19 UTC (permalink / raw)
  To: David Gibson; +Cc: Matej Hrica, passt-dev

On Tue, 3 Oct 2023 14:22:59 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:
> > On Thu, 28 Sep 2023 11:58:45 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:  
> > > > On Mon, 25 Sep 2023 14:47:52 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:    
> > > > > > ...so that we'll retry sending them, instead of more-or-less silently
> > > > > > dropping them. This happens quite frequently if our sending buffer on
> > > > > > the UNIX domain socket is heavily constrained (for instance, by the
> > > > > > 208 KiB default memory limit).
> > > > > > 
> > > > > > It might be argued that dropping frames is part of the expected TCP
> > > > > > flow: we don't dequeue those from the socket anyway, so we'll
> > > > > > eventually retransmit them.
> > > > > > 
> > > > > > But we don't need the receiver to tell us (by the way of duplicate or
> > > > > > missing ACKs) that we couldn't send them: we already know as
> > > > > > sendmsg() reports that. This seems to considerably increase
> > > > > > throughput stability and throughput itself for TCP connections with
> > > > > > default wmem_max values.
> > > > > > 
> > > > > > Unfortunately, the 16 bits left as padding in the frame descriptors      
> > > > > 
> > > > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
> > > > > yes?    
> > > > 
> > > > Right, that.
> > > >     
> > > > > For AVX2 we have substantially more space here.  Couldn't we put
> > > > > a conn (or seq) pointer in here at the cost of a few bytes MSS for
> > > > > non-AVX2 and zero cost for AVX2 (which is probably the majority case)?    
> > > > 
> > > > Yes, true. On the other hand, having this parallel array only affects
> > > > readability I guess, whereas inserting pointers and lengths in
> > > > tcp[46]_l2_buf_t actually decreases the usable MSS (not just on
> > > > non-AVX2 x86, but also on other architectures). So I'd rather stick to
> > > > this.    
> > > 
> > > Yeah, I guess so.
> > > 
> > > Actually.. I did just think of one other option.  It avoids both any
> > > extra padding and a parallel array, but at the cost of additional work
> > > when frames are dropped.  We could use that 16-bits of padding to
> > > store the TCP payload length.  Then when we don't manage to send all
> > > our frames, we do another loop through and add up how many stream
> > > bytes we actually sent to update the seq pointer.  
> > 
> > Hmm, yes. It's slightly more memory efficient, but the complexity seems
> > a bit overkill to me.  
> 
> More importantly, I forgot the fact that by the time we're sending the
> frames, we don't know what connection they're associated with any
> more.

Oh, I thought you wanted to rebuild the information about the
connection by looking into the hash table or something like that.

> [snip]
> > > > > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > > > > >  
> > > > > >  	/* Finally, queue to tap */
> > > > > >  	plen = mss;
> > > > > > +	seq = conn->seq_to_tap;      
> > > > > 
> > > > > This will only be correct if tcp_l2_data_buf_flush() is *always*
> > > > > called between tcp_data_from_sock() calls for the same socket.  That
> > > > > should be true for the normal course of things.  However, couldn't it
> > > > > happen that we get a normal socket EPOLLIN event for a particular
> > > > > connection - calling tcp_data_from_sock() - but in the same epoll()
> > > > > round we also get a tap ack for the same connection which causes
> > > > > another call to tcp_data_from_sock() (with the change from patch
> > > > > 2/5).  IIRC those would both happen before the deferred handling and
> > > > > therefore the data_buf_flush().    
> > > > 
> > > > Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/
> > > > but with that change, it's not. Unless we drop that change from 2/5.    
> > > 
> > > Even if we drop the change, it's a worryingly subtle constraint.  
> > 
> > Another option to avoid this...
> >   
> > > > > Not sure how to deal with that short of separate 'seq_queued' and
> > > > > 'seq_sent' counters in the connection structure, which is a bit
> > > > > unfortunate.    
> > > > 
> > > > I wonder how bad it is if we call tcp_l2_data_buf_flush()
> > > > unconditionally before calling tcp_data_from_sock() from
> > > > tcp_tap_handler(). But again, maybe this is not needed at all, we
> > > > should check that epoll detail from 2/5 first...  
> > 
> > other than this one, would be to use that external table to update
> > sequence numbers *in the frames* as we send stuff out.  
> 
> Not really sure what you're proposing there.

That tcp_l2_buf_fill_headers() calculates the sequence from
conn->seq_to_tap plus a cumulative count from that table, instead of
passing it from the caller.

-- 
Stefano


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data
  2023-10-05  6:18             ` Stefano Brivio
@ 2023-10-05  7:36               ` David Gibson
  0 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-10-05  7:36 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 12202 bytes --]

On Thu, Oct 05, 2023 at 08:18:49AM +0200, Stefano Brivio wrote:
> On Tue, 3 Oct 2023 14:20:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote:
> > > On Thu, 28 Sep 2023 11:48:38 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote:  
> > > > > On Mon, 25 Sep 2023 13:07:24 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > I think the change itself here is sound, but I have some nits to pick
> > > > > > with the description and reasoning.
> > > > > > 
> > > > > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:    
> > > > > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (indicating
> > > > > > > that we ran out of tap-side window space, or that all available
> > > > > > > socket data is already in flight -- better names welcome!      
> > > > > > 
> > > > > > Hmm.. when you put it like that it makes me wonder if those two quite
> > > > > > different conditions really need the same handling.  Hrm.. I guess
> > > > > > both conditions mean that we can't accept data from the socket, even
> > > > > > if it's availble.    
> > > > > 
> > > > > Right. I mean, we can also call them differently... or maybe pick a
> > > > > name that reflects the outcome/handling instead of what happened.    
> > > > 
> > > > Sure, if we could think of one.  Except, on second thoughts, I'm not
> > > > sure my characterization is correct.  If the tap side window is full
> > > > then, indeed, we can't accept data from the socket.  However if
> > > > everything we have is in flight that doesn't mean we couldn't put more
> > > > data into flight if it arrived.  
> > > 
> > > Right, but that's why we set EPOLLET...
> > >   
> > > > That consideration, together with the way we use MSG_PEEK possibly
> > > > means that we fundamentally need to use edge-triggered interrupts -
> > > > with the additional trickiness that entails - to avoid busy polling.
> > > > Although even that only works if we get a new edge interrupt when data
> > > > is added to a buffer that's been PEEKed but not TRUNCed.  If that's
> > > > not true the MSG_PEEK approach might be doomed :(.  
> > > 
> > > without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(7):
> > > 
> > >        Since even with edge-triggered epoll, multiple events can be  generated
> > >        upon  receipt  of multiple chunks of data, the caller has the option to
> > >        specify the EPOLLONESHOT flag [...]
> > > 
> > > so yes, in general, when the socket has more data, we'll get another
> > > event. I didn't test this in an isolated case, perhaps we should, but
> > > from my memory it always worked.  
> > 
> > Ok.  That text does seem to suggest it works that way, although it's
> > not entirely clear that it must always give new events.
> 
> A glimpse at the code confirms that, but... yes, I think we should test
> this more specifically, perhaps even shipping that test case under doc/.

Seems wise.

> > > On the other hand, we could actually use EPOLLONESHOT in the other
> > > case, as an optimisation, when we're waiting for an ACK from the tap
> > > side.  
> > 
> > Hrm.. I can't actually see a case were EPOLLONESHOT would be useful.
> > By the time we know the receiver's window has been filled, we're
> > already processing the last event that we'll be able to until the
> > window opens again.  Setting EPOLLONESHOT would be requesting one more
> > event.
> 
> Ah, true -- we should have it "always" set and always re-arm, which is
> messy and would probably kill any resemblance of high throughput.
> 
> > > > > > > ) on any
> > > > > > > event: do that only if the first packet in a batch has the ACK flag
> > > > > > > set.      
> > > > > > 
> > > > > > "First packet in a batch" may not be accurate here - we're looking at
> > > > > > whichever packet we were up to before calling data_from_tap().  There
> > > > > > could have been earlier packets in the receive batch that were already
> > > > > > processed.    
> > > > > 
> > > > > Well, it depends on what we call "batch" -- here I meant the pool of
> > > > > packets (that are passed as a batch to tcp_tap_handler()). Yes, "pool"
> > > > > would be more accurate.    
> > > > 
> > > > Uh.. I don't think that actually helps.  Remember pools aren't queues.
> > > > The point here is that is that the packet we're considering is not the
> > > > first of the batch/pool/whatever, but the first of what's left.  
> > > 
> > > Right, yes, I actually meant the sub-pool starting from the index (now)
> > > given by the caller.
> > >   
> > > > > > This also raises the question of why the first data packet should be
> > > > > > particularly privileged here.    
> > > > > 
> > > > > No reason other than convenience, and yes, it can be subtly wrong.
> > > > >     
> > > > > > I'm wondering if what we really want to
> > > > > > check is whether data_from_tap() advanced the ack pointer at all.    
> > > > > 
> > > > > Right, we probably should do that instead.    
> > > > 
> > > > Ok.
> > > >   
> > > > > > I'm not clear on when the th->ack check would ever fail in practice:
> > > > > > aren't the only normal packets in a TCP connection without ACK the
> > > > > > initial SYN or an RST?  We've handled the SYN case earlier, so should
> > > > > > we just have a blanket case above this that if we get a packet with
> > > > > > !ACK, we reset the connection?    
> > > > > 
> > > > > One thing that's legitimate (rarely seen, but I've seen it, I don't
> > > > > remember if the Linux kernel ever does that) is a segment without ACK,
> > > > > and without data, that just updates the window (for example after a
> > > > > zero window).
> > > > > 
> > > > > If the sequence received/processed so far doesn't correspond to the
> > > > > latest sequence sent, omitting the ACK flag is useful so that the
> > > > > window update is not taken as duplicate ACK (that would trigger
> > > > > retransmission).    
> > > > 
> > > > Ah, ok, I wasn't aware of that case.  
> > > 
> > > On a second thought, in that case, we just got a window update, so it's
> > > very reasonable to actually check again if we can send more. Hence the
> > > check on th->ack is bogus anyway.
> > >   
> > > > > > > Make sure we check for pending socket data when we reset it:
> > > > > > > reverting back to level-triggered epoll events, as tcp_epoll_ctl()
> > > > > > > does, isn't guaranteed to actually trigger a socket event.      
> > > > > > 
> > > > > > Which sure seems like a kernel bug.  Some weird edge conditions for
> > > > > > edge-triggered seems expected, but this doesn't seem like valid
> > > > > > level-triggered semantics.    
> > > > > 
> > > > > Hmm, yes, and by doing a quick isolated test actually this seems to work
> > > > > as intended in the kernel. I should drop this and try again.
> > > > >     
> > > > > > Hmmm... is toggling EPOLLET even what we want.  IIUC, the heart of
> > > > > > what's going on here is that we can't take more data from the socket
> > > > > > until something happens on the tap side (either the window expands, or
> > > > > > it acks some data).  In which case should we be toggling EPOLLIN on
> > > > > > the socket instead?  That seems more explicitly to be saying to the
> > > > > > socket side "we don't currently care if you have data available".    
> > > > > 
> > > > > The reason was to act on EPOLLRDHUP at the same time. But well, we
> > > > > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that would make
> > > > > more sense.    
> > > > 
> > > > So specifically to mask EPOLLRDHUP as well? On the grounds that if
> > > > we're still chewing on what we got already we don't yet care that
> > > > we've reached the end, yes?  
> > > 
> > > Right.  
> > 
> > Ok.
> > 
> > > > So yes, explicitly masking both those
> > > > makes more sense to me.. except that as above, I suspect we can't have
> > > > level-triggered + MSG_PEEK + no busy polling all at once.  
> > > 
> > > Hmm, right, that's the other problem if we mask EPOLLIN: we won't get
> > > events on new data. I think EPOLLET is really what we need here, at
> > > least for the case where we are not necessarily waiting for an ACK.
> > > 
> > > For the other case (window full), we can either mask EPOLLIN |
> > > EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated
> > > because we need to re-add the file descriptor).  
> > 
> > So, thinking further, what I think we want is to always set EPOLLET on
> > the TCP sockets.  If all received data is in flight we don't need
> > anything special, at least assuming epoll works like we think it does
> > above: when we get more data, we get an event and check if we can send
> > more data into flight.
> 
> ...I think having EPOLLET always set causes races that are potentially
> unsolvable,

Hrm.. I'm a little sceptical that we'd have unsolvable races when
always using EPOLLET that we don't already have using it sometimes
(though maybe with more subtle and complex triggering conditions).  I
feel like it's easier to reason through to avoid races if we stay in a
single trigger mode.

> because we can't always read from the socket until we get
> -EAGAIN. That is, this will work:
> 
> - recv() all data, we can't write more data to tap
> - at some point, we can write again to tap
> - more data comes, we'll wake up and continue
> 
> but this won't:
> 
> - partial recv(), we can't write more data to tap
> - at some point, we can write again to tap
> - no additional data comes

Well.. with the other change I'm suggesting here, once the data we did
send gets ACKed, we'll recheck the incoming socket and send more.  It
will delay that remainder data a bit, but not by _that_ much.  Since
in this case we're not getting more data from the socket, that
shouldn't even be all that harmful to throughput.

That said, it's not ideal.  We could address that by enabling an
EPOLLOUT (possibly ONESHOT) event on the tap socket if we get a
partial send.  When that event is triggered, we'd scan through
connections for any unsent data.

> > When the receive window fills we really don't care about new data
> > until it opens again, so clear EPOLLIN | EPOLLRDHUP.  When the window
> > does open again - i.e. when we get an ack or window update - both
> > reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process
> > anything that's accumulated since we turned EPOLLIN off.
> 
> Agreed. This should be a mere optimisation on top of the current
> behaviour, by the way.

Well, it becomes an essential change if we always enable EPOLLET.

> > Details to figure out:
> >  * Do we need to be careful about order of re-enable EPOLLIN
> >    vs. rechecking the recv() buffer?
> 
> It should be done before checking I guess, so that we can end up with
> one spurious event (because we already read the data a future EPOLLIN
> will tell us about), but never a missing event (because data comes
> between recv() and epoll_ctl()).

Yes, I think that's right.

> >  * At what point would we trigger the CLAMP_WINDOW workaround in that
> >    scheme?
> 
> When we read any data from the socket, with MSG_TRUNC, after the window
> full condition.

Well, yes, but what's the point at which we flag that the window full
condition has occurred?  Since that's occurring on the read side
buffer, we're not directly aware of it.

> >  * Is there any impact of EPOLLET on the other events?
> 
> Not that I'm aware of. EPOLLHUP and EPOLLERR are reported anyway, and
> we don't want EPOLLRDHUP to differ (in this sense) from EPOLLIN. But
> again, this is under the assumption that we do *not* always set EPOLLET.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames
  2023-10-05  6:19             ` Stefano Brivio
@ 2023-10-05  7:38               ` David Gibson
  0 siblings, 0 replies; 32+ messages in thread
From: David Gibson @ 2023-10-05  7:38 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Matej Hrica, passt-dev

[-- Attachment #1: Type: text/plain, Size: 5762 bytes --]

On Thu, Oct 05, 2023 at 08:19:00AM +0200, Stefano Brivio wrote:
> On Tue, 3 Oct 2023 14:22:59 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Sep 29, 2023 at 05:19:50PM +0200, Stefano Brivio wrote:
> > > On Thu, 28 Sep 2023 11:58:45 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote:  
> > > > > On Mon, 25 Sep 2023 14:47:52 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote:    
> > > > > > > ...so that we'll retry sending them, instead of more-or-less silently
> > > > > > > dropping them. This happens quite frequently if our sending buffer on
> > > > > > > the UNIX domain socket is heavily constrained (for instance, by the
> > > > > > > 208 KiB default memory limit).
> > > > > > > 
> > > > > > > It might be argued that dropping frames is part of the expected TCP
> > > > > > > flow: we don't dequeue those from the socket anyway, so we'll
> > > > > > > eventually retransmit them.
> > > > > > > 
> > > > > > > But we don't need the receiver to tell us (by the way of duplicate or
> > > > > > > missing ACKs) that we couldn't send them: we already know as
> > > > > > > sendmsg() reports that. This seems to considerably increase
> > > > > > > throughput stability and throughput itself for TCP connections with
> > > > > > > default wmem_max values.
> > > > > > > 
> > > > > > > Unfortunately, the 16 bits left as padding in the frame descriptors      
> > > > > > 
> > > > > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t,
> > > > > > yes?    
> > > > > 
> > > > > Right, that.
> > > > >     
> > > > > > For AVX2 we have substantially more space here.  Couldn't we put
> > > > > > a conn (or seq) pointer in here at the cost of a few bytes MSS for
> > > > > > non-AVX2 and zero cost for AVX2 (which is probably the majority case)?    
> > > > > 
> > > > > Yes, true. On the other hand, having this parallel array only affects
> > > > > readability I guess, whereas inserting pointers and lengths in
> > > > > tcp[46]_l2_buf_t actually decreases the usable MSS (not just on
> > > > > non-AVX2 x86, but also on other architectures). So I'd rather stick to
> > > > > this.    
> > > > 
> > > > Yeah, I guess so.
> > > > 
> > > > Actually.. I did just think of one other option.  It avoids both any
> > > > extra padding and a parallel array, but at the cost of additional work
> > > > when frames are dropped.  We could use that 16-bits of padding to
> > > > store the TCP payload length.  Then when we don't manage to send all
> > > > our frames, we do another loop through and add up how many stream
> > > > bytes we actually sent to update the seq pointer.  
> > > 
> > > Hmm, yes. It's slightly more memory efficient, but the complexity seems
> > > a bit overkill to me.  
> > 
> > More importantly, I forgot the fact that by the time we're sending the
> > frames, we don't know what connection they're associated with any
> > more.
> 
> Oh, I thought you wanted to rebuild the information about the
> connection by looking into the hash table or something like that.

No, I guess we could do that, but I think it would be substantially
messier than the parallel array approach.

> > [snip]
> > > > > > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > >  
> > > > > > >  	/* Finally, queue to tap */
> > > > > > >  	plen = mss;
> > > > > > > +	seq = conn->seq_to_tap;      
> > > > > > 
> > > > > > This will only be correct if tcp_l2_data_buf_flush() is *always*
> > > > > > called between tcp_data_from_sock() calls for the same socket.  That
> > > > > > should be true for the normal course of things.  However, couldn't it
> > > > > > happen that we get a normal socket EPOLLIN event for a particular
> > > > > > connection - calling tcp_data_from_sock() - but in the same epoll()
> > > > > > round we also get a tap ack for the same connection which causes
> > > > > > another call to tcp_data_from_sock() (with the change from patch
> > > > > > 2/5).  IIRC those would both happen before the deferred handling and
> > > > > > therefore the data_buf_flush().    
> > > > > 
> > > > > Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/
> > > > > but with that change, it's not. Unless we drop that change from 2/5.    
> > > > 
> > > > Even if we drop the change, it's a worryingly subtle constraint.  
> > > 
> > > Another option to avoid this...
> > >   
> > > > > > Not sure how to deal with that short of separate 'seq_queued' and
> > > > > > 'seq_sent' counters in the connection structure, which is a bit
> > > > > > unfortunate.    
> > > > > 
> > > > > I wonder how bad it is if we call tcp_l2_data_buf_flush()
> > > > > unconditionally before calling tcp_data_from_sock() from
> > > > > tcp_tap_handler(). But again, maybe this is not needed at all, we
> > > > > should check that epoll detail from 2/5 first...  
> > > 
> > > other than this one, would be to use that external table to update
> > > sequence numbers *in the frames* as we send stuff out.  
> > 
> > Not really sure what you're proposing there.
> 
> That tcp_l2_buf_fill_headers() calculates the sequence from
> conn->seq_to_tap plus a cumulative count from that table, instead of
> passing it from the caller.

Ah, right, that could work.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-10-05  8:15 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-22 22:06 [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
2023-09-22 22:06 ` [PATCH RFT 1/5] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
2023-09-23  2:48   ` David Gibson
2023-09-22 22:06 ` [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data Stefano Brivio
2023-09-25  3:07   ` David Gibson
2023-09-27 17:05     ` Stefano Brivio
2023-09-28  1:48       ` David Gibson
2023-09-29 15:20         ` Stefano Brivio
2023-10-03  3:20           ` David Gibson
2023-10-05  6:18             ` Stefano Brivio
2023-10-05  7:36               ` David Gibson
2023-09-22 22:06 ` [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
2023-09-22 22:31   ` Stefano Brivio
2023-09-23  7:55   ` David Gibson
2023-09-25  4:09   ` David Gibson
2023-09-25  4:10     ` David Gibson
2023-09-25  4:21     ` David Gibson
2023-09-27 17:05       ` Stefano Brivio
2023-09-28  1:51         ` David Gibson
2023-09-22 22:06 ` [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
2023-09-25  4:47   ` David Gibson
2023-09-27 17:06     ` Stefano Brivio
2023-09-28  1:58       ` David Gibson
2023-09-29 15:19         ` Stefano Brivio
2023-10-03  3:22           ` David Gibson
2023-10-05  6:19             ` Stefano Brivio
2023-10-05  7:38               ` David Gibson
2023-09-22 22:06 ` [PATCH RFT 5/5] passt.1: Add note about tuning rmem_max and wmem_max for throughput Stefano Brivio
2023-09-25  4:57   ` David Gibson
2023-09-27 17:06     ` Stefano Brivio
2023-09-28  2:02       ` David Gibson
2023-09-25  5:52 ` [PATCH RFT 0/5] Fixes and a workaround for TCP stalls with small buffers David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).