* [PATCH v2 0/3] Fixes and a workaround for TCP stalls with small buffers
@ 2023-09-29 15:04 Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 1/3] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:04 UTC (permalink / raw)
To: passt-dev; +Cc: David Gibson, Matej Hrica
The fundamental patch here is 2/3, which is a workaround for a rather
surprising kernel behaviour we seem to be hitting. This all comes from
the investigation around https://bugs.passt.top/show_bug.cgi?id=74.
I can't hit stalls anymore and throughput looks finally good to me
(~3.5gbps with 208 KiB rmem_max and wmem_max), but... please test
(again).
v2:
- Drop 1/5 (checking for ACK before resetting STALLED and calling
tcp_data_from_sock() directly from tcp_tap_handler())
- Moving reset of STALLED flag is now done in 2/3 (was 3/5)
- Don't pass unnecessary argument to tcp_data_to_tap() in 3/3 (was
4/5)
- Drop 5/5 as long as we're not sure what kind of buffer clamping
is actually beneficial
Stefano Brivio (3):
tcp: Fix comment to tcp_sock_consume()
tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
tcp, tap: Don't increase tap-side sequence counter for dropped frames
tap.c | 10 ++++++---
tap.h | 2 +-
tcp.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++----------
3 files changed, 67 insertions(+), 16 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/3] tcp: Fix comment to tcp_sock_consume()
2023-09-29 15:04 [PATCH v2 0/3] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
@ 2023-09-29 15:04 ` Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
2 siblings, 0 replies; 7+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:04 UTC (permalink / raw)
To: passt-dev; +Cc: David Gibson, Matej Hrica
Note that tcp_sock_consume() doesn't update ACK sequence counters
anymore.
Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tcp.c b/tcp.c
index 2933123..dff5e79 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c,
}
/**
- * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence
+ * tcp_sock_consume() - Consume (discard) data from buffer
* @conn: Connection pointer
* @ack_seq: ACK sequence, host order
*
--
@@ -2103,7 +2103,7 @@ static void tcp_conn_from_tap(struct ctx *c,
}
/**
- * tcp_sock_consume() - Consume (discard) data from buffer, update ACK sequence
+ * tcp_sock_consume() - Consume (discard) data from buffer
* @conn: Connection pointer
* @ack_seq: ACK sequence, host order
*
--
2.39.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
2023-09-29 15:04 [PATCH v2 0/3] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 1/3] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
@ 2023-09-29 15:04 ` Stefano Brivio
2023-09-29 15:44 ` Stefano Brivio
2023-10-03 2:47 ` David Gibson
2023-09-29 15:04 ` [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
2 siblings, 2 replies; 7+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:04 UTC (permalink / raw)
To: passt-dev; +Cc: David Gibson, Matej Hrica
It looks like we need it as workaround for this situation, readily
reproducible at least with a 6.5 Linux kernel, with default rmem_max
and wmem_max values:
- an iperf3 client on the host sends about 160 KiB, typically
segmented into five frames by passt. We read this data using
MSG_PEEK
- the iperf3 server on the guest starts receiving
- meanwhile, the host kernel advertised a zero-sized window to the
receiver, as expected
- eventually, the guest acknowledges all the data sent so far, and
we drop it from the buffer, courtesy of tcp_sock_consume(), using
recv() with MSG_TRUNC
- the client, however, doesn't get an updated window value, and
even keepalive packets are answered with zero-window segments,
until the connection is closed
It looks like dropping data from a socket using MSG_TRUNC doesn't
cause a recalculation of the window, which would be expected as a
result of any receiving operation that invalidates data on a buffer
(that is, not with MSG_PEEK).
Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
the previous value we clamped to, forces a recalculation of the
window which is advertised to the guest.
I couldn't quite confirm this issue by following all the possible
code paths in the kernel, yet. If confirmed, this should be fixed in
the kernel, but meanwhile this workaround looks robust to me (and it
will be needed for backward compatibility anyway).
Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=74
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/tcp.c b/tcp.c
index dff5e79..32917c8 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
wnd <<= conn->ws_from_tap;
wnd = MIN(MAX_WINDOW, wnd);
- if (conn->flags & WND_CLAMPED) {
+ /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
+ * with a zero-sized window on a TCP socket, dropping data (once
+ * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
+ * to be enough to make the kernel advertise a non-zero window to the
+ * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
+ * value, fixes this.
+ *
+ * The STALLED flag on a connection is a sufficient indication that we
+ * might have a zero-sized window on the socket, because it's set if we
+ * exhausted the tap-side window, or if everything we receive from a
+ * socket is already in flight to the guest.
+ *
+ * So, if STALLED is set, and we received a window value from the tap,
+ * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
+ * further and fixed in the kernel instead, if confirmed.
+ */
+ if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
if (prev_scaled == wnd)
return;
@@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
i = keep - 1;
}
- tcp_clamp_window(c, conn, max_ack_seq_wnd);
-
/* On socket flush failure, pretend there was no ACK, try again later */
if (ack && !tcp_sock_consume(conn, max_ack_seq))
tcp_update_seqack_from_tap(c, conn, max_ack_seq);
+ tcp_clamp_window(c, conn, max_ack_seq_wnd);
+
if (retr) {
trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
max_ack_seq, conn->seq_to_tap);
@@ -2572,8 +2588,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
if (th->ack && !(conn->events & ESTABLISHED))
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- conn_flag(c, conn, ~STALLED);
-
/* Establishing connection from socket */
if (conn->events & SOCK_ACCEPTED) {
if (th->syn && th->ack && !th->fin) {
@@ -2628,6 +2642,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
if (count == -1)
goto reset;
+ /* Note: STALLED matters for tcp_clamp_window(): unset it only after
+ * processing data (and window) from the tap side
+ */
+ conn_flag(c, conn, ~STALLED);
+
if (conn->seq_ack_to_tap != conn->seq_from_tap)
ack_due = 1;
--
@@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
wnd <<= conn->ws_from_tap;
wnd = MIN(MAX_WINDOW, wnd);
- if (conn->flags & WND_CLAMPED) {
+ /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
+ * with a zero-sized window on a TCP socket, dropping data (once
+ * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
+ * to be enough to make the kernel advertise a non-zero window to the
+ * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
+ * value, fixes this.
+ *
+ * The STALLED flag on a connection is a sufficient indication that we
+ * might have a zero-sized window on the socket, because it's set if we
+ * exhausted the tap-side window, or if everything we receive from a
+ * socket is already in flight to the guest.
+ *
+ * So, if STALLED is set, and we received a window value from the tap,
+ * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
+ * further and fixed in the kernel instead, if confirmed.
+ */
+ if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
if (prev_scaled == wnd)
return;
@@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
i = keep - 1;
}
- tcp_clamp_window(c, conn, max_ack_seq_wnd);
-
/* On socket flush failure, pretend there was no ACK, try again later */
if (ack && !tcp_sock_consume(conn, max_ack_seq))
tcp_update_seqack_from_tap(c, conn, max_ack_seq);
+ tcp_clamp_window(c, conn, max_ack_seq_wnd);
+
if (retr) {
trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
max_ack_seq, conn->seq_to_tap);
@@ -2572,8 +2588,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
if (th->ack && !(conn->events & ESTABLISHED))
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- conn_flag(c, conn, ~STALLED);
-
/* Establishing connection from socket */
if (conn->events & SOCK_ACCEPTED) {
if (th->syn && th->ack && !th->fin) {
@@ -2628,6 +2642,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
if (count == -1)
goto reset;
+ /* Note: STALLED matters for tcp_clamp_window(): unset it only after
+ * processing data (and window) from the tap side
+ */
+ conn_flag(c, conn, ~STALLED);
+
if (conn->seq_ack_to_tap != conn->seq_from_tap)
ack_due = 1;
--
2.39.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames
2023-09-29 15:04 [PATCH v2 0/3] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 1/3] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
@ 2023-09-29 15:04 ` Stefano Brivio
2023-10-03 2:50 ` David Gibson
2 siblings, 1 reply; 7+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:04 UTC (permalink / raw)
To: passt-dev; +Cc: David Gibson, Matej Hrica
...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).
It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.
But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.
Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tap.c | 10 +++++++---
tap.h | 2 +-
tcp.c | 40 ++++++++++++++++++++++++++++++++++------
3 files changed, 42 insertions(+), 10 deletions(-)
diff --git a/tap.c b/tap.c
index 93db989..b30ff81 100644
--- a/tap.c
+++ b/tap.c
@@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
* @c: Execution context
* @iov: Array of buffers, each containing one frame (with L2 headers)
* @n: Number of buffers/frames in @iov
+ *
+ * Return: number of frames actually sent
*/
-void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
+size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
{
size_t m;
if (!n)
- return;
+ return 0;
if (c->mode == MODE_PASST)
m = tap_send_frames_passt(c, iov, n);
@@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
m = tap_send_frames_pasta(c, iov, n);
if (m < n)
- debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
+ debug("tap: failed to send %lu frames of %lu", n - m, n);
pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
+
+ return m;
}
/**
diff --git a/tap.h b/tap.h
index 021fb7c..952fafc 100644
--- a/tap.h
+++ b/tap.h
@@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
void *in, size_t len);
int tap_send(const struct ctx *c, const void *data, size_t len);
-void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
+size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
void tap_update_mac(struct tap_hdr *taph,
const unsigned char *eth_d, const unsigned char *eth_s);
void tap_listen_handler(struct ctx *c, uint32_t events);
diff --git a/tcp.c b/tcp.c
index 32917c8..3313d72 100644
--- a/tcp.c
+++ b/tcp.c
@@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS];
*/
static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
+/**
+ * tcp_buf_seq_update - Sequences to update with length of frames once sent
+ * @seq: Pointer to sequence number sent to tap-side, to be updated
+ * @len: TCP payload length
+ */
+struct tcp_buf_seq_update {
+ uint32_t *seq;
+ uint16_t len;
+};
+
/* Static buffers */
/**
@@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
#endif
tcp4_l2_buf[TCP_FRAMES_MEM];
+static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
+
static unsigned int tcp4_l2_buf_used;
/**
@@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
#endif
tcp6_l2_buf[TCP_FRAMES_MEM];
+static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
+
static unsigned int tcp6_l2_buf_used;
/* recvmsg()/sendmsg() data for tap */
@@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
*/
static void tcp_l2_data_buf_flush(struct ctx *c)
{
- tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+ unsigned i;
+ size_t m;
+
+ m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+ for (i = 0; i < m; i++)
+ *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
tcp6_l2_buf_used = 0;
- tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+ m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+ for (i = 0; i < m; i++)
+ *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
tcp4_l2_buf_used = 0;
}
@@ -2149,17 +2170,20 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
* @plen: Payload length at L4
* @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
* @seq: Sequence number to be sent
- * @now: Current timestamp
*/
static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
ssize_t plen, int no_csum, uint32_t seq)
{
+ uint32_t *seq_update = &conn->seq_to_tap;
struct iovec *iov;
if (CONN_V4(conn)) {
struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
+ tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
+ tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
+
iov = tcp4_l2_iov + tcp4_l2_buf_used++;
iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
check, seq);
@@ -2168,6 +2192,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
} else if (CONN_V6(conn)) {
struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
+ tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
+ tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
+
iov = tcp6_l2_iov + tcp6_l2_buf_used++;
iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
NULL, seq);
@@ -2193,7 +2220,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
int s = conn->sock, i, ret = 0;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
- uint32_t already_sent;
+ uint32_t already_sent, seq;
struct iovec *iov;
already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
@@ -2282,14 +2309,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
/* Finally, queue to tap */
plen = mss;
+ seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
if (i == send_bufs - 1)
plen = last_len;
- tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
- conn->seq_to_tap += plen;
+ tcp_data_to_tap(c, conn, plen, no_csum, seq);
+ seq += plen;
}
conn_flag(c, conn, ACK_FROM_TAP_DUE);
--
@@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS];
*/
static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
+/**
+ * tcp_buf_seq_update - Sequences to update with length of frames once sent
+ * @seq: Pointer to sequence number sent to tap-side, to be updated
+ * @len: TCP payload length
+ */
+struct tcp_buf_seq_update {
+ uint32_t *seq;
+ uint16_t len;
+};
+
/* Static buffers */
/**
@@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
#endif
tcp4_l2_buf[TCP_FRAMES_MEM];
+static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
+
static unsigned int tcp4_l2_buf_used;
/**
@@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
#endif
tcp6_l2_buf[TCP_FRAMES_MEM];
+static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
+
static unsigned int tcp6_l2_buf_used;
/* recvmsg()/sendmsg() data for tap */
@@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
*/
static void tcp_l2_data_buf_flush(struct ctx *c)
{
- tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+ unsigned i;
+ size_t m;
+
+ m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+ for (i = 0; i < m; i++)
+ *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
tcp6_l2_buf_used = 0;
- tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+ m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+ for (i = 0; i < m; i++)
+ *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
tcp4_l2_buf_used = 0;
}
@@ -2149,17 +2170,20 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
* @plen: Payload length at L4
* @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
* @seq: Sequence number to be sent
- * @now: Current timestamp
*/
static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
ssize_t plen, int no_csum, uint32_t seq)
{
+ uint32_t *seq_update = &conn->seq_to_tap;
struct iovec *iov;
if (CONN_V4(conn)) {
struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
+ tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
+ tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
+
iov = tcp4_l2_iov + tcp4_l2_buf_used++;
iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
check, seq);
@@ -2168,6 +2192,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
} else if (CONN_V6(conn)) {
struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
+ tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
+ tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
+
iov = tcp6_l2_iov + tcp6_l2_buf_used++;
iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
NULL, seq);
@@ -2193,7 +2220,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
int s = conn->sock, i, ret = 0;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
- uint32_t already_sent;
+ uint32_t already_sent, seq;
struct iovec *iov;
already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
@@ -2282,14 +2309,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
/* Finally, queue to tap */
plen = mss;
+ seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
if (i == send_bufs - 1)
plen = last_len;
- tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
- conn->seq_to_tap += plen;
+ tcp_data_to_tap(c, conn, plen, no_csum, seq);
+ seq += plen;
}
conn_flag(c, conn, ACK_FROM_TAP_DUE);
--
2.39.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
2023-09-29 15:04 ` [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
@ 2023-09-29 15:44 ` Stefano Brivio
2023-10-03 2:47 ` David Gibson
1 sibling, 0 replies; 7+ messages in thread
From: Stefano Brivio @ 2023-09-29 15:44 UTC (permalink / raw)
To: passt-dev, David Gibson; +Cc: Matej Hrica
On Fri, 29 Sep 2023 17:04:45 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:
> It looks like we need it as workaround for this situation, readily
> reproducible at least with a 6.5 Linux kernel, with default rmem_max
> and wmem_max values:
>
> - an iperf3 client on the host sends about 160 KiB, typically
> segmented into five frames by passt. We read this data using
> MSG_PEEK
>
> - the iperf3 server on the guest starts receiving
>
> - meanwhile, the host kernel advertised a zero-sized window to the
> receiver, as expected
Gosh, sorry, I forgot all the s/receiver/whatever/ here. Fixed locally
and not reposting, unless there's something else...
--
Stefano
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
2023-09-29 15:04 ` [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
2023-09-29 15:44 ` Stefano Brivio
@ 2023-10-03 2:47 ` David Gibson
1 sibling, 0 replies; 7+ messages in thread
From: David Gibson @ 2023-10-03 2:47 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Matej Hrica
[-- Attachment #1: Type: text/plain, Size: 4875 bytes --]
On Fri, Sep 29, 2023 at 05:04:45PM +0200, Stefano Brivio wrote:
> It looks like we need it as workaround for this situation, readily
> reproducible at least with a 6.5 Linux kernel, with default rmem_max
> and wmem_max values:
>
> - an iperf3 client on the host sends about 160 KiB, typically
> segmented into five frames by passt. We read this data using
> MSG_PEEK
>
> - the iperf3 server on the guest starts receiving
>
> - meanwhile, the host kernel advertised a zero-sized window to the
> receiver, as expected
>
> - eventually, the guest acknowledges all the data sent so far, and
> we drop it from the buffer, courtesy of tcp_sock_consume(), using
> recv() with MSG_TRUNC
>
> - the client, however, doesn't get an updated window value, and
> even keepalive packets are answered with zero-window segments,
> until the connection is closed
>
> It looks like dropping data from a socket using MSG_TRUNC doesn't
> cause a recalculation of the window, which would be expected as a
> result of any receiving operation that invalidates data on a buffer
> (that is, not with MSG_PEEK).
>
> Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> the previous value we clamped to, forces a recalculation of the
> window which is advertised to the guest.
>
> I couldn't quite confirm this issue by following all the possible
> code paths in the kernel, yet. If confirmed, this should be fixed in
> the kernel, but meanwhile this workaround looks robust to me (and it
> will be needed for backward compatibility anyway).
>
> Reported-by: Matej Hrica <mhrica@redhat.com>
> Link: https://bugs.passt.top/show_bug.cgi?id=74
> Analysed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 29 ++++++++++++++++++++++++-----
> 1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index dff5e79..32917c8 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
> wnd <<= conn->ws_from_tap;
> wnd = MIN(MAX_WINDOW, wnd);
>
> - if (conn->flags & WND_CLAMPED) {
> + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
> + * with a zero-sized window on a TCP socket, dropping data (once
> + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
> + * to be enough to make the kernel advertise a non-zero window to the
> + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
> + * value, fixes this.
> + *
> + * The STALLED flag on a connection is a sufficient indication that we
> + * might have a zero-sized window on the socket, because it's set if we
> + * exhausted the tap-side window, or if everything we receive from a
> + * socket is already in flight to the guest.
> + *
> + * So, if STALLED is set, and we received a window value from the tap,
> + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
> + * further and fixed in the kernel instead, if confirmed.
> + */
> + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
> if (prev_scaled == wnd)
> return;
>
> @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> i = keep - 1;
> }
>
> - tcp_clamp_window(c, conn, max_ack_seq_wnd);
> -
> /* On socket flush failure, pretend there was no ACK, try again later */
> if (ack && !tcp_sock_consume(conn, max_ack_seq))
> tcp_update_seqack_from_tap(c, conn, max_ack_seq);
>
> + tcp_clamp_window(c, conn, max_ack_seq_wnd);
> +
> if (retr) {
> trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
> max_ack_seq, conn->seq_to_tap);
> @@ -2572,8 +2588,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
> if (th->ack && !(conn->events & ESTABLISHED))
> tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
>
> - conn_flag(c, conn, ~STALLED);
> -
> /* Establishing connection from socket */
> if (conn->events & SOCK_ACCEPTED) {
> if (th->syn && th->ack && !th->fin) {
> @@ -2628,6 +2642,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
> if (count == -1)
> goto reset;
>
> + /* Note: STALLED matters for tcp_clamp_window(): unset it only after
> + * processing data (and window) from the tap side
> + */
> + conn_flag(c, conn, ~STALLED);
> +
> if (conn->seq_ack_to_tap != conn->seq_from_tap)
> ack_due = 1;
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames
2023-09-29 15:04 ` [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
@ 2023-10-03 2:50 ` David Gibson
0 siblings, 0 replies; 7+ messages in thread
From: David Gibson @ 2023-10-03 2:50 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Matej Hrica
[-- Attachment #1: Type: text/plain, Size: 7435 bytes --]
On Fri, Sep 29, 2023 at 05:04:46PM +0200, Stefano Brivio wrote:
> ...so that we'll retry sending them, instead of more-or-less silently
> dropping them. This happens quite frequently if our sending buffer on
> the UNIX domain socket is heavily constrained (for instance, by the
> 208 KiB default memory limit).
>
> It might be argued that dropping frames is part of the expected TCP
> flow: we don't dequeue those from the socket anyway, so we'll
> eventually retransmit them.
>
> But we don't need the receiver to tell us (by the way of duplicate or
> missing ACKs) that we couldn't send them: we already know as
> sendmsg() reports that. This seems to considerably increase
> throughput stability and throughput itself for TCP connections with
> default wmem_max values.
>
> Unfortunately, the 16 bits left as padding in the frame descriptors
> we use internally aren't enough to uniquely identify for which
> connection we should update sequence numbers: create a parallel
> array of pointers to sequence numbers and L4 lengths, of
> TCP_FRAMES_MEM size, and go through it after calling sendmsg().
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tap.c | 10 +++++++---
> tap.h | 2 +-
> tcp.c | 40 ++++++++++++++++++++++++++++++++++------
> 3 files changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/tap.c b/tap.c
> index 93db989..b30ff81 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
> * @c: Execution context
> * @iov: Array of buffers, each containing one frame (with L2 headers)
> * @n: Number of buffers/frames in @iov
> + *
> + * Return: number of frames actually sent
> */
> -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> {
> size_t m;
>
> if (!n)
> - return;
> + return 0;
>
> if (c->mode == MODE_PASST)
> m = tap_send_frames_passt(c, iov, n);
> @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n)
> m = tap_send_frames_pasta(c, iov, n);
>
> if (m < n)
> - debug("tap: dropped %lu frames of %lu due to short send", n - m, n);
> + debug("tap: failed to send %lu frames of %lu", n - m, n);
>
> pcap_multiple(iov, m, c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> +
> + return m;
> }
>
> /**
> diff --git a/tap.h b/tap.h
> index 021fb7c..952fafc 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c,
> const struct in6_addr *src, const struct in6_addr *dst,
> void *in, size_t len);
> int tap_send(const struct ctx *c, const void *data, size_t len);
> -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n);
> void tap_update_mac(struct tap_hdr *taph,
> const unsigned char *eth_d, const unsigned char *eth_s);
> void tap_listen_handler(struct ctx *c, uint32_t events);
> diff --git a/tcp.c b/tcp.c
> index 32917c8..3313d72 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS];
> */
> static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
>
> +/**
> + * tcp_buf_seq_update - Sequences to update with length of frames once sent
> + * @seq: Pointer to sequence number sent to tap-side, to be updated
> + * @len: TCP payload length
> + */
> +struct tcp_buf_seq_update {
> + uint32_t *seq;
> + uint16_t len;
> +};
> +
> /* Static buffers */
>
> /**
> @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t {
> #endif
> tcp4_l2_buf[TCP_FRAMES_MEM];
>
> +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> +
> static unsigned int tcp4_l2_buf_used;
>
> /**
> @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t {
> #endif
> tcp6_l2_buf[TCP_FRAMES_MEM];
>
> +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> +
> static unsigned int tcp6_l2_buf_used;
>
> /* recvmsg()/sendmsg() data for tap */
> @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c)
> */
> static void tcp_l2_data_buf_flush(struct ctx *c)
> {
> - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> + unsigned i;
> + size_t m;
> +
> + m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
> + for (i = 0; i < m; i++)
> + *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
> tcp6_l2_buf_used = 0;
>
> - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> + m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
> + for (i = 0; i < m; i++)
> + *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
> tcp4_l2_buf_used = 0;
> }
>
> @@ -2149,17 +2170,20 @@ static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
> * @plen: Payload length at L4
> * @no_csum: Don't compute IPv4 checksum, use the one from previous buffer
> * @seq: Sequence number to be sent
> - * @now: Current timestamp
> */
> static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> ssize_t plen, int no_csum, uint32_t seq)
> {
> + uint32_t *seq_update = &conn->seq_to_tap;
> struct iovec *iov;
>
> if (CONN_V4(conn)) {
> struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
>
> + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
> +
> iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> check, seq);
> @@ -2168,6 +2192,9 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
> } else if (CONN_V6(conn)) {
> struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
>
> + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
> +
> iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> NULL, seq);
> @@ -2193,7 +2220,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> int s = conn->sock, i, ret = 0;
> struct msghdr mh_sock = { 0 };
> uint16_t mss = MSS_GET(conn);
> - uint32_t already_sent;
> + uint32_t already_sent, seq;
> struct iovec *iov;
>
> already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> @@ -2282,14 +2309,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>
> /* Finally, queue to tap */
> plen = mss;
> + seq = conn->seq_to_tap;
> for (i = 0; i < send_bufs; i++) {
> int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
>
> if (i == send_bufs - 1)
> plen = last_len;
>
> - tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap);
> - conn->seq_to_tap += plen;
> + tcp_data_to_tap(c, conn, plen, no_csum, seq);
> + seq += plen;
> }
>
> conn_flag(c, conn, ACK_FROM_TAP_DUE);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-10-03 2:50 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-29 15:04 [PATCH v2 0/3] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 1/3] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
2023-09-29 15:44 ` Stefano Brivio
2023-10-03 2:47 ` David Gibson
2023-09-29 15:04 ` [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
2023-10-03 2:50 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).