public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues
@ 2025-01-16 20:32 Stefano Brivio
  2025-01-16 20:32 ` [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up Stefano Brivio
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-16 20:32 UTC (permalink / raw)
  To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy, David Gibson

Patch 3/4 is similar to the patch I sent recently addressing Lina's
report. The rest is for stuff I found while reproducing that using a
parallel HTTP downloader (aria2c) with different parameters and server
locations.

Stefano Brivio (4):
  tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up
  tcp: Don't subscribe to EPOLLOUT events on STALLED
  tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
  tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from
    the guest

 tcp.c      | 16 +++++++++++-----
 tcp_buf.c  |  5 +++++
 tcp_conn.h |  1 +
 tcp_vu.c   |  6 ++++++
 4 files changed, 23 insertions(+), 5 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up
  2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
@ 2025-01-16 20:32 ` Stefano Brivio
  2025-01-16 20:32 ` [PATCH 2/4] tcp: Don't subscribe to EPOLLOUT events on STALLED Stefano Brivio
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-16 20:32 UTC (permalink / raw)
  To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy, David Gibson

In the next patches, I'm extending the usage of STALLED to a few more
cases.

Doing so revealed this issue: if we set STALLED and, consequently,
EPOLLOUT (which is wrong, fixed later) right after we set a connection
to ESTABLISHED (which also happened by mistake while I was preparing
another change), with the guest sending data together with the final
ACK in the handshake, say:

  41.3661: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3662: Flow 2 (NEW): FREE -> NEW
  41.3663: Flow 2 (INI): NEW -> INI
  41.3663: Flow 2 (INI): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => ?
  41.3665: Flow 2 (TGT): INI -> TGT
  41.3666: Flow 2 (TGT): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3667: Flow 2 (TCP connection): TGT -> TYPED
  41.3667: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3669: Flow 2 (TCP connection): TAP_SYN_RCVD: CLOSED -> SYN_SENT
  41.3670: Flow 2 (TCP connection): Side 0 hash table insert: bucket: 339814
  41.3672: Flow 2 (TCP connection): TYPED -> ACTIVE
  41.3673: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003
  41.3674: Flow 2 (TCP connection): TAP_SYN_ACK_SENT: SYN_SENT -> SYN_RCVD
  41.3675: Flow 2 (TCP connection): ACK_FROM_TAP_DUE
  41.3675: Flow 2 (TCP connection): timer expires in 10.000s
  41.3675: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.3676: Flow 2 (TCP connection): ACK_FROM_TAP_DUE dropped
  41.3676: Flow 2 (TCP connection): ESTABLISHED: SYN_RCVD -> ESTABLISHED
  41.3678: Flow 2 (TCP connection): STALLED
  41.3678: vhost-user: got kick_data: 0000000000000002 idx: 1
  41.3679: Flow 2 (TCP connection): ACK_TO_TAP_DUE
  41.3680: Flow 2 (TCP connection): timer expires in 0.010s
  41.3680: Flow 2 (TCP connection): STALLED dropped

we'll immediately get an EPOLLOUT event, call tcp_update_seqack_wnd(),
but ignore window and ACK sequence update. At this point, we think we
acknowledged all the data to the guest (but we didn't) and we'll
happily proceed to clear the ACK_TO_TAP_DUE flag:

  41.3780: Flow 2 (TCP connection): ACK_TO_TAP_DUE dropped
  41.3780: Flow 2 (TCP connection): timer expires in 7200.000s
  41.5754: vhost-user: got kick_data: 0000000000000001 idx: 1
  41.9956: vhost-user: got kick_data: 0000000000000001 idx: 1
  42.8275: vhost-user: got kick_data: 0000000000000001 idx: 1

while the guest starts retransmitting that data desperately, without
ever getting an ACK segment from us:

   1433  38.746353 2a01:4f8:222:904::2 → 2001:db8:9a55::1 94 TCP 54312 → 10003 [SYN] Seq=0 Win=65460 Len=0 MSS=65460 SACK_PERM TSval=1089126192 TSecr=0 WS=128
   1434  38.747357 2001:db8:9a55::1 → 2a01:4f8:222:904::2 82 TCP 10003 → 54312 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=61440 WS=256
   1435  38.747500 2a01:4f8:222:904::2 → 2001:db8:9a55::1 74 TCP 54312 → 10003 [ACK] Seq=1 Ack=1 Win=65536 Len=0
   1436  38.747769 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1437  38.747798 2a01:4f8:222:904::2 → 2001:db8:9a55::1 32841 TCP 54312 → 10003 [ACK] Seq=8193 Ack=1 Win=65536 Len=32767
   1438  38.748049 2001:db8:9a55::1 → 2a01:4f8:222:904::2 74 TCP [TCP Window Update] 10003 → 54312 [ACK] Seq=1 Ack=1 Win=65280 Len=0
   1439  38.954044 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1440  39.370096 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192
   1441  40.202135 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192

because seq_ack_to_tap is already set to the sequence after frame
number 1437 in the example.

For some reason, I could only reproduce this with vhost-user, IPv6,
and passt running under valgrind while taking captures. Even under
these conditions, it happens quite rarely.

Forcibly send an ACK segment if we update the ACK sequence (or the
advertised window).

Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index ec433f7..72fca63 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2200,8 +2200,10 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 		if (events & EPOLLIN)
 			tcp_data_from_sock(c, conn);
 
-		if (events & EPOLLOUT)
-			tcp_update_seqack_wnd(c, conn, false, NULL);
+		if (events & EPOLLOUT) {
+			if (tcp_update_seqack_wnd(c, conn, false, NULL))
+				tcp_send_flag(c, conn, ACK);
+		}
 
 		return;
 	}
-- 
@@ -2200,8 +2200,10 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 		if (events & EPOLLIN)
 			tcp_data_from_sock(c, conn);
 
-		if (events & EPOLLOUT)
-			tcp_update_seqack_wnd(c, conn, false, NULL);
+		if (events & EPOLLOUT) {
+			if (tcp_update_seqack_wnd(c, conn, false, NULL))
+				tcp_send_flag(c, conn, ACK);
+		}
 
 		return;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/4] tcp: Don't subscribe to EPOLLOUT events on STALLED
  2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
  2025-01-16 20:32 ` [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up Stefano Brivio
@ 2025-01-16 20:32 ` Stefano Brivio
  2025-01-16 20:32 ` [PATCH 3/4] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
  2025-01-16 20:32 ` [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest Stefano Brivio
  3 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-16 20:32 UTC (permalink / raw)
  To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy, David Gibson

I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.

Fixes: bb708111833e ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 72fca63..ef33388 100644
--- a/tcp.c
+++ b/tcp.c
@@ -437,7 +437,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 			return EPOLLET;
 
 		if (conn_flags & STALLED)
-			return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
+			return EPOLLIN | EPOLLRDHUP | EPOLLET;
 
 		return EPOLLIN | EPOLLRDHUP;
 	}
-- 
@@ -437,7 +437,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 			return EPOLLET;
 
 		if (conn_flags & STALLED)
-			return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
+			return EPOLLIN | EPOLLRDHUP | EPOLLET;
 
 		return EPOLLIN | EPOLLRDHUP;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/4] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
  2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
  2025-01-16 20:32 ` [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up Stefano Brivio
  2025-01-16 20:32 ` [PATCH 2/4] tcp: Don't subscribe to EPOLLOUT events on STALLED Stefano Brivio
@ 2025-01-16 20:32 ` Stefano Brivio
  2025-01-16 20:32 ` [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest Stefano Brivio
  3 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-16 20:32 UTC (permalink / raw)
  To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy, David Gibson

Before SO_PEEK_OFF support was introduced by commit e63d281871ef
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).

If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().

With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.

As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.

This is the original report, mentioning the originally proposed fix:

--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:

epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)

Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--

we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).

Instead, rely on the observation that, once a connection is
established, we only get EAGAIN on recvmsg() if we are attempting to
peek data from a socket with a non-zero peeking offset: we only peek
when there's pending data on a socket, and in that case, if we peek
without offset, we'll always see some data.

And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.

In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.

While we're establishing a connection from the socket side, though,
we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got
any data while we were waiting for SYN, ACK from the guest. See the
comment at the end of tcp_conn_from_sock_finish().

And if there's no data queued on the socket as we check, we'll also
get EAGAIN, even if our peeking offset is zero. For this reason, we
need to additionally check that 'already_sent' is not zero, meaning,
explicitly, that our peeking offset is not zero.

Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_buf.c | 3 +++
 tcp_vu.c  | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/tcp_buf.c b/tcp_buf.c
index a975a55..8c15101 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -359,6 +359,9 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			return -errno;
 		}
 
+		if (already_sent) /* No new data and EAGAIN: set EPOLLET */
+			conn_flag(c, conn, STALLED);
+
 		return 0;
 	}
 
diff --git a/tcp_vu.c b/tcp_vu.c
index 10e17d3..8256f53 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -399,6 +399,10 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			tcp_rst(c, conn);
 			return len;
 		}
+
+		if (already_sent) /* No new data and EAGAIN: set EPOLLET */
+			conn_flag(c, conn, STALLED);
+
 		return 0;
 	}
 
-- 
@@ -399,6 +399,10 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			tcp_rst(c, conn);
 			return len;
 		}
+
+		if (already_sent) /* No new data and EAGAIN: set EPOLLET */
+			conn_flag(c, conn, STALLED);
+
 		return 0;
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest
  2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
                   ` (2 preceding siblings ...)
  2025-01-16 20:32 ` [PATCH 3/4] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
@ 2025-01-16 20:32 ` Stefano Brivio
  3 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-16 20:32 UTC (permalink / raw)
  To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy, David Gibson

There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.

If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:

  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
  timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
  epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1

also known as:

  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s
  29.1517: Flow 2 (TCP connection): timer expires in 2.000s

which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.

That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.

In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.

If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.

This is definitely not elegant, but it's a minimal fix.

We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).

I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).

Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c      | 8 ++++++--
 tcp_buf.c  | 2 ++
 tcp_conn.h | 1 +
 tcp_vu.c   | 2 ++
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index ef33388..3b3193a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -345,7 +345,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
 
 static const char *tcp_flag_str[] __attribute((__unused__)) = {
 	"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
-	"ACK_FROM_TAP_DUE",
+	"ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
 };
 
 /* Listening sockets, used for automatic port forwarding in pasta mode only */
@@ -436,8 +436,12 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 		if (events & TAP_FIN_SENT)
 			return EPOLLET;
 
-		if (conn_flags & STALLED)
+		if (conn_flags & STALLED) {
+			if (conn_flags & ACK_FROM_TAP_BLOCKS)
+				return EPOLLRDHUP | EPOLLET;
+
 			return EPOLLIN | EPOLLRDHUP | EPOLLET;
+		}
 
 		return EPOLLIN | EPOLLRDHUP;
 	}
diff --git a/tcp_buf.c b/tcp_buf.c
index 8c15101..cbefa42 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -309,6 +309,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
 		conn_flag(c, conn, STALLED);
 		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
@@ -387,6 +388,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
+	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
 	conn_flag(c, conn, ~STALLED);
 
 	send_bufs = DIV_ROUND_UP(len, mss);
diff --git a/tcp_conn.h b/tcp_conn.h
index 6ae0511..d342680 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -77,6 +77,7 @@ struct tcp_tap_conn {
 #define ACTIVE_CLOSE		BIT(2)
 #define ACK_TO_TAP_DUE		BIT(3)
 #define ACK_FROM_TAP_DUE	BIT(4)
+#define ACK_FROM_TAP_BLOCKS	BIT(5)
 
 #define SNDBUF_BITS		24
 	unsigned int	sndbuf		:SNDBUF_BITS;
diff --git a/tcp_vu.c b/tcp_vu.c
index 8256f53..a216bb1 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -381,6 +381,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
 		conn_flag(c, conn, STALLED);
 		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
@@ -423,6 +424,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
+	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
 	conn_flag(c, conn, ~STALLED);
 
 	/* Likely, some new data was acked too. */
-- 
@@ -381,6 +381,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
 		conn_flag(c, conn, STALLED);
 		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
@@ -423,6 +424,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		return 0;
 	}
 
+	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
 	conn_flag(c, conn, ~STALLED);
 
 	/* Likely, some new data was acked too. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-01-16 20:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
2025-01-16 20:32 ` [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up Stefano Brivio
2025-01-16 20:32 ` [PATCH 2/4] tcp: Don't subscribe to EPOLLOUT events on STALLED Stefano Brivio
2025-01-16 20:32 ` [PATCH 3/4] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
2025-01-16 20:32 ` [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).