On Thu, Jan 16, 2025 at 09:32:47PM +0100, Stefano Brivio wrote: > In the next patches, I'm extending the usage of STALLED to a few more > cases. > > Doing so revealed this issue: if we set STALLED and, consequently, > EPOLLOUT (which is wrong, fixed later) right after we set a connection > to ESTABLISHED (which also happened by mistake while I was preparing > another change), with the guest sending data together with the final > ACK in the handshake, say: > > 41.3661: vhost-user: got kick_data: 0000000000000001 idx: 1 > 41.3662: Flow 2 (NEW): FREE -> NEW > 41.3663: Flow 2 (INI): NEW -> INI > 41.3663: Flow 2 (INI): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => ? > 41.3665: Flow 2 (TGT): INI -> TGT > 41.3666: Flow 2 (TGT): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003 > 41.3667: Flow 2 (TCP connection): TGT -> TYPED > 41.3667: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003 > 41.3669: Flow 2 (TCP connection): TAP_SYN_RCVD: CLOSED -> SYN_SENT > 41.3670: Flow 2 (TCP connection): Side 0 hash table insert: bucket: 339814 > 41.3672: Flow 2 (TCP connection): TYPED -> ACTIVE > 41.3673: Flow 2 (TCP connection): TAP [2a01:4f8:222:904::2]:52536 -> [2001:db8:9a55::1]:10003 => HOST [::]:0 -> [2001:db8:9a55::1]:10003 > 41.3674: Flow 2 (TCP connection): TAP_SYN_ACK_SENT: SYN_SENT -> SYN_RCVD > 41.3675: Flow 2 (TCP connection): ACK_FROM_TAP_DUE > 41.3675: Flow 2 (TCP connection): timer expires in 10.000s > 41.3675: vhost-user: got kick_data: 0000000000000001 idx: 1 > 41.3676: Flow 2 (TCP connection): ACK_FROM_TAP_DUE dropped > 41.3676: Flow 2 (TCP connection): ESTABLISHED: SYN_RCVD -> ESTABLISHED > 41.3678: Flow 2 (TCP connection): STALLED > 41.3678: vhost-user: got kick_data: 0000000000000002 idx: 1 > 41.3679: Flow 2 (TCP connection): ACK_TO_TAP_DUE > 41.3680: Flow 2 (TCP connection): timer expires in 0.010s > 41.3680: Flow 2 (TCP connection): STALLED dropped > > we'll immediately get an EPOLLOUT event, call tcp_update_seqack_wnd(), > but ignore window and ACK sequence update. At this point, we think we > acknowledged all the data to the guest (but we didn't) and we'll > happily proceed to clear the ACK_TO_TAP_DUE flag: > > 41.3780: Flow 2 (TCP connection): ACK_TO_TAP_DUE dropped > 41.3780: Flow 2 (TCP connection): timer expires in 7200.000s > 41.5754: vhost-user: got kick_data: 0000000000000001 idx: 1 > 41.9956: vhost-user: got kick_data: 0000000000000001 idx: 1 > 42.8275: vhost-user: got kick_data: 0000000000000001 idx: 1 > > while the guest starts retransmitting that data desperately, without > ever getting an ACK segment from us: > > 1433 38.746353 2a01:4f8:222:904::2 → 2001:db8:9a55::1 94 TCP 54312 → 10003 [SYN] Seq=0 Win=65460 Len=0 MSS=65460 SACK_PERM TSval=1089126192 TSecr=0 WS=128 > 1434 38.747357 2001:db8:9a55::1 → 2a01:4f8:222:904::2 82 TCP 10003 → 54312 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=61440 WS=256 > 1435 38.747500 2a01:4f8:222:904::2 → 2001:db8:9a55::1 74 TCP 54312 → 10003 [ACK] Seq=1 Ack=1 Win=65536 Len=0 > 1436 38.747769 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192 > 1437 38.747798 2a01:4f8:222:904::2 → 2001:db8:9a55::1 32841 TCP 54312 → 10003 [ACK] Seq=8193 Ack=1 Win=65536 Len=32767 > 1438 38.748049 2001:db8:9a55::1 → 2a01:4f8:222:904::2 74 TCP [TCP Window Update] 10003 → 54312 [ACK] Seq=1 Ack=1 Win=65280 Len=0 > 1439 38.954044 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192 > 1440 39.370096 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192 > 1441 40.202135 2a01:4f8:222:904::2 → 2001:db8:9a55::1 8266 TCP [TCP Retransmission] 54312 → 10003 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=8192 > > because seq_ack_to_tap is already set to the sequence after frame > number 1437 in the example. > > For some reason, I could only reproduce this with vhost-user, IPv6, > and passt running under valgrind while taking captures. Even under > these conditions, it happens quite rarely. > > Forcibly send an ACK segment if we update the ACK sequence (or the > advertised window). > > Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson > --- > tcp.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/tcp.c b/tcp.c > index ec433f7..72fca63 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -2200,8 +2200,10 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, > if (events & EPOLLIN) > tcp_data_from_sock(c, conn); > > - if (events & EPOLLOUT) > - tcp_update_seqack_wnd(c, conn, false, NULL); > + if (events & EPOLLOUT) { > + if (tcp_update_seqack_wnd(c, conn, false, NULL)) > + tcp_send_flag(c, conn, ACK); > + } > > return; > } -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson