From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Kl+113Wv; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 13AA95A0278 for ; Wed, 10 Sep 2025 08:38:00 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1757486280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Neo/pjMEFJ9Mlg/JNGhdRkrkXK0+A/EuxepeRL6LSBw=; b=Kl+113Wv0qEbe3qSoNd74k9/j0NtGT/jgwFXvJN6OUSx0KWLSO+EXqfUxULZVK2qKzQe4P XbikIL2dH4PYLX+NtDA7N43Qjs5qkDHLiRMRKvLSbvR+uROb2rRcXti20XHv6Vf8JgrrkT Uho/HdSzQliuhYrob1LSkAAOAA8y/w0= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-517-5iz84ACOOMuV0zwQGD3BfQ-1; Wed, 10 Sep 2025 02:37:58 -0400 X-MC-Unique: 5iz84ACOOMuV0zwQGD3BfQ-1 X-Mimecast-MFC-AGG-ID: 5iz84ACOOMuV0zwQGD3BfQ_1757486277 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3e68a54dac2so139147f8f.1 for ; Tue, 09 Sep 2025 23:37:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757486277; x=1758091077; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Neo/pjMEFJ9Mlg/JNGhdRkrkXK0+A/EuxepeRL6LSBw=; b=R4pHWqwQUND8YeCAwoHFPrvhfKM4nm7K/kaC6d+UyHgsTtkeGPrk3Mp8/ZMHbFJwme Yr1xhNMteo9J2q7Jv0Xcnoq4xRMyBK1PvvvyF+Ye6i8HxRjDXwmr8TyHhh8gFJKCVTJH b7DVFCfpZVpsvmzziBQA2U8OWVzMqd8hfZHMBc1QNXjVWdhp/LOaOsTrp9W0c4of+vB6 PMrfEz3WYPJ/ePUC0E124z3y6G8KX6eIORRp4lQNuploAvG8Oi7+YcXtS2qNuhxBMh2Z mADUgH7UEDQFkP2VKlMqAGRASN6HKrzCwZnw9HeuTdy3KyqJCvsIyj3O4gYH/j44z9gV /ftQ== X-Gm-Message-State: AOJu0YxLsOqwYD5rhIw8JOhwAoynGqXgeeu0M2b7DUwSvl6/QCJNX+LO el4o31VSTAZ2KQdjFeOwkIpG3fJdOnoG/JtyAErLRmSJG4gWg4WWq/lNe+xiq+k+pOVS9NRYzGs 1LydCdnnC+3BjiMu0JgYDr2IFwEaAnOkuL2Pfj107YewSTIgs2UYMBg== X-Gm-Gg: ASbGnctlDxH7motkfGoGIIDn0Waj3TiewfhEXHP4jxAUbBNsAeIFUZvuVPqLDOZNVqM VwB04PdiD0XX+X0obM7cQ8IDrmL3OA1ipAoHCCn+OTsT2uueAFwxZOEybbiXuOE9FCKAI2xO+MT CSfpPXIps72WF6e0ft2LD/CWaour3FlFcw+eGzoly8quXVeku2hIvY4m3XQHwZF7CELu3UgUeSG doM3RnQG7/3Hi9sYqFDD+nYLO9/1megCCVOI1O9blUFLt59XWNeMbDXudWXMcRh5jVHmHnPgIxV mHtKbVyBYcn4tSuQWR7hPhAKNB73Pxj/0t9Bx2OZLBP7lWUfYtJptkQXPMbXslfcFIpg X-Received: by 2002:a5d:5f46:0:b0:3d3:b30:4cf2 with SMTP id ffacd0b85a97d-3e629f1a8f6mr10926828f8f.19.1757486276750; Tue, 09 Sep 2025 23:37:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGWD0V9XzqDNh2+6VipwAy7zRpGOOnovUSLO0UqrxLtzP46e2we1upQAx6x6Y8SdOqD/k8qlA== X-Received: by 2002:a5d:5f46:0:b0:3d3:b30:4cf2 with SMTP id ffacd0b85a97d-3e629f1a8f6mr10926814f8f.19.1757486276325; Tue, 09 Sep 2025 23:37:56 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3e7521c9a2esm5987409f8f.14.2025.09.09.23.37.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Sep 2025 23:37:55 -0700 (PDT) Date: Wed, 10 Sep 2025 08:37:54 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Message-ID: <20250910083754.323c0b1e@elisabeth> In-Reply-To: References: <20250909181655.2990223-1-sbrivio@redhat.com> <20250909181655.2990223-9-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: N7lc97WWLxvKB1noSdoit2XoSYylI7QzLM5IppGN0K8_1757486277 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: BISSVZJYTV6Y2IEHFSOHZFXJ2MM3SOAD X-Message-ID-Hash: BISSVZJYTV6Y2IEHFSOHZFXJ2MM3SOAD X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Jon Maloy , Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 10 Sep 2025 12:29:48 +1000 David Gibson wrote: > On Tue, Sep 09, 2025 at 08:16:55PM +0200, Stefano Brivio wrote: > > For some reason, tcp_vu_data_from_sock() already takes care of this, > > but the non-vhost-user version ignores this possibility and just sends > > out a FIN segment whenever we infer we received one host-side, > > regardless of the fact that we might have unacknowledged data still to > > send. > >=20 > > Somewhat surprisingly, this didn't cause any issue to be reported yet, > > until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > > came around, leading to the following report from Paul, who hit this > > running Podman tests: > >=20 > > 439 0.033032 169.254.1.2 =E2=86=92 192.168.122.100 65540 TCP 56602= =E2=86=92 5789 [PSH, ACK] Seq=3D10336361 Ack=3D1 Win=3D65536 Len=3D65480 > > 440 0.033055 169.254.1.2 =E2=86=92 192.168.122.100 30324 TCP [TCP = Window Full] 56602 =E2=86=92 5789 [PSH, ACK] Seq=3D10401841 Ack=3D1 Win=3D6= 5536 Len=3D30264 > >=20 > > we're sending data to the container, up to the edge of the window > >=20 > > 441 0.033059 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP 5789 =E2= =86=92 56602 [ACK] Seq=3D1 Ack=3D10401841 Win=3D83968 Len=3D0 > >=20 > > and the container acknowledges it > >=20 > > 442 0.033091 169.254.1.2 =E2=86=92 192.168.122.100 53716 TCP 56602= =E2=86=92 5789 [PSH, ACK] Seq=3D10432105 Ack=3D1 Win=3D65536 Len=3D53656 > >=20 > > we send more data, all we possibly can, in window > >=20 > > 443 0.033126 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP Zer= oWindow] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D0 Len=3D0 > >=20 > > and the container shrinks the window due to the issue introduced > > by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no > > memory"). With a previous patch from this series, we rewind the > > sequence, meaning that we assign conn->seq_to_tap from > > conn->seq_ack_from_tap, so that we'll retransmit this segment, by > > reading again from the socket, and increasing conn->seq_to_tap > > once more. > >=20 > > However: > >=20 > > 444 0.033144 169.254.1.2 =E2=86=92 192.168.122.100 60 TCP 56602 = =E2=86=92 5789 [FIN, PSH, ACK] Seq=3D10485761 Ack=3D1 Win=3D65536 Len=3D0 > >=20 > > we eventually get a zero-length read from the socket and we miss the > > fact that conn->seq_to_tap !=3D conn->seq_ack_from_tap, so we send a > > FIN flag with the most recent sequence. The kernel insists: > >=20 > > 445 0.033147 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP Zer= oWindow] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D0 Len=3D0 > >=20 > > with its buggy zero-window update, but: > >=20 > > 446 0.033152 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP Win= dow Update] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D69120 L= en=3D0 > > 447 0.033202 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP Win= dow Update] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D142848 = Len=3D0 > >=20 > > we don't reset the TAP_FIN_SENT flag anymore, and don't resend > > the FIN segment (nor data), as we already rewound the sequence > > earlier. > >=20 > > To solve this, hold off the FIN segment until we get a zero-length > > read from the socket *and* we know that there's no unacknowledged > > pending data, also without vhost-user, in tcp_buf_data_from_sock(). > >=20 > > Reported-by: Paul Holzinger > > Signed-off-by: Stefano Brivio =20 >=20 > Reviewed-by: David Gibson >=20 > At least, I think this makes sense. As always, I find the semantics > of the STALLED flag difficult to wrap my head around. After this fix I start thinking that we should probably split STALLED into several flags explicitly indicating what we are waiting for, because it's rather complicated to make sure that some code path will eventually take care of doing something on a later trigger/condition (in this case, sending the FIN flag when we can). I don't have a concrete list of possible flags in mind but something like WAITING_ON_(x), or consistent sets of BLOCKED_BY_(x) (we can't proceed with anything otherwise) and EXPECTING_(x) (x is expected but we can proceed with something else). Maybe we'll need to go the full (y)_BLOCKED_BY_(x) route, though (FIN_BLOCKED_BY_TAP_ACK?). I hope not. --=20 Stefano