From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=E3b/RuC/;
	dkim-atps=neutral
Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76])
	by passt.top (Postfix) with ESMTPS id 1ED365A0269
	for <passt-dev@passt.top>; Tue, 21 Apr 2026 06:23:38 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202602; t=1776745415;
	bh=31xScKlmdL/6AIE3Aif9HI2kG1fQDcAAsqDbPqKkKXs=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=E3b/RuC/U6Yvbyk/hhQ+cQKDxWIfEqvOYJuoW1jzkpBLKoCo9jB3pFCPM32f0gWbb
	 08LhlsQ+KoSVJa0L3OlMcP69GAbpM/LoSKe3unOpxglS/k9G2SSbbkaI5zVTvgvPJn
	 ohNc6raBaVrev0MPvRgRnbFPeVOVz2NPCgz8+E0l18FLFO4rP5HJoq1TjJ+S94zall
	 UuOqwFiNHXbaIIUMYBVJ/9D7yxN8yX6u73BJYfyqov8S+pRT4lRXfaOjU0//vwhVYr
	 6/mA9HRLNC0SEVB5Gz8vHWqBLwyq/WkFQe5A/wVm0nQKnmnecdU7z8G+mKUedZg9+C
	 2ef4JvjtUTW0Q==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4g08P315CXz4wDx; Tue, 21 Apr 2026 14:23:35 +1000 (AEST)
Date: Tue, 21 Apr 2026 14:23:29 +1000
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH v1] tcp: Handle errors from tcp_send_flag()
Message-ID: <aeb7wQ_6yrHxXvjz@zatzit>
References: <20260410075539.1566421-1-anskuma@redhat.com>
 <20260415213827.39495072@elisabeth>
 <aeWT3iMTnMmlc75u@zatzit>
 <20260421014059.4dccee9e@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="9sOXeN/Y+CymvZI6"
Content-Disposition: inline
In-Reply-To: <20260421014059.4dccee9e@elisabeth>
Message-ID-Hash: O3WQ3XPSKYYMC2BHG4S5HIFNH33BREWZ
X-Message-ID-Hash: O3WQ3XPSKYYMC2BHG4S5HIFNH33BREWZ
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Anshu Kumari <anskuma@redhat.com>, passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/aeb7wQ_6yrHxXvjz@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/O3WQ3XPSKYYMC2BHG4S5HIFNH33BREWZ/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--9sOXeN/Y+CymvZI6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Apr 21, 2026 at 01:41:03AM +0200, Stefano Brivio wrote:
> On Mon, 20 Apr 2026 12:47:58 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Wed, Apr 15, 2026 at 09:38:28PM +0200, Stefano Brivio wrote:
> > > Nit: v1 in the subject tag is not necessary (not harmful either): if
> > > there are no version tags, it's implicit we're talking about version =
1.
> > >=20
> > > On Fri, 10 Apr 2026 13:25:39 +0530
> > > Anshu Kumari <anskuma@redhat.com> wrote:
> > >  =20
> > > > tcp_send_flag() can return error codes from tcp_prepare_flags()
> > > > failing TCP_INFO, or from failure to collect buffers on the
> > > > vhost-user path. These errors indicate the connection requires
> > > > resetting.
> > > >=20
> > > > Most callers of tcp_send_flag() were ignoring the error code and
> > > > carrying on as if nothing was wrong. Check the return value at
> > > > each call site and handle the error appropriately:
> > > >   - in tcp_data_from_tap(), return -1 so the caller resets
> > > >   - in tcp_tap_handler(), goto reset
> > > >   - in tcp_timer_handler()/tcp_sock_handler()/tcp_conn_from_sock_fi=
nish(),
> > > >     call tcp_rst() and return
> > > >   - in tcp_tap_conn_from_sock(), set CLOSING flag (flow not yet act=
ive)
> > > >   - in tcp_keepalive(), call tcp_rst() and continue the loop
> > > >   - in tcp_flow_migrate_target_ext(), goto fail
> > > >=20
> > > > The call in tcp_rst_do() is left unchecked: we are already
> > > > resetting, and tcp_sock_rst() still needs to run regardless.
> > > >=20
> > > > Bug: https://bugs.passt.top/show_bug.cgi?id=3D194 =20
> > >=20
> > > Nit: we always use Link: tags (CONTRIBUTING.md uses the plural which
> > > might be a bit confusing, I guess we should fix that), rationale:
> > >=20
> > >   https://archives.passt.top/passt-dev/20230704132104.48106368@elisab=
eth/
> > >   https://archives.passt.top/passt-dev/20251105163137.424a6537@elisab=
eth/
> > >=20
> > > But I fix up these tags on merge anyway, no need to re-send (in
> > > general).
> > >  =20
> > > > Signed-off-by: Anshu Kumari <anskuma@redhat.com>
> > > > ---
> > > >  tcp.c | 59 ++++++++++++++++++++++++++++++++++++++++++++-----------=
----
> > > >  1 file changed, 44 insertions(+), 15 deletions(-)
> > > >=20
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 8ea9be8..9ce671a 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -1917,7 +1917,9 @@ static int tcp_data_from_tap(const struct ctx=
 *c, struct tcp_tap_conn *conn,
> > > >  				   "keep-alive sequence: %u, previous: %u",
> > > >  				   seq, conn->seq_from_tap);
> > > > =20
> > > > -			tcp_send_flag(c, conn, ACK);
> > > > +			if (tcp_send_flag(c, conn, ACK))
> > > > +				return -1; =20
> > >=20
> > > A general comment: in _some_ of these cases where we fail to send ACK
> > > segments, I intentionally didn't check for errors and let the
> > > connection live on, because that looked like the most graceful failure
> > > handling to me.
> > >=20
> > > After all, ACK segments without data are not assumed to be reliably
> > > transmitted (RFC 9293, 3.8.4), so, given that failing to send some
> > > should have a similar outcome as the peer missing some, I guess we're
> > > always expected to recover from a situation like that.
> > >=20
> > > This doesn't apply to other occurrences below where we fail to send a
> > > SYN segment or where failure to send ACK segments might mean we are in
> > > some expected state (including a connection that might get stuck
> > > forever).
> > >=20
> > > But reading David's description of bug #194, I wonder if he had
> > > something else in mind. That is, I don't have a strong preference
> > > against resetting the connection whenever we fail to prepare buffers,
> > > but in many of these cases we don't really _have to_ reset the
> > > connection. David, do you see this differently? =20
> >=20
> > That's a good point, which I didn't think about when I reviewed.  I
> > don't recall if I'd considered it when I filed the bug.
> >=20
> > So, in principle, you're right; failing to send an ack is equivalent
> > to losing an ack en route, which generally shouldn't cause an
> > immediate reset of the connection in most cases.
> >=20
> > But... that's complicated by the details of what errors can actually
> > occur here.  AFAICT there are only three cases:
> >=20
> >  1) tcp_vu_send_flag() returns -1 because vu_collect() returned 0
> >=20
> > IIUC this means we ran out of buffers.  We probably shouldn't reset in
> > this case.
>=20
> Right.
>=20
> >  2) tcp_vu_send_flag() returns -1 because vu_collect() returned > 1
> >=20
> > This means the guest buffer layout wasn't what we expected.  We
> > probably shouldn't reset in this case, but I believe it will go away
> > with Laurent's pending multi-buffer vu patches anyway.  So we can
> > probably ignore this one.
>=20
> Yes, that will go away in a bit, so we can ignore it.
>=20
> > [Also, we according to docs and other usage we should return an errno
> >  in both these cases, not a bare -1]
> >=20
> >  3) tcp_prepare_flags() returns -ECONNRESET
> >=20
> > This only happens if TCP_INFO fails.  In this case we *should* reset -
> > in fact we already set CLOSED in the connection events.  Essentially,
> > we've concluded the socket side is irretrievably broken, so we should
> > shut down the tap side as well.
>=20
> Agreed.
>=20
> > So I think we have two options here:
> >=20
> > A) tcp_send_flag() reports only "fatal" errors
> >=20
> > Simpler to do, but possibly confusing.  We'd need to
>=20
> Rather confusing for me indeed.
>=20
> >     A1) Change tcp_vu_send_flag() not to report an error if
> >         vu_collect() returns 0.  AIUI, this case is more or less
> > 	equivalent to filling the tap socket queue, and we don't
> > 	report that case as an error in tcp_buf_send_flag().
> > 	tcp_payload_flush() reverts sequences in this case, but
> > 	doesn't report the error any further.
> >=20
> >     A2) That leaves only the TCP_INFO failure case, so we can and
> >         should reset on any failure from tcp_send_flag(), just like
> > 	this draft
>=20
> This looks like a stretch and almost a layering violation to me, that
> is, we would adapt the semantics of tcp_send_flag() to the caller
> instead of making it return what naturally makes sense.

Yeah, maybe.

> > B) tcp_send_flag() reports both transient and fatal errors
> >=20
> > More complex, but maybe less surprising semantics?
>=20
> Definitely less surprising I think.
>=20
> > tcp_send_flag()
> > would need to use different return codes for the different cases
> > (maybe ECONNRESET vs. EBUSY?)
>=20
> Or... if we run out of buffers, and we're sending an ACK, "try again":
> -EAGAIN.

Ah, yes EAGAIN is a better choice than EBUSY.

> I was assuming that if we're sending a SYN that's a different case and
> we should reset right away... but actually delivery of a SYN isn't
> reliable either, and we have retries implemented with proper handling
> of the case where we exceed the number of retries.
>=20
> I think in general we should give ENOBUFS / -ENOBUFS (depending on the
> specific convention of the given function) for functions that failed to
> provide / reserve buffers, and -EAGAIN if we were trying to send an
> ACK... or even just a SYN.

It seems like a layering violation to me to peer into the contents of
what we're sending to alter the return code.  But I don't think we
need to; I think it makes more sense to always send EAGAIN if the
problem is something we expect to be transient - in practice, running
out of space on the tap side, whether that's buffers in the VU ring or
the tap socket's send buffer.

If a caller really can't tolerate a delay in sending, it still has the
option of aborting on an EAGAIN, but I don't see any reason we'd
actually want that.

> >     B1) For consistency we should propagate failures from
> >         tcp_payload_flush(), through tcp_buf_send_flag() to its
> > 	callers,
> >     B2) We should only reset if tcp_send_flag() returns -ECONNRESET
>=20
> Or, at this point, on anything that's not -EAGAIN?

Right.

> > > > +
> > > >  			tcp_timer_ctl(c, conn);
> > > > =20
> > > >  			if (setsockopt(conn->sock, SOL_SOCKET, SO_KEEPALIVE,
> > > > @@ -2043,14 +2045,16 @@ eintr:
> > > >  			 *   Then swiftly looked away and left.
> > > >  			 */
> > > >  			conn->seq_from_tap =3D seq_from_tap;
> > > > -			tcp_send_flag(c, conn, ACK);
> > > > +			if (tcp_send_flag(c, conn, ACK))
> > > > +				return -1;
> > > >  		}
> > > > =20
> > > >  		if (errno =3D=3D EINTR)
> > > >  			goto eintr;
> > > > =20
> > > >  		if (errno =3D=3D EAGAIN || errno =3D=3D EWOULDBLOCK) {
> > > > -			tcp_send_flag(c, conn, ACK | DUP_ACK);
> > > > +			if (tcp_send_flag(c, conn, ACK | DUP_ACK))
> > > > +				return -1;
> > > >  			return p->count - idx;
> > > > =20
> > > >  		}
> > > > @@ -2070,7 +2074,8 @@ out:
> > > >  		 */
> > > >  		if (conn->seq_dup_ack_approx !=3D (conn->seq_from_tap & 0xff)) {
> > > >  			conn->seq_dup_ack_approx =3D conn->seq_from_tap & 0xff;
> > > > -			tcp_send_flag(c, conn, ACK | DUP_ACK);
> > > > +			if (tcp_send_flag(c, conn, ACK | DUP_ACK))
> > > > +				return -1;
> > > >  		}
> > > >  		return p->count - idx;
> > > >  	}
> > > > @@ -2084,7 +2089,8 @@ out:
> > > > =20
> > > >  		conn_event(c, conn, TAP_FIN_RCVD);
> > > >  	} else {
> > > > -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> > > > +		if (tcp_send_flag(c, conn, ACK_IF_NEEDED))
> > > > +			return -1;
> > > >  	}
> > > > =20
> > > >  	return p->count - idx;
> > > > @@ -2122,7 +2128,10 @@ static void tcp_conn_from_sock_finish(const =
struct ctx *c,
> > > >  		return;
> > > >  	}
> > > > =20
> > > > -	tcp_send_flag(c, conn, ACK);
> > > > +	if (tcp_send_flag(c, conn, ACK)) {
> > > > +		tcp_rst(c, conn);
> > > > +		return;
> > > > +	}
> > > > =20
> > > >  	/* The client might have sent data already, which we didn't
> > > >  	 * dequeue waiting for SYN,ACK from tap -- check now.
> > > > @@ -2308,7 +2317,9 @@ int tcp_tap_handler(const struct ctx *c, uint=
8_t pif, sa_family_t af,
> > > >  				goto reset;
> > > >  			}
> > > > =20
> > > > -			tcp_send_flag(c, conn, ACK);
> > > > +			if (tcp_send_flag(c, conn, ACK))
> > > > +				goto reset;
> > > > +
> > > >  			conn_event(c, conn, SOCK_FIN_SENT);
> > > > =20
> > > >  			return 1;
> > > > @@ -2388,7 +2399,9 @@ int tcp_tap_handler(const struct ctx *c, uint=
8_t pif, sa_family_t af,
> > > >  		}
> > > > =20
> > > >  		conn_event(c, conn, SOCK_FIN_SENT);
> > > > -		tcp_send_flag(c, conn, ACK);
> > > > +		if (tcp_send_flag(c, conn, ACK))
> > > > +			goto reset;
> > > > +
> > > >  		ack_due =3D 0;
> > > > =20
> > > >  		/* If we received a FIN, but the socket is in TCP_ESTABLISHED
> > > > @@ -2478,7 +2491,11 @@ static void tcp_tap_conn_from_sock(const str=
uct ctx *c, union flow *flow,
> > > > =20
> > > >  	conn->wnd_from_tap =3D WINDOW_DEFAULT;
> > > > =20
> > > > -	tcp_send_flag(c, conn, SYN);
> > > > +	if (tcp_send_flag(c, conn, SYN)) {
> > > > +		conn_flag(c, conn, CLOSING); =20
> > >=20
> > > I would wait for David to confirm, but I'm fairly sure that this needs
> > > FLOW_ACTIVATE(conn); before returning, just like in the other error p=
ath
> > > of this function, because otherwise we'll leave the newly created flow
> > > in an "incomplete" state. =20
> >=20
> > Ah, yes, I missed that.
> >=20
> > > Due to flow table restrictions we adopted to keep the implementation
> > > simple (see "Theory of Operation - allocating and freeing flow entrie=
s"
> > > in flow.c), quoting from the documentation to enum flow_state in
> > > flow.h:
> > >=20
> > >  *        Caveats:
> > >  *            - At most one entry may be NEW, INI, TGT or TYPED at a =
time, so
> > >  *              it's unsafe to use flow_alloc() again until this entr=
y moves to
> > >  *              ACTIVE or FREE
> > >=20
> > > so, if we create a second connection within the same epoll cycle (for
> > > example by calling tcp_tap_conn_from_sock() again), we'll now have two
> > > entries in state TYPED, which breaks this assumption, and things will=
 =20
> >=20
> > Exactly right.
> >=20
> > > David, I think this isn't documented very obviously, even though it's
> > > all there in flow.h. This just occurred to me because of commit
> > > 52419a64f2df ("migrate, tcp: Don't flow_alloc_cancel() during incoming
> > > migration") but we can't expect others to know about past commits. =
=20
> >=20
> > I agree it's not as clear as I'd like...
> >=20
> > > I wonder if you could think of a quick way to make this more prominen=
t...
> > > should we perhaps state return conditions in functions, like you alre=
ady
> > > added for isolation.c? =20
> >=20
> > ... but I'm not really sure how to make it more prominent in a useful
> > way.  Maybe a note in the function header would do the trick?  I'm not
> > sure.
>=20
> I don't think that's particularly helpful, because we have other
> functions that would need that, and we might write more without
> noticing.
>=20
> I think we should, eventually, one day, have another look at the
> documentation in flow.c and clearly state manipulation rules that are
> otherwise only listed as comments to enums in flow.h.
>=20
> That is, perhaps a new section with just those caveats, and in more
> practical terms: instead of "at most one entry ...", something like
> "make sure you call FLOW_ACTIVATE() after ...".
>=20
> But regardless of that I don't really think it's a priority right now.
> Let's keep this for later.

I concur.

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--9sOXeN/Y+CymvZI6
Content-Type: application/pgp-signature; name=signature.asc

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmnm+8AACgkQzQJF27ox
2GcJ7Q//aGgeOdC+/c48cx2G66vC9qClB8J8J0mPWtGqMgBAkQsUbjA2fAM4eM4D
olsgo4crVDjZJx7GF662jdF91BMKToWq1nrFsGZFuN/5AN4ia7EtGQUOX2NO6JEK
g4rnY0mSQXdKMai/Gmo7R6AVJBz1MWWanvWKE+Eo9pMvxbjjMzKDQvHmeH2C5ZX7
Zq1zVpRNWzMYtMYAMBP3sFpTH2gfnsyEkkpLFT4O8qsm3FENQShrU57o3XC5rWvN
tDY1XBQvTbaRN/T/N6ta3QPVDEF/UicQWldCbc3fZuJ5u5NgmyiZ9bia9ez3qt/K
Ak1x15Z/DSNwVEIaVZZ0uV/y9K3JPE1smuJ8uHn7BbSSdXn0KfuY/7HQQjoXiAw8
m+a/dW29YUARfBUm+jY+AvTDKZ/mzyYxbkjS7uTxAsImP46X+CkSuy79TdiUTvSC
5WGB4hYawYdIHD5m4I8GoHqqTUDdhfVqcUMkA4vX4fJi58HIF1DQQlODADJwQkch
Z1wSewYadZ8R6++bIuI0eqrSLoqojRlqfVJlqVRFdrAp5ZjhDtqY6xOXrX2SgUAX
0AqNMKRp7bUemNRl/zhsVUt/OF685f8ba+6VFpb7gjE2P5+xUY7OXUhENSXIYadt
4Lq7VBsySUp7PiES/usdfYUwVHMv2SD4FPKR6uPJgMuaGXYslQk=
=Z+J1
-----END PGP SIGNATURE-----

--9sOXeN/Y+CymvZI6--