From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id F41C35A026D
	for <passt-dev@passt.top>; Sat,  6 Jan 2024 13:18:57 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202312; t=1704543531;
	bh=wLV9qlcOWKnEe4p8dQgYt7xZgQRXwz40qJlkRr4KkUc=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=abzoOy70SrEfY/Wkme4+641paTYp5FNljJAM54KGCi5IUdWsts/mywi/sj2qfas1Q
	 j7Rn6AqPnoYwc/tal841mlmBguoX0hgsl/b2D+gosCqKI9rAAXBv8LDf7xTsZinPlA
	 mwqlPQc72oBINUJRxJ4NdJDZ2Mpwe94ZEimze1rU0BoNheg3MDbL/WLAf0eF15sy0m
	 GdB1tXkNiHu5x7QjJRzJx/yp63rlUbjTYeJg6xGn2Y6ANVrUbLRM0rKL39e5OZc+47
	 mSTHMQwGIl2DX/s4K/Z44mAo8cj+AY7fw4nNGJajj3H8V4WxJSRC+L2IdWxCt3qRQE
	 ax0uD7WN4/ajA==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4T6fWH72qQz4x5r; Sat,  6 Jan 2024 23:18:51 +1100 (AEDT)
Date: Sat, 6 Jan 2024 22:32:10 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
Message-ID: <ZZk6OmPaHMitFC9-@zatzit>
References: <20231221061549.976358-1-david@gibson.dropbear.id.au>
 <20231221061549.976358-14-david@gibson.dropbear.id.au>
 <20231228192525.7ba1ee48@elisabeth>
 <20231230113304.37c60a9a@elisabeth>
 <ZZKpjUztf_OzfOLI@zatzit>
 <20240102191341.7c91dd44@elisabeth>
 <ZZaCK8U3mYUTForT@zatzit>
 <20240105093335.0c725692@elisabeth>
 <ZZfOZgDeVlzQU6PI@zatzit>
 <20240105112754.75c765e3@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="gRd88EjuwcLlALst"
Content-Disposition: inline
In-Reply-To: <20240105112754.75c765e3@elisabeth>
Message-ID-Hash: 7MHFF67GD74VTJ5A4CHVU5CJHQ7MYVQI
X-Message-ID-Hash: 7MHFF67GD74VTJ5A4CHVU5CJHQ7MYVQI
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/ZZk6OmPaHMitFC9-@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/7MHFF67GD74VTJ5A4CHVU5CJHQ7MYVQI/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--gRd88EjuwcLlALst
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:
> On Fri, 5 Jan 2024 20:39:50 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:
> > > On Thu, 4 Jan 2024 21:02:19 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >  =20
> > > > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote: =20
> > > > > On Mon, 1 Jan 2024 23:01:17 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >    =20
> > > > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:=
   =20
> > > > > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >      =20
> > > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > > >=20
> > > > > > > > > [...]     =20
> > > > > > > >
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > I wonder if we really have to keep track of the number of (=
non-)entries
> > > > > > > > in the free "block", and if we have to be explicit about th=
e two cases.
> > > > > > > >=20
> > > > > > > > I'm trying to find out if we can simplify the whole thing w=
ith slightly
> > > > > > > > different variants, for example:     =20
> > > > > > >=20
> > > > > > > So... I think the version with (explicit) blocks has this fun=
damental
> > > > > > > advantage, on deletion:
> > > > > > >      =20
> > > > > > > > > +	flow->f.type =3D FLOW_TYPE_NONE;
> > > > > > > > > +	/* Put it back in a length 1 free block, don't attempt =
to fully reverse
> > > > > > > > > +	 * flow_alloc()s steps.  This will get folded together =
the next time
> > > > > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > > > > +	flow->free.n =3D 1;
> > > > > > > > > +	flow->free.next =3D flow_first_free;
> > > > > > > > > +	flow_first_free =3D FLOW_IDX(flow);     =20
> > > > > > >=20
> > > > > > > which is doable even without explicit blocks, but much harder=
 to
> > > > > > > follow.     =20
> > > > > >=20
> > > > > > Remember this is not a general deletion, only a "cancel" of the=
 most
> > > > > > recent allocation.   =20
> > > > >=20
> > > > > Oh, I thought that was only the case for this series and you woul=
d use
> > > > > that as actual deletion in another pending series (which I haven't
> > > > > finished reviewing yet).   =20
> > > >=20
> > > > No.  Not allowing deletion of any entry at any time is what I'm
> > > > trading off to get both O(1) allocation and (effectively) O(1)
> > > > deletion.
> > > >  =20
> > > > > But now I'm not sure anymore why I was thinking this...
> > > > >=20
> > > > > Anyway... do we really need it, then? Can't we just mark the "fai=
led"
> > > > > flows as whatever means "closed" for a specific protocol, and cle=
an
> > > > > them up later, instead of calling cancel() right away?   =20
> > > >=20
> > > > We could, but I'm not sure we want to.  For starters, that requires
> > > > protocol-specific behaviour whenever we need to back out an allocat=
ion
> > > > like this.  Not a big deal, since that's in protocol specific code
> > > > already, but I think it's uglier than calling cancel.
> > > >=20
> > > > It also requires that the protocol specific deferred cleanup functi=
ons
> > > > (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> > > > 'cancel' we can back out just the initialisation steps we've already
> > > > done (because we know where we've failed during init), then remove =
the
> > > > entry.  The deferred cleanup function only needs to deal with
> > > > "complete" entries.  Again, certainly possible, but IMO uglier than
> > > > having 'cancel'. =20
> > >=20
> > > Okay, yes, I see now.
> > >=20
> > > Another doubt that comes to me now is: if you don't plan to use this
> > > alloc_cancel() thing anywhere else, the only reason why you are adding
> > > it is to replace the (flow_count >=3D FLOW_MAX) check with a flow_all=
oc()
> > > version that can fail.
> > >=20
> > > But at this point, speaking of ugliness, couldn't we just have a
> > > bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the call=
er
> > > can use to decide to abort earlier? To me it looks so much simpler and
> > > more robust. =20
> >=20
> > Well, we could, but there are a couple of reasons I don't love it.
> > The first is abstraction: this returns explicit handling of the layout
> > of the table to the protocol specific callers.  It's not a huge deal
> > right now, but once we have 4 or 5 protocols doing this, having to
> > change all of them if we make any tiny change to the semantics of
> > flow_first_free isn't great.
>=20
> Hmm, I don't get the difference in terms of abstraction between
> checking the return value of flow_alloc() and checking the return value
> of flow_may_alloc().

Oh, sorry, I thought you were proposing open-coding the check against
FLOW_MAX, like it is at the moment.  See below for comments on a
flow_may_alloc() or similar call.

> In both cases the protocol handlers know that there's a table, a
> function to reserve entries, and that reserving entries might fail...
> and not much else.
>=20
> > The other issue is that to do this (without a bunch of fairly large
> > and ugly temporaries) means we'd populate at least some of the fields
> > in flow_common before we have officially "allocated" the entry.  At
> > that point it becomes a bit fuzzy as to when that allocation really
> > occurs.  Is it when we do the FLOW_MAX tesT?
>=20
> I would say yes -- after that we can't fail.

Uh.. we can't fail to allocate.  We can fail for other reasons.

> I mean, we work with rather constrained structures for a number of
> reasons, which comes with a number of hard problems... let's at least
> reap the benefits of it?

I'm not sure what you're getting at here.

> > Is it when we write to
> > f.type?  Is it when we update flow_first_free?  If we fail somewhere
> > in the middle of that, what steps do we need to reverse?
>=20
> We can't fail in the middle of it, at the moment.

Yes we can, that's kind of the whole point of this.

> Of course, if the
> "allocation" function changes, we might need to change the scheme. But
> is it really likely? And anyway it's just a few lines in your current
> version...
>=20
> > For those reasons I prefer the scheme presented.  Fwiw, in an earlier
> > draft I did this differently with a "flow_prealloc()", which was
> > essentially the check against FLOW_MAX, then a later
> > flow_alloc_commit().  I thought it turned out pretty confusing
> > compared to the alloc/cancel approach.
>=20
> The current flow_alloc_cancel() implementation is definitely simple and
> semantically clear.
>=20
> What worries me a bit is that you would have two different cases for
> free "blocks" of size one, depending on the order of the events. So if
> we want to debug something like that and we see a block with size one
> it might be a failed bind(), so a fake one, or also not: it might be an
> actual block with size one.

The cluster of size one from cancel is still a valid free cluster that
satisfies all the usual invaraints, it's not "fake".  It does mean
that we could get two contiguous free clusters, which we wouldn't
otherwise.  The idea is that they'll be merged back together on the
next deferred scan, but as noted on a different subthread, that's not
currently working and I'll have to fix it.

> Thinking of multithreading: defining flow_may_alloc() becomes more
> complicated because the caller can't just assume the "allocation" will
> succeed (as long as we don't cross an "epoll cycle" or something like
> that). But we'll probably need some form of locking or userspace RCU
> giving us barriers around the pair may_alloc() / alloc().

For multi-threaded, I really thin we'd want alloc/free semantics, not
may_alloc/alloc semantics.  We have to change state in some way at
"may alloc" time, or something else could try to allocate the same
slot.  "cancel" semantics also don't make sense here, because we can
no longer be confident that the alloc we did above is still the most
recent allloc.  So a bunch of things would need to change for
multi-threading.

> If we stick to the failing alloc(), this part is simpler, but the
> interpretation of flow_first_free and block sizes becomes non-trivial.

Again, not sure what you're getting at.

> Well, on the other hand, it's all simple enough that we can change it
> as needed (for example for multithreading). If we can hope that the new
> scheme is reasonably low on bugs and we'll probably never have to guess
> why a block has size one, I'm fine with the failing alloc() as well.
>=20

--=20
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--gRd88EjuwcLlALst
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmWZOg0ACgkQzQJF27ox
2GfZEBAAm0RsGsE2A5/dNl2WOs0+xf4ydn89Wz7gPK/1TDJY/PGDr6gk01C6ijOn
07BiJCUlyaWzsOl7EprZBLURD6VV5hyb9k3ZZFmLnZUVYSVRBXQKU+w3NxP1mPfO
TV7SzkiAGJodJHLqG07IxInNn7+mJYCi7EhTXyZFIDZOOHt/LzF6RVXRznN0cUfT
uaft1+ksGxMV4mdyTww9QrM7cCN6FVotjoAMIdr+0CKxmrMaaS4xsjPNmc1SlKLW
9KxI0YtbnorJ69elY9SaJ1wXKEx4nVFEFghudmmViATnjf0O6c3ivIzjRaJghAeN
yPDKfrkwSK7Og7F28TPK9wOAfJ8Cq9roNb1p/OmhG1Vt7EHMMVU/PJxx24FoUiO9
TQT1jPlV/3grmAKLWxwpt3nB0RtTI/p2SzKwwqD4YJy8xmi0VOBiNAA83NBVZmTB
KFT58+KDzO5xgIGHsFENjAsJMh9/nVkpAlXAVbFZlZMkyq/5yNZD0HYu5hb8Q0s8
JtMM8ioJinP/Nr9dbUWP6lj2jXNDdj3kJDREJXafRGH/dKnPHCsojuFf9NRNktuB
h6pgRme/M25erjOsTsn5twBmTlv29lE8c6/wW6i1Z73DAdczf1xkzPO+I3yM31qu
PNSxiQVS/LUchKu41umi6akb9afaH/qZ3OlwsqpxOVfZ7Ym9qak=
=bJUu
-----END PGP SIGNATURE-----

--gRd88EjuwcLlALst--