From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 0C4175A0272 for ; Sun, 7 Jan 2024 06:23:54 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1704605027; bh=UWuPUXHwWj9G8VgMPiXBH7JOWPMnxofBiufW9gTWwqE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=i9AQ2GoySkcIpD68JfHOkKIgrHdOfoSd6OsDhSfCNLBE0fCN+R0YfFwFD01MADJXm 9gdx4QgdpWU9QsumR8i+d3+r79P2hrosDMYghDAHC0g8NIhZW2ABzcZSeofosR6LyX 2xMRAnH7hsrh6cHRtunKSVT8x+jnkEawSCr9OCrMhd9YdZn5JFCqOrQXW+ZLeQIGwV qWGiKdShzRF5HXiVXZM+3KyTCfUJ3w8XPDjgdF5eBuZ8qAz8QhPD45DzN6umTsALd7 kuqMBeFqJKu+sAQwYGYLtCWvGWPDGRygP5rz0VZYDJcf71UAn428AQeaLAPlKJwVwr 1jG8qohuSN3yQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4T75Fv6xprz4wd5; Sun, 7 Jan 2024 16:23:47 +1100 (AEDT) Date: Sun, 7 Jan 2024 16:20:33 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table Message-ID: References: <20231228192525.7ba1ee48@elisabeth> <20231230113304.37c60a9a@elisabeth> <20240102191341.7c91dd44@elisabeth> <20240105093335.0c725692@elisabeth> <20240105112754.75c765e3@elisabeth> <20240106140201.11bf25cd@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="qv4JauW4DfTXfqye" Content-Disposition: inline In-Reply-To: <20240106140201.11bf25cd@elisabeth> Message-ID-Hash: 3ZJJRM5KA2O3UXGE5LBF4ERK2ASLMC3H X-Message-ID-Hash: 3ZJJRM5KA2O3UXGE5LBF4ERK2ASLMC3H X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --qv4JauW4DfTXfqye Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jan 06, 2024 at 02:02:01PM +0100, Stefano Brivio wrote: > On Sat, 6 Jan 2024 22:32:10 +1100 > David Gibson wrote: >=20 > > On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote: > > > On Fri, 5 Jan 2024 20:39:50 +1100 > > > David Gibson wrote: > > > =20 > > > > On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote: =20 > > > > > On Thu, 4 Jan 2024 21:02:19 +1100 > > > > > David Gibson wrote: > > > > > =20 > > > > > > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:= =20 > > > > > > > On Mon, 1 Jan 2024 23:01:17 +1100 > > > > > > > David Gibson wrote: > > > > > > > =20 > > > > > > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wr= ote: =20 > > > > > > > > > On Thu, 28 Dec 2023 19:25:25 +0100 > > > > > > > > > Stefano Brivio wrote: > > > > > > > > > =20 > > > > > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100 > > > > > > > > > > > David Gibson wrote: > > > > > > > > > > >=20 > > > > > > > > > > > [...] =20 > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > I wonder if we really have to keep track of the number = of (non-)entries > > > > > > > > > > in the free "block", and if we have to be explicit abou= t the two cases. > > > > > > > > > >=20 > > > > > > > > > > I'm trying to find out if we can simplify the whole thi= ng with slightly > > > > > > > > > > different variants, for example: =20 > > > > > > > > >=20 > > > > > > > > > So... I think the version with (explicit) blocks has this= fundamental > > > > > > > > > advantage, on deletion: > > > > > > > > > =20 > > > > > > > > > > > + flow->f.type =3D FLOW_TYPE_NONE; > > > > > > > > > > > + /* Put it back in a length 1 free block, don't atte= mpt to fully reverse > > > > > > > > > > > + * flow_alloc()s steps. This will get folded toget= her the next time > > > > > > > > > > > + * flow_defer_handler runs anyway() */ > > > > > > > > > > > + flow->free.n =3D 1; > > > > > > > > > > > + flow->free.next =3D flow_first_free; > > > > > > > > > > > + flow_first_free =3D FLOW_IDX(flow); =20 > > > > > > > > >=20 > > > > > > > > > which is doable even without explicit blocks, but much ha= rder to > > > > > > > > > follow. =20 > > > > > > > >=20 > > > > > > > > Remember this is not a general deletion, only a "cancel" of= the most > > > > > > > > recent allocation. =20 > > > > > > >=20 > > > > > > > Oh, I thought that was only the case for this series and you = would use > > > > > > > that as actual deletion in another pending series (which I ha= ven't > > > > > > > finished reviewing yet). =20 > > > > > >=20 > > > > > > No. Not allowing deletion of any entry at any time is what I'm > > > > > > trading off to get both O(1) allocation and (effectively) O(1) > > > > > > deletion. > > > > > > =20 > > > > > > > But now I'm not sure anymore why I was thinking this... > > > > > > >=20 > > > > > > > Anyway... do we really need it, then? Can't we just mark the = "failed" > > > > > > > flows as whatever means "closed" for a specific protocol, and= clean > > > > > > > them up later, instead of calling cancel() right away? = =20 > > > > > >=20 > > > > > > We could, but I'm not sure we want to. For starters, that requ= ires > > > > > > protocol-specific behaviour whenever we need to back out an all= ocation > > > > > > like this. Not a big deal, since that's in protocol specific c= ode > > > > > > already, but I think it's uglier than calling cancel. > > > > > >=20 > > > > > > It also requires that the protocol specific deferred cleanup fu= nctions > > > > > > (e.g. tcp_flow_defer()) handle partially initialised entries. = With > > > > > > 'cancel' we can back out just the initialisation steps we've al= ready > > > > > > done (because we know where we've failed during init), then rem= ove the > > > > > > entry. The deferred cleanup function only needs to deal with > > > > > > "complete" entries. Again, certainly possible, but IMO uglier = than > > > > > > having 'cancel'. =20 > > > > >=20 > > > > > Okay, yes, I see now. > > > > >=20 > > > > > Another doubt that comes to me now is: if you don't plan to use t= his > > > > > alloc_cancel() thing anywhere else, the only reason why you are a= dding > > > > > it is to replace the (flow_count >=3D FLOW_MAX) check with a flow= _alloc() > > > > > version that can fail. > > > > >=20 > > > > > But at this point, speaking of ugliness, couldn't we just have a > > > > > bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the = caller > > > > > can use to decide to abort earlier? To me it looks so much simple= r and > > > > > more robust. =20 > > > >=20 > > > > Well, we could, but there are a couple of reasons I don't love it. > > > > The first is abstraction: this returns explicit handling of the lay= out > > > > of the table to the protocol specific callers. It's not a huge deal > > > > right now, but once we have 4 or 5 protocols doing this, having to > > > > change all of them if we make any tiny change to the semantics of > > > > flow_first_free isn't great. =20 > > >=20 > > > Hmm, I don't get the difference in terms of abstraction between > > > checking the return value of flow_alloc() and checking the return val= ue > > > of flow_may_alloc(). =20 > >=20 > > Oh, sorry, I thought you were proposing open-coding the check against > > FLOW_MAX, like it is at the moment. See below for comments on a > > flow_may_alloc() or similar call. > >=20 > > > In both cases the protocol handlers know that there's a table, a > > > function to reserve entries, and that reserving entries might fail... > > > and not much else. > > > =20 > > > > The other issue is that to do this (without a bunch of fairly large > > > > and ugly temporaries) means we'd populate at least some of the fiel= ds > > > > in flow_common before we have officially "allocated" the entry. At > > > > that point it becomes a bit fuzzy as to when that allocation really > > > > occurs. Is it when we do the FLOW_MAX tesT? =20 > > >=20 > > > I would say yes -- after that we can't fail. =20 > >=20 > > Uh.. we can't fail to allocate. We can fail for other reasons. >=20 > Yes, I meant, if we pass the FLOW_MAX check, then we can't fail to > allocate -- therefore flow_may_alloc() would contain just that check. Right, ok. > > > I mean, we work with rather constrained structures for a number of > > > reasons, which comes with a number of hard problems... let's at least > > > reap the benefits of it? =20 > >=20 > > I'm not sure what you're getting at here. >=20 > ...I'm saying that we don't actually allocate memory, which means that > we can have a simple test (on FLOW_MAX), and then we know that, at > least within the handling of a single epoll event, we will be able to > allocate memory, without possible races. >=20 > We couldn't do that with an actual heap allocation because that's out > of our control. >=20 > This is one of the few aspects where using statically allocated memory > only could make our lives easier (while in general we need to spend > more effort for other things). Ok, I understand. > > > > Is it when we write to > > > > f.type? Is it when we update flow_first_free? If we fail somewhere > > > > in the middle of that, what steps do we need to reverse? =20 > > >=20 > > > We can't fail in the middle of it, at the moment. =20 > >=20 > > Yes we can, that's kind of the whole point of this. >=20 > But as of this patch flow_alloc() can only fail on the FLOW_MAX check... I'm not talking about allocation failure, I'm talking about the other failures in setting up a new flow. We can fail between checking FLOW_MAX and "committing" to the allocation by updating flow_first_free (or whatever). Before the point we finally commit, we do want to write some fields of the flow. That's the obvious place to put the address from accept() for example. But there are some things we mustn't touch, because it will mess up how the slot is interpreted as a free slot (type, and anything in the protocol specific fields, because that will overlap struct flow_free_cluster. I don't like having this fragile intermediate state with subtle rules about what we can and can't safely do to the flow entry. With alloc cancel, we can do whatever we like as soon as we've alloc()ed, and on failure the cancel() will restore whatever free cluster information needs to be there. > > > Of course, if the > > > "allocation" function changes, we might need to change the scheme. But > > > is it really likely? And anyway it's just a few lines in your current > > > version... > > > =20 > > > > For those reasons I prefer the scheme presented. Fwiw, in an earli= er > > > > draft I did this differently with a "flow_prealloc()", which was > > > > essentially the check against FLOW_MAX, then a later > > > > flow_alloc_commit(). I thought it turned out pretty confusing > > > > compared to the alloc/cancel approach. =20 > > >=20 > > > The current flow_alloc_cancel() implementation is definitely simple a= nd > > > semantically clear. > > >=20 > > > What worries me a bit is that you would have two different cases for > > > free "blocks" of size one, depending on the order of the events. So if > > > we want to debug something like that and we see a block with size one > > > it might be a failed bind(), so a fake one, or also not: it might be = an > > > actual block with size one. =20 > >=20 > > The cluster of size one from cancel is still a valid free cluster that > > satisfies all the usual invaraints, it's not "fake". It does mean > > that we could get two contiguous free clusters, which we wouldn't > > otherwise. >=20 > Okay, yes, not having contiguous free clusters is probably not a > valuable invariant anyway. Right, that was my feeling. > > The idea is that they'll be merged back together on the > > next deferred scan, but as noted on a different subthread, that's not > > currently working and I'll have to fix it. >=20 > Yes yes that part is clear to me. I don't find it very problematic, or > in any way "wrong". >=20 > > > Thinking of multithreading: defining flow_may_alloc() becomes more > > > complicated because the caller can't just assume the "allocation" will > > > succeed (as long as we don't cross an "epoll cycle" or something like > > > that). But we'll probably need some form of locking or userspace RCU > > > giving us barriers around the pair may_alloc() / alloc(). =20 > >=20 > > For multi-threaded, I really thin we'd want alloc/free semantics, not > > may_alloc/alloc semantics. We have to change state in some way at > > "may alloc" time, or something else could try to allocate the same > > slot. "cancel" semantics also don't make sense here, because we can > > no longer be confident that the alloc we did above is still the most > > recent allloc. So a bunch of things would need to change for > > multi-threading. >=20 > By the way, another possibility would be to just go ahead and call > socket() and bind(), or accept() the connection from the socket, then if > we fail to "allocate" a flow we'll close the socket. That doesn't solve > the synchronisation problem entirely but it makes it simpler to handle. We could. But (a) the may-alloc check is much cheaper than the syscalls and (b) that means we need to store all information from those syscalls temporarily before copying it into the flow table, which is kind of nasty. > Now, a bound socket or an accepted connection is visible to the user > which is (I guess) the reason why you want to avoid this, but if we > can't open/accept new connections it's getting pretty bad anyway... so > should we really care? Right, that's certainly not my concern about it. > Actually, calling accept() and then close() on a socket (peer gets RST) > is probably a nicer behaviour than letting a peer hang because we ran > out of slots. Maybe, maybe not. If we are persistently overloaded, then yes. However if we just have a sudden spike of short lived connections, then we could just "bounce" against the flow limit, but we'll still process those other connection attempts in a little bit. > > > If we stick to the failing alloc(), this part is simpler, but the > > > interpretation of flow_first_free and block sizes becomes non-trivial= =2E =20 > >=20 > > Again, not sure what you're getting at. >=20 > By "failing alloc()" I mean the alloc/cancel semantics, as opposed to > an allocation function that can't fail. That bit I got. > There, as you mentioned, we can no longer be confident that the > canceled allocation would be the most recent one (which changes the > meaning of flow_first_free). If we're multithreading, then I think this allocation scheme needs some revision anyway. > > > Well, on the other hand, it's all simple enough that we can change it > > > as needed (for example for multithreading). If we can hope that the n= ew > > > scheme is reasonably low on bugs and we'll probably never have to gue= ss > > > why a block has size one, I'm fine with the failing alloc() as well. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --qv4JauW4DfTXfqye Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmWaNKAACgkQzQJF27ox 2GcvZQ//R3GoGqrvvWWKUqlskkshhgk9MDBhIX6PMMX9Mcbxnpk5TS77JIDsHR8Y foUy49KVq8UBwD3mSoU6YxtXhEAq3Csj3zqRIZWKhWSAGZK1/zXBxXRgVWK8mesY OB4SQNmvlzMLb7GwGS26pdnYbh/WdW0a3Hyqc9/4H3NHifGY3XxqJBZHa9nx+MBR d8MelJXpDGluKeOSN/stkYuht0AZ99fXsR7/XTn8Dim2YNfzEUWry7vlEqMhHVMv lHCrOGLbri0jMTFaolic+pQ8P1dpbL/uHqdlnjGOAiOZqglWiLtDvZaA4o7Q5i7p vgAT5hm4VPSUiooEGwFBGQTMhNaOxAcrvLkiyKWMacCP592i3MfAGjO0OtOfCcQt VwJZ02WJX/pRYZkLRCk9JLttLtu+1JISctQsqKxd22bzYAX7yWlO3bks7q1GRWwL mjcaUWD8rTfQT/3g7v1EsyQKqAJVaZsk1xQbBt5mHY2z1ArinlD6nKaOK5M0PTQX y9p5QxsCuk/f04q4CWckGK6sI5IyHMh98h6VoATYoEgh1Ygq9Gv/oCCQP8bQRVWg +QYBDqAcT3O/rT7ukBguaPqYNALvXCNCCyWlSZPZ/Rc7G4WgeTpan+0KRgca/nny hySoVgESn5yFC+nlXeW/evRJGaYYbb5uWQf2vIcLOaZTqKXFW6Y= =4oY0 -----END PGP SIGNATURE----- --qv4JauW4DfTXfqye--