From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id D56BE5A026F for ; Mon, 1 Jan 2024 13:01:27 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1704110483; bh=QyWVVbmmWui9iMQE8I0SchTruZ86ryoHOyp+MfMCEfY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=NgmFOoGBRDrP8nGiFPYaCXOGalieF6NqWUbDT4Q2M2/hfYofd2BI+Z5oCkdFkvNPE u/afFoW7tDxcTft0cb6AKjyyBboPNbAcrZOiltB5zjYU5j3M3m+ZpskoPiAE/9J+Rx QD6JnzLNNQ7wD/HWHW2haimA1jZfm9vN7HeZK4gVHda1R6znXSDIOUBt2Zbq+9x2CM vFANQ+6YSsep7affpFTvr8cy1LjFtQ8ZEMwAwSiOiBbDUm7IdlsKGMIvRNCE/7f/+n F57Q4eTKudRCTe6ffyH35bki1K97yk1QEpjNVKqHp9r0sdqW8jS4WmqjjjMEI3SIr5 s03tSOZiSsyxA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4T3ZMR48ZCz4wd4; Mon, 1 Jan 2024 23:01:23 +1100 (AEDT) Date: Mon, 1 Jan 2024 21:44:54 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table Message-ID: References: <20231221061549.976358-1-david@gibson.dropbear.id.au> <20231221061549.976358-14-david@gibson.dropbear.id.au> <20231228192525.7ba1ee48@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="2FJt/E/x7hlNGObX" Content-Disposition: inline In-Reply-To: <20231228192525.7ba1ee48@elisabeth> Message-ID-Hash: JCWI4DMOP42K6AP5KWNQCCNAK76QDW32 X-Message-ID-Hash: JCWI4DMOP42K6AP5KWNQCCNAK76QDW32 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --2FJt/E/x7hlNGObX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote: > On Thu, 21 Dec 2023 17:15:49 +1100 > David Gibson wrote: >=20 > > Currently we always keep the flow table maximally compact: that is all = the > > active entries are contiguous at the start of the table. Doing this > > sometimes requires moving an entry when one is freed. That's kind of > > fiddly, and potentially expensive: it requires updating the hash table = for > > the new location, and depending on flow type, it may require EPOLL_CTL_= MOD, > > system calls to update epoll tags with the new location too. > >=20 > > Implement a new way of managing the flow table that doesn't ever move > > entries. It attempts to maintain some compactness by always using the > > first free slot for a new connection, and mitigates the effect of non > > compactness by cheaply skipping over contiguous blocks of free entries. > > See the "theory of operation" comment in flow.c for details. > >=20 > > Signed-off-by: David Gibson > > --- > > flow.c | 177 +++++++++++++++++++++++++++++++++++++-------------- > > flow.h | 1 + > > flow_table.h | 16 ++++- > > passt.c | 2 + > > tcp.c | 23 ------- > > tcp_conn.h | 3 - > > tcp_splice.c | 11 ---- > > 7 files changed, 146 insertions(+), 87 deletions(-) > >=20 > > diff --git a/flow.c b/flow.c > > index 7b99b58..421e6b5 100644 > > --- a/flow.c > > +++ b/flow.c > > @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) =3D=3D FLOW_N= UM_TYPES, > > "flow_type_str[] doesn't match enum flow_type"); > > =20 > > /* Global Flow Table */ > > -unsigned flow_count; > > +unsigned flow_first_free; >=20 > As mentioned in the comment to 10/13: this should be initialised to > zero. I'm pretty sure it already is.. > > union flow flowtab[FLOW_MAX]; > > =20 > > /* Last time the flow timers ran */ > > @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri,= const char *fmt, ...) > > logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); > > } > > =20 > > +/** >=20 > As a general comment: interesting -- I wonder if this already has a name, > but I couldn't find anything similar in the literature. Perhaps it's > equivalent to a flattened search tree where free/busy slots can be > found by following the links. Yeah, I'm not aware of this technique being used elsewhere, though I can't rule it out, of course. > > + * DOC: Theory of Operation - allocation and freeing of flow entries >=20 > "allocating and freeing flow entries" ...flows better for me. Good idea, changed. > I think this comment should be before the declaration of flowtab[] -- > above here, with a function in between, is not exactly where I expected > to find it as I started reading the next paragraph. Fair enough, moved. > > + * > > + * Each flow takes a single slot in flowtab[]. Moving entries in that= table >=20 > You jump right away to "moving entries", we know why, but I guess any > other reader would be lost at this point. >=20 > In general, it took me some effort to grasp this paragraph. What about: >=20 > * Flows are entries in the flowtab[]. We need to routinely scan the whol= e table > * to perform deferred bookkeeping tasks on active entries, and sparse em= pty > * slots waste time and worsen data locality. > * > * But keeping the table compacted by moving entries on deletion is fiddl= y: it > * requires updating hash tables, and the epoll references to the flows. = Instead, > * we implement the compromise described below. >=20 > ? Much better, thanks > > + * (which we used to do) is fiddly and possibly expensive: it requires= updating > > + * the hash table indexing flows, and may require updating epoll data = which > > + * references the flow by index. However, we also want to keep the ac= tive > > + * entries in the table compact where possible, because otherwise scan= ning > > + * through the entire table becomes more expensive. This describes the > > + * compromise implemented below. > > + * > > + * Free blocks > > + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NON= E) entries >=20 > This gave me the idea that blocks are some kind of "bucket" with a fixed > size. I think we should either mention that the blocks have variable > size, or call them sequences, or clusters (of free slots) perhaps? Fair point. I've changed everything to the term "free cluster". > > + * in flowtab. The first entry in each block contains metadata, sp= ecifically > > + * the number of entries in the block, and the index of the next (n= on > > + * contiguous) free block (struct flow_free_block). >=20 > ...it doesn't waste any space, but I don't think we actually need to > store the number of "entries" (slots?)... more details below. I haven't seen a way to avoid it, while maintaining the properties we want, but I'm open to suggestions. > > + * > > + * Free block list > > + * flow_first_free gives the index of the first entry of the first = (lowest > > + * index) free block. Each free block has the index of the next fr= ee block, > > + * or MAX_FLOW if it is the last free block. Together these form a= linked > > + * list of free blocks, in strictly increasing order of index. > > + * > > + * Allocation > > + * When allocating a new flow, we always use the first entry of the= first > > + * free block, that is, at index flow_first_free. If the block has= more than > > + * one entry, flow_first_free is updated to the next entry, which i= s updated > > + * to represent the new smaller free block. Otherwise the free blo= ck is > > + * eliminated and flow_first_free is updated to the next free block. > > + * > > + * Scanning the table > > + * Theoretically, scanning the table requires FLOW_MAX iterations. = However, > > + * when we encounter the start of a free block, we can immediately = skip to > > + * its end, meaning that in practice we only need (number of active >=20 > after its end better yet, "...immediately skip past it,". > > + * connections) + (number of free blocks) iterations. > > + * > > + * Freeing > > + * We can only free entries when scanning the whole flow table in > > + * flow_defer_handler(). This is what lets us maintain the fee blo= ck list in >=20 > s/fee/free/ Fixed. > > + * index sorted order. As we scan we keep track of whether the pre= vious > > + * entry was in a free block or not. If so when an entry is freed = (its > > + * deferred handler returns 'true'), we add it to that free block. = Otherwise > > + * we create a new free block for it and chain it to the last free = block we > > + * scanned. > > + */ > > + > > /** > > * flow_alloc() - Allocate a new flow > > * > > @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pr= i, const char *fmt, ...) > > */ > > union flow *flow_alloc(void) > > { > > - if (flow_count >=3D FLOW_MAX) > > + union flow *flow =3D &flowtab[flow_first_free]; > > + > > + if (flow_first_free >=3D FLOW_MAX) > > return NULL; > > =20 > > - return &flowtab[flow_count++]; > > + ASSERT(flow->f.type =3D=3D FLOW_TYPE_NONE); > > + ASSERT(flow->free.n >=3D 1); > > + > > + if (flow->free.n > 1) { > > + /* Use one entry from the block */ > > + union flow *next =3D &flowtab[++flow_first_free]; >=20 > ...but we just checked (flow_first_free < FLOW_MAX) above. Now you > increment flow_first_free by one. >=20 > I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n > should be 0, so we shouldn't end up in this case -- but it's a bit > confusing. That's correct, except that it's flow->free.n =3D=3D 1, not 0 (we're checking for strictly > 1 above). If it's not, that's a free cluster claiming it extends past the end of the table. I've added ASSERT(flow_first_free + flow->free.n <=3D FLOW_MAX); Above the if to try to clarify this. > > + > > + ASSERT(FLOW_IDX(next) < FLOW_MAX); > > + ASSERT(next->f.type =3D=3D FLOW_TYPE_NONE); > > + ASSERT(next->free.n =3D=3D 0); > > + > > + next->free.n =3D flow->free.n - 1; > > + next->free.next =3D flow->free.next; > > + } else { > > + /* Use the entire block */ > > + flow_first_free =3D flow->free.next; > > + } >=20 > I wonder if we really have to keep track of the number of (non-)entries > in the free "block", and if we have to be explicit about the two cases. >=20 > I'm trying to find out if we can simplify the whole thing with slightly > different variants, for example: >=20 > struct flow_free_block { > [...] > unsigned next_free; > unsigned next_busy; > }; >=20 > or: >=20 > struct flow_free_block { > [...] > unsigned next_free; > }; >=20 > or even some sort of double-linked list: >=20 > struct flow_free_block { > [...] > unsigned prev_free; > unsigned next_free; > }; >=20 > but I couldn't quite find a more elegant solution yet. >=20 > > + > > + memset(flow, 0, sizeof(*flow)); > > + return flow; > > } > > =20 > > /** > > @@ -70,48 +137,15 @@ union flow *flow_alloc(void) > > */ > > void flow_alloc_cancel(union flow *flow) > > { > > - ASSERT(FLOW_IDX(flow) =3D=3D flow_count - 1); > > - memset(flow, 0, sizeof(*flow)); > > - flow_count--; > > -} > > - > > -/** > > - * flow_table_compact() - Perform compaction on flow table > > - * @c: Execution context > > - * @hole: Pointer to recently closed flow > > - */ > > -static void flow_table_compact(const struct ctx *c, union flow *hole) > > -{ > > - union flow *from; > > - > > - if (FLOW_IDX(hole) =3D=3D --flow_count) { > > - debug("flow: table compaction: maximum index was %u (%p)", > > - FLOW_IDX(hole), (void *)hole); > > - memset(hole, 0, sizeof(*hole)); > > - return; > > - } > > - > > - from =3D flowtab + flow_count; > > - memcpy(hole, from, sizeof(*hole)); > > - > > - switch (from->f.type) { > > - case FLOW_TCP: > > - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); > > - break; > > - case FLOW_TCP_SPLICE: > > - tcp_splice_conn_update(c, &hole->tcp_splice); > > - break; > > - default: > > - die("Unexpected %s in tcp_table_compact()", > > - FLOW_TYPE(&from->f)); > > - } > > - > > - debug("flow: table compaction (%s): old index %u, new index %u, " > > - "from: %p, to: %p", > > - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), > > - (void *)from, (void *)hole); > > - > > - memset(from, 0, sizeof(*from)); > > + ASSERT(flow_first_free > FLOW_IDX(flow)); > > + > > + flow->f.type =3D FLOW_TYPE_NONE; > > + /* Put it back in a length 1 free block, don't attempt to fully rever= se > > + * flow_alloc()s steps. This will get folded together the next time > > + * flow_defer_handler runs anyway() */ > > + flow->free.n =3D 1; > > + flow->free.next =3D flow_first_free; > > + flow_first_free =3D FLOW_IDX(flow); > > } > > =20 > > /** > > @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *= c, union flow *hole) > > */ > > void flow_defer_handler(const struct ctx *c, const struct timespec *no= w) > > { > > + struct flow_free_block *free_head =3D NULL; > > + unsigned *last_next =3D &flow_first_free; > > bool timer =3D false; > > - union flow *flow; > > + unsigned idx; > > =20 > > if (timespec_diff_ms(now, &flow_timer_run) >=3D FLOW_TIMER_INTERVAL) { > > timer =3D true; > > flow_timer_run =3D *now; > > } > > =20 > > - for (flow =3D flowtab + flow_count - 1; flow >=3D flowtab; flow--) { > > + for (idx =3D 0; idx < FLOW_MAX; idx++) { > > + union flow *flow =3D &flowtab[idx]; > > bool closed =3D false; > > =20 > > + if (flow->f.type =3D=3D FLOW_TYPE_NONE) { > > + /* Start of a free block */ > > + free_head =3D &flow->free; > > + *last_next =3D idx; > > + last_next =3D &free_head->next; > > + /* Skip the rest of the block */ > > + idx +=3D free_head->n - 1; > > + continue; > > + } > > + > > + =09 >=20 > Stray tabs. >=20 > > switch (flow->f.type) { > > + case FLOW_TYPE_NONE: > > + closed =3D true; > > + break; > > case FLOW_TCP: > > closed =3D tcp_flow_defer(flow); > > break; > > @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const= struct timespec *now) > > ; > > } > > =20 > > - if (closed) > > - flow_table_compact(c, flow); > > + if (closed) { >=20 > Should this be a flow_del() function perhaps, possibly taking flow and > free_head as argument? I suppose it could, but since it's not that long I'd prefer to inline it to emphasise the fact that for the tracking scheme to work, deletions *must* be done in the context of scanning the full table. >=20 > > + flow->f.type =3D FLOW_TYPE_NONE; > > + > > + if (free_head) { > > + /* Add slot to current free block */ > > + ASSERT(idx =3D=3D FLOW_IDX(free_head) + free_head->n); > > + free_head->n++; > > + flow->free.n =3D flow->free.next =3D 0; > > + } else { > > + /* Create new free block */ > > + free_head =3D &flow->free; > > + free_head->n =3D 1; > > + *last_next =3D idx; > > + last_next =3D &free_head->next; > > + } > > + } else { > > + free_head =3D NULL; > > + } > > } > > + > > + *last_next =3D FLOW_MAX; > > +} > > + > > +/** > > + * flow_init() - Initialise flow related data structures > > + */ > > +void flow_init(void) > > +{ > > + /* Initial state is a single free block containing the whole table */ > > + flowtab[0].free.n =3D FLOW_MAX; > > + flowtab[0].free.next =3D FLOW_MAX; > > } > > diff --git a/flow.h b/flow.h > > index 8064f0e..48a0ab4 100644 > > --- a/flow.h > > +++ b/flow.h > > @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_s= idx_t b) > > =20 > > union flow; > > =20 > > +void flow_init(void); > > void flow_defer_handler(const struct ctx *c, const struct timespec *no= w); > > =20 > > void flow_log_(const struct flow_common *f, int pri, const char *fmt, = =2E..) > > diff --git a/flow_table.h b/flow_table.h > > index 2773a2b..34fc679 100644 > > --- a/flow_table.h > > +++ b/flow_table.h > > @@ -9,6 +9,19 @@ > > =20 > > #include "tcp_conn.h" > > =20 > > +/** > > + * struct flow_free_block - Information about a block of free entries > > + * @f: Generic flow information > > + * @n: Number of entries in this free block (including this one) > > + * @next: Index of next free block > > + */ > > +struct flow_free_block { > > + /* Must be first element */ > > + struct flow_common f; > > + unsigned n; > > + unsigned next; > > +}; > > + > > /** > > * union flow - Descriptor for a logical packet flow (e.g. connection) > > * @f: Fields common between all variants > > @@ -17,12 +30,13 @@ > > */ > > union flow { > > struct flow_common f; > > + struct flow_free_block free; > > struct tcp_tap_conn tcp; > > struct tcp_splice_conn tcp_splice; > > }; > > =20 > > /* Global Flow Table */ > > -extern unsigned flow_count; > > +extern unsigned flow_first_free; > > extern union flow flowtab[]; > > =20 > > =20 > > diff --git a/passt.c b/passt.c > > index 71bea8f..d315438 100644 > > --- a/passt.c > > +++ b/passt.c > > @@ -285,6 +285,8 @@ int main(int argc, char **argv) > > =20 > > clock_gettime(CLOCK_MONOTONIC, &now); > > =20 > > + flow_init(); > > + > > if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) > > exit(EXIT_FAILURE); > > =20 > > diff --git a/tcp.c b/tcp.c > > index a38deb0..6aacb56 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c, > > tc_hash[b] =3D FLOW_SIDX_NONE; > > } > > =20 > > -/** > > - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the= table > > - * @c: Execution context > > - * @old: Old location of tcp_tap_conn > > - * @new: New location of tcp_tap_conn > > - */ > > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, > > - struct tcp_tap_conn *new) > > - > > -{ > > - unsigned b =3D tcp_hash_probe(c, old); > > - > > - if (!flow_at_sidx(tc_hash[b])) > > - return; /* Not in hash table, nothing to update */ > > - > > - tc_hash[b] =3D FLOW_SIDX(new, TAPSIDE); > > - > > - debug("TCP: hash table update: old index %u, new index %u, sock %i, " > > - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); > > - > > - tcp_epoll_ctl(c, new); > > -} > > - > > /** > > * tcp_hash_lookup() - Look up connection given remote address and por= ts > > * @c: Execution context > > diff --git a/tcp_conn.h b/tcp_conn.h > > index 636224e..a5f5cfe 100644 > > --- a/tcp_conn.h > > +++ b/tcp_conn.h > > @@ -155,9 +155,6 @@ struct tcp_splice_conn { > > extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > > extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > > =20 > > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, > > - struct tcp_tap_conn *new); > > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_con= n *new); > > bool tcp_flow_defer(union flow *flow); > > bool tcp_splice_flow_defer(union flow *flow); > > void tcp_splice_timer(const struct ctx *c, union flow *flow); > > diff --git a/tcp_splice.c b/tcp_splice.c > > index 0711b19..7aae623 100644 > > --- a/tcp_splice.c > > +++ b/tcp_splice.c > > @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, str= uct tcp_splice_conn *conn, > > } while (0) > > =20 > > =20 > > -/** > > - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved = in the table > > - * @c: Execution context > > - * @new: New location of tcp_splice_conn > > - */ > > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_con= n *new) > > -{ > > - if (tcp_splice_epoll_ctl(c, new)) > > - conn_flag(c, new, CLOSING); > > -} > > - > > /** > > * tcp_splice_flow_defer() - Deferred per-flow handling (clean up clos= ed) > > * @flow: Flow table entry for this connection >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --2FJt/E/x7hlNGObX Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmWSl5IACgkQzQJF27ox 2Gcn4Q//enHy+Ymp6DN72e8nR3n+SDG9uomY9M/men9pkque0p2pSDz5ctxr8022 AVcJExmQa3XR2GT6IZSlDBr/L0mDvf/UtqL9A48FBJsx6oaFxko9g9U4BQTAksBu JIzzTFYd9374VgfexFR2Uo3aM1u6W/60XQSDaaAHNM4DSJZ+SR4NosMporZtpzzj VEOzZEjktNWykXYt6UMZrJ/Py7CNqcRw8XUhDqrki2Ix6N86gIK9DZWx4QMTCGJg luOaRC7Q7UJleQWC4Fzbfz7Nxo2lToYLjNbokd1ZvWTr7vv0r+cRH+pnHh/JLMTA ut37k2txbJQd4K1+eemezSBGuF6/t+37FQb/9nNHPXP7RQSmSv9YLmoV0inA0qZc Pnh4WH1w0NGBK/P3VqKQzTIyv4gMp6xvxaYY6JbhoCd9m4MgYdHynhOPAaUEoQh3 S/c2J/XKB6e+gdW+sS16Yd91MIyibE3TYhJmBpI1JVfUzFvILzXHUYVAv0TIDTW9 0n6CBcoZuuph//UxMIoMq7yQgzTNPGS2BaU6l6LSW4x5KyWFO2H5rP424BgTeGaE SNzrNMBCrMz6GaS/ErWCmE1KNqgE3d5S75vSV5zwPOhrgAJWS6INRdN3Wi3o3eWx xV1NYP9mYFeuudY4eDPyrpKnXC9alXIg/c2iOM62sb/BevfGZo4= =NlIn -----END PGP SIGNATURE----- --2FJt/E/x7hlNGObX--