From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
Date: Mon, 1 Jan 2024 21:44:54 +1100 [thread overview]
Message-ID: <ZZKXptlHUziT4ngN@zatzit> (raw)
In-Reply-To: <20231228192525.7ba1ee48@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 18177 bytes --]
On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:
> On Thu, 21 Dec 2023 17:15:49 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Currently we always keep the flow table maximally compact: that is all the
> > active entries are contiguous at the start of the table. Doing this
> > sometimes requires moving an entry when one is freed. That's kind of
> > fiddly, and potentially expensive: it requires updating the hash table for
> > the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
> > system calls to update epoll tags with the new location too.
> >
> > Implement a new way of managing the flow table that doesn't ever move
> > entries. It attempts to maintain some compactness by always using the
> > first free slot for a new connection, and mitigates the effect of non
> > compactness by cheaply skipping over contiguous blocks of free entries.
> > See the "theory of operation" comment in flow.c for details.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > flow.c | 177 +++++++++++++++++++++++++++++++++++++--------------
> > flow.h | 1 +
> > flow_table.h | 16 ++++-
> > passt.c | 2 +
> > tcp.c | 23 -------
> > tcp_conn.h | 3 -
> > tcp_splice.c | 11 ----
> > 7 files changed, 146 insertions(+), 87 deletions(-)
> >
> > diff --git a/flow.c b/flow.c
> > index 7b99b58..421e6b5 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
> > "flow_type_str[] doesn't match enum flow_type");
> >
> > /* Global Flow Table */
> > -unsigned flow_count;
> > +unsigned flow_first_free;
>
> As mentioned in the comment to 10/13: this should be initialised to
> zero.
I'm pretty sure it already is..
> > union flow flowtab[FLOW_MAX];
> >
> > /* Last time the flow timers ran */
> > @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> > }
> >
> > +/**
>
> As a general comment: interesting -- I wonder if this already has a name,
> but I couldn't find anything similar in the literature. Perhaps it's
> equivalent to a flattened search tree where free/busy slots can be
> found by following the links.
Yeah, I'm not aware of this technique being used elsewhere, though I
can't rule it out, of course.
> > + * DOC: Theory of Operation - allocation and freeing of flow entries
>
> "allocating and freeing flow entries" ...flows better for me.
Good idea, changed.
> I think this comment should be before the declaration of flowtab[] --
> above here, with a function in between, is not exactly where I expected
> to find it as I started reading the next paragraph.
Fair enough, moved.
> > + *
> > + * Each flow takes a single slot in flowtab[]. Moving entries in that table
>
> You jump right away to "moving entries", we know why, but I guess any
> other reader would be lost at this point.
>
> In general, it took me some effort to grasp this paragraph. What about:
>
> * Flows are entries in the flowtab[]. We need to routinely scan the whole table
> * to perform deferred bookkeeping tasks on active entries, and sparse empty
> * slots waste time and worsen data locality.
> *
> * But keeping the table compacted by moving entries on deletion is fiddly: it
> * requires updating hash tables, and the epoll references to the flows. Instead,
> * we implement the compromise described below.
>
> ?
Much better, thanks
> > + * (which we used to do) is fiddly and possibly expensive: it requires updating
> > + * the hash table indexing flows, and may require updating epoll data which
> > + * references the flow by index. However, we also want to keep the active
> > + * entries in the table compact where possible, because otherwise scanning
> > + * through the entire table becomes more expensive. This describes the
> > + * compromise implemented below.
> > + *
> > + * Free blocks
> > + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries
>
> This gave me the idea that blocks are some kind of "bucket" with a fixed
> size. I think we should either mention that the blocks have variable
> size, or call them sequences, or clusters (of free slots) perhaps?
Fair point. I've changed everything to the term "free cluster".
> > + * in flowtab. The first entry in each block contains metadata, specifically
> > + * the number of entries in the block, and the index of the next (non
> > + * contiguous) free block (struct flow_free_block).
>
> ...it doesn't waste any space, but I don't think we actually need to
> store the number of "entries" (slots?)... more details below.
I haven't seen a way to avoid it, while maintaining the properties we
want, but I'm open to suggestions.
> > + *
> > + * Free block list
> > + * flow_first_free gives the index of the first entry of the first (lowest
> > + * index) free block. Each free block has the index of the next free block,
> > + * or MAX_FLOW if it is the last free block. Together these form a linked
> > + * list of free blocks, in strictly increasing order of index.
> > + *
> > + * Allocation
> > + * When allocating a new flow, we always use the first entry of the first
> > + * free block, that is, at index flow_first_free. If the block has more than
> > + * one entry, flow_first_free is updated to the next entry, which is updated
> > + * to represent the new smaller free block. Otherwise the free block is
> > + * eliminated and flow_first_free is updated to the next free block.
> > + *
> > + * Scanning the table
> > + * Theoretically, scanning the table requires FLOW_MAX iterations. However,
> > + * when we encounter the start of a free block, we can immediately skip to
> > + * its end, meaning that in practice we only need (number of active
>
> after its end
better yet, "...immediately skip past it,".
> > + * connections) + (number of free blocks) iterations.
> > + *
> > + * Freeing
> > + * We can only free entries when scanning the whole flow table in
> > + * flow_defer_handler(). This is what lets us maintain the fee block list in
>
> s/fee/free/
Fixed.
> > + * index sorted order. As we scan we keep track of whether the previous
> > + * entry was in a free block or not. If so when an entry is freed (its
> > + * deferred handler returns 'true'), we add it to that free block. Otherwise
> > + * we create a new free block for it and chain it to the last free block we
> > + * scanned.
> > + */
> > +
> > /**
> > * flow_alloc() - Allocate a new flow
> > *
> > @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > */
> > union flow *flow_alloc(void)
> > {
> > - if (flow_count >= FLOW_MAX)
> > + union flow *flow = &flowtab[flow_first_free];
> > +
> > + if (flow_first_free >= FLOW_MAX)
> > return NULL;
> >
> > - return &flowtab[flow_count++];
> > + ASSERT(flow->f.type == FLOW_TYPE_NONE);
> > + ASSERT(flow->free.n >= 1);
> > +
> > + if (flow->free.n > 1) {
> > + /* Use one entry from the block */
> > + union flow *next = &flowtab[++flow_first_free];
>
> ...but we just checked (flow_first_free < FLOW_MAX) above. Now you
> increment flow_first_free by one.
>
> I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n
> should be 0, so we shouldn't end up in this case -- but it's a bit
> confusing.
That's correct, except that it's flow->free.n == 1, not 0 (we're
checking for strictly > 1 above). If it's not, that's a free cluster
claiming it extends past the end of the table.
I've added
ASSERT(flow_first_free + flow->free.n <= FLOW_MAX);
Above the if to try to clarify this.
> > +
> > + ASSERT(FLOW_IDX(next) < FLOW_MAX);
> > + ASSERT(next->f.type == FLOW_TYPE_NONE);
> > + ASSERT(next->free.n == 0);
> > +
> > + next->free.n = flow->free.n - 1;
> > + next->free.next = flow->free.next;
> > + } else {
> > + /* Use the entire block */
> > + flow_first_free = flow->free.next;
> > + }
>
> I wonder if we really have to keep track of the number of (non-)entries
> in the free "block", and if we have to be explicit about the two cases.
>
> I'm trying to find out if we can simplify the whole thing with slightly
> different variants, for example:
>
> struct flow_free_block {
> [...]
> unsigned next_free;
> unsigned next_busy;
> };
>
> or:
>
> struct flow_free_block {
> [...]
> unsigned next_free;
> };
>
> or even some sort of double-linked list:
>
> struct flow_free_block {
> [...]
> unsigned prev_free;
> unsigned next_free;
> };
>
> but I couldn't quite find a more elegant solution yet.
>
> > +
> > + memset(flow, 0, sizeof(*flow));
> > + return flow;
> > }
> >
> > /**
> > @@ -70,48 +137,15 @@ union flow *flow_alloc(void)
> > */
> > void flow_alloc_cancel(union flow *flow)
> > {
> > - ASSERT(FLOW_IDX(flow) == flow_count - 1);
> > - memset(flow, 0, sizeof(*flow));
> > - flow_count--;
> > -}
> > -
> > -/**
> > - * flow_table_compact() - Perform compaction on flow table
> > - * @c: Execution context
> > - * @hole: Pointer to recently closed flow
> > - */
> > -static void flow_table_compact(const struct ctx *c, union flow *hole)
> > -{
> > - union flow *from;
> > -
> > - if (FLOW_IDX(hole) == --flow_count) {
> > - debug("flow: table compaction: maximum index was %u (%p)",
> > - FLOW_IDX(hole), (void *)hole);
> > - memset(hole, 0, sizeof(*hole));
> > - return;
> > - }
> > -
> > - from = flowtab + flow_count;
> > - memcpy(hole, from, sizeof(*hole));
> > -
> > - switch (from->f.type) {
> > - case FLOW_TCP:
> > - tcp_tap_conn_update(c, &from->tcp, &hole->tcp);
> > - break;
> > - case FLOW_TCP_SPLICE:
> > - tcp_splice_conn_update(c, &hole->tcp_splice);
> > - break;
> > - default:
> > - die("Unexpected %s in tcp_table_compact()",
> > - FLOW_TYPE(&from->f));
> > - }
> > -
> > - debug("flow: table compaction (%s): old index %u, new index %u, "
> > - "from: %p, to: %p",
> > - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole),
> > - (void *)from, (void *)hole);
> > -
> > - memset(from, 0, sizeof(*from));
> > + ASSERT(flow_first_free > FLOW_IDX(flow));
> > +
> > + flow->f.type = FLOW_TYPE_NONE;
> > + /* Put it back in a length 1 free block, don't attempt to fully reverse
> > + * flow_alloc()s steps. This will get folded together the next time
> > + * flow_defer_handler runs anyway() */
> > + flow->free.n = 1;
> > + flow->free.next = flow_first_free;
> > + flow_first_free = FLOW_IDX(flow);
> > }
> >
> > /**
> > @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole)
> > */
> > void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> > {
> > + struct flow_free_block *free_head = NULL;
> > + unsigned *last_next = &flow_first_free;
> > bool timer = false;
> > - union flow *flow;
> > + unsigned idx;
> >
> > if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
> > timer = true;
> > flow_timer_run = *now;
> > }
> >
> > - for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> > + for (idx = 0; idx < FLOW_MAX; idx++) {
> > + union flow *flow = &flowtab[idx];
> > bool closed = false;
> >
> > + if (flow->f.type == FLOW_TYPE_NONE) {
> > + /* Start of a free block */
> > + free_head = &flow->free;
> > + *last_next = idx;
> > + last_next = &free_head->next;
> > + /* Skip the rest of the block */
> > + idx += free_head->n - 1;
> > + continue;
> > + }
> > +
> > +
>
> Stray tabs.
>
> > switch (flow->f.type) {
> > + case FLOW_TYPE_NONE:
> > + closed = true;
> > + break;
> > case FLOW_TCP:
> > closed = tcp_flow_defer(flow);
> > break;
> > @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> > ;
> > }
> >
> > - if (closed)
> > - flow_table_compact(c, flow);
> > + if (closed) {
>
> Should this be a flow_del() function perhaps, possibly taking flow and
> free_head as argument?
I suppose it could, but since it's not that long I'd prefer to inline
it to emphasise the fact that for the tracking scheme to work,
deletions *must* be done in the context of scanning the full table.
>
> > + flow->f.type = FLOW_TYPE_NONE;
> > +
> > + if (free_head) {
> > + /* Add slot to current free block */
> > + ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
> > + free_head->n++;
> > + flow->free.n = flow->free.next = 0;
> > + } else {
> > + /* Create new free block */
> > + free_head = &flow->free;
> > + free_head->n = 1;
> > + *last_next = idx;
> > + last_next = &free_head->next;
> > + }
> > + } else {
> > + free_head = NULL;
> > + }
> > }
> > +
> > + *last_next = FLOW_MAX;
> > +}
> > +
> > +/**
> > + * flow_init() - Initialise flow related data structures
> > + */
> > +void flow_init(void)
> > +{
> > + /* Initial state is a single free block containing the whole table */
> > + flowtab[0].free.n = FLOW_MAX;
> > + flowtab[0].free.next = FLOW_MAX;
> > }
> > diff --git a/flow.h b/flow.h
> > index 8064f0e..48a0ab4 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
> >
> > union flow;
> >
> > +void flow_init(void);
> > void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> >
> > void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > diff --git a/flow_table.h b/flow_table.h
> > index 2773a2b..34fc679 100644
> > --- a/flow_table.h
> > +++ b/flow_table.h
> > @@ -9,6 +9,19 @@
> >
> > #include "tcp_conn.h"
> >
> > +/**
> > + * struct flow_free_block - Information about a block of free entries
> > + * @f: Generic flow information
> > + * @n: Number of entries in this free block (including this one)
> > + * @next: Index of next free block
> > + */
> > +struct flow_free_block {
> > + /* Must be first element */
> > + struct flow_common f;
> > + unsigned n;
> > + unsigned next;
> > +};
> > +
> > /**
> > * union flow - Descriptor for a logical packet flow (e.g. connection)
> > * @f: Fields common between all variants
> > @@ -17,12 +30,13 @@
> > */
> > union flow {
> > struct flow_common f;
> > + struct flow_free_block free;
> > struct tcp_tap_conn tcp;
> > struct tcp_splice_conn tcp_splice;
> > };
> >
> > /* Global Flow Table */
> > -extern unsigned flow_count;
> > +extern unsigned flow_first_free;
> > extern union flow flowtab[];
> >
> >
> > diff --git a/passt.c b/passt.c
> > index 71bea8f..d315438 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -285,6 +285,8 @@ int main(int argc, char **argv)
> >
> > clock_gettime(CLOCK_MONOTONIC, &now);
> >
> > + flow_init();
> > +
> > if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
> > exit(EXIT_FAILURE);
> >
> > diff --git a/tcp.c b/tcp.c
> > index a38deb0..6aacb56 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c,
> > tc_hash[b] = FLOW_SIDX_NONE;
> > }
> >
> > -/**
> > - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
> > - * @c: Execution context
> > - * @old: Old location of tcp_tap_conn
> > - * @new: New location of tcp_tap_conn
> > - */
> > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> > - struct tcp_tap_conn *new)
> > -
> > -{
> > - unsigned b = tcp_hash_probe(c, old);
> > -
> > - if (!flow_at_sidx(tc_hash[b]))
> > - return; /* Not in hash table, nothing to update */
> > -
> > - tc_hash[b] = FLOW_SIDX(new, TAPSIDE);
> > -
> > - debug("TCP: hash table update: old index %u, new index %u, sock %i, "
> > - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b);
> > -
> > - tcp_epoll_ctl(c, new);
> > -}
> > -
> > /**
> > * tcp_hash_lookup() - Look up connection given remote address and ports
> > * @c: Execution context
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 636224e..a5f5cfe 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -155,9 +155,6 @@ struct tcp_splice_conn {
> > extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE];
> > extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE];
> >
> > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> > - struct tcp_tap_conn *new);
> > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
> > bool tcp_flow_defer(union flow *flow);
> > bool tcp_splice_flow_defer(union flow *flow);
> > void tcp_splice_timer(const struct ctx *c, union flow *flow);
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 0711b19..7aae623 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
> > } while (0)
> >
> >
> > -/**
> > - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
> > - * @c: Execution context
> > - * @new: New location of tcp_splice_conn
> > - */
> > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
> > -{
> > - if (tcp_splice_epoll_ctl(c, new))
> > - conn_flag(c, new, CLOSING);
> > -}
> > -
> > /**
> > * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
> > * @flow: Flow table entry for this connection
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-01-01 12:01 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-21 6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
2023-12-21 6:15 ` [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs David Gibson
2023-12-21 6:15 ` [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables David Gibson
2023-12-21 6:15 ` [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer() David Gibson
2023-12-21 6:15 ` [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers David Gibson
2023-12-21 6:15 ` [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling David Gibson
2023-12-28 18:24 ` Stefano Brivio
2023-12-31 5:56 ` David Gibson
2024-01-02 18:13 ` Stefano Brivio
2024-01-03 3:45 ` David Gibson
2023-12-21 6:15 ` [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers David Gibson
2023-12-21 6:15 ` [PATCH v3 07/13] epoll: Better handling of number of epoll types David Gibson
2023-12-21 6:15 ` [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets David Gibson
2023-12-21 6:15 ` [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c David Gibson
2023-12-21 6:15 ` [PATCH v3 10/13] flow: Move flow_count from context structure to a global David Gibson
2023-12-28 18:25 ` Stefano Brivio
2023-12-31 5:58 ` David Gibson
2024-01-02 18:13 ` Stefano Brivio
2024-01-03 3:54 ` David Gibson
2024-01-03 7:08 ` Stefano Brivio
2024-01-04 9:51 ` David Gibson
2024-01-05 7:55 ` Stefano Brivio
2024-01-07 5:23 ` David Gibson
2023-12-21 6:15 ` [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function David Gibson
2023-12-21 6:15 ` [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers David Gibson
2023-12-21 6:15 ` [PATCH v3 13/13] flow: Avoid moving flow entries to compact table David Gibson
2023-12-28 18:25 ` Stefano Brivio
2023-12-30 10:33 ` Stefano Brivio
2024-01-01 12:01 ` David Gibson
2024-01-02 18:13 ` Stefano Brivio
2024-01-04 10:02 ` David Gibson
2024-01-05 8:33 ` Stefano Brivio
2024-01-05 9:39 ` David Gibson
2024-01-05 10:27 ` Stefano Brivio
2024-01-06 11:32 ` David Gibson
2024-01-06 13:02 ` Stefano Brivio
2024-01-07 5:20 ` David Gibson
2024-01-01 10:44 ` David Gibson [this message]
2024-01-02 18:13 ` Stefano Brivio
2024-01-05 9:45 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZZKXptlHUziT4ngN@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).