On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote: > On Thu, 21 Dec 2023 17:15:49 +1100 > David Gibson wrote: > > > Currently we always keep the flow table maximally compact: that is all the > > active entries are contiguous at the start of the table. Doing this > > sometimes requires moving an entry when one is freed. That's kind of > > fiddly, and potentially expensive: it requires updating the hash table for > > the new location, and depending on flow type, it may require EPOLL_CTL_MOD, > > system calls to update epoll tags with the new location too. > > > > Implement a new way of managing the flow table that doesn't ever move > > entries. It attempts to maintain some compactness by always using the > > first free slot for a new connection, and mitigates the effect of non > > compactness by cheaply skipping over contiguous blocks of free entries. > > See the "theory of operation" comment in flow.c for details. > > > > Signed-off-by: David Gibson > > --- > > flow.c | 177 +++++++++++++++++++++++++++++++++++++-------------- > > flow.h | 1 + > > flow_table.h | 16 ++++- > > passt.c | 2 + > > tcp.c | 23 ------- > > tcp_conn.h | 3 - > > tcp_splice.c | 11 ---- > > 7 files changed, 146 insertions(+), 87 deletions(-) > > > > diff --git a/flow.c b/flow.c > > index 7b99b58..421e6b5 100644 > > --- a/flow.c > > +++ b/flow.c > > @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, > > "flow_type_str[] doesn't match enum flow_type"); > > > > /* Global Flow Table */ > > -unsigned flow_count; > > +unsigned flow_first_free; > > As mentioned in the comment to 10/13: this should be initialised to > zero. I'm pretty sure it already is.. > > union flow flowtab[FLOW_MAX]; > > > > /* Last time the flow timers ran */ > > @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) > > logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); > > } > > > > +/** > > As a general comment: interesting -- I wonder if this already has a name, > but I couldn't find anything similar in the literature. Perhaps it's > equivalent to a flattened search tree where free/busy slots can be > found by following the links. Yeah, I'm not aware of this technique being used elsewhere, though I can't rule it out, of course. > > + * DOC: Theory of Operation - allocation and freeing of flow entries > > "allocating and freeing flow entries" ...flows better for me. Good idea, changed. > I think this comment should be before the declaration of flowtab[] -- > above here, with a function in between, is not exactly where I expected > to find it as I started reading the next paragraph. Fair enough, moved. > > + * > > + * Each flow takes a single slot in flowtab[]. Moving entries in that table > > You jump right away to "moving entries", we know why, but I guess any > other reader would be lost at this point. > > In general, it took me some effort to grasp this paragraph. What about: > > * Flows are entries in the flowtab[]. We need to routinely scan the whole table > * to perform deferred bookkeeping tasks on active entries, and sparse empty > * slots waste time and worsen data locality. > * > * But keeping the table compacted by moving entries on deletion is fiddly: it > * requires updating hash tables, and the epoll references to the flows. Instead, > * we implement the compromise described below. > > ? Much better, thanks > > + * (which we used to do) is fiddly and possibly expensive: it requires updating > > + * the hash table indexing flows, and may require updating epoll data which > > + * references the flow by index. However, we also want to keep the active > > + * entries in the table compact where possible, because otherwise scanning > > + * through the entire table becomes more expensive. This describes the > > + * compromise implemented below. > > + * > > + * Free blocks > > + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries > > This gave me the idea that blocks are some kind of "bucket" with a fixed > size. I think we should either mention that the blocks have variable > size, or call them sequences, or clusters (of free slots) perhaps? Fair point. I've changed everything to the term "free cluster". > > + * in flowtab. The first entry in each block contains metadata, specifically > > + * the number of entries in the block, and the index of the next (non > > + * contiguous) free block (struct flow_free_block). > > ...it doesn't waste any space, but I don't think we actually need to > store the number of "entries" (slots?)... more details below. I haven't seen a way to avoid it, while maintaining the properties we want, but I'm open to suggestions. > > + * > > + * Free block list > > + * flow_first_free gives the index of the first entry of the first (lowest > > + * index) free block. Each free block has the index of the next free block, > > + * or MAX_FLOW if it is the last free block. Together these form a linked > > + * list of free blocks, in strictly increasing order of index. > > + * > > + * Allocation > > + * When allocating a new flow, we always use the first entry of the first > > + * free block, that is, at index flow_first_free. If the block has more than > > + * one entry, flow_first_free is updated to the next entry, which is updated > > + * to represent the new smaller free block. Otherwise the free block is > > + * eliminated and flow_first_free is updated to the next free block. > > + * > > + * Scanning the table > > + * Theoretically, scanning the table requires FLOW_MAX iterations. However, > > + * when we encounter the start of a free block, we can immediately skip to > > + * its end, meaning that in practice we only need (number of active > > after its end better yet, "...immediately skip past it,". > > + * connections) + (number of free blocks) iterations. > > + * > > + * Freeing > > + * We can only free entries when scanning the whole flow table in > > + * flow_defer_handler(). This is what lets us maintain the fee block list in > > s/fee/free/ Fixed. > > + * index sorted order. As we scan we keep track of whether the previous > > + * entry was in a free block or not. If so when an entry is freed (its > > + * deferred handler returns 'true'), we add it to that free block. Otherwise > > + * we create a new free block for it and chain it to the last free block we > > + * scanned. > > + */ > > + > > /** > > * flow_alloc() - Allocate a new flow > > * > > @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) > > */ > > union flow *flow_alloc(void) > > { > > - if (flow_count >= FLOW_MAX) > > + union flow *flow = &flowtab[flow_first_free]; > > + > > + if (flow_first_free >= FLOW_MAX) > > return NULL; > > > > - return &flowtab[flow_count++]; > > + ASSERT(flow->f.type == FLOW_TYPE_NONE); > > + ASSERT(flow->free.n >= 1); > > + > > + if (flow->free.n > 1) { > > + /* Use one entry from the block */ > > + union flow *next = &flowtab[++flow_first_free]; > > ...but we just checked (flow_first_free < FLOW_MAX) above. Now you > increment flow_first_free by one. > > I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n > should be 0, so we shouldn't end up in this case -- but it's a bit > confusing. That's correct, except that it's flow->free.n == 1, not 0 (we're checking for strictly > 1 above). If it's not, that's a free cluster claiming it extends past the end of the table. I've added ASSERT(flow_first_free + flow->free.n <= FLOW_MAX); Above the if to try to clarify this. > > + > > + ASSERT(FLOW_IDX(next) < FLOW_MAX); > > + ASSERT(next->f.type == FLOW_TYPE_NONE); > > + ASSERT(next->free.n == 0); > > + > > + next->free.n = flow->free.n - 1; > > + next->free.next = flow->free.next; > > + } else { > > + /* Use the entire block */ > > + flow_first_free = flow->free.next; > > + } > > I wonder if we really have to keep track of the number of (non-)entries > in the free "block", and if we have to be explicit about the two cases. > > I'm trying to find out if we can simplify the whole thing with slightly > different variants, for example: > > struct flow_free_block { > [...] > unsigned next_free; > unsigned next_busy; > }; > > or: > > struct flow_free_block { > [...] > unsigned next_free; > }; > > or even some sort of double-linked list: > > struct flow_free_block { > [...] > unsigned prev_free; > unsigned next_free; > }; > > but I couldn't quite find a more elegant solution yet. > > > + > > + memset(flow, 0, sizeof(*flow)); > > + return flow; > > } > > > > /** > > @@ -70,48 +137,15 @@ union flow *flow_alloc(void) > > */ > > void flow_alloc_cancel(union flow *flow) > > { > > - ASSERT(FLOW_IDX(flow) == flow_count - 1); > > - memset(flow, 0, sizeof(*flow)); > > - flow_count--; > > -} > > - > > -/** > > - * flow_table_compact() - Perform compaction on flow table > > - * @c: Execution context > > - * @hole: Pointer to recently closed flow > > - */ > > -static void flow_table_compact(const struct ctx *c, union flow *hole) > > -{ > > - union flow *from; > > - > > - if (FLOW_IDX(hole) == --flow_count) { > > - debug("flow: table compaction: maximum index was %u (%p)", > > - FLOW_IDX(hole), (void *)hole); > > - memset(hole, 0, sizeof(*hole)); > > - return; > > - } > > - > > - from = flowtab + flow_count; > > - memcpy(hole, from, sizeof(*hole)); > > - > > - switch (from->f.type) { > > - case FLOW_TCP: > > - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); > > - break; > > - case FLOW_TCP_SPLICE: > > - tcp_splice_conn_update(c, &hole->tcp_splice); > > - break; > > - default: > > - die("Unexpected %s in tcp_table_compact()", > > - FLOW_TYPE(&from->f)); > > - } > > - > > - debug("flow: table compaction (%s): old index %u, new index %u, " > > - "from: %p, to: %p", > > - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), > > - (void *)from, (void *)hole); > > - > > - memset(from, 0, sizeof(*from)); > > + ASSERT(flow_first_free > FLOW_IDX(flow)); > > + > > + flow->f.type = FLOW_TYPE_NONE; > > + /* Put it back in a length 1 free block, don't attempt to fully reverse > > + * flow_alloc()s steps. This will get folded together the next time > > + * flow_defer_handler runs anyway() */ > > + flow->free.n = 1; > > + flow->free.next = flow_first_free; > > + flow_first_free = FLOW_IDX(flow); > > } > > > > /** > > @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole) > > */ > > void flow_defer_handler(const struct ctx *c, const struct timespec *now) > > { > > + struct flow_free_block *free_head = NULL; > > + unsigned *last_next = &flow_first_free; > > bool timer = false; > > - union flow *flow; > > + unsigned idx; > > > > if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { > > timer = true; > > flow_timer_run = *now; > > } > > > > - for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { > > + for (idx = 0; idx < FLOW_MAX; idx++) { > > + union flow *flow = &flowtab[idx]; > > bool closed = false; > > > > + if (flow->f.type == FLOW_TYPE_NONE) { > > + /* Start of a free block */ > > + free_head = &flow->free; > > + *last_next = idx; > > + last_next = &free_head->next; > > + /* Skip the rest of the block */ > > + idx += free_head->n - 1; > > + continue; > > + } > > + > > + > > Stray tabs. > > > switch (flow->f.type) { > > + case FLOW_TYPE_NONE: > > + closed = true; > > + break; > > case FLOW_TCP: > > closed = tcp_flow_defer(flow); > > break; > > @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) > > ; > > } > > > > - if (closed) > > - flow_table_compact(c, flow); > > + if (closed) { > > Should this be a flow_del() function perhaps, possibly taking flow and > free_head as argument? I suppose it could, but since it's not that long I'd prefer to inline it to emphasise the fact that for the tracking scheme to work, deletions *must* be done in the context of scanning the full table. > > > + flow->f.type = FLOW_TYPE_NONE; > > + > > + if (free_head) { > > + /* Add slot to current free block */ > > + ASSERT(idx == FLOW_IDX(free_head) + free_head->n); > > + free_head->n++; > > + flow->free.n = flow->free.next = 0; > > + } else { > > + /* Create new free block */ > > + free_head = &flow->free; > > + free_head->n = 1; > > + *last_next = idx; > > + last_next = &free_head->next; > > + } > > + } else { > > + free_head = NULL; > > + } > > } > > + > > + *last_next = FLOW_MAX; > > +} > > + > > +/** > > + * flow_init() - Initialise flow related data structures > > + */ > > +void flow_init(void) > > +{ > > + /* Initial state is a single free block containing the whole table */ > > + flowtab[0].free.n = FLOW_MAX; > > + flowtab[0].free.next = FLOW_MAX; > > } > > diff --git a/flow.h b/flow.h > > index 8064f0e..48a0ab4 100644 > > --- a/flow.h > > +++ b/flow.h > > @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) > > > > union flow; > > > > +void flow_init(void); > > void flow_defer_handler(const struct ctx *c, const struct timespec *now); > > > > void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) > > diff --git a/flow_table.h b/flow_table.h > > index 2773a2b..34fc679 100644 > > --- a/flow_table.h > > +++ b/flow_table.h > > @@ -9,6 +9,19 @@ > > > > #include "tcp_conn.h" > > > > +/** > > + * struct flow_free_block - Information about a block of free entries > > + * @f: Generic flow information > > + * @n: Number of entries in this free block (including this one) > > + * @next: Index of next free block > > + */ > > +struct flow_free_block { > > + /* Must be first element */ > > + struct flow_common f; > > + unsigned n; > > + unsigned next; > > +}; > > + > > /** > > * union flow - Descriptor for a logical packet flow (e.g. connection) > > * @f: Fields common between all variants > > @@ -17,12 +30,13 @@ > > */ > > union flow { > > struct flow_common f; > > + struct flow_free_block free; > > struct tcp_tap_conn tcp; > > struct tcp_splice_conn tcp_splice; > > }; > > > > /* Global Flow Table */ > > -extern unsigned flow_count; > > +extern unsigned flow_first_free; > > extern union flow flowtab[]; > > > > > > diff --git a/passt.c b/passt.c > > index 71bea8f..d315438 100644 > > --- a/passt.c > > +++ b/passt.c > > @@ -285,6 +285,8 @@ int main(int argc, char **argv) > > > > clock_gettime(CLOCK_MONOTONIC, &now); > > > > + flow_init(); > > + > > if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) > > exit(EXIT_FAILURE); > > > > diff --git a/tcp.c b/tcp.c > > index a38deb0..6aacb56 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c, > > tc_hash[b] = FLOW_SIDX_NONE; > > } > > > > -/** > > - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table > > - * @c: Execution context > > - * @old: Old location of tcp_tap_conn > > - * @new: New location of tcp_tap_conn > > - */ > > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, > > - struct tcp_tap_conn *new) > > - > > -{ > > - unsigned b = tcp_hash_probe(c, old); > > - > > - if (!flow_at_sidx(tc_hash[b])) > > - return; /* Not in hash table, nothing to update */ > > - > > - tc_hash[b] = FLOW_SIDX(new, TAPSIDE); > > - > > - debug("TCP: hash table update: old index %u, new index %u, sock %i, " > > - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); > > - > > - tcp_epoll_ctl(c, new); > > -} > > - > > /** > > * tcp_hash_lookup() - Look up connection given remote address and ports > > * @c: Execution context > > diff --git a/tcp_conn.h b/tcp_conn.h > > index 636224e..a5f5cfe 100644 > > --- a/tcp_conn.h > > +++ b/tcp_conn.h > > @@ -155,9 +155,6 @@ struct tcp_splice_conn { > > extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > > extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > > > > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, > > - struct tcp_tap_conn *new); > > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); > > bool tcp_flow_defer(union flow *flow); > > bool tcp_splice_flow_defer(union flow *flow); > > void tcp_splice_timer(const struct ctx *c, union flow *flow); > > diff --git a/tcp_splice.c b/tcp_splice.c > > index 0711b19..7aae623 100644 > > --- a/tcp_splice.c > > +++ b/tcp_splice.c > > @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, > > } while (0) > > > > > > -/** > > - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table > > - * @c: Execution context > > - * @new: New location of tcp_splice_conn > > - */ > > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) > > -{ > > - if (tcp_splice_epoll_ctl(c, new)) > > - conn_flag(c, new, CLOSING); > > -} > > - > > /** > > * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) > > * @flow: Flow table entry for this connection > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson