On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:
> On Thu, 21 Dec 2023 17:15:49 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we always keep the flow table maximally compact: that is all the
> > active entries are contiguous at the start of the table.  Doing this
> > sometimes requires moving an entry when one is freed.  That's kind of
> > fiddly, and potentially expensive: it requires updating the hash table for
> > the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
> > system calls to update epoll tags with the new location too.
> > 
> > Implement a new way of managing the flow table that doesn't ever move
> > entries.  It attempts to maintain some compactness by always using the
> > first free slot for a new connection, and mitigates the effect of non
> > compactness by cheaply skipping over contiguous blocks of free entries.
> > See the "theory of operation" comment in flow.c for details.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c       | 177 +++++++++++++++++++++++++++++++++++++--------------
> >  flow.h       |   1 +
> >  flow_table.h |  16 ++++-
> >  passt.c      |   2 +
> >  tcp.c        |  23 -------
> >  tcp_conn.h   |   3 -
> >  tcp_splice.c |  11 ----
> >  7 files changed, 146 insertions(+), 87 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index 7b99b58..421e6b5 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
> >  	      "flow_type_str[] doesn't match enum flow_type");
> >  
> >  /* Global Flow Table */
> > -unsigned flow_count;
> > +unsigned flow_first_free;
> 
> As mentioned in the comment to 10/13: this should be initialised to
> zero.

I'm pretty sure it already is..

> >  union flow flowtab[FLOW_MAX];
> >  
> >  /* Last time the flow timers ran */
> > @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> >  }
> >  
> > +/**
> 
> As a general comment: interesting -- I wonder if this already has a name,
> but I couldn't find anything similar in the literature. Perhaps it's
> equivalent to a flattened search tree where free/busy slots can be
> found by following the links.

Yeah, I'm not aware of this technique being used elsewhere, though I
can't rule it out, of course.

> > + * DOC: Theory of Operation - allocation and freeing of flow entries
> 
> "allocating and freeing flow entries" ...flows better for me.

Good idea, changed.

> I think this comment should be before the declaration of flowtab[] --
> above here, with a function in between, is not exactly where I expected
> to find it as I started reading the next paragraph.

Fair enough, moved.

> > + *
> > + * Each flow takes a single slot in flowtab[].  Moving entries in that table
> 
> You jump right away to "moving entries", we know why, but I guess any
> other reader would be lost at this point.
> 
> In general, it took me some effort to grasp this paragraph. What about:
> 
>  * Flows are entries in the flowtab[]. We need to routinely scan the whole table
>  * to perform deferred bookkeeping tasks on active entries, and sparse empty
>  * slots waste time and worsen data locality.
>  *
>  * But keeping the table compacted by moving entries on deletion is fiddly: it
>  * requires updating hash tables, and the epoll references to the flows. Instead,
>  * we implement the compromise described below.
> 
> ?

Much better, thanks

> > + * (which we used to do) is fiddly and possibly expensive: it requires updating
> > + * the hash table indexing flows, and may require updating epoll data which
> > + * references the flow by index.  However, we also want to keep the active
> > + * entries in the table compact where possible, because otherwise scanning
> > + * through the entire table becomes more expensive.  This describes the
> > + * compromise implemented below.
> > + *
> > + * Free blocks
> > + *    A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries
> 
> This gave me the idea that blocks are some kind of "bucket" with a fixed
> size. I think we should either mention that the blocks have variable
> size, or call them sequences, or clusters (of free slots) perhaps?

Fair point.  I've changed everything to the term "free cluster".

> > + *    in flowtab.  The first entry in each block contains metadata, specifically
> > + *    the number of entries in the block, and the index of the next (non
> > + *    contiguous) free block (struct flow_free_block).
> 
> ...it doesn't waste any space, but I don't think we actually need to
> store the number of "entries" (slots?)... more details below.

I haven't seen a way to avoid it, while maintaining the properties we
want, but I'm open to suggestions.

> > + *
> > + * Free block list
> > + *    flow_first_free gives the index of the first entry of the first (lowest
> > + *    index) free block.  Each free block has the index of the next free block,
> > + *    or MAX_FLOW if it is the last free block.  Together these form a linked
> > + *    list of free blocks, in strictly increasing order of index.
> > + *
> > + * Allocation
> > + *    When allocating a new flow, we always use the first entry of the first
> > + *    free block, that is, at index flow_first_free.  If the block has more than
> > + *    one entry, flow_first_free is updated to the next entry, which is updated
> > + *    to represent the new smaller free block.  Otherwise the free block is
> > + *    eliminated and flow_first_free is updated to the next free block.
> > + *
> > + * Scanning the table
> > + *    Theoretically, scanning the table requires FLOW_MAX iterations.  However,
> > + *    when we encounter the start of a free block, we can immediately skip to
> > + *    its end, meaning that in practice we only need (number of active
> 
> after its end

better yet, "...immediately skip past it,".

> > + *    connections) + (number of free blocks) iterations.
> > + *
> > + * Freeing
> > + *    We can only free entries when scanning the whole flow table in
> > + *    flow_defer_handler().  This is what lets us maintain the fee block list in
> 
> s/fee/free/

Fixed.

> > + *    index sorted order.  As we scan we keep track of whether the previous
> > + *    entry was in a free block or not.  If so when an entry is freed (its
> > + *    deferred handler returns 'true'), we add it to that free block.  Otherwise
> > + *    we create a new free block for it and chain it to the last free block we
> > + *    scanned.
> > + */
> > +
> >  /**
> >   * flow_alloc() - Allocate a new flow
> >   *
> > @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >   */
> >  union flow *flow_alloc(void)
> >  {
> > -	if (flow_count >= FLOW_MAX)
> > +	union flow *flow = &flowtab[flow_first_free];
> > +
> > +	if (flow_first_free >= FLOW_MAX)
> >  		return NULL;
> >  
> > -	return &flowtab[flow_count++];
> > +	ASSERT(flow->f.type == FLOW_TYPE_NONE);
> > +	ASSERT(flow->free.n >= 1);
> > +
> > +	if (flow->free.n > 1) {
> > +		/* Use one entry from the block */
> > +		union flow *next = &flowtab[++flow_first_free];
> 
> ...but we just checked (flow_first_free < FLOW_MAX) above. Now you
> increment flow_first_free by one.
> 
> I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n
> should be 0, so we shouldn't end up in this case -- but it's a bit
> confusing.

That's correct, except that it's flow->free.n == 1, not 0 (we're
checking for strictly > 1 above).  If it's not, that's a free cluster
claiming it extends past the end of the table.

I've added
	ASSERT(flow_first_free + flow->free.n <= FLOW_MAX);

Above the if to try to clarify this.

> > +
> > +		ASSERT(FLOW_IDX(next) < FLOW_MAX);
> > +		ASSERT(next->f.type == FLOW_TYPE_NONE);
> > +		ASSERT(next->free.n == 0);
> > +
> > +		next->free.n = flow->free.n - 1;
> > +		next->free.next = flow->free.next;
> > +	} else {
> > +		/* Use the entire block */
> > +		flow_first_free = flow->free.next;
> > +	}
> 
> I wonder if we really have to keep track of the number of (non-)entries
> in the free "block", and if we have to be explicit about the two cases.
> 
> I'm trying to find out if we can simplify the whole thing with slightly
> different variants, for example:
> 
>   struct flow_free_block {
>   	[...]
>   	unsigned next_free;
>   	unsigned next_busy;
>   };
> 
> or:
> 
>   struct flow_free_block {
>   	[...]
>   	unsigned next_free;
>   };
> 
> or even some sort of double-linked list:
> 
>   struct flow_free_block {
>   	[...]
>   	unsigned prev_free;
>   	unsigned next_free;
>   };
> 
> but I couldn't quite find a more elegant solution yet.
> 
> > +
> > +	memset(flow, 0, sizeof(*flow));
> > +	return flow;
> >  }
> >  
> >  /**
> > @@ -70,48 +137,15 @@ union flow *flow_alloc(void)
> >   */
> >  void flow_alloc_cancel(union flow *flow)
> >  {
> > -	ASSERT(FLOW_IDX(flow) == flow_count - 1);
> > -	memset(flow, 0, sizeof(*flow));
> > -	flow_count--;
> > -}
> > -
> > -/**
> > - * flow_table_compact() - Perform compaction on flow table
> > - * @c:		Execution context
> > - * @hole:	Pointer to recently closed flow
> > - */
> > -static void flow_table_compact(const struct ctx *c, union flow *hole)
> > -{
> > -	union flow *from;
> > -
> > -	if (FLOW_IDX(hole) == --flow_count) {
> > -		debug("flow: table compaction: maximum index was %u (%p)",
> > -		      FLOW_IDX(hole), (void *)hole);
> > -		memset(hole, 0, sizeof(*hole));
> > -		return;
> > -	}
> > -
> > -	from = flowtab + flow_count;
> > -	memcpy(hole, from, sizeof(*hole));
> > -
> > -	switch (from->f.type) {
> > -	case FLOW_TCP:
> > -		tcp_tap_conn_update(c, &from->tcp, &hole->tcp);
> > -		break;
> > -	case FLOW_TCP_SPLICE:
> > -		tcp_splice_conn_update(c, &hole->tcp_splice);
> > -		break;
> > -	default:
> > -		die("Unexpected %s in tcp_table_compact()",
> > -		    FLOW_TYPE(&from->f));
> > -	}
> > -
> > -	debug("flow: table compaction (%s): old index %u, new index %u, "
> > -	      "from: %p, to: %p",
> > -	      FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole),
> > -	      (void *)from, (void *)hole);
> > -
> > -	memset(from, 0, sizeof(*from));
> > +	ASSERT(flow_first_free > FLOW_IDX(flow));
> > +
> > +	flow->f.type = FLOW_TYPE_NONE;
> > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > +	 * flow_alloc()s steps.  This will get folded together the next time
> > +	 * flow_defer_handler runs anyway() */
> > +	flow->free.n = 1;
> > +	flow->free.next = flow_first_free;
> > +	flow_first_free = FLOW_IDX(flow);
> >  }
> >  
> >  /**
> > @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole)
> >   */
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> >  {
> > +	struct flow_free_block *free_head = NULL;
> > +	unsigned *last_next = &flow_first_free;
> >  	bool timer = false;
> > -	union flow *flow;
> > +	unsigned idx;
> >  
> >  	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
> >  		timer = true;
> >  		flow_timer_run = *now;
> >  	}
> >  
> > -	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> > +	for (idx = 0; idx < FLOW_MAX; idx++) {
> > +		union flow *flow = &flowtab[idx];
> >  		bool closed = false;
> >  
> > +		if (flow->f.type == FLOW_TYPE_NONE) {
> > +			/* Start of a free block */
> > +			free_head = &flow->free;
> > +			*last_next = idx;
> > +			last_next = &free_head->next;
> > +			/* Skip the rest of the block */
> > +			idx += free_head->n - 1;
> > +			continue;
> > +		}
> > +
> > +		
> 
> Stray tabs.
> 
> >  		switch (flow->f.type) {
> > +		case FLOW_TYPE_NONE:
> > +			closed = true;
> > +			break;
> >  		case FLOW_TCP:
> >  			closed = tcp_flow_defer(flow);
> >  			break;
> > @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> >  			;
> >  		}
> >  
> > -		if (closed)
> > -			flow_table_compact(c, flow);
> > +		if (closed) {
> 
> Should this be a flow_del() function perhaps, possibly taking flow and
> free_head as argument?

I suppose it could, but since it's not that long I'd prefer to inline
it to emphasise the fact that for the tracking scheme to work,
deletions *must* be done in the context of scanning the full table.

> 
> > +			flow->f.type = FLOW_TYPE_NONE;
> > +
> > +			if (free_head) {
> > +				/* Add slot to current free block */
> > +				ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
> > +				free_head->n++;
> > +				flow->free.n = flow->free.next = 0;
> > +			} else {
> > +				/* Create new free block */
> > +				free_head = &flow->free;
> > +				free_head->n = 1;
> > +				*last_next = idx;
> > +				last_next = &free_head->next;
> > +			}
> > +		} else {
> > +			free_head = NULL;
> > +		}
> >  	}
> > +
> > +	*last_next = FLOW_MAX;
> > +}
> > +
> > +/**
> > + * flow_init() - Initialise flow related data structures
> > + */
> > +void flow_init(void)
> > +{
> > +	/* Initial state is a single free block containing the whole table */
> > +	flowtab[0].free.n = FLOW_MAX;
> > +	flowtab[0].free.next = FLOW_MAX;
> >  }
> > diff --git a/flow.h b/flow.h
> > index 8064f0e..48a0ab4 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
> >  
> >  union flow;
> >  
> > +void flow_init(void);
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> >  
> >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > diff --git a/flow_table.h b/flow_table.h
> > index 2773a2b..34fc679 100644
> > --- a/flow_table.h
> > +++ b/flow_table.h
> > @@ -9,6 +9,19 @@
> >  
> >  #include "tcp_conn.h"
> >  
> > +/**
> > + * struct flow_free_block - Information about a block of free entries
> > + * @f:		Generic flow information
> > + * @n:		Number of entries in this free block (including this one)
> > + * @next:	Index of next free block
> > + */
> > +struct flow_free_block {
> > +	/* Must be first element */
> > +	struct flow_common f;
> > +	unsigned n;
> > +	unsigned next;
> > +};
> > +
> >  /**
> >   * union flow - Descriptor for a logical packet flow (e.g. connection)
> >   * @f:		Fields common between all variants
> > @@ -17,12 +30,13 @@
> >  */
> >  union flow {
> >  	struct flow_common f;
> > +	struct flow_free_block free;
> >  	struct tcp_tap_conn tcp;
> >  	struct tcp_splice_conn tcp_splice;
> >  };
> >  
> >  /* Global Flow Table */
> > -extern unsigned flow_count;
> > +extern unsigned flow_first_free;
> >  extern union flow flowtab[];
> >  
> >  
> > diff --git a/passt.c b/passt.c
> > index 71bea8f..d315438 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -285,6 +285,8 @@ int main(int argc, char **argv)
> >  
> >  	clock_gettime(CLOCK_MONOTONIC, &now);
> >  
> > +	flow_init();
> > +
> >  	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
> >  		exit(EXIT_FAILURE);
> >  
> > diff --git a/tcp.c b/tcp.c
> > index a38deb0..6aacb56 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c,
> >  	tc_hash[b] = FLOW_SIDX_NONE;
> >  }
> >  
> > -/**
> > - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
> > - * @c:		Execution context
> > - * @old:	Old location of tcp_tap_conn
> > - * @new:	New location of tcp_tap_conn
> > - */
> > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> > -			 struct tcp_tap_conn *new)
> > -
> > -{
> > -	unsigned b = tcp_hash_probe(c, old);
> > -
> > -	if (!flow_at_sidx(tc_hash[b]))
> > -		return; /* Not in hash table, nothing to update */
> > -
> > -	tc_hash[b] = FLOW_SIDX(new, TAPSIDE);
> > -
> > -	debug("TCP: hash table update: old index %u, new index %u, sock %i, "
> > -	      "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b);
> > -
> > -	tcp_epoll_ctl(c, new);
> > -}
> > -
> >  /**
> >   * tcp_hash_lookup() - Look up connection given remote address and ports
> >   * @c:		Execution context
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 636224e..a5f5cfe 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -155,9 +155,6 @@ struct tcp_splice_conn {
> >  extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
> >  extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
> >  
> > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> > -			 struct tcp_tap_conn *new);
> > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
> >  bool tcp_flow_defer(union flow *flow);
> >  bool tcp_splice_flow_defer(union flow *flow);
> >  void tcp_splice_timer(const struct ctx *c, union flow *flow);
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 0711b19..7aae623 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
> >  	} while (0)
> >  
> >  
> > -/**
> > - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
> > - * @c:		Execution context
> > - * @new:	New location of tcp_splice_conn
> > - */
> > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
> > -{
> > -	if (tcp_splice_epoll_ctl(c, new))
> > -		conn_flag(c, new, CLOSING);
> > -}
> > -
> >  /**
> >   * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
> >   * @flow:	Flow table entry for this connection
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson