public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
Date: Thu, 28 Dec 2023 19:25:25 +0100	[thread overview]
Message-ID: <20231228192525.7ba1ee48@elisabeth> (raw)
In-Reply-To: <20231221061549.976358-14-david@gibson.dropbear.id.au>

On Thu, 21 Dec 2023 17:15:49 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently we always keep the flow table maximally compact: that is all the
> active entries are contiguous at the start of the table.  Doing this
> sometimes requires moving an entry when one is freed.  That's kind of
> fiddly, and potentially expensive: it requires updating the hash table for
> the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
> system calls to update epoll tags with the new location too.
> 
> Implement a new way of managing the flow table that doesn't ever move
> entries.  It attempts to maintain some compactness by always using the
> first free slot for a new connection, and mitigates the effect of non
> compactness by cheaply skipping over contiguous blocks of free entries.
> See the "theory of operation" comment in flow.c for details.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c       | 177 +++++++++++++++++++++++++++++++++++++--------------
>  flow.h       |   1 +
>  flow_table.h |  16 ++++-
>  passt.c      |   2 +
>  tcp.c        |  23 -------
>  tcp_conn.h   |   3 -
>  tcp_splice.c |  11 ----
>  7 files changed, 146 insertions(+), 87 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 7b99b58..421e6b5 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
>  	      "flow_type_str[] doesn't match enum flow_type");
>  
>  /* Global Flow Table */
> -unsigned flow_count;
> +unsigned flow_first_free;

As mentioned in the comment to 10/13: this should be initialised to
zero.

>  union flow flowtab[FLOW_MAX];
>  
>  /* Last time the flow timers ran */
> @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
>  }
>  
> +/**

As a general comment: interesting -- I wonder if this already has a name,
but I couldn't find anything similar in the literature. Perhaps it's
equivalent to a flattened search tree where free/busy slots can be
found by following the links.

> + * DOC: Theory of Operation - allocation and freeing of flow entries

"allocating and freeing flow entries" ...flows better for me.

I think this comment should be before the declaration of flowtab[] --
above here, with a function in between, is not exactly where I expected
to find it as I started reading the next paragraph.

> + *
> + * Each flow takes a single slot in flowtab[].  Moving entries in that table

You jump right away to "moving entries", we know why, but I guess any
other reader would be lost at this point.

In general, it took me some effort to grasp this paragraph. What about:

 * Flows are entries in the flowtab[]. We need to routinely scan the whole table
 * to perform deferred bookkeeping tasks on active entries, and sparse empty
 * slots waste time and worsen data locality.
 *
 * But keeping the table compacted by moving entries on deletion is fiddly: it
 * requires updating hash tables, and the epoll references to the flows. Instead,
 * we implement the compromise described below.

?

> + * (which we used to do) is fiddly and possibly expensive: it requires updating
> + * the hash table indexing flows, and may require updating epoll data which
> + * references the flow by index.  However, we also want to keep the active
> + * entries in the table compact where possible, because otherwise scanning
> + * through the entire table becomes more expensive.  This describes the
> + * compromise implemented below.
> + *
> + * Free blocks
> + *    A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries

This gave me the idea that blocks are some kind of "bucket" with a fixed
size. I think we should either mention that the blocks have variable
size, or call them sequences, or clusters (of free slots) perhaps?

> + *    in flowtab.  The first entry in each block contains metadata, specifically
> + *    the number of entries in the block, and the index of the next (non
> + *    contiguous) free block (struct flow_free_block).

...it doesn't waste any space, but I don't think we actually need to
store the number of "entries" (slots?)... more details below.

> + *
> + * Free block list
> + *    flow_first_free gives the index of the first entry of the first (lowest
> + *    index) free block.  Each free block has the index of the next free block,
> + *    or MAX_FLOW if it is the last free block.  Together these form a linked
> + *    list of free blocks, in strictly increasing order of index.
> + *
> + * Allocation
> + *    When allocating a new flow, we always use the first entry of the first
> + *    free block, that is, at index flow_first_free.  If the block has more than
> + *    one entry, flow_first_free is updated to the next entry, which is updated
> + *    to represent the new smaller free block.  Otherwise the free block is
> + *    eliminated and flow_first_free is updated to the next free block.
> + *
> + * Scanning the table
> + *    Theoretically, scanning the table requires FLOW_MAX iterations.  However,
> + *    when we encounter the start of a free block, we can immediately skip to
> + *    its end, meaning that in practice we only need (number of active

after its end

> + *    connections) + (number of free blocks) iterations.
> + *
> + * Freeing
> + *    We can only free entries when scanning the whole flow table in
> + *    flow_defer_handler().  This is what lets us maintain the fee block list in

s/fee/free/

> + *    index sorted order.  As we scan we keep track of whether the previous
> + *    entry was in a free block or not.  If so when an entry is freed (its
> + *    deferred handler returns 'true'), we add it to that free block.  Otherwise
> + *    we create a new free block for it and chain it to the last free block we
> + *    scanned.
> + */
> +
>  /**
>   * flow_alloc() - Allocate a new flow
>   *
> @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>   */
>  union flow *flow_alloc(void)
>  {
> -	if (flow_count >= FLOW_MAX)
> +	union flow *flow = &flowtab[flow_first_free];
> +
> +	if (flow_first_free >= FLOW_MAX)
>  		return NULL;
>  
> -	return &flowtab[flow_count++];
> +	ASSERT(flow->f.type == FLOW_TYPE_NONE);
> +	ASSERT(flow->free.n >= 1);
> +
> +	if (flow->free.n > 1) {
> +		/* Use one entry from the block */
> +		union flow *next = &flowtab[++flow_first_free];

...but we just checked (flow_first_free < FLOW_MAX) above. Now you
increment flow_first_free by one.

I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n
should be 0, so we shouldn't end up in this case -- but it's a bit
confusing.

> +
> +		ASSERT(FLOW_IDX(next) < FLOW_MAX);
> +		ASSERT(next->f.type == FLOW_TYPE_NONE);
> +		ASSERT(next->free.n == 0);
> +
> +		next->free.n = flow->free.n - 1;
> +		next->free.next = flow->free.next;
> +	} else {
> +		/* Use the entire block */
> +		flow_first_free = flow->free.next;
> +	}

I wonder if we really have to keep track of the number of (non-)entries
in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly
different variants, for example:

  struct flow_free_block {
  	[...]
  	unsigned next_free;
  	unsigned next_busy;
  };

or:

  struct flow_free_block {
  	[...]
  	unsigned next_free;
  };

or even some sort of double-linked list:

  struct flow_free_block {
  	[...]
  	unsigned prev_free;
  	unsigned next_free;
  };

but I couldn't quite find a more elegant solution yet.

> +
> +	memset(flow, 0, sizeof(*flow));
> +	return flow;
>  }
>  
>  /**
> @@ -70,48 +137,15 @@ union flow *flow_alloc(void)
>   */
>  void flow_alloc_cancel(union flow *flow)
>  {
> -	ASSERT(FLOW_IDX(flow) == flow_count - 1);
> -	memset(flow, 0, sizeof(*flow));
> -	flow_count--;
> -}
> -
> -/**
> - * flow_table_compact() - Perform compaction on flow table
> - * @c:		Execution context
> - * @hole:	Pointer to recently closed flow
> - */
> -static void flow_table_compact(const struct ctx *c, union flow *hole)
> -{
> -	union flow *from;
> -
> -	if (FLOW_IDX(hole) == --flow_count) {
> -		debug("flow: table compaction: maximum index was %u (%p)",
> -		      FLOW_IDX(hole), (void *)hole);
> -		memset(hole, 0, sizeof(*hole));
> -		return;
> -	}
> -
> -	from = flowtab + flow_count;
> -	memcpy(hole, from, sizeof(*hole));
> -
> -	switch (from->f.type) {
> -	case FLOW_TCP:
> -		tcp_tap_conn_update(c, &from->tcp, &hole->tcp);
> -		break;
> -	case FLOW_TCP_SPLICE:
> -		tcp_splice_conn_update(c, &hole->tcp_splice);
> -		break;
> -	default:
> -		die("Unexpected %s in tcp_table_compact()",
> -		    FLOW_TYPE(&from->f));
> -	}
> -
> -	debug("flow: table compaction (%s): old index %u, new index %u, "
> -	      "from: %p, to: %p",
> -	      FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole),
> -	      (void *)from, (void *)hole);
> -
> -	memset(from, 0, sizeof(*from));
> +	ASSERT(flow_first_free > FLOW_IDX(flow));
> +
> +	flow->f.type = FLOW_TYPE_NONE;
> +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> +	 * flow_alloc()s steps.  This will get folded together the next time
> +	 * flow_defer_handler runs anyway() */
> +	flow->free.n = 1;
> +	flow->free.next = flow_first_free;
> +	flow_first_free = FLOW_IDX(flow);
>  }
>  
>  /**
> @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole)
>   */
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  {
> +	struct flow_free_block *free_head = NULL;
> +	unsigned *last_next = &flow_first_free;
>  	bool timer = false;
> -	union flow *flow;
> +	unsigned idx;
>  
>  	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
>  		timer = true;
>  		flow_timer_run = *now;
>  	}
>  
> -	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> +	for (idx = 0; idx < FLOW_MAX; idx++) {
> +		union flow *flow = &flowtab[idx];
>  		bool closed = false;
>  
> +		if (flow->f.type == FLOW_TYPE_NONE) {
> +			/* Start of a free block */
> +			free_head = &flow->free;
> +			*last_next = idx;
> +			last_next = &free_head->next;
> +			/* Skip the rest of the block */
> +			idx += free_head->n - 1;
> +			continue;
> +		}
> +
> +		

Stray tabs.

>  		switch (flow->f.type) {
> +		case FLOW_TYPE_NONE:
> +			closed = true;
> +			break;
>  		case FLOW_TCP:
>  			closed = tcp_flow_defer(flow);
>  			break;
> @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  			;
>  		}
>  
> -		if (closed)
> -			flow_table_compact(c, flow);
> +		if (closed) {

Should this be a flow_del() function perhaps, possibly taking flow and
free_head as argument?

> +			flow->f.type = FLOW_TYPE_NONE;
> +
> +			if (free_head) {
> +				/* Add slot to current free block */
> +				ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
> +				free_head->n++;
> +				flow->free.n = flow->free.next = 0;
> +			} else {
> +				/* Create new free block */
> +				free_head = &flow->free;
> +				free_head->n = 1;
> +				*last_next = idx;
> +				last_next = &free_head->next;
> +			}
> +		} else {
> +			free_head = NULL;
> +		}
>  	}
> +
> +	*last_next = FLOW_MAX;
> +}
> +
> +/**
> + * flow_init() - Initialise flow related data structures
> + */
> +void flow_init(void)
> +{
> +	/* Initial state is a single free block containing the whole table */
> +	flowtab[0].free.n = FLOW_MAX;
> +	flowtab[0].free.next = FLOW_MAX;
>  }
> diff --git a/flow.h b/flow.h
> index 8064f0e..48a0ab4 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
>  
>  union flow;
>  
> +void flow_init(void);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
>  
>  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> diff --git a/flow_table.h b/flow_table.h
> index 2773a2b..34fc679 100644
> --- a/flow_table.h
> +++ b/flow_table.h
> @@ -9,6 +9,19 @@
>  
>  #include "tcp_conn.h"
>  
> +/**
> + * struct flow_free_block - Information about a block of free entries
> + * @f:		Generic flow information
> + * @n:		Number of entries in this free block (including this one)
> + * @next:	Index of next free block
> + */
> +struct flow_free_block {
> +	/* Must be first element */
> +	struct flow_common f;
> +	unsigned n;
> +	unsigned next;
> +};
> +
>  /**
>   * union flow - Descriptor for a logical packet flow (e.g. connection)
>   * @f:		Fields common between all variants
> @@ -17,12 +30,13 @@
>  */
>  union flow {
>  	struct flow_common f;
> +	struct flow_free_block free;
>  	struct tcp_tap_conn tcp;
>  	struct tcp_splice_conn tcp_splice;
>  };
>  
>  /* Global Flow Table */
> -extern unsigned flow_count;
> +extern unsigned flow_first_free;
>  extern union flow flowtab[];
>  
>  
> diff --git a/passt.c b/passt.c
> index 71bea8f..d315438 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -285,6 +285,8 @@ int main(int argc, char **argv)
>  
>  	clock_gettime(CLOCK_MONOTONIC, &now);
>  
> +	flow_init();
> +
>  	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
>  		exit(EXIT_FAILURE);
>  
> diff --git a/tcp.c b/tcp.c
> index a38deb0..6aacb56 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c,
>  	tc_hash[b] = FLOW_SIDX_NONE;
>  }
>  
> -/**
> - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
> - * @c:		Execution context
> - * @old:	Old location of tcp_tap_conn
> - * @new:	New location of tcp_tap_conn
> - */
> -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> -			 struct tcp_tap_conn *new)
> -
> -{
> -	unsigned b = tcp_hash_probe(c, old);
> -
> -	if (!flow_at_sidx(tc_hash[b]))
> -		return; /* Not in hash table, nothing to update */
> -
> -	tc_hash[b] = FLOW_SIDX(new, TAPSIDE);
> -
> -	debug("TCP: hash table update: old index %u, new index %u, sock %i, "
> -	      "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b);
> -
> -	tcp_epoll_ctl(c, new);
> -}
> -
>  /**
>   * tcp_hash_lookup() - Look up connection given remote address and ports
>   * @c:		Execution context
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 636224e..a5f5cfe 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -155,9 +155,6 @@ struct tcp_splice_conn {
>  extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
>  extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
>  
> -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> -			 struct tcp_tap_conn *new);
> -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
>  bool tcp_flow_defer(union flow *flow);
>  bool tcp_splice_flow_defer(union flow *flow);
>  void tcp_splice_timer(const struct ctx *c, union flow *flow);
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 0711b19..7aae623 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
>  	} while (0)
>  
>  
> -/**
> - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
> - * @c:		Execution context
> - * @new:	New location of tcp_splice_conn
> - */
> -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
> -{
> -	if (tcp_splice_epoll_ctl(c, new))
> -		conn_flag(c, new, CLOSING);
> -}
> -
>  /**
>   * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
>   * @flow:	Flow table entry for this connection

-- 
Stefano


  reply	other threads:[~2023-12-28 18:25 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
2023-12-21  6:15 ` [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs David Gibson
2023-12-21  6:15 ` [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables David Gibson
2023-12-21  6:15 ` [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer() David Gibson
2023-12-21  6:15 ` [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers David Gibson
2023-12-21  6:15 ` [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling David Gibson
2023-12-28 18:24   ` Stefano Brivio
2023-12-31  5:56     ` David Gibson
2024-01-02 18:13       ` Stefano Brivio
2024-01-03  3:45         ` David Gibson
2023-12-21  6:15 ` [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers David Gibson
2023-12-21  6:15 ` [PATCH v3 07/13] epoll: Better handling of number of epoll types David Gibson
2023-12-21  6:15 ` [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets David Gibson
2023-12-21  6:15 ` [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c David Gibson
2023-12-21  6:15 ` [PATCH v3 10/13] flow: Move flow_count from context structure to a global David Gibson
2023-12-28 18:25   ` Stefano Brivio
2023-12-31  5:58     ` David Gibson
2024-01-02 18:13       ` Stefano Brivio
2024-01-03  3:54         ` David Gibson
2024-01-03  7:08           ` Stefano Brivio
2024-01-04  9:51             ` David Gibson
2024-01-05  7:55               ` Stefano Brivio
2024-01-07  5:23                 ` David Gibson
2023-12-21  6:15 ` [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function David Gibson
2023-12-21  6:15 ` [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers David Gibson
2023-12-21  6:15 ` [PATCH v3 13/13] flow: Avoid moving flow entries to compact table David Gibson
2023-12-28 18:25   ` Stefano Brivio [this message]
2023-12-30 10:33     ` Stefano Brivio
2024-01-01 12:01       ` David Gibson
2024-01-02 18:13         ` Stefano Brivio
2024-01-04 10:02           ` David Gibson
2024-01-05  8:33             ` Stefano Brivio
2024-01-05  9:39               ` David Gibson
2024-01-05 10:27                 ` Stefano Brivio
2024-01-06 11:32                   ` David Gibson
2024-01-06 13:02                     ` Stefano Brivio
2024-01-07  5:20                       ` David Gibson
2024-01-01 10:44     ` David Gibson
2024-01-02 18:13       ` Stefano Brivio
2024-01-05  9:45         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231228192525.7ba1ee48@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).