From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 3300E5A027C for ; Wed, 20 Dec 2023 08:09:21 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1703056152; bh=YuzVM+itw6p8eye1ZurejzMXh2QugUoaPizTqjQydxw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ifx5QNBUKQAtjlYZm7iQV+HUEi9jYdSbUcQDuVB9mX0+nb1Qa1WHENn7Qv1MDKPbD +ti+w1wjjL3OOIf6DP0TFQOgZqjm2fFNBp7JnyLAiXNDKEi7oG6Xcj9exPOs2bNteo Kj+wxdn7vEvi2fGzZJVEe2gkVz9xydda8cH6GDSbeM6O6GSZUNStVwXukN10mu6IQ6 UnUcirYIVoXGG6LtvGS8n1G+4/ZHK/6LKsKDYn7FApfAZbeKLF5GZ0E24A4Y/29VEm Nl6zljHp/Kp3Z9f7ZdpAGPWLzQk2yZU/4ntTqfrpIu+aeFUDXeB1v/5x3g+WMxW57S JgGHz+4WykRig== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Sw4Rr3JWdz4xNH; Wed, 20 Dec 2023 18:09:12 +1100 (AEDT) From: David Gibson To: passt-dev@passt.top, Stefano Brivio Subject: [PATCH v2 13/13] flow: Avoid moving flow entries to compact table Date: Wed, 20 Dec 2023 18:09:08 +1100 Message-ID: <20231220070908.2506277-14-david@gibson.dropbear.id.au> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20231220070908.2506277-1-david@gibson.dropbear.id.au> References: <20231220070908.2506277-1-david@gibson.dropbear.id.au> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID-Hash: KBWEYAXBS7JXP3RLDFZ5KH4S24I5Y57P X-Message-ID-Hash: KBWEYAXBS7JXP3RLDFZ5KH4S24I5Y57P X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Currently we always keep the flow table maximally compact: that is all the active entries are contiguous at the start of the table. Doing this sometimes requires moving an entry when one is freed. That's kind of fiddly, and potentially expensive: it requires updating the hash table for the new location, and depending on flow type, it may require EPOLL_CTL_MOD, system calls to update epoll tags with the new location too. Implement a new way of managing the flow table that doesn't ever move entries. It attempts to maintain some compactness by always using the first free slot for a new connection, and mitigates the effect of non compactness by cheaply skipping over contiguous blocks of free entries. See the "theory of operation" comment in flow.c for details. Signed-off-by: David Gibson --- flow.c | 151 ++++++++++++++++++++++++++++++++++++--------------- flow.h | 1 + flow_table.h | 20 ++++++- passt.c | 2 + tcp.c | 23 -------- tcp_conn.h | 3 - tcp_splice.c | 11 ---- 7 files changed, 128 insertions(+), 83 deletions(-) diff --git a/flow.c b/flow.c index f298bd0..a92864e 100644 --- a/flow.c +++ b/flow.c @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); /* Global Flow Table */ -unsigned flow_count; +unsigned flow_first_free; union flow flowtab[FLOW_MAX]; /* Last time the flow timers ran */ @@ -49,6 +49,51 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } +/** + * DOC: Theory of Operation - allocation and freeing of flow entries + * + * Each flow takes a single slot in flowtab[]. Moving entries in that table + * (which we used to do) is fiddly and possibly expensive: it requires updating + * the hash table indexing flows, and may require updating epoll data which + * references the flow by index. However, we also want to keep the active + * entries in the table compact where possible, because otherwise scanning + * through the entire table becomes more expensive. This describes the + * compromise implemented below. + * + * Free blocks + * A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries + * in flowtab. The first entry in each block contains metadata, specifically + * the number of entries in the block, and the index of the next (non + * contiguous) free block (struct flow_free_block). + * + * Free block list + * flow_first_free gives the index of the first entry of the first (lowest + * index) free block. Each free block has the index of the next free block, + * or MAX_FLOW if it is the last free block. Together these form a linked + * list of free blocks, in strictly increasing order of index. + * + * Allocation + * When allocating a new flow, we always use the first entry of the first + * free block, that is, at index flow_first_free. If the block has more than + * one entry, flow_first_free is updated to the next entry, which is updated + * to represent the new smaller free block. Otherwise the free block is + * eliminated and flow_first_free is updated to the next free block. + * + * Scanning the table + * Theoretically, scanning the table requires FLOW_MAX iterations. However, + * when we encounter the start of a free block, we can immediately skip to + * its end, meaning that in practice we only need (number of active + * connections) + (number of free blocks) iterations. + * + * Freeing + * We can only free entries when scanning the whole flow table in + * flow_defer_handler(). This is what lets us maintain the fee block list in + * index sorted order. As we scan we keep track of whether the previous + * entry was in a free block or not. If so when an entry is freed (its + * deferred handler returns 'true'), we add it to that free block. Otherwise + * we create a new free block for it and chain it to the last free block we + * scanned. + */ /** * flow_alloc_commit() - Mark a new flow as used @@ -61,49 +106,27 @@ union flow *flow_alloc_commit(const union flow *pflow, enum flow_type type) { union flow *flow = (union flow *)pflow; - ASSERT(FLOW_IDX(flow) == flow_count); - flow_count++; - flow->f.type = type; - return flow; -} + ASSERT(FLOW_IDX(flow) == flow_first_free); + ASSERT(flow->f.type == FLOW_TYPE_NONE); + ASSERT(flow->free.n >= 1); -/** - * flow_table_compact() - Perform compaction on flow table - * @c: Execution context - * @hole: Pointer to recently closed flow - */ -static void flow_table_compact(const struct ctx *c, union flow *hole) -{ - union flow *from; + if (flow->free.n > 1) { + /* Use one entry from the block */ + union flow *next = &flowtab[++flow_first_free]; - if (FLOW_IDX(hole) == --flow_count) { - debug("flow: table compaction: maximum index was %u (%p)", - FLOW_IDX(hole), (void *)hole); - memset(hole, 0, sizeof(*hole)); - return; - } + ASSERT(FLOW_IDX(next) < FLOW_MAX); + ASSERT(next->f.type == FLOW_TYPE_NONE); + ASSERT(next->free.n == 0); - from = flowtab + flow_count; - memcpy(hole, from, sizeof(*hole)); - - switch (from->f.type) { - case FLOW_TCP: - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); - break; - case FLOW_TCP_SPLICE: - tcp_splice_conn_update(c, &hole->tcp_splice); - break; - default: - die("Unexpected %s in tcp_table_compact()", - FLOW_TYPE(&from->f)); + next->free.n = flow->free.n - 1; + next->free.next = flow->free.next; + } else { + /* Use the entire block */ + flow_first_free = flow->free.next; } - debug("flow: table compaction (%s): old index %u, new index %u, " - "from: %p, to: %p", - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), - (void *)from, (void *)hole); - - memset(from, 0, sizeof(*from)); + flow->f.type = type; + return flow; } /** @@ -113,18 +136,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole) */ void flow_defer_handler(const struct ctx *c, const struct timespec *now) { + struct flow_free_block *free_head = NULL; + unsigned *last_next = &flow_first_free; bool timer = false; - union flow *flow; + unsigned idx; if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) { timer = true; flow_timer_run = *now; } - for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) { + for (idx = 0; idx < FLOW_MAX; idx++) { + union flow *flow = &flowtab[idx]; bool closed = false; + if (flow->f.type == FLOW_TYPE_NONE) { + /* Start of a free block */ + free_head = &flow->free; + *last_next = idx; + last_next = &free_head->next; + /* Skip the rest of the block */ + idx += free_head->n - 1; + continue; + } + + switch (flow->f.type) { + case FLOW_TYPE_NONE: + closed = true; + break; case FLOW_TCP: closed = tcp_flow_defer(flow); break; @@ -138,7 +178,32 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) ; } - if (closed) - flow_table_compact(c, flow); + if (closed) { + memset(flow, 0, sizeof(*flow)); + + if (free_head) { + /* Add slot to current free block */ + ASSERT(idx == FLOW_IDX(free_head) + free_head->n); + free_head->n++; + } else { + /* Create new free block */ + free_head = &flow->free; + free_head->n = 1; + *last_next = idx; + last_next = &free_head->next; + } + } else { + free_head = NULL; + } } } + +/** + * flow_init() - Initialise flow related data structures + */ +void flow_init(void) +{ + /* Initial state is a single free block containing the whole table */ + flowtab[0].free.n = FLOW_MAX; + flowtab[0].free.next = FLOW_MAX; +} diff --git a/flow.h b/flow.h index 8064f0e..48a0ab4 100644 --- a/flow.h +++ b/flow.h @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) union flow; +void flow_init(void); void flow_defer_handler(const struct ctx *c, const struct timespec *now); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) diff --git a/flow_table.h b/flow_table.h index 3236288..a885c2f 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,19 @@ #include "tcp_conn.h" +/** + * struct flow_free_block - Information about a block of free entries + * @f: Generic flow information + * @n: Number of entries in this free block (including this one) + * @next: Index of next free block + */ +struct flow_free_block { + /* Must be first element */ + struct flow_common f; + unsigned n; + unsigned next; +}; + /** * union flow - Descriptor for a logical packet flow (e.g. connection) * @f: Fields common between all variants @@ -17,12 +30,13 @@ */ union flow { struct flow_common f; + struct flow_free_block free; struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; }; /* Global Flow Table */ -extern unsigned flow_count; +extern unsigned flow_first_free; extern union flow flowtab[]; @@ -98,9 +112,9 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, */ static inline const union flow *flow_prealloc(void) { - if (flow_count >= FLOW_MAX) + if (flow_first_free >= FLOW_MAX) return NULL; - return &flowtab[flow_count]; + return &flowtab[flow_first_free]; } union flow *flow_alloc_commit(const union flow *pflow, enum flow_type type); diff --git a/passt.c b/passt.c index 71bea8f..d315438 100644 --- a/passt.c +++ b/passt.c @@ -285,6 +285,8 @@ int main(int argc, char **argv) clock_gettime(CLOCK_MONOTONIC, &now); + flow_init(); + if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) exit(EXIT_FAILURE); diff --git a/tcp.c b/tcp.c index 0071654..54d0752 100644 --- a/tcp.c +++ b/tcp.c @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c, tc_hash[b] = FLOW_SIDX_NONE; } -/** - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table - * @c: Execution context - * @old: Old location of tcp_tap_conn - * @new: New location of tcp_tap_conn - */ -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new) - -{ - unsigned b = tcp_hash_probe(c, old); - - if (!flow_at_sidx(tc_hash[b])) - return; /* Not in hash table, nothing to update */ - - tc_hash[b] = FLOW_SIDX(new, TAPSIDE); - - debug("TCP: hash table update: old index %u, new index %u, sock %i, " - "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b); - - tcp_epoll_ctl(c, new); -} - /** * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context diff --git a/tcp_conn.h b/tcp_conn.h index 636224e..a5f5cfe 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -155,9 +155,6 @@ struct tcp_splice_conn { extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new); -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new); bool tcp_flow_defer(union flow *flow); bool tcp_splice_flow_defer(union flow *flow); void tcp_splice_timer(const struct ctx *c, union flow *flow); diff --git a/tcp_splice.c b/tcp_splice.c index 571228f..5c751f1 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, } while (0) -/** - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table - * @c: Execution context - * @new: New location of tcp_splice_conn - */ -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) -{ - if (tcp_splice_epoll_ctl(c, new)) - conn_flag(c, new, CLOSING); -} - /** * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed) * @flow: Flow table entry for this connection -- 2.43.0