public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top
Subject: Re: [PATCH v5 02/10] fwd: Added cache table for ARP/NDP contents
Date: Mon, 8 Sep 2025 12:42:31 +1000	[thread overview]
Message-ID: <aL5Cl10oxs2g0R4C@zatzit> (raw)
In-Reply-To: <20250906021154.2760611-3-jmaloy@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 11370 bytes --]

On Fri, Sep 05, 2025 at 10:11:46PM -0400, Jon Maloy wrote:
> We add a cache table to keep partial contents of the kernel ARP/NDP
> tables. This way, we drastically reduce the number of netlink calls
> to read those tables.
> 
> We create placeholder cache entries representing non- or not-yet-
> existing ARP/NDP entries when needed. We add a short expiration time
> to each such entry, so that we can know when to make repeated calls to
> the kernel tables in the beginning. We also add an access counter to the
> entries, to ensure that the timer becomes longer and the call frequency
> abates over time if no ARP/NDP entry shows up.
> 
> For regular entries we use a much longer timer, with the purpose to
> update the entry in the rare case that a remote host changes its
> MAC address.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> 
> ---
> v5: - Moved to earlier in series to reduce rebase conflicts
> ---
>  conf.c    |   2 +
>  fwd.c     | 206 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fwd.h     |  11 +++
>  netlink.c |   2 -
>  4 files changed, 219 insertions(+), 2 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index f47f48e..11d9d63 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -2122,6 +2122,8 @@ void conf(struct ctx *c, int argc, char **argv)
>  		c->udp.fwd_out.mode = fwd_default;
>  
>  	fwd_scan_ports_init(c);
> +	if (fwd_mac_cache_init())
> +		die("Failed to initiate neighbour MAC cache");
>  
>  	if (!c->quiet)
>  		conf_print(c);
> diff --git a/fwd.c b/fwd.c
> index 250cf56..236e58e 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -19,6 +19,8 @@
>  #include <sched.h>
>  #include <unistd.h>
>  #include <stdio.h>
> +#include <time.h>
> +#include <netinet/if_ether.h>
>  
>  #include "util.h"
>  #include "ip.h"
> @@ -26,6 +28,8 @@
>  #include "passt.h"
>  #include "lineread.h"
>  #include "flow_table.h"
> +#include "inany.h"
> +#include "netlink.h"
>  
>  /* Empheral port range: values from RFC 6335 */
>  static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> @@ -33,6 +37,208 @@ static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
>  
>  #define PORT_RANGE_SYSCTL	"/proc/sys/net/ipv4/ip_local_port_range"
>  
> +#define MAC_CACHE_BUCKETS 1024  /* Must be power of two */
> +#define MAC_CACHE_RENEWAL 3600  /* Refresh entry from ARP/NDP every hour */
> +
> +/* Partial cache of ARP/NDP table contents */
> +struct mac_cache_entry {
> +	union inany_addr key;
> +	unsigned char mac[ETH_ALEN];
> +	struct timespec expiry;
> +	uint32_t count;
> +
> +	/* Hash bucket chain */
> +	struct mac_cache_entry *next;

Given we can't allocate (see below), you may be better off with
probing rather than explicit hash chains.  That's what we do to look
up the flow table.

Or... we might even be ok with a fixed number of entries per bucket.
The chance of being on a network segment that causes a bucket overflow
might be low enough to ignore that possibility.


> +};
> +
> +struct mac_cache_table {
> +	struct mac_cache_entry **buckets;
> +	size_t nbuckets;
> +	size_t size;
> +};
> +
> +static struct mac_cache_table mac_cache;
> +const uint8_t undefined_mac[6] = {0, 0, 0, 0, 0, 0};
> +
> +/**
> + * fwd_mac_cache_bucket_idx() - Find the table index of an entry
> + * @c:		Execution context
> + * @key:	IPv4 or IPv6 address, used as key for the hash lookup
> + * @nbuckets:	Number of buckets in the table
> + *
> + * Return: The index found
> + */
> +static inline size_t fwd_mac_cache_bucket_idx(const struct ctx *c,
> +					      const union inany_addr *key,
> +					      size_t nbuckets)
> +{
> +	struct siphash_state st = SIPHASH_INIT(c->hash_secret);
> +	uint32_t h;
> +
> +	inany_siphash_feed(&st, key);
> +	h = siphash_final(&st, sizeof(*key), 0);
> +
> +	return ((size_t)h) & (nbuckets - 1);
> +}
> +
> +/**
> + * timespec_before() - Check the relation between two pints in time
> + * @a:		Point in time to be tested
> + * @b:		Point in time test a against
> + * Return: True if a comes before b, otherwise b
> + */
> +static inline bool timespec_before(const struct timespec *a,
> +				   const struct timespec *b)

This probably belongs in util.c, with timespec_diff_us() and friends.

> +{
> +	return (a->tv_sec < b->tv_sec) ||
> +	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +/**
> + * mac_entry_placeholder() - Check if a cache entry is a placeholder

I don't think "placeholder" is a great name for negative cache
entries.  It suggests that at some later point it will be replaced
with a real entry, but (at least if we're doing the lookup on first
reply thing) there's no particular reason to expect that to be the
case.

> + * @e:		Cache entry
> + *
> + * Return: True if the entry is a placeholder, false otherwise
> + */
> +bool mac_entry_placeholder(const struct mac_cache_entry *e)
> +{
> +	return mac_undefined(e->mac);
> +}
> +
> +/**
> + * mac_entry_expired() - Check if a cache entry has expired
> + * @e:		Cache entry
> + *
> + * Return: True if the entry has expired, false otherwise
> + */
> +static bool mac_entry_expired(const struct mac_cache_entry *e)
> +{
> +	struct timespec now;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &now);

Mostly we try to keep to a single clock_gettime() call per epoll
cycle, passing 'now' down to the things we call there.

> +	return timespec_before(&e->expiry, &now);
> +}
> +
> +/**
> + * mac_entry_set_expiry() - Set the time for a cache entry to expire
> + * @e:		Cache entry
> + * @expiry:	Expiration time, in seconds from current moment.
> + *
> + * Return: The result of the hash

Copypasta error?

> + */
> +static void mac_entry_set_expiry(struct mac_cache_entry *e, int expiry)
> +{
> +	clock_gettime(CLOCK_MONOTONIC, &e->expiry);
> +	e->expiry.tv_sec += expiry;
> +}
> +
> +/**
> + * fwd_mac_cache_find() - Find an entry in the ARP/NDP cache table
> + * @c:		Execution context
> + * @key:	IPv4 or IPv6 address, used as key for the hash lookup
> + *
> + * Return: Pointer to the entry on success, NULL on failure.
> + */
> +static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c,
> +						  const union inany_addr *key)
> +{
> +	const struct mac_cache_table *t = &mac_cache;
> +	struct mac_cache_entry *e;
> +	size_t idx;
> +
> +	idx = fwd_mac_cache_bucket_idx(c, key, t->nbuckets);
> +	for (e = t->buckets[idx]; e; e = e->next)
> +		if (inany_equals(&e->key, key))
> +			return e;
> +	return NULL;
> +}
> +
> +/**
> + * fwd_mac_cache_add() - Add a new entry to the ARP/NDP cache table
> + * @c:		Execution context
> + * @key:	IPv4 or IPv6 address, used as key for the hash lookup
> + * @mac:	Buffer for Ethernet MAC, left unchanged if not found/usable

This doesn't seem correct.  Also, the only caller passes uninitialized
memory here to fill in later, so there doesn't seem any point to this
parameter.

> + *
> + * Return: Pointer to the new entry on success, NULL on failure.
> + */
> +static struct mac_cache_entry *fwd_mac_cache_add(const struct ctx *c,
> +						 const union inany_addr *key,
> +						 const unsigned char *mac)
> +{
> +	struct mac_cache_table *t = &mac_cache;
> +	size_t idx = fwd_mac_cache_bucket_idx(c, key, t->nbuckets);
> +	struct mac_cache_entry *e;
> +
> +	e = calloc(1, sizeof(*e));

No memory allocation allowed in pasta, sorry.

> +	if (!e)
> +		return NULL;
> +
> +	e->key = *key;
> +	memcpy(e->mac, mac, ETH_ALEN);
> +	e->count = 0;
> +	e->next = t->buckets[idx];
> +	t->buckets[idx] = e;
> +
> +	return e;
> +}
> +
> +/**
> + * fwd_neigh_mac_get() - Find a MAC address the ARP/NDP cache table
> + * @c:		Execution context
> + * @addr:	IPv4 or IPv6 address
> + * @ifi:	Interface index
> + * @mac:	Buffer for Ethernet MAC to return, found or default.
> + *
> + * Return: true if real MAC found, false if not found or failure
> + */
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> +		       uint8_t *mac)
> +{
> +	struct mac_cache_entry *e = fwd_mac_cache_find(c, addr);
> +	int ifi = inany_v4(addr) ? c->ifi4 : c->ifi6;
> +	bool refresh = false;
> +	bool ret = false;
> +
> +	if (e)
> +		refresh = mac_entry_expired(e);
> +	else if ((e = fwd_mac_cache_add(c, addr, mac)))

As noted, this is inserting an uninitialized value as mac.  We'll
overwrite it, but it seems unnecessarily counterintuitive.

> +		refresh = true;
> +	else
> +		return false;
> +
> +	if (!refresh) {
> +		ret = !mac_entry_placeholder(e);
> +	} else {
> +		ret = nl_neigh_mac_get(nl_sock, addr, ifi, e->mac);

If this retursn false (no neighbour found), we never actually
initialise e->mac to 00:00:00:00:00:00 as it looks like was intended.

> +		mac_entry_set_expiry(e, MAC_CACHE_RENEWAL);
> +	}
> +
> +	if (ret) {
> +		memcpy(mac, e->mac, ETH_ALEN);
> +		return true;
> +	}
> +
> +	/* Do linear back-off of new netlink calls if nothing found */
> +	mac_entry_set_expiry(e, e->count++);

As noted above, I'm not sure there's any real reason to treat expiry
of negative entries differently from positive entries.

> +	memcpy(mac, c->our_tap_mac, ETH_ALEN);
> +	return false;
> +}
> +
> +/**
> + * fwd_mac_cache_init() - Initiate ARP/NDP cache table
> + *
> + * Return: 0 on success, -1 on failure.
> + */
> +int fwd_mac_cache_init(void)
> +{
> +	struct mac_cache_table *t = &mac_cache;
> +
> +	t->nbuckets = MAC_CACHE_BUCKETS;
> +	t->buckets = calloc(t->nbuckets, sizeof(*t->buckets));

Nope.

> +	t->size = 0;
> +	return t->buckets ? 0 : -1;
> +}
> +
>  /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
>   *
>   * Work out what ports the host thinks are emphemeral and record it for later
> diff --git a/fwd.h b/fwd.h
> index 65c7c96..80da4b1 100644
> --- a/fwd.h
> +++ b/fwd.h
> @@ -42,6 +42,13 @@ struct fwd_ports {
>  	in_port_t delta[NUM_PORTS];
>  };
>  
> +extern const unsigned char undefined_mac[];
> +
> +static inline bool mac_undefined(const uint8_t *mac)
> +{
> +	return !memcmp(mac, undefined_mac, 6);
> +}
> +
>  void fwd_scan_ports_tcp(struct fwd_ports *fwd, const struct fwd_ports *rev);
>  void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
>  			const struct fwd_ports *tcp_fwd,
> @@ -57,4 +64,8 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
>  uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
>  			  const struct flowside *ini, struct flowside *tgt);
>  
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> +		       uint8_t *mac);
> +int fwd_mac_cache_init(void);
> +
>  #endif /* FWD_H */
> diff --git a/netlink.c b/netlink.c
> index 1ca2c9a..3ba0597 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -878,8 +878,6 @@ bool nl_neigh_mac_get(int s, const union inany_addr *addr,
>  			}
>  		}
>  	}
> -	if (status < 0)
> -		warn("netlink: RTM_NEWNEIGH failed: %s", strerror_(-status));

Should this change be folded into the first patch?

>  	return found;
>  }
> -- 
> 2.50.1
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-09-08  3:09 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-06  2:11 [PATCH v5 00/10] Use true MAC address of LAN local remote hosts Jon Maloy
2025-09-06  2:11 ` [PATCH v5 01/10] netlink: add function to extract MAC addresses from NDP/ARP table Jon Maloy
2025-09-08  2:12   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 02/10] fwd: Added cache table for ARP/NDP contents Jon Maloy
2025-09-08  2:42   ` David Gibson [this message]
2025-09-09 15:02     ` Jon Maloy
2025-09-10  1:49       ` David Gibson
2025-09-08  9:57   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 03/10] fwd: Add entries of ARP/NDP cache table to a FIFO/LRU queue Jon Maloy
2025-09-08  2:51   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 04/10] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
2025-09-08  3:04   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 05/10] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
2025-09-08  3:07   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 06/10] udp: forward external source MAC address through tap interface Jon Maloy
2025-09-08  3:13   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 07/10] tcp: " Jon Maloy
2025-09-08  3:18   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 08/10] tap: change signature of function tap_push_l2h() Jon Maloy
2025-09-08  3:21   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 09/10] tcp: make tcp_rst_no_conn() respond with correct MAC address Jon Maloy
2025-09-08  3:29   ` David Gibson
2025-09-06  2:11 ` [PATCH v5 10/10] icmp: let icmp use mac address from flowside structure Jon Maloy
2025-09-08  3:35   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aL5Cl10oxs2g0R4C@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).