From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202508 header.b=S7rXFuYr; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 8429A5A027B for ; Thu, 21 Aug 2025 04:04:01 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202508; t=1755741833; bh=ogV4RmRsQeeia4VSDNwf1fd8o9peSDxAAV8i+40qiac=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=S7rXFuYrWNTgCNIGRrvIRtjMjTrFkXyFF65A4bPYopWDV+Ujdp0HnwY62GEs58rex YowIlIQtKq67vsWoUQF1IPjeyd3rb0PlQ+w6bFJZeby77J87bJBwDViNKaOJuFVJxL zXaXmVNNazspmJJ0G9gsHtfUqrQMBToIqJcG0xjaGm21RCSej8AhzbIWgxqq1U1N8U knhgemkM9qqatVpgnv1PzhVO98ofNfOe0JdzmV2Jqy3KO8cnPw7VNOsU2/AIu3+61s dBv9GgXgYbS9WqQPw5QtBwhGyJhwZZiOIwfNiVKVt26rPHp6LysnkpN/eK9KGB3hXd uWgwXc6tPT2MQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4c6mp12kpKz4xSQ; Thu, 21 Aug 2025 12:03:53 +1000 (AEST) Date: Thu, 21 Aug 2025 12:03:47 +1000 From: David Gibson To: Jon Maloy Subject: Re: [PATCH v4 9/9] fwd: Added cache table for ARP/NDP contents Message-ID: References: <20250820031005.2725591-1-jmaloy@redhat.com> <20250820031005.2725591-10-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="+Ylcj9ExOV+liWoX" Content-Disposition: inline In-Reply-To: <20250820031005.2725591-10-jmaloy@redhat.com> Message-ID-Hash: 5PAEFN7YJE7TTQAZI47OLZA7P7OXNZZV X-Message-ID-Hash: 5PAEFN7YJE7TTQAZI47OLZA7P7OXNZZV X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --+Ylcj9ExOV+liWoX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 19, 2025 at 11:10:05PM -0400, Jon Maloy wrote: > We add a cache table to keep partial contents of the kernel ARP/NDP > tables. This way, we drastically reduce the number of netlink calls > to read those tables. Do you have data to suggest this is necessary? It's a lot of code to optimise something only needed for some pretty uncommon cases. > We create dummy cache entries representing non-(not-yet)-existing > ARP/NDP entries when needed. We add a short expiration time to each > such entry, so that we can know when to make repeated calls to the > kernel tables in the beginning. We also add an access counter to the > entries, to ensure that the timer becomes longer and the call frequency > abates over time if no ARP/NDP entry is created. >=20 > For regular entries we use a much longer timer, with the purpose to > update the entry in the rare case that a remote host changes its > MAC address. >=20 > Signed-off-by: Jon Maloy > --- > arp.c | 3 +- > conf.c | 2 + > flow.c | 5 +- > fwd.c | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > fwd.h | 4 ++ > ndp.c | 3 +- > tcp.c | 3 +- > 7 files changed, 218 insertions(+), 8 deletions(-) >=20 > diff --git a/arp.c b/arp.c > index c37867a..040d4fe 100644 > --- a/arp.c > +++ b/arp.c > @@ -29,7 +29,6 @@ > #include "dhcp.h" > #include "passt.h" > #include "tap.h" > -#include "netlink.h" > =20 > /** > * arp() - Check if this is a supported ARP message, reply as needed > @@ -79,7 +78,7 @@ int arp(const struct ctx *c, const struct pool *p) > */ > inany_from_af(&tgt, AF_INET, am->tip); > if (!fwd_inany_nat(c, &tgt)) > - nl_neigh_mac_get(nl_sock, &tgt, c->ifi4, am->sha); > + fwd_neigh_mac_get(c, &tgt, c->ifi4, am->sha); > =20 > memcpy(swap, am->tip, sizeof(am->tip)); > memcpy(am->tip, am->sip, sizeof(am->tip)); > diff --git a/conf.c b/conf.c > index f47f48e..0abdbf7 100644 > --- a/conf.c > +++ b/conf.c > @@ -2122,6 +2122,8 @@ void conf(struct ctx *c, int argc, char **argv) > c->udp.fwd_out.mode =3D fwd_default; > =20 > fwd_scan_ports_init(c); > + if (fwd_mac_cache_init()) > + die("Failed to initiate neighnor MAC cache"); "neighnor" > if (!c->quiet) > conf_print(c); > diff --git a/flow.c b/flow.c > index d7b3fd1..3ca30f1 100644 > --- a/flow.c > +++ b/flow.c > @@ -459,7 +459,7 @@ struct flowside *flow_target(const struct ctx *c, uni= on flow *flow, > */ > ifi =3D inany_v4(&ini->oaddr) ? c->ifi4 : c->ifi6; > if (!fwd_inany_nat(c, &ini->oaddr)) > - nl_neigh_mac_get(nl_sock, &ini->oaddr, ifi, f->tap_omac); > + fwd_neigh_mac_get(c, &ini->oaddr, ifi, f->tap_omac); > break; > =20 > case PIF_SPLICE: > @@ -474,7 +474,7 @@ struct flowside *flow_target(const struct ctx *c, uni= on flow *flow, > */ > ifi =3D inany_v4(&ini->eaddr) ? c->ifi4 : c->ifi6; > if (!fwd_inany_nat(c, &ini->eaddr)) > - nl_neigh_mac_get(nl_sock, &ini->eaddr, ifi, f->tap_omac); > + fwd_neigh_mac_get(c, &ini->eaddr, ifi, f->tap_omac); > break; > =20 > default: > @@ -491,6 +491,7 @@ struct flowside *flow_target(const struct ctx *c, uni= on flow *flow, > =20 > f->pif[TGTSIDE] =3D tgtpif; > flow_set_state(f, FLOW_STATE_TGT); > + > return tgt; > } > =20 > diff --git a/fwd.c b/fwd.c > index 55bf5f2..6647764 100644 > --- a/fwd.c > +++ b/fwd.c > @@ -19,6 +19,8 @@ > #include > #include > #include > +#include > +#include > =20 > #include "util.h" > #include "ip.h" > @@ -26,6 +28,8 @@ > #include "passt.h" > #include "lineread.h" > #include "flow_table.h" > +#include "inany.h" > +#include "netlink.h" > =20 > /* Empheral port range: values from RFC 6335 */ > static in_port_t fwd_ephemeral_min =3D (1 << 15) + (1 << 14); > @@ -33,6 +37,208 @@ static in_port_t fwd_ephemeral_max =3D NUM_PORTS - 1; > =20 > #define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range" > =20 > +#define MAC_CACHE_BUCKETS 1024 > +#define MAC_CACHE_RENEWAL 3600 /* Refresh entry from ARP/NDP every hour= */ > + > +/* Partial cache of ARP/NDP table contents */ > +struct mac_cache_entry { > + union inany_addr key; > + unsigned char mac[ETH_ALEN]; > + struct timespec expiry; > + uint32_t count; > + struct mac_cache_entry *next; > +}; > + > +struct mac_cache_table { > + struct mac_cache_entry **buckets; > + size_t nbuckets; > +}; > + > +static struct mac_cache_table mac_cache; > + > +/** > + * timespec_before() - Check the relation between two pints in time > + * @a: Point in time to be tested > + * @b: Point in time test a against > + * Return: True if a comes before b, otherwise b > + */ > +static inline bool timespec_before(const struct timespec *a, > + const struct timespec *b) > +{ > + return (a->tv_sec < b->tv_sec) || > + (a->tv_sec =3D=3D b->tv_sec && a->tv_nsec < b->tv_nsec); > +} > + > +/** > + * mac_entry_is_dummy() - Check if a cache entry is a placeholder > + * @c: Execution context > + * @e: Cache entry > + * > + * Return: True if the entry is a placeholder, false otherwise > + */ > +bool mac_entry_is_dummy(const struct ctx *c, const struct mac_cache_entr= y *e) > +{ > + return !memcmp(c->our_tap_mac, e->mac, ETH_ALEN); > +} > + > +/** > + * mac_entry_expired() - Check if a cache entry has expired > + * @e: Cache entry > + * > + * Return: True if the entry has expired, false otherwise > + */ > +static bool mac_entry_expired(const struct mac_cache_entry *e) > +{ > + struct timespec now; > + > + clock_gettime(CLOCK_MONOTONIC, &now); > + return timespec_before(&e->expiry, &now); > +} > + > +/** > + * mac_entry_set_expiry() - Set the time for a cache entry to expire > + * @e: Cache entry > + * @expiry: Expiration time, in seconds from current moment. > + * > + * Return: The result of the hash > + */ > +static void mac_entry_set_expiry(struct mac_cache_entry *e, int expiry) > +{ > + clock_gettime(CLOCK_MONOTONIC, &e->expiry); > + e->expiry.tv_sec +=3D expiry; > +} > + > +/** > + * inany_hash32() - Hash the contenst of key into an integer > + * @key: IPv4 or IPv6 address, used as key > + * > + * Return: The result of the hash > + */ > +static inline uint32_t inany_hash32(const struct ctx *c, > + const union inany_addr *key) > +{ > + struct siphash_state st =3D SIPHASH_INIT(c->hash_secret); > + > + inany_siphash_feed(&st, key); > + > + return (uint32_t)siphash_final(&st, sizeof(*key), 0); > +} > + > +/** > + * fwd_mac_cache_bucket_idx() - Find the table index of an entry > + * @c: Execution context > + * @key: IPv4 or IPv6 address, used as key for the hash lookup > + * @nbuckets: Number of buckets in the table > + * > + * Return: The index found > + */ > +static inline size_t fwd_mac_cache_bucket_idx(const struct ctx *c, > + const union inany_addr *key, > + size_t nbuckets) > +{ > + uint32_t h =3D inany_hash32(c, key); > + > + return (nbuckets & (nbuckets - 1)) ? (h % nbuckets) : (h & (nbuckets - = 1)); I don't think this is worth the complexity: I wouldn't be confident the conditional branch is less expensive than an integer divmod. And if it is, it wouldn't suprise me if the compiler did that optimisation itself. > +} > + > +/** > + * fwd_mac_cache_find() - Find an entry in the ARP/NDP cache table > + * @c: Execution context > + * @key: IPv4 or IPv6 address, used as key for the hash lookup > + * > + * Return: Pointer to the entry on success, NULL on failure. > + */ > +static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c, > + const union inany_addr *key) > +{ > + const struct mac_cache_table *t =3D &mac_cache; > + struct mac_cache_entry *e; > + size_t idx; > + > + idx =3D fwd_mac_cache_bucket_idx(c, key, t->nbuckets); > + for (e =3D t->buckets[idx]; e; e =3D e->next) > + if (inany_equals(&e->key, key)) > + return e; > + return NULL; > +} > + > +/** > + * fwd_mac_cache_add() - Add a new entry to the ARP/NDP cache table > + * @c: Execution context > + * @key: IPv4 or IPv6 address, used as key for the hash lookup > + * @mac: Buffer for Ethernet MAC, left unchanged if not found/usable > + * > + * Return: Pointer to the new entry on success, NULL on failure. > + */ > +static struct mac_cache_entry *fwd_mac_cache_add(const struct ctx *c, > + const union inany_addr *key, > + const unsigned char *mac) > +{ > + struct mac_cache_table *t =3D &mac_cache; > + size_t idx =3D fwd_mac_cache_bucket_idx(c, key, t->nbuckets); > + struct mac_cache_entry *e; > + > + e =3D calloc(1, sizeof(*e)); > + if (!e) > + return NULL; > + e->key =3D *key; > + memcpy(e->mac, mac, ETH_ALEN); > + e->count =3D 0; > + e->next =3D t->buckets[idx]; > + t->buckets[idx] =3D e; > + > + return e; > +} > + > +/** > + * fwd_neigh_mac_get() - Find a MAC address the ARP/NDP cache table > + * @c: Execution context > + * @addr: IPv4 or IPv6 address > + * @ifi: Interface index > + * @mac: Buffer for Ethernet MAC, left unchanged if not found/usable > + * > + * Return: 0 on success, -1 on failure. > + */ > +int fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, > + int ifi, unsigned char *mac) > +{ > + struct mac_cache_entry *e =3D fwd_mac_cache_find(c, addr); > + bool refresh =3D false; > + > + if (e) > + refresh =3D mac_entry_expired(e); > + else if ((e =3D fwd_mac_cache_add(c, addr, mac))) > + refresh =3D true; > + else > + return -1; > + > + if (refresh) { > + nl_neigh_mac_get(nl_sock, addr, ifi, e->mac); > + if (mac_entry_is_dummy(c, e)) > + mac_entry_set_expiry(e, e->count++); > + else > + mac_entry_set_expiry(e, MAC_CACHE_RENEWAL); > + } > + memcpy(mac, e->mac, ETH_ALEN); > + > + return 0; > +} > + > +/** > + * fwd_mac_cache_init() - Initiate ARP/NDP cache table > + * > + * Return: 0 on success, -1 on failure. > + */ > +int fwd_mac_cache_init(void) > +{ > + struct mac_cache_table *t =3D &mac_cache; > + > + t->nbuckets =3D MAC_CACHE_BUCKETS; > + t->buckets =3D calloc(t->nbuckets, sizeof(*t->buckets)); > + > + return t->buckets ? 0 : -1; > +} > + > /** fwd_probe_ephemeral() - Determine what ports this host considers eph= emeral > * > * Work out what ports the host thinks are emphemeral and record it for = later > diff --git a/fwd.h b/fwd.h > index c8d485d..ebfc114 100644 > --- a/fwd.h > +++ b/fwd.h > @@ -58,4 +58,8 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t = proto, > const struct flowside *ini, struct flowside *tgt); > bool fwd_inany_nat(const struct ctx *c, const union inany_addr *addr); > =20 > +int fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, > + int ifi, unsigned char *mac); > +int fwd_mac_cache_init(void); > + > #endif /* FWD_H */ > diff --git a/ndp.c b/ndp.c > index 19b9b28..af957c9 100644 > --- a/ndp.c > +++ b/ndp.c > @@ -32,7 +32,6 @@ > #include "passt.h" > #include "tap.h" > #include "log.h" > -#include "netlink.h" > =20 > #define RT_LIFETIME 65535 > =20 > @@ -222,7 +221,7 @@ static void ndp_na(const struct ctx *c, const struct = in6_addr *dst, > */ > inany_from_af(&tgt, AF_INET6, addr); > if (!fwd_inany_nat(c, &tgt)) > - nl_neigh_mac_get(nl_sock, &tgt, c->ifi6, na.target_l2_addr.mac); > + fwd_neigh_mac_get(c, &tgt, c->ifi6, na.target_l2_addr.mac); > =20 > ndp_send(c, dst, &na, sizeof(na)); > } > diff --git a/tcp.c b/tcp.c > index 28f9ef5..6b5b4ff 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -309,7 +309,6 @@ > #include "tcp_internal.h" > #include "tcp_buf.h" > #include "tcp_vu.h" > -#include "netlink.h" > =20 > #ifndef __USE_MISC > /* From Linux UAPI, missing in netinet/tcp.h provided by musl */ > @@ -1908,7 +1907,7 @@ static void tcp_rst_no_conn(const struct ctx *c, in= t af, > memcpy(src_mac, c->our_tap_mac, ETH_ALEN); > inany_from_af(&tgt, af, daddr); > if (!fwd_inany_nat(c, &tgt)) > - nl_neigh_mac_get(nl_sock, &tgt, ifi, src_mac); > + fwd_neigh_mac_get(c, &tgt, ifi, src_mac); > =20 > if (af =3D=3D AF_INET) { > struct iphdr *ip4h =3D tap_push_l2h(c, buf, src_mac, ETH_P_IP); --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --+Ylcj9ExOV+liWoX Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmimfoIACgkQzQJF27ox 2Gfd7RAAo9vaXTS8X4wMFcfCHCfEPZSi9rtICN8CwF+88cLy/UgRuylZLkbl3bJD Nm8jXR34MvrS9mzkS730+yd/1PxV9X8C3OjXROkFJqDseHDFFuhwsA4wd533cSQp KcsQ9fY2sHvYDj2KrcT1MEKSVXJp8DP1t6Y4FZKyBlXRtuAbzs0hiJAu32NL5rii bfD6fTPrRRd1EOQc0Vfvyd+SOpgZp34LTSNRpnzuTPYAg3HvrOM5LCAEXDtZakyM /YF2DUrrFi9cXUxnG9zgshb1NVCW3DLSh9RrAhduTTWYHq6yKHqHg58frnR2ILjR b4iB52qeLZVEbBz2tic/SQQRHmSuQuDaDz1QE3Eb1JjHE+tEGVNoor49sQtUQuJn k2+R2FhzihoVrfYOgWdf0Qz6owMlKfWtSyKr04C46Af7oz6kI6hXhHRpfBGyIPPG b8Ibp+sgzLDX3jbnp8YjXEwd0EARj17GI3vLXfwzkr/t96V4lxzSHZcpeON3z1+0 fa7AixGMgmuPz8Ta6xceoAhELXI5V59qtzgxsJrBQL22ga9Rv9bmg73eqr7HUpQT obxKO8kA69iptJeoRHcYFOgcC8pNsHgxnKOLWiZOaCXsOQnI/WuJSaAqGoopqVhF gfFa2ojEQTOpauDfF0ErgobnmQq+epZSr0LoeKX4hCQa+92kb+U= =V6ck -----END PGP SIGNATURE----- --+Ylcj9ExOV+liWoX--