From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202508 header.b=fzzB9aKM; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 64A455A0271 for ; Fri, 12 Sep 2025 05:27:10 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202508; t=1757647627; bh=BLx0d1mtZCdZj+tp48jj6EaF6gDlSzMkUmtevbne9Fw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fzzB9aKMuqMERs3gwjUVGAJIlhkBxMfO1Ql6Ch+u0upvClHZNaRFeYp6sJDuH75I7 6IVFvVx647HRRTvGAsXIwXLBHseaGJ5gmuwCBbRjIomeZrWB2S4C61mWq0/gHgQ1c7 kBdoLwCgW4rlRk+6QJb/nfHOwutUeldxDA0kmj4SylXL72FO6jClggoPB2PKXYLeW6 jpqykXl/J4h1LkFv+M0eYAocjyCCPhWsRJO6X6mVUCHBebjEtdaH3Dyj+wzc0TK7Cp 7+jKK+8sMFANRazrWueqCBaJOtf5jkUtbE00VKVsuiJ6JAqkUWzsrVe2VXEYgkkeFV HioBa4D1AdQ6g== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4cNKbv04xrz4w9Q; Fri, 12 Sep 2025 13:27:06 +1000 (AEST) Date: Fri, 12 Sep 2025 13:27:01 +1000 From: David Gibson To: Jon Maloy Subject: Re: [PATCH v7 2/8] fwd: Added cache table for ARP/NDP contents Message-ID: References: <20250910150355.366780-1-jmaloy@redhat.com> <20250910150355.366780-3-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="PFR+SEkACGHTlFHv" Content-Disposition: inline In-Reply-To: <20250910150355.366780-3-jmaloy@redhat.com> Message-ID-Hash: CLHTZ3AUJJPU6R7MYLGJPQFNGGUVTSZL X-Message-ID-Hash: CLHTZ3AUJJPU6R7MYLGJPQFNGGUVTSZL X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --PFR+SEkACGHTlFHv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Sep 10, 2025 at 11:03:49AM -0400, Jon Maloy wrote: > We add a cache table to keep partial contents of the kernel ARP/NDP > tables. This way, we drastically reduce the number of netlink calls > to read those tables. >=20 > We create undefined cache entries representing non- or not-yet- > existing ARP/NDP entries when needed. We add a short expiration time > to each such entry, so that we can know when to make repeated calls to > the kernel tables in the beginning. We also add an access counter to the > entries, to ensure that the timer becomes longer and the call frequency > abates over time if no ARP/NDP entry shows up. >=20 > For regular entries we use a much longer timer, with the purpose to > update the entry in the rare case that a remote host changes its > MAC address. >=20 > Signed-off-by: Jon Maloy >=20 > --- > v5: - Moved to earlier in series to reduce rebase conflicts > v6: - Sqashed the hash list commit and the FIFO/LRU queue commit > - Removed hash lookup. We now only use linear lookup in a > linked list > - Eliminated dynamic memory allocation. > - Ensured there is only one call to clock_gettime() > - Using MAC_ZERO instead of the previously dedicated definitions > v7: - NOW using MAC_ZERO where needed > - I am still using linear back-off for empty cache entries. Even > an incoming, flow-creating packet from a local host gives no > guarantee that its MAC address is in the ARP table, so we must > allow for a few new attempts at first possible occasions. Only > after several failed lookups can we conclude that we probably > never will succeed. Hence the back-off. Still not sure I'm entirely convinced, but ok. Reviewed-by: David Gibson > - Fixed a bug that David inadvertently made me aware of: I only > intended to set the initial expiry value to MAC_CACHE_RENEWAL > when an ARP/NDP table lookup was successful. > - Improved struct and function description comments > --- > conf.c | 1 + > fwd.c | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > fwd.h | 4 ++ > util.c | 12 ++++ > util.h | 1 + > 5 files changed, 207 insertions(+) >=20 > diff --git a/conf.c b/conf.c > index f47f48e..27b04d3 100644 > --- a/conf.c > +++ b/conf.c > @@ -2122,6 +2122,7 @@ void conf(struct ctx *c, int argc, char **argv) > c->udp.fwd_out.mode =3D fwd_default; > =20 > fwd_scan_ports_init(c); > + fwd_mac_cache_init(); > =20 > if (!c->quiet) > conf_print(c); > diff --git a/fwd.c b/fwd.c > index 250cf56..fcd119e 100644 > --- a/fwd.c > +++ b/fwd.c > @@ -19,6 +19,8 @@ > #include > #include > #include > +#include > +#include > =20 > #include "util.h" > #include "ip.h" > @@ -26,6 +28,8 @@ > #include "passt.h" > #include "lineread.h" > #include "flow_table.h" > +#include "inany.h" > +#include "netlink.h" > =20 > /* Empheral port range: values from RFC 6335 */ > static in_port_t fwd_ephemeral_min =3D (1 << 15) + (1 << 14); > @@ -33,6 +37,191 @@ static in_port_t fwd_ephemeral_max =3D NUM_PORTS - 1; > =20 > #define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range" > =20 > +#define MAC_CACHE_SIZE 128 > +#define MAC_CACHE_RENEWAL 3600 /* Refresh entry from ARP/NDP every hour= */ > + > +/* Partial cache of ARP/NDP table contents */ > +/** > + * mac_cache_entry - Entry in the ARP/NDP table cache > + * @prev: Previous entry in LRU list > + * @next: Next entry in LRU list > + * @expiry: Point of time after which a new netlink lookup is allowed > + * @addr: IP address of represented host > + * @mac: MAC address of represented host, if known > + * @access_count: Access counter. Used for back-off from netlink calls > + */ > +struct mac_cache_entry { > + struct mac_cache_entry *prev; > + struct mac_cache_entry *next; > + struct timespec expiry; > + union inany_addr addr; > + uint8_t mac[ETH_ALEN]; > + uint16_t access_count; > +}; > + > +/** > + * mac_cache_entry - Partial cache of ARP/NDP table contents > + * @head: Least recent used entry in LRU list > + * @tail: Most recent used entry in LRU list > + * @mac_table: Array of entries, organized as FIFO/LRU list > + */ > +struct mac_cache_table { > + struct mac_cache_entry *head; > + struct mac_cache_entry *tail; > + struct mac_cache_entry mac_table[MAC_CACHE_SIZE]; > +}; > + > +static struct mac_cache_table mac_cache; > + > +/** > + * fwd_mac_cache_unlink() - Unlink entry from LRU queue > + */ > +static void fwd_mac_cache_unlink(struct mac_cache_entry *e) > +{ > + struct mac_cache_table *t =3D &mac_cache; > + > + if (e->prev) > + e->prev->next =3D e->next; > + else > + t->head =3D e->next; > + > + if (e->next) > + e->next->prev =3D e->prev; > + else > + t->tail =3D e->prev; > + > + e->prev =3D e->next =3D NULL; > +} > + > +/** > + * fwd_mac_cache_append_tail() - Add entry to tail of LRU queue > + */ > +static void fwd_mac_cache_append_to_tail(struct mac_cache_entry *e) > +{ > + struct mac_cache_table *t =3D &mac_cache; > + > + e->next =3D NULL; > + e->prev =3D t->tail; > + t->tail->next =3D e; > + t->tail =3D e; > +} > + > +/** > + * fwd_mac_cache_move_to_tail() - Move entry to tail of LRU queue > + */ > +static void fwd_mac_cache_move_to_tail(struct mac_cache_entry *e) > +{ > + struct mac_cache_table *t =3D &mac_cache; > + > + if (t->tail =3D=3D e) > + return; > + > + fwd_mac_cache_unlink(e); > + fwd_mac_cache_append_to_tail(e); > +} > + > +/** > + * mac_entry_set_expiry() - Set the time for a cache entry to expire > + * @now: Current point in time > + * @e: Cache entry > + * @expiry: Expiration time, in seconds from current moment. > + */ > +static void mac_entry_set_expiry(struct mac_cache_entry *e, struct times= pec *now, int expiry) > +{ > + e->expiry =3D *now; > + e->expiry.tv_sec +=3D expiry; > +} > + > +/** > + * fwd_mac_cache_find() - Find an entry in the ARP/NDP cache table > + * @addr: IPv4 or IPv6 address, used as key for the lookup > + * > + * Return: Pointer to the entry on success, NULL on failure. > + */ > +static struct mac_cache_entry *fwd_mac_cache_find(const union inany_addr= *addr) > +{ > + const struct mac_cache_table *t =3D &mac_cache; > + struct mac_cache_entry *e =3D t->tail; > + > + for (e =3D t->tail; e; e =3D e->prev) { > + if (inany_equals(&e->addr, addr)) > + return e; > + } > + return NULL; > +} > + > +/** > + * fwd_neigh_mac_get() - Lookup MAC address in the real or cached ARP/ND= P table > + * @c: Execution context > + * @addr: IPv4 or IPv6 address, used as lookup key > + * @mac: Buffer for Ethernet MAC to return, found or default value. > + * > + * Return: true if real MAC found, false if not found or if failure > + */ > +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, > + uint8_t *mac) > +{ > + struct mac_cache_entry *e =3D fwd_mac_cache_find(addr); > + int ifi =3D inany_v4(addr) ? c->ifi4 : c->ifi6; > + bool refresh =3D false; > + struct timespec now; > + bool found =3D false; > + > + clock_gettime(CLOCK_MONOTONIC, &now); > + > + if (e) { > + refresh =3D timespec_before(&e->expiry, &now); > + } else { > + /* Recycle least recently used entry */ > + e =3D mac_cache.head; > + e->addr =3D *addr; > + memcpy(e->mac, MAC_ZERO, ETH_ALEN); > + e->access_count =3D 0; > + refresh =3D true; > + } > + > + if (!refresh) { > + found =3D !MAC_IS_ZERO(e->mac); > + } else { > + found =3D nl_neigh_mac_get(nl_sock, addr, ifi, e->mac); > + if (found) > + mac_entry_set_expiry(e, &now, MAC_CACHE_RENEWAL); > + } > + > + if (found) { > + memcpy(mac, e->mac, ETH_ALEN); > + } else { > + /* Back off from new netlink calls if nothing found */ > + mac_entry_set_expiry(e, &now, e->access_count++); > + memcpy(mac, c->our_tap_mac, ETH_ALEN); > + } > + > + /* Set to most recently used */ > + fwd_mac_cache_move_to_tail(e); > + > + return found; > +} > + > +/** > + * fwd_mac_cache_init() - Initiate ARP/NDP cache table > + */ > +void fwd_mac_cache_init(void) > +{ > + struct mac_cache_table *t =3D &mac_cache; > + struct mac_cache_entry *e; > + int i; > + > + memset(t, 0, sizeof(*t)); > + > + for (i =3D 0; i < MAC_CACHE_SIZE; i++) { > + e =3D &t->mac_table[i]; > + e->prev =3D (i =3D=3D 0) ? NULL : &t->mac_table[i - 1]; > + e->next =3D (i < (MAC_CACHE_SIZE - 1)) ? &t->mac_table[i + 1] : NULL; > + } > + t->head =3D &t->mac_table[0]; > + t->tail =3D &t->mac_table[MAC_CACHE_SIZE - 1]; > +} > + > /** fwd_probe_ephemeral() - Determine what ports this host considers eph= emeral > * > * Work out what ports the host thinks are emphemeral and record it for = later > diff --git a/fwd.h b/fwd.h > index 65c7c96..728601f 100644 > --- a/fwd.h > +++ b/fwd.h > @@ -57,4 +57,8 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_= t proto, > uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > const struct flowside *ini, struct flowside *tgt); > =20 > +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, > + uint8_t *mac); > +void fwd_mac_cache_init(void); > + > #endif /* FWD_H */ > diff --git a/util.c b/util.c > index c492f90..8a8846c 100644 > --- a/util.c > +++ b/util.c > @@ -305,6 +305,18 @@ long timespec_diff_ms(const struct timespec *a, cons= t struct timespec *b) > return timespec_diff_us(a, b) / 1000; > } > =20 > +/** > + * timespec_before() - Check the relation between two points in time > + * @a: Point in time to be tested > + * @b: Point in time test a against > + * Return: True if a comes before b, otherwise b > + */ > +bool timespec_before(const struct timespec *a, const struct timespec *b) > +{ > + return (a->tv_sec < b->tv_sec) || > + (a->tv_sec =3D=3D b->tv_sec && a->tv_nsec < b->tv_nsec); > +} > + > /** > * bitmap_set() - Set single bit in bitmap > * @map: Pointer to bitmap > diff --git a/util.h b/util.h > index 2a8c38f..5ec3f22 100644 > --- a/util.h > +++ b/util.h > @@ -207,6 +207,7 @@ int sock_unix(char *sock_path); > void sock_probe_mem(struct ctx *c); > long timespec_diff_ms(const struct timespec *a, const struct timespec *b= ); > int64_t timespec_diff_us(const struct timespec *a, const struct timespec= *b); > +bool timespec_before(const struct timespec *a, const struct timespec *b); > void bitmap_set(uint8_t *map, unsigned bit); > void bitmap_clear(uint8_t *map, unsigned bit); > bool bitmap_isset(const uint8_t *map, unsigned bit); > --=20 > 2.50.1 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --PFR+SEkACGHTlFHv Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmjDkwQACgkQzQJF27ox 2GcUXA/+Mb3LJSNfZYzr82fsOwG4s7VqdRYLCCuWOanWgWYBQkQtDQJRc/t7pc8M WCbgmZjCntiZANudZzXVwSEKyegploIxcd0hm5CqtgZJ3Iaqz89LlK3uh+1+62E4 8IlPwQW3a3U/Q3s80vb7w7Pjjwnfv6O3K28PL4YjKBVxcEVgHh1BdiAHKDb5Xcaz anVhiaWt2YYUU3q+v1QizLqJcmSTFXENdihxTGg+kkdYK5hYXepBhVj5Nltzsspo F0nt53mlfIeequmFof/0fbJRK9OUQ9UAowYGu9dOsAesCD79RmVcybGIIrBQGXZy Xq4ZyNKkyeQB7Ztw42dz1mnhNz1eMkBgrwrUUcTPZvuV3exLLcfdF3dXrBmGkxVt iml5Asv/L+OwffgizZikL9DRBG6zde1SGP60MCjKV389cL6l/we7dIuIqcPwNiti iqxRBd7VdVe0dOcV8+iuKCu7q2Xb2Nm4KkKQyS8BTdVYAwY7yd9oom91kOyZgH0c 1j4ADFK7DD7Lzkme9ijyrU0DuPeaX82DvlQrQsEJ4uR8rnyt3ex8SmYHhFXDRgRc 5nz91Wtebhjzw1cLJG/anhhrGb63w8Rqhku1dzwliogdNmZGy62SQ/wLScJb2opQ nWVts0lQDMNisAXr63G7+ngIypoAbSEwL40qO+pcBrK73hB5uBI= =+W6O -----END PGP SIGNATURE----- --PFR+SEkACGHTlFHv--