From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top
Subject: Re: [PATCH v13 03/10] fwd: Add cache table for ARP/NDP contents
Date: Tue, 14 Oct 2025 15:55:54 +1100 [thread overview]
Message-ID: <aO3X2gZ-YwqIZ2I8@zatzit> (raw)
In-Reply-To: <20251012193337.616835-4-jmaloy@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 12752 bytes --]
On Sun, Oct 12, 2025 at 03:33:30PM -0400, Jon Maloy wrote:
> We add a cache table to keep track of the contents of the kernel ARP
> and NDP tables. The table is fed from the just introduced netlink based
> neigbour subscription function. The new table eliminates the need for
> explicit netlink calls to find a host's MAC address.
Last sentence only really makes sense in the context of earlier
versions of this series, which won't be in the final commit log.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v5: - Moved to earlier in series to reduce rebase conflicts
> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
> - Removed hash lookup. We now only use linear lookup in a
> linked list
> - Eliminated dynamic memory allocation.
> - Ensured there is only one call to clock_gettime()
> - Using MAC_ZERO instead of the previously dedicated definitions
> v7: - NOW using MAC_ZERO where needed
> - I am still using linear back-off for empty cache entries. Even
> an incoming, flow-creating packet from a local host gives no
> guarantee that its MAC address is in the ARP table, so we must
> allow for a few new attempts at first possible occasions. Only
> after several failed lookups can we conclude that we probably
> never will succeed. Hence the back-off.
> - Fixed a bug that David inadvertently made me aware of: I only
> intended to set the initial expiry value to MAC_CACHE_RENEWAL
> when an ARP/NDP table lookup was successful.
> - Improved struct and function description comments.
> v8: - Total re-design of table, adapting to the new, subscription
> based way of updating it.
> v9: - Catering for MAC address change for an existing host.
> v10: - Changes according to feedback from David Gibson
> v12: - Changes according to feedback from David and Stefano
> - Added dummy entries for loopback and default GW addresses
> v13: - Changes according to feedback and discussions with David
> and Stefano
> ---
> fwd.c | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fwd.h | 7 ++
> netlink.c | 3 +
> passt.c | 1 +
> 4 files changed, 233 insertions(+)
>
> diff --git a/fwd.c b/fwd.c
> index 250cf56..d062ebc 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -26,6 +26,7 @@
> #include "passt.h"
> #include "lineread.h"
> #include "flow_table.h"
> +#include "netlink.h"
>
> /* Empheral port range: values from RFC 6335 */
> static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> @@ -33,6 +34,227 @@ static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
>
> #define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
>
> +#define NEIGH_TABLE_SLOTS 1024
> +#define NEIGH_TABLE_SIZE (NEIGH_TABLE_SLOTS / 2)
> +static_assert((NEIGH_TABLE_SLOTS & (NEIGH_TABLE_SLOTS - 1)) == 0,
> + "NEIGH_TABLE_SLOTS must be a power of two");
> +
> +/**
> + * struct neigh_table_entry - Entry in the ARP/NDP table
> + * @next: Next entry in slot or free list
> + * @addr: IP address of represented host
> + * @mac: MAC address of represented host
> + * @permanent: Entry cannot be altered or freed by notification
> + */
> +struct neigh_table_entry {
> + struct neigh_table_entry *next;
> + union inany_addr addr;
> + uint8_t mac[ETH_ALEN];
> + bool permanent;
> +};
> +
> +/**
> + * struct neigh_table - Cache of ARP/NDP table contents
> + * @entries: Entries to be plugged into the hash slots when allocated
> + * @slots: Hash table slots
> + * @free: Linked list of unused entries
> + */
> +struct neigh_table {
> + struct neigh_table_entry entries[NEIGH_TABLE_SIZE];
> + struct neigh_table_entry *slots[NEIGH_TABLE_SLOTS];
> + struct neigh_table_entry *free;
> +};
> +
> +static struct neigh_table neigh_table;
> +
> +/**
> + * neigh_table_slot() - Hash key to a number within the table range
> + * @c: Execution context
> + * @key: The key to be used for the hash
> + *
> + * Return: the resulting hash value
> + */
> +static size_t neigh_table_slot(const struct ctx *c,
> + const union inany_addr *key)
> +{
> + struct siphash_state st = SIPHASH_INIT(c->hash_secret);
> + uint32_t i;
> +
> + inany_siphash_feed(&st, key);
> + i = siphash_final(&st, sizeof(*key), 0);
> +
> + return ((size_t)i) & (NEIGH_TABLE_SIZE - 1);
> +}
> +
> +/**
> + * fwd_neigh_table_find() - Find a MAC table entry
> + * @c: Execution context
> + * @addr: Neighbour address to be used as key for the lookup
> + *
> + * Return: the matching entry, if found. Otherwise NULL
> + */
> +static struct neigh_table_entry *fwd_neigh_table_find(const struct ctx *c,
> + const union inany_addr *addr)
> +{
> + size_t slot = neigh_table_slot(c, addr);
> + struct neigh_table_entry *e = neigh_table.slots[slot];
> +
> + while (e && !inany_equals(&e->addr, addr))
> + e = e->next;
> +
> + return e;
> +}
> +
> +/**
> + * fwd_neigh_table_update() - Allocate or update neighbour table entry
> + * @c: Execution context
> + * @addr: IP address used to determine insertion slot and store in entry
> + * @mac: The MAC address associated with the neighbour address
> + * @permanent: Created entry cannot be altered or freed
> + */
> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> + const uint8_t *mac, bool permanent)
> +{
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e;
> + union inany_addr daddr;
> + ssize_t slot;
> +
> + /* We only add guest-side visible addresses */
> + if (!nat_inbound(c, addr, &daddr))
> + return;
To me, it makes more sense to have the nat_inbound() in the caller.
That way the hash table is just a hash table, with no specific
knowledge of what the contents mean. The caller is the thing which
knows it has a host side address, and wants to update a corresponding
entry in a table indexed by guest side address. It also means that
the fwd_neigh_table_*() functions will consistently take guest side
addresses.
> + /* MAC address might change sometimes */
> + e = fwd_neigh_table_find(c, &daddr);
> + if (e) {
> + if (!e->permanent)
> + memcpy(e->mac, mac, ETH_ALEN);
> + return;
> + }
> +
> + e = t->free;
> + if (!e) {
> + debug("Failed to allocate neighbour table entry");
> + return;
> + }
> + t->free = e->next;
> + slot = neigh_table_slot(c, &daddr);
> + e->next = t->slots[slot];
> + t->slots[slot] = e;
> +
> + memcpy(&e->addr, &daddr, sizeof(daddr));
> + memcpy(e->mac, mac, ETH_ALEN);
> + e->permanent = permanent;
> +}
> +
> +/**
> + * fwd_neigh_table_free() - Remove an entry from a slot and add it to free list
> + * @c: Execution context
> + * @addr: IP address used to find the slot for the entry
> + */
> +void fwd_neigh_table_free(const struct ctx *c, const union inany_addr *addr)
> +{
> + ssize_t slot = neigh_table_slot(c, addr);
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e, **prev;
> +
> + prev = &t->slots[slot];
> + e = t->slots[slot];
> + while (e && !inany_equals(&e->addr, addr)) {
> + prev = &e->next;
> + e = e->next;
> + }
> +
> + if (!e || e->permanent)
> + return;
> +
> + *prev = e->next;
> + e->next = t->free;
> + t->free = e;
> + memset(&e->addr, 0, sizeof(*addr));
> + memset(e->mac, 0, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_neigh_mac_get() - Look up MAC address in the ARP/NDP table
> + * @c: Execution context
> + * @addr: Neighbour IP address used as lookup key
> + * @mac: Buffer for returned MAC address
> + */
> +void fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> + uint8_t *mac)
> +{
> + const struct neigh_table_entry *e = fwd_neigh_table_find(c, addr);
> +
> + if (e)
> + memcpy(mac, e->mac, ETH_ALEN);
> + else
> + memcpy(mac, c->our_tap_mac, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_neigh_table_init() - Initialize the neighbour table
> + * @c: Execution context
> + */
> +void fwd_neigh_table_init(const struct ctx *c)
> +{
> + union inany_addr mhl = inany_from_v4(c->ip4.map_host_loopback);
> + union inany_addr mga = inany_from_v4(c->ip4.map_guest_addr);
> + union inany_addr ggw = inany_from_v4(c->ip4.guest_gw);
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e;
> + int i;
> +
> + memset(t, 0, sizeof(*t));
> +
> + for (i = 0; i < NEIGH_TABLE_SIZE; i++) {
> + e = &t->entries[i];
> + e->next = t->free;
> + t->free = e;
> + }
> +
> + /* Blocker entries to stop events from hosts using these addresses */
> + if (!inany_is_unspecified4(&mhl))
> + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true);
Here mhl is already a guest side address - and may not have a
corresponding address host side - so we definitely don't want the
nat_inbound() inside fwd_neigh_table_update().
> + if (!inany_is_unspecified4(&ggw) && !c->no_map_gw)
> + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true);
Same here.
> + if (!inany_is_unspecified4(&mga) && !inany_equals(&mhl, &mga)) {
> + uint8_t mac[ETH_ALEN];
> + int rc;
> +
> + rc = nl_link_get_mac(nl_sock, c->ifi4, mac);
> + if (rc < 0) {
> + debug("Couldn't get ip4 MAC addr: %s", strerror_(-rc));
> + memcpy(mac, c->our_tap_mac, ETH_ALEN);
> + }
> + fwd_neigh_table_update(c, &mga, mac, true);
And here.
> + }
> +
> + mhl = *(union inany_addr *)&c->ip6.map_host_loopback;
> + mga = *(union inany_addr *)&c->ip4.map_guest_addr;
> + ggw = *(union inany_addr *)&c->ip4.guest_gw;
> +
> + if (!inany_is_unspecified6(&mhl))
> + fwd_neigh_table_update(c, &mhl, c->our_tap_mac, true);
..and so forth.
> + if (!inany_is_unspecified6(&ggw) && !c->no_map_gw)
> + fwd_neigh_table_update(c, &ggw, c->our_tap_mac, true);
> +
> + if (!inany_is_unspecified6(&mga) && !inany_equals(&mhl, &mga)) {
> + uint8_t mac[ETH_ALEN];
> + int rc;
> +
> + rc = nl_link_get_mac(nl_sock, c->ifi6, mac);
> + if (rc < 0) {
> + debug("Couldn't get ip6 MAC addr: %s", strerror_(-rc));
> + memcpy(mac, c->our_tap_mac, ETH_ALEN);
> + }
> + fwd_neigh_table_update(c, &mga, mac, true);
> + }
> +}
> +
> /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
> *
> * Work out what ports the host thinks are emphemeral and record it for later
> diff --git a/fwd.h b/fwd.h
> index 65c7c96..352f3b5 100644
> --- a/fwd.h
> +++ b/fwd.h
> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> + const uint8_t *mac, bool permanent);
> +void fwd_neigh_table_free(const struct ctx *c,
> + const union inany_addr *addr);
> +void fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> + uint8_t *mac);
> +void fwd_neigh_table_init(const struct ctx *c);
>
> #endif /* FWD_H */
> diff --git a/netlink.c b/netlink.c
> index 99dcb72..61796fb 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -1186,10 +1186,12 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
>
> if (nh->nlmsg_type == RTM_DELNEIGH) {
> trace("neigh table delete: %s", ip_str);
> + fwd_neigh_table_free(c, &addr);
You'll need a nat_inbound() on the delete side as well. Again, I
think it makes more sense here than in fwd_neigh_table_free(), but at
the moment it's in neither.
> return;
> }
> if (!(ndm->ndm_state & NUD_VALID)) {
> trace("neigh table: invalid state for %s", ip_str);
> + fwd_neigh_table_free(c, &addr);
> return;
> }
> if (nh->nlmsg_type != RTM_NEWNEIGH || !lladdr) {
> @@ -1202,6 +1204,7 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> memcpy(mac, lladdr, ETH_ALEN);
> eth_ntop(mac, mac_str, sizeof(mac_str));
> trace("neigh table update: %s / %s", ip_str, mac_str);
> + fwd_neigh_table_update(c, &addr, mac, false);
> }
>
> /**
> diff --git a/passt.c b/passt.c
> index e21d6ba..98fc430 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -324,6 +324,7 @@ int main(int argc, char **argv)
>
> pcap_init(&c);
>
> + fwd_neigh_table_init(&c);
> nl_neigh_notify_init(&c);
>
> if (!c.foreground) {
> --
> 2.50.1
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-10-14 5:04 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-12 19:33 [PATCH v13 00/10] Use true MAC address of LAN local remote hosts Jon Maloy
2025-10-12 19:33 ` [PATCH v13 01/10] netlink: add subscription on changes in NDP/ARP table Jon Maloy
2025-10-14 4:39 ` David Gibson
2025-10-12 19:33 ` [PATCH v13 02/10] passt: add no_map_gw flag to struct ctx Jon Maloy
2025-10-14 4:42 ` David Gibson
2025-10-12 19:33 ` [PATCH v13 03/10] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-10-14 4:55 ` David Gibson [this message]
2025-10-12 19:33 ` [PATCH v13 04/10] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
2025-10-14 4:57 ` David Gibson
2025-10-12 19:33 ` [PATCH v13 05/10] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
2025-10-14 5:01 ` David Gibson
2025-10-12 19:33 ` [PATCH v13 06/10] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
2025-10-14 5:02 ` David Gibson
2025-10-12 19:33 ` [PATCH v13 07/10] udp: forward external source MAC address through tap interface Jon Maloy
2025-10-12 19:33 ` [PATCH v13 08/10] tcp: " Jon Maloy
2025-10-12 19:33 ` [PATCH v13 09/10] tap: change signature of function tap_push_l2h() Jon Maloy
2025-10-12 19:33 ` [PATCH v13 10/10] icmp: let icmp use mac address from flowside structure Jon Maloy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aO3X2gZ-YwqIZ2I8@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=dgibson@redhat.com \
--cc=jmaloy@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).