[PATCH v9 0/9] Use true MAC address of LAN local remote hosts

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v9 0/9] Use true MAC address of LAN local remote hosts
@ 2025-09-24  1:13 Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev



Bug #120 asks us to use the true MAC addresses of LAN local
remote hosts, since some programs need this information.
These commits introduces this for ARP, NDP, UDP, TCP and
ICMP.

---
v3: Updated according to feedback from Stefano and David:
    - Made the ARP/NDP lookup call filter out the requested address
      by itself, qualified by the index if the template interface
    - Moved the flow specific MAC address from struct flowside to
      struct flow_common.

v4: - Updated according to feedback from David and Stefan
    - Added a cache table for ARP/NDP table contents

v5: - Updated according to feedback from David and Stefan
    - Added cache table entries to FIFO/LRU queue
    - New criteria for when to consult ARP/NDP

v6: - Simplified and merged mac cache table commits
    - Other changes after feedback from David.
v7: - Fixes in patch #2 based on feedback from David
      and Stefano.
v8: - Redesigned netlink and cache table part to be based
      on a subscription model.
v8: - Small fix to patch #2 so that we cover the case
      when a MAC addess for a host has changed.
    - Added a commit where we send a gratuitous ARP/
      unsolicitated NA to the guest when a new host is
      added to the neighbour cache table.

Jon Maloy (8):
  netlink: add subsciption on changes in NDP/ARP table
  fwd: Add cache table for ARP/NDP contents
  arp/ndp: respond with true MAC address of LAN local remote hosts
  flow: add MAC address of LAN local remote hosts to flow
  udp: forward external source MAC address through tap interface
  tcp: forward external source MAC address through tap interface
  tap: change signature of function tap_push_l2h()
  icmp: let icmp use mac address from flowside structure

 arp.c          |   9 ++-
 conf.c         |   1 +
 epoll_type.h   |   2 +
 flow.c         |   2 +
 flow.h         |   2 +
 fwd.c          | 167 +++++++++++++++++++++++++++++++++++++++++++++++--
 fwd.h          |   9 +++
 icmp.c         |   8 ++-
 inany.c        |   1 +
 ndp.c          |  10 ++-
 netlink.c      | 119 +++++++++++++++++++++++++++++++++++
 netlink.h      |   4 ++
 passt.c        |  17 +++--
 passt.h        |   3 +-
 pasta.c        |   2 +-
 tap.c          |  24 ++++---
 tap.h          |   7 ++-
 tcp.c          |  18 ++++--
 tcp.h          |   2 +-
 tcp_buf.c      |  37 +++++------
 tcp_internal.h |   4 +-
 tcp_vu.c       |   5 +-
 udp.c          |  57 ++++++++++-------
 udp.h          |   2 +-
 24 files changed, 429 insertions(+), 83 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  2:47   ` David Gibson
  2025-09-24  1:13 ` [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
requires the ability to translate from an IP address to its
corresponding MAC address in cases where those are present in
the ARP or NDP tables.

To keep track of the contents of these tables we add a netlink
based neighbour subscription feature.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: - Added an attribute contianing NDA_DST to sent message, so
      that we let the kernel do the filtering of the IP address
      and return only one entry.
    - Added interface index to the call signature. Since the only
      interface we know is the template interface, this limits
      the number of hosts that will be seen as 'network segment
      local' from a PASST viewpoint.
v4: - Made loop independent of attribute order.
    - Ignoring L2 addresses which are not of size ETH_ALEN.
v5: - Changed return value of new function, so caller can know if
      a MAC address really was found.
v6: - Removed warning printout which had ended up in the wrong
      commit.
v8: - Changed to neighbour event subscription model

netlink: arp/ndp table subscription
---
 epoll_type.h |   2 +
 netlink.c    | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
 netlink.h    |   4 ++
 passt.c      |   8 ++++
 4 files changed, 128 insertions(+)

diff --git a/epoll_type.h b/epoll_type.h
index 12ac59b..a90ffb6 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -44,6 +44,8 @@ enum epoll_type {
 	EPOLL_TYPE_REPAIR_LISTEN,
 	/* TCP_REPAIR helper socket */
 	EPOLL_TYPE_REPAIR,
+	/* Netlink neighbour subscription socket */
+	EPOLL_TYPE_NL_NEIGH,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/netlink.c b/netlink.c
index c436780..1faf3da 100644
--- a/netlink.c
+++ b/netlink.c
@@ -53,6 +53,7 @@
 int nl_sock	= -1;
 int nl_sock_ns	= -1;
 static int nl_seq = 1;
+static int nl_sock_neigh = -1;
 
 /**
  * nl_sock_init_do() - Set up netlink sockets in init or target namespace
@@ -84,6 +85,119 @@ static int nl_sock_init_do(void *arg)
 	return 0;
 }
 
+/**
+ * nl_neigh_subscr_init() - Open a NETLINK_ROUTE socket and subscribe to neighbor events
+ *
+ * Return: 0 on success, -1 on failure
+ */
+int nl_neigh_subscr_init(struct ctx *c)
+{
+	struct epoll_event ev = { 0 };
+	union epoll_ref ref = { .type = EPOLL_TYPE_NL_NEIGH, .fd = 0 };
+
+	struct sockaddr_nl addr = {
+		.nl_family = AF_NETLINK,
+		.nl_groups = RTMGRP_NEIGH,
+	};
+
+	if (nl_sock_neigh >= 0)
+		return 0;
+
+	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
+	if (nl_sock_neigh < 0)
+		return -1;
+
+	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		close(nl_sock_neigh);
+		nl_sock_neigh = -1;
+		return -1;
+	}
+
+	ref.fd = nl_sock_neigh;
+	ev.events = EPOLLIN;
+	ev.data.u64 = ref.u64;
+	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
+		close(nl_sock_neigh);
+		nl_sock_neigh = -1;
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * nl_neigh_subscr_handler() - Non-blocking drain of pending neighbor updates
+ * @c:	       Execution context
+ */
+void nl_neigh_subscr_handler(struct ctx *c)
+{
+	struct nlmsghdr *nh;
+	char buf[NLBUFSIZ];
+	ssize_t n;
+
+	if (nl_sock_neigh < 0)
+		return;
+
+	for (;;) {
+		n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
+		if (n <= 0)
+			return;
+
+		nh = (struct nlmsghdr *)buf;
+		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
+			struct ndmsg *ndm = NLMSG_DATA(nh);
+			struct rtattr *rta = (struct rtattr *)(ndm + 1);
+			size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
+			const uint8_t *lladdr = NULL;
+			const void *dst = NULL;
+			size_t lladdr_len = 0;
+			size_t dstlen = 0;
+
+			if (nh->nlmsg_type == NLMSG_DONE ||
+			    nh->nlmsg_type == NLMSG_ERROR)
+				continue;
+
+			if (nh->nlmsg_type != RTM_NEWNEIGH &&
+			    nh->nlmsg_type != RTM_DELNEIGH)
+				continue;
+
+			for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
+				switch (rta->rta_type) {
+				case NDA_DST:
+					dst = RTA_DATA(rta);
+					dstlen = RTA_PAYLOAD(rta);
+					break;
+				case NDA_LLADDR:
+					lladdr = RTA_DATA(rta);
+					lladdr_len = RTA_PAYLOAD(rta);
+					break;
+				default:
+					break;
+				}
+			}
+
+			if (!dst)
+				continue;
+
+			if (dstlen != sizeof(struct in_addr) &&
+			    dstlen != sizeof(struct in6_addr))
+				continue;
+
+			char abuf[INET6_ADDRSTRLEN];
+
+			if (dstlen == sizeof(struct in_addr))
+				inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
+			else
+				inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
+
+			if (nh->nlmsg_type == RTM_NEWNEIGH)
+				debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
+			else
+				debug("neigh: DEL %s", abuf);
+		}
+	}
+}
+
 /**
  * nl_sock_init() - Call nl_sock_init_do(), won't return on failure
  * @c:		Execution context
diff --git a/netlink.h b/netlink.h
index b51e99c..a7d3506 100644
--- a/netlink.h
+++ b/netlink.h
@@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
 		 int s_dst, unsigned int ifi_dst, sa_family_t af);
 int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
 		void *addr, int *prefix_len, void *addr_l);
+bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
+		      unsigned char *mac);
 int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
 		const void *addr, int prefix_len);
 int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
@@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
 int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
 int nl_link_set_flags(int s, unsigned int ifi,
 		      unsigned int set, unsigned int change);
+int nl_neigh_subscr_init(struct ctx *c);
+void nl_neigh_subscr_handler(struct ctx *c);
 
 #endif /* NETLINK_H */
diff --git a/passt.c b/passt.c
index 31fbb75..e20bbad 100644
--- a/passt.c
+++ b/passt.c
@@ -53,6 +53,7 @@
 #include "vu_common.h"
 #include "migrate.h"
 #include "repair.h"
+#include "netlink.h"
 
 #define EPOLL_EVENTS		8
 
@@ -79,6 +80,7 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
 	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
 	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
+	[EPOLL_TYPE_NL_NEIGH]		= "netlink neighbour subscription socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -322,6 +324,9 @@ int main(int argc, char **argv)
 
 	pcap_init(&c);
 
+	if (nl_neigh_subscr_init(&c) < 0)
+		warn("Failed to subscribe to RTMGRP_NEIGH");
+
 	if (!c.foreground) {
 		if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
 			die_perror("Failed to open /dev/null");
@@ -414,6 +419,9 @@ loop:
 		case EPOLL_TYPE_REPAIR:
 			repair_handler(&c, eventmask);
 			break;
+		case EPOLL_TYPE_NL_NEIGH:
+			nl_neigh_subscr_handler(&c);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  3:03   ` David Gibson
  2025-09-24  1:13 ` [PATCH v9 3/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

We add a cache table to keep track of the contents of the kernel ARP
and NDP tables. The table is fed from the just introduced netlink based
neigbour subscription function. The new table eliminates the need for
explicit netlink calls to find a host's MAC address.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v5: - Moved to earlier in series to reduce rebase conflicts
v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
    - Removed hash lookup. We now only use linear lookup in a
      linked list
    - Eliminated dynamic memory allocation.
    - Ensured there is only one call to clock_gettime()
    - Using MAC_ZERO instead of the previously dedicated definitions
v7: - NOW using MAC_ZERO where needed
    - I am still using linear back-off for empty cache entries. Even
      an incoming, flow-creating packet from a local host gives no
      guarantee that its MAC address is in the ARP table, so we must
      allow for a few new attempts at first possible occasions. Only
      after several failed lookups can we conclude that we probably
      never will succeed. Hence the back-off.
    - Fixed a bug that David inadvertently made me aware of: I only
      intended to set the initial expiry value to MAC_CACHE_RENEWAL
      when an ARP/NDP table lookup was successful.
    - Improved struct and function description comments.
v8: - Total re-design of table, adapting to the new, subscription
      based way of updating it.
v9: - Catering for MAC address change for an existing host.
---
 conf.c    |   1 +
 fwd.c     | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fwd.h     |   7 +++
 netlink.c |  15 +++--
 4 files changed, 180 insertions(+), 5 deletions(-)

diff --git a/conf.c b/conf.c
index 66b9e63..cc491c3 100644
--- a/conf.c
+++ b/conf.c
@@ -2133,6 +2133,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		c->udp.fwd_out.mode = fwd_default;
 
 	fwd_scan_ports_init(c);
+	fwd_neigh_cache_init();
 
 	if (!c->quiet)
 		conf_print(c);
diff --git a/fwd.c b/fwd.c
index 250cf56..491303d 100644
--- a/fwd.c
+++ b/fwd.c
@@ -32,6 +32,168 @@ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
 static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
 
 #define PORT_RANGE_SYSCTL	"/proc/sys/net/ipv4/ip_local_port_range"
+#define MAC_CACHE_SIZE     1024
+
+/**
+ * mac_cache_entry -  Entry in the ARP/NDP table cache
+ * @next:	Next entry in slot or free list
+ * @addr:	IP address of represented host
+ * @mac:	MAC address of represented host
+ */
+struct mac_cache_entry {
+	struct mac_cache_entry *next;
+	union inany_addr addr;
+	uint8_t mac[ETH_ALEN];
+};
+
+/**
+ * mac_cache_table -  Cache of ARP/NDP table contents
+ * @entries:	Entries to be plugged into the hash slots when allocated
+ * @slots:	Hash table slots
+ * @free:	Linked list of unused entries
+ */
+struct mac_cache_table {
+	struct mac_cache_entry entries[MAC_CACHE_SIZE];
+	struct mac_cache_entry *slots[MAC_CACHE_SIZE];
+	struct mac_cache_entry *free;
+};
+
+static struct mac_cache_table mac_cache;
+
+/**
+ * fwd_mac_cache_slot_idx() - Hash key to a number within the table range
+ * @c:		Execution context
+ * @key:	The key to be used for the hash
+ *
+ * Return: The resulting hash value
+ */
+static inline size_t fwd_mac_cache_slot_idx(const struct ctx *c,
+					    const union inany_addr *key)
+{
+	struct siphash_state st = SIPHASH_INIT(c->hash_secret);
+	uint32_t i;
+
+	inany_siphash_feed(&st, key);
+	i = siphash_final(&st, sizeof(*key), 0);
+
+	return ((size_t)i) & (MAC_CACHE_SIZE - 1);
+}
+
+/**
+ * fwd_mac_cache_find() - Find a MAC cache table entry
+ * @c:		Execution context
+ * @addr:	Neighbour address to be used as key for the lookup
+ *
+ * Return: The found entry, if any. Otherwise NULL.
+ */
+static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c,
+						  const union inany_addr *addr)
+{
+	size_t idx = fwd_mac_cache_slot_idx(c, addr);
+	struct mac_cache_entry *e = mac_cache.slots[idx];
+
+	while (e && !inany_equals(&e->addr, addr))
+		e = e->next;
+
+	return e;
+}
+
+/**
+ * fwd_mac_cache_alloc() - Allocate a mac cache table entry from the free list
+ * @c:		Execution context
+ * @addr:	Address used to determine insertion slot and store in entry
+ * @mac:	The MAC address associated with the neighbour address
+ */
+void fwd_neigh_mac_cache_alloc(const struct ctx *c,
+			       const union inany_addr *addr, uint8_t *mac)
+{
+	struct mac_cache_table *t = &mac_cache;
+	struct mac_cache_entry *e;
+	ssize_t idx;
+
+	/* MAC address might change sometimes */
+	e = fwd_mac_cache_find(c, addr);
+	if (e) {
+		memcpy(e->mac, mac, ETH_ALEN);
+		return;
+	}
+
+	e = t->free;
+	if (!e)
+		return;
+	t->free = e->next;
+
+	idx = fwd_mac_cache_slot_idx(c, addr);
+	e->next = t->slots[idx];
+	t->slots[idx] = e;
+
+	memcpy(&e->addr, addr, sizeof(*addr));
+	memcpy(e->mac, mac, ETH_ALEN);
+}
+
+/**
+ * fwd_mac_cache_free() - Remove an entry from a slot and add it to free list
+ * @c:		Execution context
+ * @addr:	Neighbour address used to find the slot for the entry
+ */
+void fwd_neigh_mac_cache_free(const struct ctx *c, const union inany_addr *addr)
+{
+	ssize_t idx = fwd_mac_cache_slot_idx(c, addr);
+	struct mac_cache_table *t = &mac_cache;
+	struct mac_cache_entry *e, **prev;
+
+	prev = &t->slots[idx];
+	e = t->slots[idx];
+	while (e && !inany_equals(&e->addr, addr)) {
+		prev = &e->next;
+		e = e->next;
+	}
+	if (!e)
+		return;
+
+	*prev = e->next;
+	e->next = t->free;
+	t->free = e;
+	memset(&e->addr, 0, sizeof(*addr));
+	memset(e->mac, 0, ETH_ALEN);
+}
+
+/**
+ * fwd_neigh_mac_get() - Lookup MAC address in the ARP/NDP cache table
+ * @c:		Execution context
+ * @addr:	Neighbour address used as lookup key
+ * @mac:	Buffer for Ethernet MAC to return, found or default value.
+ *
+ * Return: true if real MAC found, false if not found or if failure
+ */
+bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
+		       uint8_t *mac)
+{
+	struct mac_cache_entry *e = fwd_mac_cache_find(c, addr);
+
+	if (e)
+		memcpy(mac, e->mac, ETH_ALEN);
+	else
+		memcpy(mac, c->our_tap_mac, ETH_ALEN);
+
+	return !!e;
+}
+
+/**
+ * fwd_neigh_cache_init() - Initialize the neighbor ARP/NDP cache table
+ */
+void fwd_neigh_cache_init(void)
+{
+	struct mac_cache_table *t = &mac_cache;
+	struct mac_cache_entry *e;
+
+	memset(t, 0, sizeof(*t));
+	for (int i = 0; i < MAC_CACHE_SIZE; i++) {
+		e = &t->entries[i];
+		e->next = t->free;
+		t->free = e;
+	}
+}
 
 /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
  *
diff --git a/fwd.h b/fwd.h
index 65c7c96..a0e8fbc 100644
--- a/fwd.h
+++ b/fwd.h
@@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
 			    const struct flowside *ini, struct flowside *tgt);
 uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
 			  const struct flowside *ini, struct flowside *tgt);
+void fwd_neigh_mac_cache_alloc(const struct ctx *c,
+			       const union inany_addr *addr, uint8_t *mac);
+void fwd_neigh_mac_cache_free(const struct ctx *c,
+			      const union inany_addr *addr);
+bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
+		       uint8_t *mac);
+void fwd_neigh_cache_init(void);
 
 #endif /* FWD_H */
diff --git a/netlink.c b/netlink.c
index 1faf3da..1e1ec43 100644
--- a/netlink.c
+++ b/netlink.c
@@ -131,6 +131,8 @@ int nl_neigh_subscr_init(struct ctx *c)
  */
 void nl_neigh_subscr_handler(struct ctx *c)
 {
+	union inany_addr addr;
+	uint8_t mac[ETH_ALEN];
 	struct nlmsghdr *nh;
 	char buf[NLBUFSIZ];
 	ssize_t n;
@@ -183,17 +185,20 @@ void nl_neigh_subscr_handler(struct ctx *c)
 			    dstlen != sizeof(struct in6_addr))
 				continue;
 
-			char abuf[INET6_ADDRSTRLEN];
+			if (!lladdr || lladdr_len != ETH_ALEN)
+				continue;
+
+			memcpy(mac, lladdr, ETH_ALEN);
 
 			if (dstlen == sizeof(struct in_addr))
-				inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
+				addr = inany_from_v4(*(struct in_addr *) dst);
 			else
-				inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
+				addr.a6 = *(struct in6_addr *) dst;
 
 			if (nh->nlmsg_type == RTM_NEWNEIGH)
-				debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
+				fwd_neigh_mac_cache_alloc(c, &addr, mac);
 			else
-				debug("neigh: DEL %s", abuf);
+				fwd_neigh_mac_cache_free(c, &addr);
 		}
 	}
 }
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 3/9] arp/ndp: respond with true MAC address of LAN local remote hosts
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 4/9] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

When we receive an ARP request or NDP neigbor solicitation over
the tap interface for a host on the local network segment attached
to the template interface, we respond with that host's real MAC
address.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: - Added helper function to find out if a remote ip address is subject
      to NAT. This filters out local host addresses which should be
      presented with the passt/pasta local MAC address 9a:55:9a:55:9a:55 even
      though it is on the local segment.
    - Adapted to the change in nl_mac_get() function, so that we now consider
      only the template interface when checking the ARP/NDP table.
v4: - Moved NAT check into the function nat_outbound() to obtain more
      precise criteria for when NAT is used. We may in theory
      have NAT even if original and translated addresses are equal, and
      we want to catch this case.
    - I chose to keep the wrapper funtion inany_nat(), but moved it to
      fwd.h/fwd.c and renamed it to fwd_inany_nat().
v5: - Simplified criteria for when we do ARP/NDP lookup. Now, we
      just try with the potentially translated address after an
      attempted NAT check.
    - Using the new ARP/NDP cache table instead of using netlink
      directly.
v6: - Fixes after feedback from David:
      - Renamed nat_outbound() to fwd_nat_outbound()
      - Eliminated unnecessary temporary variable in arp.c::arp()
---
 arp.c   | 9 +++++++--
 fwd.c   | 8 ++++----
 fwd.h   | 2 ++
 inany.c | 1 +
 ndp.c   | 8 ++++++++
 5 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arp.c b/arp.c
index ad088b1..442faff 100644
--- a/arp.c
+++ b/arp.c
@@ -69,6 +69,7 @@ static bool ignore_arp(const struct ctx *c,
  */
 int arp(const struct ctx *c, struct iov_tail *data)
 {
+	union inany_addr tgt, tgt_nat;
 	struct {
 		struct ethhdr eh;
 		struct arphdr ah;
@@ -102,8 +103,12 @@ int arp(const struct ctx *c, struct iov_tail *data)
 	resp.ah.ar_hln = ah->ar_hln;
 	resp.ah.ar_pln = ah->ar_pln;
 
-	/* ARP message */
-	memcpy(resp.am.sha,		c->our_tap_mac,	sizeof(resp.am.sha));
+	/* MAC address to return in ARP message */
+	inany_from_af(&tgt, AF_INET, am->tip);
+	fwd_nat_outbound(c, &tgt, &tgt_nat);
+	fwd_neigh_mac_get(c, &tgt_nat, resp.am.sha);
+
+	/* Rest of ARP message */
 	memcpy(resp.am.sip,		am->tip,	sizeof(resp.am.sip));
 	memcpy(resp.am.tha,		am->sha,	sizeof(resp.am.tha));
 	memcpy(resp.am.tip,		am->sip,	sizeof(resp.am.tip));
diff --git a/fwd.c b/fwd.c
index 491303d..c6348ab 100644
--- a/fwd.c
+++ b/fwd.c
@@ -486,7 +486,7 @@ static bool fwd_guest_accessible(const struct ctx *c,
 }
 
 /**
- * nat_outbound() - Apply address translation for outbound (TAP to HOST)
+ * fwd_nat_outbound() - Apply address translation for outbound (TAP to HOST)
  * @c:		Execution context
  * @addr:	Input address (as seen on TAP interface)
  * @translated:	Output address (as seen on HOST interface)
@@ -494,8 +494,8 @@ static bool fwd_guest_accessible(const struct ctx *c,
  * Only handles translations that depend *only* on the address.  Anything
  * related to specific ports or flows is handled elsewhere.
  */
-static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
-			 union inany_addr *translated)
+void fwd_nat_outbound(const struct ctx *c, const union inany_addr *addr,
+		      union inany_addr *translated)
 {
 	if (inany_equals4(addr, &c->ip4.map_host_loopback))
 		*translated = inany_loopback4;
@@ -529,7 +529,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
 		   inany_equals6(&ini->oaddr, &c->ip6.dns_match))
 		tgt->eaddr.a6 = c->ip6.dns_host;
 	else
-		nat_outbound(c, &ini->oaddr, &tgt->eaddr);
+		fwd_nat_outbound(c, &ini->oaddr, &tgt->eaddr);
 
 	tgt->eport = ini->oport;
 
diff --git a/fwd.h b/fwd.h
index a0e8fbc..18371bf 100644
--- a/fwd.h
+++ b/fwd.h
@@ -50,6 +50,8 @@ void fwd_scan_ports_init(struct ctx *c);
 
 bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
 		 union inany_addr *translated);
+void fwd_nat_outbound(const struct ctx *c, const union inany_addr *addr,
+		      union inany_addr *translated);
 uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
 			 const struct flowside *ini, struct flowside *tgt);
 uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
diff --git a/inany.c b/inany.c
index 65a39f9..7680439 100644
--- a/inany.c
+++ b/inany.c
@@ -16,6 +16,7 @@
 #include "ip.h"
 #include "siphash.h"
 #include "inany.h"
+#include "fwd.h"
 
 const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT);
 const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT);
diff --git a/ndp.c b/ndp.c
index 588b48f..c0e5f6e 100644
--- a/ndp.c
+++ b/ndp.c
@@ -196,6 +196,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst,
 static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
 	    const struct in6_addr *addr)
 {
+	union inany_addr tgt, tgt_nat;
 	struct ndp_na na = {
 		.ih = {
 			.icmp6_type		= NA,
@@ -215,6 +216,13 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
 
 	memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
 
+	/* Respond with true MAC address if remote host's address or
+	 * NAT translated address can be found in NDP table.
+	 */
+	inany_from_af(&tgt, AF_INET6, addr);
+	fwd_nat_outbound(c, &tgt, &tgt_nat);
+	fwd_neigh_mac_get(c, &tgt_nat, na.target_l2_addr.mac);
+
 	ndp_send(c, dst, &na, sizeof(na));
 }
 
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 4/9] flow: add MAC address of LAN local remote hosts to flow
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
                   ` (2 preceding siblings ...)
  2025-09-24  1:13 ` [PATCH v9 3/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 5/9] udp: forward external source MAC address through tap interface Jon Maloy
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

When communicating with remote hosts on the local network, some guest
applications want to see the real MAC address of that host instead
of PASST/PASTA's own tap address. The flow_common structure is a
convenient location for storing that address, so we do that in this
commit.

Note that we don´t add actual usage of this address here, that will
be done in later commits.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: - Moved the remote host macaddress from struct flowside to
      struct flow_common. I chose to call it 'omac' as suggested
      by David, although in my understanding the correct name would be
      'emac'. (In general I find the address naming scheme confusing.)
    - Adapted to new signature of function nl_mac_get(), now passing
      it the index of the template interface.
v4: - Renamed flow_commeon->omac to flow_common->tap_omac to make is
      role in the code clearer
v5: - Modified the criteria for ARP/NDP table lookup like in the
      previous commits.
    - Removed the PIF_TAP lookup case, as David suggested, and did
      instead give the flow->tap_omac field a value marking it as
      non-initialized.
    - Calling the cache table instead of netlink for ARP/NDP lookup.
    - Unconditionally using the potentially translated IP address
      in the lookup, instead of only if NAT really was applied.
v6: - Using MAC_ZERO instead of own definitions
---
 flow.c | 2 ++
 flow.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/flow.c b/flow.c
index feefda3..510f3c5 100644
--- a/flow.c
+++ b/flow.c
@@ -449,6 +449,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow,
 
 	switch (f->pif[INISIDE]) {
 	case PIF_TAP:
+		memcpy(f->tap_omac, MAC_ZERO, ETH_ALEN);
 		tgtpif = fwd_nat_from_tap(c, proto, ini, tgt);
 		break;
 
@@ -458,6 +459,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow,
 
 	case PIF_HOST:
 		tgtpif = fwd_nat_from_host(c, proto, ini, tgt);
+		fwd_neigh_mac_get(c, &ini->eaddr, f->tap_omac);
 		break;
 
 	default:
diff --git a/flow.h b/flow.h
index cac618a..f342895 100644
--- a/flow.h
+++ b/flow.h
@@ -177,6 +177,7 @@ int flowside_connect(const struct ctx *c, int s,
  * @type:	Type of packet flow
  * @pif[]:	Interface for each side of the flow
  * @side[]:	Information for each side of the flow
+ * @tap_omac: MAC address of remote endpoint as seen from the guest
  */
 struct flow_common {
 #ifdef __GNUC__
@@ -192,6 +193,7 @@ struct flow_common {
 #endif
 	uint8_t		pif[SIDES];
 	struct flowside	side[SIDES];
+	uint8_t		tap_omac[6];
 };
 
 #define FLOW_INDEX_BITS		17	/* 128k - 1 */
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 5/9] udp: forward external source MAC address through tap interface
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
                   ` (3 preceding siblings ...)
  2025-09-24  1:13 ` [PATCH v9 4/9] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 6/9] tcp: " Jon Maloy
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

We forward the incoming MAC address through the tap interface when
receiving incoming packets from network local hosts.

This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: - Adapted to the move of external MAC address from struct flowside
      to struct flow_common
v4: - Changed signature of udp_tap_prepare() to take a MAC address
      instead of a flow.
    - Eliminated initialization of MAC source address in all frames,
      since those now are set per send occasion anyway.
v5: - Added lookup in ARP/NDP table on incoming messages in
      case flow->tap_omac wasn't initialized at flow creation,
      i.e., the flow was initiated from the guest.
v6: - Using MAC_ZERO
---
 passt.c |  2 +-
 udp.c   | 45 +++++++++++++++++++++++++--------------------
 udp.h   |  2 +-
 3 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/passt.c b/passt.c
index e20bbad..fdd275a 100644
--- a/passt.c
+++ b/passt.c
@@ -164,7 +164,7 @@ static void timer_init(struct ctx *c, const struct timespec *now)
 void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
 {
 	tcp_update_l2_buf(eth_d, eth_s);
-	udp_update_l2_buf(eth_d, eth_s);
+	udp_update_l2_buf(eth_d);
 }
 
 /**
diff --git a/udp.c b/udp.c
index 86585b7..eb57f05 100644
--- a/udp.c
+++ b/udp.c
@@ -133,11 +133,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
 /* UDP header and data for inbound messages */
 static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
 
-/* Ethernet header for IPv4 frames */
-static struct ethhdr udp4_eth_hdr;
-
-/* Ethernet header for IPv6 frames */
-static struct ethhdr udp6_eth_hdr;
+/* Ethernet headers for IPv4 and IPv6 frames */
+static struct ethhdr udp_eth_hdr[UDP_MAX_FRAMES];
 
 /**
  * struct udp_meta_t - Pre-cooked headers for UDP packets
@@ -210,12 +207,13 @@ void udp_portmap_clear(void)
 /**
  * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
  * @eth_d:	Ethernet destination address, NULL if unchanged
- * @eth_s:	Ethernet source address, NULL if unchanged
  */
-void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void udp_update_l2_buf(const unsigned char *eth_d)
 {
-	eth_update_mac(&udp4_eth_hdr, eth_d, eth_s);
-	eth_update_mac(&udp6_eth_hdr, eth_d, eth_s);
+	int i;
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++)
+		eth_update_mac(&udp_eth_hdr[i], eth_d, NULL);
 }
 
 /**
@@ -238,6 +236,7 @@ static void udp_iov_init_one(const struct ctx *c, size_t i)
 
 	*siov = IOV_OF_LVALUE(payload->data);
 
+	tiov[UDP_IOV_ETH] = IOV_OF_LVALUE(udp_eth_hdr[i]);
 	tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph);
 	tiov[UDP_IOV_PAYLOAD].iov_base = payload;
 
@@ -253,9 +252,6 @@ static void udp_iov_init(const struct ctx *c)
 {
 	size_t i;
 
-	udp4_eth_hdr.h_proto = htons_constant(ETH_P_IP);
-	udp6_eth_hdr.h_proto = htons_constant(ETH_P_IPV6);
-
 	for (i = 0; i < UDP_MAX_FRAMES; i++)
 		udp_iov_init_one(c, i);
 }
@@ -352,31 +348,34 @@ size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
  * udp_tap_prepare() - Convert one datagram into a tap frame
  * @mmh:	Receiving mmsghdr array
  * @idx:	Index of the datagram to prepare
+ * @tap_omac:	MAC address of remote endpoint as seen from the guest
  * @toside:	Flowside for destination side
  * @no_udp_csum: Do not set UDP checksum
  */
 static void udp_tap_prepare(const struct mmsghdr *mmh,
-			    unsigned idx, const struct flowside *toside,
+			    unsigned int idx,
+			    const uint8_t *tap_omac,
+			    const struct flowside *toside,
 			    bool no_udp_csum)
 {
 	struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
 	struct udp_payload_t *bp = &udp_payload[idx];
 	struct udp_meta_t *bm = &udp_meta[idx];
+	struct ethhdr *eh = (*tap_iov)[UDP_IOV_ETH].iov_base;
 	size_t l4len;
 
+	eth_update_mac(eh, NULL, tap_omac);
 	if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) {
 		l4len = udp_update_hdr6(&bm->ip6h, bp, toside,
 					mmh[idx].msg_len, no_udp_csum);
-		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
-			       sizeof(udp6_eth_hdr));
-		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
+		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + ETH_HLEN);
+		eh->h_proto = htons_constant(ETH_P_IPV6);
 		(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
 	} else {
 		l4len = udp_update_hdr4(&bm->ip4h, bp, toside,
 					mmh[idx].msg_len, no_udp_csum);
-		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
-			       sizeof(udp4_eth_hdr));
-		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
+		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + ETH_HLEN);
+		eh->h_proto = htons_constant(ETH_P_IP);
 		(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h);
 	}
 	(*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len;
@@ -801,13 +800,19 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
 				flow_sidx_t tosidx)
 {
 	const struct flowside *toside = flowside_at_sidx(tosidx);
+	struct udp_flow *uflow = udp_at_sidx(tosidx);
+	uint8_t *omac = uflow->f.tap_omac;
 	int i;
 
 	if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
 		return;
 
+	/* Make one attempt to find true MAC address in ARP/NDP table */
+	if (MAC_IS_ZERO(omac))
+		fwd_neigh_mac_get(c, &toside->oaddr, omac);
+
 	for (i = 0; i < n; i++)
-		udp_tap_prepare(udp_mh_recv, i, toside, false);
+		udp_tap_prepare(udp_mh_recv, i, omac, toside, false);
 
 	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
 }
diff --git a/udp.h b/udp.h
index 8f8531a..dd6e5ad 100644
--- a/udp.h
+++ b/udp.h
@@ -21,7 +21,7 @@ int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
 		  const char *ifname, in_port_t port);
 int udp_init(struct ctx *c);
 void udp_timer(struct ctx *c, const struct timespec *now);
-void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
+void udp_update_l2_buf(const unsigned char *eth_d);
 
 /**
  * union udp_listen_epoll_ref - epoll reference for "listening" UDP sockets
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 6/9] tcp: forward external source MAC address through tap interface
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
                   ` (4 preceding siblings ...)
  2025-09-24  1:13 ` [PATCH v9 5/9] udp: forward external source MAC address through tap interface Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 7/9] tap: change signature of function tap_push_l2h() Jon Maloy
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.

This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: - Adapted to the signature change in nl_mac_get() function, so that
      we now consider only the template interface when checking the
      ARP/NDP table.
v4: - Adapted to previous name changes in this series
v5: - Added lookup in ARP/NDP cache and/or table on incoming messages
      in  case flow->tap_omac wasn't initialized at flow creation,
      i.e., the flow was initiated from the guest.
v6: - Using MAC_IS_ZERO() instead of mac_undefined()
    - I ignored David's suggestion to move the test for
      undefined flow->tap_omac to tcp_{buf,vu}_data_from_sock(), at
      least for now. It looks like a sub-optimization to first have to
      look up the flow instance in these calls, and then do the
      test for MAC_IS_ZERO(), even though we might in theory end up
      with fewer such calls (how often?).
---
 passt.c        |  7 +++----
 passt.h        |  3 +--
 pasta.c        |  2 +-
 tap.c          |  2 +-
 tcp.c          | 13 +++++++++++--
 tcp.h          |  2 +-
 tcp_buf.c      | 37 +++++++++++++++++--------------------
 tcp_internal.h |  4 ++--
 tcp_vu.c       |  5 ++---
 9 files changed, 39 insertions(+), 36 deletions(-)

diff --git a/passt.c b/passt.c
index fdd275a..a2ce2d9 100644
--- a/passt.c
+++ b/passt.c
@@ -159,11 +159,10 @@ static void timer_init(struct ctx *c, const struct timespec *now)
 /**
  * proto_update_l2_buf() - Update scatter-gather L2 buffers in protocol handlers
  * @eth_d:	Ethernet destination address, NULL if unchanged
- * @eth_s:	Ethernet source address, NULL if unchanged
  */
-void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void proto_update_l2_buf(const unsigned char *eth_d)
 {
-	tcp_update_l2_buf(eth_d, eth_s);
+	tcp_update_l2_buf(eth_d);
 	udp_update_l2_buf(eth_d);
 }
 
@@ -314,7 +313,7 @@ int main(int argc, char **argv)
 	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
 		_exit(EXIT_FAILURE);
 
-	proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
+	proto_update_l2_buf(c.guest_mac);
 
 	if (c.ifi4 && !c.no_dhcp)
 		dhcp_init();
diff --git a/passt.h b/passt.h
index 0075eb4..ff0236c 100644
--- a/passt.h
+++ b/passt.h
@@ -326,7 +326,6 @@ struct ctx {
 	bool migrate_exit;
 };
 
-void proto_update_l2_buf(const unsigned char *eth_d,
-			 const unsigned char *eth_s);
+void proto_update_l2_buf(const unsigned char *eth_d);
 
 #endif /* PASST_H */
diff --git a/pasta.c b/pasta.c
index 687406b..a42cfd8 100644
--- a/pasta.c
+++ b/pasta.c
@@ -411,7 +411,7 @@ void pasta_ns_conf(struct ctx *c)
 		}
 	}
 
-	proto_update_l2_buf(c->guest_mac, NULL);
+	proto_update_l2_buf(c->guest_mac);
 }
 
 /**
diff --git a/tap.c b/tap.c
index 399eeaa..250a0f6 100644
--- a/tap.c
+++ b/tap.c
@@ -1101,7 +1101,7 @@ void tap_add_packet(struct ctx *c, struct iov_tail *data,
 		memcpy(c->guest_mac, eh->h_source, ETH_ALEN);
 		debug("New guest MAC address observed: %s",
 		      eth_ntop(c->guest_mac, bufmac, sizeof(bufmac)));
-		proto_update_l2_buf(c->guest_mac, NULL);
+		proto_update_l2_buf(c->guest_mac);
 	}
 
 	switch (ntohs(eh->h_proto)) {
diff --git a/tcp.c b/tcp.c
index 48b1ef2..af05d35 100644
--- a/tcp.c
+++ b/tcp.c
@@ -919,6 +919,7 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers() - Fill 802.3, IP, TCP headers
+ * @c:			Execution context
  * @conn:		Connection pointer
  * @taph:		tap backend specific header
  * @ip4h:		Pointer to IPv4 header, or NULL
@@ -929,14 +930,15 @@ static void tcp_fill_header(struct tcphdr *th,
  * @seq:		Sequence number for this segment
  * @no_tcp_csum:	Do not set TCP checksum
  */
-void tcp_fill_headers(const struct tcp_tap_conn *conn,
-		      struct tap_hdr *taph,
+void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
+		      struct tap_hdr *taph, struct ethhdr *eh,
 		      struct iphdr *ip4h, struct ipv6hdr *ip6h,
 		      struct tcphdr *th, struct iov_tail *payload,
 		      const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum)
 {
 	const struct flowside *tapside = TAPFLOW(conn);
 	size_t l4len = iov_tail_size(payload) + sizeof(*th);
+	uint8_t *omac = conn->f.tap_omac;
 	size_t l3len = l4len;
 	uint32_t psum = 0;
 
@@ -962,6 +964,7 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn,
 			psum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
 						      *src4, *dst4);
 		}
+		eh->h_proto = htons_constant(ETH_P_IP);
 	}
 
 	if (ip6h) {
@@ -982,8 +985,14 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn,
 						      &ip6h->saddr,
 						      &ip6h->daddr);
 		}
+		eh->h_proto = htons_constant(ETH_P_IPV6);
 	}
 
+	/* Make one attempt to find true MAC address in ARP/NDP table */
+	if (MAC_IS_ZERO(omac))
+		fwd_neigh_mac_get(c, &tapside->oaddr, omac);
+	eth_update_mac(eh, NULL, omac);
+
 	tcp_fill_header(th, conn, seq);
 
 	if (no_tcp_csum)
diff --git a/tcp.h b/tcp.h
index 234a803..c1b8385 100644
--- a/tcp.h
+++ b/tcp.h
@@ -24,7 +24,7 @@ int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
 
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
+void tcp_update_l2_buf(const unsigned char *eth_d);
 
 extern bool peek_offset_cap;
 
diff --git a/tcp_buf.c b/tcp_buf.c
index a493b5a..f256dcc 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -40,8 +40,7 @@
 /* Static buffers */
 
 /* Ethernet header for IPv4 and IPv6 frames */
-static struct ethhdr		tcp4_eth_src;
-static struct ethhdr		tcp6_eth_src;
+static struct ethhdr		tcp_eth_hdr[TCP_FRAMES_MEM];
 
 static struct tap_hdr		tcp_payload_tap_hdr[TCP_FRAMES_MEM];
 
@@ -67,12 +66,13 @@ static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 /**
  * tcp_update_l2_buf() - Update Ethernet header buffers with addresses
  * @eth_d:	Ethernet destination address, NULL if unchanged
- * @eth_s:	Ethernet source address, NULL if unchanged
  */
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void tcp_update_l2_buf(const unsigned char *eth_d)
 {
-	eth_update_mac(&tcp4_eth_src, eth_d, eth_s);
-	eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
+	int i;
+
+	for (i = 0; i < TCP_FRAMES_MEM; i++)
+		eth_update_mac(&tcp_eth_hdr[i], eth_d, NULL);
 }
 
 /**
@@ -85,9 +85,6 @@ void tcp_sock_iov_init(const struct ctx *c)
 	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
 	int i;
 
-	tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
-	tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
-
 	for (i = 0; i < ARRAY_SIZE(tcp_payload); i++) {
 		tcp6_payload_ip[i] = ip6;
 		tcp4_payload_ip[i] = iph;
@@ -149,13 +146,15 @@ void tcp_payload_flush(const struct ctx *c)
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @check:	Checksum, if already known
  * @seq:	Sequence number for this segment
  * @no_tcp_csum: Do not set TCP checksum
  */
-static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
+static void tcp_l2_buf_fill_headers(const struct ctx *c,
+				    struct tcp_tap_conn *conn,
 				    struct iovec *iov, const uint16_t *check,
 				    uint32_t seq, bool no_tcp_csum)
 {
@@ -164,6 +163,7 @@ static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 	struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base;
 	const struct flowside *tapside = TAPFLOW(conn);
 	const struct in_addr *a4 = inany_v4(&tapside->oaddr);
+	struct ethhdr *eh = iov[TCP_IOV_ETH].iov_base;
 	struct ipv6hdr *ip6h = NULL;
 	struct iphdr *ip4h = NULL;
 
@@ -172,7 +172,7 @@ static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 	else
 		ip6h = iov[TCP_IOV_IP].iov_base;
 
-	tcp_fill_headers(conn, taph, ip4h, ip6h, th, &tail,
+	tcp_fill_headers(c, conn, taph, eh, ip4h, ip6h, th, &tail,
 			 check, seq, no_tcp_csum);
 }
 
@@ -194,14 +194,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	int ret;
 
 	iov = tcp_l2_iov[tcp_payload_used];
-	if (CONN_V4(conn)) {
+	if (CONN_V4(conn))
 		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
-		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
-	} else {
+	else
 		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
-		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
-	}
 
+	iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp_eth_hdr[tcp_payload_used]);
 	payload = iov[TCP_IOV_PAYLOAD].iov_base;
 	seq = conn->seq_to_tap;
 	ret = tcp_prepare_flags(c, conn, flags, &payload->th,
@@ -212,7 +210,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	tcp_frame_conns[tcp_payload_used++] = conn;
 	l4len = optlen + sizeof(struct tcphdr);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
-	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
+	tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
 
 	if (flags & DUP_ACK) {
 		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used];
@@ -260,11 +258,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			check = &iph->check;
 		}
 		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
-		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else if (CONN_V6(conn)) {
 		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
-		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
+	iov[TCP_IOV_ETH].iov_base = &tcp_eth_hdr[tcp_payload_used];
 	payload = iov[TCP_IOV_PAYLOAD].iov_base;
 	payload->th.th_off = sizeof(struct tcphdr) / 4;
 	payload->th.th_x2 = 0;
@@ -272,7 +269,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	payload->th.ack = 1;
 	payload->th.psh = push;
 	iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
-	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
+	tcp_l2_buf_fill_headers(c, conn, iov, check, seq, false);
 	if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
 		tcp_payload_flush(c);
 }
diff --git a/tcp_internal.h b/tcp_internal.h
index 5cb6cba..f55025c 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -174,8 +174,8 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
 
 struct tcp_info_linux;
 
-void tcp_fill_headers(const struct tcp_tap_conn *conn,
-		      struct tap_hdr *taph,
+void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
+		      struct tap_hdr *taph, struct ethhdr *eh,
 		      struct iphdr *ip4h, struct ipv6hdr *ip6h,
 		      struct tcphdr *th, struct iov_tail *payload,
 		      const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum);
diff --git a/tcp_vu.c b/tcp_vu.c
index ebd3a1e..0c17884 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -135,7 +135,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	flags_elem[0].in_sg[0].iov_len = hdrlen + optlen;
 	payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
 
-	tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
+	tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload,
 			 NULL, seq, !*c->pcap);
 
 	if (*c->pcap) {
@@ -308,7 +308,6 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
 	eh = vu_eth(base);
 
 	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
-	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
 
 	/* initialize header */
 
@@ -332,7 +331,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
 	th->ack = 1;
 	th->psh = push;
 
-	tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
+	tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload,
 			 *check, conn->seq_to_tap, no_tcp_csum);
 	if (ip4h)
 		*check = &ip4h->check;
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 7/9] tap: change signature of function tap_push_l2h()
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
                   ` (5 preceding siblings ...)
  2025-09-24  1:13 ` [PATCH v9 6/9] tcp: " Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 8/9] icmp: let icmp use mac address from flowside structure Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added Jon Maloy
  8 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

In the next commit it must be possible for the callers of function
tap_push_l2h() to specify which source MAC address should be
added to the ethernet header sent over the tap interface. As a
preparation, we now add a new argument to that function, still
without any logical changes.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: -Improved comment for src_mac argument, as suggested by Stefano.
v4: -Re-added comment in tap.c and remove redundant argument check
     in tap_push_l2()
v5: -Ensured that MAC argument to tap_push_l2h() never is NULL.
v6: -Clarification of comment about ARP lookup
---
 tap.c | 16 +++++++++-------
 tap.h |  3 ++-
 tcp.c |  5 +++--
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/tap.c b/tap.c
index 250a0f6..ec7e525 100644
--- a/tap.c
+++ b/tap.c
@@ -171,17 +171,19 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
  * tap_push_l2h() - Build an L2 header for an inbound packet
  * @c:		Execution context
  * @buf:	Buffer address at which to generate header
+ * @src_mac:	MAC address to be used as source for message.
  * @proto:	Ethernet protocol number for L3
  *
  * Return: pointer at which to write the packet's payload
  */
-void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
+void *tap_push_l2h(const struct ctx *c, void *buf,
+		   const void *src_mac, uint16_t proto)
 {
 	struct ethhdr *eh = (struct ethhdr *)buf;
 
-	/* TODO: ARP table lookup */
+	/* TODO: ARP lookup on tap side */
 	memcpy(eh->h_dest, c->guest_mac, ETH_ALEN);
-	memcpy(eh->h_source, c->our_tap_mac, ETH_ALEN);
+	memcpy(eh->h_source, src_mac, ETH_ALEN);
 	eh->h_proto = ntohs(proto);
 	return eh + 1;
 }
@@ -261,7 +263,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 {
 	size_t l4len = dlen + sizeof(struct udphdr);
 	char buf[USHRT_MAX];
-	struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+	struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IP);
 	struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
 	char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
 
@@ -281,7 +283,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
 		    const void *in, size_t l4len)
 {
 	char buf[USHRT_MAX];
-	struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+	struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IP);
 	struct icmphdr *icmp4h = tap_push_ip4h(ip4h, src, dst,
 					       l4len, IPPROTO_ICMP);
 
@@ -367,7 +369,7 @@ void tap_udp6_send(const struct ctx *c,
 {
 	size_t l4len = dlen + sizeof(struct udphdr);
 	char buf[USHRT_MAX];
-	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
 	struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
 					  l4len, IPPROTO_UDP, flow);
 	char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
@@ -389,7 +391,7 @@ void tap_icmp6_send(const struct ctx *c,
 		    const void *in, size_t l4len)
 {
 	char buf[USHRT_MAX];
-	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
 	struct icmp6hdr *icmp6h = tap_push_ip6h(ip6h, src, dst, l4len,
 						IPPROTO_ICMPV6, 0);
 
diff --git a/tap.h b/tap.h
index 21db4d2..02f7761 100644
--- a/tap.h
+++ b/tap.h
@@ -70,7 +70,8 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 }
 
 unsigned long tap_l2_max_len(const struct ctx *c);
-void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
+void *tap_push_l2h(const struct ctx *c, void *buf,
+		   const void *src_mac, uint16_t proto);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
 		     struct in_addr dst, size_t l4len, uint8_t proto);
 void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
diff --git a/tcp.c b/tcp.c
index af05d35..b874317 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1978,7 +1978,8 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
 		return;
 
 	if (af == AF_INET) {
-		struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+		struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac,
+						  ETH_P_IP);
 		const struct in_addr *rst_src = daddr;
 		const struct in_addr *rst_dst = saddr;
 
@@ -1988,7 +1989,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
 					      *rst_src, *rst_dst);
 
 	} else {
-		struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+		struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
 		const struct in6_addr *rst_src = daddr;
 		const struct in6_addr *rst_dst = saddr;
 
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 8/9] icmp: let icmp use mac address from flowside structure
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
                   ` (6 preceding siblings ...)
  2025-09-24  1:13 ` [PATCH v9 7/9] tap: change signature of function tap_push_l2h() Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  1:13 ` [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added Jon Maloy
  8 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

Even ICMP needs to be updated to use the external MAC address instead
of just the own tap address when applicable. We do that here.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

---
v3: - Adapted to the move of external MAC address from struct flowside
      to struct flow_common
v4: - Adapted to name changes in previous commits in this series
v5: - Added conditional lookup in ARP/NDP if the flow's tap_omac is
      undefined
v6: - Looking up MAC of ICMP generating node in udp_send_tap_icmp4/6()
      when available, instead trusting the contents of flow->tap_omac.
---
 icmp.c |  8 ++++++--
 ndp.c  |  2 +-
 tap.c  | 10 ++++++----
 tap.h  |  4 ++--
 udp.c  | 12 ++++++++++--
 5 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/icmp.c b/icmp.c
index 6dffafb..1d99632 100644
--- a/icmp.c
+++ b/icmp.c
@@ -125,17 +125,21 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 	flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
 		 ini->eport, seq);
 
+	/* Try to find true MAC address in ARP/NDP table if needed */
+	if (MAC_IS_ZERO(pingf->f.tap_omac))
+		fwd_neigh_mac_get(c, &ini->oaddr, pingf->f.tap_omac);
+
 	if (pingf->f.type == FLOW_PING4) {
 		const struct in_addr *saddr = inany_v4(&ini->oaddr);
 		const struct in_addr *daddr = inany_v4(&ini->eaddr);
 
 		ASSERT(saddr && daddr); /* Must have IPv4 addresses */
-		tap_icmp4_send(c, *saddr, *daddr, buf, n);
+		tap_icmp4_send(c, *saddr, *daddr, buf, pingf->f.tap_omac, n);
 	} else if (pingf->f.type == FLOW_PING6) {
 		const struct in6_addr *saddr = &ini->oaddr.a6;
 		const struct in6_addr *daddr = &ini->eaddr.a6;
 
-		tap_icmp6_send(c, saddr, daddr, buf, n);
+		tap_icmp6_send(c, saddr, daddr, buf, pingf->f.tap_omac, n);
 	}
 	return;
 
diff --git a/ndp.c b/ndp.c
index c0e5f6e..70b68aa 100644
--- a/ndp.c
+++ b/ndp.c
@@ -184,7 +184,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst,
 {
 	const struct in6_addr *src = &c->ip6.our_tap_ll;
 
-	tap_icmp6_send(c, src, dst, buf, l4len);
+	tap_icmp6_send(c, src, dst, buf, c->our_tap_mac, l4len);
 }
 
 /**
diff --git a/tap.c b/tap.c
index ec7e525..2a32da5 100644
--- a/tap.c
+++ b/tap.c
@@ -277,13 +277,14 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
  * @src:	IPv4 source address
  * @dst:	IPv4 destination address
  * @in:		ICMP packet, including ICMP header
+ * @src_mac:	MAC address to be used as source for message
  * @l4len:	ICMP packet length, including ICMP header
  */
 void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
-		    const void *in, size_t l4len)
+		    const void *in, const void *src_mac, size_t l4len)
 {
 	char buf[USHRT_MAX];
-	struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IP);
+	struct iphdr *ip4h = tap_push_l2h(c, buf, src_mac, ETH_P_IP);
 	struct icmphdr *icmp4h = tap_push_ip4h(ip4h, src, dst,
 					       l4len, IPPROTO_ICMP);
 
@@ -384,14 +385,15 @@ void tap_udp6_send(const struct ctx *c,
  * @src:	IPv6 source address
  * @dst:	IPv6 destination address
  * @in:		ICMP packet, including ICMP header
+ * @src_mac:	MAC address to be used as source for message
  * @l4len:	ICMP packet length, including ICMP header
  */
 void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
-		    const void *in, size_t l4len)
+		    const void *in, const void *src_mac, size_t l4len)
 {
 	char buf[USHRT_MAX];
-	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
+	struct ipv6hdr *ip6h = tap_push_l2h(c, buf, src_mac, ETH_P_IPV6);
 	struct icmp6hdr *icmp6h = tap_push_ip6h(ip6h, src, dst, l4len,
 						IPPROTO_ICMPV6, 0);
 
diff --git a/tap.h b/tap.h
index 02f7761..1864173 100644
--- a/tap.h
+++ b/tap.h
@@ -91,7 +91,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
 void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
-		    const void *in, size_t l4len);
+		    const void *in, const void *src_mac, size_t l4len);
 const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
 				     const struct in6_addr *src);
 void *tap_push_ip6h(struct ipv6hdr *ip6h,
@@ -103,7 +103,7 @@ void tap_udp6_send(const struct ctx *c,
 		   uint32_t flow, void *in, size_t dlen);
 void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
-		    const void *in, size_t l4len);
+		    const void *in, const void *src_mac, size_t l4len);
 void tap_send_single(const struct ctx *c, const void *data, size_t l2len);
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 		       size_t bufs_per_frame, size_t nframes);
diff --git a/udp.c b/udp.c
index eb57f05..ff15e37 100644
--- a/udp.c
+++ b/udp.c
@@ -400,6 +400,8 @@ static void udp_send_tap_icmp4(const struct ctx *c,
 	struct in_addr eaddr = toside->eaddr.v4mapped.a4;
 	in_port_t eport = toside->eport;
 	in_port_t oport = toside->oport;
+	union inany_addr saddr_any;
+	uint8_t tap_omac[ETH_ALEN];
 	struct {
 		struct icmphdr icmp4h;
 		struct iphdr ip4h;
@@ -421,7 +423,10 @@ static void udp_send_tap_icmp4(const struct ctx *c,
 	tap_push_uh4(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
 	memcpy(&msg.data, in, dlen);
 
-	tap_icmp4_send(c, saddr, eaddr, &msg, msglen);
+	/* Try to obtain the MAC address of the generating node */
+	saddr_any = inany_from_v4(saddr);
+	fwd_neigh_mac_get(c, &saddr_any, tap_omac);
+	tap_icmp4_send(c, saddr, eaddr, &msg, tap_omac, msglen);
 }
 
 
@@ -445,6 +450,7 @@ static void udp_send_tap_icmp6(const struct ctx *c,
 	const struct in6_addr *eaddr = &toside->eaddr.a6;
 	in_port_t eport = toside->eport;
 	in_port_t oport = toside->oport;
+	uint8_t tap_omac[ETH_ALEN];
 	struct {
 		struct icmp6_hdr icmp6h;
 		struct ipv6hdr ip6h;
@@ -466,7 +472,9 @@ static void udp_send_tap_icmp6(const struct ctx *c,
 	tap_push_uh6(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
 	memcpy(&msg.data, in, dlen);
 
-	tap_icmp6_send(c, saddr, eaddr, &msg, msglen);
+	/* Try to obtain the MAC address of the generating node */
+	fwd_neigh_mac_get(c, (union inany_addr *) saddr, tap_omac);
+	tap_icmp6_send(c, saddr, eaddr, &msg, tap_omac, msglen);
 }
 
 /**
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
                   ` (7 preceding siblings ...)
  2025-09-24  1:13 ` [PATCH v9 8/9] icmp: let icmp use mac address from flowside structure Jon Maloy
@ 2025-09-24  1:13 ` Jon Maloy
  2025-09-24  3:22   ` David Gibson
  8 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-24  1:13 UTC (permalink / raw)
  To: sbrivio, dgibson, david, jmaloy, passt-dev

Gratuitious ARP and unsolicitated NA should be handled with caution
because of the risk of malignant users emitting them to disturb
network communication.

There is however one case we where we know it is legitimate
and safe for us to send out such messages: The one time we switch
from using ctx->own_tap_mac to a MAC address received via the
recently added neigbour subscription function. Later changes to
the MAC address of a host in an existing entry cannot be fully
trusted, so we abstain from doing it in such cases.

When sending this type of messages, we notice that the guest accepts
the update, but also asks for a confirmation in the form of a regular
ARP/NS request. This is responded to with the new value, and we have
exactly the effect we wanted.

This commit adds this functionality.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 arp.c | 39 +++++++++++++++++++++++++++++++++++++++
 arp.h |  2 ++
 fwd.c | 11 +++++++++++
 ndp.c | 10 ++++++++++
 ndp.h |  1 +
 5 files changed, 63 insertions(+)

diff --git a/arp.c b/arp.c
index 442faff..259f736 100644
--- a/arp.c
+++ b/arp.c
@@ -151,3 +151,42 @@ void arp_send_init_req(const struct ctx *c)
 	debug("Sending initial ARP request for guest MAC address");
 	tap_send_single(c, &req, sizeof(req));
 }
+
+/**
+ * arp_send_gratuitous() - Send a gratuitous ARP announcement for an IPv4 host
+ * @c:		Execution context
+ * @ip:	IPv4 address we announce as owned by @mac
+ * @mac:	MAC address to advertise for @ip
+ */
+void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
+			 const unsigned char *mac)
+{
+	char ip_str[INET_ADDRSTRLEN];
+	struct {
+		struct ethhdr eh;
+		struct arphdr ah;
+		struct arpmsg am;
+	} __attribute__((__packed__)) req;
+
+	/* Ethernet header */
+	req.eh.h_proto = htons(ETH_P_ARP);
+	memcpy(req.eh.h_dest, MAC_BROADCAST, sizeof(req.eh.h_dest));
+	memcpy(req.eh.h_source, c->our_tap_mac, sizeof(req.eh.h_source));
+
+	/* ARP header */
+	req.ah.ar_op = htons(ARPOP_REPLY);
+	req.ah.ar_hrd = htons(ARPHRD_ETHER);
+	req.ah.ar_pro = htons(ETH_P_IP);
+	req.ah.ar_hln = ETH_ALEN;
+	req.ah.ar_pln = 4;
+
+	/* ARP message */
+	memcpy(req.am.sha, mac, sizeof(req.am.sha));
+	memcpy(req.am.sip, &ip, sizeof(req.am.sip));
+	memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
+	memcpy(req.am.tip, &ip, sizeof(req.am.tip));
+
+	inet_ntop(AF_INET, &ip, ip_str, sizeof(ip_str));
+	debug("Sending gratuitous ARP for %s", ip_str);
+	tap_send_single(c, &req, sizeof(req));
+}
diff --git a/arp.h b/arp.h
index d5ad0e1..b0dbb56 100644
--- a/arp.h
+++ b/arp.h
@@ -22,5 +22,7 @@ struct arpmsg {
 
 int arp(const struct ctx *c, struct iov_tail *data);
 void arp_send_init_req(const struct ctx *c);
+void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
+			 const unsigned char *mac);
 
 #endif /* ARP_H */
diff --git a/fwd.c b/fwd.c
index c6348ab..879a351 100644
--- a/fwd.c
+++ b/fwd.c
@@ -26,6 +26,8 @@
 #include "passt.h"
 #include "lineread.h"
 #include "flow_table.h"
+#include "arp.h"
+#include "ndp.h"
 
 /* Empheral port range: values from RFC 6335 */
 static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
@@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
 
 	memcpy(&e->addr, addr, sizeof(*addr));
 	memcpy(e->mac, mac, ETH_ALEN);
+
+	/* Send gratuitous ARP / unsolicited NA for the new mapping */
+	if (inany_v4(addr)) {
+		struct in_addr ip4 = *inany_v4(addr);
+
+		arp_send_gratuitous(c, ip4, e->mac);
+	} else {
+		ndp_send_unsolicited_na(c, &addr->a6);
+	}
 }
 
 /**
diff --git a/ndp.c b/ndp.c
index 70b68aa..8914f31 100644
--- a/ndp.c
+++ b/ndp.c
@@ -226,6 +226,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
 	ndp_send(c, dst, &na, sizeof(na));
 }
 
+/**
+ * ndp_send_unsolicited_na() - Send unsolicited NA
+ * @c:		Execution context
+ * @addr:	IPv6 address to advertise
+ */
+void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr)
+{
+	ndp_na(c, &in6addr_ll_all_nodes, addr);
+}
+
 /**
  * ndp_ra() - Send an NDP Router Advertisement (RA) message
  * @c:		Execution context
diff --git a/ndp.h b/ndp.h
index 781ea86..320009c 100644
--- a/ndp.h
+++ b/ndp.h
@@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr,
 	struct iov_tail *data);
 void ndp_timer(const struct ctx *c, const struct timespec *now);
 void ndp_send_init_req(const struct ctx *c);
+void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
 
 #endif /* NDP_H */
-- 
2.50.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table
  2025-09-24  1:13 ` [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
@ 2025-09-24  2:47   ` David Gibson
  2025-09-24  3:34     ` David Gibson
  2025-09-24 18:40     ` Jon Maloy
  0 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2025-09-24  2:47 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 8788 bytes --]

On Tue, Sep 23, 2025 at 09:13:22PM -0400, Jon Maloy wrote:
> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> requires the ability to translate from an IP address to its
> corresponding MAC address in cases where those are present in
> the ARP or NDP tables.
> 
> To keep track of the contents of these tables we add a netlink
> based neighbour subscription feature.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

The change to a subscription model is large enough that I think it
warrants dropping pre-existing reviews.

> ---
> v3: - Added an attribute contianing NDA_DST to sent message, so
>       that we let the kernel do the filtering of the IP address
>       and return only one entry.
>     - Added interface index to the call signature. Since the only
>       interface we know is the template interface, this limits
>       the number of hosts that will be seen as 'network segment
>       local' from a PASST viewpoint.
> v4: - Made loop independent of attribute order.
>     - Ignoring L2 addresses which are not of size ETH_ALEN.
> v5: - Changed return value of new function, so caller can know if
>       a MAC address really was found.
> v6: - Removed warning printout which had ended up in the wrong
>       commit.
> v8: - Changed to neighbour event subscription model
> 
> netlink: arp/ndp table subscription
> ---
>  epoll_type.h |   2 +
>  netlink.c    | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  netlink.h    |   4 ++
>  passt.c      |   8 ++++
>  4 files changed, 128 insertions(+)
> 
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..a90ffb6 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
>  	EPOLL_TYPE_REPAIR_LISTEN,
>  	/* TCP_REPAIR helper socket */
>  	EPOLL_TYPE_REPAIR,
> +	/* Netlink neighbour subscription socket */
> +	EPOLL_TYPE_NL_NEIGH,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/netlink.c b/netlink.c
> index c436780..1faf3da 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -53,6 +53,7 @@
>  int nl_sock	= -1;
>  int nl_sock_ns	= -1;
>  static int nl_seq = 1;
> +static int nl_sock_neigh = -1;
>  
>  /**
>   * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> @@ -84,6 +85,119 @@ static int nl_sock_init_do(void *arg)
>  	return 0;
>  }
>  
> +/**
> + * nl_neigh_subscr_init() - Open a NETLINK_ROUTE socket and subscribe to neighbor events
> + *
> + * Return: 0 on success, -1 on failure
> + */
> +int nl_neigh_subscr_init(struct ctx *c)
> +{
> +	struct epoll_event ev = { 0 };
> +	union epoll_ref ref = { .type = EPOLL_TYPE_NL_NEIGH, .fd = 0 };

You overwrite the .fd field, so there's not need to include it in the
initializer.  (Also 0 is a bad placeholder, since 0 is a valid fd).

> +
> +	struct sockaddr_nl addr = {
> +		.nl_family = AF_NETLINK,
> +		.nl_groups = RTMGRP_NEIGH,
> +	};
> +
> +	if (nl_sock_neigh >= 0)
> +		return 0;
> +
> +	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
> +	if (nl_sock_neigh < 0)
> +		return -1;
> +
> +	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
> +		close(nl_sock_neigh);
> +		nl_sock_neigh = -1;
> +		return -1;
> +	}
> +
> +	ref.fd = nl_sock_neigh;
> +	ev.events = EPOLLIN;

You could set this in the initializer.

> +	ev.data.u64 = ref.u64;
> +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
> +		close(nl_sock_neigh);
> +		nl_sock_neigh = -1;
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * nl_neigh_subscr_handler() - Non-blocking drain of pending neighbor updates
> + * @c:	       Execution context
> + */
> +void nl_neigh_subscr_handler(struct ctx *c)
> +{
> +	struct nlmsghdr *nh;
> +	char buf[NLBUFSIZ];
> +	ssize_t n;
> +
> +	if (nl_sock_neigh < 0)
> +		return;
> +
> +	for (;;) {
> +		n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
> +		if (n <= 0)
> +			return;

EAGAIN correctly results in a silent return, but you probably want to
report other errors and EOF (n == 0) since those are not expected.

> +
> +		nh = (struct nlmsghdr *)buf;
> +		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
> +			struct ndmsg *ndm = NLMSG_DATA(nh);
> +			struct rtattr *rta = (struct rtattr *)(ndm + 1);
> +			size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> +			const uint8_t *lladdr = NULL;
> +			const void *dst = NULL;
> +			size_t lladdr_len = 0;
> +			size_t dstlen = 0;
> +
> +			if (nh->nlmsg_type == NLMSG_DONE ||
> +			    nh->nlmsg_type == NLMSG_ERROR)
> +				continue;

IIUC, NLMSG_ERROR would also be unexpected, so should probably be
reported.

> +			if (nh->nlmsg_type != RTM_NEWNEIGH &&
> +			    nh->nlmsg_type != RTM_DELNEIGH)
> +				continue;

Should we filter on ndm->ndm_ifindex for neighbours on the template
interface only?  I'm not sure.

It seems like we probably should filter on ndm->ndm_state.

> +			for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> +				switch (rta->rta_type) {
> +				case NDA_DST:
> +					dst = RTA_DATA(rta);
> +					dstlen = RTA_PAYLOAD(rta);
> +					break;
> +				case NDA_LLADDR:
> +					lladdr = RTA_DATA(rta);
> +					lladdr_len = RTA_PAYLOAD(rta);
> +					break;
> +				default:
> +					break;
> +				}
> +			}

It seems like somewhere there ought to be something specifying the
address type of both DST and LLADDR, but I'm not sure exactly where.
We should check those, just in case some weirdo runs us on a host with
IPX over Token Ring.

> +			if (!dst)
> +				continue;
> +
> +			if (dstlen != sizeof(struct in_addr) &&
> +			    dstlen != sizeof(struct in6_addr))
> +				continue;
> +
> +			char abuf[INET6_ADDRSTRLEN];
> +
> +			if (dstlen == sizeof(struct in_addr))
> +				inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
> +			else
> +				inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
> +
> +			if (nh->nlmsg_type == RTM_NEWNEIGH)
> +				debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
> +			else
> +				debug("neigh: DEL %s", abuf);
> +		}
> +	}
> +}
> +
>  /**
>   * nl_sock_init() - Call nl_sock_init_do(), won't return on failure
>   * @c:		Execution context
> diff --git a/netlink.h b/netlink.h
> index b51e99c..a7d3506 100644
> --- a/netlink.h
> +++ b/netlink.h
> @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
>  		 int s_dst, unsigned int ifi_dst, sa_family_t af);
>  int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
>  		void *addr, int *prefix_len, void *addr_l);
> +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
> +		      unsigned char *mac);
>  int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
>  		const void *addr, int prefix_len);
>  int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
> @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
>  int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
>  int nl_link_set_flags(int s, unsigned int ifi,
>  		      unsigned int set, unsigned int change);
> +int nl_neigh_subscr_init(struct ctx *c);
> +void nl_neigh_subscr_handler(struct ctx *c);
>  
>  #endif /* NETLINK_H */
> diff --git a/passt.c b/passt.c
> index 31fbb75..e20bbad 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -53,6 +53,7 @@
>  #include "vu_common.h"
>  #include "migrate.h"
>  #include "repair.h"
> +#include "netlink.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -79,6 +80,7 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
>  	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> +	[EPOLL_TYPE_NL_NEIGH]		= "netlink neighbour subscription socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -322,6 +324,9 @@ int main(int argc, char **argv)
>  
>  	pcap_init(&c);
>  
> +	if (nl_neigh_subscr_init(&c) < 0)
> +		warn("Failed to subscribe to RTMGRP_NEIGH");
> +
>  	if (!c.foreground) {
>  		if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
>  			die_perror("Failed to open /dev/null");
> @@ -414,6 +419,9 @@ loop:
>  		case EPOLL_TYPE_REPAIR:
>  			repair_handler(&c, eventmask);
>  			break;
> +		case EPOLL_TYPE_NL_NEIGH:
> +			nl_neigh_subscr_handler(&c);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> -- 
> 2.50.1
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents
  2025-09-24  1:13 ` [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
@ 2025-09-24  3:03   ` David Gibson
  2025-09-24 18:54     ` Jon Maloy
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-09-24  3:03 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 10106 bytes --]

On Tue, Sep 23, 2025 at 09:13:23PM -0400, Jon Maloy wrote:
> We add a cache table to keep track of the contents of the kernel ARP
> and NDP tables. The table is fed from the just introduced netlink based
> neigbour subscription function. The new table eliminates the need for
> explicit netlink calls to find a host's MAC address.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> 
> ---
> v5: - Moved to earlier in series to reduce rebase conflicts
> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
>     - Removed hash lookup. We now only use linear lookup in a
>       linked list
>     - Eliminated dynamic memory allocation.
>     - Ensured there is only one call to clock_gettime()
>     - Using MAC_ZERO instead of the previously dedicated definitions
> v7: - NOW using MAC_ZERO where needed
>     - I am still using linear back-off for empty cache entries. Even
>       an incoming, flow-creating packet from a local host gives no
>       guarantee that its MAC address is in the ARP table, so we must
>       allow for a few new attempts at first possible occasions. Only
>       after several failed lookups can we conclude that we probably
>       never will succeed. Hence the back-off.
>     - Fixed a bug that David inadvertently made me aware of: I only
>       intended to set the initial expiry value to MAC_CACHE_RENEWAL
>       when an ARP/NDP table lookup was successful.
>     - Improved struct and function description comments.
> v8: - Total re-design of table, adapting to the new, subscription
>       based way of updating it.
> v9: - Catering for MAC address change for an existing host.
> ---
>  conf.c    |   1 +
>  fwd.c     | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fwd.h     |   7 +++
>  netlink.c |  15 +++--
>  4 files changed, 180 insertions(+), 5 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 66b9e63..cc491c3 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -2133,6 +2133,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  		c->udp.fwd_out.mode = fwd_default;
>  
>  	fwd_scan_ports_init(c);
> +	fwd_neigh_cache_init();
>  
>  	if (!c->quiet)
>  		conf_print(c);
> diff --git a/fwd.c b/fwd.c
> index 250cf56..491303d 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -32,6 +32,168 @@ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>  static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
>  
>  #define PORT_RANGE_SYSCTL	"/proc/sys/net/ipv4/ip_local_port_range"
> +#define MAC_CACHE_SIZE     1024

Nit: you use "mac cache" some places, "neigh cache" others.

> +
> +/**
> + * mac_cache_entry -  Entry in the ARP/NDP table cache
> + * @next:	Next entry in slot or free list
> + * @addr:	IP address of represented host
> + * @mac:	MAC address of represented host
> + */
> +struct mac_cache_entry {
> +	struct mac_cache_entry *next;

Hash buckets might be overkill, but they're pretty easy and cheap, so
whatever.

> +	union inany_addr addr;
> +	uint8_t mac[ETH_ALEN];
> +};
> +
> +/**
> + * mac_cache_table -  Cache of ARP/NDP table contents
> + * @entries:	Entries to be plugged into the hash slots when allocated
> + * @slots:	Hash table slots
> + * @free:	Linked list of unused entries
> + */
> +struct mac_cache_table {
> +	struct mac_cache_entry entries[MAC_CACHE_SIZE];
> +	struct mac_cache_entry *slots[MAC_CACHE_SIZE];

There's no inherent reason these two tables need the same size (and
indeed some reasons they might not want to have the same size).

> +	struct mac_cache_entry *free;
> +};
> +
> +static struct mac_cache_table mac_cache;
> +
> +/**
> + * fwd_mac_cache_slot_idx() - Hash key to a number within the table range
> + * @c:		Execution context
> + * @key:	The key to be used for the hash
> + *
> + * Return: The resulting hash value
> + */
> +static inline size_t fwd_mac_cache_slot_idx(const struct ctx *c,
> +					    const union inany_addr *key)

Maybe just neigh_hash_bucket()?

> +{
> +	struct siphash_state st = SIPHASH_INIT(c->hash_secret);
> +	uint32_t i;
> +
> +	inany_siphash_feed(&st, key);
> +	i = siphash_final(&st, sizeof(*key), 0);
> +
> +	return ((size_t)i) & (MAC_CACHE_SIZE - 1);

This subtly depends on MAC_CACHE_SIZE being a power of 2.  If you use
(... % MAC_CACHE_SIZE) it will work regardless, and the compiler will
still optimize if it _is_ a power of 2.

> +}
> +
> +/**
> + * fwd_mac_cache_find() - Find a MAC cache table entry
> + * @c:		Execution context
> + * @addr:	Neighbour address to be used as key for the lookup
> + *
> + * Return: The found entry, if any. Otherwise NULL.
> + */
> +static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c,
> +						  const union inany_addr *addr)
> +{
> +	size_t idx = fwd_mac_cache_slot_idx(c, addr);
> +	struct mac_cache_entry *e = mac_cache.slots[idx];
> +
> +	while (e && !inany_equals(&e->addr, addr))
> +		e = e->next;
> +
> +	return e;
> +}
> +
> +/**
> + * fwd_mac_cache_alloc() - Allocate a mac cache table entry from the free list
> + * @c:		Execution context
> + * @addr:	Address used to determine insertion slot and store in entry
> + * @mac:	The MAC address associated with the neighbour address
> + */
> +void fwd_neigh_mac_cache_alloc(const struct ctx *c,
> +			       const union inany_addr *addr, uint8_t *mac)

This can update as well as allocating.  Maybe fwd_neigh_cache_set()?

> +{
> +	struct mac_cache_table *t = &mac_cache;
> +	struct mac_cache_entry *e;
> +	ssize_t idx;
> +
> +	/* MAC address might change sometimes */
> +	e = fwd_mac_cache_find(c, addr);
> +	if (e) {
> +		memcpy(e->mac, mac, ETH_ALEN);
> +		return;
> +	}
> +
> +	e = t->free;
> +	if (!e)
> +		return;
> +	t->free = e->next;
> +
> +	idx = fwd_mac_cache_slot_idx(c, addr);
> +	e->next = t->slots[idx];
> +	t->slots[idx] = e;
> +
> +	memcpy(&e->addr, addr, sizeof(*addr));
> +	memcpy(e->mac, mac, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_mac_cache_free() - Remove an entry from a slot and add it to free list
> + * @c:		Execution context
> + * @addr:	Neighbour address used to find the slot for the entry
> + */
> +void fwd_neigh_mac_cache_free(const struct ctx *c, const union inany_addr *addr)
> +{
> +	ssize_t idx = fwd_mac_cache_slot_idx(c, addr);
> +	struct mac_cache_table *t = &mac_cache;
> +	struct mac_cache_entry *e, **prev;
> +
> +	prev = &t->slots[idx];
> +	e = t->slots[idx];
> +	while (e && !inany_equals(&e->addr, addr)) {
> +		prev = &e->next;
> +		e = e->next;
> +	}
> +	if (!e)
> +		return;
> +
> +	*prev = e->next;
> +	e->next = t->free;
> +	t->free = e;
> +	memset(&e->addr, 0, sizeof(*addr));
> +	memset(e->mac, 0, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_neigh_mac_get() - Lookup MAC address in the ARP/NDP cache table
> + * @c:		Execution context
> + * @addr:	Neighbour address used as lookup key
> + * @mac:	Buffer for Ethernet MAC to return, found or default value.
> + *
> + * Return: true if real MAC found, false if not found or if failure
> + */
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> +		       uint8_t *mac)
> +{
> +	struct mac_cache_entry *e = fwd_mac_cache_find(c, addr);
> +
> +	if (e)
> +		memcpy(mac, e->mac, ETH_ALEN);
> +	else
> +		memcpy(mac, c->our_tap_mac, ETH_ALEN);
> +
> +	return !!e;
> +}
> +
> +/**
> + * fwd_neigh_cache_init() - Initialize the neighbor ARP/NDP cache table
> + */
> +void fwd_neigh_cache_init(void)
> +{
> +	struct mac_cache_table *t = &mac_cache;
> +	struct mac_cache_entry *e;
> +
> +	memset(t, 0, sizeof(*t));
> +	for (int i = 0; i < MAC_CACHE_SIZE; i++) {
> +		e = &t->entries[i];
> +		e->next = t->free;
> +		t->free = e;
> +	}
> +}
>  
>  /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
>   *
> diff --git a/fwd.h b/fwd.h
> index 65c7c96..a0e8fbc 100644
> --- a/fwd.h
> +++ b/fwd.h
> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
>  			    const struct flowside *ini, struct flowside *tgt);
>  uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
>  			  const struct flowside *ini, struct flowside *tgt);
> +void fwd_neigh_mac_cache_alloc(const struct ctx *c,
> +			       const union inany_addr *addr, uint8_t *mac);
> +void fwd_neigh_mac_cache_free(const struct ctx *c,
> +			      const union inany_addr *addr);
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> +		       uint8_t *mac);
> +void fwd_neigh_cache_init(void);
>  
>  #endif /* FWD_H */
> diff --git a/netlink.c b/netlink.c
> index 1faf3da..1e1ec43 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -131,6 +131,8 @@ int nl_neigh_subscr_init(struct ctx *c)
>   */
>  void nl_neigh_subscr_handler(struct ctx *c)
>  {
> +	union inany_addr addr;
> +	uint8_t mac[ETH_ALEN];
>  	struct nlmsghdr *nh;
>  	char buf[NLBUFSIZ];
>  	ssize_t n;
> @@ -183,17 +185,20 @@ void nl_neigh_subscr_handler(struct ctx *c)
>  			    dstlen != sizeof(struct in6_addr))
>  				continue;
>  
> -			char abuf[INET6_ADDRSTRLEN];
> +			if (!lladdr || lladdr_len != ETH_ALEN)
> +				continue;
> +
> +			memcpy(mac, lladdr, ETH_ALEN);
>  
>  			if (dstlen == sizeof(struct in_addr))
> -				inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
> +				addr = inany_from_v4(*(struct in_addr *) dst);
>  			else
> -				inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
> +				addr.a6 = *(struct in6_addr *) dst;
>  
>  			if (nh->nlmsg_type == RTM_NEWNEIGH)
> -				debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
> +				fwd_neigh_mac_cache_alloc(c, &addr, mac);
>  			else
> -				debug("neigh: DEL %s", abuf);
> +				fwd_neigh_mac_cache_free(c, &addr);

I think the debug() messages could still be useful.  They could move
into the functions of course, and might want to be demoted to trace().

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24  1:13 ` [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added Jon Maloy
@ 2025-09-24  3:22   ` David Gibson
  2025-09-24 22:18     ` Jon Maloy
  2025-09-26 23:25     ` Jon Maloy
  0 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2025-09-24  3:22 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 6373 bytes --]

On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
> Gratuitious ARP and unsolicitated NA should be handled with caution
> because of the risk of malignant users emitting them to disturb
> network communication.
> 
> There is however one case we where we know it is legitimate
> and safe for us to send out such messages: The one time we switch
> from using ctx->own_tap_mac to a MAC address received via the
> recently added neigbour subscription function. Later changes to
> the MAC address of a host in an existing entry cannot be fully
> trusted, so we abstain from doing it in such cases.

So, I think you're right that the gratuitous ARP is safe in this case.

But it concerns me that (other that some edge cases) we're sending
data to the guest under own_tap_mac before we get the real MAC.  At
the point we send data from a flow to the guest, I would have expected
to already have an entry in the host neighbour table (because by
definition the host is talking to the peer), therefore in our cache,
by the subscriber.

I'm wondering if it could be as simple as both the neighbour update
and the actual data coming in the same epoll batch, and we could avoid
the temporary use of own_tap_mac by prioritising processing of the
neighbour events.

> When sending this type of messages, we notice that the guest accepts
> the update, but also asks for a confirmation in the form of a regular
> ARP/NS request. This is responded to with the new value, and we have
> exactly the effect we wanted.
> 
> This commit adds this functionality.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  arp.c | 39 +++++++++++++++++++++++++++++++++++++++
>  arp.h |  2 ++
>  fwd.c | 11 +++++++++++
>  ndp.c | 10 ++++++++++
>  ndp.h |  1 +
>  5 files changed, 63 insertions(+)
> 
> diff --git a/arp.c b/arp.c
> index 442faff..259f736 100644
> --- a/arp.c
> +++ b/arp.c
> @@ -151,3 +151,42 @@ void arp_send_init_req(const struct ctx *c)
>  	debug("Sending initial ARP request for guest MAC address");
>  	tap_send_single(c, &req, sizeof(req));
>  }
> +
> +/**
> + * arp_send_gratuitous() - Send a gratuitous ARP announcement for an IPv4 host
> + * @c:		Execution context
> + * @ip:	IPv4 address we announce as owned by @mac
> + * @mac:	MAC address to advertise for @ip
> + */
> +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
> +			 const unsigned char *mac)
> +{
> +	char ip_str[INET_ADDRSTRLEN];
> +	struct {
> +		struct ethhdr eh;
> +		struct arphdr ah;
> +		struct arpmsg am;
> +	} __attribute__((__packed__)) req;

'req' is not a great name, since this is an ARP response, not a
request (but see below).

> +	/* Ethernet header */
> +	req.eh.h_proto = htons(ETH_P_ARP);
> +	memcpy(req.eh.h_dest, MAC_BROADCAST, sizeof(req.eh.h_dest));
> +	memcpy(req.eh.h_source, c->our_tap_mac, sizeof(req.eh.h_source));
> +
> +	/* ARP header */
> +	req.ah.ar_op = htons(ARPOP_REPLY);
> +	req.ah.ar_hrd = htons(ARPHRD_ETHER);
> +	req.ah.ar_pro = htons(ETH_P_IP);
> +	req.ah.ar_hln = ETH_ALEN;
> +	req.ah.ar_pln = 4;
> +
> +	/* ARP message */
> +	memcpy(req.am.sha, mac, sizeof(req.am.sha));
> +	memcpy(req.am.sip, &ip, sizeof(req.am.sip));
> +	memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
> +	memcpy(req.am.tip, &ip, sizeof(req.am.tip));

So, I was trying to check if it made sense to use the same IP for both
source and target here, and came across
    https://www.rfc-editor.org/rfc/rfc5227#section-3

Which suggests we should (counter intuitively) be using ARP requests,
not ARP replies for announcements.

> +	inet_ntop(AF_INET, &ip, ip_str, sizeof(ip_str));
> +	debug("Sending gratuitous ARP for %s", ip_str);
> +	tap_send_single(c, &req, sizeof(req));
> +}
> diff --git a/arp.h b/arp.h
> index d5ad0e1..b0dbb56 100644
> --- a/arp.h
> +++ b/arp.h
> @@ -22,5 +22,7 @@ struct arpmsg {
>  
>  int arp(const struct ctx *c, struct iov_tail *data);
>  void arp_send_init_req(const struct ctx *c);
> +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
> +			 const unsigned char *mac);
>  
>  #endif /* ARP_H */
> diff --git a/fwd.c b/fwd.c
> index c6348ab..879a351 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -26,6 +26,8 @@
>  #include "passt.h"
>  #include "lineread.h"
>  #include "flow_table.h"
> +#include "arp.h"
> +#include "ndp.h"
>  
>  /* Empheral port range: values from RFC 6335 */
>  static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>  
>  	memcpy(&e->addr, addr, sizeof(*addr));
>  	memcpy(e->mac, mac, ETH_ALEN);
> +
> +	/* Send gratuitous ARP / unsolicited NA for the new mapping */

AFAICT this doesn't actually implement what the commit message
describes - it seems to always send an ARP/NA when the neighbour table
is updated.

> +	if (inany_v4(addr)) {
> +		struct in_addr ip4 = *inany_v4(addr);
> +
> +		arp_send_gratuitous(c, ip4, e->mac);
> +	} else {
> +		ndp_send_unsolicited_na(c, &addr->a6);
> +	}
>  }
>  
>  /**
> diff --git a/ndp.c b/ndp.c
> index 70b68aa..8914f31 100644
> --- a/ndp.c
> +++ b/ndp.c
> @@ -226,6 +226,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
>  	ndp_send(c, dst, &na, sizeof(na));
>  }
>  
> +/**
> + * ndp_send_unsolicited_na() - Send unsolicited NA
> + * @c:		Execution context
> + * @addr:	IPv6 address to advertise
> + */
> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr)
> +{
> +	ndp_na(c, &in6addr_ll_all_nodes, addr);
> +}
> +
>  /**
>   * ndp_ra() - Send an NDP Router Advertisement (RA) message
>   * @c:		Execution context
> diff --git a/ndp.h b/ndp.h
> index 781ea86..320009c 100644
> --- a/ndp.h
> +++ b/ndp.h
> @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr,
>  	struct iov_tail *data);
>  void ndp_timer(const struct ctx *c, const struct timespec *now);
>  void ndp_send_init_req(const struct ctx *c);
> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
>  
>  #endif /* NDP_H */
> -- 
> 2.50.1
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table
  2025-09-24  2:47   ` David Gibson
@ 2025-09-24  3:34     ` David Gibson
  2025-09-24 18:40     ` Jon Maloy
  1 sibling, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-09-24  3:34 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1492 bytes --]

On Wed, Sep 24, 2025 at 12:47:22PM +1000, David Gibson wrote:
> On Tue, Sep 23, 2025 at 09:13:22PM -0400, Jon Maloy wrote:
[snip]
> > +			for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> > +				switch (rta->rta_type) {
> > +				case NDA_DST:
> > +					dst = RTA_DATA(rta);
> > +					dstlen = RTA_PAYLOAD(rta);
> > +					break;
> > +				case NDA_LLADDR:
> > +					lladdr = RTA_DATA(rta);
> > +					lladdr_len = RTA_PAYLOAD(rta);
> > +					break;
> > +				default:
> > +					break;
> > +				}
> > +			}
> 
> It seems like somewhere there ought to be something specifying the
> address type of both DST and LLADDR, but I'm not sure exactly where.
> We should check those, just in case some weirdo runs us on a host with
> IPX over Token Ring.

Looked into this a bit more.  The LLADDR type is a property of the
link.  I don't know if we get an RTM_NEWLINK before RTM_NEWNEIGH on
the subscription channel.  If so, we can check that and give up if
it's not ARPHRD_ETHER.  If not, we should probably probe the LLADDR
type via our normal netlink socket and only subscribe if it is
ARPHRD_ETHER.

For the protocol address type, it looks to be in ndm->ndm_family,
though rtnetlink(7) doesn't make that very clear.  We should check
that, not just the address length.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table
  2025-09-24  2:47   ` David Gibson
  2025-09-24  3:34     ` David Gibson
@ 2025-09-24 18:40     ` Jon Maloy
  2025-09-25  6:42       ` David Gibson
  1 sibling, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-24 18:40 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-23 22:47, David Gibson wrote:
> On Tue, Sep 23, 2025 at 09:13:22PM -0400, Jon Maloy wrote:
>> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
>> requires the ability to translate from an IP address to its
>> corresponding MAC address in cases where those are present in
>> the ARP or NDP tables.
>>
>> To keep track of the contents of these tables we add a netlink
>> based neighbour subscription feature.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> The change to a subscription model is large enough that I think it
> warrants dropping pre-existing reviews.
> 
>> ---
>> v3: - Added an attribute contianing NDA_DST to sent message, so
>>        that we let the kernel do the filtering of the IP address
>>        and return only one entry.
>>      - Added interface index to the call signature. Since the only
>>        interface we know is the template interface, this limits
>>        the number of hosts that will be seen as 'network segment
>>        local' from a PASST viewpoint.
>> v4: - Made loop independent of attribute order.
>>      - Ignoring L2 addresses which are not of size ETH_ALEN.
>> v5: - Changed return value of new function, so caller can know if
>>        a MAC address really was found.
>> v6: - Removed warning printout which had ended up in the wrong
>>        commit.
>> v8: - Changed to neighbour event subscription model
>>
>> netlink: arp/ndp table subscription
>> ---
>>   epoll_type.h |   2 +
>>   netlink.c    | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   netlink.h    |   4 ++
>>   passt.c      |   8 ++++
>>   4 files changed, 128 insertions(+)
>>
>> diff --git a/epoll_type.h b/epoll_type.h
>> index 12ac59b..a90ffb6 100644
>> --- a/epoll_type.h
>> +++ b/epoll_type.h
>> @@ -44,6 +44,8 @@ enum epoll_type {
>>   	EPOLL_TYPE_REPAIR_LISTEN,
>>   	/* TCP_REPAIR helper socket */
>>   	EPOLL_TYPE_REPAIR,
>> +	/* Netlink neighbour subscription socket */
>> +	EPOLL_TYPE_NL_NEIGH,
>>   
>>   	EPOLL_NUM_TYPES,
>>   };
>> diff --git a/netlink.c b/netlink.c
>> index c436780..1faf3da 100644
>> --- a/netlink.c
>> +++ b/netlink.c
>> @@ -53,6 +53,7 @@
>>   int nl_sock	= -1;
>>   int nl_sock_ns	= -1;
>>   static int nl_seq = 1;
>> +static int nl_sock_neigh = -1;
>>   
>>   /**
>>    * nl_sock_init_do() - Set up netlink sockets in init or target namespace
>> @@ -84,6 +85,119 @@ static int nl_sock_init_do(void *arg)
>>   	return 0;
>>   }
>>   
>> +/**
>> + * nl_neigh_subscr_init() - Open a NETLINK_ROUTE socket and subscribe to neighbor events
>> + *
>> + * Return: 0 on success, -1 on failure
>> + */
>> +int nl_neigh_subscr_init(struct ctx *c)
>> +{
>> +	struct epoll_event ev = { 0 };
>> +	union epoll_ref ref = { .type = EPOLL_TYPE_NL_NEIGH, .fd = 0 };
> 
> You overwrite the .fd field, so there's not need to include it in the
> initializer.  (Also 0 is a bad placeholder, since 0 is a valid fd).
ok
> 
>> +
>> +	struct sockaddr_nl addr = {
>> +		.nl_family = AF_NETLINK,
>> +		.nl_groups = RTMGRP_NEIGH,
>> +	};
>> +
>> +	if (nl_sock_neigh >= 0)
>> +		return 0;
>> +
>> +	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
>> +	if (nl_sock_neigh < 0)
>> +		return -1;
>> +
>> +	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
>> +		close(nl_sock_neigh);
>> +		nl_sock_neigh = -1;
>> +		return -1;
>> +	}
>> +
>> +	ref.fd = nl_sock_neigh;
>> +	ev.events = EPOLLIN;
> 
> You could set this in the initializer.
ok
> 
>> +	ev.data.u64 = ref.u64;
>> +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
>> +		close(nl_sock_neigh);
>> +		nl_sock_neigh = -1;
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * nl_neigh_subscr_handler() - Non-blocking drain of pending neighbor updates
>> + * @c:	       Execution context
>> + */
>> +void nl_neigh_subscr_handler(struct ctx *c)
>> +{
>> +	struct nlmsghdr *nh;
>> +	char buf[NLBUFSIZ];
>> +	ssize_t n;
>> +
>> +	if (nl_sock_neigh < 0)
>> +		return;
>> +
>> +	for (;;) {
>> +		n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
>> +		if (n <= 0)
>> +			return;
> 
> EAGAIN correctly results in a silent return, but you probably want to
> report other errors and EOF (n == 0) since those are not expected.

Ok. But EOF is never returned from netlink. I can issue a warn_perror() 
for all other negative return values than EAGAIN.

> 
>> +
>> +		nh = (struct nlmsghdr *)buf;
>> +		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
>> +			struct ndmsg *ndm = NLMSG_DATA(nh);
>> +			struct rtattr *rta = (struct rtattr *)(ndm + 1);
>> +			size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
>> +			const uint8_t *lladdr = NULL;
>> +			const void *dst = NULL;
>> +			size_t lladdr_len = 0;
>> +			size_t dstlen = 0;
>> +
>> +			if (nh->nlmsg_type == NLMSG_DONE ||
>> +			    nh->nlmsg_type == NLMSG_ERROR)
>> +				continue;
> 
> IIUC, NLMSG_ERROR would also be unexpected, so should probably be
> reported.
> 
>> +			if (nh->nlmsg_type != RTM_NEWNEIGH &&
>> +			    nh->nlmsg_type != RTM_DELNEIGH)
>> +				continue;
> 
> Should we filter on ndm->ndm_ifindex for neighbours on the template
> interface only?  I'm not sure.

Yes. I was wondering about this, because most events are state changes 
we aren't really interested in. My research indicates I can ignore all 
states except NUD_VALID (==up) and (INCOMPLETE | FAILED) (== down).
This cannot be set in the subscription itself unless I want to involve 
eBPF (I don't), so the right place is to filter it here. That will
save at least some cycles.

> 
> It seems like we probably should filter on .
> 
>> +			for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
>> +				switch (rta->rta_type) {
>> +				case NDA_DST:
>> +					dst = RTA_DATA(rta);
>> +					dstlen = RTA_PAYLOAD(rta);
>> +					break;
>> +				case NDA_LLADDR:
>> +					lladdr = RTA_DATA(rta);
>> +					lladdr_len = RTA_PAYLOAD(rta);
>> +					break;
>> +				default:
>> +					break;
>> +				}
>> +			}
> 
> It seems like somewhere there ought to be something specifying the
> address type of both DST and LLADDR, but I'm not sure exactly where.
> We should check those, just in case some weirdo runs us on a host with
> IPX over Token Ring.

There is (ndm_family == AF_INET/6) for DST, and (lladdr_len == 
ETH_ALEN). The latter sounds like a good idea, while the fist sounds 
slightly paranoid. OK, we do have AF_TIPC when giving it a second 
thought ;-)

///jon


> 
>> +			if (!dst)
>> +				continue;
>> +
>> +			if (dstlen != sizeof(struct in_addr) &&
>> +			    dstlen != sizeof(struct in6_addr))
>> +				continue;
>> +
>> +			char abuf[INET6_ADDRSTRLEN];
>> +
>> +			if (dstlen == sizeof(struct in_addr))
>> +				inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
>> +			else
>> +				inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
>> +
>> +			if (nh->nlmsg_type == RTM_NEWNEIGH)
>> +				debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
>> +			else
>> +				debug("neigh: DEL %s", abuf);
>> +		}
>> +	}
>> +}
>> +
>>   /**
>>    * nl_sock_init() - Call nl_sock_init_do(), won't return on failure
>>    * @c:		Execution context
>> diff --git a/netlink.h b/netlink.h
>> index b51e99c..a7d3506 100644
>> --- a/netlink.h
>> +++ b/netlink.h
>> @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
>>   		 int s_dst, unsigned int ifi_dst, sa_family_t af);
>>   int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
>>   		void *addr, int *prefix_len, void *addr_l);
>> +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
>> +		      unsigned char *mac);
>>   int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
>>   		const void *addr, int prefix_len);
>>   int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
>> @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
>>   int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
>>   int nl_link_set_flags(int s, unsigned int ifi,
>>   		      unsigned int set, unsigned int change);
>> +int nl_neigh_subscr_init(struct ctx *c);
>> +void nl_neigh_subscr_handler(struct ctx *c);
>>   
>>   #endif /* NETLINK_H */
>> diff --git a/passt.c b/passt.c
>> index 31fbb75..e20bbad 100644
>> --- a/passt.c
>> +++ b/passt.c
>> @@ -53,6 +53,7 @@
>>   #include "vu_common.h"
>>   #include "migrate.h"
>>   #include "repair.h"
>> +#include "netlink.h"
>>   
>>   #define EPOLL_EVENTS		8
>>   
>> @@ -79,6 +80,7 @@ char *epoll_type_str[] = {
>>   	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>>   	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
>>   	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
>> +	[EPOLL_TYPE_NL_NEIGH]		= "netlink neighbour subscription socket",
>>   };
>>   static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>>   	      "epoll_type_str[] doesn't match enum epoll_type");
>> @@ -322,6 +324,9 @@ int main(int argc, char **argv)
>>   
>>   	pcap_init(&c);
>>   
>> +	if (nl_neigh_subscr_init(&c) < 0)
>> +		warn("Failed to subscribe to RTMGRP_NEIGH");
>> +
>>   	if (!c.foreground) {
>>   		if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
>>   			die_perror("Failed to open /dev/null");
>> @@ -414,6 +419,9 @@ loop:
>>   		case EPOLL_TYPE_REPAIR:
>>   			repair_handler(&c, eventmask);
>>   			break;
>> +		case EPOLL_TYPE_NL_NEIGH:
>> +			nl_neigh_subscr_handler(&c);
>> +			break;
>>   		default:
>>   			/* Can't happen */
>>   			ASSERT(0);
>> -- 
>> 2.50.1
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents
  2025-09-24  3:03   ` David Gibson
@ 2025-09-24 18:54     ` Jon Maloy
  0 siblings, 0 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24 18:54 UTC (permalink / raw)
  To: passt-dev



On 2025-09-23 23:03, David Gibson wrote:
> On Tue, Sep 23, 2025 at 09:13:23PM -0400, Jon Maloy wrote:
>> We add a cache table to keep track of the contents of the kernel ARP
>> and NDP tables. The table is fed from the just introduced netlink based
>> neigbour subscription function. The new table eliminates the need for
>> explicit netlink calls to find a host's MAC address.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>>
>> ---
>> v5: - Moved to earlier in series to reduce rebase conflicts
>> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
>>      - Removed hash lookup. We now only use linear lookup in a
>>        linked list
>>      - Eliminated dynamic memory allocation.
>>      - Ensured there is only one call to clock_gettime()
>>      - Using MAC_ZERO instead of the previously dedicated definitions
>> v7: - NOW using MAC_ZERO where needed
>>      - I am still using linear back-off for empty cache entries. Even
>>        an incoming, flow-creating packet from a local host gives no
>>        guarantee that its MAC address is in the ARP table, so we must
>>        allow for a few new attempts at first possible occasions. Only
>>        after several failed lookups can we conclude that we probably
>>        never will succeed. Hence the back-off.
>>      - Fixed a bug that David inadvertently made me aware of: I only
>>        intended to set the initial expiry value to MAC_CACHE_RENEWAL
>>        when an ARP/NDP table lookup was successful.
>>      - Improved struct and function description comments.
>> v8: - Total re-design of table, adapting to the new, subscription
>>        based way of updating it.
>> v9: - Catering for MAC address change for an existing host.
>> ---
>>   conf.c    |   1 +
>>   fwd.c     | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   fwd.h     |   7 +++
>>   netlink.c |  15 +++--
>>   4 files changed, 180 insertions(+), 5 deletions(-)
>>
>> diff --git a/conf.c b/conf.c
>> index 66b9e63..cc491c3 100644
>> --- a/conf.c
>> +++ b/conf.c
>> @@ -2133,6 +2133,7 @@ void conf(struct ctx *c, int argc, char **argv)
>>   		c->udp.fwd_out.mode = fwd_default;
>>   
>>   	fwd_scan_ports_init(c);
>> +	fwd_neigh_cache_init();
>>   
>>   	if (!c->quiet)
>>   		conf_print(c);
>> diff --git a/fwd.c b/fwd.c
>> index 250cf56..491303d 100644
>> --- a/fwd.c
>> +++ b/fwd.c
>> @@ -32,6 +32,168 @@ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>>   static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
>>   
>>   #define PORT_RANGE_SYSCTL	"/proc/sys/net/ipv4/ip_local_port_range"
>> +#define MAC_CACHE_SIZE     1024
> 
> Nit: you use "mac cache" some places, "neigh cache" others.

Good. point, and something I have been thinking about myself. I think
I will change to neigh_cache all over the line.

> 
>> +
>> +/**
>> + * mac_cache_entry -  Entry in the ARP/NDP table cache
>> + * @next:	Next entry in slot or free list
>> + * @addr:	IP address of represented host
>> + * @mac:	MAC address of represented host
>> + */
>> +struct mac_cache_entry {
>> +	struct mac_cache_entry *next;
> 
> Hash buckets might be overkill, but they're pretty easy and cheap, so
> whatever.

In theory there might be hundreds or even thousands of neighbors.
Maybe not often, but this is cheap as you say.

> 
>> +	union inany_addr addr;
>> +	uint8_t mac[ETH_ALEN];
>> +};
>> +
>> +/**
>> + * mac_cache_table -  Cache of ARP/NDP table contents
>> + * @entries:	Entries to be plugged into the hash slots when allocated
>> + * @slots:	Hash table slots
>> + * @free:	Linked list of unused entries
>> + */
>> +struct mac_cache_table {
>> +	struct mac_cache_entry entries[MAC_CACHE_SIZE];
>> +	struct mac_cache_entry *slots[MAC_CACHE_SIZE];
> 
> There's no inherent reason these two tables need the same size (and
> indeed some reasons they might not want to have the same size).

I know, but I decided in the end to give them the same size. It might
make sense to make the slot array larger than the entry array to reduce
risk of collisions, but this risk and its consequences are really
minimal in my view.

> 
>> +	struct mac_cache_entry *free;
>> +};
>> +
>> +static struct mac_cache_table mac_cache;
>> +
>> +/**
>> + * fwd_mac_cache_slot_idx() - Hash key to a number within the table range
>> + * @c:		Execution context
>> + * @key:	The key to be used for the hash
>> + *
>> + * Return: The resulting hash value
>> + */
>> +static inline size_t fwd_mac_cache_slot_idx(const struct ctx *c,
>> +					    const union inany_addr *key)
> 
> Maybe just neigh_hash_bucket()?

Yeah. Maybe.

> 
>> +{
>> +	struct siphash_state st = SIPHASH_INIT(c->hash_secret);
>> +	uint32_t i;
>> +
>> +	inany_siphash_feed(&st, key);
>> +	i = siphash_final(&st, sizeof(*key), 0);
>> +
>> +	return ((size_t)i) & (MAC_CACHE_SIZE - 1);
> 
> This subtly depends on MAC_CACHE_SIZE being a power of 2.  If you use
> (... % MAC_CACHE_SIZE) it will work regardless, and the compiler will
> still optimize if it _is_ a power of 2.

Absolutely. I believe I had a comment about this at the #define line
in earlier versions, but it seems to have disappeared.
I will re-introduce it.

> 
>> +}
>> +
>> +/**
>> + * fwd_mac_cache_find() - Find a MAC cache table entry
>> + * @c:		Execution context
>> + * @addr:	Neighbour address to be used as key for the lookup
>> + *
>> + * Return: The found entry, if any. Otherwise NULL.
>> + */
>> +static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c,
>> +						  const union inany_addr *addr)
>> +{
>> +	size_t idx = fwd_mac_cache_slot_idx(c, addr);
>> +	struct mac_cache_entry *e = mac_cache.slots[idx];
>> +
>> +	while (e && !inany_equals(&e->addr, addr))
>> +		e = e->next;
>> +
>> +	return e;
>> +}
>> +
>> +/**
>> + * fwd_mac_cache_alloc() - Allocate a mac cache table entry from the free list
>> + * @c:		Execution context
>> + * @addr:	Address used to determine insertion slot and store in entry
>> + * @mac:	The MAC address associated with the neighbour address
>> + */
>> +void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>> +			       const union inany_addr *addr, uint8_t *mac)
> 
> This can update as well as allocating.  Maybe fwd_neigh_cache_set()?

Ok

> 
>> +{
>> +	struct mac_cache_table *t = &mac_cache;
>> +	struct mac_cache_entry *e;
>> +	ssize_t idx;
>> +
>> +	/* MAC address might change sometimes */
>> +	e = fwd_mac_cache_find(c, addr);
>> +	if (e) {
>> +		memcpy(e->mac, mac, ETH_ALEN);
>> +		return;
>> +	}
>> +
>> +	e = t->free;
>> +	if (!e)
>> +		return;
>> +	t->free = e->next;
>> +
>> +	idx = fwd_mac_cache_slot_idx(c, addr);
>> +	e->next = t->slots[idx];
>> +	t->slots[idx] = e;
>> +
>> +	memcpy(&e->addr, addr, sizeof(*addr));
>> +	memcpy(e->mac, mac, ETH_ALEN);
>> +}
>> +
>> +/**
>> + * fwd_mac_cache_free() - Remove an entry from a slot and add it to free list
>> + * @c:		Execution context
>> + * @addr:	Neighbour address used to find the slot for the entry
>> + */
>> +void fwd_neigh_mac_cache_free(const struct ctx *c, const union inany_addr *addr)
>> +{
>> +	ssize_t idx = fwd_mac_cache_slot_idx(c, addr);
>> +	struct mac_cache_table *t = &mac_cache;
>> +	struct mac_cache_entry *e, **prev;
>> +
>> +	prev = &t->slots[idx];
>> +	e = t->slots[idx];
>> +	while (e && !inany_equals(&e->addr, addr)) {
>> +		prev = &e->next;
>> +		e = e->next;
>> +	}
>> +	if (!e)
>> +		return;
>> +
>> +	*prev = e->next;
>> +	e->next = t->free;
>> +	t->free = e;
>> +	memset(&e->addr, 0, sizeof(*addr));
>> +	memset(e->mac, 0, ETH_ALEN);
>> +}
>> +
>> +/**
>> + * fwd_neigh_mac_get() - Lookup MAC address in the ARP/NDP cache table
>> + * @c:		Execution context
>> + * @addr:	Neighbour address used as lookup key
>> + * @mac:	Buffer for Ethernet MAC to return, found or default value.
>> + *
>> + * Return: true if real MAC found, false if not found or if failure
>> + */
>> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
>> +		       uint8_t *mac)
>> +{
>> +	struct mac_cache_entry *e = fwd_mac_cache_find(c, addr);
>> +
>> +	if (e)
>> +		memcpy(mac, e->mac, ETH_ALEN);
>> +	else
>> +		memcpy(mac, c->our_tap_mac, ETH_ALEN);
>> +
>> +	return !!e;
>> +}
>> +
>> +/**
>> + * fwd_neigh_cache_init() - Initialize the neighbor ARP/NDP cache table
>> + */
>> +void fwd_neigh_cache_init(void)
>> +{
>> +	struct mac_cache_table *t = &mac_cache;
>> +	struct mac_cache_entry *e;
>> +
>> +	memset(t, 0, sizeof(*t));
>> +	for (int i = 0; i < MAC_CACHE_SIZE; i++) {
>> +		e = &t->entries[i];
>> +		e->next = t->free;
>> +		t->free = e;
>> +	}
>> +}
>>   
>>   /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
>>    *
>> diff --git a/fwd.h b/fwd.h
>> index 65c7c96..a0e8fbc 100644
>> --- a/fwd.h
>> +++ b/fwd.h
>> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
>>   			    const struct flowside *ini, struct flowside *tgt);
>>   uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
>>   			  const struct flowside *ini, struct flowside *tgt);
>> +void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>> +			       const union inany_addr *addr, uint8_t *mac);
>> +void fwd_neigh_mac_cache_free(const struct ctx *c,
>> +			      const union inany_addr *addr);
>> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
>> +		       uint8_t *mac);
>> +void fwd_neigh_cache_init(void);
>>   
>>   #endif /* FWD_H */
>> diff --git a/netlink.c b/netlink.c
>> index 1faf3da..1e1ec43 100644
>> --- a/netlink.c
>> +++ b/netlink.c
>> @@ -131,6 +131,8 @@ int nl_neigh_subscr_init(struct ctx *c)
>>    */
>>   void nl_neigh_subscr_handler(struct ctx *c)
>>   {
>> +	union inany_addr addr;
>> +	uint8_t mac[ETH_ALEN];
>>   	struct nlmsghdr *nh;
>>   	char buf[NLBUFSIZ];
>>   	ssize_t n;
>> @@ -183,17 +185,20 @@ void nl_neigh_subscr_handler(struct ctx *c)
>>   			    dstlen != sizeof(struct in6_addr))
>>   				continue;
>>   
>> -			char abuf[INET6_ADDRSTRLEN];
>> +			if (!lladdr || lladdr_len != ETH_ALEN)
>> +				continue;
>> +
>> +			memcpy(mac, lladdr, ETH_ALEN);
>>   
>>   			if (dstlen == sizeof(struct in_addr))
>> -				inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
>> +				addr = inany_from_v4(*(struct in_addr *) dst);
>>   			else
>> -				inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
>> +				addr.a6 = *(struct in6_addr *) dst;
>>   
>>   			if (nh->nlmsg_type == RTM_NEWNEIGH)
>> -				debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
>> +				fwd_neigh_mac_cache_alloc(c, &addr, mac);
>>   			else
>> -				debug("neigh: DEL %s", abuf);
>> +				fwd_neigh_mac_cache_free(c, &addr);
> 
> I think the debug() messages could still be useful.  They could move
> into the functions of course, and might want to be demoted to trace().

Yes. Good point.

///jon

> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24  3:22   ` David Gibson
@ 2025-09-24 22:18     ` Jon Maloy
  2025-09-24 23:32       ` Jon Maloy
  2025-09-25  6:36       ` David Gibson
  2025-09-26 23:25     ` Jon Maloy
  1 sibling, 2 replies; 34+ messages in thread
From: Jon Maloy @ 2025-09-24 22:18 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-23 23:22, David Gibson wrote:
> On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
>> Gratuitious ARP and unsolicitated NA should be handled with caution
>> because of the risk of malignant users emitting them to disturb
>> network communication.
>>
>> There is however one case we where we know it is legitimate
>> and safe for us to send out such messages: The one time we switch
>> from using ctx->own_tap_mac to a MAC address received via the
>> recently added neigbour subscription function. Later changes to
>> the MAC address of a host in an existing entry cannot be fully
>> trusted, so we abstain from doing it in such cases.
> 
> So, I think you're right that the gratuitous ARP is safe in this case.
> 
> But it concerns me that (other that some edge cases) we're sending
> data to the guest under own_tap_mac before we get the real MAC.  At
> the point we send data from a flow to the guest, I would have expected
> to already have an entry in the host neighbour table (because by
> definition the host is talking to the peer), therefore in our cache,
> by the subscriber.
> 
> I'm wondering if it could be as simple as both the neighbour update
> and the actual data coming in the same epoll batch, and we could avoid
> the temporary use of own_tap_mac by prioritising processing of the
> neighbour events.
> 

I experimented a bit with this. My test program is a simple UDP 
client-server pair, exchanging first 3 UDP messages client->server, 
followed by
3 messages server->client.

First, I changed the main() loop a bit, so that netlink events are
handled before all other events, if any. (Basically, I added
an extra loop before the main loop, only handling netlink events, before
moving on to the main loop (where netlink events had been excluded.)
This should secure absolute priority of netlink events before any other
events. As you will see below, this made no difference to the scenarios
I describe.

1: When starting the container, I notice that there is no subscription
    event in PASTA, even though I can see the entry for the remote host
    is present in the host's ARP table. There is never any event coming
    up even if I wait for 10+ minutes.
2: The first UDP is attempted sent from the guest. An ARP request is
    sent to PASTA, and responded to with the 9a:9a: address.
3: The UDP, and two more UDPs, are sent via PASTA to the remote host.
    Those are responded to and sent back to the guest.
4: I now receive a neigbour event, and can update my cache, but since
    there is still no new ARP request from the guest, even if I wait
    for many minutes, he continues in the belief the old address
    is confirmed.
5: If I run the same test again after a few minutes,
    the guest *does* send out an ARP request a few seconds after the
    message exchange, and is now updated with the correct address.

- If i run this sequence in the opposite direction everything seems to
   work ok, at least if the ARP entry is already present on the local
   host.

- When I delete that ARP entry before running the sequence, a neigbour
   event shows up after some seconds, but it can take up to a minute, at
   least. If I run my sequence from the remote host before that happens,
   there will be an ARP request from the guest (for the response UDPs),
   responded to with the default tap mac, and it will remain
   like that for a long time, since the guest considers the mac address
   confirmed. It doesn't help much that a neigbour event shows up some
   seconds after the exchange.

In brief, the guest *will* be updated eventually, but depending on luck
and timing it may take a long time, at least several minutes.
My gratuitous ARPs/ non-solicitated NAs doesn't completely solve this 
issue, but it significantly reduces the potential time gap before the 
guest gets properly updated.


>> When sending this type of messages, we notice that the guest accepts
>> the update, but also asks for a confirmation in the form of a regular
>> ARP/NS request. This is responded to with the new value, and we have
>> exactly the effect we wanted.
>>
>> This commit adds this functionality.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>> ---
>>   arp.c | 39 +++++++++++++++++++++++++++++++++++++++
>>   arp.h |  2 ++
>>   fwd.c | 11 +++++++++++
>>   ndp.c | 10 ++++++++++
>>   ndp.h |  1 +
>>   5 files changed, 63 insertions(+)
>>
>> diff --git a/arp.c b/arp.c
>> index 442faff..259f736 100644
>> --- a/arp.c
>> +++ b/arp.c
>> @@ -151,3 +151,42 @@ void arp_send_init_req(const struct ctx *c)
>>   	debug("Sending initial ARP request for guest MAC address");
>>   	tap_send_single(c, &req, sizeof(req));
>>   }
>> +
>> +/**
>> + * arp_send_gratuitous() - Send a gratuitous ARP announcement for an IPv4 host
>> + * @c:		Execution context
>> + * @ip:	IPv4 address we announce as owned by @mac
>> + * @mac:	MAC address to advertise for @ip
>> + */
>> +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
>> +			 const unsigned char *mac)
>> +{
>> +	char ip_str[INET_ADDRSTRLEN];
>> +	struct {
>> +		struct ethhdr eh;
>> +		struct arphdr ah;
>> +		struct arpmsg am;
>> +	} __attribute__((__packed__)) req;
> 
> 'req' is not a great name, since this is an ARP response, not a
> request (but see below).
> 
>> +	/* Ethernet header */
>> +	req.eh.h_proto = htons(ETH_P_ARP);
>> +	memcpy(req.eh.h_dest, MAC_BROADCAST, sizeof(req.eh.h_dest));
>> +	memcpy(req.eh.h_source, c->our_tap_mac, sizeof(req.eh.h_source));
>> +
>> +	/* ARP header */
>> +	req.ah.ar_op = htons(ARPOP_REPLY);
>> +	req.ah.ar_hrd = htons(ARPHRD_ETHER);
>> +	req.ah.ar_pro = htons(ETH_P_IP);
>> +	req.ah.ar_hln = ETH_ALEN;
>> +	req.ah.ar_pln = 4;
>> +
>> +	/* ARP message */
>> +	memcpy(req.am.sha, mac, sizeof(req.am.sha));
>> +	memcpy(req.am.sip, &ip, sizeof(req.am.sip));
>> +	memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
>> +	memcpy(req.am.tip, &ip, sizeof(req.am.tip));
> 
> So, I was trying to check if it made sense to use the same IP for both
> source and target here, and came across
>      https://www.rfc-editor.org/rfc/rfc5227#section-3
> 
> Which suggests we should (counter intuitively) be using ARP requests,
> not ARP replies for announcements.

Instead of gratuitous ARP, you mean? I can try it.

///jon

> 
>> +	inet_ntop(AF_INET, &ip, ip_str, sizeof(ip_str));
>> +	debug("Sending gratuitous ARP for %s", ip_str);
>> +	tap_send_single(c, &req, sizeof(req));
>> +}
>> diff --git a/arp.h b/arp.h
>> index d5ad0e1..b0dbb56 100644
>> --- a/arp.h
>> +++ b/arp.h
>> @@ -22,5 +22,7 @@ struct arpmsg {
>>   
>>   int arp(const struct ctx *c, struct iov_tail *data);
>>   void arp_send_init_req(const struct ctx *c);
>> +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
>> +			 const unsigned char *mac);
>>   
>>   #endif /* ARP_H */
>> diff --git a/fwd.c b/fwd.c
>> index c6348ab..879a351 100644
>> --- a/fwd.c
>> +++ b/fwd.c
>> @@ -26,6 +26,8 @@
>>   #include "passt.h"
>>   #include "lineread.h"
>>   #include "flow_table.h"
>> +#include "arp.h"
>> +#include "ndp.h"
>>   
>>   /* Empheral port range: values from RFC 6335 */
>>   static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>>   
>>   	memcpy(&e->addr, addr, sizeof(*addr));
>>   	memcpy(e->mac, mac, ETH_ALEN);
>> +
>> +	/* Send gratuitous ARP / unsolicited NA for the new mapping */
> 
> AFAICT this doesn't actually implement what the commit message
> describes - it seems to always send an ARP/NA when the neighbour table
> is updated.
> 
>> +	if (inany_v4(addr)) {
>> +		struct in_addr ip4 = *inany_v4(addr);
>> +
>> +		arp_send_gratuitous(c, ip4, e->mac);
>> +	} else {
>> +		ndp_send_unsolicited_na(c, &addr->a6);
>> +	}
>>   }
>>   
>>   /**
>> diff --git a/ndp.c b/ndp.c
>> index 70b68aa..8914f31 100644
>> --- a/ndp.c
>> +++ b/ndp.c
>> @@ -226,6 +226,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
>>   	ndp_send(c, dst, &na, sizeof(na));
>>   }
>>   
>> +/**
>> + * ndp_send_unsolicited_na() - Send unsolicited NA
>> + * @c:		Execution context
>> + * @addr:	IPv6 address to advertise
>> + */
>> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr)
>> +{
>> +	ndp_na(c, &in6addr_ll_all_nodes, addr);
>> +}
>> +
>>   /**
>>    * ndp_ra() - Send an NDP Router Advertisement (RA) message
>>    * @c:		Execution context
>> diff --git a/ndp.h b/ndp.h
>> index 781ea86..320009c 100644
>> --- a/ndp.h
>> +++ b/ndp.h
>> @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr,
>>   	struct iov_tail *data);
>>   void ndp_timer(const struct ctx *c, const struct timespec *now);
>>   void ndp_send_init_req(const struct ctx *c);
>> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
>>   
>>   #endif /* NDP_H */
>> -- 
>> 2.50.1
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24 22:18     ` Jon Maloy
@ 2025-09-24 23:32       ` Jon Maloy
  2025-09-25  6:38         ` David Gibson
  2025-09-25  6:36       ` David Gibson
  1 sibling, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-24 23:32 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-24 18:18, Jon Maloy wrote:
> 
> 
> On 2025-09-23 23:22, David Gibson wrote:
>> On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:

[...]

>>> --- a/fwd.c
>>> +++ b/fwd.c
>>> @@ -26,6 +26,8 @@
>>>   #include "passt.h"
>>>   #include "lineread.h"
>>>   #include "flow_table.h"
>>> +#include "arp.h"
>>> +#include "ndp.h"
>>>   /* Empheral port range: values from RFC 6335 */
>>>   static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>>> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>>>       memcpy(&e->addr, addr, sizeof(*addr));
>>>       memcpy(e->mac, mac, ETH_ALEN);
>>> +
>>> +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
>>
>> AFAICT this doesn't actually implement what the commit message
>> describes - it seems to always send an ARP/NA when the neighbour table
>> is updated.

No. Check the code again.

///jon

>>
>>> +    if (inany_v4(addr)) {
>>> +        struct in_addr ip4 = *inany_v4(addr);
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24 22:18     ` Jon Maloy
  2025-09-24 23:32       ` Jon Maloy
@ 2025-09-25  6:36       ` David Gibson
  2025-09-25 13:14         ` Jon Maloy
  1 sibling, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-09-25  6:36 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 7798 bytes --]

On Wed, Sep 24, 2025 at 06:18:52PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-23 23:22, David Gibson wrote:
> > On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
> > > Gratuitious ARP and unsolicitated NA should be handled with caution
> > > because of the risk of malignant users emitting them to disturb
> > > network communication.
> > > 
> > > There is however one case we where we know it is legitimate
> > > and safe for us to send out such messages: The one time we switch
> > > from using ctx->own_tap_mac to a MAC address received via the
> > > recently added neigbour subscription function. Later changes to
> > > the MAC address of a host in an existing entry cannot be fully
> > > trusted, so we abstain from doing it in such cases.
> > 
> > So, I think you're right that the gratuitous ARP is safe in this case.
> > 
> > But it concerns me that (other that some edge cases) we're sending
> > data to the guest under own_tap_mac before we get the real MAC.  At
> > the point we send data from a flow to the guest, I would have expected
> > to already have an entry in the host neighbour table (because by
> > definition the host is talking to the peer), therefore in our cache,
> > by the subscriber.
> > 
> > I'm wondering if it could be as simple as both the neighbour update
> > and the actual data coming in the same epoll batch, and we could avoid
> > the temporary use of own_tap_mac by prioritising processing of the
> > neighbour events.
> 
> I experimented a bit with this. My test program is a simple UDP
> client-server pair, exchanging first 3 UDP messages client->server, followed
> by
> 3 messages server->client.

With the client on the guest, and server outside?  How is the outside
machine arranged - is it a physically separate host? A bridged VM or
container on the same host?  Something else?

> First, I changed the main() loop a bit, so that netlink events are
> handled before all other events, if any. (Basically, I added
> an extra loop before the main loop, only handling netlink events, before
> moving on to the main loop (where netlink events had been excluded.)
> This should secure absolute priority of netlink events before any other
> events. As you will see below, this made no difference to the scenarios
> I describe.

Drat.
> 1: When starting the container, I notice that there is no subscription
>    event in PASTA, even though I can see the entry for the remote host
>    is present in the host's ARP table. There is never any event coming
>    up even if I wait for 10+ minutes.

Huh.... do we need to do something to ensure we get events for
existing entries in the host ARP table, not just ones that are added
or updated after we're running?

> 2: The first UDP is attempted sent from the guest. An ARP request is
>    sent to PASTA, and responded to with the 9a:9a: address.

Maybe we still need to explicitly ask for an ARP resolution when the
guest ARPs.

> 3: The UDP, and two more UDPs, are sent via PASTA to the remote host.
>    Those are responded to and sent back to the guest.
> 4: I now receive a neigbour event, and can update my cache, but since
>    there is still no new ARP request from the guest, even if I wait
>    for many minutes, he continues in the belief the old address
>    is confirmed.
> 5: If I run the same test again after a few minutes,
>    the guest *does* send out an ARP request a few seconds after the
>    message exchange, and is now updated with the correct address.
> 
> - If i run this sequence in the opposite direction everything seems to
>   work ok, at least if the ARP entry is already present on the local
>   host.
> 
> - When I delete that ARP entry before running the sequence,

Delete it from the host ARP table, you mean?

>   a neigbour
>   event shows up after some seconds, but it can take up to a minute, at
>   least.

Oof.  I guess some delay is inevitable, but that's way longer than I
would have expected.

>   If I run my sequence from the remote host before that happens,
>   there will be an ARP request from the guest (for the response UDPs),
>   responded to with the default tap mac, and it will remain
>   like that for a long time, since the guest considers the mac address
>   confirmed. It doesn't help much that a neigbour event shows up some
>   seconds after the exchange.
> 
> In brief, the guest *will* be updated eventually, but depending on luck
> and timing it may take a long time, at least several minutes.
> My gratuitous ARPs/ non-solicitated NAs doesn't completely solve this issue,
> but it significantly reduces the potential time gap before the guest gets
> properly updated.
> 
> 
> > > When sending this type of messages, we notice that the guest accepts
> > > the update, but also asks for a confirmation in the form of a regular
> > > ARP/NS request. This is responded to with the new value, and we have
> > > exactly the effect we wanted.
> > > 
> > > This commit adds this functionality.
> > > 
> > > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > > ---
> > >   arp.c | 39 +++++++++++++++++++++++++++++++++++++++
> > >   arp.h |  2 ++
> > >   fwd.c | 11 +++++++++++
> > >   ndp.c | 10 ++++++++++
> > >   ndp.h |  1 +
> > >   5 files changed, 63 insertions(+)
> > > 
> > > diff --git a/arp.c b/arp.c
> > > index 442faff..259f736 100644
> > > --- a/arp.c
> > > +++ b/arp.c
> > > @@ -151,3 +151,42 @@ void arp_send_init_req(const struct ctx *c)
> > >   	debug("Sending initial ARP request for guest MAC address");
> > >   	tap_send_single(c, &req, sizeof(req));
> > >   }
> > > +
> > > +/**
> > > + * arp_send_gratuitous() - Send a gratuitous ARP announcement for an IPv4 host
> > > + * @c:		Execution context
> > > + * @ip:	IPv4 address we announce as owned by @mac
> > > + * @mac:	MAC address to advertise for @ip
> > > + */
> > > +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
> > > +			 const unsigned char *mac)
> > > +{
> > > +	char ip_str[INET_ADDRSTRLEN];
> > > +	struct {
> > > +		struct ethhdr eh;
> > > +		struct arphdr ah;
> > > +		struct arpmsg am;
> > > +	} __attribute__((__packed__)) req;
> > 
> > 'req' is not a great name, since this is an ARP response, not a
> > request (but see below).
> > 
> > > +	/* Ethernet header */
> > > +	req.eh.h_proto = htons(ETH_P_ARP);
> > > +	memcpy(req.eh.h_dest, MAC_BROADCAST, sizeof(req.eh.h_dest));
> > > +	memcpy(req.eh.h_source, c->our_tap_mac, sizeof(req.eh.h_source));
> > > +
> > > +	/* ARP header */
> > > +	req.ah.ar_op = htons(ARPOP_REPLY);
> > > +	req.ah.ar_hrd = htons(ARPHRD_ETHER);
> > > +	req.ah.ar_pro = htons(ETH_P_IP);
> > > +	req.ah.ar_hln = ETH_ALEN;
> > > +	req.ah.ar_pln = 4;
> > > +
> > > +	/* ARP message */
> > > +	memcpy(req.am.sha, mac, sizeof(req.am.sha));
> > > +	memcpy(req.am.sip, &ip, sizeof(req.am.sip));
> > > +	memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
> > > +	memcpy(req.am.tip, &ip, sizeof(req.am.tip));
> > 
> > So, I was trying to check if it made sense to use the same IP for both
> > source and target here, and came across
> >      https://www.rfc-editor.org/rfc/rfc5227#section-3
> > 
> > Which suggests we should (counter intuitively) be using ARP requests,
> > not ARP replies for announcements.
> 
> Instead of gratuitous ARP, you mean? I can try it.

It suggests that what's traditionally meant by "gratuitous ARP" is
actually ARP requests, not responses as you might expect.  There's
some detailed reasoning there, I'd give it a read.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24 23:32       ` Jon Maloy
@ 2025-09-25  6:38         ` David Gibson
  2025-09-25 12:48           ` Jon Maloy
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-09-25  6:38 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]

On Wed, Sep 24, 2025 at 07:32:43PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-24 18:18, Jon Maloy wrote:
> > 
> > 
> > On 2025-09-23 23:22, David Gibson wrote:
> > > On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
> 
> [...]
> 
> > > > --- a/fwd.c
> > > > +++ b/fwd.c
> > > > @@ -26,6 +26,8 @@
> > > >   #include "passt.h"
> > > >   #include "lineread.h"
> > > >   #include "flow_table.h"
> > > > +#include "arp.h"
> > > > +#include "ndp.h"
> > > >   /* Empheral port range: values from RFC 6335 */
> > > >   static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> > > > @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
> > > >       memcpy(&e->addr, addr, sizeof(*addr));
> > > >       memcpy(e->mac, mac, ETH_ALEN);
> > > > +
> > > > +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
> > > 
> > > AFAICT this doesn't actually implement what the commit message
> > > describes - it seems to always send an ARP/NA when the neighbour table
> > > is updated.
> 
> No. Check the code again.

Still not seeing it, I'm going to need a more specific pointer to what
I've missed.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table
  2025-09-24 18:40     ` Jon Maloy
@ 2025-09-25  6:42       ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-09-25  6:42 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 7900 bytes --]

On Wed, Sep 24, 2025 at 02:40:33PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-23 22:47, David Gibson wrote:
> > On Tue, Sep 23, 2025 at 09:13:22PM -0400, Jon Maloy wrote:
> > > The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> > > requires the ability to translate from an IP address to its
> > > corresponding MAC address in cases where those are present in
> > > the ARP or NDP tables.
> > > 
> > > To keep track of the contents of these tables we add a netlink
> > > based neighbour subscription feature.
> > > 
> > > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > The change to a subscription model is large enough that I think it
> > warrants dropping pre-existing reviews.
> > 
> > > ---
> > > v3: - Added an attribute contianing NDA_DST to sent message, so
> > >        that we let the kernel do the filtering of the IP address
> > >        and return only one entry.
> > >      - Added interface index to the call signature. Since the only
> > >        interface we know is the template interface, this limits
> > >        the number of hosts that will be seen as 'network segment
> > >        local' from a PASST viewpoint.
> > > v4: - Made loop independent of attribute order.
> > >      - Ignoring L2 addresses which are not of size ETH_ALEN.
> > > v5: - Changed return value of new function, so caller can know if
> > >        a MAC address really was found.
> > > v6: - Removed warning printout which had ended up in the wrong
> > >        commit.
> > > v8: - Changed to neighbour event subscription model
> > > 
> > > netlink: arp/ndp table subscription
> > > ---
> > >   epoll_type.h |   2 +
> > >   netlink.c    | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >   netlink.h    |   4 ++
> > >   passt.c      |   8 ++++
> > >   4 files changed, 128 insertions(+)
> > > 
> > > diff --git a/epoll_type.h b/epoll_type.h
> > > index 12ac59b..a90ffb6 100644
> > > --- a/epoll_type.h
> > > +++ b/epoll_type.h
> > > @@ -44,6 +44,8 @@ enum epoll_type {
> > >   	EPOLL_TYPE_REPAIR_LISTEN,
> > >   	/* TCP_REPAIR helper socket */
> > >   	EPOLL_TYPE_REPAIR,
> > > +	/* Netlink neighbour subscription socket */
> > > +	EPOLL_TYPE_NL_NEIGH,
> > >   	EPOLL_NUM_TYPES,
> > >   };
> > > diff --git a/netlink.c b/netlink.c
> > > index c436780..1faf3da 100644
> > > --- a/netlink.c
> > > +++ b/netlink.c
> > > @@ -53,6 +53,7 @@
> > >   int nl_sock	= -1;
> > >   int nl_sock_ns	= -1;
> > >   static int nl_seq = 1;
> > > +static int nl_sock_neigh = -1;
> > >   /**
> > >    * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> > > @@ -84,6 +85,119 @@ static int nl_sock_init_do(void *arg)
> > >   	return 0;
> > >   }
> > > +/**
> > > + * nl_neigh_subscr_init() - Open a NETLINK_ROUTE socket and subscribe to neighbor events
> > > + *
> > > + * Return: 0 on success, -1 on failure
> > > + */
> > > +int nl_neigh_subscr_init(struct ctx *c)
> > > +{
> > > +	struct epoll_event ev = { 0 };
> > > +	union epoll_ref ref = { .type = EPOLL_TYPE_NL_NEIGH, .fd = 0 };
> > 
> > You overwrite the .fd field, so there's not need to include it in the
> > initializer.  (Also 0 is a bad placeholder, since 0 is a valid fd).
> ok
> > 
> > > +
> > > +	struct sockaddr_nl addr = {
> > > +		.nl_family = AF_NETLINK,
> > > +		.nl_groups = RTMGRP_NEIGH,
> > > +	};
> > > +
> > > +	if (nl_sock_neigh >= 0)
> > > +		return 0;
> > > +
> > > +	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
> > > +	if (nl_sock_neigh < 0)
> > > +		return -1;
> > > +
> > > +	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
> > > +		close(nl_sock_neigh);
> > > +		nl_sock_neigh = -1;
> > > +		return -1;
> > > +	}
> > > +
> > > +	ref.fd = nl_sock_neigh;
> > > +	ev.events = EPOLLIN;
> > 
> > You could set this in the initializer.
> ok
> > 
> > > +	ev.data.u64 = ref.u64;
> > > +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
> > > +		close(nl_sock_neigh);
> > > +		nl_sock_neigh = -1;
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * nl_neigh_subscr_handler() - Non-blocking drain of pending neighbor updates
> > > + * @c:	       Execution context
> > > + */
> > > +void nl_neigh_subscr_handler(struct ctx *c)
> > > +{
> > > +	struct nlmsghdr *nh;
> > > +	char buf[NLBUFSIZ];
> > > +	ssize_t n;
> > > +
> > > +	if (nl_sock_neigh < 0)
> > > +		return;
> > > +
> > > +	for (;;) {
> > > +		n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
> > > +		if (n <= 0)
> > > +			return;
> > 
> > EAGAIN correctly results in a silent return, but you probably want to
> > report other errors and EOF (n == 0) since those are not expected.
> 
> Ok. But EOF is never returned from netlink.

Right.  Which is exactly why it's important to know if it somehow
happens (it could mean, for example that we somehow clobbered our
socket fd).

> I can issue a warn_perror() for
> all other negative return values than EAGAIN.
> 
> > 
> > > +
> > > +		nh = (struct nlmsghdr *)buf;
> > > +		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
> > > +			struct ndmsg *ndm = NLMSG_DATA(nh);
> > > +			struct rtattr *rta = (struct rtattr *)(ndm + 1);
> > > +			size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> > > +			const uint8_t *lladdr = NULL;
> > > +			const void *dst = NULL;
> > > +			size_t lladdr_len = 0;
> > > +			size_t dstlen = 0;
> > > +
> > > +			if (nh->nlmsg_type == NLMSG_DONE ||
> > > +			    nh->nlmsg_type == NLMSG_ERROR)
> > > +				continue;
> > 
> > IIUC, NLMSG_ERROR would also be unexpected, so should probably be
> > reported.
> > 
> > > +			if (nh->nlmsg_type != RTM_NEWNEIGH &&
> > > +			    nh->nlmsg_type != RTM_DELNEIGH)
> > > +				continue;
> > 
> > Should we filter on ndm->ndm_ifindex for neighbours on the template
> > interface only?  I'm not sure.
> 
> Yes. I was wondering about this, because most events are state changes we
> aren't really interested in. My research indicates I can ignore all states
> except NUD_VALID (==up) and (INCOMPLETE | FAILED) (== down).
> This cannot be set in the subscription itself unless I want to involve eBPF
> (I don't), so the right place is to filter it here. That will
> save at least some cycles.

Ok, good.

> > It seems like we probably should filter on .
> > 
> > > +			for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> > > +				switch (rta->rta_type) {
> > > +				case NDA_DST:
> > > +					dst = RTA_DATA(rta);
> > > +					dstlen = RTA_PAYLOAD(rta);
> > > +					break;
> > > +				case NDA_LLADDR:
> > > +					lladdr = RTA_DATA(rta);
> > > +					lladdr_len = RTA_PAYLOAD(rta);
> > > +					break;
> > > +				default:
> > > +					break;
> > > +				}
> > > +			}
> > 
> > It seems like somewhere there ought to be something specifying the
> > address type of both DST and LLADDR, but I'm not sure exactly where.
> > We should check those, just in case some weirdo runs us on a host with
> > IPX over Token Ring.
> 
> There is (ndm_family == AF_INET/6) for DST, and (lladdr_len == ETH_ALEN).
> The latter sounds like a good idea, while the fist sounds slightly paranoid.
> OK, we do have AF_TIPC when giving it a second thought ;-)

Right.  I really like to actually check assumptions like this.
Checking lladdr_len isn't technically correct, since in theory it
could be a link that has 6-byte addresses that aren't Ethernet MACs.
Or even, theoretically, a link with variable length addresses, and
this one just happens to have 6 bytes.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-25  6:38         ` David Gibson
@ 2025-09-25 12:48           ` Jon Maloy
  2025-09-26  0:47             ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-25 12:48 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-25 02:38, David Gibson wrote:
> On Wed, Sep 24, 2025 at 07:32:43PM -0400, Jon Maloy wrote:
>>
>>
>> On 2025-09-24 18:18, Jon Maloy wrote:
>>>
>>>
>>> On 2025-09-23 23:22, David Gibson wrote:
>>>> On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
>>
>> [...]
>>
>>>>> --- a/fwd.c
>>>>> +++ b/fwd.c
>>>>> @@ -26,6 +26,8 @@
>>>>>    #include "passt.h"
>>>>>    #include "lineread.h"
>>>>>    #include "flow_table.h"
>>>>> +#include "arp.h"
>>>>> +#include "ndp.h"
>>>>>    /* Empheral port range: values from RFC 6335 */
>>>>>    static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>>>>> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>>>>>        memcpy(&e->addr, addr, sizeof(*addr));
>>>>>        memcpy(e->mac, mac, ETH_ALEN);
>>>>> +
>>>>> +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
>>>>
>>>> AFAICT this doesn't actually implement what the commit message
>>>> describes - it seems to always send an ARP/NA when the neighbour table
>>>> is updated.
>>
>> No. Check the code again.
> 
> Still not seeing it, I'm going to need a more specific pointer to what
> I've missed.
> 
Look in fwd_neigh_mac_cache_alloc().
If the entry already exists, the function updates the mac address just 
to be on the safe side, but then returns without sending any ARP/NA.
Or do I misunderstand your question?

/jon


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-25  6:36       ` David Gibson
@ 2025-09-25 13:14         ` Jon Maloy
  2025-09-26  0:55           ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-25 13:14 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-25 02:36, David Gibson wrote:
> On Wed, Sep 24, 2025 at 06:18:52PM -0400, Jon Maloy wrote:
>>
[...]
>>
>> I experimented a bit with this. My test program is a simple UDP
>> client-server pair, exchanging first 3 UDP messages client->server, followed
>> by
>> 3 messages server->client.
> 
> With the client on the guest, and server outside?  How is the outside
> machine arranged - is it a physically separate host? A bridged VM or
> container on the same host?  Something else?

It is a physically separate host.

> 
>> First, I changed the main() loop a bit, so that netlink events are
>> handled before all other events, if any. (Basically, I added
>> an extra loop before the main loop, only handling netlink events, before
>> moving on to the main loop (where netlink events had been excluded.)
>> This should secure absolute priority of netlink events before any other
>> events. As you will see below, this made no difference to the scenarios
>> I describe.
> 
> Drat.
>> 1: When starting the container, I notice that there is no subscription
>>     event in PASTA, even though I can see the entry for the remote host
>>     is present in the host's ARP table. There is never any event coming
>>     up even if I wait for 10+ minutes.
> 
> Huh.... do we need to do something to ensure we get events for
> existing entries in the host ARP table, not just ones that are added
> or updated after we're running? 

It doesn't seem to be possible, but even if it were it wouldn't help us 
much if the entry isn't here, which is also a problematic case. See below.

> 
>> 2: The first UDP is attempted sent from the guest. An ARP request is
>>     sent to PASTA, and responded to with the 9a:9a: address.
> 
> Maybe we still need to explicitly ask for an ARP resolution when the
> guest ARPs.

I think so. If we limit this to ARP and NDP, this should be unproblematic.


> 
>> 3: The UDP, and two more UDPs, are sent via PASTA to the remote host.
>>     Those are responded to and sent back to the guest.
>> 4: I now receive a neigbour event, and can update my cache, but since
>>     there is still no new ARP request from the guest, even if I wait
>>     for many minutes, he continues in the belief the old address
>>     is confirmed.
>> 5: If I run the same test again after a few minutes,
>>     the guest *does* send out an ARP request a few seconds after the
>>     message exchange, and is now updated with the correct address.
>>
>> - If i run this sequence in the opposite direction everything seems to
>>    work ok, at least if the ARP entry is already present on the local
>>    host.
>>
>> - When I delete that ARP entry before running the sequence,
> 
> Delete it from the host ARP table, you mean?

Yes.

> 
>>    a neigbour
>>    event shows up after some seconds, but it can take up to a minute, at
>>    least.
> 
> Oof.  I guess some delay is inevitable, but that's way longer than I
> would have expected.
> 
>>    If I run my sequence from the remote host before that happens,
>>    there will be an ARP request from the guest (for the response UDPs),
>>    responded to with the default tap mac, and it will remain
>>    like that for a long time, since the guest considers the mac address
>>    confirmed. It doesn't help much that a neigbour event shows up some
>>    seconds after the exchange.
>>
>> In brief, the guest *will* be updated eventually, but depending on luck
>> and timing it may take a long time, at least several minutes.
[...]
>>>> +	memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
>>>> +	memcpy(req.am.tip, &ip, sizeof(req.am.tip));
>>>
>>> So, I was trying to check if it made sense to use the same IP for both
>>> source and target here, and came across
>>>       https://www.rfc-editor.org/rfc/rfc5227#section-3
>>>
>>> Which suggests we should (counter intuitively) be using ARP requests,
>>> not ARP replies for announcements.
>>
>> Instead of gratuitous ARP, you mean? I can try it.
> 
> It suggests that what's traditionally meant by "gratuitous ARP" is
> actually ARP requests, not responses as you might expect.  There's
> some detailed reasoning there, I'd give it a read.

So will I. It sounds interesting.

///jon

> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-25 12:48           ` Jon Maloy
@ 2025-09-26  0:47             ` David Gibson
  2025-09-26 22:59               ` Jon Maloy
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-09-26  0:47 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2357 bytes --]

On Thu, Sep 25, 2025 at 08:48:51AM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-25 02:38, David Gibson wrote:
> > On Wed, Sep 24, 2025 at 07:32:43PM -0400, Jon Maloy wrote:
> > > 
> > > 
> > > On 2025-09-24 18:18, Jon Maloy wrote:
> > > > 
> > > > 
> > > > On 2025-09-23 23:22, David Gibson wrote:
> > > > > On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
> > > 
> > > [...]
> > > 
> > > > > > --- a/fwd.c
> > > > > > +++ b/fwd.c
> > > > > > @@ -26,6 +26,8 @@
> > > > > >    #include "passt.h"
> > > > > >    #include "lineread.h"
> > > > > >    #include "flow_table.h"
> > > > > > +#include "arp.h"
> > > > > > +#include "ndp.h"
> > > > > >    /* Empheral port range: values from RFC 6335 */
> > > > > >    static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> > > > > > @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
> > > > > >        memcpy(&e->addr, addr, sizeof(*addr));
> > > > > >        memcpy(e->mac, mac, ETH_ALEN);
> > > > > > +
> > > > > > +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
> > > > > 
> > > > > AFAICT this doesn't actually implement what the commit message
> > > > > describes - it seems to always send an ARP/NA when the neighbour table
> > > > > is updated.
> > > 
> > > No. Check the code again.
> > 
> > Still not seeing it, I'm going to need a more specific pointer to what
> > I've missed.
> > 
> Look in fwd_neigh_mac_cache_alloc().
> If the entry already exists, the function updates the mac address just to be
> on the safe side, but then returns without sending any ARP/NA.

Uh... no, no it doesn't.  I applied all the patches on a branch and
here's what I see in fwd_neigh_mac_cache_alloc():

	memcpy(&e->addr, addr, sizeof(*addr));
	memcpy(e->mac, mac, ETH_ALEN);

	/* Send gratuitous ARP / unsolicited NA for the new mapping */
	if (inany_v4(addr)) {
		struct in_addr ip4 = *inany_v4(addr);

		arp_send_gratuitous(c, ip4, e->mac);
	} else {
		ndp_send_unsolicited_na(c, &addr->a6);
	}

If it updates, it ARPs.  Do you have something in your tree that
didn't make it into the published patches?

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-25 13:14         ` Jon Maloy
@ 2025-09-26  0:55           ` David Gibson
  2025-09-26 23:05             ` Jon Maloy
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-09-26  0:55 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2712 bytes --]

On Thu, Sep 25, 2025 at 09:14:42AM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-25 02:36, David Gibson wrote:
> > On Wed, Sep 24, 2025 at 06:18:52PM -0400, Jon Maloy wrote:
> > > 
> [...]
> > > 
> > > I experimented a bit with this. My test program is a simple UDP
> > > client-server pair, exchanging first 3 UDP messages client->server, followed
> > > by
> > > 3 messages server->client.
> > 
> > With the client on the guest, and server outside?  How is the outside
> > machine arranged - is it a physically separate host? A bridged VM or
> > container on the same host?  Something else?
> 
> It is a physically separate host.
> 
> > 
> > > First, I changed the main() loop a bit, so that netlink events are
> > > handled before all other events, if any. (Basically, I added
> > > an extra loop before the main loop, only handling netlink events, before
> > > moving on to the main loop (where netlink events had been excluded.)
> > > This should secure absolute priority of netlink events before any other
> > > events. As you will see below, this made no difference to the scenarios
> > > I describe.
> > 
> > Drat.
> > > 1: When starting the container, I notice that there is no subscription
> > >     event in PASTA, even though I can see the entry for the remote host
> > >     is present in the host's ARP table. There is never any event coming
> > >     up even if I wait for 10+ minutes.
> > 
> > Huh.... do we need to do something to ensure we get events for
> > existing entries in the host ARP table, not just ones that are added
> > or updated after we're running?
> 
> It doesn't seem to be possible,

Can we do an RTM_GETNEIGH, with no address specified?  It's something
like that we do to get all our links and addresses in other places.

> but even if it were it wouldn't help us much
> if the entry isn't here, which is also a problematic case. See below.
> 
> > 
> > > 2: The first UDP is attempted sent from the guest. An ARP request is
> > >     sent to PASTA, and responded to with the 9a:9a: address.
> > 
> > Maybe we still need to explicitly ask for an ARP resolution when the
> > guest ARPs.
> 
> I think so. If we limit this to ARP and NDP, this should be unproblematic.

I just realised this is harder than I thought, though.  At least if we
want to get the right answer for the first guest ARP.  It's not just a
netlink request, we'd need to wait for the host to ARP, which means
timeouts, and state we need to track, and ...

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-26  0:47             ` David Gibson
@ 2025-09-26 22:59               ` Jon Maloy
  2025-09-29  4:03                 ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-26 22:59 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-25 20:47, David Gibson wrote:
> On Thu, Sep 25, 2025 at 08:48:51AM -0400, Jon Maloy wrote:
>>
>>
>> On 2025-09-25 02:38, David Gibson wrote:
>>> On Wed, Sep 24, 2025 at 07:32:43PM -0400, Jon Maloy wrote:
>>>>
>>>>
>>>> On 2025-09-24 18:18, Jon Maloy wrote:
>>>>>
>>>>>
>>>>> On 2025-09-23 23:22, David Gibson wrote:
>>>>>> On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
>>>>
>>>> [...]
>>>>
>>>>>>> --- a/fwd.c
>>>>>>> +++ b/fwd.c
>>>>>>> @@ -26,6 +26,8 @@
>>>>>>>     #include "passt.h"
>>>>>>>     #include "lineread.h"
>>>>>>>     #include "flow_table.h"
>>>>>>> +#include "arp.h"
>>>>>>> +#include "ndp.h"
>>>>>>>     /* Empheral port range: values from RFC 6335 */
>>>>>>>     static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>>>>>>> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>>>>>>>         memcpy(&e->addr, addr, sizeof(*addr));
>>>>>>>         memcpy(e->mac, mac, ETH_ALEN);
>>>>>>> +
>>>>>>> +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
>>>>>>
>>>>>> AFAICT this doesn't actually implement what the commit message
>>>>>> describes - it seems to always send an ARP/NA when the neighbour table
>>>>>> is updated.
>>>>
>>>> No. Check the code again.
>>>
>>> Still not seeing it, I'm going to need a more specific pointer to what
>>> I've missed.
>>>
>> Look in fwd_neigh_mac_cache_alloc().
>> If the entry already exists, the function updates the mac address just to be
>> on the safe side, but then returns without sending any ARP/NA.
> 
> Uh... no, no it doesn't.  I applied all the patches on a branch and
> here's what I see in fwd_neigh_mac_cache_alloc():
> 
> 	memcpy(&e->addr, addr, sizeof(*addr));
> 	memcpy(e->mac, mac, ETH_ALEN);
> 
> 	/* Send gratuitous ARP / unsolicited NA for the new mapping */
> 	if (inany_v4(addr)) {
> 		struct in_addr ip4 = *inany_v4(addr);
> 
> 		arp_send_gratuitous(c, ip4, e->mac);
> 	} else {
> 		ndp_send_unsolicited_na(c, &addr->a6);
> 	}
> 
> If it updates, it ARPs.  Do you have something in your tree that
> didn't make it into the published patches?
> 

This was introduced already in patch #1, not #9.
The final function looks like this:

void fwd_neigh_mac_cache_alloc(const struct ctx *c,
                                const union inany_addr *addr,
                                uint8_t *mac)

{
         struct mac_cache_table *t = &mac_cache;
         struct mac_cache_entry *e;
         ssize_t idx;

         /* MAC address might change sometimes */
         e = fwd_mac_cache_find(c, addr);
         if (e) {
                 memcpy(e->mac, mac, ETH_ALEN);
====>           return;
         }

         e = t->free;
         if (!e)
                 return;
         t->free = e->next;

         idx = fwd_mac_cache_slot_idx(c, addr);
         e->next = t->slots[idx];
         t->slots[idx] = e;

         memcpy(&e->addr, addr, sizeof(*addr));
         memcpy(e->mac, mac, ETH_ALEN);

         /* Send gratuitous ARP / unsolicited NA for the new mapping */
         if (inany_v4(addr)) {
                 struct in_addr ip4 = *inany_v4(addr);

                 arp_send_gratuitous(c, ip4, e->mac);
         } else {
                 ndp_send_unsolicited_na(c, &addr->a6);
         }
}


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-26  0:55           ` David Gibson
@ 2025-09-26 23:05             ` Jon Maloy
  2025-09-29  4:04               ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-26 23:05 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-25 20:55, David Gibson wrote:
> On Thu, Sep 25, 2025 at 09:14:42AM -0400, Jon Maloy wrote:
>>
>>
>> On 2025-09-25 02:36, David Gibson wrote:
>>> On Wed, Sep 24, 2025 at 06:18:52PM -0400, Jon Maloy wrote:
>>>>
>> [...]
>>>>
>>>> I experimented a bit with this. My test program is a simple UDP
>>>> client-server pair, exchanging first 3 UDP messages client->server, followed
>>>> by
>>>> 3 messages server->client.
>>>
>>> With the client on the guest, and server outside?  How is the outside
>>> machine arranged - is it a physically separate host? A bridged VM or
>>> container on the same host?  Something else?
>>
>> It is a physically separate host.
>>
>>>
>>>> First, I changed the main() loop a bit, so that netlink events are
>>>> handled before all other events, if any. (Basically, I added
>>>> an extra loop before the main loop, only handling netlink events, before
>>>> moving on to the main loop (where netlink events had been excluded.)
>>>> This should secure absolute priority of netlink events before any other
>>>> events. As you will see below, this made no difference to the scenarios
>>>> I describe.
>>>
>>> Drat.
>>>> 1: When starting the container, I notice that there is no subscription
>>>>      event in PASTA, even though I can see the entry for the remote host
>>>>      is present in the host's ARP table. There is never any event coming
>>>>      up even if I wait for 10+ minutes.
>>>
>>> Huh.... do we need to do something to ensure we get events for
>>> existing entries in the host ARP table, not just ones that are added
>>> or updated after we're running?
>>
>> It doesn't seem to be possible,

It actually *is* possible, and I just implemented it. It doesn't solve 
all problems, but makes a huge difference. I will add it in v10.

> 
> Can we do an RTM_GETNEIGH, with no address specified?  It's something
> like that we do to get all our links and addresses in other places.
> 
>> but even if it were it wouldn't help us much
>> if the entry isn't here, which is also a problematic case. See below.
>>
>>>
>>>> 2: The first UDP is attempted sent from the guest. An ARP request is
>>>>      sent to PASTA, and responded to with the 9a:9a: address.
>>>
>>> Maybe we still need to explicitly ask for an ARP resolution when the
>>> guest ARPs.
>>
>> I think so. If we limit this to ARP and NDP, this should be unproblematic.
> 
> I just realised this is harder than I thought, though.  At least if we
> want to get the right answer for the first guest ARP.  It's not just a
> netlink request, we'd need to wait for the host to ARP, which means
> timeouts, and state we need to track, and ...

Yes. I think with the above it is as good as it gets, and it isn't bad.

///jon

> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-24  3:22   ` David Gibson
  2025-09-24 22:18     ` Jon Maloy
@ 2025-09-26 23:25     ` Jon Maloy
  2025-09-27 19:32       ` Jon Maloy
  1 sibling, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-26 23:25 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev


On 2025-09-23 23:22, David Gibson wrote:
> On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
>> Gratuitious ARP and unsolicitated NA should be handled with caution
>> because of the risk of malignant users emitting them to disturb
>> network communication.
>>
[...]
>> +	req.ah.ar_op = htons(ARPOP_REPLY);
>> +	req.ah.ar_hrd = htons(ARPHRD_ETHER);
>> +	req.ah.ar_pro = htons(ETH_P_IP);
>> +	req.ah.ar_hln = ETH_ALEN;
>> +	req.ah.ar_pln = 4;
>> +
>> +	/* ARP message */
>> +	memcpy(req.am.sha, mac, sizeof(req.am.sha));
>> +	memcpy(req.am.sip, &ip, sizeof(req.am.sip));
>> +	memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
>> +	memcpy(req.am.tip, &ip, sizeof(req.am.tip));
> 
> So, I was trying to check if it made sense to use the same IP for both
> source and target here, and came across
>      https://www.rfc-editor.org/rfc/rfc5227#section-3
> 
> Which suggests we should (counter intuitively) be using ARP requests,
> not ARP replies for announcements.

I have now read through it, and it seems to come to the conclusion
that this is not advisable. In principle it should work, if all 
implementations stick to standard, but there might be stacks which are 
not stateless in this regard, i.e., they only accepts ARP replies as a 
response to a sent request.
In short, I think I will stick to my current approach, since it is 
evidently harmless and is proven to work.

///jon


> 
>> +	inet_ntop(AF_INET, &ip, ip_str, sizeof(ip_str));
>> +	debug("Sending gratuitous ARP for %s", ip_str);
>> +	tap_send_single(c, &req, sizeof(req));
>> +}
>> diff --git a/arp.h b/arp.h
>> index d5ad0e1..b0dbb56 100644
>> --- a/arp.h
>> +++ b/arp.h
>> @@ -22,5 +22,7 @@ struct arpmsg {
>>   
>>   int arp(const struct ctx *c, struct iov_tail *data);
>>   void arp_send_init_req(const struct ctx *c);
>> +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
>> +			 const unsigned char *mac);
>>   
>>   #endif /* ARP_H */
>> diff --git a/fwd.c b/fwd.c
>> index c6348ab..879a351 100644
>> --- a/fwd.c
>> +++ b/fwd.c
>> @@ -26,6 +26,8 @@
>>   #include "passt.h"
>>   #include "lineread.h"
>>   #include "flow_table.h"
>> +#include "arp.h"
>> +#include "ndp.h"
>>   
>>   /* Empheral port range: values from RFC 6335 */
>>   static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>>   
>>   	memcpy(&e->addr, addr, sizeof(*addr));
>>   	memcpy(e->mac, mac, ETH_ALEN);
>> +
>> +	/* Send gratuitous ARP / unsolicited NA for the new mapping */
> 
> AFAICT this doesn't actually implement what the commit message
> describes - it seems to always send an ARP/NA when the neighbour table
> is updated.
> 
>> +	if (inany_v4(addr)) {
>> +		struct in_addr ip4 = *inany_v4(addr);
>> +
>> +		arp_send_gratuitous(c, ip4, e->mac);
>> +	} else {
>> +		ndp_send_unsolicited_na(c, &addr->a6);
>> +	}
>>   }
>>   
>>   /**
>> diff --git a/ndp.c b/ndp.c
>> index 70b68aa..8914f31 100644
>> --- a/ndp.c
>> +++ b/ndp.c
>> @@ -226,6 +226,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
>>   	ndp_send(c, dst, &na, sizeof(na));
>>   }
>>   
>> +/**
>> + * ndp_send_unsolicited_na() - Send unsolicited NA
>> + * @c:		Execution context
>> + * @addr:	IPv6 address to advertise
>> + */
>> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr)
>> +{
>> +	ndp_na(c, &in6addr_ll_all_nodes, addr);
>> +}
>> +
>>   /**
>>    * ndp_ra() - Send an NDP Router Advertisement (RA) message
>>    * @c:		Execution context
>> diff --git a/ndp.h b/ndp.h
>> index 781ea86..320009c 100644
>> --- a/ndp.h
>> +++ b/ndp.h
>> @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr,
>>   	struct iov_tail *data);
>>   void ndp_timer(const struct ctx *c, const struct timespec *now);
>>   void ndp_send_init_req(const struct ctx *c);
>> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
>>   
>>   #endif /* NDP_H */
>> -- 
>> 2.50.1
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-26 23:25     ` Jon Maloy
@ 2025-09-27 19:32       ` Jon Maloy
  2025-09-29  4:08         ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Jon Maloy @ 2025-09-27 19:32 UTC (permalink / raw)
  To: David Gibson; +Cc: sbrivio, dgibson, passt-dev



On 2025-09-26 19:25, Jon Maloy wrote:
> 
> On 2025-09-23 23:22, David Gibson wrote:
>> On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
>>> Gratuitious ARP and unsolicitated NA should be handled with caution
>>> because of the risk of malignant users emitting them to disturb
>>> network communication.
>>>
> [...]
>>> +    req.ah.ar_op = htons(ARPOP_REPLY);
>>> +    req.ah.ar_hrd = htons(ARPHRD_ETHER);
>>> +    req.ah.ar_pro = htons(ETH_P_IP);
>>> +    req.ah.ar_hln = ETH_ALEN;
>>> +    req.ah.ar_pln = 4;
>>> +
>>> +    /* ARP message */
>>> +    memcpy(req.am.sha, mac, sizeof(req.am.sha));
>>> +    memcpy(req.am.sip, &ip, sizeof(req.am.sip));
>>> +    memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
>>> +    memcpy(req.am.tip, &ip, sizeof(req.am.tip));
>>
>> So, I was trying to check if it made sense to use the same IP for both
>> source and target here, and came across
>>      https://www.rfc-editor.org/rfc/rfc5227#section-3
>>
>> Which suggests we should (counter intuitively) be using ARP requests,
>> not ARP replies for announcements.
> 
> I have now read through it, and it seems to come to the conclusion
> that this is not advisable. In principle it should work, if all 
> implementations stick to standard, but there might be stacks which are 
> not stateless in this regard, i.e., they only accepts ARP replies as a 
> response to a sent request.
> In short, I think I will stick to my current approach, since it is 
> evidently harmless and is proven to work.
> 
> ///jon

My response above may look confusing. I had actually experimented with 
both methods, and had in my mind that it was the "ARP Announcement" 
implementation I had posted.
It is now fixed.

That said, further investigation indicates that the other method is 
fully legit, and actually widely used (Windows, Cisco), although not
by Linux.

///jon

> 
> 
>>
>>> +    inet_ntop(AF_INET, &ip, ip_str, sizeof(ip_str));
>>> +    debug("Sending gratuitous ARP for %s", ip_str);
>>> +    tap_send_single(c, &req, sizeof(req));
>>> +}
>>> diff --git a/arp.h b/arp.h
>>> index d5ad0e1..b0dbb56 100644
>>> --- a/arp.h
>>> +++ b/arp.h
>>> @@ -22,5 +22,7 @@ struct arpmsg {
>>>   int arp(const struct ctx *c, struct iov_tail *data);
>>>   void arp_send_init_req(const struct ctx *c);
>>> +void arp_send_gratuitous(const struct ctx *c, struct in_addr ip,
>>> +             const unsigned char *mac);
>>>   #endif /* ARP_H */
>>> diff --git a/fwd.c b/fwd.c
>>> index c6348ab..879a351 100644
>>> --- a/fwd.c
>>> +++ b/fwd.c
>>> @@ -26,6 +26,8 @@
>>>   #include "passt.h"
>>>   #include "lineread.h"
>>>   #include "flow_table.h"
>>> +#include "arp.h"
>>> +#include "ndp.h"
>>>   /* Empheral port range: values from RFC 6335 */
>>>   static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
>>> @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>>>       memcpy(&e->addr, addr, sizeof(*addr));
>>>       memcpy(e->mac, mac, ETH_ALEN);
>>> +
>>> +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
>>
>> AFAICT this doesn't actually implement what the commit message
>> describes - it seems to always send an ARP/NA when the neighbour table
>> is updated.
>>
>>> +    if (inany_v4(addr)) {
>>> +        struct in_addr ip4 = *inany_v4(addr);
>>> +
>>> +        arp_send_gratuitous(c, ip4, e->mac);
>>> +    } else {
>>> +        ndp_send_unsolicited_na(c, &addr->a6);
>>> +    }
>>>   }
>>>   /**
>>> diff --git a/ndp.c b/ndp.c
>>> index 70b68aa..8914f31 100644
>>> --- a/ndp.c
>>> +++ b/ndp.c
>>> @@ -226,6 +226,16 @@ static void ndp_na(const struct ctx *c, const 
>>> struct in6_addr *dst,
>>>       ndp_send(c, dst, &na, sizeof(na));
>>>   }
>>> +/**
>>> + * ndp_send_unsolicited_na() - Send unsolicited NA
>>> + * @c:        Execution context
>>> + * @addr:    IPv6 address to advertise
>>> + */
>>> +void ndp_send_unsolicited_na(const struct ctx *c, const struct 
>>> in6_addr *addr)
>>> +{
>>> +    ndp_na(c, &in6addr_ll_all_nodes, addr);
>>> +}
>>> +
>>>   /**
>>>    * ndp_ra() - Send an NDP Router Advertisement (RA) message
>>>    * @c:        Execution context
>>> diff --git a/ndp.h b/ndp.h
>>> index 781ea86..320009c 100644
>>> --- a/ndp.h
>>> +++ b/ndp.h
>>> @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr 
>>> *saddr,
>>>       struct iov_tail *data);
>>>   void ndp_timer(const struct ctx *c, const struct timespec *now);
>>>   void ndp_send_init_req(const struct ctx *c);
>>> +void ndp_send_unsolicited_na(const struct ctx *c, const struct 
>>> in6_addr *addr);
>>>   #endif /* NDP_H */
>>> -- 
>>> 2.50.1
>>>
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-26 22:59               ` Jon Maloy
@ 2025-09-29  4:03                 ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-09-29  4:03 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 3330 bytes --]

On Fri, Sep 26, 2025 at 06:59:40PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-25 20:47, David Gibson wrote:
> > On Thu, Sep 25, 2025 at 08:48:51AM -0400, Jon Maloy wrote:
> > > 
> > > 
> > > On 2025-09-25 02:38, David Gibson wrote:
> > > > On Wed, Sep 24, 2025 at 07:32:43PM -0400, Jon Maloy wrote:
> > > > > 
> > > > > 
> > > > > On 2025-09-24 18:18, Jon Maloy wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2025-09-23 23:22, David Gibson wrote:
> > > > > > > On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > --- a/fwd.c
> > > > > > > > +++ b/fwd.c
> > > > > > > > @@ -26,6 +26,8 @@
> > > > > > > >     #include "passt.h"
> > > > > > > >     #include "lineread.h"
> > > > > > > >     #include "flow_table.h"
> > > > > > > > +#include "arp.h"
> > > > > > > > +#include "ndp.h"
> > > > > > > >     /* Empheral port range: values from RFC 6335 */
> > > > > > > >     static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> > > > > > > > @@ -129,6 +131,15 @@ void fwd_neigh_mac_cache_alloc(const struct ctx *c,
> > > > > > > >         memcpy(&e->addr, addr, sizeof(*addr));
> > > > > > > >         memcpy(e->mac, mac, ETH_ALEN);
> > > > > > > > +
> > > > > > > > +    /* Send gratuitous ARP / unsolicited NA for the new mapping */
> > > > > > > 
> > > > > > > AFAICT this doesn't actually implement what the commit message
> > > > > > > describes - it seems to always send an ARP/NA when the neighbour table
> > > > > > > is updated.
> > > > > 
> > > > > No. Check the code again.
> > > > 
> > > > Still not seeing it, I'm going to need a more specific pointer to what
> > > > I've missed.
> > > > 
> > > Look in fwd_neigh_mac_cache_alloc().
> > > If the entry already exists, the function updates the mac address just to be
> > > on the safe side, but then returns without sending any ARP/NA.
> > 
> > Uh... no, no it doesn't.  I applied all the patches on a branch and
> > here's what I see in fwd_neigh_mac_cache_alloc():
> > 
> > 	memcpy(&e->addr, addr, sizeof(*addr));
> > 	memcpy(e->mac, mac, ETH_ALEN);
> > 
> > 	/* Send gratuitous ARP / unsolicited NA for the new mapping */
> > 	if (inany_v4(addr)) {
> > 		struct in_addr ip4 = *inany_v4(addr);
> > 
> > 		arp_send_gratuitous(c, ip4, e->mac);
> > 	} else {
> > 		ndp_send_unsolicited_na(c, &addr->a6);
> > 	}
> > 
> > If it updates, it ARPs.  Do you have something in your tree that
> > didn't make it into the published patches?
> > 
> 
> This was introduced already in patch #1, not #9.
> The final function looks like this:
> 
> void fwd_neigh_mac_cache_alloc(const struct ctx *c,
>                                const union inany_addr *addr,
>                                uint8_t *mac)
> 
> {
>         struct mac_cache_table *t = &mac_cache;
>         struct mac_cache_entry *e;
>         ssize_t idx;
> 
>         /* MAC address might change sometimes */
>         e = fwd_mac_cache_find(c, addr);
>         if (e) {
>                 memcpy(e->mac, mac, ETH_ALEN);
> ====>           return;

Ah, yes, sorry.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-26 23:05             ` Jon Maloy
@ 2025-09-29  4:04               ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-09-29  4:04 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2316 bytes --]

On Fri, Sep 26, 2025 at 07:05:56PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-25 20:55, David Gibson wrote:
> > On Thu, Sep 25, 2025 at 09:14:42AM -0400, Jon Maloy wrote:
> > > 
> > > 
> > > On 2025-09-25 02:36, David Gibson wrote:
> > > > On Wed, Sep 24, 2025 at 06:18:52PM -0400, Jon Maloy wrote:
> > > > > 
> > > [...]
> > > > > 
> > > > > I experimented a bit with this. My test program is a simple UDP
> > > > > client-server pair, exchanging first 3 UDP messages client->server, followed
> > > > > by
> > > > > 3 messages server->client.
> > > > 
> > > > With the client on the guest, and server outside?  How is the outside
> > > > machine arranged - is it a physically separate host? A bridged VM or
> > > > container on the same host?  Something else?
> > > 
> > > It is a physically separate host.
> > > 
> > > > 
> > > > > First, I changed the main() loop a bit, so that netlink events are
> > > > > handled before all other events, if any. (Basically, I added
> > > > > an extra loop before the main loop, only handling netlink events, before
> > > > > moving on to the main loop (where netlink events had been excluded.)
> > > > > This should secure absolute priority of netlink events before any other
> > > > > events. As you will see below, this made no difference to the scenarios
> > > > > I describe.
> > > > 
> > > > Drat.
> > > > > 1: When starting the container, I notice that there is no subscription
> > > > >      event in PASTA, even though I can see the entry for the remote host
> > > > >      is present in the host's ARP table. There is never any event coming
> > > > >      up even if I wait for 10+ minutes.
> > > > 
> > > > Huh.... do we need to do something to ensure we get events for
> > > > existing entries in the host ARP table, not just ones that are added
> > > > or updated after we're running?
> > > 
> > > It doesn't seem to be possible,
> 
> It actually *is* possible, and I just implemented it. It doesn't solve all
> problems, but makes a huge difference. I will add it in v10.

Hooray!  I thought there ought to be a way.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-27 19:32       ` Jon Maloy
@ 2025-09-29  4:08         ` David Gibson
  2025-09-29 22:23           ` Stefano Brivio
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-09-29  4:08 UTC (permalink / raw)
  To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]

On Sat, Sep 27, 2025 at 03:32:38PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-26 19:25, Jon Maloy wrote:
> > 
> > On 2025-09-23 23:22, David Gibson wrote:
> > > On Tue, Sep 23, 2025 at 09:13:30PM -0400, Jon Maloy wrote:
> > > > Gratuitious ARP and unsolicitated NA should be handled with caution
> > > > because of the risk of malignant users emitting them to disturb
> > > > network communication.
> > > > 
> > [...]
> > > > +    req.ah.ar_op = htons(ARPOP_REPLY);
> > > > +    req.ah.ar_hrd = htons(ARPHRD_ETHER);
> > > > +    req.ah.ar_pro = htons(ETH_P_IP);
> > > > +    req.ah.ar_hln = ETH_ALEN;
> > > > +    req.ah.ar_pln = 4;
> > > > +
> > > > +    /* ARP message */
> > > > +    memcpy(req.am.sha, mac, sizeof(req.am.sha));
> > > > +    memcpy(req.am.sip, &ip, sizeof(req.am.sip));
> > > > +    memcpy(req.am.tha, MAC_BROADCAST, sizeof(req.am.tha));
> > > > +    memcpy(req.am.tip, &ip, sizeof(req.am.tip));
> > > 
> > > So, I was trying to check if it made sense to use the same IP for both
> > > source and target here, and came across
> > >      https://www.rfc-editor.org/rfc/rfc5227#section-3
> > > 
> > > Which suggests we should (counter intuitively) be using ARP requests,
> > > not ARP replies for announcements.
> > 
> > I have now read through it, and it seems to come to the conclusion
> > that this is not advisable. In principle it should work, if all

What "this" refers to here is not clear to me.

> > implementations stick to standard, but there might be stacks which are
> > not stateless in this regard, i.e., they only accepts ARP replies as a
> > response to a sent request.
> > In short, I think I will stick to my current approach, since it is
> > evidently harmless and is proven to work.
> > 
> > ///jon
> 
> My response above may look confusing.

Yes.. and I'm still confused.  Without knowing what "this" is above,
I'm not clear what "it" or "the other" are below either.

> I had actually experimented with both
> methods, and had in my mind that it was the "ARP Announcement"
> implementation I had posted.
> It is now fixed.
> 
> That said, further investigation indicates that the other method is fully
> legit, and actually widely used (Windows, Cisco), although not
> by Linux.

My understanding of that RFC is that it is advising _against_ sending
unsolicited ARP replies (as your earlier posted versions did).
Instead, it advises sending ARP requests in order to announce a MAC to
the networm.  The history is confusing because "ARP announcements" and
"gratuitous ARP" can and have been used to refer to both variants.

Does that match your current understanding?

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-29  4:08         ` David Gibson
@ 2025-09-29 22:23           ` Stefano Brivio
  2025-09-30  0:15             ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-09-29 22:23 UTC (permalink / raw)
  To: David Gibson; +Cc: Jon Maloy, dgibson, passt-dev

On Mon, 29 Sep 2025 14:08:27 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Instead, it advises sending ARP requests in order to announce a MAC to
> the networm.

Network, you meant network! The other plan is not public yet. :)

-- 
Stefano


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added
  2025-09-29 22:23           ` Stefano Brivio
@ 2025-09-30  0:15             ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-09-30  0:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Jon Maloy, dgibson, passt-dev

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

On Tue, Sep 30, 2025 at 12:23:44AM +0200, Stefano Brivio wrote:
> On Mon, 29 Sep 2025 14:08:27 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Instead, it advises sending ARP requests in order to announce a MAC to
> > the networm.
> 
> Network, you meant network! The other plan is not public yet. :)

Turns out the only portable way to do ARP announcements is to infect
the other hosts and fix the bugs in their ARP handling. :)

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-09-30  0:23 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-24  1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
2025-09-24  1:13 ` [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
2025-09-24  2:47   ` David Gibson
2025-09-24  3:34     ` David Gibson
2025-09-24 18:40     ` Jon Maloy
2025-09-25  6:42       ` David Gibson
2025-09-24  1:13 ` [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-09-24  3:03   ` David Gibson
2025-09-24 18:54     ` Jon Maloy
2025-09-24  1:13 ` [PATCH v9 3/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
2025-09-24  1:13 ` [PATCH v9 4/9] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
2025-09-24  1:13 ` [PATCH v9 5/9] udp: forward external source MAC address through tap interface Jon Maloy
2025-09-24  1:13 ` [PATCH v9 6/9] tcp: " Jon Maloy
2025-09-24  1:13 ` [PATCH v9 7/9] tap: change signature of function tap_push_l2h() Jon Maloy
2025-09-24  1:13 ` [PATCH v9 8/9] icmp: let icmp use mac address from flowside structure Jon Maloy
2025-09-24  1:13 ` [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added Jon Maloy
2025-09-24  3:22   ` David Gibson
2025-09-24 22:18     ` Jon Maloy
2025-09-24 23:32       ` Jon Maloy
2025-09-25  6:38         ` David Gibson
2025-09-25 12:48           ` Jon Maloy
2025-09-26  0:47             ` David Gibson
2025-09-26 22:59               ` Jon Maloy
2025-09-29  4:03                 ` David Gibson
2025-09-25  6:36       ` David Gibson
2025-09-25 13:14         ` Jon Maloy
2025-09-26  0:55           ` David Gibson
2025-09-26 23:05             ` Jon Maloy
2025-09-29  4:04               ` David Gibson
2025-09-26 23:25     ` Jon Maloy
2025-09-27 19:32       ` Jon Maloy
2025-09-29  4:08         ` David Gibson
2025-09-29 22:23           ` Stefano Brivio
2025-09-30  0:15             ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).