* [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 4:01 ` David Gibson
2025-10-06 22:33 ` Stefano Brivio
2025-10-03 0:34 ` [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
` (8 subsequent siblings)
9 siblings, 2 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
requires the ability to translate from an IP address to its
corresponding MAC address in cases where those are present in
the ARP or NDP tables.
To keep track of the contents of these tables we add a netlink
based neighbour subscription feature.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: - Added an attribute contianing NDA_DST to sent message, so
that we let the kernel do the filtering of the IP address
and return only one entry.
- Added interface index to the call signature. Since the only
interface we know is the template interface, this limits
the number of hosts that will be seen as 'network segment
local' from a PASST viewpoint.
v4: - Made loop independent of attribute order.
- Ignoring L2 addresses which are not of size ETH_ALEN.
v5: - Changed return value of new function, so caller can know if
a MAC address really was found.
v6: - Removed warning printout which had ended up in the wrong
commit.
v8: - Changed to neighbour event subscription model
- netlink: arp/ndp table subscription
v10:- Updated according to David's latest comments on v8
- Added functionaly where we initially read current
state of ARP/NDP tables
v12:- Updates based on feedback from David and Stefano
---
epoll_type.h | 2 +
netlink.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++-
netlink.h | 4 +
passt.c | 7 ++
4 files changed, 214 insertions(+), 3 deletions(-)
diff --git a/epoll_type.h b/epoll_type.h
index 12ac59b..a90ffb6 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -44,6 +44,8 @@ enum epoll_type {
EPOLL_TYPE_REPAIR_LISTEN,
/* TCP_REPAIR helper socket */
EPOLL_TYPE_REPAIR,
+ /* Netlink neighbour subscription socket */
+ EPOLL_TYPE_NL_NEIGH,
EPOLL_NUM_TYPES,
};
diff --git a/netlink.c b/netlink.c
index c436780..3fe2fdd 100644
--- a/netlink.c
+++ b/netlink.c
@@ -26,6 +26,7 @@
#include <arpa/inet.h>
#include <netinet/in.h>
#include <netinet/if_ether.h>
+#include <net/if_arp.h>
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
@@ -40,6 +41,10 @@
#define RTNH_NEXT_AND_DEC(rtnh, attrlen) \
((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh))
+/* Convenience macro borrowed from kernel */
+#define NUD_VALID \
+ (NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE)
+
/* Netlink expects a buffer of at least 8kiB or the system page size,
* whichever is larger. 32kiB is recommended for more efficient.
* Since the largest page size on any remotely common Linux setup is
@@ -50,9 +55,10 @@
#define NLBUFSIZ 65536
/* Socket in init, in target namespace, sequence (just needs to be monotonic) */
-int nl_sock = -1;
-int nl_sock_ns = -1;
-static int nl_seq = 1;
+int nl_sock = -1;
+int nl_sock_ns = -1;
+static int nl_sock_neigh = -1;
+static int nl_seq = 1;
/**
* nl_sock_init_do() - Set up netlink sockets in init or target namespace
@@ -1103,3 +1109,195 @@ int nl_link_set_flags(int s, unsigned int ifi,
return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
}
+
+/**
+ * nl_neigh_msg_read() - Interpret a neigbour state message from netlink
+ * @c: Execution context
+ * @nh: Message to be read
+ *
+ */
+static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
+{
+ struct ndmsg *ndm = NLMSG_DATA(nh);
+ struct rtattr *rta = (struct rtattr *)(ndm + 1);
+ size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
+ char ip_str[INET6_ADDRSTRLEN];
+ char mac_str[ETH_ADDRSTRLEN];
+ const uint8_t *lladdr = NULL;
+ const void *dst = NULL;
+ size_t lladdr_len = 0;
+ uint8_t mac[ETH_ALEN];
+ union inany_addr addr;
+ size_t dstlen = 0;
+
+ if (nh->nlmsg_type == NLMSG_DONE)
+ return;
+
+ if (nh->nlmsg_type == NLMSG_ERROR) {
+ warn_perror("nlmsg_type error at msg read");
+ return;
+ }
+
+ if (nh->nlmsg_type != RTM_NEWNEIGH &&
+ nh->nlmsg_type != RTM_DELNEIGH)
+ return;
+
+ for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
+ switch (rta->rta_type) {
+ case NDA_DST:
+ dst = RTA_DATA(rta);
+ dstlen = RTA_PAYLOAD(rta);
+ break;
+ case NDA_LLADDR:
+ lladdr = RTA_DATA(rta);
+ lladdr_len = RTA_PAYLOAD(rta);
+ break;
+ default:
+ break;
+ }
+ }
+
+ if (!dst)
+ return;
+
+ if (!lladdr || lladdr_len != ETH_ALEN)
+ return;
+
+ if (ndm->ndm_type != ARPHRD_ETHER)
+ return;
+
+ memcpy(mac, lladdr, ETH_ALEN);
+ eth_ntop(mac, mac_str, sizeof(mac_str));
+
+ if (ndm->ndm_family == AF_INET &&
+ ndm->ndm_ifindex != c->ifi4)
+ return;
+
+ if (ndm->ndm_family == AF_INET6 &&
+ ndm->ndm_ifindex != c->ifi6)
+ return;
+
+ if (ndm->ndm_family != AF_INET &&
+ ndm->ndm_family != AF_INET6)
+ return;
+
+ if (ndm->ndm_family == AF_INET &&
+ dstlen != sizeof(struct in_addr))
+ return;
+
+ if (ndm->ndm_family == AF_INET6 &&
+ dstlen != sizeof(struct in6_addr))
+ return;
+
+ inany_from_af(&addr, ndm->ndm_family, dst);
+ inany_ntop(dst, ip_str, sizeof(ip_str));
+
+ if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
+ trace("neigh table update: %s / %s", ip_str, mac_str);
+ else
+ trace("neigh table delete: %s / %s", ip_str, mac_str);
+}
+
+/**
+ * nl_neigh_sync() - Read current contents ARP/NDP tables
+ * @c: Execution context
+ * @proto: Protocol, AF_INET or AF_INET6
+ * @ifi: Interface index
+ *
+ */
+static void nl_neigh_sync(const struct ctx *c, int proto, int ifi)
+{
+ struct {
+ struct nlmsghdr nlh;
+ struct ndmsg ndm;
+ } req = {
+ .nlh = {0},
+ .ndm.ndm_family = proto,
+ .ndm.ndm_ifindex = ifi,
+ .ndm.ndm_state = 0,
+ .ndm.ndm_flags = 0,
+ .ndm.ndm_type = 0
+ };
+ struct nlmsghdr *nh;
+ char buf[NLBUFSIZ];
+ ssize_t status;
+ uint32_t seq;
+
+ seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, NLM_F_DUMP, sizeof(req));
+ nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH)
+ nl_neigh_msg_read(c, nh);
+ if (status < 0)
+ warn("Failed to read netlink message. status == %li", status);
+}
+
+/**
+ * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour updates
+ * @c: Execution context
+ */
+void nl_neigh_notify_handler(const struct ctx *c)
+{
+ char buf[NLBUFSIZ];
+
+ for (;;) {
+ ssize_t n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
+ struct nlmsghdr *nh;
+
+ if (n < 0) {
+ if (errno != EAGAIN)
+ warn_perror("Failed to read from netlink sock");
+ return;
+ }
+ nh = (struct nlmsghdr *)buf;
+ for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n))
+ nl_neigh_msg_read(c, nh);
+ }
+}
+
+/**
+ * nl_neigh_notify_init() - Open netlink sock and subscribe to neighbour events
+ * @c: Execution context
+ *
+ * Return: 0 on success, -1 on failure
+ */
+int nl_neigh_notify_init(const struct ctx *c)
+{
+ union epoll_ref ref = {
+ .type = EPOLL_TYPE_NL_NEIGH
+ };
+ struct epoll_event ev = {
+ .events = EPOLLIN
+ };
+ struct sockaddr_nl addr = {
+ .nl_family = AF_NETLINK,
+ .nl_groups = RTMGRP_NEIGH,
+ };
+
+ if (nl_sock_neigh >= 0) {
+ warn("RTMGRP_NEIGH already exists");
+ return 0;
+ }
+ nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
+ if (nl_sock_neigh < 0) {
+ warn("Failed create RTMGRP_NEIGH socket");
+ return -1;
+ }
+ if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+ close(nl_sock_neigh);
+ nl_sock_neigh = -1;
+ warn("Failed bind RTMGRP_NEIGH socket");
+ return -1;
+ }
+
+ ev.data.u64 = ref.u64;
+ if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
+ close(nl_sock_neigh);
+ nl_sock_neigh = -1;
+ warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket");
+ return -1;
+ }
+
+ nl_neigh_sync(c, AF_INET, c->ifi4);
+ nl_neigh_sync(c, AF_INET6, c->ifi6);
+
+ return 0;
+}
diff --git a/netlink.h b/netlink.h
index b51e99c..8f1e9b9 100644
--- a/netlink.h
+++ b/netlink.h
@@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
int s_dst, unsigned int ifi_dst, sa_family_t af);
int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
void *addr, int *prefix_len, void *addr_l);
+bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
+ unsigned char *mac);
int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
const void *addr, int prefix_len);
int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
@@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
int nl_link_set_flags(int s, unsigned int ifi,
unsigned int set, unsigned int change);
+int nl_neigh_notify_init(const struct ctx *c);
+void nl_neigh_notify_handler(const struct ctx *c);
#endif /* NETLINK_H */
diff --git a/passt.c b/passt.c
index 31fbb75..e21d6ba 100644
--- a/passt.c
+++ b/passt.c
@@ -53,6 +53,7 @@
#include "vu_common.h"
#include "migrate.h"
#include "repair.h"
+#include "netlink.h"
#define EPOLL_EVENTS 8
@@ -79,6 +80,7 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
[EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
[EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
+ [EPOLL_TYPE_NL_NEIGH] = "netlink neighbour subscription socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@@ -322,6 +324,8 @@ int main(int argc, char **argv)
pcap_init(&c);
+ nl_neigh_notify_init(&c);
+
if (!c.foreground) {
if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
die_perror("Failed to open /dev/null");
@@ -414,6 +418,9 @@ loop:
case EPOLL_TYPE_REPAIR:
repair_handler(&c, eventmask);
break;
+ case EPOLL_TYPE_NL_NEIGH:
+ nl_neigh_notify_handler(&c);
+ break;
default:
/* Can't happen */
ASSERT(0);
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table
2025-10-03 0:34 ` [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
@ 2025-10-03 4:01 ` David Gibson
2025-10-06 22:33 ` Stefano Brivio
1 sibling, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:01 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 11484 bytes --]
On Thu, Oct 02, 2025 at 08:34:04PM -0400, Jon Maloy wrote:
> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> requires the ability to translate from an IP address to its
> corresponding MAC address in cases where those are present in
> the ARP or NDP tables.
>
> To keep track of the contents of these tables we add a netlink
> based neighbour subscription feature.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> ---
> v3: - Added an attribute contianing NDA_DST to sent message, so
> that we let the kernel do the filtering of the IP address
> and return only one entry.
> - Added interface index to the call signature. Since the only
> interface we know is the template interface, this limits
> the number of hosts that will be seen as 'network segment
> local' from a PASST viewpoint.
> v4: - Made loop independent of attribute order.
> - Ignoring L2 addresses which are not of size ETH_ALEN.
> v5: - Changed return value of new function, so caller can know if
> a MAC address really was found.
> v6: - Removed warning printout which had ended up in the wrong
> commit.
> v8: - Changed to neighbour event subscription model
> - netlink: arp/ndp table subscription
> v10:- Updated according to David's latest comments on v8
> - Added functionaly where we initially read current
> state of ARP/NDP tables
> v12:- Updates based on feedback from David and Stefano
> ---
> epoll_type.h | 2 +
> netlink.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> netlink.h | 4 +
> passt.c | 7 ++
> 4 files changed, 214 insertions(+), 3 deletions(-)
>
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..a90ffb6 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
> EPOLL_TYPE_REPAIR_LISTEN,
> /* TCP_REPAIR helper socket */
> EPOLL_TYPE_REPAIR,
> + /* Netlink neighbour subscription socket */
> + EPOLL_TYPE_NL_NEIGH,
>
> EPOLL_NUM_TYPES,
> };
> diff --git a/netlink.c b/netlink.c
> index c436780..3fe2fdd 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -26,6 +26,7 @@
> #include <arpa/inet.h>
> #include <netinet/in.h>
> #include <netinet/if_ether.h>
> +#include <net/if_arp.h>
>
> #include <linux/netlink.h>
> #include <linux/rtnetlink.h>
> @@ -40,6 +41,10 @@
> #define RTNH_NEXT_AND_DEC(rtnh, attrlen) \
> ((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh))
>
> +/* Convenience macro borrowed from kernel */
> +#define NUD_VALID \
> + (NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE)
> +
> /* Netlink expects a buffer of at least 8kiB or the system page size,
> * whichever is larger. 32kiB is recommended for more efficient.
> * Since the largest page size on any remotely common Linux setup is
> @@ -50,9 +55,10 @@
> #define NLBUFSIZ 65536
>
> /* Socket in init, in target namespace, sequence (just needs to be monotonic) */
> -int nl_sock = -1;
> -int nl_sock_ns = -1;
> -static int nl_seq = 1;
> +int nl_sock = -1;
> +int nl_sock_ns = -1;
> +static int nl_sock_neigh = -1;
> +static int nl_seq = 1;
>
> /**
> * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> @@ -1103,3 +1109,195 @@ int nl_link_set_flags(int s, unsigned int ifi,
>
> return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
> }
> +
> +/**
> + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink
> + * @c: Execution context
> + * @nh: Message to be read
> + *
> + */
> +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> +{
> + struct ndmsg *ndm = NLMSG_DATA(nh);
> + struct rtattr *rta = (struct rtattr *)(ndm + 1);
> + size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> + char ip_str[INET6_ADDRSTRLEN];
> + char mac_str[ETH_ADDRSTRLEN];
> + const uint8_t *lladdr = NULL;
> + const void *dst = NULL;
> + size_t lladdr_len = 0;
> + uint8_t mac[ETH_ALEN];
> + union inany_addr addr;
> + size_t dstlen = 0;
> +
> + if (nh->nlmsg_type == NLMSG_DONE)
> + return;
> +
> + if (nh->nlmsg_type == NLMSG_ERROR) {
> + warn_perror("nlmsg_type error at msg read");
warn_perror() won't work here, because the error code isn't in global
errno. See nl_status() for how to extract the error code from the
message.
> + return;
> + }
> +
> + if (nh->nlmsg_type != RTM_NEWNEIGH &&
> + nh->nlmsg_type != RTM_DELNEIGH)
> + return;
> +
> + for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> + switch (rta->rta_type) {
> + case NDA_DST:
> + dst = RTA_DATA(rta);
> + dstlen = RTA_PAYLOAD(rta);
> + break;
> + case NDA_LLADDR:
> + lladdr = RTA_DATA(rta);
> + lladdr_len = RTA_PAYLOAD(rta);
> + break;
> + default:
> + break;
> + }
> + }
> +
> + if (!dst)
> + return;
> +
> + if (!lladdr || lladdr_len != ETH_ALEN)
> + return;
Do we need lladdr for the DELNEIGH case?
> +
> + if (ndm->ndm_type != ARPHRD_ETHER)
> + return;
> +
> + memcpy(mac, lladdr, ETH_ALEN);
> + eth_ntop(mac, mac_str, sizeof(mac_str));
> +
> + if (ndm->ndm_family == AF_INET &&
> + ndm->ndm_ifindex != c->ifi4)
> + return;
> +
> + if (ndm->ndm_family == AF_INET6 &&
> + ndm->ndm_ifindex != c->ifi6)
> + return;
> +
> + if (ndm->ndm_family != AF_INET &&
> + ndm->ndm_family != AF_INET6)
> + return;
> +
> + if (ndm->ndm_family == AF_INET &&
> + dstlen != sizeof(struct in_addr))
> + return;
> +
> + if (ndm->ndm_family == AF_INET6 &&
> + dstlen != sizeof(struct in6_addr))
> + return;
I'd suggest a warn() for those last two cases, since it means the
kernel is doing something really weird.
> + inany_from_af(&addr, ndm->ndm_family, dst);
> + inany_ntop(dst, ip_str, sizeof(ip_str));
> +
> + if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
> + trace("neigh table update: %s / %s", ip_str, mac_str);
> + else
> + trace("neigh table delete: %s / %s", ip_str, mac_str);
This will give a misleading message if you have RTM_NEWNEIGH and
!NUD_VALID.
> +}
> +
> +/**
> + * nl_neigh_sync() - Read current contents ARP/NDP tables
> + * @c: Execution context
> + * @proto: Protocol, AF_INET or AF_INET6
> + * @ifi: Interface index
> + *
> + */
> +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi)
> +{
> + struct {
> + struct nlmsghdr nlh;
> + struct ndmsg ndm;
> + } req = {
> + .nlh = {0},
> + .ndm.ndm_family = proto,
> + .ndm.ndm_ifindex = ifi,
> + .ndm.ndm_state = 0,
> + .ndm.ndm_flags = 0,
> + .ndm.ndm_type = 0
> + };
> + struct nlmsghdr *nh;
> + char buf[NLBUFSIZ];
> + ssize_t status;
> + uint32_t seq;
> +
> + seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, NLM_F_DUMP, sizeof(req));
> + nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH)
> + nl_neigh_msg_read(c, nh);
> + if (status < 0)
> + warn("Failed to read netlink message. status == %li", status);
> +}
> +
> +/**
> + * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour updates
> + * @c: Execution context
> + */
> +void nl_neigh_notify_handler(const struct ctx *c)
> +{
> + char buf[NLBUFSIZ];
> +
> + for (;;) {
> + ssize_t n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
> + struct nlmsghdr *nh;
> +
> + if (n < 0) {
> + if (errno != EAGAIN)
> + warn_perror("Failed to read from netlink sock");
> + return;
> + }
> + nh = (struct nlmsghdr *)buf;
> + for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n))
> + nl_neigh_msg_read(c, nh);
> + }
> +}
> +
> +/**
> + * nl_neigh_notify_init() - Open netlink sock and subscribe to neighbour events
> + * @c: Execution context
> + *
> + * Return: 0 on success, -1 on failure
> + */
> +int nl_neigh_notify_init(const struct ctx *c)
> +{
> + union epoll_ref ref = {
> + .type = EPOLL_TYPE_NL_NEIGH
> + };
> + struct epoll_event ev = {
> + .events = EPOLLIN
> + };
> + struct sockaddr_nl addr = {
> + .nl_family = AF_NETLINK,
> + .nl_groups = RTMGRP_NEIGH,
> + };
> +
> + if (nl_sock_neigh >= 0) {
> + warn("RTMGRP_NEIGH already exists");
> + return 0;
> + }
> + nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
> + if (nl_sock_neigh < 0) {
> + warn("Failed create RTMGRP_NEIGH socket");
> + return -1;
> + }
> + if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
> + close(nl_sock_neigh);
> + nl_sock_neigh = -1;
> + warn("Failed bind RTMGRP_NEIGH socket");
> + return -1;
> + }
> +
> + ev.data.u64 = ref.u64;
> + if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
> + close(nl_sock_neigh);
> + nl_sock_neigh = -1;
> + warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket");
> + return -1;
> + }
> +
> + nl_neigh_sync(c, AF_INET, c->ifi4);
> + nl_neigh_sync(c, AF_INET6, c->ifi6);
> +
> + return 0;
> +}
> diff --git a/netlink.h b/netlink.h
> index b51e99c..8f1e9b9 100644
> --- a/netlink.h
> +++ b/netlink.h
> @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
> int s_dst, unsigned int ifi_dst, sa_family_t af);
> int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
> void *addr, int *prefix_len, void *addr_l);
> +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
> + unsigned char *mac);
> int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
> const void *addr, int prefix_len);
> int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
> @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
> int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
> int nl_link_set_flags(int s, unsigned int ifi,
> unsigned int set, unsigned int change);
> +int nl_neigh_notify_init(const struct ctx *c);
> +void nl_neigh_notify_handler(const struct ctx *c);
>
> #endif /* NETLINK_H */
> diff --git a/passt.c b/passt.c
> index 31fbb75..e21d6ba 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -53,6 +53,7 @@
> #include "vu_common.h"
> #include "migrate.h"
> #include "repair.h"
> +#include "netlink.h"
>
> #define EPOLL_EVENTS 8
>
> @@ -79,6 +80,7 @@ char *epoll_type_str[] = {
> [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
> [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
> [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
> + [EPOLL_TYPE_NL_NEIGH] = "netlink neighbour subscription socket",
> };
> static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
> "epoll_type_str[] doesn't match enum epoll_type");
> @@ -322,6 +324,8 @@ int main(int argc, char **argv)
>
> pcap_init(&c);
>
> + nl_neigh_notify_init(&c);
> +
> if (!c.foreground) {
> if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
> die_perror("Failed to open /dev/null");
> @@ -414,6 +418,9 @@ loop:
> case EPOLL_TYPE_REPAIR:
> repair_handler(&c, eventmask);
> break;
> + case EPOLL_TYPE_NL_NEIGH:
> + nl_neigh_notify_handler(&c);
> + break;
> default:
> /* Can't happen */
> ASSERT(0);
> --
> 2.50.1
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table
2025-10-03 0:34 ` [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
2025-10-03 4:01 ` David Gibson
@ 2025-10-06 22:33 ` Stefano Brivio
1 sibling, 0 replies; 28+ messages in thread
From: Stefano Brivio @ 2025-10-06 22:33 UTC (permalink / raw)
To: Jon Maloy; +Cc: dgibson, david, passt-dev
In the title: s/subsciption/subscription/
On Thu, 2 Oct 2025 20:34:04 -0400
Jon Maloy <jmaloy@redhat.com> wrote:
> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> requires the ability to translate from an IP address to its
> corresponding MAC address in cases where those are present in
> the ARP or NDP tables.
>
> To keep track of the contents of these tables we add a netlink
> based neighbour subscription feature.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> ---
> v3: - Added an attribute contianing NDA_DST to sent message, so
> that we let the kernel do the filtering of the IP address
> and return only one entry.
> - Added interface index to the call signature. Since the only
> interface we know is the template interface, this limits
> the number of hosts that will be seen as 'network segment
> local' from a PASST viewpoint.
> v4: - Made loop independent of attribute order.
> - Ignoring L2 addresses which are not of size ETH_ALEN.
> v5: - Changed return value of new function, so caller can know if
> a MAC address really was found.
> v6: - Removed warning printout which had ended up in the wrong
> commit.
> v8: - Changed to neighbour event subscription model
> - netlink: arp/ndp table subscription
> v10:- Updated according to David's latest comments on v8
> - Added functionaly where we initially read current
> state of ARP/NDP tables
> v12:- Updates based on feedback from David and Stefano
> ---
> epoll_type.h | 2 +
> netlink.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> netlink.h | 4 +
> passt.c | 7 ++
> 4 files changed, 214 insertions(+), 3 deletions(-)
>
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..a90ffb6 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
> EPOLL_TYPE_REPAIR_LISTEN,
> /* TCP_REPAIR helper socket */
> EPOLL_TYPE_REPAIR,
> + /* Netlink neighbour subscription socket */
> + EPOLL_TYPE_NL_NEIGH,
>
> EPOLL_NUM_TYPES,
> };
> diff --git a/netlink.c b/netlink.c
> index c436780..3fe2fdd 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -26,6 +26,7 @@
> #include <arpa/inet.h>
> #include <netinet/in.h>
> #include <netinet/if_ether.h>
> +#include <net/if_arp.h>
>
> #include <linux/netlink.h>
> #include <linux/rtnetlink.h>
> @@ -40,6 +41,10 @@
> #define RTNH_NEXT_AND_DEC(rtnh, attrlen) \
> ((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh))
>
> +/* Convenience macro borrowed from kernel */
> +#define NUD_VALID \
> + (NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE)
> +
> /* Netlink expects a buffer of at least 8kiB or the system page size,
> * whichever is larger. 32kiB is recommended for more efficient.
> * Since the largest page size on any remotely common Linux setup is
> @@ -50,9 +55,10 @@
> #define NLBUFSIZ 65536
>
> /* Socket in init, in target namespace, sequence (just needs to be monotonic) */
> -int nl_sock = -1;
> -int nl_sock_ns = -1;
> -static int nl_seq = 1;
> +int nl_sock = -1;
> +int nl_sock_ns = -1;
> +static int nl_sock_neigh = -1;
> +static int nl_seq = 1;
>
> /**
> * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> @@ -1103,3 +1109,195 @@ int nl_link_set_flags(int s, unsigned int ifi,
>
> return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
> }
> +
> +/**
> + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink
> + * @c: Execution context
> + * @nh: Message to be read
> + *
Nit: excess newline.
> + */
> +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> +{
> + struct ndmsg *ndm = NLMSG_DATA(nh);
> + struct rtattr *rta = (struct rtattr *)(ndm + 1);
> + size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> + char ip_str[INET6_ADDRSTRLEN];
> + char mac_str[ETH_ADDRSTRLEN];
> + const uint8_t *lladdr = NULL;
> + const void *dst = NULL;
> + size_t lladdr_len = 0;
> + uint8_t mac[ETH_ALEN];
> + union inany_addr addr;
> + size_t dstlen = 0;
> +
> + if (nh->nlmsg_type == NLMSG_DONE)
> + return;
> +
> + if (nh->nlmsg_type == NLMSG_ERROR) {
> + warn_perror("nlmsg_type error at msg read");
> + return;
> + }
> +
> + if (nh->nlmsg_type != RTM_NEWNEIGH &&
> + nh->nlmsg_type != RTM_DELNEIGH)
> + return;
> +
> + for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> + switch (rta->rta_type) {
> + case NDA_DST:
> + dst = RTA_DATA(rta);
> + dstlen = RTA_PAYLOAD(rta);
> + break;
> + case NDA_LLADDR:
> + lladdr = RTA_DATA(rta);
> + lladdr_len = RTA_PAYLOAD(rta);
> + break;
> + default:
> + break;
> + }
> + }
> +
> + if (!dst)
> + return;
> +
> + if (!lladdr || lladdr_len != ETH_ALEN)
> + return;
> +
> + if (ndm->ndm_type != ARPHRD_ETHER)
> + return;
> +
> + memcpy(mac, lladdr, ETH_ALEN);
> + eth_ntop(mac, mac_str, sizeof(mac_str));
> +
> + if (ndm->ndm_family == AF_INET &&
> + ndm->ndm_ifindex != c->ifi4)
> + return;
> +
> + if (ndm->ndm_family == AF_INET6 &&
> + ndm->ndm_ifindex != c->ifi6)
> + return;
> +
> + if (ndm->ndm_family != AF_INET &&
> + ndm->ndm_family != AF_INET6)
> + return;
> +
> + if (ndm->ndm_family == AF_INET &&
> + dstlen != sizeof(struct in_addr))
> + return;
> +
> + if (ndm->ndm_family == AF_INET6 &&
> + dstlen != sizeof(struct in6_addr))
> + return;
> +
> + inany_from_af(&addr, ndm->ndm_family, dst);
> + inany_ntop(dst, ip_str, sizeof(ip_str));
> +
> + if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
> + trace("neigh table update: %s / %s", ip_str, mac_str);
> + else
> + trace("neigh table delete: %s / %s", ip_str, mac_str);
> +}
> +
> +/**
> + * nl_neigh_sync() - Read current contents ARP/NDP tables
> + * @c: Execution context
> + * @proto: Protocol, AF_INET or AF_INET6
> + * @ifi: Interface index
> + *
> + */
> +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi)
> +{
> + struct {
> + struct nlmsghdr nlh;
> + struct ndmsg ndm;
> + } req = {
> + .nlh = {0},
> + .ndm.ndm_family = proto,
> + .ndm.ndm_ifindex = ifi,
> + .ndm.ndm_state = 0,
> + .ndm.ndm_flags = 0,
> + .ndm.ndm_type = 0
> + };
> + struct nlmsghdr *nh;
> + char buf[NLBUFSIZ];
> + ssize_t status;
> + uint32_t seq;
> +
> + seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, NLM_F_DUMP, sizeof(req));
> + nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH)
> + nl_neigh_msg_read(c, nh);
> + if (status < 0)
> + warn("Failed to read netlink message. status == %li", status);
That should be a human-readable message, so I'm not sure if the C
notation for equality belongs there. In general we use the same format
as perror() (you can't use warn_perror() here) that is,
"Failed to read netlink message: %s", strerror_(status));
--
Stefano
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
2025-10-03 0:34 ` [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 1:03 ` Jon Maloy
` (2 more replies)
2025-10-03 0:34 ` [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
` (7 subsequent siblings)
9 siblings, 3 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
We add a cache table to keep track of the contents of the kernel ARP
and NDP tables. The table is fed from the just introduced netlink based
neigbour subscription function. The new table eliminates the need for
explicit netlink calls to find a host's MAC address.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v5: - Moved to earlier in series to reduce rebase conflicts
v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
- Removed hash lookup. We now only use linear lookup in a
linked list
- Eliminated dynamic memory allocation.
- Ensured there is only one call to clock_gettime()
- Using MAC_ZERO instead of the previously dedicated definitions
v7: - NOW using MAC_ZERO where needed
- I am still using linear back-off for empty cache entries. Even
an incoming, flow-creating packet from a local host gives no
guarantee that its MAC address is in the ARP table, so we must
allow for a few new attempts at first possible occasions. Only
after several failed lookups can we conclude that we probably
never will succeed. Hence the back-off.
- Fixed a bug that David inadvertently made me aware of: I only
intended to set the initial expiry value to MAC_CACHE_RENEWAL
when an ARP/NDP table lookup was successful.
- Improved struct and function description comments.
v8: - Total re-design of table, adapting to the new, subscription
based way of updating it.
v9: - Catering for MAC address change for an existing host.
v10: - Changes according to feedback from David Gibson
v12: - Changes according to feedback from David and Stefano
- Added dummy entries for loopback and default GW addresses
---
conf.c | 1 +
fwd.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fwd.h | 7 +++
netlink.c | 7 ++-
4 files changed, 195 insertions(+), 2 deletions(-)
diff --git a/conf.c b/conf.c
index 66b9e63..3224ee6 100644
--- a/conf.c
+++ b/conf.c
@@ -2133,6 +2133,7 @@ void conf(struct ctx *c, int argc, char **argv)
c->udp.fwd_out.mode = fwd_default;
fwd_scan_ports_init(c);
+ fwd_neigh_table_init(c);
if (!c->quiet)
conf_print(c);
diff --git a/fwd.c b/fwd.c
index 250cf56..c34bb1c 100644
--- a/fwd.c
+++ b/fwd.c
@@ -33,6 +33,188 @@ static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
#define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
+#define NEIGH_TABLE_SLOTS 1024 /* Must be power of two */
+#define NEIGH_TABLE_SIZE (NEIGH_TABLE_SLOTS / 2)
+static_assert((NEIGH_TABLE_SLOTS & (NEIGH_TABLE_SLOTS - 1)) == 0,
+ "NEIGH_TABLE_SLOTS must be a power of two");
+
+/**
+ * struct neigh_table_entry - Entry in the ARP/NDP table
+ * @next: Next entry in slot or free list
+ * @addr: IP address of represented host
+ * @mac: MAC address of represented host
+ */
+struct neigh_table_entry {
+ struct neigh_table_entry *next;
+ union inany_addr addr;
+ uint8_t mac[ETH_ALEN];
+};
+
+/**
+ * struct neigh_table - Cache of ARP/NDP table contents
+ * @entries: Entries to be plugged into the hash slots when allocated
+ * @slots: Hash table slots
+ * @free: Linked list of unused entries
+ */
+struct neigh_table {
+ struct neigh_table_entry entries[NEIGH_TABLE_SIZE];
+ struct neigh_table_entry *slots[NEIGH_TABLE_SLOTS];
+ struct neigh_table_entry *free;
+};
+
+static struct neigh_table neigh_table;
+
+/**
+ * neigh_table_slot() - Hash key to a number within the table range
+ * @c: Execution context
+ * @key: The key to be used for the hash
+ *
+ * Return: The resulting hash value
+ */
+static size_t neigh_table_slot(const struct ctx *c,
+ const union inany_addr *key)
+{
+ struct siphash_state st = SIPHASH_INIT(c->hash_secret);
+ uint32_t i;
+
+ inany_siphash_feed(&st, key);
+ i = siphash_final(&st, sizeof(*key), 0);
+
+ return ((size_t)i) & (NEIGH_TABLE_SIZE - 1);
+}
+
+/**
+ * fwd_neigh_table_find() - Find a MAC table entry
+ * @c: Execution context
+ * @addr: Neighbour address to be used as key for the lookup
+ *
+ * Return: The matching entry, if found. Otherwise NULL.
+ */
+static struct neigh_table_entry *fwd_neigh_table_find(const struct ctx *c,
+ const union inany_addr *addr)
+{
+ size_t slot = neigh_table_slot(c, addr);
+ struct neigh_table_entry *e = neigh_table.slots[slot];
+
+ while (e && !inany_equals(&e->addr, addr))
+ e = e->next;
+
+ return e;
+}
+
+/**
+ * fwd_neigh_table_update() - Allocate or update neighbour table entry
+ * @c: Execution context
+ * @addr: IP address used to determine insertion slot and store in entry
+ * @mac: The MAC address associated with the neighbour address
+ */
+void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
+ const uint8_t *mac)
+{
+ struct neigh_table *t = &neigh_table;
+ struct neigh_table_entry *e;
+ ssize_t slot;
+
+ /* MAC address might change sometimes */
+ e = fwd_neigh_table_find(c, addr);
+ if (e) {
+ if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
+ return;
+ if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
+ return;
+
+ memcpy(e->mac, mac, ETH_ALEN);
+ return;
+ }
+
+ e = t->free;
+ if (!e) {
+ debug("Failed to allocate neighbour table entry");
+ return;
+ }
+ t->free = e->next;
+
+ slot = neigh_table_slot(c, addr);
+ e->next = t->slots[slot];
+ t->slots[slot] = e;
+
+ memcpy(&e->addr, addr, sizeof(*addr));
+ memcpy(e->mac, mac, ETH_ALEN);
+}
+
+/**
+ * fwd_neigh_table_free() - Remove an entry from a slot and add it to free list
+ * @c: Execution context
+ * @addr: IP address used to find the slot for the entry
+ */
+void fwd_neigh_table_free(const struct ctx *c, const union inany_addr *addr)
+{
+ ssize_t slot = neigh_table_slot(c, addr);
+ struct neigh_table *t = &neigh_table;
+ struct neigh_table_entry *e, **prev;
+
+ prev = &t->slots[slot];
+ e = t->slots[slot];
+ while (e && !inany_equals(&e->addr, addr)) {
+ prev = &e->next;
+ e = e->next;
+ }
+ if (!e)
+ return;
+
+ *prev = e->next;
+ e->next = t->free;
+ t->free = e;
+ memset(&e->addr, 0, sizeof(*addr));
+ memset(e->mac, 0, ETH_ALEN);
+}
+
+/**
+ * fwd_neigh_mac_get() - Look up MAC address in the ARP/NDP table
+ * @c: Execution context
+ * @addr: Neighbour IP address used as lookup key
+ * @mac: Buffer for Ethernet MAC to return, found or default value.
+ *
+ * Return: true if real MAC found, false if not found or if failure
+ */
+bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
+ uint8_t *mac)
+{
+ const struct neigh_table_entry *e = fwd_neigh_table_find(c, addr);
+
+ if (e)
+ memcpy(mac, e->mac, ETH_ALEN);
+ else
+ memcpy(mac, c->our_tap_mac, ETH_ALEN);
+
+ return !!e;
+}
+
+/**
+ * fwd_neigh_table_init() - Initialize the neighbour table
+ * @c: Execution context
+ */
+void fwd_neigh_table_init(const struct ctx *c)
+{
+ struct neigh_table *t = &neigh_table;
+ const uint8_t *omac = c->our_tap_mac;
+ struct neigh_table_entry *e;
+ int i;
+
+ memset(t, 0, sizeof(*t));
+ for (i = 0; i < NEIGH_TABLE_SIZE; i++) {
+ e = &t->entries[i];
+ e->next = t->free;
+ t->free = e;
+ }
+
+ /* These addresses must always map to our own MAC address */
+ fwd_neigh_table_update(c, &inany_loopback4, omac);
+ fwd_neigh_table_update(c, &inany_loopback6, omac);
+ fwd_neigh_table_update(c, &inany_from_v4(c->ip4.guest_gw), omac);
+ fwd_neigh_table_update(c, (union inany_addr *)&c->ip6.guest_gw, omac);
+}
+
/** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
*
* Work out what ports the host thinks are emphemeral and record it for later
diff --git a/fwd.h b/fwd.h
index 65c7c96..6ca743c 100644
--- a/fwd.h
+++ b/fwd.h
@@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
+void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
+ const uint8_t *mac);
+void fwd_neigh_table_free(const struct ctx *c,
+ const union inany_addr *addr);
+bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
+ uint8_t *mac);
+void fwd_neigh_table_init(const struct ctx *c);
#endif /* FWD_H */
diff --git a/netlink.c b/netlink.c
index 3fe2fdd..4be5fcf 100644
--- a/netlink.c
+++ b/netlink.c
@@ -1192,10 +1192,13 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
inany_from_af(&addr, ndm->ndm_family, dst);
inany_ntop(dst, ip_str, sizeof(ip_str));
- if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
+ if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID) {
trace("neigh table update: %s / %s", ip_str, mac_str);
- else
+ fwd_neigh_table_update(c, &addr, mac);
+ } else {
trace("neigh table delete: %s / %s", ip_str, mac_str);
+ fwd_neigh_table_free(c, &addr);
+ }
}
/**
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-03 0:34 ` [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
@ 2025-10-03 1:03 ` Jon Maloy
2025-10-03 4:31 ` David Gibson
2025-10-06 22:40 ` Stefano Brivio
2 siblings, 0 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 1:03 UTC (permalink / raw)
To: sbrivio, dgibson, david, passt-dev
On 2025-10-02 20:34, Jon Maloy wrote:
> We add a cache table to keep track of the contents of the kernel ARP
> and NDP tables. The table is fed from the just introduced netlink based
> neigbour subscription function. The new table eliminates the need for
> explicit netlink calls to find a host's MAC address.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v5: - Moved to earlier in series to reduce rebase conflicts
> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
> - Removed hash lookup. We now only use linear lookup in a
> linked list
> - Eliminated dynamic memory allocation.
> - Ensured there is only one call to clock_gettime()
> - Using MAC_ZERO instead of the previously dedicated definitions
> v7: - NOW using MAC_ZERO where needed
> - I am still using linear back-off for empty cache entries. Even
> an incoming, flow-creating packet from a local host gives no
> guarantee that its MAC address is in the ARP table, so we must
> allow for a few new attempts at first possible occasions. Only
> after several failed lookups can we conclude that we probably
> never will succeed. Hence the back-off.
> - Fixed a bug that David inadvertently made me aware of: I only
> intended to set the initial expiry value to MAC_CACHE_RENEWAL
> when an ARP/NDP table lookup was successful.
> - Improved struct and function description comments.
> v8: - Total re-design of table, adapting to the new, subscription
> based way of updating it.
> v9: - Catering for MAC address change for an existing host.
> v10: - Changes according to feedback from David Gibson
> v12: - Changes according to feedback from David and Stefano
> - Added dummy entries for loopback and default GW addresses
> ---
> conf.c | 1 +
> fwd.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fwd.h | 7 +++
> netlink.c | 7 ++-
> 4 files changed, 195 insertions(+), 2 deletions(-)
>
[...]
> + return !!e;
> +}
> +
> +/**
> + * fwd_neigh_table_init() - Initialize the neighbour table
> + * @c: Execution context
> + */
> +void fwd_neigh_table_init(const struct ctx *c)
> +{
> + struct neigh_table *t = &neigh_table;
> + const uint8_t *omac = c->our_tap_mac;
> + struct neigh_table_entry *e;
> + int i;
> +
> + memset(t, 0, sizeof(*t));
> + for (i = 0; i < NEIGH_TABLE_SIZE; i++) {
> + e = &t->entries[i];
> + e->next = t->free;
> + t->free = e;
> + }
> +
> + /* These addresses must always map to our own MAC address */
> + fwd_neigh_table_update(c, &inany_loopback4, omac);
> + fwd_neigh_table_update(c, &inany_loopback6, omac);
> + fwd_neigh_table_update(c, &inany_from_v4(c->ip4.guest_gw), omac);
> + fwd_neigh_table_update(c, (union inany_addr *)&c->ip6.guest_gw, omac);
> +}
> +
Stefano, I know this isn't exactly what we agreed upon when we discussed
this, but I really cannot see the guest address being handled anywhere
in a way that justifies adding it to the the neigbour table.
As far as I can see, the above, plus a fix in the next commit, solves
all the problems regarding host and default GW access.
Please correct me if I am wrong.
///jon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-03 0:34 ` [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-10-03 1:03 ` Jon Maloy
@ 2025-10-03 4:31 ` David Gibson
2025-10-05 15:52 ` Jon Maloy
2025-10-06 22:40 ` Stefano Brivio
2 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:31 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 8423 bytes --]
On Thu, Oct 02, 2025 at 08:34:05PM -0400, Jon Maloy wrote:
> We add a cache table to keep track of the contents of the kernel ARP
> and NDP tables. The table is fed from the just introduced netlink based
> neigbour subscription function. The new table eliminates the need for
> explicit netlink calls to find a host's MAC address.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v5: - Moved to earlier in series to reduce rebase conflicts
> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
> - Removed hash lookup. We now only use linear lookup in a
> linked list
> - Eliminated dynamic memory allocation.
> - Ensured there is only one call to clock_gettime()
> - Using MAC_ZERO instead of the previously dedicated definitions
> v7: - NOW using MAC_ZERO where needed
> - I am still using linear back-off for empty cache entries. Even
> an incoming, flow-creating packet from a local host gives no
> guarantee that its MAC address is in the ARP table, so we must
> allow for a few new attempts at first possible occasions. Only
> after several failed lookups can we conclude that we probably
> never will succeed. Hence the back-off.
> - Fixed a bug that David inadvertently made me aware of: I only
> intended to set the initial expiry value to MAC_CACHE_RENEWAL
> when an ARP/NDP table lookup was successful.
> - Improved struct and function description comments.
> v8: - Total re-design of table, adapting to the new, subscription
> based way of updating it.
> v9: - Catering for MAC address change for an existing host.
> v10: - Changes according to feedback from David Gibson
> v12: - Changes according to feedback from David and Stefano
> - Added dummy entries for loopback and default GW addresses
> ---
> conf.c | 1 +
> fwd.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fwd.h | 7 +++
> netlink.c | 7 ++-
> 4 files changed, 195 insertions(+), 2 deletions(-)
>
> diff --git a/conf.c b/conf.c
> index 66b9e63..3224ee6 100644
> --- a/conf.c
> +++ b/conf.c
[snip]
> +/**
> + * fwd_neigh_table_update() - Allocate or update neighbour table entry
> + * @c: Execution context
> + * @addr: IP address used to determine insertion slot and store in entry
> + * @mac: The MAC address associated with the neighbour address
> + */
> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> + const uint8_t *mac)
> +{
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e;
> + ssize_t slot;
> +
> + /* MAC address might change sometimes */
> + e = fwd_neigh_table_find(c, addr);
> + if (e) {
> + if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
> + return;
> + if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
> + return;
This doesn't make sense to me. From the way its looked up in 4/9, the
IP addresses in the table are as seen by the host, not by the guest
(we look up the table *after* applying NAT). Which makes guest_gw not
meaningful here.
You _do_ need to handle the case that addr is loopback (which guest_gw
might be translated to), and that's handled by your dummy entries.
The other case we might need to consider here is if the (translated)
address is the host's IP. Do we want to give out_tap_addr for that
case? Or the host's real MAC on the template interface? If the
latter will that be in the host ARP table, or would we have to handle
it specially?
> +
> + memcpy(e->mac, mac, ETH_ALEN);
> + return;
> + }
> +
> + e = t->free;
> + if (!e) {
> + debug("Failed to allocate neighbour table entry");
> + return;
> + }
> + t->free = e->next;
> +
> + slot = neigh_table_slot(c, addr);
> + e->next = t->slots[slot];
> + t->slots[slot] = e;
> +
> + memcpy(&e->addr, addr, sizeof(*addr));
> + memcpy(e->mac, mac, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_neigh_table_free() - Remove an entry from a slot and add it to free list
> + * @c: Execution context
> + * @addr: IP address used to find the slot for the entry
> + */
> +void fwd_neigh_table_free(const struct ctx *c, const union inany_addr *addr)
> +{
> + ssize_t slot = neigh_table_slot(c, addr);
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e, **prev;
> +
> + prev = &t->slots[slot];
> + e = t->slots[slot];
> + while (e && !inany_equals(&e->addr, addr)) {
> + prev = &e->next;
> + e = e->next;
> + }
> + if (!e)
> + return;
> +
> + *prev = e->next;
> + e->next = t->free;
> + t->free = e;
> + memset(&e->addr, 0, sizeof(*addr));
> + memset(e->mac, 0, ETH_ALEN);
As Stefano noted earlier, 0xff is probably a better placeholder here,
since 00:00:00:00:00:00 is a valid MAC.
> +}
> +
> +/**
> + * fwd_neigh_mac_get() - Look up MAC address in the ARP/NDP table
> + * @c: Execution context
> + * @addr: Neighbour IP address used as lookup key
> + * @mac: Buffer for Ethernet MAC to return, found or default value.
> + *
> + * Return: true if real MAC found, false if not found or if failure
> + */
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> + uint8_t *mac)
> +{
> + const struct neigh_table_entry *e = fwd_neigh_table_find(c, addr);
> +
> + if (e)
> + memcpy(mac, e->mac, ETH_ALEN);
> + else
> + memcpy(mac, c->our_tap_mac, ETH_ALEN);
> +
> + return !!e;
> +}
> +
> +/**
> + * fwd_neigh_table_init() - Initialize the neighbour table
> + * @c: Execution context
> + */
> +void fwd_neigh_table_init(const struct ctx *c)
> +{
> + struct neigh_table *t = &neigh_table;
> + const uint8_t *omac = c->our_tap_mac;
> + struct neigh_table_entry *e;
> + int i;
> +
> + memset(t, 0, sizeof(*t));
> + for (i = 0; i < NEIGH_TABLE_SIZE; i++) {
> + e = &t->entries[i];
> + e->next = t->free;
> + t->free = e;
> + }
> +
> + /* These addresses must always map to our own MAC address */
> + fwd_neigh_table_update(c, &inany_loopback4, omac);
> + fwd_neigh_table_update(c, &inany_loopback6, omac);
These make sense.
> + fwd_neigh_table_update(c, &inany_from_v4(c->ip4.guest_gw), omac);
> + fwd_neigh_table_update(c, (union inany_addr *)&c->ip6.guest_gw, omac);
But these don't. For two reasons:
* As noted guest_gw is only meaningful guest side, but the table is
based on host side addresses.
* These will be no-ops, since you explicitly exclude these addresses
in fwd_neigh_table_update().
> +}
> +
> /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
> *
> * Work out what ports the host thinks are emphemeral and record it for later
> diff --git a/fwd.h b/fwd.h
> index 65c7c96..6ca743c 100644
> --- a/fwd.h
> +++ b/fwd.h
> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> + const uint8_t *mac);
> +void fwd_neigh_table_free(const struct ctx *c,
> + const union inany_addr *addr);
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> + uint8_t *mac);
> +void fwd_neigh_table_init(const struct ctx *c);
>
> #endif /* FWD_H */
> diff --git a/netlink.c b/netlink.c
> index 3fe2fdd..4be5fcf 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -1192,10 +1192,13 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> inany_from_af(&addr, ndm->ndm_family, dst);
> inany_ntop(dst, ip_str, sizeof(ip_str));
>
> - if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
> + if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID) {
> trace("neigh table update: %s / %s", ip_str, mac_str);
> - else
> + fwd_neigh_table_update(c, &addr, mac);
> + } else {
> trace("neigh table delete: %s / %s", ip_str, mac_str);
> + fwd_neigh_table_free(c, &addr);
> + }
> }
>
> /**
> --
> 2.50.1
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-03 4:31 ` David Gibson
@ 2025-10-05 15:52 ` Jon Maloy
2025-10-06 22:33 ` Stefano Brivio
2025-10-07 3:33 ` David Gibson
0 siblings, 2 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-05 15:52 UTC (permalink / raw)
To: David Gibson; +Cc: sbrivio, dgibson, passt-dev
On 2025-10-03 00:31, David Gibson wrote:
> On Thu, Oct 02, 2025 at 08:34:05PM -0400, Jon Maloy wrote:
>> We add a cache table to keep track of the contents of the kernel ARP
>> and NDP tables. The table is fed from the just introduced netlink based
[...]
>> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
>> + const uint8_t *mac)
>> +{
>> + struct neigh_table *t = &neigh_table;
>> + struct neigh_table_entry *e;
>> + ssize_t slot;
>> +
>> + /* MAC address might change sometimes */
>> + e = fwd_neigh_table_find(c, addr);
>> + if (e) {
>> + if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
>> + return;
>> + if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
>> + return;
>
> This doesn't make sense to me. From the way its looked up in 4/9, the
> IP addresses in the table are as seen by the host, not by the guest
> (we look up the table *after* applying NAT). Which makes guest_gw not
> meaningful here.
>
> You _do_ need to handle the case that addr is loopback (which guest_gw
> might be translated to), and that's handled by your dummy entries.
>
> The other case we might need to consider here is if the (translated)
> address is the host's IP. Do we want to give out_tap_addr for that
> case? Or the host's real MAC on the template interface? If the
> latter will that be in the host ARP table, or would we have to handle
> it specially?
>
This is my current understanding of this:
1) Adding the loopback entries to the neigbour table is harmless,
but unnecessary. fwd_neigh_mac_get() will always return our_tap_mac
when it doesn't find an entry in the table, which is the case here.
This addition was a mistake.
2) Regarding the default gw, we have two cases:
2.1) guest_gw IP is the same as the host_gw IP:
In this case, we'll have an event trying to inserting an entry
into the table with the host_gw's real mac address.
In v11, this mapping was announced to the guest, but later
contradicted by all ARP/NDP messages he receives, because they all
come from PASTA/PASST and uses our_tap_mac as source mac in the
message header. This is probably harmless, but causes confusion in
the guest and a warning wireshark.
In v12, I therefore chose to add the entry but suppress the
announcements for the gw, and also the updates of
the mac fields from subsequent events.
This is not right either. We either have to be consistent, and
add the real host_gw mac in all messages sent from the purported
gw address, or we just as consistently use our_tap_mac. I think
the latter is simpler with less consequences for the code.
I think we should just simply suppress all events from the host
gw in such cases.
(Here I have a question: I don't see any NAT mapping making
it possible to communicate between the guest and the the host gw
when guest gw IP and host gw IP are are equal. Shouldn't there be
one?)
2.2) guest_gw IP is different from host_gw IP:
This case is simpler. We just treat the host_gw as any other
local host, add and update its entry at new events, and pass it on
to the guest as an announcement at first contact, consistently
using that host's true mapping.
3) Regarding the host address, we also have two cases:
3.1) guest IP is the same as the host IP (on the default interface):
In this case the guest can only reach the host by using the
guest_gw IP (which lands on the host), some or the other IPs
on the host, or the map_host_loopback IP, as far as I
understand.
The host IP will never enter the neigbour table by ARP/NDP
event, and there is no reason for us to add a dummy for it.
3.2) guest IP differs from the host IP.
In this case we let fwd_neigh_table_init() add an entry for the
host as if it were any other host, using the IP and mac
addresses from the template interface.
Conclusion:
- We use nat_inbound() instead of nat_outbound() before consulting the
table, like you suggest.
- We need to manually add an entry representing the host in the case the
host IP differs from the guest IP. This entry is announced.
In the case the IPs are the same we don't add any entry.
- Local host entries are added by ARP/NDP events, but only if
nat_inbound() doesn't translate the IP address (will that ever
happen?).
We suppress all events from the host gw in case it has the same
IP as the guest gw (unless this is covered by the nat_inbound()
check?).
- We may want a NAT mapping making it possible to reach the host gw
in the case it has the same IP as the guest gw. This has no effect
on our neighbour table.
If you agree with my above analysis I can go ahead and post a v13 of
the series early next week, including the above changes and fixes
for your other comments to the series.
///jon
>> +
>> + memcpy(e->mac, mac, ETH_ALEN);
>> + return;
>> + }
>> +
>> + e = t->free;
>> + if (!e) {
>> + debug("Failed to allocate neighbour table entry");
>> + return;
>> + }
>> + t->free = e->next;
>> +
>> + slot = neigh_table_slot(c, addr);
>> + e->next = t->slots[slot];
>> + t->slots[slot] = e;
>> +
>> + memcpy(&e->addr, addr, sizeof(*addr));
>> + memcpy(e->mac, mac, ETH_ALEN);
>> +}
>> +
>> +/**
>> + * fwd_neigh_table_free() - Remove an entry from a slot and add it to free list
>> + * @c: Execution context
>> + * @addr: IP address used to find the slot for the entry
>> + */
>> +void fwd_neigh_table_free(const struct ctx *c, const union inany_addr *addr)
>> +{
>> + ssize_t slot = neigh_table_slot(c, addr);
>> + struct neigh_table *t = &neigh_table;
>> + struct neigh_table_entry *e, **prev;
>> +
>> + prev = &t->slots[slot];
>> + e = t->slots[slot];
>> + while (e && !inany_equals(&e->addr, addr)) {
>> + prev = &e->next;
>> + e = e->next;
>> + }
>> + if (!e)
>> + return;
>> +
>> + *prev = e->next;
>> + e->next = t->free;
>> + t->free = e;
>> + memset(&e->addr, 0, sizeof(*addr));
>> + memset(e->mac, 0, ETH_ALEN);
>
> As Stefano noted earlier, 0xff is probably a better placeholder here,
> since 00:00:00:00:00:00 is a valid MAC.
>
>> +}
>> +
>> +/**
>> + * fwd_neigh_mac_get() - Look up MAC address in the ARP/NDP table
>> + * @c: Execution context
>> + * @addr: Neighbour IP address used as lookup key
>> + * @mac: Buffer for Ethernet MAC to return, found or default value.
>> + *
>> + * Return: true if real MAC found, false if not found or if failure
>> + */
>> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
>> + uint8_t *mac)
>> +{
>> + const struct neigh_table_entry *e = fwd_neigh_table_find(c, addr);
>> +
>> + if (e)
>> + memcpy(mac, e->mac, ETH_ALEN);
>> + else
>> + memcpy(mac, c->our_tap_mac, ETH_ALEN);
>> +
>> + return !!e;
>> +}
>> +
>> +/**
>> + * fwd_neigh_table_init() - Initialize the neighbour table
>> + * @c: Execution context
>> + */
>> +void fwd_neigh_table_init(const struct ctx *c)
>> +{
>> + struct neigh_table *t = &neigh_table;
>> + const uint8_t *omac = c->our_tap_mac;
>> + struct neigh_table_entry *e;
>> + int i;
>> +
>> + memset(t, 0, sizeof(*t));
>> + for (i = 0; i < NEIGH_TABLE_SIZE; i++) {
>> + e = &t->entries[i];
>> + e->next = t->free;
>> + t->free = e;
>> + }
>> +
>> + /* These addresses must always map to our own MAC address */
>> + fwd_neigh_table_update(c, &inany_loopback4, omac);
>> + fwd_neigh_table_update(c, &inany_loopback6, omac);
>
> These make sense.
>
>> + fwd_neigh_table_update(c, &inany_from_v4(c->ip4.guest_gw), omac);
>> + fwd_neigh_table_update(c, (union inany_addr *)&c->ip6.guest_gw, omac);
>
> But these don't. For two reasons:
> * As noted guest_gw is only meaningful guest side, but the table is
> based on host side addresses.
> * These will be no-ops, since you explicitly exclude these addresses
> in fwd_neigh_table_update().
>
>> +}
>> +
>> /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
>> *
>> * Work out what ports the host thinks are emphemeral and record it for later
>> diff --git a/fwd.h b/fwd.h
>> index 65c7c96..6ca743c 100644
>> --- a/fwd.h
>> +++ b/fwd.h
>> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
>> const struct flowside *ini, struct flowside *tgt);
>> uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
>> const struct flowside *ini, struct flowside *tgt);
>> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
>> + const uint8_t *mac);
>> +void fwd_neigh_table_free(const struct ctx *c,
>> + const union inany_addr *addr);
>> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
>> + uint8_t *mac);
>> +void fwd_neigh_table_init(const struct ctx *c);
>>
>> #endif /* FWD_H */
>> diff --git a/netlink.c b/netlink.c
>> index 3fe2fdd..4be5fcf 100644
>> --- a/netlink.c
>> +++ b/netlink.c
>> @@ -1192,10 +1192,13 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
>> inany_from_af(&addr, ndm->ndm_family, dst);
>> inany_ntop(dst, ip_str, sizeof(ip_str));
>>
>> - if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
>> + if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID) {
>> trace("neigh table update: %s / %s", ip_str, mac_str);
>> - else
>> + fwd_neigh_table_update(c, &addr, mac);
>> + } else {
>> trace("neigh table delete: %s / %s", ip_str, mac_str);
>> + fwd_neigh_table_free(c, &addr);
>> + }
>> }
>>
>> /**
>> --
>> 2.50.1
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-05 15:52 ` Jon Maloy
@ 2025-10-06 22:33 ` Stefano Brivio
2025-10-07 3:33 ` David Gibson
1 sibling, 0 replies; 28+ messages in thread
From: Stefano Brivio @ 2025-10-06 22:33 UTC (permalink / raw)
To: Jon Maloy; +Cc: David Gibson, dgibson, passt-dev
On Sun, 5 Oct 2025 11:52:42 -0400
Jon Maloy <jmaloy@redhat.com> wrote:
> On 2025-10-03 00:31, David Gibson wrote:
> > On Thu, Oct 02, 2025 at 08:34:05PM -0400, Jon Maloy wrote:
> >> We add a cache table to keep track of the contents of the kernel ARP
> >> and NDP tables. The table is fed from the just introduced netlink based
>
> [...]
>
> >> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> >> + const uint8_t *mac)
> >> +{
> >> + struct neigh_table *t = &neigh_table;
> >> + struct neigh_table_entry *e;
> >> + ssize_t slot;
> >> +
> >> + /* MAC address might change sometimes */
> >> + e = fwd_neigh_table_find(c, addr);
> >> + if (e) {
> >> + if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
> >> + return;
> >> + if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
> >> + return;
> >
> > This doesn't make sense to me. From the way its looked up in 4/9, the
> > IP addresses in the table are as seen by the host, not by the guest
> > (we look up the table *after* applying NAT). Which makes guest_gw not
> > meaningful here.
> >
> > You _do_ need to handle the case that addr is loopback (which guest_gw
> > might be translated to), and that's handled by your dummy entries.
> >
> > The other case we might need to consider here is if the (translated)
> > address is the host's IP. Do we want to give out_tap_addr for that
> > case? Or the host's real MAC on the template interface? If the
> > latter will that be in the host ARP table, or would we have to handle
> > it specially?
> >
> This is my current understanding of this:
>
> 1) Adding the loopback entries to the neigbour table is harmless,
> but unnecessary.
Loopback entries yes, but I thought you would add the equivalent
guest-side entries.
> fwd_neigh_mac_get() will always return our_tap_mac
> when it doesn't find an entry in the table, which is the case here.
Kind of, because while with IPv4:
$ pasta sh -c 'ip n a 127.0.0.1 dev lo lladdr 00:ba:db:ad:1d:ea; ip n'
0.0.0.0 dev lo lladdr 00:ba:db:ad:1d:ea PERMANENT
that won't match anything, with IPv6:
$ pasta sh -c 'ip n a ::1 dev lo lladdr 00:ba:db:ad:1d:ea; ip n'
::1 dev lo lladdr 00:ba:db:ad:1d:ea PERMANENT
that would match the loopback address. I don't think we actually need
to care about this, though.
> This addition was a mistake.
>
>
> 2) Regarding the default gw, we have two cases:
>
> 2.1) guest_gw IP is the same as the host_gw IP:
I'm afraid we're losing a substantial amount of time on ambiguous
terminology, so let me clarify what my understanding of these terms is:
a. "guest_gw IP" -> any of the two IP addresses, in the guest,
possibly mapping to the host
...that's because the default gateway address in the guest doesn't
really matter, so I don't think you're referring to the IP address of
the default gateway in the guest.
b. "host_gw IP" -> the IP address of the default gateway on the host
> In this case, we'll have an event trying to inserting an entry
> into the table with the host_gw's real mac address.
> In v11, this mapping was announced to the guest, but later
> contradicted by all ARP/NDP messages he receives, because they all
> come from PASTA/PASST and uses our_tap_mac as source mac in the
> message header. This is probably harmless, but causes confusion in
> the guest and a warning wireshark.
> In v12, I therefore chose to add the entry but suppress the
> announcements for the gw, and also the updates of
> the mac fields from subsequent events.
> This is not right either. We either have to be consistent, and
> add the real host_gw mac in all messages sent from the purported
> gw address, or we just as consistently use our_tap_mac. I think
> the latter is simpler with less consequences for the code.
I also think we should use our_tap_mac for any message coming from the
host itself: they're not coming from the gateway (as seen by the host),
which we are effectively shadowing.
> I think we should just simply suppress all events from the host
> gw in such cases.
This is generally fine, I think, but it comes with the additional
problem that, should we ever make the map_guest_addr or
map_host_loopback parameters configurable at runtime, we'll miss the
entry at that point and we'll need to rework this.
If it doesn't cause substantial effort, I would suggest that, if
entries always refer to IP addresses as seen from the guest (David's
suggestion), you would actually add the entries mapping to the host to
the table, resolving to our_tap_mac.
Those addresses are *not* c->ip[46].guest_gw though. They are
c->ip[46].map_guest_addr and c->ip[46].map_host_loopback.
See nat_outbound() for an overview of translated addresses.
> (Here I have a question: I don't see any NAT mapping making
> it possible to communicate between the guest and the the host gw
> when guest gw IP and host gw IP are are equal. Shouldn't there be
> one?)
Right, there's no such parameter at the moment, because the assumption
is that, if you need to reach the default gateway of the host, you
would specify a non-default --map-host-loopback (possibly 'none').
On the other hand we plan to make this more generic, perhaps as part of
https://bugs.passt.top/show_bug.cgi?id=140, perhaps as part of another
effort.
The flow table should enable arbitrary NATs, and was implemented with
that in mind. We just don't have a suitable configuration front-end at
the moment, even though command line parameters would probably be
enough to cover most common cases.
> 2.2) guest_gw IP is different from host_gw IP:
> This case is simpler. We just treat the host_gw as any other
> local host, add and update its entry at new events, and pass it on
> to the guest as an announcement at first contact, consistently
> using that host's true mapping.
>
> 3) Regarding the host address, we also have two cases:
>
> 3.1) guest IP is the same as the host IP (on the default interface):
My interpretation of the terms here is:
c. "guest IP" -> one of the configured guest IP addresses
(c->ip[46].addr)
d. "host IP" -> one of the IP addresses configured on the host (not in
struct ctx, but two of those will become c->ip[46].addr by default)
> In this case the guest can only reach the host by using the
> guest_gw IP (which lands on the host),
Not according to my interpretation of "guest_gw IP" (a. above).
> some or the other IPs
> on the host, or the map_host_loopback IP, as far as I
> understand.
Yes. And c->ip[46].map_guest_addr on top of them.
> The host IP will never enter the neigbour table by ARP/NDP
> event, and there is no reason for us to add a dummy for it.
If none of the host IP addresses are visible to the guest, right,
there's no need for any matching entry.
> 3.2) guest IP differs from the host IP.
> In this case we let fwd_neigh_table_init() add an entry for the
> host as if it were any other host, using the IP and mac
> addresses from the template interface.
This assumes that we'll use MAC addresses assigned to host interfaces,
instead of using our_tap_mac.
I think it looks more elegant than the alternative, but mind that there
can be 2^32 - 1 interfaces on Linux (see dev_index_reserve() in
net/core/dev.c), so that might flood the table right away. Maybe
our_tap_mac isn't that bad after all.
> Conclusion:
> - We use nat_inbound() instead of nat_outbound() before consulting the
> table, like you suggest.
If this implies standardising on guest-side addresses, it makes sense
to me as well.
> - We need to manually add an entry representing the host in the case the
> host IP differs from the guest IP. This entry is announced.
> In the case the IPs are the same we don't add any entry.
I don't think host and guest IP addresses are really related to this
point. It's rather about the two sets of addresses that might map to
the host.
> - Local host entries are added by ARP/NDP events, but only if
> nat_inbound() doesn't translate the IP address (will that ever
> happen?).
> We suppress all events from the host gw in case it has the same
> IP as the guest gw (unless this is covered by the nat_inbound()
> check?).
I wouldn't hide any address like that, to make this a bit more
future-proof.
> - We may want a NAT mapping making it possible to reach the host gw
> in the case it has the same IP as the guest gw. This has no effect
> on our neighbour table.
>
> If you agree with my above analysis I can go ahead and post a v13 of
> the series early next week, including the above changes and fixes
> for your other comments to the series.
...the rest sounds good to me.
--
Stefano
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-05 15:52 ` Jon Maloy
2025-10-06 22:33 ` Stefano Brivio
@ 2025-10-07 3:33 ` David Gibson
1 sibling, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-07 3:33 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 12411 bytes --]
On Sun, Oct 05, 2025 at 11:52:42AM -0400, Jon Maloy wrote:
>
>
> On 2025-10-03 00:31, David Gibson wrote:
> > On Thu, Oct 02, 2025 at 08:34:05PM -0400, Jon Maloy wrote:
> > > We add a cache table to keep track of the contents of the kernel ARP
> > > and NDP tables. The table is fed from the just introduced netlink based
>
> [...]
>
> > > +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> > > + const uint8_t *mac)
> > > +{
> > > + struct neigh_table *t = &neigh_table;
> > > + struct neigh_table_entry *e;
> > > + ssize_t slot;
> > > +
> > > + /* MAC address might change sometimes */
> > > + e = fwd_neigh_table_find(c, addr);
> > > + if (e) {
> > > + if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
> > > + return;
> > > + if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
> > > + return;
> >
> > This doesn't make sense to me. From the way its looked up in 4/9, the
> > IP addresses in the table are as seen by the host, not by the guest
> > (we look up the table *after* applying NAT). Which makes guest_gw not
> > meaningful here.
> >
> > You _do_ need to handle the case that addr is loopback (which guest_gw
> > might be translated to), and that's handled by your dummy entries.
> >
> > The other case we might need to consider here is if the (translated)
> > address is the host's IP. Do we want to give out_tap_addr for that
> > case? Or the host's real MAC on the template interface? If the
> > latter will that be in the host ARP table, or would we have to handle
> > it specially?
> >
> This is my current understanding of this:
>
> 1) Adding the loopback entries to the neigbour table is harmless,
> but unnecessary. fwd_neigh_mac_get() will always return our_tap_mac
> when it doesn't find an entry in the table, which is the case here.
> This addition was a mistake.
Yes. In fact, it goes even further than that: now that we're
conceptualizing this in terms of guest side addresses, we should never
ever see loopback addresses on tap (I believe we already filter for
this before we get to this point).
> 2) Regarding the default gw, we have two cases:
>
> 2.1) guest_gw IP is the same as the host_gw IP:
> In this case, we'll have an event trying to inserting an entry
> into the table with the host_gw's real mac address.
Yes. Indeed if guest_gw has the same address as anything on the
host's network segment.
> In v11, this mapping was announced to the guest, but later
> contradicted by all ARP/NDP messages he receives, because they all
> come from PASTA/PASST and uses our_tap_mac as source mac in the
> message header. This is probably harmless, but causes confusion in
> the guest and a warning wireshark.
> In v12, I therefore chose to add the entry but suppress the
> announcements for the gw, and also the updates of
> the mac fields from subsequent events.
> This is not right either. We either have to be consistent, and
> add the real host_gw mac in all messages sent from the purported
> gw address, or we just as consistently use our_tap_mac. I think
> the latter is simpler with less consequences for the code.
> I think we should just simply suppress all events from the host
> gw in such cases.
> (Here I have a question: I don't see any NAT mapping making
> it possible to communicate between the guest and the the host gw
> when guest gw IP and host gw IP are are equal. Shouldn't there be
> one?)
No, there isn't. This is a limitation in passt that's been there
forever and is tricky to remove. The problem is that we can't
allocate IP addresses, but we're trying to give the guest the illusion
of being its own network host with its own address.
Re-using the host's IP for the guest has a number of advantages - we
can avoid many NATs, and in particular means (most) peers will see the
same address as the guest thinks it has. However it means the guest
can't contact the host at all, which is pretty inconvenient.
The gateway NAT is essentially sacrificing the ability for the guest
to contact the (host) gateway, in order to let it contact the host
itself. This is a good tradeoff when the gateway is a dumb network
appliance (common in a data centre) - it's not such a great tradeoff
when the gateway also has other services running on it (common in a
home network).
There are other ways of approaching the tradeoffs here, none of them
perfect. podman uses a different one by default, using link-local
(169.254.x.y) addresses for the NATs.
One of the goals of the flexible forwarding stuff I'm (gradually)
working on is to make more of those options easier to access.
> 2.2) guest_gw IP is different from host_gw IP:
> This case is simpler. We just treat the host_gw as any other
> local host, add and update its entry at new events, and pass it on
> to the guest as an announcement at first contact, consistently
> using that host's true mapping.
So, what to do is a bit confused by the fact that there are 4
conceptually different addresses that usually have the same value:
a) The host gw address
b) The guest gw address
c) The --map-host-loopback address (c->ip*.map_host_loopback)
d) The --map-guest-addr address (c->ip*.map_guest_addr)
I don't think the host gw address is relevant here, except insofar as
its affected by having the same value as others.
The guest gw address is *usually* not directly relevant for forwarding
decisions. But here it kind of is. We can be essentially certain the
guest will ARP this address, and it will look weird if the MAC address
that returns isn't the same as the MAC for packets from peers outside
the network segment (i.e. we are simulating as being routed through
the guest's gateway).
> 3) Regarding the host address, we also have two cases:
>
> 3.1) guest IP is the same as the host IP (on the default interface):
> In this case the guest can only reach the host by using the
> guest_gw IP (which lands on the host), some or the other IPs
> on the host, or the map_host_loopback IP, as far as I
> understand.
It can only reach the host via the map_host_loopback or map_guest_addr
addresses. The guest_gw IP is not relevant... except that it's
usually the same as one of the former.
> The host IP will never enter the neigbour table by ARP/NDP
> event, and there is no reason for us to add a dummy for it.
>
> 3.2) guest IP differs from the host IP.
> In this case we let fwd_neigh_table_init() add an entry for the
> host as if it were any other host, using the IP and mac
> addresses from the template interface.
*thinks*...
* The map_host_loopback and map_guest_addr addresses really refer to
the host, regardless of what that might be shadowing out in the
wider internet
* map_host_loopback address should always use our_tap_mac,
regardless of anything else. The interface it's referring to (lo
on the host) has no MAC address, so our_tap_mac is the only real
choice.
* map_guest_addr usually refers to the host, but could refer to
something else (host addr != guest addr).
* If it refers to the host, we kind of have a choice whether to
use our_tap_mac or the host's real MAC addr
* If it refers to something else, we should probably use the
real MAC addr if it's on the same network segment
* I think the best thing here is not to treat this
specially, relying on nat_inbound to handle it. If it's not
the host that will preserve the MAC. If it is the host it
will preserve the MAC iff we get an event for the host's own
MAC.
* host_gw needs no special handling. If it's the same as guest_gw
it will be handled by one of the other cases. If it's !=
guest_gw, then it's just another machine on the host's network
segment and will be handled the same
* guest_gw is the trickiest to think about.
* guest_gw == map_host_loopback
* our_tap_mac is the only real choice - but that's already
handled by the map_host_loopback rule
* guest_gw == map_guest_addr && guest_addr == host_addr
* guest_gw refers to the host, so we kind of have a choice between
our_tap_mac and the host's real MAC addr
* guest_gw == map_guest_addr && guest_addr != host_addr
* guest_gw refers to some machine that's not the host, but
when we forward packets we're acting on our own, it's not
something that other machine is doing
* guest_gw != map_guest_addr && guest_gw == host_addr
* guest_gw refers to the host without NAT. Makes sense to
use host's real MAC addr here?
* A bunch of other combinations
* My brain hurts.
The confusion with guest_gw comes because the guest can communicate
with it in two different ways: it can be communicating with guest_gw
directly as a peer (so the guest_gw IP actually appears in the
packets), or it can be communicating only as a router (no guest_gw IP
in the packets, but the guest will expect the router's MAC address
there). The question is what do we need to keep consistent between
those cases so that we don't confuse anyone.
I think the way to go - for now at least - is to lock the MAC address
for guest_gw to our_tap_mac, regardless of what else is going on.
That means we pass up the opportunity to preserve the MAC address of
the host router, even if that doesn't conflict with our NATs
(--no-map-gw). But I think it's the simplest way to keep things
consistent. Otherwise we need special logic when we forward a packet
from outside the host network to set the MAC to the router's MAC, even
though the router's IP doesn't appear in the packet or flow.
> Conclusion:
> - We use nat_inbound() instead of nat_outbound() before consulting the
> table, like you suggest.
Agreed.
> - We need to manually add an entry representing the host in the case the
> host IP differs from the guest IP. This entry is announced.
> In the case the IPs are the same we don't add any entry.
Rather than deal with multiple cases explicitly here, I think we
always want to insert an entry for nat_inbound(host_addr), unless that
conflicts with the rules below.
> - Local host entries are added by ARP/NDP events, but only if
> nat_inbound() doesn't translate the IP address (will that ever
> happen?).
I don't think it matter whether nat_inbound() translates, only whether
the guest side address is == map_host_loopback or == guest_gw. We
can't allow an entry for those cases, but I think we can in all other
cases, whether or not there's been a (non-identity) translation.
> We suppress all events from the host gw in case it has the same
> IP as the guest gw (unless this is covered by the nat_inbound()
> check?).
I believe that is handled by the above checks.
> - We may want a NAT mapping making it possible to reach the host gw
> in the case it has the same IP as the guest gw. This has no effect
> on our neighbour table.
host_gw == guest_gw doesn't of itself prevent us from reaching it.
host_gw == map_guest_addr does. There's not really a reason to add a
new NAT. If you need to reach the host gw, use --no-map-gw and/or
override --map-guest-addr. That will leave something else
inaccessible instead, but there's not getting around that - we can't
create a free IP out of thin air[0].
> If you agree with my above analysis I can go ahead and post a v13 of
> the series early next week, including the above changes and fixes
> for your other comments to the series.
With revisions noted above, then yes.
[0] Using link-local addresses, maybe we can, but doing that is a
larger project. And, to do so implies that the "link" is just from
the guest to passt, not encompassing the host's neighbourhood, which
doesn't really match with preserving MAC addresses in the first place.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents
2025-10-03 0:34 ` [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-10-03 1:03 ` Jon Maloy
2025-10-03 4:31 ` David Gibson
@ 2025-10-06 22:40 ` Stefano Brivio
2 siblings, 0 replies; 28+ messages in thread
From: Stefano Brivio @ 2025-10-06 22:40 UTC (permalink / raw)
To: Jon Maloy; +Cc: dgibson, david, passt-dev
On Thu, 2 Oct 2025 20:34:05 -0400
Jon Maloy <jmaloy@redhat.com> wrote:
> We add a cache table to keep track of the contents of the kernel ARP
> and NDP tables. The table is fed from the just introduced netlink based
> neigbour subscription function. The new table eliminates the need for
> explicit netlink calls to find a host's MAC address.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v5: - Moved to earlier in series to reduce rebase conflicts
> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit
> - Removed hash lookup. We now only use linear lookup in a
> linked list
> - Eliminated dynamic memory allocation.
> - Ensured there is only one call to clock_gettime()
> - Using MAC_ZERO instead of the previously dedicated definitions
> v7: - NOW using MAC_ZERO where needed
> - I am still using linear back-off for empty cache entries. Even
> an incoming, flow-creating packet from a local host gives no
> guarantee that its MAC address is in the ARP table, so we must
> allow for a few new attempts at first possible occasions. Only
> after several failed lookups can we conclude that we probably
> never will succeed. Hence the back-off.
> - Fixed a bug that David inadvertently made me aware of: I only
> intended to set the initial expiry value to MAC_CACHE_RENEWAL
> when an ARP/NDP table lookup was successful.
> - Improved struct and function description comments.
> v8: - Total re-design of table, adapting to the new, subscription
> based way of updating it.
> v9: - Catering for MAC address change for an existing host.
> v10: - Changes according to feedback from David Gibson
> v12: - Changes according to feedback from David and Stefano
> - Added dummy entries for loopback and default GW addresses
> ---
> conf.c | 1 +
> fwd.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fwd.h | 7 +++
> netlink.c | 7 ++-
> 4 files changed, 195 insertions(+), 2 deletions(-)
>
> diff --git a/conf.c b/conf.c
> index 66b9e63..3224ee6 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -2133,6 +2133,7 @@ void conf(struct ctx *c, int argc, char **argv)
> c->udp.fwd_out.mode = fwd_default;
>
> fwd_scan_ports_init(c);
> + fwd_neigh_table_init(c);
>
> if (!c->quiet)
> conf_print(c);
> diff --git a/fwd.c b/fwd.c
> index 250cf56..c34bb1c 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -33,6 +33,188 @@ static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
>
> #define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
>
> +#define NEIGH_TABLE_SLOTS 1024 /* Must be power of two */
> +#define NEIGH_TABLE_SIZE (NEIGH_TABLE_SLOTS / 2)
> +static_assert((NEIGH_TABLE_SLOTS & (NEIGH_TABLE_SLOTS - 1)) == 0,
> + "NEIGH_TABLE_SLOTS must be a power of two");
> +
> +/**
> + * struct neigh_table_entry - Entry in the ARP/NDP table
Nit: excess whitespace before "Entry".
> + * @next: Next entry in slot or free list
> + * @addr: IP address of represented host
> + * @mac: MAC address of represented host
> + */
> +struct neigh_table_entry {
> + struct neigh_table_entry *next;
> + union inany_addr addr;
> + uint8_t mac[ETH_ALEN];
> +};
> +
> +/**
> + * struct neigh_table - Cache of ARP/NDP table contents
Same here.
> + * @entries: Entries to be plugged into the hash slots when allocated
> + * @slots: Hash table slots
> + * @free: Linked list of unused entries
> + */
> +struct neigh_table {
> + struct neigh_table_entry entries[NEIGH_TABLE_SIZE];
> + struct neigh_table_entry *slots[NEIGH_TABLE_SLOTS];
> + struct neigh_table_entry *free;
> +};
> +
> +static struct neigh_table neigh_table;
> +
> +/**
> + * neigh_table_slot() - Hash key to a number within the table range
> + * @c: Execution context
> + * @key: The key to be used for the hash
> + *
> + * Return: The resulting hash value
Nit: s/The/the/, it's part of the same statement / sentence.
> + */
> +static size_t neigh_table_slot(const struct ctx *c,
> + const union inany_addr *key)
> +{
> + struct siphash_state st = SIPHASH_INIT(c->hash_secret);
> + uint32_t i;
> +
> + inany_siphash_feed(&st, key);
> + i = siphash_final(&st, sizeof(*key), 0);
> +
> + return ((size_t)i) & (NEIGH_TABLE_SIZE - 1);
> +}
> +
> +/**
> + * fwd_neigh_table_find() - Find a MAC table entry
> + * @c: Execution context
> + * @addr: Neighbour address to be used as key for the lookup
> + *
> + * Return: The matching entry, if found. Otherwise NULL.
Same here.
> + */
> +static struct neigh_table_entry *fwd_neigh_table_find(const struct ctx *c,
> + const union inany_addr *addr)
> +{
> + size_t slot = neigh_table_slot(c, addr);
> + struct neigh_table_entry *e = neigh_table.slots[slot];
> +
> + while (e && !inany_equals(&e->addr, addr))
> + e = e->next;
> +
> + return e;
> +}
> +
> +/**
> + * fwd_neigh_table_update() - Allocate or update neighbour table entry
> + * @c: Execution context
> + * @addr: IP address used to determine insertion slot and store in entry
> + * @mac: The MAC address associated with the neighbour address
> + */
> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> + const uint8_t *mac)
> +{
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e;
> + ssize_t slot;
> +
> + /* MAC address might change sometimes */
> + e = fwd_neigh_table_find(c, addr);
> + if (e) {
> + if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
> + return;
> + if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
> + return;
> +
> + memcpy(e->mac, mac, ETH_ALEN);
> + return;
> + }
> +
> + e = t->free;
> + if (!e) {
> + debug("Failed to allocate neighbour table entry");
> + return;
> + }
> + t->free = e->next;
> +
> + slot = neigh_table_slot(c, addr);
> + e->next = t->slots[slot];
> + t->slots[slot] = e;
> +
> + memcpy(&e->addr, addr, sizeof(*addr));
> + memcpy(e->mac, mac, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_neigh_table_free() - Remove an entry from a slot and add it to free list
> + * @c: Execution context
> + * @addr: IP address used to find the slot for the entry
> + */
> +void fwd_neigh_table_free(const struct ctx *c, const union inany_addr *addr)
> +{
> + ssize_t slot = neigh_table_slot(c, addr);
> + struct neigh_table *t = &neigh_table;
> + struct neigh_table_entry *e, **prev;
> +
> + prev = &t->slots[slot];
> + e = t->slots[slot];
> + while (e && !inany_equals(&e->addr, addr)) {
> + prev = &e->next;
> + e = e->next;
> + }
> + if (!e)
> + return;
> +
> + *prev = e->next;
> + e->next = t->free;
> + t->free = e;
> + memset(&e->addr, 0, sizeof(*addr));
> + memset(e->mac, 0, ETH_ALEN);
> +}
> +
> +/**
> + * fwd_neigh_mac_get() - Look up MAC address in the ARP/NDP table
> + * @c: Execution context
> + * @addr: Neighbour IP address used as lookup key
> + * @mac: Buffer for Ethernet MAC to return, found or default value.
Nit: trailing '.'
The whole description isn't really clear to me. It sounds like there
are three possible alternatives, that is, "return", "found", or
"default". Maybe simply:
Buffer for returned MAC address
?
> + *
> + * Return: true if real MAC found, false if not found or if failure
This is a MAC address, not the media access control itself (you can't
really... find it).
> + */
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> + uint8_t *mac)
As I pointed out on v11: you seem to be ignoring the return value
everywhere in this series, so returning a value here seems to be a
left-over. I'm not reviewing past this point.
> +{
> + const struct neigh_table_entry *e = fwd_neigh_table_find(c, addr);
> +
> + if (e)
> + memcpy(mac, e->mac, ETH_ALEN);
> + else
> + memcpy(mac, c->our_tap_mac, ETH_ALEN);
> +
> + return !!e;
> +}
> +
> +/**
> + * fwd_neigh_table_init() - Initialize the neighbour table
> + * @c: Execution context
> + */
> +void fwd_neigh_table_init(const struct ctx *c)
> +{
> + struct neigh_table *t = &neigh_table;
> + const uint8_t *omac = c->our_tap_mac;
> + struct neigh_table_entry *e;
> + int i;
> +
> + memset(t, 0, sizeof(*t));
> + for (i = 0; i < NEIGH_TABLE_SIZE; i++) {
> + e = &t->entries[i];
> + e->next = t->free;
> + t->free = e;
> + }
> +
> + /* These addresses must always map to our own MAC address */
> + fwd_neigh_table_update(c, &inany_loopback4, omac);
> + fwd_neigh_table_update(c, &inany_loopback6, omac);
> + fwd_neigh_table_update(c, &inany_from_v4(c->ip4.guest_gw), omac);
> + fwd_neigh_table_update(c, (union inany_addr *)&c->ip6.guest_gw, omac);
> +}
> +
> /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral
> *
> * Work out what ports the host thinks are emphemeral and record it for later
> diff --git a/fwd.h b/fwd.h
> index 65c7c96..6ca743c 100644
> --- a/fwd.h
> +++ b/fwd.h
> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> +void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> + const uint8_t *mac);
> +void fwd_neigh_table_free(const struct ctx *c,
> + const union inany_addr *addr);
> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr,
> + uint8_t *mac);
> +void fwd_neigh_table_init(const struct ctx *c);
>
> #endif /* FWD_H */
> diff --git a/netlink.c b/netlink.c
> index 3fe2fdd..4be5fcf 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -1192,10 +1192,13 @@ static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> inany_from_af(&addr, ndm->ndm_family, dst);
> inany_ntop(dst, ip_str, sizeof(ip_str));
>
> - if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID)
> + if (nh->nlmsg_type == RTM_NEWNEIGH && ndm->ndm_state & NUD_VALID) {
> trace("neigh table update: %s / %s", ip_str, mac_str);
> - else
> + fwd_neigh_table_update(c, &addr, mac);
> + } else {
> trace("neigh table delete: %s / %s", ip_str, mac_str);
> + fwd_neigh_table_free(c, &addr);
> + }
> }
>
> /**
--
Stefano
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
2025-10-03 0:34 ` [PATCH v12 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
2025-10-03 0:34 ` [PATCH v12 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 4:41 ` David Gibson
2025-10-06 22:51 ` Stefano Brivio
2025-10-03 0:34 ` [PATCH v12 4/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
` (6 subsequent siblings)
9 siblings, 2 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
ARP announcements and unsolicited NAs should be handled with caution
because of the risk of malignant users emitting them to disturb
network communication.
There is however one case we where we know it is legitimate
and safe for us to send out such messages: The one time we switch
from using ctx->own_tap_mac to a MAC address received via the
recently added neigbour subscription function. Later changes to
the MAC address of a host in an existing entry cannot be fully
trusted, so we abstain from doing it in such cases.
When sending this type of messages, we notice that the guest accepts
the update, but shortly later asks for a confirmation in the form of
a regular ARP/NS request. This is responded to with the new value,
and we have exactly the effect we wanted.
This commit adds this functionality.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v10: -Made small changes based of feedback from David G.
v11: -Moved from 'Gratuitous ARP reply' model to 'ARP Announcement'
model.
v12: -Excluding loopback and default GW addresses from the ARP/NA
announcement to be sent to the guest
---
arp.c | 42 ++++++++++++++++++++++++++++++++++++++++++
arp.h | 2 ++
fwd.c | 16 ++++++++++++++++
ndp.c | 10 ++++++++++
ndp.h | 1 +
5 files changed, 71 insertions(+)
diff --git a/arp.c b/arp.c
index ad088b1..b08780f 100644
--- a/arp.c
+++ b/arp.c
@@ -146,3 +146,45 @@ void arp_send_init_req(const struct ctx *c)
debug("Sending initial ARP request for guest MAC address");
tap_send_single(c, &req, sizeof(req));
}
+
+/**
+ * arp_announce() - Send an ARP announcement for an IPv4 host
+ * @c: Execution context
+ * @ip: IPv4 address we announce as owned by @mac
+ * @mac: MAC address to advertise for @ip
+ */
+void arp_announce(const struct ctx *c, struct in_addr *ip,
+ const unsigned char *mac)
+{
+ char ip_str[INET_ADDRSTRLEN];
+ char mac_str[ETH_ADDRSTRLEN];
+ struct {
+ struct ethhdr eh;
+ struct arphdr ah;
+ struct arpmsg am;
+ } __attribute__((__packed__)) annc;
+
+ /* Ethernet header */
+ annc.eh.h_proto = htons(ETH_P_ARP);
+ memcpy(annc.eh.h_dest, MAC_BROADCAST, sizeof(annc.eh.h_dest));
+ memcpy(annc.eh.h_source, mac, sizeof(annc.eh.h_source));
+
+ /* ARP header */
+ annc.ah.ar_op = htons(ARPOP_REQUEST);
+ annc.ah.ar_hrd = htons(ARPHRD_ETHER);
+ annc.ah.ar_pro = htons(ETH_P_IP);
+ annc.ah.ar_hln = ETH_ALEN;
+ annc.ah.ar_pln = 4;
+
+ /* ARP message */
+ memcpy(annc.am.sha, mac, sizeof(annc.am.sha));
+ memcpy(annc.am.sip, ip, sizeof(annc.am.sip));
+ memcpy(annc.am.tha, MAC_BROADCAST, sizeof(annc.am.tha));
+ memcpy(annc.am.tip, ip, sizeof(annc.am.tip));
+
+ inet_ntop(AF_INET, ip, ip_str, sizeof(ip_str));
+ eth_ntop(mac, mac_str, sizeof(mac_str));
+ debug("Announcing ARP for %s / %s", ip_str, mac_str);
+
+ tap_send_single(c, &annc, sizeof(annc));
+}
diff --git a/arp.h b/arp.h
index d5ad0e1..4862e90 100644
--- a/arp.h
+++ b/arp.h
@@ -22,5 +22,7 @@ struct arpmsg {
int arp(const struct ctx *c, struct iov_tail *data);
void arp_send_init_req(const struct ctx *c);
+void arp_announce(const struct ctx *c, struct in_addr *ip,
+ const unsigned char *mac);
#endif /* ARP_H */
diff --git a/fwd.c b/fwd.c
index c34bb1c..ade97c8 100644
--- a/fwd.c
+++ b/fwd.c
@@ -26,6 +26,8 @@
#include "passt.h"
#include "lineread.h"
#include "flow_table.h"
+#include "arp.h"
+#include "ndp.h"
/* Empheral port range: values from RFC 6335 */
static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
@@ -140,6 +142,20 @@ void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
memcpy(&e->addr, addr, sizeof(*addr));
memcpy(e->mac, mac, ETH_ALEN);
+
+ if (inany_equals(addr, &inany_loopback4))
+ return;
+ if (inany_equals(addr, &inany_loopback6))
+ return;
+ if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
+ return;
+ if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
+ return;
+
+ if (inany_v4(addr))
+ arp_announce(c, inany_v4(addr), e->mac);
+ else
+ ndp_send_unsolicited_na(c, &addr->a6);
}
/**
diff --git a/ndp.c b/ndp.c
index 588b48f..d7f64a3 100644
--- a/ndp.c
+++ b/ndp.c
@@ -218,6 +218,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
ndp_send(c, dst, &na, sizeof(na));
}
+/**
+ * ndp_send_unsolicited_na() - Send unsolicited NA
+ * @c: Execution context
+ * @addr: IPv6 address to advertise
+ */
+void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr)
+{
+ ndp_na(c, &in6addr_ll_all_nodes, addr);
+}
+
/**
* ndp_ra() - Send an NDP Router Advertisement (RA) message
* @c: Execution context
diff --git a/ndp.h b/ndp.h
index 781ea86..320009c 100644
--- a/ndp.h
+++ b/ndp.h
@@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr,
struct iov_tail *data);
void ndp_timer(const struct ctx *c, const struct timespec *now);
void ndp_send_init_req(const struct ctx *c);
+void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
#endif /* NDP_H */
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added
2025-10-03 0:34 ` [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
@ 2025-10-03 4:41 ` David Gibson
2025-10-07 10:10 ` Stefano Brivio
2025-10-06 22:51 ` Stefano Brivio
1 sibling, 1 reply; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:41 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 6405 bytes --]
On Thu, Oct 02, 2025 at 08:34:06PM -0400, Jon Maloy wrote:
> ARP announcements and unsolicited NAs should be handled with caution
> because of the risk of malignant users emitting them to disturb
> network communication.
>
> There is however one case we where we know it is legitimate
> and safe for us to send out such messages: The one time we switch
> from using ctx->own_tap_mac to a MAC address received via the
> recently added neigbour subscription function. Later changes to
> the MAC address of a host in an existing entry cannot be fully
> trusted, so we abstain from doing it in such cases.
>
> When sending this type of messages, we notice that the guest accepts
> the update, but shortly later asks for a confirmation in the form of
> a regular ARP/NS request. This is responded to with the new value,
> and we have exactly the effect we wanted.
>
> This commit adds this functionality.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v10: -Made small changes based of feedback from David G.
> v11: -Moved from 'Gratuitous ARP reply' model to 'ARP Announcement'
> model.
> v12: -Excluding loopback and default GW addresses from the ARP/NA
> announcement to be sent to the guest
> ---
> arp.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> arp.h | 2 ++
> fwd.c | 16 ++++++++++++++++
> ndp.c | 10 ++++++++++
> ndp.h | 1 +
> 5 files changed, 71 insertions(+)
>
> diff --git a/arp.c b/arp.c
> index ad088b1..b08780f 100644
> --- a/arp.c
> +++ b/arp.c
> @@ -146,3 +146,45 @@ void arp_send_init_req(const struct ctx *c)
> debug("Sending initial ARP request for guest MAC address");
> tap_send_single(c, &req, sizeof(req));
> }
> +
> +/**
> + * arp_announce() - Send an ARP announcement for an IPv4 host
> + * @c: Execution context
> + * @ip: IPv4 address we announce as owned by @mac
> + * @mac: MAC address to advertise for @ip
> + */
> +void arp_announce(const struct ctx *c, struct in_addr *ip,
> + const unsigned char *mac)
> +{
> + char ip_str[INET_ADDRSTRLEN];
> + char mac_str[ETH_ADDRSTRLEN];
> + struct {
> + struct ethhdr eh;
> + struct arphdr ah;
> + struct arpmsg am;
> + } __attribute__((__packed__)) annc;
> +
> + /* Ethernet header */
> + annc.eh.h_proto = htons(ETH_P_ARP);
> + memcpy(annc.eh.h_dest, MAC_BROADCAST, sizeof(annc.eh.h_dest));
> + memcpy(annc.eh.h_source, mac, sizeof(annc.eh.h_source));
> +
> + /* ARP header */
> + annc.ah.ar_op = htons(ARPOP_REQUEST);
> + annc.ah.ar_hrd = htons(ARPHRD_ETHER);
> + annc.ah.ar_pro = htons(ETH_P_IP);
> + annc.ah.ar_hln = ETH_ALEN;
> + annc.ah.ar_pln = 4;
> +
> + /* ARP message */
> + memcpy(annc.am.sha, mac, sizeof(annc.am.sha));
> + memcpy(annc.am.sip, ip, sizeof(annc.am.sip));
> + memcpy(annc.am.tha, MAC_BROADCAST, sizeof(annc.am.tha));
> + memcpy(annc.am.tip, ip, sizeof(annc.am.tip));
As noted in several earlier revisions, having sip == tip (but with
different mac addresses) looks odd. Is that what the RFCs say to do
for ARP announcements?
> + inet_ntop(AF_INET, ip, ip_str, sizeof(ip_str));
> + eth_ntop(mac, mac_str, sizeof(mac_str));
> + debug("Announcing ARP for %s / %s", ip_str, mac_str);
> +
> + tap_send_single(c, &annc, sizeof(annc));
> +}
> diff --git a/arp.h b/arp.h
> index d5ad0e1..4862e90 100644
> --- a/arp.h
> +++ b/arp.h
> @@ -22,5 +22,7 @@ struct arpmsg {
>
> int arp(const struct ctx *c, struct iov_tail *data);
> void arp_send_init_req(const struct ctx *c);
> +void arp_announce(const struct ctx *c, struct in_addr *ip,
> + const unsigned char *mac);
>
> #endif /* ARP_H */
> diff --git a/fwd.c b/fwd.c
> index c34bb1c..ade97c8 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -26,6 +26,8 @@
> #include "passt.h"
> #include "lineread.h"
> #include "flow_table.h"
> +#include "arp.h"
> +#include "ndp.h"
>
> /* Empheral port range: values from RFC 6335 */
> static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> @@ -140,6 +142,20 @@ void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
>
> memcpy(&e->addr, addr, sizeof(*addr));
> memcpy(e->mac, mac, ETH_ALEN);
> +
> + if (inany_equals(addr, &inany_loopback4))
> + return;
> + if (inany_equals(addr, &inany_loopback6))
> + return;
Since you need these explicit checks anyway, there's not much point to
the dummy entries you created - you could exit on these addresses
before even looking up the table.
> + if (inany_equals(addr, &inany_from_v4(c->ip4.guest_gw)))
> + return;
> + if (inany_equals(addr, (union inany_addr *)&c->ip6.guest_gw))
> + return;
Again, guest addresses, so they don't make sense here.
> +
> + if (inany_v4(addr))
> + arp_announce(c, inany_v4(addr), e->mac);
> + else
> + ndp_send_unsolicited_na(c, &addr->a6);
Oh... this is a guest side operation but we have host side addresses.
So we need to apply a nat_inbound() first. So we might need to filter
on that retranslated address instead.
> }
>
> /**
> diff --git a/ndp.c b/ndp.c
> index 588b48f..d7f64a3 100644
> --- a/ndp.c
> +++ b/ndp.c
> @@ -218,6 +218,16 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
> ndp_send(c, dst, &na, sizeof(na));
> }
>
> +/**
> + * ndp_send_unsolicited_na() - Send unsolicited NA
> + * @c: Execution context
> + * @addr: IPv6 address to advertise
> + */
> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr)
Maybe just 'ndp_unsolicited_na', by analogy with ndp_na() itself.
> +{
> + ndp_na(c, &in6addr_ll_all_nodes, addr);
> +}
> +
> /**
> * ndp_ra() - Send an NDP Router Advertisement (RA) message
> * @c: Execution context
> diff --git a/ndp.h b/ndp.h
> index 781ea86..320009c 100644
> --- a/ndp.h
> +++ b/ndp.h
> @@ -12,5 +12,6 @@ int ndp(const struct ctx *c, const struct in6_addr *saddr,
> struct iov_tail *data);
> void ndp_timer(const struct ctx *c, const struct timespec *now);
> void ndp_send_init_req(const struct ctx *c);
> +void ndp_send_unsolicited_na(const struct ctx *c, const struct in6_addr *addr);
>
> #endif /* NDP_H */
> --
> 2.50.1
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added
2025-10-03 4:41 ` David Gibson
@ 2025-10-07 10:10 ` Stefano Brivio
0 siblings, 0 replies; 28+ messages in thread
From: Stefano Brivio @ 2025-10-07 10:10 UTC (permalink / raw)
To: David Gibson; +Cc: Jon Maloy, dgibson, passt-dev
On Fri, 3 Oct 2025 14:41:56 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Thu, Oct 02, 2025 at 08:34:06PM -0400, Jon Maloy wrote:
> > ARP announcements and unsolicited NAs should be handled with caution
> > because of the risk of malignant users emitting them to disturb
> > network communication.
> >
> > There is however one case we where we know it is legitimate
> > and safe for us to send out such messages: The one time we switch
> > from using ctx->own_tap_mac to a MAC address received via the
> > recently added neigbour subscription function. Later changes to
> > the MAC address of a host in an existing entry cannot be fully
> > trusted, so we abstain from doing it in such cases.
> >
> > When sending this type of messages, we notice that the guest accepts
> > the update, but shortly later asks for a confirmation in the form of
> > a regular ARP/NS request. This is responded to with the new value,
> > and we have exactly the effect we wanted.
> >
> > This commit adds this functionality.
> >
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> >
> > ---
> > v10: -Made small changes based of feedback from David G.
> > v11: -Moved from 'Gratuitous ARP reply' model to 'ARP Announcement'
> > model.
> > v12: -Excluding loopback and default GW addresses from the ARP/NA
> > announcement to be sent to the guest
> > ---
> > arp.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > arp.h | 2 ++
> > fwd.c | 16 ++++++++++++++++
> > ndp.c | 10 ++++++++++
> > ndp.h | 1 +
> > 5 files changed, 71 insertions(+)
> >
> > diff --git a/arp.c b/arp.c
> > index ad088b1..b08780f 100644
> > --- a/arp.c
> > +++ b/arp.c
> > @@ -146,3 +146,45 @@ void arp_send_init_req(const struct ctx *c)
> > debug("Sending initial ARP request for guest MAC address");
> > tap_send_single(c, &req, sizeof(req));
> > }
> > +
> > +/**
> > + * arp_announce() - Send an ARP announcement for an IPv4 host
> > + * @c: Execution context
> > + * @ip: IPv4 address we announce as owned by @mac
> > + * @mac: MAC address to advertise for @ip
> > + */
> > +void arp_announce(const struct ctx *c, struct in_addr *ip,
> > + const unsigned char *mac)
> > +{
> > + char ip_str[INET_ADDRSTRLEN];
> > + char mac_str[ETH_ADDRSTRLEN];
> > + struct {
> > + struct ethhdr eh;
> > + struct arphdr ah;
> > + struct arpmsg am;
> > + } __attribute__((__packed__)) annc;
> > +
> > + /* Ethernet header */
> > + annc.eh.h_proto = htons(ETH_P_ARP);
> > + memcpy(annc.eh.h_dest, MAC_BROADCAST, sizeof(annc.eh.h_dest));
> > + memcpy(annc.eh.h_source, mac, sizeof(annc.eh.h_source));
> > +
> > + /* ARP header */
> > + annc.ah.ar_op = htons(ARPOP_REQUEST);
> > + annc.ah.ar_hrd = htons(ARPHRD_ETHER);
> > + annc.ah.ar_pro = htons(ETH_P_IP);
> > + annc.ah.ar_hln = ETH_ALEN;
> > + annc.ah.ar_pln = 4;
> > +
> > + /* ARP message */
> > + memcpy(annc.am.sha, mac, sizeof(annc.am.sha));
> > + memcpy(annc.am.sip, ip, sizeof(annc.am.sip));
> > + memcpy(annc.am.tha, MAC_BROADCAST, sizeof(annc.am.tha));
> > + memcpy(annc.am.tip, ip, sizeof(annc.am.tip));
>
> As noted in several earlier revisions, having sip == tip (but with
> different mac addresses) looks odd. Is that what the RFCs say to do
> for ARP announcements?
>
> > + inet_ntop(AF_INET, ip, ip_str, sizeof(ip_str));
> > + eth_ntop(mac, mac_str, sizeof(mac_str));
> > + debug("Announcing ARP for %s / %s", ip_str, mac_str);
> > +
> > + tap_send_single(c, &annc, sizeof(annc));
> > +}
> > diff --git a/arp.h b/arp.h
> > index d5ad0e1..4862e90 100644
> > --- a/arp.h
> > +++ b/arp.h
> > @@ -22,5 +22,7 @@ struct arpmsg {
> >
> > int arp(const struct ctx *c, struct iov_tail *data);
> > void arp_send_init_req(const struct ctx *c);
> > +void arp_announce(const struct ctx *c, struct in_addr *ip,
> > + const unsigned char *mac);
> >
> > #endif /* ARP_H */
> > diff --git a/fwd.c b/fwd.c
> > index c34bb1c..ade97c8 100644
> > --- a/fwd.c
> > +++ b/fwd.c
> > @@ -26,6 +26,8 @@
> > #include "passt.h"
> > #include "lineread.h"
> > #include "flow_table.h"
> > +#include "arp.h"
> > +#include "ndp.h"
> >
> > /* Empheral port range: values from RFC 6335 */
> > static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14);
> > @@ -140,6 +142,20 @@ void fwd_neigh_table_update(const struct ctx *c, const union inany_addr *addr,
> >
> > memcpy(&e->addr, addr, sizeof(*addr));
> > memcpy(e->mac, mac, ETH_ALEN);
> > +
> > + if (inany_equals(addr, &inany_loopback4))
> > + return;
> > + if (inany_equals(addr, &inany_loopback6))
> > + return;
>
> Since you need these explicit checks anyway, there's not much point to
> the dummy entries you created - you could exit on these addresses
> before even looking up the table.
I guess those entries make sense if we can drop all these checks as a
result. I think we should be able to.
But if we can't, then yes, let's just go with explicit checks
everywhere, so that it's consistent.
--
Stefano
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added
2025-10-03 0:34 ` [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
2025-10-03 4:41 ` David Gibson
@ 2025-10-06 22:51 ` Stefano Brivio
1 sibling, 0 replies; 28+ messages in thread
From: Stefano Brivio @ 2025-10-06 22:51 UTC (permalink / raw)
To: Jon Maloy; +Cc: dgibson, david, passt-dev
Nit, in the title: neighbour.
On Thu, 2 Oct 2025 20:34:06 -0400
Jon Maloy <jmaloy@redhat.com> wrote:
> ARP announcements and unsolicited NAs should be handled with caution
> because of the risk of malignant users emitting them to disturb
> network communication.
>
> There is however one case we where we know it is legitimate
> and safe for us to send out such messages: The one time we switch
> from using ctx->own_tap_mac to a MAC address received via the
> recently added neigbour subscription function. Later changes to
...same here.
Other than this, and all the pending comments (including changes to
consistently represent guest-side link-layer addresses), and the
remaining part of 2/9, the rest looks good to me.
--
Stefano
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 4/9] arp/ndp: respond with true MAC address of LAN local remote hosts
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (2 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 3/9] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 4:48 ` David Gibson
2025-10-03 0:34 ` [PATCH v12 5/9] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
` (5 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
When we receive an ARP request or NDP neigbour solicitation over
the tap interface for a host on the local network segment attached
to the template interface, we respond with that host's real MAC
address, if available.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: - Added helper function to find out if a remote ip address is subject
to NAT. This filters out local host addresses which should be
presented with the passt/pasta local MAC address 9a:55:9a:55:9a:55 even
though it is on the local segment.
- Adapted to the change in nl_mac_get() function, so that we now consider
only the template interface when checking the ARP/NDP table.
v4: - Moved NAT check into the function nat_outbound() to obtain more
precise criteria for when NAT is used. We may in theory
have NAT even if original and translated addresses are equal, and
we want to catch this case.
- I chose to keep the wrapper funtion inany_nat(), but moved it to
fwd.h/fwd.c and renamed it to fwd_inany_nat().
v5: - Simplified criteria for when we do ARP/NDP lookup. Now, we
just try with the potentially translated address after an
attempted NAT check.
- Using the new ARP/NDP cache table instead of using netlink
directly.
v6: - Fixes after feedback from David:
- Renamed nat_outbound() to fwd_nat_outbound()
- Eliminated unnecessary temporary variable in arp.c::arp()
v12: - Some minor changes after feedback from Stefano
---
arp.c | 9 +++++++--
fwd.c | 8 ++++----
fwd.h | 2 ++
inany.c | 1 +
ndp.c | 8 ++++++++
5 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/arp.c b/arp.c
index b08780f..a95c377 100644
--- a/arp.c
+++ b/arp.c
@@ -69,6 +69,7 @@ static bool ignore_arp(const struct ctx *c,
*/
int arp(const struct ctx *c, struct iov_tail *data)
{
+ union inany_addr tgt, tgt_nat;
struct {
struct ethhdr eh;
struct arphdr ah;
@@ -102,8 +103,12 @@ int arp(const struct ctx *c, struct iov_tail *data)
resp.ah.ar_hln = ah->ar_hln;
resp.ah.ar_pln = ah->ar_pln;
- /* ARP message */
- memcpy(resp.am.sha, c->our_tap_mac, sizeof(resp.am.sha));
+ /* MAC address to return in ARP message */
+ inany_from_af(&tgt, AF_INET, am->tip);
+ fwd_nat_outbound(c, &tgt, &tgt_nat);
+ fwd_neigh_mac_get(c, &tgt_nat, resp.am.sha);
+
+ /* Rest of ARP message */
memcpy(resp.am.sip, am->tip, sizeof(resp.am.sip));
memcpy(resp.am.tha, am->sha, sizeof(resp.am.tha));
memcpy(resp.am.tip, am->sip, sizeof(resp.am.tip));
diff --git a/fwd.c b/fwd.c
index ade97c8..a69d84d 100644
--- a/fwd.c
+++ b/fwd.c
@@ -522,7 +522,7 @@ static bool fwd_guest_accessible(const struct ctx *c,
}
/**
- * nat_outbound() - Apply address translation for outbound (TAP to HOST)
+ * fwd_nat_outbound() - Apply address translation for outbound (TAP to HOST)
* @c: Execution context
* @addr: Input address (as seen on TAP interface)
* @translated: Output address (as seen on HOST interface)
@@ -530,8 +530,8 @@ static bool fwd_guest_accessible(const struct ctx *c,
* Only handles translations that depend *only* on the address. Anything
* related to specific ports or flows is handled elsewhere.
*/
-static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
- union inany_addr *translated)
+void fwd_nat_outbound(const struct ctx *c, const union inany_addr *addr,
+ union inany_addr *translated)
{
if (inany_equals4(addr, &c->ip4.map_host_loopback))
*translated = inany_loopback4;
@@ -565,7 +565,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
inany_equals6(&ini->oaddr, &c->ip6.dns_match))
tgt->eaddr.a6 = c->ip6.dns_host;
else
- nat_outbound(c, &ini->oaddr, &tgt->eaddr);
+ fwd_nat_outbound(c, &ini->oaddr, &tgt->eaddr);
tgt->eport = ini->oport;
diff --git a/fwd.h b/fwd.h
index 6ca743c..f91fa55 100644
--- a/fwd.h
+++ b/fwd.h
@@ -50,6 +50,8 @@ void fwd_scan_ports_init(struct ctx *c);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated);
+void fwd_nat_outbound(const struct ctx *c, const union inany_addr *addr,
+ union inany_addr *translated);
uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
const struct flowside *ini, struct flowside *tgt);
uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
diff --git a/inany.c b/inany.c
index 65a39f9..7680439 100644
--- a/inany.c
+++ b/inany.c
@@ -16,6 +16,7 @@
#include "ip.h"
#include "siphash.h"
#include "inany.h"
+#include "fwd.h"
const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT);
const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT);
diff --git a/ndp.c b/ndp.c
index d7f64a3..6bba0a8 100644
--- a/ndp.c
+++ b/ndp.c
@@ -196,6 +196,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst,
static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
const struct in6_addr *addr)
{
+ union inany_addr tgt, tgt_nat;
struct ndp_na na = {
.ih = {
.icmp6_type = NA,
@@ -215,6 +216,13 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
+ /* Respond with recorded MAC address if remote host's address
+ * or NAT translated address can be found in NDP table.
+ */
+ inany_from_af(&tgt, AF_INET6, addr);
+ fwd_nat_outbound(c, &tgt, &tgt_nat);
+ fwd_neigh_mac_get(c, &tgt_nat, na.target_l2_addr.mac);
+
ndp_send(c, dst, &na, sizeof(na));
}
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 4/9] arp/ndp: respond with true MAC address of LAN local remote hosts
2025-10-03 0:34 ` [PATCH v12 4/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
@ 2025-10-03 4:48 ` David Gibson
0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:48 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 6787 bytes --]
On Thu, Oct 02, 2025 at 08:34:07PM -0400, Jon Maloy wrote:
> When we receive an ARP request or NDP neigbour solicitation over
> the tap interface for a host on the local network segment attached
> to the template interface, we respond with that host's real MAC
> address, if available.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> ---
> v3: - Added helper function to find out if a remote ip address is subject
> to NAT. This filters out local host addresses which should be
> presented with the passt/pasta local MAC address 9a:55:9a:55:9a:55 even
> though it is on the local segment.
> - Adapted to the change in nl_mac_get() function, so that we now consider
> only the template interface when checking the ARP/NDP table.
> v4: - Moved NAT check into the function nat_outbound() to obtain more
> precise criteria for when NAT is used. We may in theory
> have NAT even if original and translated addresses are equal, and
> we want to catch this case.
> - I chose to keep the wrapper funtion inany_nat(), but moved it to
> fwd.h/fwd.c and renamed it to fwd_inany_nat().
> v5: - Simplified criteria for when we do ARP/NDP lookup. Now, we
> just try with the potentially translated address after an
> attempted NAT check.
> - Using the new ARP/NDP cache table instead of using netlink
> directly.
> v6: - Fixes after feedback from David:
> - Renamed nat_outbound() to fwd_nat_outbound()
> - Eliminated unnecessary temporary variable in arp.c::arp()
> v12: - Some minor changes after feedback from Stefano
> ---
> arp.c | 9 +++++++--
> fwd.c | 8 ++++----
> fwd.h | 2 ++
> inany.c | 1 +
> ndp.c | 8 ++++++++
> 5 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/arp.c b/arp.c
> index b08780f..a95c377 100644
> --- a/arp.c
> +++ b/arp.c
> @@ -69,6 +69,7 @@ static bool ignore_arp(const struct ctx *c,
> */
> int arp(const struct ctx *c, struct iov_tail *data)
> {
> + union inany_addr tgt, tgt_nat;
> struct {
> struct ethhdr eh;
> struct arphdr ah;
> @@ -102,8 +103,12 @@ int arp(const struct ctx *c, struct iov_tail *data)
> resp.ah.ar_hln = ah->ar_hln;
> resp.ah.ar_pln = ah->ar_pln;
>
> - /* ARP message */
> - memcpy(resp.am.sha, c->our_tap_mac, sizeof(resp.am.sha));
> + /* MAC address to return in ARP message */
> + inany_from_af(&tgt, AF_INET, am->tip);
> + fwd_nat_outbound(c, &tgt, &tgt_nat);
> + fwd_neigh_mac_get(c, &tgt_nat, resp.am.sha);
> + /* Rest of ARP message */
> memcpy(resp.am.sip, am->tip, sizeof(resp.am.sip));
> memcpy(resp.am.tha, am->sha, sizeof(resp.am.tha));
> memcpy(resp.am.tip, am->sip, sizeof(resp.am.tip));
> diff --git a/fwd.c b/fwd.c
> index ade97c8..a69d84d 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -522,7 +522,7 @@ static bool fwd_guest_accessible(const struct ctx *c,
> }
>
> /**
> - * nat_outbound() - Apply address translation for outbound (TAP to HOST)
> + * fwd_nat_outbound() - Apply address translation for outbound (TAP to HOST)
> * @c: Execution context
> * @addr: Input address (as seen on TAP interface)
> * @translated: Output address (as seen on HOST interface)
> @@ -530,8 +530,8 @@ static bool fwd_guest_accessible(const struct ctx *c,
> * Only handles translations that depend *only* on the address. Anything
> * related to specific ports or flows is handled elsewhere.
> */
> -static void nat_outbound(const struct ctx *c, const union inany_addr *addr,
> - union inany_addr *translated)
> +void fwd_nat_outbound(const struct ctx *c, const union inany_addr *addr,
> + union inany_addr *translated)
> {
> if (inany_equals4(addr, &c->ip4.map_host_loopback))
> *translated = inany_loopback4;
> @@ -565,7 +565,7 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
> inany_equals6(&ini->oaddr, &c->ip6.dns_match))
> tgt->eaddr.a6 = c->ip6.dns_host;
> else
> - nat_outbound(c, &ini->oaddr, &tgt->eaddr);
> + fwd_nat_outbound(c, &ini->oaddr, &tgt->eaddr);
>
> tgt->eport = ini->oport;
>
> diff --git a/fwd.h b/fwd.h
> index 6ca743c..f91fa55 100644
> --- a/fwd.h
> +++ b/fwd.h
> @@ -50,6 +50,8 @@ void fwd_scan_ports_init(struct ctx *c);
>
> bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
> union inany_addr *translated);
> +void fwd_nat_outbound(const struct ctx *c, const union inany_addr *addr,
> + union inany_addr *translated);
> uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
> const struct flowside *ini, struct flowside *tgt);
> uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto,
> diff --git a/inany.c b/inany.c
> index 65a39f9..7680439 100644
> --- a/inany.c
> +++ b/inany.c
> @@ -16,6 +16,7 @@
> #include "ip.h"
> #include "siphash.h"
> #include "inany.h"
> +#include "fwd.h"
>
> const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT);
> const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT);
> diff --git a/ndp.c b/ndp.c
> index d7f64a3..6bba0a8 100644
> --- a/ndp.c
> +++ b/ndp.c
> @@ -196,6 +196,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst,
> static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
> const struct in6_addr *addr)
> {
> + union inany_addr tgt, tgt_nat;
> struct ndp_na na = {
> .ih = {
> .icmp6_type = NA,
> @@ -215,6 +216,13 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
>
> memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
>
> + /* Respond with recorded MAC address if remote host's address
> + * or NAT translated address can be found in NDP table.
> + */
> + inany_from_af(&tgt, AF_INET6, addr);
> + fwd_nat_outbound(c, &tgt, &tgt_nat);
> + fwd_neigh_mac_get(c, &tgt_nat, na.target_l2_addr.mac);
You do indeed need nat_outbound() here, and as noted on the previous
patch you need nat_inbound() for unsolicited NAs triggered by host
neighbour updates. But you want to avoid doing both an inbound and
outbound NAT for the same operation, which might require some
restructuring here.
Or, I guess, you could convert the whole neighbour table to be in
terms of guest side addresses, in which case you'd remove the NAT
here, and instead need a nat_inbound() before you insert things into
the table.
> ndp_send(c, dst, &na, sizeof(na));
> }
>
> --
> 2.50.1
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 5/9] flow: add MAC address of LAN local remote hosts to flow
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (3 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 4/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 0:34 ` [PATCH v12 6/9] udp: forward external source MAC address through tap interface Jon Maloy
` (4 subsequent siblings)
9 siblings, 0 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
When communicating with remote hosts on the local network, some guest
applications want to see the real MAC address of that host instead
of PASST/PASTA's own tap address. The flow_common structure is a
convenient location for storing that address, so we do that in this
commit.
Note that we don´t add actual usage of this address here, that will
be done in later commits.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: - Moved the remote host macaddress from struct flowside to
struct flow_common. I chose to call it 'omac' as suggested
by David, although in my understanding the correct name would be
'emac'. (In general I find the address naming scheme confusing.)
- Adapted to new signature of function nl_mac_get(), now passing
it the index of the template interface.
v4: - Renamed flow_commeon->omac to flow_common->tap_omac to make is
role in the code clearer
v5: - Modified the criteria for ARP/NDP table lookup like in the
previous commits.
- Removed the PIF_TAP lookup case, as David suggested, and did
instead give the flow->tap_omac field a value marking it as
non-initialized.
- Calling the cache table instead of netlink for ARP/NDP lookup.
- Unconditionally using the potentially translated IP address
in the lookup, instead of only if NAT really was applied.
v6: - Using MAC_ZERO instead of own definitions
v12:- Using MAC_UNDEF (==ff:ff:ff:ff:ff:ff) instead of MAC_ZERO,
which is a legal MAC address.
---
flow.c | 2 ++
flow.h | 2 ++
util.h | 2 ++
3 files changed, 6 insertions(+)
diff --git a/flow.c b/flow.c
index feefda3..df3ce2a 100644
--- a/flow.c
+++ b/flow.c
@@ -449,6 +449,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow,
switch (f->pif[INISIDE]) {
case PIF_TAP:
+ memcpy(f->tap_omac, MAC_UNDEF, ETH_ALEN);
tgtpif = fwd_nat_from_tap(c, proto, ini, tgt);
break;
@@ -458,6 +459,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow,
case PIF_HOST:
tgtpif = fwd_nat_from_host(c, proto, ini, tgt);
+ fwd_neigh_mac_get(c, &ini->eaddr, f->tap_omac);
break;
default:
diff --git a/flow.h b/flow.h
index cac618a..f342895 100644
--- a/flow.h
+++ b/flow.h
@@ -177,6 +177,7 @@ int flowside_connect(const struct ctx *c, int s,
* @type: Type of packet flow
* @pif[]: Interface for each side of the flow
* @side[]: Information for each side of the flow
+ * @tap_omac: MAC address of remote endpoint as seen from the guest
*/
struct flow_common {
#ifdef __GNUC__
@@ -192,6 +193,7 @@ struct flow_common {
#endif
uint8_t pif[SIDES];
struct flowside side[SIDES];
+ uint8_t tap_omac[6];
};
#define FLOW_INDEX_BITS 17 /* 128k - 1 */
diff --git a/util.h b/util.h
index 22eaac5..6fc8f5d 100644
--- a/util.h
+++ b/util.h
@@ -101,6 +101,8 @@ void abort_with_msg(const char *fmt, ...)
((uint8_t [ETH_ALEN]){ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff })
#define MAC_ZERO ((uint8_t [ETH_ALEN]){ 0 })
#define MAC_IS_ZERO(addr) (!memcmp((addr), MAC_ZERO, ETH_ALEN))
+#define MAC_UNDEF MAC_BROADCAST
+#define MAC_IS_UNDEF(addr) (!memcmp((addr), MAC_UNDEF, ETH_ALEN))
#ifndef __bswap_constant_16
#define __bswap_constant_16(x) \
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 6/9] udp: forward external source MAC address through tap interface
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (4 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 5/9] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 4:52 ` David Gibson
2025-10-03 0:34 ` [PATCH v12 7/9] tcp: " Jon Maloy
` (3 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
We forward the incoming MAC address through the tap interface when
receiving incoming packets from network local hosts.
This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: - Adapted to the move of external MAC address from struct flowside
to struct flow_common
v4: - Changed signature of udp_tap_prepare() to take a MAC address
instead of a flow.
- Eliminated initialization of MAC source address in all frames,
since those now are set per send occasion anyway.
v5: - Added lookup in ARP/NDP table on incoming messages in
case flow->tap_omac wasn't initialized at flow creation,
i.e., the flow was initiated from the guest.
v6: - Using MAC_IS_ZERO
v12: - Using MAC_IS_UNDEF() instead of MAC_IS_ZERO()
---
passt.c | 2 +-
udp.c | 45 +++++++++++++++++++++++++--------------------
udp.h | 2 +-
3 files changed, 27 insertions(+), 22 deletions(-)
diff --git a/passt.c b/passt.c
index e21d6ba..e47d1ae 100644
--- a/passt.c
+++ b/passt.c
@@ -164,7 +164,7 @@ static void timer_init(struct ctx *c, const struct timespec *now)
void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
{
tcp_update_l2_buf(eth_d, eth_s);
- udp_update_l2_buf(eth_d, eth_s);
+ udp_update_l2_buf(eth_d);
}
/**
diff --git a/udp.c b/udp.c
index 86585b7..3acc401 100644
--- a/udp.c
+++ b/udp.c
@@ -133,11 +133,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
/* UDP header and data for inbound messages */
static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
-/* Ethernet header for IPv4 frames */
-static struct ethhdr udp4_eth_hdr;
-
-/* Ethernet header for IPv6 frames */
-static struct ethhdr udp6_eth_hdr;
+/* Ethernet headers for IPv4 and IPv6 frames */
+static struct ethhdr udp_eth_hdr[UDP_MAX_FRAMES];
/**
* struct udp_meta_t - Pre-cooked headers for UDP packets
@@ -210,12 +207,13 @@ void udp_portmap_clear(void)
/**
* udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
* @eth_d: Ethernet destination address, NULL if unchanged
- * @eth_s: Ethernet source address, NULL if unchanged
*/
-void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void udp_update_l2_buf(const unsigned char *eth_d)
{
- eth_update_mac(&udp4_eth_hdr, eth_d, eth_s);
- eth_update_mac(&udp6_eth_hdr, eth_d, eth_s);
+ int i;
+
+ for (i = 0; i < UDP_MAX_FRAMES; i++)
+ eth_update_mac(&udp_eth_hdr[i], eth_d, NULL);
}
/**
@@ -238,6 +236,7 @@ static void udp_iov_init_one(const struct ctx *c, size_t i)
*siov = IOV_OF_LVALUE(payload->data);
+ tiov[UDP_IOV_ETH] = IOV_OF_LVALUE(udp_eth_hdr[i]);
tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph);
tiov[UDP_IOV_PAYLOAD].iov_base = payload;
@@ -253,9 +252,6 @@ static void udp_iov_init(const struct ctx *c)
{
size_t i;
- udp4_eth_hdr.h_proto = htons_constant(ETH_P_IP);
- udp6_eth_hdr.h_proto = htons_constant(ETH_P_IPV6);
-
for (i = 0; i < UDP_MAX_FRAMES; i++)
udp_iov_init_one(c, i);
}
@@ -352,31 +348,34 @@ size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
* udp_tap_prepare() - Convert one datagram into a tap frame
* @mmh: Receiving mmsghdr array
* @idx: Index of the datagram to prepare
+ * @tap_omac: MAC address of remote endpoint as seen from the guest
* @toside: Flowside for destination side
* @no_udp_csum: Do not set UDP checksum
*/
static void udp_tap_prepare(const struct mmsghdr *mmh,
- unsigned idx, const struct flowside *toside,
+ unsigned int idx,
+ const uint8_t *tap_omac,
+ const struct flowside *toside,
bool no_udp_csum)
{
struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
+ struct ethhdr *eh = (*tap_iov)[UDP_IOV_ETH].iov_base;
struct udp_payload_t *bp = &udp_payload[idx];
struct udp_meta_t *bm = &udp_meta[idx];
size_t l4len;
+ eth_update_mac(eh, NULL, tap_omac);
if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) {
l4len = udp_update_hdr6(&bm->ip6h, bp, toside,
mmh[idx].msg_len, no_udp_csum);
- tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
- sizeof(udp6_eth_hdr));
- (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
+ tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + ETH_HLEN);
+ eh->h_proto = htons_constant(ETH_P_IPV6);
(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
} else {
l4len = udp_update_hdr4(&bm->ip4h, bp, toside,
mmh[idx].msg_len, no_udp_csum);
- tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
- sizeof(udp4_eth_hdr));
- (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
+ tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + ETH_HLEN);
+ eh->h_proto = htons_constant(ETH_P_IP);
(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h);
}
(*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len;
@@ -801,13 +800,19 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
flow_sidx_t tosidx)
{
const struct flowside *toside = flowside_at_sidx(tosidx);
+ struct udp_flow *uflow = udp_at_sidx(tosidx);
+ uint8_t *omac = uflow->f.tap_omac;
int i;
if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
return;
+ /* Find if neighbour table has a recorded MAC address */
+ if (MAC_IS_UNDEF(omac))
+ fwd_neigh_mac_get(c, &toside->oaddr, omac);
+
for (i = 0; i < n; i++)
- udp_tap_prepare(udp_mh_recv, i, toside, false);
+ udp_tap_prepare(udp_mh_recv, i, omac, toside, false);
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
}
diff --git a/udp.h b/udp.h
index 8f8531a..dd6e5ad 100644
--- a/udp.h
+++ b/udp.h
@@ -21,7 +21,7 @@ int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
const char *ifname, in_port_t port);
int udp_init(struct ctx *c);
void udp_timer(struct ctx *c, const struct timespec *now);
-void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
+void udp_update_l2_buf(const unsigned char *eth_d);
/**
* union udp_listen_epoll_ref - epoll reference for "listening" UDP sockets
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 6/9] udp: forward external source MAC address through tap interface
2025-10-03 0:34 ` [PATCH v12 6/9] udp: forward external source MAC address through tap interface Jon Maloy
@ 2025-10-03 4:52 ` David Gibson
0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:52 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]
On Thu, Oct 02, 2025 at 08:34:09PM -0400, Jon Maloy wrote:
> We forward the incoming MAC address through the tap interface when
> receiving incoming packets from network local hosts.
>
> This is a part of the solution to bug
> https://bugs.passt.top/show_bug.cgi?id=120
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[snip]
> @@ -801,13 +800,19 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
> flow_sidx_t tosidx)
> {
> const struct flowside *toside = flowside_at_sidx(tosidx);
> + struct udp_flow *uflow = udp_at_sidx(tosidx);
> + uint8_t *omac = uflow->f.tap_omac;
> int i;
>
> if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0)
> return;
>
> + /* Find if neighbour table has a recorded MAC address */
> + if (MAC_IS_UNDEF(omac))
> + fwd_neigh_mac_get(c, &toside->oaddr, omac);
Uh oh. Here you're looking up based on a guest side address, whereas
other places are looking up based on a host side address. You need to
pick one.
> +
> for (i = 0; i < n; i++)
> - udp_tap_prepare(udp_mh_recv, i, toside, false);
> + udp_tap_prepare(udp_mh_recv, i, omac, toside, false);
>
> tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
> }
> diff --git a/udp.h b/udp.h
> index 8f8531a..dd6e5ad 100644
> --- a/udp.h
> +++ b/udp.h
> @@ -21,7 +21,7 @@ int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> const char *ifname, in_port_t port);
> int udp_init(struct ctx *c);
> void udp_timer(struct ctx *c, const struct timespec *now);
> -void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
> +void udp_update_l2_buf(const unsigned char *eth_d);
>
> /**
> * union udp_listen_epoll_ref - epoll reference for "listening" UDP sockets
> --
> 2.50.1
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 7/9] tcp: forward external source MAC address through tap interface
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (5 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 6/9] udp: forward external source MAC address through tap interface Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 4:54 ` David Gibson
2025-10-03 0:34 ` [PATCH v12 8/9] tap: change signature of function tap_push_l2h() Jon Maloy
` (2 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts.
This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: - Adapted to the signature change in nl_mac_get() function, so that
we now consider only the template interface when checking the
ARP/NDP table.
v4: - Adapted to previous name changes in this series
v5: - Added lookup in ARP/NDP cache and/or table on incoming messages
in case flow->tap_omac wasn't initialized at flow creation,
i.e., the flow was initiated from the guest.
v6: - Using MAC_IS_ZERO() instead of mac_undefined()
- I ignored David's suggestion to move the test for
undefined flow->tap_omac to tcp_{buf,vu}_data_from_sock(), at
least for now. It looks like a sub-optimization to first have to
look up the flow instance in these calls, and then do the
test for MAC_IS_ZERO(), even though we might in theory end up
with fewer such calls (how often?).
v12: - Using MAC_IS_UNDEF() instead of MAC_IS_ZERO()
- Minor update after feedback from Stefano
---
passt.c | 7 +++----
passt.h | 3 +--
pasta.c | 2 +-
tap.c | 2 +-
tcp.c | 14 ++++++++++++--
tcp.h | 2 +-
tcp_buf.c | 37 +++++++++++++++++--------------------
tcp_internal.h | 4 ++--
tcp_vu.c | 5 ++---
9 files changed, 40 insertions(+), 36 deletions(-)
diff --git a/passt.c b/passt.c
index e47d1ae..fb94f9b 100644
--- a/passt.c
+++ b/passt.c
@@ -159,11 +159,10 @@ static void timer_init(struct ctx *c, const struct timespec *now)
/**
* proto_update_l2_buf() - Update scatter-gather L2 buffers in protocol handlers
* @eth_d: Ethernet destination address, NULL if unchanged
- * @eth_s: Ethernet source address, NULL if unchanged
*/
-void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void proto_update_l2_buf(const unsigned char *eth_d)
{
- tcp_update_l2_buf(eth_d, eth_s);
+ tcp_update_l2_buf(eth_d);
udp_update_l2_buf(eth_d);
}
@@ -314,7 +313,7 @@ int main(int argc, char **argv)
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
_exit(EXIT_FAILURE);
- proto_update_l2_buf(c.guest_mac, c.our_tap_mac);
+ proto_update_l2_buf(c.guest_mac);
if (c.ifi4 && !c.no_dhcp)
dhcp_init();
diff --git a/passt.h b/passt.h
index 0075eb4..ff0236c 100644
--- a/passt.h
+++ b/passt.h
@@ -326,7 +326,6 @@ struct ctx {
bool migrate_exit;
};
-void proto_update_l2_buf(const unsigned char *eth_d,
- const unsigned char *eth_s);
+void proto_update_l2_buf(const unsigned char *eth_d);
#endif /* PASST_H */
diff --git a/pasta.c b/pasta.c
index 687406b..a42cfd8 100644
--- a/pasta.c
+++ b/pasta.c
@@ -411,7 +411,7 @@ void pasta_ns_conf(struct ctx *c)
}
}
- proto_update_l2_buf(c->guest_mac, NULL);
+ proto_update_l2_buf(c->guest_mac);
}
/**
diff --git a/tap.c b/tap.c
index 399eeaa..250a0f6 100644
--- a/tap.c
+++ b/tap.c
@@ -1101,7 +1101,7 @@ void tap_add_packet(struct ctx *c, struct iov_tail *data,
memcpy(c->guest_mac, eh->h_source, ETH_ALEN);
debug("New guest MAC address observed: %s",
eth_ntop(c->guest_mac, bufmac, sizeof(bufmac)));
- proto_update_l2_buf(c->guest_mac, NULL);
+ proto_update_l2_buf(c->guest_mac);
}
switch (ntohs(eh->h_proto)) {
diff --git a/tcp.c b/tcp.c
index 48b1ef2..06b07f8 100644
--- a/tcp.c
+++ b/tcp.c
@@ -919,8 +919,10 @@ static void tcp_fill_header(struct tcphdr *th,
/**
* tcp_fill_headers() - Fill 802.3, IP, TCP headers
+ * @c: Execution context
* @conn: Connection pointer
* @taph: tap backend specific header
+ * @eh: Pointer to Ethernet header
* @ip4h: Pointer to IPv4 header, or NULL
* @ip6h: Pointer to IPv6 header, or NULL
* @th: Pointer to TCP header
@@ -929,14 +931,15 @@ static void tcp_fill_header(struct tcphdr *th,
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*/
-void tcp_fill_headers(const struct tcp_tap_conn *conn,
- struct tap_hdr *taph,
+void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
+ struct tap_hdr *taph, struct ethhdr *eh,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
struct tcphdr *th, struct iov_tail *payload,
const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum)
{
const struct flowside *tapside = TAPFLOW(conn);
size_t l4len = iov_tail_size(payload) + sizeof(*th);
+ uint8_t *omac = conn->f.tap_omac;
size_t l3len = l4len;
uint32_t psum = 0;
@@ -962,6 +965,7 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn,
psum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
*src4, *dst4);
}
+ eh->h_proto = htons_constant(ETH_P_IP);
}
if (ip6h) {
@@ -982,8 +986,14 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn,
&ip6h->saddr,
&ip6h->daddr);
}
+ eh->h_proto = htons_constant(ETH_P_IPV6);
}
+ /* Find if neighbour table has a recorded MAC address */
+ if (MAC_IS_UNDEF(omac))
+ fwd_neigh_mac_get(c, &tapside->oaddr, omac);
+ eth_update_mac(eh, NULL, omac);
+
tcp_fill_header(th, conn, seq);
if (no_tcp_csum)
diff --git a/tcp.h b/tcp.h
index 234a803..c1b8385 100644
--- a/tcp.h
+++ b/tcp.h
@@ -24,7 +24,7 @@ int tcp_init(struct ctx *c);
void tcp_timer(struct ctx *c, const struct timespec *now);
void tcp_defer_handler(struct ctx *c);
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
+void tcp_update_l2_buf(const unsigned char *eth_d);
extern bool peek_offset_cap;
diff --git a/tcp_buf.c b/tcp_buf.c
index a493b5a..f256dcc 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -40,8 +40,7 @@
/* Static buffers */
/* Ethernet header for IPv4 and IPv6 frames */
-static struct ethhdr tcp4_eth_src;
-static struct ethhdr tcp6_eth_src;
+static struct ethhdr tcp_eth_hdr[TCP_FRAMES_MEM];
static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM];
@@ -67,12 +66,13 @@ static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
* tcp_update_l2_buf() - Update Ethernet header buffers with addresses
* @eth_d: Ethernet destination address, NULL if unchanged
- * @eth_s: Ethernet source address, NULL if unchanged
*/
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void tcp_update_l2_buf(const unsigned char *eth_d)
{
- eth_update_mac(&tcp4_eth_src, eth_d, eth_s);
- eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
+ int i;
+
+ for (i = 0; i < TCP_FRAMES_MEM; i++)
+ eth_update_mac(&tcp_eth_hdr[i], eth_d, NULL);
}
/**
@@ -85,9 +85,6 @@ void tcp_sock_iov_init(const struct ctx *c)
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
int i;
- tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
- tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
-
for (i = 0; i < ARRAY_SIZE(tcp_payload); i++) {
tcp6_payload_ip[i] = ip6;
tcp4_payload_ip[i] = iph;
@@ -149,13 +146,15 @@ void tcp_payload_flush(const struct ctx *c)
/**
* tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
+ * @c: Execution context
* @conn: Connection pointer
* @iov: Pointer to an array of iovec of TCP pre-cooked buffers
* @check: Checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
*/
-static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
+static void tcp_l2_buf_fill_headers(const struct ctx *c,
+ struct tcp_tap_conn *conn,
struct iovec *iov, const uint16_t *check,
uint32_t seq, bool no_tcp_csum)
{
@@ -164,6 +163,7 @@ static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base;
const struct flowside *tapside = TAPFLOW(conn);
const struct in_addr *a4 = inany_v4(&tapside->oaddr);
+ struct ethhdr *eh = iov[TCP_IOV_ETH].iov_base;
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
@@ -172,7 +172,7 @@ static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
else
ip6h = iov[TCP_IOV_IP].iov_base;
- tcp_fill_headers(conn, taph, ip4h, ip6h, th, &tail,
+ tcp_fill_headers(c, conn, taph, eh, ip4h, ip6h, th, &tail,
check, seq, no_tcp_csum);
}
@@ -194,14 +194,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
int ret;
iov = tcp_l2_iov[tcp_payload_used];
- if (CONN_V4(conn)) {
+ if (CONN_V4(conn))
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
- iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
- } else {
+ else
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
- iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
- }
+ iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp_eth_hdr[tcp_payload_used]);
payload = iov[TCP_IOV_PAYLOAD].iov_base;
seq = conn->seq_to_tap;
ret = tcp_prepare_flags(c, conn, flags, &payload->th,
@@ -212,7 +210,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
tcp_frame_conns[tcp_payload_used++] = conn;
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
- tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
+ tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used];
@@ -260,11 +258,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
check = &iph->check;
}
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
- iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
- iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
+ iov[TCP_IOV_ETH].iov_base = &tcp_eth_hdr[tcp_payload_used];
payload = iov[TCP_IOV_PAYLOAD].iov_base;
payload->th.th_off = sizeof(struct tcphdr) / 4;
payload->th.th_x2 = 0;
@@ -272,7 +269,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
payload->th.ack = 1;
payload->th.psh = push;
iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
- tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
+ tcp_l2_buf_fill_headers(c, conn, iov, check, seq, false);
if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
tcp_payload_flush(c);
}
diff --git a/tcp_internal.h b/tcp_internal.h
index 5cb6cba..f55025c 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -174,8 +174,8 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
struct tcp_info_linux;
-void tcp_fill_headers(const struct tcp_tap_conn *conn,
- struct tap_hdr *taph,
+void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
+ struct tap_hdr *taph, struct ethhdr *eh,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
struct tcphdr *th, struct iov_tail *payload,
const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum);
diff --git a/tcp_vu.c b/tcp_vu.c
index ebd3a1e..0c17884 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -135,7 +135,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
flags_elem[0].in_sg[0].iov_len = hdrlen + optlen;
payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
- tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
+ tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload,
NULL, seq, !*c->pcap);
if (*c->pcap) {
@@ -308,7 +308,6 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
eh = vu_eth(base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
- memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
/* initialize header */
@@ -332,7 +331,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
th->ack = 1;
th->psh = push;
- tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload,
+ tcp_fill_headers(c, conn, NULL, eh, ip4h, ip6h, th, &payload,
*check, conn->seq_to_tap, no_tcp_csum);
if (ip4h)
*check = &ip4h->check;
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 7/9] tcp: forward external source MAC address through tap interface
2025-10-03 0:34 ` [PATCH v12 7/9] tcp: " Jon Maloy
@ 2025-10-03 4:54 ` David Gibson
0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:54 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 809 bytes --]
On Thu, Oct 02, 2025 at 08:34:10PM -0400, Jon Maloy wrote:
> We forward the incoming mac address through the tap interface when
> receiving incoming packets from network local hosts.
>
> This is a part of the solution to bug
> https://bugs.passt.top/show_bug.cgi?id=120
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[snip]
> + /* Find if neighbour table has a recorded MAC address */
> + if (MAC_IS_UNDEF(omac))
> + fwd_neigh_mac_get(c, &tapside->oaddr, omac);
Again, this is looking up with a guest side address, not host side.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 8/9] tap: change signature of function tap_push_l2h()
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (6 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 7/9] tcp: " Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 0:34 ` [PATCH v12 9/9] icmp: let icmp use mac address from flowside structure Jon Maloy
2025-10-03 5:33 ` [PATCH v12 0/9] Use true MAC address of LAN local remote hosts David Gibson
9 siblings, 0 replies; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
In the next commit it must be possible for the callers of function
tap_push_l2h() to specify which source MAC address should be
added to the ethernet header sent over the tap interface. As a
preparation, we now add a new argument to that function, still
without any logical changes.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: -Improved comment for src_mac argument, as suggested by Stefano.
v4: -Re-added comment in tap.c and remove redundant argument check
in tap_push_l2()
v5: -Ensured that MAC argument to tap_push_l2h() never is NULL.
v6: -Clarification of comment about ARP lookup
v12: - Minor updates after feedback from Stefano
---
tap.c | 16 +++++++++-------
tap.h | 3 ++-
tcp.c | 6 ++++--
3 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/tap.c b/tap.c
index 250a0f6..84266d0 100644
--- a/tap.c
+++ b/tap.c
@@ -171,17 +171,19 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
* tap_push_l2h() - Build an L2 header for an inbound packet
* @c: Execution context
* @buf: Buffer address at which to generate header
+ * @src_mac: MAC address to be used as source for message
* @proto: Ethernet protocol number for L3
*
* Return: pointer at which to write the packet's payload
*/
-void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto)
+void *tap_push_l2h(const struct ctx *c, void *buf,
+ const void *src_mac, uint16_t proto)
{
struct ethhdr *eh = (struct ethhdr *)buf;
- /* TODO: ARP table lookup */
+ /* TODO: ARP lookup on tap side */
memcpy(eh->h_dest, c->guest_mac, ETH_ALEN);
- memcpy(eh->h_source, c->our_tap_mac, ETH_ALEN);
+ memcpy(eh->h_source, src_mac, ETH_ALEN);
eh->h_proto = ntohs(proto);
return eh + 1;
}
@@ -261,7 +263,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
{
size_t l4len = dlen + sizeof(struct udphdr);
char buf[USHRT_MAX];
- struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+ struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IP);
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP);
char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
@@ -281,7 +283,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
const void *in, size_t l4len)
{
char buf[USHRT_MAX];
- struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+ struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IP);
struct icmphdr *icmp4h = tap_push_ip4h(ip4h, src, dst,
l4len, IPPROTO_ICMP);
@@ -367,7 +369,7 @@ void tap_udp6_send(const struct ctx *c,
{
size_t l4len = dlen + sizeof(struct udphdr);
char buf[USHRT_MAX];
- struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+ struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
struct udphdr *uh = tap_push_ip6h(ip6h, src, dst,
l4len, IPPROTO_UDP, flow);
char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
@@ -389,7 +391,7 @@ void tap_icmp6_send(const struct ctx *c,
const void *in, size_t l4len)
{
char buf[USHRT_MAX];
- struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+ struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
struct icmp6hdr *icmp6h = tap_push_ip6h(ip6h, src, dst, l4len,
IPPROTO_ICMPV6, 0);
diff --git a/tap.h b/tap.h
index 21db4d2..02f7761 100644
--- a/tap.h
+++ b/tap.h
@@ -70,7 +70,8 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
}
unsigned long tap_l2_max_len(const struct ctx *c);
-void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
+void *tap_push_l2h(const struct ctx *c, void *buf,
+ const void *src_mac, uint16_t proto);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
struct in_addr dst, size_t l4len, uint8_t proto);
void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport,
diff --git a/tcp.c b/tcp.c
index 06b07f8..79d2efe 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1979,7 +1979,8 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
return;
if (af == AF_INET) {
- struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP);
+ struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac,
+ ETH_P_IP);
const struct in_addr *rst_src = daddr;
const struct in_addr *rst_dst = saddr;
@@ -1989,7 +1990,8 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
*rst_src, *rst_dst);
} else {
- struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6);
+ struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac,
+ ETH_P_IPV6);
const struct in6_addr *rst_src = daddr;
const struct in6_addr *rst_dst = saddr;
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v12 9/9] icmp: let icmp use mac address from flowside structure
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (7 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 8/9] tap: change signature of function tap_push_l2h() Jon Maloy
@ 2025-10-03 0:34 ` Jon Maloy
2025-10-03 4:57 ` David Gibson
2025-10-03 5:33 ` [PATCH v12 0/9] Use true MAC address of LAN local remote hosts David Gibson
9 siblings, 1 reply; 28+ messages in thread
From: Jon Maloy @ 2025-10-03 0:34 UTC (permalink / raw)
To: sbrivio, dgibson, david, jmaloy, passt-dev
Even ICMP needs to be updated to use the external MAC address instead
of just the own tap address when applicable. We do that here.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
v3: - Adapted to the move of external MAC address from struct flowside
to struct flow_common
v4: - Adapted to name changes in previous commits in this series
v5: - Added conditional lookup in ARP/NDP if the flow's tap_omac is
undefined
v6: - Looking up MAC of ICMP generating node in udp_send_tap_icmp4/6()
when available, instead trusting the contents of flow->tap_omac.
v12: - Using MAC_IS_UNDEF() instead of MAC_IS_ZERO()
- Comment update after feedback from Stefano
---
icmp.c | 8 ++++++--
ndp.c | 2 +-
tap.c | 10 ++++++----
tap.h | 4 ++--
udp.c | 12 ++++++++++--
5 files changed, 25 insertions(+), 11 deletions(-)
diff --git a/icmp.c b/icmp.c
index 6dffafb..93b394a 100644
--- a/icmp.c
+++ b/icmp.c
@@ -125,17 +125,21 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
ini->eport, seq);
+ /* Find if neighbour table has a recorded MAC address */
+ if (MAC_IS_UNDEF(pingf->f.tap_omac))
+ fwd_neigh_mac_get(c, &ini->oaddr, pingf->f.tap_omac);
+
if (pingf->f.type == FLOW_PING4) {
const struct in_addr *saddr = inany_v4(&ini->oaddr);
const struct in_addr *daddr = inany_v4(&ini->eaddr);
ASSERT(saddr && daddr); /* Must have IPv4 addresses */
- tap_icmp4_send(c, *saddr, *daddr, buf, n);
+ tap_icmp4_send(c, *saddr, *daddr, buf, pingf->f.tap_omac, n);
} else if (pingf->f.type == FLOW_PING6) {
const struct in6_addr *saddr = &ini->oaddr.a6;
const struct in6_addr *daddr = &ini->eaddr.a6;
- tap_icmp6_send(c, saddr, daddr, buf, n);
+ tap_icmp6_send(c, saddr, daddr, buf, pingf->f.tap_omac, n);
}
return;
diff --git a/ndp.c b/ndp.c
index 6bba0a8..1af5e4b 100644
--- a/ndp.c
+++ b/ndp.c
@@ -184,7 +184,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst,
{
const struct in6_addr *src = &c->ip6.our_tap_ll;
- tap_icmp6_send(c, src, dst, buf, l4len);
+ tap_icmp6_send(c, src, dst, buf, c->our_tap_mac, l4len);
}
/**
diff --git a/tap.c b/tap.c
index 84266d0..3ef2734 100644
--- a/tap.c
+++ b/tap.c
@@ -277,13 +277,14 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
* @src: IPv4 source address
* @dst: IPv4 destination address
* @in: ICMP packet, including ICMP header
+ * @src_mac: MAC address to be used as source for message
* @l4len: ICMP packet length, including ICMP header
*/
void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
- const void *in, size_t l4len)
+ const void *in, const void *src_mac, size_t l4len)
{
char buf[USHRT_MAX];
- struct iphdr *ip4h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IP);
+ struct iphdr *ip4h = tap_push_l2h(c, buf, src_mac, ETH_P_IP);
struct icmphdr *icmp4h = tap_push_ip4h(ip4h, src, dst,
l4len, IPPROTO_ICMP);
@@ -384,14 +385,15 @@ void tap_udp6_send(const struct ctx *c,
* @src: IPv6 source address
* @dst: IPv6 destination address
* @in: ICMP packet, including ICMP header
+ * @src_mac: MAC address to be used as source for message
* @l4len: ICMP packet length, including ICMP header
*/
void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
- const void *in, size_t l4len)
+ const void *in, const void *src_mac, size_t l4len)
{
char buf[USHRT_MAX];
- struct ipv6hdr *ip6h = tap_push_l2h(c, buf, c->our_tap_mac, ETH_P_IPV6);
+ struct ipv6hdr *ip6h = tap_push_l2h(c, buf, src_mac, ETH_P_IPV6);
struct icmp6hdr *icmp6h = tap_push_ip6h(ip6h, src, dst, l4len,
IPPROTO_ICMPV6, 0);
diff --git a/tap.h b/tap.h
index 02f7761..1864173 100644
--- a/tap.h
+++ b/tap.h
@@ -91,7 +91,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
struct in_addr dst, in_port_t dport,
const void *in, size_t dlen);
void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
- const void *in, size_t l4len);
+ const void *in, const void *src_mac, size_t l4len);
const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
const struct in6_addr *src);
void *tap_push_ip6h(struct ipv6hdr *ip6h,
@@ -103,7 +103,7 @@ void tap_udp6_send(const struct ctx *c,
uint32_t flow, void *in, size_t dlen);
void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
- const void *in, size_t l4len);
+ const void *in, const void *src_mac, size_t l4len);
void tap_send_single(const struct ctx *c, const void *data, size_t l2len);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
size_t bufs_per_frame, size_t nframes);
diff --git a/udp.c b/udp.c
index 3acc401..3981e9b 100644
--- a/udp.c
+++ b/udp.c
@@ -400,6 +400,8 @@ static void udp_send_tap_icmp4(const struct ctx *c,
struct in_addr eaddr = toside->eaddr.v4mapped.a4;
in_port_t eport = toside->eport;
in_port_t oport = toside->oport;
+ union inany_addr saddr_any;
+ uint8_t tap_omac[ETH_ALEN];
struct {
struct icmphdr icmp4h;
struct iphdr ip4h;
@@ -421,7 +423,10 @@ static void udp_send_tap_icmp4(const struct ctx *c,
tap_push_uh4(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
memcpy(&msg.data, in, dlen);
- tap_icmp4_send(c, saddr, eaddr, &msg, msglen);
+ /* Try to obtain the MAC address of the generating node */
+ saddr_any = inany_from_v4(saddr);
+ fwd_neigh_mac_get(c, &saddr_any, tap_omac);
+ tap_icmp4_send(c, saddr, eaddr, &msg, tap_omac, msglen);
}
@@ -445,6 +450,7 @@ static void udp_send_tap_icmp6(const struct ctx *c,
const struct in6_addr *eaddr = &toside->eaddr.a6;
in_port_t eport = toside->eport;
in_port_t oport = toside->oport;
+ uint8_t tap_omac[ETH_ALEN];
struct {
struct icmp6_hdr icmp6h;
struct ipv6hdr ip6h;
@@ -466,7 +472,9 @@ static void udp_send_tap_icmp6(const struct ctx *c,
tap_push_uh6(&msg.uh, eaddr, eport, oaddr, oport, in, dlen);
memcpy(&msg.data, in, dlen);
- tap_icmp6_send(c, saddr, eaddr, &msg, msglen);
+ /* Try to obtain the MAC address of the generating node */
+ fwd_neigh_mac_get(c, (union inany_addr *) saddr, tap_omac);
+ tap_icmp6_send(c, saddr, eaddr, &msg, tap_omac, msglen);
}
/**
--
2.50.1
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 9/9] icmp: let icmp use mac address from flowside structure
2025-10-03 0:34 ` [PATCH v12 9/9] icmp: let icmp use mac address from flowside structure Jon Maloy
@ 2025-10-03 4:57 ` David Gibson
0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-03 4:57 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 1947 bytes --]
On Thu, Oct 02, 2025 at 08:34:12PM -0400, Jon Maloy wrote:
> Even ICMP needs to be updated to use the external MAC address instead
> of just the own tap address when applicable. We do that here.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> ---
> v3: - Adapted to the move of external MAC address from struct flowside
> to struct flow_common
> v4: - Adapted to name changes in previous commits in this series
> v5: - Added conditional lookup in ARP/NDP if the flow's tap_omac is
> undefined
> v6: - Looking up MAC of ICMP generating node in udp_send_tap_icmp4/6()
> when available, instead trusting the contents of flow->tap_omac.
> v12: - Using MAC_IS_UNDEF() instead of MAC_IS_ZERO()
> - Comment update after feedback from Stefano
> ---
> icmp.c | 8 ++++++--
> ndp.c | 2 +-
> tap.c | 10 ++++++----
> tap.h | 4 ++--
> udp.c | 12 ++++++++++--
> 5 files changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/icmp.c b/icmp.c
> index 6dffafb..93b394a 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -125,17 +125,21 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
> flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
> ini->eport, seq);
>
> + /* Find if neighbour table has a recorded MAC address */
> + if (MAC_IS_UNDEF(pingf->f.tap_omac))
> + fwd_neigh_mac_get(c, &ini->oaddr, pingf->f.tap_omac);
Again a lookup by guest address. Actually looking at these later
patches, having the table stored in terms of guest address probably
makes more sense than my earlier suggestions.
That means the NAT needs to move where you populate the table from
host updates.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 0/9] Use true MAC address of LAN local remote hosts
2025-10-03 0:34 [PATCH v12 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
` (8 preceding siblings ...)
2025-10-03 0:34 ` [PATCH v12 9/9] icmp: let icmp use mac address from flowside structure Jon Maloy
@ 2025-10-03 5:33 ` David Gibson
2025-10-05 13:39 ` Jon Maloy
9 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2025-10-03 5:33 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]
On Thu, Oct 02, 2025 at 08:34:03PM -0400, Jon Maloy wrote:
>
>
> Bug #120 asks us to use the true MAC addresses of LAN local
> remote hosts, since some programs need this information.
> These commits introduces this for ARP, NDP, UDP, TCP and
> ICMP.
I have a bunch of more detailed comments on the patches, but they're a
bit confused, because for the earlier ones I hadn't read the whole
series to get the whole context.
There is a problem here in that you're not consistent about whether
the neighbour table is indexed by host side addresses or guest side
addresses. You'll mostly get away with this, because they're
usually the same. But it will result in bugs on the edge cases.
Having read the whole series now, I think it will be simplest to
standardize on indexing by guest side addresses. That invalidates
some of my earlier comments, but instead you'll need to use
nat_inbound() when you populate the table from netlink updates.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 0/9] Use true MAC address of LAN local remote hosts
2025-10-03 5:33 ` [PATCH v12 0/9] Use true MAC address of LAN local remote hosts David Gibson
@ 2025-10-05 13:39 ` Jon Maloy
2025-10-07 0:56 ` David Gibson
0 siblings, 1 reply; 28+ messages in thread
From: Jon Maloy @ 2025-10-05 13:39 UTC (permalink / raw)
To: David Gibson; +Cc: sbrivio, dgibson, passt-dev
On 2025-10-03 01:33, David Gibson wrote:
> On Thu, Oct 02, 2025 at 08:34:03PM -0400, Jon Maloy wrote:
>>
>>
>> Bug #120 asks us to use the true MAC addresses of LAN local
>> remote hosts, since some programs need this information.
>> These commits introduces this for ARP, NDP, UDP, TCP and
>> ICMP.
>
> I have a bunch of more detailed comments on the patches, but they're a
> bit confused, because for the earlier ones I hadn't read the whole
> series to get the whole context.
>
> There is a problem here in that you're not consistent about whether
> the neighbour table is indexed by host side addresses or guest side
> addresses. You'll mostly get away with this, because they're
> usually the same. But it will result in bugs on the edge cases.
>
> Having read the whole series now, I think it will be simplest to
> standardize on indexing by guest side addresses. That invalidates
> some of my earlier comments, but instead you'll need to use
> nat_inbound() when you populate the table from netlink updates.
>
I think I agree. After all this is all about what we are showing the
guest in the form of IP and MAC addresses, so it is conceptually clearer.
It doesn't change the implementation much, but it makes it easier
to understand which special cases we need to consider.
See my next email.
///jon
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v12 0/9] Use true MAC address of LAN local remote hosts
2025-10-05 13:39 ` Jon Maloy
@ 2025-10-07 0:56 ` David Gibson
0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2025-10-07 0:56 UTC (permalink / raw)
To: Jon Maloy; +Cc: sbrivio, dgibson, passt-dev
[-- Attachment #1: Type: text/plain, Size: 1824 bytes --]
On Sun, Oct 05, 2025 at 09:39:37AM -0400, Jon Maloy wrote:
>
>
> On 2025-10-03 01:33, David Gibson wrote:
> > On Thu, Oct 02, 2025 at 08:34:03PM -0400, Jon Maloy wrote:
> > >
> > >
> > > Bug #120 asks us to use the true MAC addresses of LAN local
> > > remote hosts, since some programs need this information.
> > > These commits introduces this for ARP, NDP, UDP, TCP and
> > > ICMP.
> >
> > I have a bunch of more detailed comments on the patches, but they're a
> > bit confused, because for the earlier ones I hadn't read the whole
> > series to get the whole context.
> >
> > There is a problem here in that you're not consistent about whether
> > the neighbour table is indexed by host side addresses or guest side
> > addresses. You'll mostly get away with this, because they're
> > usually the same. But it will result in bugs on the edge cases.
> >
> > Having read the whole series now, I think it will be simplest to
> > standardize on indexing by guest side addresses. That invalidates
> > some of my earlier comments, but instead you'll need to use
> > nat_inbound() when you populate the table from netlink updates.
> >
>
> I think I agree. After all this is all about what we are showing the
> guest in the form of IP and MAC addresses, so it is conceptually clearer.
>
> It doesn't change the implementation much, but it makes it easier
> to understand which special cases we need to consider.
Agreed. I think it will also make this easier to extend for
guest-side ARP lookups in future, which we'll want for handling
multiple guests.
> See my next email.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread