public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top
Subject: Re: [PATCH v13 01/10] netlink: add subscription on changes in NDP/ARP table
Date: Tue, 14 Oct 2025 15:39:19 +1100	[thread overview]
Message-ID: <aO3T9wIBcROgN6rI@zatzit> (raw)
In-Reply-To: <20251012193337.616835-2-jmaloy@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 12565 bytes --]

On Sun, Oct 12, 2025 at 03:33:28PM -0400, Jon Maloy wrote:
> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> requires the ability to translate from an IP address to its
> corresponding MAC address in cases where those are present in
> the ARP or NDP tables.
> 
> To keep track of the contents of these tables we add a netlink
> based neighbour subscription feature.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>

One important comment, several nits.

> ---
> v3: - Added an attribute contianing NDA_DST to sent message, so
>       that we let the kernel do the filtering of the IP address
>       and return only one entry.
>     - Added interface index to the call signature. Since the only
>       interface we know is the template interface, this limits
>       the number of hosts that will be seen as 'network segment
>       local' from a PASST viewpoint.
> v4: - Made loop independent of attribute order.
>     - Ignoring L2 addresses which are not of size ETH_ALEN.
> v5: - Changed return value of new function, so caller can know if
>       a MAC address really was found.
> v6: - Removed warning printout which had ended up in the wrong
>       commit.
> v8: - Changed to neighbour event subscription model
>     - netlink: arp/ndp table subscription
> v10:- Updated according to David's latest comments on v8
>     - Added functionaly where we initially read current
>       state of ARP/NDP tables
> v12:- Updates based on feedback from David and Stefano
> v13:- Updates based on feedback from David and Stefano
> ---
>  epoll_type.h |   2 +
>  netlink.c    | 215 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  netlink.h    |   4 +
>  passt.c      |   7 ++
>  4 files changed, 224 insertions(+), 4 deletions(-)
> 
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..a90ffb6 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
>  	EPOLL_TYPE_REPAIR_LISTEN,
>  	/* TCP_REPAIR helper socket */
>  	EPOLL_TYPE_REPAIR,
> +	/* Netlink neighbour subscription socket */
> +	EPOLL_TYPE_NL_NEIGH,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/netlink.c b/netlink.c
> index 9fe70f2..99dcb72 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -26,6 +26,7 @@
>  #include <arpa/inet.h>
>  #include <netinet/in.h>
>  #include <netinet/if_ether.h>
> +#include <net/if_arp.h>
>  
>  #include <linux/netlink.h>
>  #include <linux/rtnetlink.h>
> @@ -40,6 +41,10 @@
>  #define RTNH_NEXT_AND_DEC(rtnh, attrlen)				\
>  	((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh))
>  
> +/* Convenience macro borrowed from kernel */
> +#define NUD_VALID				\
> +	(NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE)
> +
>  /* Netlink expects a buffer of at least 8kiB or the system page size,
>   * whichever is larger.  32kiB is recommended for more efficient.
>   * Since the largest page size on any remotely common Linux setup is
> @@ -50,9 +55,10 @@
>  #define NLBUFSIZ 65536
>  
>  /* Socket in init, in target namespace, sequence (just needs to be monotonic) */
> -int nl_sock	= -1;
> -int nl_sock_ns	= -1;
> -static int nl_seq = 1;
> +int nl_sock		 = -1;
> +int nl_sock_ns		 = -1;
> +static int nl_sock_neigh = -1;
> +static int nl_seq	 = 1;
>  
>  /**
>   * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> @@ -325,7 +331,6 @@ unsigned int nl_get_ext_if(int s, sa_family_t af)
>  
>  	if (status < 0)
>  		warn("netlink: RTM_GETROUTE failed: %s", strerror_(-status));
> -

Nit: unrelated change.

>  	if (defifi) {
>  		if (ndef > 1) {
>  			info("Multiple default %s routes, picked first",
> @@ -1103,3 +1108,205 @@ int nl_link_set_flags(int s, unsigned int ifi,
>  
>  	return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
>  }
> +
> +/**
> + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink
> + * @c:		Execution context
> + * @nh:	Message to be read
> + */
> +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> +{
> +	struct ndmsg *ndm = NLMSG_DATA(nh);
> +	struct rtattr *rta = (struct rtattr *)(ndm + 1);
> +	size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> +	char ip_str[INET6_ADDRSTRLEN];
> +	char mac_str[ETH_ADDRSTRLEN];
> +	const uint8_t *lladdr = NULL;
> +	const void *dst = NULL;
> +	size_t lladdr_len = 0;
> +	uint8_t mac[ETH_ALEN];
> +	union inany_addr addr;
> +	size_t dstlen = 0;
> +
> +	if (nh->nlmsg_type == NLMSG_DONE)
> +		return;
> +
> +	if (nh->nlmsg_type == NLMSG_ERROR) {
> +		warn("nlmsg_type error at msg read");
> +		return;
> +	}
> +
> +	if (nh->nlmsg_type != RTM_NEWNEIGH &&
> +	    nh->nlmsg_type != RTM_DELNEIGH)
> +		return;
> +
> +	for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> +		switch (rta->rta_type) {
> +		case NDA_DST:
> +			dst = RTA_DATA(rta);
> +			dstlen = RTA_PAYLOAD(rta);
> +			break;
> +		case NDA_LLADDR:
> +			lladdr = RTA_DATA(rta);
> +			lladdr_len = RTA_PAYLOAD(rta);
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +
> +	if (!dst)
> +		return;
> +
> +

Nit: extra newline.

> +	if (ndm->ndm_family == AF_INET &&
> +	     ndm->ndm_ifindex != c->ifi4)
> +		return;
> +
> +	if (ndm->ndm_family == AF_INET6 &&
> +	     ndm->ndm_ifindex != c->ifi6)
> +		return;
> +
> +	if (ndm->ndm_family != AF_INET &&
> +	    ndm->ndm_family != AF_INET6)
> +		return;
> +
> +	if (ndm->ndm_family == AF_INET &&
> +	    dstlen != sizeof(struct in_addr)) {
> +		warn("netlink: wrong address length in AF_INET notification");
> +		return;
> +	}
> +	if (ndm->ndm_family == AF_INET6 &&
> +	    dstlen != sizeof(struct in6_addr)) {
> +		warn("netlink: wrong address length in AF_INET6 notification");
> +		return;
> +	}
> +	inany_from_af(&addr, ndm->ndm_family, dst);
> +	inany_ntop(dst, ip_str, sizeof(ip_str));
> +
> +	if (nh->nlmsg_type == RTM_DELNEIGH) {
> +		trace("neigh table delete: %s", ip_str);
> +		return;
> +	}
> +	if (!(ndm->ndm_state & NUD_VALID)) {
> +		trace("neigh table: invalid state for %s", ip_str);
> +		return;
> +	}
> +	if (nh->nlmsg_type != RTM_NEWNEIGH || !lladdr) {
> +		warn("netlink: notification with illegal msg for %s", ip_str);
> +		return;
> +	}
> +	if (lladdr_len != ETH_ALEN || ndm->ndm_type != ARPHRD_ETHER)
> +		return;
> +
> +	memcpy(mac, lladdr, ETH_ALEN);
> +	eth_ntop(mac, mac_str, sizeof(mac_str));
> +	trace("neigh table update: %s / %s", ip_str, mac_str);
> +}
> +
> +/**
> + * nl_neigh_sync() - Read current contents ARP/NDP tables
> + * @c:		Execution context
> + * @proto:	Protocol, AF_INET or AF_INET6
> + * @ifi:	Interface index
> + */
> +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi)
> +{
> +	struct {
> +		struct nlmsghdr nlh;
> +		struct ndmsg    ndm;
> +	} req = {
> +		.nlh = {0},
> +		.ndm.ndm_family  = proto,
> +		.ndm.ndm_ifindex = ifi,
> +		.ndm.ndm_state   = 0,
> +		.ndm.ndm_flags   = 0,
> +		.ndm.ndm_type    = 0

Nit: unspecified initializer fields are assumed to be 0 (as long as
there's at least one initializer field).

> +	};
> +	struct nlmsghdr *nh;
> +	char buf[NLBUFSIZ];
> +	ssize_t status;
> +	uint32_t seq;
> +
> +	seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, NLM_F_DUMP, sizeof(req));
> +	nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH)
> +		nl_neigh_msg_read(c, nh);
> +	if (status < 0)
> +		warn("netlink: RTM_GETNEIGH failed: %s", strerror_(-status));
> +}
> +
> +/**
> + * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour updates
> + * @c:		Execution context
> + */
> +void nl_neigh_notify_handler(const struct ctx *c)
> +{
> +	char buf[NLBUFSIZ];
> +
> +	for (;;) {
> +		ssize_t n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
> +		struct nlmsghdr *nh = (struct nlmsghdr *)buf;
> +
> +		if (n < 0) {
> +			struct nlmsgerr *errmsg;
> +
> +			if (nh->nlmsg_type != NLMSG_ERROR)
> +				return;

This isn't right.  It looks like you're conflating errors reported by
recv() with errors reported as netlink messages.  If n < 0, it means
recv() has failed, and nh->nlmsg_type is uninitialized.  recv()
succeeding, but the buffer containing an NLMSG_ERROR is a different
case.  There are two different semantic levels, each of which reports
errors in its own way, each needs to be handled separately.

> +			errmsg = (struct nlmsgerr *)NLMSG_DATA(nh);
> +			warn("netlink: recv() failed: error %i", errmsg->error);
> +			return;
> +		}
> +		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n))
> +			nl_neigh_msg_read(c, nh);
> +	}
> +}
> +
> +/**
> + * nl_neigh_notify_init() - Open netlink sock and subscribe to neighbour events
> + * @c:		Execution context
> + *
> + * Return: 0 on success, -1 on failure
> + */
> +int nl_neigh_notify_init(const struct ctx *c)
> +{
> +	union epoll_ref ref = {
> +		.type = EPOLL_TYPE_NL_NEIGH
> +	};
> +	struct epoll_event ev = {
> +		.events = EPOLLIN
> +	};
> +	struct sockaddr_nl addr = {
> +		.nl_family = AF_NETLINK,
> +		.nl_groups = RTMGRP_NEIGH,
> +	};
> +
> +	if (nl_sock_neigh >= 0) {
> +		warn("RTMGRP_NEIGH already exists");
> +		return 0;
> +	}
> +	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
> +	if (nl_sock_neigh < 0) {
> +		warn("Failed create RTMGRP_NEIGH socket");
> +		return -1;
> +	}
> +	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
> +		close(nl_sock_neigh);
> +		nl_sock_neigh = -1;
> +		warn("Failed bind RTMGRP_NEIGH socket");
> +		return -1;
> +	}
> +
> +	ev.data.u64 = ref.u64;
> +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
> +		close(nl_sock_neigh);
> +		nl_sock_neigh = -1;
> +		warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket");
> +		return -1;
> +	}

We can probably simplify this a bit with the epoll_add() helper in one
of Laurent's draft patches.  I guess it just depends which lands first.

> +
> +	nl_neigh_sync(c, AF_INET, c->ifi4);
> +	nl_neigh_sync(c, AF_INET6, c->ifi6);
> +
> +	return 0;
> +}
> diff --git a/netlink.h b/netlink.h
> index b51e99c..8f1e9b9 100644
> --- a/netlink.h
> +++ b/netlink.h
> @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
>  		 int s_dst, unsigned int ifi_dst, sa_family_t af);
>  int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
>  		void *addr, int *prefix_len, void *addr_l);
> +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
> +		      unsigned char *mac);
>  int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
>  		const void *addr, int prefix_len);
>  int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
> @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
>  int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
>  int nl_link_set_flags(int s, unsigned int ifi,
>  		      unsigned int set, unsigned int change);
> +int nl_neigh_notify_init(const struct ctx *c);
> +void nl_neigh_notify_handler(const struct ctx *c);
>  
>  #endif /* NETLINK_H */
> diff --git a/passt.c b/passt.c
> index 31fbb75..e21d6ba 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -53,6 +53,7 @@
>  #include "vu_common.h"
>  #include "migrate.h"
>  #include "repair.h"
> +#include "netlink.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -79,6 +80,7 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
>  	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> +	[EPOLL_TYPE_NL_NEIGH]		= "netlink neighbour subscription socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -322,6 +324,8 @@ int main(int argc, char **argv)
>  
>  	pcap_init(&c);
>  
> +	nl_neigh_notify_init(&c);
> +
>  	if (!c.foreground) {
>  		if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
>  			die_perror("Failed to open /dev/null");
> @@ -414,6 +418,9 @@ loop:
>  		case EPOLL_TYPE_REPAIR:
>  			repair_handler(&c, eventmask);
>  			break;
> +		case EPOLL_TYPE_NL_NEIGH:
> +			nl_neigh_notify_handler(&c);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> -- 
> 2.50.1
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-10-14  5:04 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-12 19:33 [PATCH v13 00/10] Use true MAC address of LAN local remote hosts Jon Maloy
2025-10-12 19:33 ` [PATCH v13 01/10] netlink: add subscription on changes in NDP/ARP table Jon Maloy
2025-10-14  4:39   ` David Gibson [this message]
2025-10-12 19:33 ` [PATCH v13 02/10] passt: add no_map_gw flag to struct ctx Jon Maloy
2025-10-14  4:42   ` David Gibson
2025-10-12 19:33 ` [PATCH v13 03/10] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-10-14  4:55   ` David Gibson
2025-10-12 19:33 ` [PATCH v13 04/10] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
2025-10-14  4:57   ` David Gibson
2025-10-12 19:33 ` [PATCH v13 05/10] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
2025-10-14  5:01   ` David Gibson
2025-10-12 19:33 ` [PATCH v13 06/10] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
2025-10-14  5:02   ` David Gibson
2025-10-12 19:33 ` [PATCH v13 07/10] udp: forward external source MAC address through tap interface Jon Maloy
2025-10-12 19:33 ` [PATCH v13 08/10] tcp: " Jon Maloy
2025-10-12 19:33 ` [PATCH v13 09/10] tap: change signature of function tap_push_l2h() Jon Maloy
2025-10-12 19:33 ` [PATCH v13 10/10] icmp: let icmp use mac address from flowside structure Jon Maloy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aO3T9wIBcROgN6rI@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).