On Wed, Sep 24, 2025 at 02:40:33PM -0400, Jon Maloy wrote:
> 
> 
> On 2025-09-23 22:47, David Gibson wrote:
> > On Tue, Sep 23, 2025 at 09:13:22PM -0400, Jon Maloy wrote:
> > > The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> > > requires the ability to translate from an IP address to its
> > > corresponding MAC address in cases where those are present in
> > > the ARP or NDP tables.
> > > 
> > > To keep track of the contents of these tables we add a netlink
> > > based neighbour subscription feature.
> > > 
> > > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > The change to a subscription model is large enough that I think it
> > warrants dropping pre-existing reviews.
> > 
> > > ---
> > > v3: - Added an attribute contianing NDA_DST to sent message, so
> > >        that we let the kernel do the filtering of the IP address
> > >        and return only one entry.
> > >      - Added interface index to the call signature. Since the only
> > >        interface we know is the template interface, this limits
> > >        the number of hosts that will be seen as 'network segment
> > >        local' from a PASST viewpoint.
> > > v4: - Made loop independent of attribute order.
> > >      - Ignoring L2 addresses which are not of size ETH_ALEN.
> > > v5: - Changed return value of new function, so caller can know if
> > >        a MAC address really was found.
> > > v6: - Removed warning printout which had ended up in the wrong
> > >        commit.
> > > v8: - Changed to neighbour event subscription model
> > > 
> > > netlink: arp/ndp table subscription
> > > ---
> > >   epoll_type.h |   2 +
> > >   netlink.c    | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >   netlink.h    |   4 ++
> > >   passt.c      |   8 ++++
> > >   4 files changed, 128 insertions(+)
> > > 
> > > diff --git a/epoll_type.h b/epoll_type.h
> > > index 12ac59b..a90ffb6 100644
> > > --- a/epoll_type.h
> > > +++ b/epoll_type.h
> > > @@ -44,6 +44,8 @@ enum epoll_type {
> > >   	EPOLL_TYPE_REPAIR_LISTEN,
> > >   	/* TCP_REPAIR helper socket */
> > >   	EPOLL_TYPE_REPAIR,
> > > +	/* Netlink neighbour subscription socket */
> > > +	EPOLL_TYPE_NL_NEIGH,
> > >   	EPOLL_NUM_TYPES,
> > >   };
> > > diff --git a/netlink.c b/netlink.c
> > > index c436780..1faf3da 100644
> > > --- a/netlink.c
> > > +++ b/netlink.c
> > > @@ -53,6 +53,7 @@
> > >   int nl_sock	= -1;
> > >   int nl_sock_ns	= -1;
> > >   static int nl_seq = 1;
> > > +static int nl_sock_neigh = -1;
> > >   /**
> > >    * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> > > @@ -84,6 +85,119 @@ static int nl_sock_init_do(void *arg)
> > >   	return 0;
> > >   }
> > > +/**
> > > + * nl_neigh_subscr_init() - Open a NETLINK_ROUTE socket and subscribe to neighbor events
> > > + *
> > > + * Return: 0 on success, -1 on failure
> > > + */
> > > +int nl_neigh_subscr_init(struct ctx *c)
> > > +{
> > > +	struct epoll_event ev = { 0 };
> > > +	union epoll_ref ref = { .type = EPOLL_TYPE_NL_NEIGH, .fd = 0 };
> > 
> > You overwrite the .fd field, so there's not need to include it in the
> > initializer.  (Also 0 is a bad placeholder, since 0 is a valid fd).
> ok
> > 
> > > +
> > > +	struct sockaddr_nl addr = {
> > > +		.nl_family = AF_NETLINK,
> > > +		.nl_groups = RTMGRP_NEIGH,
> > > +	};
> > > +
> > > +	if (nl_sock_neigh >= 0)
> > > +		return 0;
> > > +
> > > +	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
> > > +	if (nl_sock_neigh < 0)
> > > +		return -1;
> > > +
> > > +	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
> > > +		close(nl_sock_neigh);
> > > +		nl_sock_neigh = -1;
> > > +		return -1;
> > > +	}
> > > +
> > > +	ref.fd = nl_sock_neigh;
> > > +	ev.events = EPOLLIN;
> > 
> > You could set this in the initializer.
> ok
> > 
> > > +	ev.data.u64 = ref.u64;
> > > +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
> > > +		close(nl_sock_neigh);
> > > +		nl_sock_neigh = -1;
> > > +		return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * nl_neigh_subscr_handler() - Non-blocking drain of pending neighbor updates
> > > + * @c:	       Execution context
> > > + */
> > > +void nl_neigh_subscr_handler(struct ctx *c)
> > > +{
> > > +	struct nlmsghdr *nh;
> > > +	char buf[NLBUFSIZ];
> > > +	ssize_t n;
> > > +
> > > +	if (nl_sock_neigh < 0)
> > > +		return;
> > > +
> > > +	for (;;) {
> > > +		n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
> > > +		if (n <= 0)
> > > +			return;
> > 
> > EAGAIN correctly results in a silent return, but you probably want to
> > report other errors and EOF (n == 0) since those are not expected.
> 
> Ok. But EOF is never returned from netlink.

Right.  Which is exactly why it's important to know if it somehow
happens (it could mean, for example that we somehow clobbered our
socket fd).

> I can issue a warn_perror() for
> all other negative return values than EAGAIN.
> 
> > 
> > > +
> > > +		nh = (struct nlmsghdr *)buf;
> > > +		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
> > > +			struct ndmsg *ndm = NLMSG_DATA(nh);
> > > +			struct rtattr *rta = (struct rtattr *)(ndm + 1);
> > > +			size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> > > +			const uint8_t *lladdr = NULL;
> > > +			const void *dst = NULL;
> > > +			size_t lladdr_len = 0;
> > > +			size_t dstlen = 0;
> > > +
> > > +			if (nh->nlmsg_type == NLMSG_DONE ||
> > > +			    nh->nlmsg_type == NLMSG_ERROR)
> > > +				continue;
> > 
> > IIUC, NLMSG_ERROR would also be unexpected, so should probably be
> > reported.
> > 
> > > +			if (nh->nlmsg_type != RTM_NEWNEIGH &&
> > > +			    nh->nlmsg_type != RTM_DELNEIGH)
> > > +				continue;
> > 
> > Should we filter on ndm->ndm_ifindex for neighbours on the template
> > interface only?  I'm not sure.
> 
> Yes. I was wondering about this, because most events are state changes we
> aren't really interested in. My research indicates I can ignore all states
> except NUD_VALID (==up) and (INCOMPLETE | FAILED) (== down).
> This cannot be set in the subscription itself unless I want to involve eBPF
> (I don't), so the right place is to filter it here. That will
> save at least some cycles.

Ok, good.

> > It seems like we probably should filter on .
> > 
> > > +			for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> > > +				switch (rta->rta_type) {
> > > +				case NDA_DST:
> > > +					dst = RTA_DATA(rta);
> > > +					dstlen = RTA_PAYLOAD(rta);
> > > +					break;
> > > +				case NDA_LLADDR:
> > > +					lladdr = RTA_DATA(rta);
> > > +					lladdr_len = RTA_PAYLOAD(rta);
> > > +					break;
> > > +				default:
> > > +					break;
> > > +				}
> > > +			}
> > 
> > It seems like somewhere there ought to be something specifying the
> > address type of both DST and LLADDR, but I'm not sure exactly where.
> > We should check those, just in case some weirdo runs us on a host with
> > IPX over Token Ring.
> 
> There is (ndm_family == AF_INET/6) for DST, and (lladdr_len == ETH_ALEN).
> The latter sounds like a good idea, while the fist sounds slightly paranoid.
> OK, we do have AF_TIPC when giving it a second thought ;-)

Right.  I really like to actually check assumptions like this.
Checking lladdr_len isn't technically correct, since in theory it
could be a link that has 6-byte addresses that aren't Ethernet MACs.
Or even, theoretically, a link with variable length addresses, and
this one just happens to have 6 bytes.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson