public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: dgibson@redhat.com, david@gibson.dropbear.id.au, passt-dev@passt.top
Subject: Re: [PATCH v14 01/10] netlink: add subscription on changes in NDP/ARP table
Date: Sun, 19 Oct 2025 12:07:30 +0200	[thread overview]
Message-ID: <20251019120730.01c2cf4f@elisabeth> (raw)
In-Reply-To: <20251015025521.1449156-2-jmaloy@redhat.com>

On Tue, 14 Oct 2025 22:55:12 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
> requires the ability to translate from an IP address to its
> corresponding MAC address in cases where those are present in
> the ARP or NDP tables.
> 
> To keep track of the contents of these tables we add a netlink
> based neighbour subscription feature.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> 
> ---
> v3: - Added an attribute contianing NDA_DST to sent message, so
>       that we let the kernel do the filtering of the IP address
>       and return only one entry.
>     - Added interface index to the call signature. Since the only
>       interface we know is the template interface, this limits
>       the number of hosts that will be seen as 'network segment
>       local' from a PASST viewpoint.
> v4: - Made loop independent of attribute order.
>     - Ignoring L2 addresses which are not of size ETH_ALEN.
> v5: - Changed return value of new function, so caller can know if
>       a MAC address really was found.
> v6: - Removed warning printout which had ended up in the wrong
>       commit.
> v8: - Changed to neighbour event subscription model
>     - netlink: arp/ndp table subscription
> v10:- Updated according to David's latest comments on v8
>     - Added functionaly where we initially read current
>       state of ARP/NDP tables
> v12:- Updates based on feedback from David and Stefano
> v13:- Updates based on feedback from David and Stefano
> v14:- Updates based on feedback from David
> ---
>  epoll_type.h |   2 +
>  netlink.c    | 209 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  netlink.h    |   4 +
>  passt.c      |   7 ++
>  4 files changed, 219 insertions(+), 3 deletions(-)
> 
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..a90ffb6 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
>  	EPOLL_TYPE_REPAIR_LISTEN,
>  	/* TCP_REPAIR helper socket */
>  	EPOLL_TYPE_REPAIR,
> +	/* Netlink neighbour subscription socket */
> +	EPOLL_TYPE_NL_NEIGH,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/netlink.c b/netlink.c
> index 9fe70f2..186383c 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -26,6 +26,7 @@
>  #include <arpa/inet.h>
>  #include <netinet/in.h>
>  #include <netinet/if_ether.h>
> +#include <net/if_arp.h>
>  
>  #include <linux/netlink.h>
>  #include <linux/rtnetlink.h>
> @@ -40,6 +41,10 @@
>  #define RTNH_NEXT_AND_DEC(rtnh, attrlen)				\
>  	((attrlen) -= RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh))
>  
> +/* Convenience macro borrowed from kernel */
> +#define NUD_VALID				\
> +	(NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE)
> +
>  /* Netlink expects a buffer of at least 8kiB or the system page size,
>   * whichever is larger.  32kiB is recommended for more efficient.
>   * Since the largest page size on any remotely common Linux setup is
> @@ -50,9 +55,10 @@
>  #define NLBUFSIZ 65536
>  
>  /* Socket in init, in target namespace, sequence (just needs to be monotonic) */
> -int nl_sock	= -1;
> -int nl_sock_ns	= -1;
> -static int nl_seq = 1;
> +int nl_sock		 = -1;
> +int nl_sock_ns		 = -1;
> +static int nl_sock_neigh = -1;
> +static int nl_seq	 = 1;
>  
>  /**
>   * nl_sock_init_do() - Set up netlink sockets in init or target namespace
> @@ -1103,3 +1109,200 @@ int nl_link_set_flags(int s, unsigned int ifi,
>  
>  	return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req));
>  }
> +
> +/**
> + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink

For some reason I missed this on both v11 and v12: neighbour.

I can fix it up on merge. I might also shut up, you might say, but:

- I grep for things in the code, typos make it hard

- typos invite "clean-up" patches that are just code churn and make git
  logs and blames harder to follow (while perhaps still better than
  leaving typos behind, I think)

> + * @c:		Execution context
> + * @nh:	Message to be read
> + */
> +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh)
> +{
> +	struct ndmsg *ndm = NLMSG_DATA(nh);
> +	struct rtattr *rta = (struct rtattr *)(ndm + 1);
> +	size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
> +	char ip_str[INET6_ADDRSTRLEN];
> +	char mac_str[ETH_ADDRSTRLEN];
> +	const uint8_t *lladdr = NULL;
> +	const void *dst = NULL;
> +	size_t lladdr_len = 0;
> +	uint8_t mac[ETH_ALEN];
> +	union inany_addr addr;
> +	size_t dstlen = 0;
> +
> +	if (nh->nlmsg_type == NLMSG_DONE)
> +		return;
> +
> +	if (nh->nlmsg_type == NLMSG_ERROR) {
> +		struct nlmsgerr *errmsg = (struct nlmsgerr *)ndm;
> +
> +		warn("neigh_msg_read() failed, error %i", errmsg->error);

But neigh_msg_read() isn't a function name (it's nl_neigh_msg_read()).
To avoid that, if you really need the function name, I'd suggest %s and
__func__.

However, here, I don't think you can say that the function failed. It's
just that we got an error from netlink. The function didn't do almost
anything. Maybe just "netlink error on neighbour notifier: ..."?

By the way of which, usually, we print the error name with _strerror():
note that error in struct nlmsgerr is errno-like, so you can use that
instead of printing a number that perhaps not everybody is familiar
with.

> +		return;
> +	}
> +
> +	if (nh->nlmsg_type != RTM_NEWNEIGH &&
> +	    nh->nlmsg_type != RTM_DELNEIGH)

This fits on a single line now (same here, I missed this on both v11
and v12, sorry).

> +		return;
> +
> +	for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
> +		switch (rta->rta_type) {
> +		case NDA_DST:
> +			dst = RTA_DATA(rta);
> +			dstlen = RTA_PAYLOAD(rta);
> +			break;
> +		case NDA_LLADDR:
> +			lladdr = RTA_DATA(rta);
> +			lladdr_len = RTA_PAYLOAD(rta);
> +			break;
> +		default:
> +			break;

With 'if' / 'else if' you save three lines, but not a strong
preference, maybe not even a preference at all.

> +		}
> +	}
> +
> +	if (!dst)
> +		return;
> +
> +	if (ndm->ndm_family == AF_INET &&
> +	     ndm->ndm_ifindex != c->ifi4)
> +		return;
> +
> +	if (ndm->ndm_family == AF_INET6 &&
> +	     ndm->ndm_ifindex != c->ifi6)
> +		return;
> +
> +	if (ndm->ndm_family != AF_INET &&
> +	    ndm->ndm_family != AF_INET6)
> +		return;
> +
> +	if (ndm->ndm_family == AF_INET &&
> +	    dstlen != sizeof(struct in_addr)) {
> +		warn("netlink: wrong address length in AF_INET notification");
> +		return;
> +	}
> +	if (ndm->ndm_family == AF_INET6 &&
> +	    dstlen != sizeof(struct in6_addr)) {

All these five conditions fit on a single line (missed on v12, but it
wasn't the case on v11).

> +		warn("netlink: wrong address length in AF_INET6 notification");
> +		return;
> +	}
> +	inany_from_af(&addr, ndm->ndm_family, dst);
> +	inany_ntop(dst, ip_str, sizeof(ip_str));
> +
> +	if (nh->nlmsg_type == RTM_DELNEIGH) {
> +		trace("neigh table delete: %s", ip_str);
> +		return;
> +	}
> +	if (!(ndm->ndm_state & NUD_VALID)) {
> +		trace("neigh table: invalid state for %s", ip_str);

It's not an invalid state. It's a state that represents that the
neighbour information lost its validity, but the state is not invalid
(that would look like an error to anybody who hasn't read the code).

Maybe something like "neighbour table: %s unreachable, state: 0x%02x"?

> +		return;
> +	}
> +	if (nh->nlmsg_type != RTM_NEWNEIGH || !lladdr) {

nh->nlmsg_type can only be RTM_NEWNEIGH at this point, because you have
two early returns for:

- nh->nlmsg_type != RTM_NEWNEIGH && nh->nlmsg_type != RTM_DELNEIGH

- nh->nlmsg_type == RTM_DELNEIGH

> +		warn("netlink: notification with illegal msg for %s", ip_str);

...meaning that you can finally be a bit more descriptive about what's
"illegal" in the notification ("missing link-layer address"?).

> +		return;
> +	}
> +	if (lladdr_len != ETH_ALEN || ndm->ndm_type != ARPHRD_ETHER)
> +		return;
> +
> +	memcpy(mac, lladdr, ETH_ALEN);
> +	eth_ntop(mac, mac_str, sizeof(mac_str));

I read 3/10, but I don't understand anyway: why do we need a temporary
'mac' variable? Can't you use 'lladdr' directly?

> +	trace("neigh table update: %s / %s", ip_str, mac_str);
> +}
> +
> +/**
> + * nl_neigh_sync() - Read current contents ARP/NDP tables

Is that a... reverse Saxon Genitive? :) ... content of ...

> + * @c:		Execution context
> + * @proto:	Protocol, AF_INET or AF_INET6
> + * @ifi:	Interface index
> + */
> +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi)
> +{
> +	struct {
> +		struct nlmsghdr nlh;
> +		struct ndmsg    ndm;
> +	} req = {
> +		.nlh = {0},

We always leave spaces inside curly brackets (K&R / kernel style). By
the way, I think David already suggested in another case that you don't
need to initialise stuff explicitly to zero. It's also the case here.

> +		.ndm.ndm_family  = proto,
> +		.ndm.ndm_ifindex = ifi,
> +	};
> +	struct nlmsghdr *nh;
> +	char buf[NLBUFSIZ];
> +	ssize_t status;
> +	uint32_t seq;
> +
> +	seq = nl_send(nl_sock_neigh, &req, RTM_GETNEIGH,
> +		      NLM_F_DUMP, sizeof(req));
> +	nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH)
> +		nl_neigh_msg_read(c, nh);
> +	if (status < 0)
> +		warn("netlink: RTM_GETNEIGH failed: %s", strerror_(-status));
> +}
> +
> +/**
> + * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour updates
> + * @c:		Execution context
> + */
> +void nl_neigh_notify_handler(const struct ctx *c)
> +{
> +	char buf[NLBUFSIZ];
> +
> +	for (;;) {
> +		ssize_t n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
> +		struct nlmsghdr *nh = (struct nlmsghdr *)buf;
> +
> +		if (n < 0) {
> +			if (errno != EAGAIN && errno != EINTR)

On EINTR, shouldn't we re-do the recv() call? We usually do that,
because it's actually the right thing to do. See a concise example in
vu_message_read_default(), vhost_user.c.

> +				warn_perror("nl_neigh_notify sock read error");

Same as above with function names: there's no such function as
nl_neigh_notify() and anyway it's not very descriptive to users.

What about "netlink notifier read error"?

> +			return;
> +		}
> +		for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n))
> +			nl_neigh_msg_read(c, nh);
> +	}
> +}
> +
> +/**
> + * nl_neigh_notify_init() - Subscribe to neighbour events
> + * @c:		Execution context
> + *
> + * Return: 0 on success, -1 on failure
> + */
> +int nl_neigh_notify_init(const struct ctx *c)
> +{
> +	union epoll_ref ref = {
> +		.type = EPOLL_TYPE_NL_NEIGH
> +	};
> +	struct epoll_event ev = {
> +		.events = EPOLLIN
> +	};
> +	struct sockaddr_nl addr = {
> +		.nl_family = AF_NETLINK,
> +		.nl_groups = RTMGRP_NEIGH,
> +	};
> +
> +	if (nl_sock_neigh >= 0) {
> +		warn("RTMGRP_NEIGH already exists");

Strictly speaking, the constant existed for quite some years already.
What already and unexpectedly exists here is the notifier socket.

> +		return 0;
> +	}
> +	nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC,

Hint: extra newlines don't account for lines of code using cloc(1). :)

> +			       NETLINK_ROUTE);
> +	if (nl_sock_neigh < 0) {
> +		warn("Failed create RTMGRP_NEIGH socket");

Same here, it's a notifier socket.

> +		return -1;
> +	}
> +	if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
> +		close(nl_sock_neigh);
> +		nl_sock_neigh = -1;
> +		warn("Failed bind RTMGRP_NEIGH socket");

If you move this to just after bind(), you can conveniently call
warn_perror() and also explain why it failed for free.

It's a notifier socket, and a preposition is missing.

> +		return -1;
> +	}
> +
> +	ev.data.u64 = ref.u64;
> +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
> +		close(nl_sock_neigh);
> +		nl_sock_neigh = -1;
> +		warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket");

Same as for bind(). epoll_ctrl() doesn't exist, by the way.

> +		return -1;
> +	}
> +
> +	nl_neigh_sync(c, AF_INET, c->ifi4);
> +	nl_neigh_sync(c, AF_INET6, c->ifi6);
> +
> +	return 0;
> +}
> diff --git a/netlink.h b/netlink.h
> index b51e99c..8f1e9b9 100644
> --- a/netlink.h
> +++ b/netlink.h
> @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
>  		 int s_dst, unsigned int ifi_dst, sa_family_t af);
>  int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
>  		void *addr, int *prefix_len, void *addr_l);
> +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
> +		      unsigned char *mac);
>  int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
>  		const void *addr, int prefix_len);
>  int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
> @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
>  int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
>  int nl_link_set_flags(int s, unsigned int ifi,
>  		      unsigned int set, unsigned int change);
> +int nl_neigh_notify_init(const struct ctx *c);
> +void nl_neigh_notify_handler(const struct ctx *c);
>  
>  #endif /* NETLINK_H */
> diff --git a/passt.c b/passt.c
> index 31fbb75..e21d6ba 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -53,6 +53,7 @@
>  #include "vu_common.h"
>  #include "migrate.h"
>  #include "repair.h"
> +#include "netlink.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -79,6 +80,7 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
>  	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> +	[EPOLL_TYPE_NL_NEIGH]		= "netlink neighbour subscription socket",

I already commented on this in v11: even if you subscribe using the
socket, it's functionally a notifier (or notification?) socket.

In general, I know it takes a bit longer but I wonder if it's worth the
effort: quote from:

  https://docs.kernel.org/process/6.Followthrough.html#working-with-reviewers

  Andrew Morton has suggested that every review comment which does not
  result in a code change should result in an additional code comment instead;
  that can help future reviewers avoid the questions which came up the first
  time around.

David and myself do that very consistently, and I think it helps
keeping the number of revisions down. It takes time and sometimes
nerves but the return on the investment is substantial.

I don't want to bother people more than I do (yes, I think it's
possible), so I'm not proposing that we enforce that, but maybe it's
something you should consider?

>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -322,6 +324,8 @@ int main(int argc, char **argv)
>  
>  	pcap_init(&c);
>  
> +	nl_neigh_notify_init(&c);
> +
>  	if (!c.foreground) {
>  		if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
>  			die_perror("Failed to open /dev/null");
> @@ -414,6 +418,9 @@ loop:
>  		case EPOLL_TYPE_REPAIR:
>  			repair_handler(&c, eventmask);
>  			break;
> +		case EPOLL_TYPE_NL_NEIGH:
> +			nl_neigh_notify_handler(&c);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);

-- 
Stefano


  parent reply	other threads:[~2025-10-19 10:07 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-15  2:55 [PATCH v14 00/10] Use true MAC address of LAN local remote hosts Jon Maloy
2025-10-15  2:55 ` [PATCH v14 01/10] netlink: add subscription on changes in NDP/ARP table Jon Maloy
2025-10-17  2:36   ` David Gibson
2025-10-19 10:07   ` Stefano Brivio [this message]
2025-10-20  0:17     ` David Gibson
2025-10-15  2:55 ` [PATCH v14 02/10] passt: add no_map_gw flag to struct ctx Jon Maloy
2025-10-19 10:07   ` Stefano Brivio
2025-10-15  2:55 ` [PATCH v14 03/10] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-10-17  3:05   ` David Gibson
2025-10-17 18:49     ` Jon Maloy
2025-10-20  0:06       ` David Gibson
2025-10-20 10:00         ` Jon Maloy
2025-10-22  1:20           ` David Gibson
2025-10-19 10:07   ` Stefano Brivio
2025-10-15  2:55 ` [PATCH v14 04/10] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
2025-10-17  3:06   ` David Gibson
2025-10-15  2:55 ` [PATCH v14 05/10] arp/ndp: send ARP announcement / unsolicited NA when neigbour entry added Jon Maloy
2025-10-17  3:08   ` David Gibson
2025-10-19 10:08   ` Stefano Brivio
2025-10-15  2:55 ` [PATCH v14 06/10] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
2025-10-15  2:55 ` [PATCH v14 07/10] udp: forward external source MAC address through tap interface Jon Maloy
2025-10-15  2:55 ` [PATCH v14 08/10] tcp: " Jon Maloy
2025-10-15  2:55 ` [PATCH v14 09/10] tap: change signature of function tap_push_l2h() Jon Maloy
2025-10-15  2:55 ` [PATCH v14 10/10] icmp: let icmp use mac address from flowside structure Jon Maloy
2025-10-19 10:08   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251019120730.01c2cf4f@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).