From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=frLmK1Hl; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 31F6E5A061B for ; Tue, 14 Oct 2025 07:04:51 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202510; t=1760418288; bh=RzcKM359n0yKKLgeqLeFZp5lcO9ExQTtOxZNkkHMT3k=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=frLmK1HlLdQbZEArzrgCvlBmd4jsuawHIdcB+A28V4yKnew4evrs8pQDorNoi8+1F qZzTWThZa+MfsGNmzTKqF8FzyGBCtPwu6G0iaCv9sufyCATceRaY+4ggk5lfm61VrA JcomTd4SxWWcz9jrRxbF9PxdgQSFr6QyzxCTNxcYuHrZelfqOUKoS0ccnt/6W8NgMP h2yXGGiv2TvtKcYHB49lKddVY+Vn652tWqgARNwnbL8Cx174BCppP13+uiFLCdSC7/ WikgKD7/pbmrvxDwzGyBmTjib/0YtZYrElzAJiWveGIxpAwlVmkuTwXPfGfx+SJCNY SV954HDMN3XfQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4cm2Fr0cdFz4wCg; Tue, 14 Oct 2025 16:04:48 +1100 (AEDT) Date: Tue, 14 Oct 2025 15:39:19 +1100 From: David Gibson To: Jon Maloy Subject: Re: [PATCH v13 01/10] netlink: add subscription on changes in NDP/ARP table Message-ID: References: <20251012193337.616835-1-jmaloy@redhat.com> <20251012193337.616835-2-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="RITthKCwcaNAD2LG" Content-Disposition: inline In-Reply-To: <20251012193337.616835-2-jmaloy@redhat.com> Message-ID-Hash: ZBTRJP6G37HJ3OCLF6LAIQ44SK7IJHXF X-Message-ID-Hash: ZBTRJP6G37HJ3OCLF6LAIQ44SK7IJHXF X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --RITthKCwcaNAD2LG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Oct 12, 2025 at 03:33:28PM -0400, Jon Maloy wrote: > The solution to bug https://bugs.passt.top/show_bug.cgi?id=3D120 > requires the ability to translate from an IP address to its > corresponding MAC address in cases where those are present in > the ARP or NDP tables. >=20 > To keep track of the contents of these tables we add a netlink > based neighbour subscription feature. >=20 > Signed-off-by: Jon Maloy One important comment, several nits. > --- > v3: - Added an attribute contianing NDA_DST to sent message, so > that we let the kernel do the filtering of the IP address > and return only one entry. > - Added interface index to the call signature. Since the only > interface we know is the template interface, this limits > the number of hosts that will be seen as 'network segment > local' from a PASST viewpoint. > v4: - Made loop independent of attribute order. > - Ignoring L2 addresses which are not of size ETH_ALEN. > v5: - Changed return value of new function, so caller can know if > a MAC address really was found. > v6: - Removed warning printout which had ended up in the wrong > commit. > v8: - Changed to neighbour event subscription model > - netlink: arp/ndp table subscription > v10:- Updated according to David's latest comments on v8 > - Added functionaly where we initially read current > state of ARP/NDP tables > v12:- Updates based on feedback from David and Stefano > v13:- Updates based on feedback from David and Stefano > --- > epoll_type.h | 2 + > netlink.c | 215 ++++++++++++++++++++++++++++++++++++++++++++++++++- > netlink.h | 4 + > passt.c | 7 ++ > 4 files changed, 224 insertions(+), 4 deletions(-) >=20 > diff --git a/epoll_type.h b/epoll_type.h > index 12ac59b..a90ffb6 100644 > --- a/epoll_type.h > +++ b/epoll_type.h > @@ -44,6 +44,8 @@ enum epoll_type { > EPOLL_TYPE_REPAIR_LISTEN, > /* TCP_REPAIR helper socket */ > EPOLL_TYPE_REPAIR, > + /* Netlink neighbour subscription socket */ > + EPOLL_TYPE_NL_NEIGH, > =20 > EPOLL_NUM_TYPES, > }; > diff --git a/netlink.c b/netlink.c > index 9fe70f2..99dcb72 100644 > --- a/netlink.c > +++ b/netlink.c > @@ -26,6 +26,7 @@ > #include > #include > #include > +#include > =20 > #include > #include > @@ -40,6 +41,10 @@ > #define RTNH_NEXT_AND_DEC(rtnh, attrlen) \ > ((attrlen) -=3D RTNH_ALIGN((rtnh)->rtnh_len), RTNH_NEXT(rtnh)) > =20 > +/* Convenience macro borrowed from kernel */ > +#define NUD_VALID \ > + (NUD_PERMANENT | NUD_NOARP | NUD_REACHABLE | NUD_PROBE | NUD_STALE) > + > /* Netlink expects a buffer of at least 8kiB or the system page size, > * whichever is larger. 32kiB is recommended for more efficient. > * Since the largest page size on any remotely common Linux setup is > @@ -50,9 +55,10 @@ > #define NLBUFSIZ 65536 > =20 > /* Socket in init, in target namespace, sequence (just needs to be monot= onic) */ > -int nl_sock =3D -1; > -int nl_sock_ns =3D -1; > -static int nl_seq =3D 1; > +int nl_sock =3D -1; > +int nl_sock_ns =3D -1; > +static int nl_sock_neigh =3D -1; > +static int nl_seq =3D 1; > =20 > /** > * nl_sock_init_do() - Set up netlink sockets in init or target namespace > @@ -325,7 +331,6 @@ unsigned int nl_get_ext_if(int s, sa_family_t af) > =20 > if (status < 0) > warn("netlink: RTM_GETROUTE failed: %s", strerror_(-status)); > - Nit: unrelated change. > if (defifi) { > if (ndef > 1) { > info("Multiple default %s routes, picked first", > @@ -1103,3 +1108,205 @@ int nl_link_set_flags(int s, unsigned int ifi, > =20 > return nl_do(s, &req, RTM_NEWLINK, 0, sizeof(req)); > } > + > +/** > + * nl_neigh_msg_read() - Interpret a neigbour state message from netlink > + * @c: Execution context > + * @nh: Message to be read > + */ > +static void nl_neigh_msg_read(const struct ctx *c, struct nlmsghdr *nh) > +{ > + struct ndmsg *ndm =3D NLMSG_DATA(nh); > + struct rtattr *rta =3D (struct rtattr *)(ndm + 1); > + size_t na =3D NLMSG_PAYLOAD(nh, sizeof(*ndm)); > + char ip_str[INET6_ADDRSTRLEN]; > + char mac_str[ETH_ADDRSTRLEN]; > + const uint8_t *lladdr =3D NULL; > + const void *dst =3D NULL; > + size_t lladdr_len =3D 0; > + uint8_t mac[ETH_ALEN]; > + union inany_addr addr; > + size_t dstlen =3D 0; > + > + if (nh->nlmsg_type =3D=3D NLMSG_DONE) > + return; > + > + if (nh->nlmsg_type =3D=3D NLMSG_ERROR) { > + warn("nlmsg_type error at msg read"); > + return; > + } > + > + if (nh->nlmsg_type !=3D RTM_NEWNEIGH && > + nh->nlmsg_type !=3D RTM_DELNEIGH) > + return; > + > + for (; RTA_OK(rta, na); rta =3D RTA_NEXT(rta, na)) { > + switch (rta->rta_type) { > + case NDA_DST: > + dst =3D RTA_DATA(rta); > + dstlen =3D RTA_PAYLOAD(rta); > + break; > + case NDA_LLADDR: > + lladdr =3D RTA_DATA(rta); > + lladdr_len =3D RTA_PAYLOAD(rta); > + break; > + default: > + break; > + } > + } > + > + if (!dst) > + return; > + > + Nit: extra newline. > + if (ndm->ndm_family =3D=3D AF_INET && > + ndm->ndm_ifindex !=3D c->ifi4) > + return; > + > + if (ndm->ndm_family =3D=3D AF_INET6 && > + ndm->ndm_ifindex !=3D c->ifi6) > + return; > + > + if (ndm->ndm_family !=3D AF_INET && > + ndm->ndm_family !=3D AF_INET6) > + return; > + > + if (ndm->ndm_family =3D=3D AF_INET && > + dstlen !=3D sizeof(struct in_addr)) { > + warn("netlink: wrong address length in AF_INET notification"); > + return; > + } > + if (ndm->ndm_family =3D=3D AF_INET6 && > + dstlen !=3D sizeof(struct in6_addr)) { > + warn("netlink: wrong address length in AF_INET6 notification"); > + return; > + } > + inany_from_af(&addr, ndm->ndm_family, dst); > + inany_ntop(dst, ip_str, sizeof(ip_str)); > + > + if (nh->nlmsg_type =3D=3D RTM_DELNEIGH) { > + trace("neigh table delete: %s", ip_str); > + return; > + } > + if (!(ndm->ndm_state & NUD_VALID)) { > + trace("neigh table: invalid state for %s", ip_str); > + return; > + } > + if (nh->nlmsg_type !=3D RTM_NEWNEIGH || !lladdr) { > + warn("netlink: notification with illegal msg for %s", ip_str); > + return; > + } > + if (lladdr_len !=3D ETH_ALEN || ndm->ndm_type !=3D ARPHRD_ETHER) > + return; > + > + memcpy(mac, lladdr, ETH_ALEN); > + eth_ntop(mac, mac_str, sizeof(mac_str)); > + trace("neigh table update: %s / %s", ip_str, mac_str); > +} > + > +/** > + * nl_neigh_sync() - Read current contents ARP/NDP tables > + * @c: Execution context > + * @proto: Protocol, AF_INET or AF_INET6 > + * @ifi: Interface index > + */ > +static void nl_neigh_sync(const struct ctx *c, int proto, int ifi) > +{ > + struct { > + struct nlmsghdr nlh; > + struct ndmsg ndm; > + } req =3D { > + .nlh =3D {0}, > + .ndm.ndm_family =3D proto, > + .ndm.ndm_ifindex =3D ifi, > + .ndm.ndm_state =3D 0, > + .ndm.ndm_flags =3D 0, > + .ndm.ndm_type =3D 0 Nit: unspecified initializer fields are assumed to be 0 (as long as there's at least one initializer field). > + }; > + struct nlmsghdr *nh; > + char buf[NLBUFSIZ]; > + ssize_t status; > + uint32_t seq; > + > + seq =3D nl_send(nl_sock_neigh, &req, RTM_GETNEIGH, NLM_F_DUMP, sizeof(r= eq)); > + nl_foreach_oftype(nh, status, nl_sock_neigh, buf, seq, RTM_NEWNEIGH) > + nl_neigh_msg_read(c, nh); > + if (status < 0) > + warn("netlink: RTM_GETNEIGH failed: %s", strerror_(-status)); > +} > + > +/** > + * nl_neigh_notify_handler() - Non-blocking drain of pending neighbour u= pdates > + * @c: Execution context > + */ > +void nl_neigh_notify_handler(const struct ctx *c) > +{ > + char buf[NLBUFSIZ]; > + > + for (;;) { > + ssize_t n =3D recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT); > + struct nlmsghdr *nh =3D (struct nlmsghdr *)buf; > + > + if (n < 0) { > + struct nlmsgerr *errmsg; > + > + if (nh->nlmsg_type !=3D NLMSG_ERROR) > + return; This isn't right. It looks like you're conflating errors reported by recv() with errors reported as netlink messages. If n < 0, it means recv() has failed, and nh->nlmsg_type is uninitialized. recv() succeeding, but the buffer containing an NLMSG_ERROR is a different case. There are two different semantic levels, each of which reports errors in its own way, each needs to be handled separately. > + errmsg =3D (struct nlmsgerr *)NLMSG_DATA(nh); > + warn("netlink: recv() failed: error %i", errmsg->error); > + return; > + } > + for (; NLMSG_OK(nh, n); nh =3D NLMSG_NEXT(nh, n)) > + nl_neigh_msg_read(c, nh); > + } > +} > + > +/** > + * nl_neigh_notify_init() - Open netlink sock and subscribe to neighbour= events > + * @c: Execution context > + * > + * Return: 0 on success, -1 on failure > + */ > +int nl_neigh_notify_init(const struct ctx *c) > +{ > + union epoll_ref ref =3D { > + .type =3D EPOLL_TYPE_NL_NEIGH > + }; > + struct epoll_event ev =3D { > + .events =3D EPOLLIN > + }; > + struct sockaddr_nl addr =3D { > + .nl_family =3D AF_NETLINK, > + .nl_groups =3D RTMGRP_NEIGH, > + }; > + > + if (nl_sock_neigh >=3D 0) { > + warn("RTMGRP_NEIGH already exists"); > + return 0; > + } > + nl_sock_neigh =3D socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_R= OUTE); > + if (nl_sock_neigh < 0) { > + warn("Failed create RTMGRP_NEIGH socket"); > + return -1; > + } > + if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) { > + close(nl_sock_neigh); > + nl_sock_neigh =3D -1; > + warn("Failed bind RTMGRP_NEIGH socket"); > + return -1; > + } > + > + ev.data.u64 =3D ref.u64; > + if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) =3D=3D -1)= { > + close(nl_sock_neigh); > + nl_sock_neigh =3D -1; > + warn("Failed do epoll_ctrl() on RTMGRP_NEIGH socket"); > + return -1; > + } We can probably simplify this a bit with the epoll_add() helper in one of Laurent's draft patches. I guess it just depends which lands first. > + > + nl_neigh_sync(c, AF_INET, c->ifi4); > + nl_neigh_sync(c, AF_INET6, c->ifi6); > + > + return 0; > +} > diff --git a/netlink.h b/netlink.h > index b51e99c..8f1e9b9 100644 > --- a/netlink.h > +++ b/netlink.h > @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, > int s_dst, unsigned int ifi_dst, sa_family_t af); > int nl_addr_get(int s, unsigned int ifi, sa_family_t af, > void *addr, int *prefix_len, void *addr_l); > +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi, > + unsigned char *mac); > int nl_addr_set(int s, unsigned int ifi, sa_family_t af, > const void *addr, int prefix_len); > int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr); > @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void= *mac); > int nl_link_set_mtu(int s, unsigned int ifi, int mtu); > int nl_link_set_flags(int s, unsigned int ifi, > unsigned int set, unsigned int change); > +int nl_neigh_notify_init(const struct ctx *c); > +void nl_neigh_notify_handler(const struct ctx *c); > =20 > #endif /* NETLINK_H */ > diff --git a/passt.c b/passt.c > index 31fbb75..e21d6ba 100644 > --- a/passt.c > +++ b/passt.c > @@ -53,6 +53,7 @@ > #include "vu_common.h" > #include "migrate.h" > #include "repair.h" > +#include "netlink.h" > =20 > #define EPOLL_EVENTS 8 > =20 > @@ -79,6 +80,7 @@ char *epoll_type_str[] =3D { > [EPOLL_TYPE_VHOST_KICK] =3D "vhost-user kick socket", > [EPOLL_TYPE_REPAIR_LISTEN] =3D "TCP_REPAIR helper listening socket", > [EPOLL_TYPE_REPAIR] =3D "TCP_REPAIR helper socket", > + [EPOLL_TYPE_NL_NEIGH] =3D "netlink neighbour subscription socket", > }; > static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES, > "epoll_type_str[] doesn't match enum epoll_type"); > @@ -322,6 +324,8 @@ int main(int argc, char **argv) > =20 > pcap_init(&c); > =20 > + nl_neigh_notify_init(&c); > + > if (!c.foreground) { > if ((devnull_fd =3D open("/dev/null", O_RDWR | O_CLOEXEC)) < 0) > die_perror("Failed to open /dev/null"); > @@ -414,6 +418,9 @@ loop: > case EPOLL_TYPE_REPAIR: > repair_handler(&c, eventmask); > break; > + case EPOLL_TYPE_NL_NEIGH: > + nl_neigh_notify_handler(&c); > + break; > default: > /* Can't happen */ > ASSERT(0); > --=20 > 2.50.1 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --RITthKCwcaNAD2LG Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmjt0/YACgkQzQJF27ox 2Gem8Q/+Mu/92gYPHCPUKt7QFmEr3OD79lpQKUno7M99JbaByiYXCkthrB8nWwAn OoT2+yN3UB10NgJXsEXC5WLjQVVRh2Q0GXabUjNoGIXHt2GxleQ0226VLcmirJQs GfpYYbj1aPY+Amlce/d1PlT+87Il2fPFf4DmoRXqU+DEbIu3ipe4PFFPB+6nwJpd otw+TwG9KepSv5b2ByoLGzfVgPPnkx28TbNhfoM207HkPG8qDVFg3iPH/JbYlEYM h3w40qVVIY6qDjYvIW3b4xPD9DfYwV8kl4ufxhaUq2EwHvR5EGhxAEClfm5lT8o4 5C7nT/rNVy/PiqwOfFltwN3ILtlDMNWeMOE10YR+392/uXF9qZSdJd1JOuxbTBYE AAdjZI5XjC5FWZHXOfVonCKLwVsfMYnAljE597eBiLdwK6DOGS6+raeu8L6Jc77Q mQh+d7YVR3q82SNo6jvE+gZrqws//vJXAjWsUQgo/Jc6RUJlDTNRMusROjaf7Ae7 TfmLQHskfyH1XOsDJHs39GmFpa7bvdoesLKj4dSdqtYNquM1ANug4OlTrsJ+NcSf TO5H36jQsngxUCVK8oYiT9qIyRqYEfwqXHQdEn+GyQQkuNu6vga+AlRPCQRpCdbd mFHsLnE+OmCjPlO/3GDdbwiWAJIymiXyFBVlaTSqEaHkeSFxiMc= =3qA3 -----END PGP SIGNATURE----- --RITthKCwcaNAD2LG--