From: Jon Maloy <jmaloy@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: sbrivio@redhat.com, dgibson@redhat.com, passt-dev@passt.top
Subject: Re: [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table
Date: Wed, 24 Sep 2025 14:40:33 -0400 [thread overview]
Message-ID: <87aafa5b-bc1c-4fe2-808d-3a7514ab11ce@redhat.com> (raw)
In-Reply-To: <aNNbusB_MV4Oxp1_@zatzit>
On 2025-09-23 22:47, David Gibson wrote:
> On Tue, Sep 23, 2025 at 09:13:22PM -0400, Jon Maloy wrote:
>> The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
>> requires the ability to translate from an IP address to its
>> corresponding MAC address in cases where those are present in
>> the ARP or NDP tables.
>>
>> To keep track of the contents of these tables we add a netlink
>> based neighbour subscription feature.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> The change to a subscription model is large enough that I think it
> warrants dropping pre-existing reviews.
>
>> ---
>> v3: - Added an attribute contianing NDA_DST to sent message, so
>> that we let the kernel do the filtering of the IP address
>> and return only one entry.
>> - Added interface index to the call signature. Since the only
>> interface we know is the template interface, this limits
>> the number of hosts that will be seen as 'network segment
>> local' from a PASST viewpoint.
>> v4: - Made loop independent of attribute order.
>> - Ignoring L2 addresses which are not of size ETH_ALEN.
>> v5: - Changed return value of new function, so caller can know if
>> a MAC address really was found.
>> v6: - Removed warning printout which had ended up in the wrong
>> commit.
>> v8: - Changed to neighbour event subscription model
>>
>> netlink: arp/ndp table subscription
>> ---
>> epoll_type.h | 2 +
>> netlink.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> netlink.h | 4 ++
>> passt.c | 8 ++++
>> 4 files changed, 128 insertions(+)
>>
>> diff --git a/epoll_type.h b/epoll_type.h
>> index 12ac59b..a90ffb6 100644
>> --- a/epoll_type.h
>> +++ b/epoll_type.h
>> @@ -44,6 +44,8 @@ enum epoll_type {
>> EPOLL_TYPE_REPAIR_LISTEN,
>> /* TCP_REPAIR helper socket */
>> EPOLL_TYPE_REPAIR,
>> + /* Netlink neighbour subscription socket */
>> + EPOLL_TYPE_NL_NEIGH,
>>
>> EPOLL_NUM_TYPES,
>> };
>> diff --git a/netlink.c b/netlink.c
>> index c436780..1faf3da 100644
>> --- a/netlink.c
>> +++ b/netlink.c
>> @@ -53,6 +53,7 @@
>> int nl_sock = -1;
>> int nl_sock_ns = -1;
>> static int nl_seq = 1;
>> +static int nl_sock_neigh = -1;
>>
>> /**
>> * nl_sock_init_do() - Set up netlink sockets in init or target namespace
>> @@ -84,6 +85,119 @@ static int nl_sock_init_do(void *arg)
>> return 0;
>> }
>>
>> +/**
>> + * nl_neigh_subscr_init() - Open a NETLINK_ROUTE socket and subscribe to neighbor events
>> + *
>> + * Return: 0 on success, -1 on failure
>> + */
>> +int nl_neigh_subscr_init(struct ctx *c)
>> +{
>> + struct epoll_event ev = { 0 };
>> + union epoll_ref ref = { .type = EPOLL_TYPE_NL_NEIGH, .fd = 0 };
>
> You overwrite the .fd field, so there's not need to include it in the
> initializer. (Also 0 is a bad placeholder, since 0 is a valid fd).
ok
>
>> +
>> + struct sockaddr_nl addr = {
>> + .nl_family = AF_NETLINK,
>> + .nl_groups = RTMGRP_NEIGH,
>> + };
>> +
>> + if (nl_sock_neigh >= 0)
>> + return 0;
>> +
>> + nl_sock_neigh = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, NETLINK_ROUTE);
>> + if (nl_sock_neigh < 0)
>> + return -1;
>> +
>> + if (bind(nl_sock_neigh, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
>> + close(nl_sock_neigh);
>> + nl_sock_neigh = -1;
>> + return -1;
>> + }
>> +
>> + ref.fd = nl_sock_neigh;
>> + ev.events = EPOLLIN;
>
> You could set this in the initializer.
ok
>
>> + ev.data.u64 = ref.u64;
>> + if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, nl_sock_neigh, &ev) == -1) {
>> + close(nl_sock_neigh);
>> + nl_sock_neigh = -1;
>> + return -1;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * nl_neigh_subscr_handler() - Non-blocking drain of pending neighbor updates
>> + * @c: Execution context
>> + */
>> +void nl_neigh_subscr_handler(struct ctx *c)
>> +{
>> + struct nlmsghdr *nh;
>> + char buf[NLBUFSIZ];
>> + ssize_t n;
>> +
>> + if (nl_sock_neigh < 0)
>> + return;
>> +
>> + for (;;) {
>> + n = recv(nl_sock_neigh, buf, sizeof(buf), MSG_DONTWAIT);
>> + if (n <= 0)
>> + return;
>
> EAGAIN correctly results in a silent return, but you probably want to
> report other errors and EOF (n == 0) since those are not expected.
Ok. But EOF is never returned from netlink. I can issue a warn_perror()
for all other negative return values than EAGAIN.
>
>> +
>> + nh = (struct nlmsghdr *)buf;
>> + for (; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
>> + struct ndmsg *ndm = NLMSG_DATA(nh);
>> + struct rtattr *rta = (struct rtattr *)(ndm + 1);
>> + size_t na = NLMSG_PAYLOAD(nh, sizeof(*ndm));
>> + const uint8_t *lladdr = NULL;
>> + const void *dst = NULL;
>> + size_t lladdr_len = 0;
>> + size_t dstlen = 0;
>> +
>> + if (nh->nlmsg_type == NLMSG_DONE ||
>> + nh->nlmsg_type == NLMSG_ERROR)
>> + continue;
>
> IIUC, NLMSG_ERROR would also be unexpected, so should probably be
> reported.
>
>> + if (nh->nlmsg_type != RTM_NEWNEIGH &&
>> + nh->nlmsg_type != RTM_DELNEIGH)
>> + continue;
>
> Should we filter on ndm->ndm_ifindex for neighbours on the template
> interface only? I'm not sure.
Yes. I was wondering about this, because most events are state changes
we aren't really interested in. My research indicates I can ignore all
states except NUD_VALID (==up) and (INCOMPLETE | FAILED) (== down).
This cannot be set in the subscription itself unless I want to involve
eBPF (I don't), so the right place is to filter it here. That will
save at least some cycles.
>
> It seems like we probably should filter on .
>
>> + for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) {
>> + switch (rta->rta_type) {
>> + case NDA_DST:
>> + dst = RTA_DATA(rta);
>> + dstlen = RTA_PAYLOAD(rta);
>> + break;
>> + case NDA_LLADDR:
>> + lladdr = RTA_DATA(rta);
>> + lladdr_len = RTA_PAYLOAD(rta);
>> + break;
>> + default:
>> + break;
>> + }
>> + }
>
> It seems like somewhere there ought to be something specifying the
> address type of both DST and LLADDR, but I'm not sure exactly where.
> We should check those, just in case some weirdo runs us on a host with
> IPX over Token Ring.
There is (ndm_family == AF_INET/6) for DST, and (lladdr_len ==
ETH_ALEN). The latter sounds like a good idea, while the fist sounds
slightly paranoid. OK, we do have AF_TIPC when giving it a second
thought ;-)
///jon
>
>> + if (!dst)
>> + continue;
>> +
>> + if (dstlen != sizeof(struct in_addr) &&
>> + dstlen != sizeof(struct in6_addr))
>> + continue;
>> +
>> + char abuf[INET6_ADDRSTRLEN];
>> +
>> + if (dstlen == sizeof(struct in_addr))
>> + inet_ntop(AF_INET, dst, abuf, sizeof(abuf));
>> + else
>> + inet_ntop(AF_INET6, dst, abuf, sizeof(abuf));
>> +
>> + if (nh->nlmsg_type == RTM_NEWNEIGH)
>> + debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len);
>> + else
>> + debug("neigh: DEL %s", abuf);
>> + }
>> + }
>> +}
>> +
>> /**
>> * nl_sock_init() - Call nl_sock_init_do(), won't return on failure
>> * @c: Execution context
>> diff --git a/netlink.h b/netlink.h
>> index b51e99c..a7d3506 100644
>> --- a/netlink.h
>> +++ b/netlink.h
>> @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
>> int s_dst, unsigned int ifi_dst, sa_family_t af);
>> int nl_addr_get(int s, unsigned int ifi, sa_family_t af,
>> void *addr, int *prefix_len, void *addr_l);
>> +bool nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi,
>> + unsigned char *mac);
>> int nl_addr_set(int s, unsigned int ifi, sa_family_t af,
>> const void *addr, int prefix_len);
>> int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
>> @@ -28,5 +30,7 @@ int nl_link_set_mac(int s, unsigned int ifi, const void *mac);
>> int nl_link_set_mtu(int s, unsigned int ifi, int mtu);
>> int nl_link_set_flags(int s, unsigned int ifi,
>> unsigned int set, unsigned int change);
>> +int nl_neigh_subscr_init(struct ctx *c);
>> +void nl_neigh_subscr_handler(struct ctx *c);
>>
>> #endif /* NETLINK_H */
>> diff --git a/passt.c b/passt.c
>> index 31fbb75..e20bbad 100644
>> --- a/passt.c
>> +++ b/passt.c
>> @@ -53,6 +53,7 @@
>> #include "vu_common.h"
>> #include "migrate.h"
>> #include "repair.h"
>> +#include "netlink.h"
>>
>> #define EPOLL_EVENTS 8
>>
>> @@ -79,6 +80,7 @@ char *epoll_type_str[] = {
>> [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
>> [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
>> [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
>> + [EPOLL_TYPE_NL_NEIGH] = "netlink neighbour subscription socket",
>> };
>> static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>> "epoll_type_str[] doesn't match enum epoll_type");
>> @@ -322,6 +324,9 @@ int main(int argc, char **argv)
>>
>> pcap_init(&c);
>>
>> + if (nl_neigh_subscr_init(&c) < 0)
>> + warn("Failed to subscribe to RTMGRP_NEIGH");
>> +
>> if (!c.foreground) {
>> if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0)
>> die_perror("Failed to open /dev/null");
>> @@ -414,6 +419,9 @@ loop:
>> case EPOLL_TYPE_REPAIR:
>> repair_handler(&c, eventmask);
>> break;
>> + case EPOLL_TYPE_NL_NEIGH:
>> + nl_neigh_subscr_handler(&c);
>> + break;
>> default:
>> /* Can't happen */
>> ASSERT(0);
>> --
>> 2.50.1
>>
>
next prev parent reply other threads:[~2025-09-24 18:40 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-24 1:13 [PATCH v9 0/9] Use true MAC address of LAN local remote hosts Jon Maloy
2025-09-24 1:13 ` [PATCH v9 1/9] netlink: add subsciption on changes in NDP/ARP table Jon Maloy
2025-09-24 2:47 ` David Gibson
2025-09-24 3:34 ` David Gibson
2025-09-24 18:40 ` Jon Maloy [this message]
2025-09-25 6:42 ` David Gibson
2025-09-24 1:13 ` [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents Jon Maloy
2025-09-24 3:03 ` David Gibson
2025-09-24 18:54 ` Jon Maloy
2025-09-24 1:13 ` [PATCH v9 3/9] arp/ndp: respond with true MAC address of LAN local remote hosts Jon Maloy
2025-09-24 1:13 ` [PATCH v9 4/9] flow: add MAC address of LAN local remote hosts to flow Jon Maloy
2025-09-24 1:13 ` [PATCH v9 5/9] udp: forward external source MAC address through tap interface Jon Maloy
2025-09-24 1:13 ` [PATCH v9 6/9] tcp: " Jon Maloy
2025-09-24 1:13 ` [PATCH v9 7/9] tap: change signature of function tap_push_l2h() Jon Maloy
2025-09-24 1:13 ` [PATCH v9 8/9] icmp: let icmp use mac address from flowside structure Jon Maloy
2025-09-24 1:13 ` [PATCH v9 9/9] arp/ndp: send gratuitous ARP / unsolicitated NA when MAC cache entry added Jon Maloy
2025-09-24 3:22 ` David Gibson
2025-09-24 22:18 ` Jon Maloy
2025-09-24 23:32 ` Jon Maloy
2025-09-25 6:38 ` David Gibson
2025-09-25 12:48 ` Jon Maloy
2025-09-26 0:47 ` David Gibson
2025-09-26 22:59 ` Jon Maloy
2025-09-29 4:03 ` David Gibson
2025-09-25 6:36 ` David Gibson
2025-09-25 13:14 ` Jon Maloy
2025-09-26 0:55 ` David Gibson
2025-09-26 23:05 ` Jon Maloy
2025-09-29 4:04 ` David Gibson
2025-09-26 23:25 ` Jon Maloy
2025-09-27 19:32 ` Jon Maloy
2025-09-29 4:08 ` David Gibson
2025-09-29 22:23 ` Stefano Brivio
2025-09-30 0:15 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87aafa5b-bc1c-4fe2-808d-3a7514ab11ce@redhat.com \
--to=jmaloy@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=dgibson@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).