On Thu, Aug 21, 2025 at 12:53:36PM +0200, Stefano Brivio wrote: > On Thu, 21 Aug 2025 12:03:47 +1000 > David Gibson wrote: > > > On Tue, Aug 19, 2025 at 11:10:05PM -0400, Jon Maloy wrote: > > > We add a cache table to keep partial contents of the kernel ARP/NDP > > > tables. This way, we drastically reduce the number of netlink calls > > > to read those tables. > > > > Do you have data to suggest this is necessary? > > I'll chime in as I originally suggested that we need this cache. > > Without it, we'll have one netlink query for each local, non-loopback > flow being established, which sounds rather absurd (...am I missing > something?). Not just local: we only need the MAC for local flows, but we don't know if it's local until we look for the MAC. Yeah, that's true. I guess I don't have a good sense of the relative latency of a netlink lookup versus a flow establishment, so I don't have an intuition as to whether that's absurd or not. Since you do, I'll defer to your judgement on this. > I haven't tested these changes yet but I suppose the usual tcp_crr test > should show the issue. > > It's not just about TCP CRR latency though. We're adding this mostly > for use cases where some kind of LAN service is implemented by a > container (say, Pi-hole), and we can probably expect a ton of > short-lived TCP flows in those cases (say, DNS requests over TCP). Actually, I would have expected more UDP than TCP, but the issue is the same. > > It's a lot of code to optimise something only needed for some pretty > > uncommon cases. > > Actually, it's a bit less code than I expected, but I don't understand > why you're assuming those cases are uncommon. Hm, maybe. > By the way, note that we should be able to get rid of most of this once > we implement a netlink monitor (which we need for other purposes), > because at that point we can also subscribe to ARP / neighbour table > changes. Not really. We don't need explicit queries in that case, but we still need the data structure to record the information we get from the monitor. The bulk of this code is implementing that data structure. > > > We create dummy cache entries representing non-(not-yet)-existing > > > ARP/NDP entries when needed. We add a short expiration time to each > > > such entry, so that we can know when to make repeated calls to the > > > kernel tables in the beginning. We also add an access counter to the > > > entries, to ensure that the timer becomes longer and the call frequency > > > abates over time if no ARP/NDP entry is created. > > > > > > For regular entries we use a much longer timer, with the purpose to > > > update the entry in the rare case that a remote host changes its > > > MAC address. > > > > > > Signed-off-by: Jon Maloy > > > --- > > > arp.c | 3 +- > > > conf.c | 2 + > > > flow.c | 5 +- > > > fwd.c | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > fwd.h | 4 ++ > > > ndp.c | 3 +- > > > tcp.c | 3 +- > > > 7 files changed, 218 insertions(+), 8 deletions(-) > > > > > > diff --git a/arp.c b/arp.c > > > index c37867a..040d4fe 100644 > > > --- a/arp.c > > > +++ b/arp.c > > > @@ -29,7 +29,6 @@ > > > #include "dhcp.h" > > > #include "passt.h" > > > #include "tap.h" > > > -#include "netlink.h" > > > > > > /** > > > * arp() - Check if this is a supported ARP message, reply as needed > > > @@ -79,7 +78,7 @@ int arp(const struct ctx *c, const struct pool *p) > > > */ > > > inany_from_af(&tgt, AF_INET, am->tip); > > > if (!fwd_inany_nat(c, &tgt)) > > > - nl_neigh_mac_get(nl_sock, &tgt, c->ifi4, am->sha); > > > + fwd_neigh_mac_get(c, &tgt, c->ifi4, am->sha); > > > > > > memcpy(swap, am->tip, sizeof(am->tip)); > > > memcpy(am->tip, am->sip, sizeof(am->tip)); > > > diff --git a/conf.c b/conf.c > > > index f47f48e..0abdbf7 100644 > > > --- a/conf.c > > > +++ b/conf.c > > > @@ -2122,6 +2122,8 @@ void conf(struct ctx *c, int argc, char **argv) > > > c->udp.fwd_out.mode = fwd_default; > > > > > > fwd_scan_ports_init(c); > > > + if (fwd_mac_cache_init()) > > > + die("Failed to initiate neighnor MAC cache"); > > > > "neighnor" > > Jon, by the way, we've been using BrE quite consistently throughout the > codebase, so the two occurrences of "neighbour" (code comments only) > have, well, a 'u' in them. I'd try to keep that consistency if it > doesn't bother anybody. > > -- > Stefano > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson