From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ErMD54iT; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 21EB95A0271 for ; Wed, 24 Sep 2025 20:54:30 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1758740068; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=b3tg47OhjgnRcqwvAaBM2dA/Dx/AJVw0XisFHnnEY+I=; b=ErMD54iTBXcJ2Iepul3CFOYzuiJxATWYr7/30/7BVfpGPUk7l73vD33WXiEt9j/qmmtnvc xfgSlELsqYzhJg/nzbWHDR/+2zgk6hGHNzlo/Q5U3jtLReORrJoi3c7l/8ZxGpPQx8oltq or1fD+T7m35sD7LMLrZ5wwpcbo3wuRU= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-427-8KnQxW_SPD22OG-PThxzoQ-1; Wed, 24 Sep 2025 14:54:27 -0400 X-MC-Unique: 8KnQxW_SPD22OG-PThxzoQ-1 X-Mimecast-MFC-AGG-ID: 8KnQxW_SPD22OG-PThxzoQ_1758740067 Received: by mail-qk1-f197.google.com with SMTP id af79cd13be357-81ea2cb6f13so45769285a.0 for ; Wed, 24 Sep 2025 11:54:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1758740067; x=1759344867; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=b3tg47OhjgnRcqwvAaBM2dA/Dx/AJVw0XisFHnnEY+I=; b=DN3TKPYCP0U8y3tjZQuPzzpGMIHIPyCyIDkgPKSRWiijaX4d+HgBKRjh6L+K2wFWSR QLofpo8mP0wf1ok63aypfM3lEP1WegnP3uNlrl57k4MWV0dE0u+XEz8oPG7I1o9P87iM kFk3rPfOcCpRxB7ftJMdcyQf8j0gls2QuZWJrogjkJj+sGyZlJ7VfdLGuoe2qiIcangn owSbVRsIh/qP1jgXobt+DQFMfnbQYeFx9xAkN2j1lCQrPmVqR26RAquCAyzzj+3vTsxe V07kOeawx4fUKiutYIZDb1WoqW7/Nh97u02QbI/63Rf90/VyKitIkiAYcYs5DERZ3W46 FplQ== X-Gm-Message-State: AOJu0YziKRFSI8ks78GMgbwNdNL64Bv+qk6te1X2PyZJYQLfjG6mazFR g/6cNocJkzWdcdz8AqdGM6v9I0K9Up0y5a3ii3XyQy9fzkraTL5DVOMFqhRJYe027NC5QB+guRC LzhU4FggwAb68A7noWl6v9jsS5QycrOYaXkA18cw9tptLZuhc757qcBLM7v+VyfbWsMAm1f3ZKh 28pYhGP9bzmjBxJZM5jLlIm3cWd1P/PjLn4l5xrA== X-Gm-Gg: ASbGncvLrgQS1mOd4rp/P1D+V0yDFsmxTwu10jECORyy2sd/C1H8y4k2+vsCXLdG9Qu MLkvZ79ktr/0XyKAdrh0rhyPANMxyzM4pwlV4XuCX7quK0OVteiNdwTAnpgVcWvxjWQNMtp3mX3 0EX3eYSXsbxL16T0jOPqYwSTr/Lbp9WIgPd7Tvtpsl8Pt86fzYOCeATeEp0/1hyJXAzZjiX10hJ azzEMPhRKJObK/Gdi9D5sZDgFbTBo8QJkbJd9x2cAtDuc+otXnraN22LWqeK9wyzYw01lBHe2Nk VEtC2g92NBVhUAZeP91H+KcCTHwqgTbGaUI/rR0Wde9DxrkCyboqR7NYzzrchKnWTrMA+ftkJ4e 0jUBuQuAYdw== X-Received: by 2002:a05:620a:7114:b0:84a:a85f:42cb with SMTP id af79cd13be357-85adee2b82cmr124238585a.14.1758740066649; Wed, 24 Sep 2025 11:54:26 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF4WZI9wRTm3xsU98Z6uv24BbcuCB2AaY2705Aiivv4oluYCTnHc+MdF/lN3pEOWVRkItSyrw== X-Received: by 2002:a05:620a:7114:b0:84a:a85f:42cb with SMTP id af79cd13be357-85adee2b82cmr124232685a.14.1758740065936; Wed, 24 Sep 2025 11:54:25 -0700 (PDT) Received: from ?IPV6:2001:4958:2206:8901:6025:1483:4146:72dd? ([2001:4958:2206:8901:6025:1483:4146:72dd]) by smtp.gmail.com with ESMTPSA id af79cd13be357-836328d6d8esm1189306885a.58.2025.09.24.11.54.25 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 Sep 2025 11:54:25 -0700 (PDT) Message-ID: <404b2431-8042-4e31-acd2-c7729491275e@redhat.com> Date: Wed, 24 Sep 2025 14:54:25 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v9 2/9] fwd: Add cache table for ARP/NDP contents To: passt-dev@passt.top References: <20250924011330.1168921-1-jmaloy@redhat.com> <20250924011330.1168921-3-jmaloy@redhat.com> From: Jon Maloy In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: BNle3r5TAsIto-etVSdAWeSvdXh8a5kyd8uJeUNURVk_1758740067 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: HZICLO4ZPMSLVA6WSVAGANHXARDVZCWL X-Message-ID-Hash: HZICLO4ZPMSLVA6WSVAGANHXARDVZCWL X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2025-09-23 23:03, David Gibson wrote: > On Tue, Sep 23, 2025 at 09:13:23PM -0400, Jon Maloy wrote: >> We add a cache table to keep track of the contents of the kernel ARP >> and NDP tables. The table is fed from the just introduced netlink based >> neigbour subscription function. The new table eliminates the need for >> explicit netlink calls to find a host's MAC address. >> >> Signed-off-by: Jon Maloy >> >> --- >> v5: - Moved to earlier in series to reduce rebase conflicts >> v6: - Sqashed the hash list commit and the FIFO/LRU queue commit >> - Removed hash lookup. We now only use linear lookup in a >> linked list >> - Eliminated dynamic memory allocation. >> - Ensured there is only one call to clock_gettime() >> - Using MAC_ZERO instead of the previously dedicated definitions >> v7: - NOW using MAC_ZERO where needed >> - I am still using linear back-off for empty cache entries. Even >> an incoming, flow-creating packet from a local host gives no >> guarantee that its MAC address is in the ARP table, so we must >> allow for a few new attempts at first possible occasions. Only >> after several failed lookups can we conclude that we probably >> never will succeed. Hence the back-off. >> - Fixed a bug that David inadvertently made me aware of: I only >> intended to set the initial expiry value to MAC_CACHE_RENEWAL >> when an ARP/NDP table lookup was successful. >> - Improved struct and function description comments. >> v8: - Total re-design of table, adapting to the new, subscription >> based way of updating it. >> v9: - Catering for MAC address change for an existing host. >> --- >> conf.c | 1 + >> fwd.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> fwd.h | 7 +++ >> netlink.c | 15 +++-- >> 4 files changed, 180 insertions(+), 5 deletions(-) >> >> diff --git a/conf.c b/conf.c >> index 66b9e63..cc491c3 100644 >> --- a/conf.c >> +++ b/conf.c >> @@ -2133,6 +2133,7 @@ void conf(struct ctx *c, int argc, char **argv) >> c->udp.fwd_out.mode = fwd_default; >> >> fwd_scan_ports_init(c); >> + fwd_neigh_cache_init(); >> >> if (!c->quiet) >> conf_print(c); >> diff --git a/fwd.c b/fwd.c >> index 250cf56..491303d 100644 >> --- a/fwd.c >> +++ b/fwd.c >> @@ -32,6 +32,168 @@ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14); >> static in_port_t fwd_ephemeral_max = NUM_PORTS - 1; >> >> #define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range" >> +#define MAC_CACHE_SIZE 1024 > > Nit: you use "mac cache" some places, "neigh cache" others. Good. point, and something I have been thinking about myself. I think I will change to neigh_cache all over the line. > >> + >> +/** >> + * mac_cache_entry - Entry in the ARP/NDP table cache >> + * @next: Next entry in slot or free list >> + * @addr: IP address of represented host >> + * @mac: MAC address of represented host >> + */ >> +struct mac_cache_entry { >> + struct mac_cache_entry *next; > > Hash buckets might be overkill, but they're pretty easy and cheap, so > whatever. In theory there might be hundreds or even thousands of neighbors. Maybe not often, but this is cheap as you say. > >> + union inany_addr addr; >> + uint8_t mac[ETH_ALEN]; >> +}; >> + >> +/** >> + * mac_cache_table - Cache of ARP/NDP table contents >> + * @entries: Entries to be plugged into the hash slots when allocated >> + * @slots: Hash table slots >> + * @free: Linked list of unused entries >> + */ >> +struct mac_cache_table { >> + struct mac_cache_entry entries[MAC_CACHE_SIZE]; >> + struct mac_cache_entry *slots[MAC_CACHE_SIZE]; > > There's no inherent reason these two tables need the same size (and > indeed some reasons they might not want to have the same size). I know, but I decided in the end to give them the same size. It might make sense to make the slot array larger than the entry array to reduce risk of collisions, but this risk and its consequences are really minimal in my view. > >> + struct mac_cache_entry *free; >> +}; >> + >> +static struct mac_cache_table mac_cache; >> + >> +/** >> + * fwd_mac_cache_slot_idx() - Hash key to a number within the table range >> + * @c: Execution context >> + * @key: The key to be used for the hash >> + * >> + * Return: The resulting hash value >> + */ >> +static inline size_t fwd_mac_cache_slot_idx(const struct ctx *c, >> + const union inany_addr *key) > > Maybe just neigh_hash_bucket()? Yeah. Maybe. > >> +{ >> + struct siphash_state st = SIPHASH_INIT(c->hash_secret); >> + uint32_t i; >> + >> + inany_siphash_feed(&st, key); >> + i = siphash_final(&st, sizeof(*key), 0); >> + >> + return ((size_t)i) & (MAC_CACHE_SIZE - 1); > > This subtly depends on MAC_CACHE_SIZE being a power of 2. If you use > (... % MAC_CACHE_SIZE) it will work regardless, and the compiler will > still optimize if it _is_ a power of 2. Absolutely. I believe I had a comment about this at the #define line in earlier versions, but it seems to have disappeared. I will re-introduce it. > >> +} >> + >> +/** >> + * fwd_mac_cache_find() - Find a MAC cache table entry >> + * @c: Execution context >> + * @addr: Neighbour address to be used as key for the lookup >> + * >> + * Return: The found entry, if any. Otherwise NULL. >> + */ >> +static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c, >> + const union inany_addr *addr) >> +{ >> + size_t idx = fwd_mac_cache_slot_idx(c, addr); >> + struct mac_cache_entry *e = mac_cache.slots[idx]; >> + >> + while (e && !inany_equals(&e->addr, addr)) >> + e = e->next; >> + >> + return e; >> +} >> + >> +/** >> + * fwd_mac_cache_alloc() - Allocate a mac cache table entry from the free list >> + * @c: Execution context >> + * @addr: Address used to determine insertion slot and store in entry >> + * @mac: The MAC address associated with the neighbour address >> + */ >> +void fwd_neigh_mac_cache_alloc(const struct ctx *c, >> + const union inany_addr *addr, uint8_t *mac) > > This can update as well as allocating. Maybe fwd_neigh_cache_set()? Ok > >> +{ >> + struct mac_cache_table *t = &mac_cache; >> + struct mac_cache_entry *e; >> + ssize_t idx; >> + >> + /* MAC address might change sometimes */ >> + e = fwd_mac_cache_find(c, addr); >> + if (e) { >> + memcpy(e->mac, mac, ETH_ALEN); >> + return; >> + } >> + >> + e = t->free; >> + if (!e) >> + return; >> + t->free = e->next; >> + >> + idx = fwd_mac_cache_slot_idx(c, addr); >> + e->next = t->slots[idx]; >> + t->slots[idx] = e; >> + >> + memcpy(&e->addr, addr, sizeof(*addr)); >> + memcpy(e->mac, mac, ETH_ALEN); >> +} >> + >> +/** >> + * fwd_mac_cache_free() - Remove an entry from a slot and add it to free list >> + * @c: Execution context >> + * @addr: Neighbour address used to find the slot for the entry >> + */ >> +void fwd_neigh_mac_cache_free(const struct ctx *c, const union inany_addr *addr) >> +{ >> + ssize_t idx = fwd_mac_cache_slot_idx(c, addr); >> + struct mac_cache_table *t = &mac_cache; >> + struct mac_cache_entry *e, **prev; >> + >> + prev = &t->slots[idx]; >> + e = t->slots[idx]; >> + while (e && !inany_equals(&e->addr, addr)) { >> + prev = &e->next; >> + e = e->next; >> + } >> + if (!e) >> + return; >> + >> + *prev = e->next; >> + e->next = t->free; >> + t->free = e; >> + memset(&e->addr, 0, sizeof(*addr)); >> + memset(e->mac, 0, ETH_ALEN); >> +} >> + >> +/** >> + * fwd_neigh_mac_get() - Lookup MAC address in the ARP/NDP cache table >> + * @c: Execution context >> + * @addr: Neighbour address used as lookup key >> + * @mac: Buffer for Ethernet MAC to return, found or default value. >> + * >> + * Return: true if real MAC found, false if not found or if failure >> + */ >> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, >> + uint8_t *mac) >> +{ >> + struct mac_cache_entry *e = fwd_mac_cache_find(c, addr); >> + >> + if (e) >> + memcpy(mac, e->mac, ETH_ALEN); >> + else >> + memcpy(mac, c->our_tap_mac, ETH_ALEN); >> + >> + return !!e; >> +} >> + >> +/** >> + * fwd_neigh_cache_init() - Initialize the neighbor ARP/NDP cache table >> + */ >> +void fwd_neigh_cache_init(void) >> +{ >> + struct mac_cache_table *t = &mac_cache; >> + struct mac_cache_entry *e; >> + >> + memset(t, 0, sizeof(*t)); >> + for (int i = 0; i < MAC_CACHE_SIZE; i++) { >> + e = &t->entries[i]; >> + e->next = t->free; >> + t->free = e; >> + } >> +} >> >> /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral >> * >> diff --git a/fwd.h b/fwd.h >> index 65c7c96..a0e8fbc 100644 >> --- a/fwd.h >> +++ b/fwd.h >> @@ -56,5 +56,12 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, >> const struct flowside *ini, struct flowside *tgt); >> uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, >> const struct flowside *ini, struct flowside *tgt); >> +void fwd_neigh_mac_cache_alloc(const struct ctx *c, >> + const union inany_addr *addr, uint8_t *mac); >> +void fwd_neigh_mac_cache_free(const struct ctx *c, >> + const union inany_addr *addr); >> +bool fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, >> + uint8_t *mac); >> +void fwd_neigh_cache_init(void); >> >> #endif /* FWD_H */ >> diff --git a/netlink.c b/netlink.c >> index 1faf3da..1e1ec43 100644 >> --- a/netlink.c >> +++ b/netlink.c >> @@ -131,6 +131,8 @@ int nl_neigh_subscr_init(struct ctx *c) >> */ >> void nl_neigh_subscr_handler(struct ctx *c) >> { >> + union inany_addr addr; >> + uint8_t mac[ETH_ALEN]; >> struct nlmsghdr *nh; >> char buf[NLBUFSIZ]; >> ssize_t n; >> @@ -183,17 +185,20 @@ void nl_neigh_subscr_handler(struct ctx *c) >> dstlen != sizeof(struct in6_addr)) >> continue; >> >> - char abuf[INET6_ADDRSTRLEN]; >> + if (!lladdr || lladdr_len != ETH_ALEN) >> + continue; >> + >> + memcpy(mac, lladdr, ETH_ALEN); >> >> if (dstlen == sizeof(struct in_addr)) >> - inet_ntop(AF_INET, dst, abuf, sizeof(abuf)); >> + addr = inany_from_v4(*(struct in_addr *) dst); >> else >> - inet_ntop(AF_INET6, dst, abuf, sizeof(abuf)); >> + addr.a6 = *(struct in6_addr *) dst; >> >> if (nh->nlmsg_type == RTM_NEWNEIGH) >> - debug("neigh: NEW %s lladdr_len=%zu", abuf, lladdr_len); >> + fwd_neigh_mac_cache_alloc(c, &addr, mac); >> else >> - debug("neigh: DEL %s", abuf); >> + fwd_neigh_mac_cache_free(c, &addr); > > I think the debug() messages could still be useful. They could move > into the functions of course, and might want to be demoted to trace(). Yes. Good point. ///jon >