From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=fjWtGn31; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 5652F5A026F for ; Mon, 31 Mar 2025 07:25:01 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1743398684; bh=ukZEqlWkL537iY7abB8vQ4HximwSd1umWeQqCErt0jg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fjWtGn31/twRI26HLobPrhiY4qqImb4oQXsbVNWC2+J18RcsxOQSoiPDVLPd5UOit FlKNSI/CaJGZ1KAmYoGZNmbIjbNsUyyku4hyu9gI1sTn2MzjjsWNJyIjiXOL/Suv8/ oHK7IvSVHlNGK+EWYGqqjEYsu5x9ZRHJlYuepHBlq0J0mhP9nw4QHYAGqWgo/t+Rtv Wi55y98tH/UeRBHI2wJtfkTWoTnKxBnGx5WtiwQ77HLd2louUM/4bcNls6qMdEoLzZ KMOr+VJKzYTTzVZYCjEhinQv2/5WE4CON1uFJ3YQvzpeWD2YLXRHTDdqLb4V+bR5YX LDUMJIATC05og== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4ZR01m11zhz4x0L; Mon, 31 Mar 2025 16:24:44 +1100 (AEDT) Date: Mon, 31 Mar 2025 16:23:40 +1100 From: David Gibson To: Jon Maloy Subject: Re: [PATCH v3] udp: support traceroute Message-ID: References: <20250330210628.47752-1-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="J66foiCZuCVzgA09" Content-Disposition: inline In-Reply-To: <20250330210628.47752-1-jmaloy@redhat.com> Message-ID-Hash: JUSYEFLADTIVBBXJOYGKJKFDTO5M77T7 X-Message-ID-Hash: JUSYEFLADTIVBBXJOYGKJKFDTO5M77T7 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --J66foiCZuCVzgA09 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Mar 30, 2025 at 05:06:28PM -0400, Jon Maloy wrote: > Now that ICMP pass-through from socket-to-tap is in place, it is > easy to support UDP based traceroute functionality in direction > tap-to-socket. >=20 > We fix that in this commit. >=20 > Signed-off-by: Jon Maloy > --- > v2: - Using ancillary data instead of setsockopt to transfer outgoing > TTL. > - Support IPv6 > v3: - Storing ttl per packet instead of per flow. This may not be > elegant, but much less intrusive than changing the flow > criteria. This eliminates the need for the extra, flow-changing > patch we introduced in v2. Hm, I think this will work, but I'm not entirely convinced about the approach. > --- > packet.c | 28 +++++++++++++++++----------- > packet.h | 30 ++++++++++++++++++++++-------- > tap.c | 3 ++- > udp.c | 28 ++++++++++++++++++++++++---- > udp.h | 3 ++- > 5 files changed, 67 insertions(+), 25 deletions(-) >=20 > diff --git a/packet.c b/packet.c > index 72c6158..36a32fe 100644 > --- a/packet.c > +++ b/packet.c > @@ -89,11 +89,12 @@ bool pool_full(const struct pool *p) > * @p: Existing pool > * @len: Length of new descriptor > * @start: Start of data > + * @ttl: TTL/hop_limit for this packet > * @func: For tracing: name of calling function > * @line: For tracing: caller line of function call > */ > void packet_add_do(struct pool *p, size_t len, const char *start, > - const char *func, int line) > + const uint8_t ttl, const char *func, int line) > { > size_t idx =3D p->count; > =20 > @@ -106,8 +107,9 @@ void packet_add_do(struct pool *p, size_t len, const = char *start, > if (packet_check_range(p, start, len, func, line)) > return; > =20 > - p->pkt[idx].iov_base =3D (void *)start; > - p->pkt[idx].iov_len =3D len; > + p->pkt[idx].iov.iov_base =3D (void *)start; > + p->pkt[idx].iov.iov_len =3D len; > + p->pkt[idx].ttl =3D ttl; > =20 > p->count++; > } > @@ -125,7 +127,8 @@ void packet_add_do(struct pool *p, size_t len, const = char *start, > * Return: pointer to start of data range, NULL on invalid range or desc= riptor > */ > void *packet_get_try_do(const struct pool *p, size_t idx, size_t offset, > - size_t len, size_t *left, const char *func, int line) > + size_t len, size_t *left, uint8_t *ttl, > + const char *func, int line) > { > char *ptr; > =20 > @@ -139,18 +142,21 @@ void *packet_get_try_do(const struct pool *p, size_= t idx, size_t offset, > return NULL; > } > =20 > - if (offset > p->pkt[idx].iov_len || > - len > (p->pkt[idx].iov_len - offset)) > + if (offset > p->pkt[idx].iov.iov_len || > + len > (p->pkt[idx].iov.iov_len - offset)) > return NULL; > =20 > - ptr =3D (char *)p->pkt[idx].iov_base + offset; > + ptr =3D (char *)p->pkt[idx].iov.iov_base + offset; > =20 > ASSERT_WITH_MSG(!packet_check_range(p, ptr, len, func, line), > "Corrupt packet pool, %s:%i", func, line); > =20 > if (left) > - *left =3D p->pkt[idx].iov_len - offset - len; > + *left =3D p->pkt[idx].iov.iov_len - offset - len; > =20 > + if (ttl) > + *ttl =3D p->pkt[idx].ttl; > +; > return ptr; > } > =20 > @@ -168,14 +174,14 @@ void *packet_get_try_do(const struct pool *p, size_= t idx, size_t offset, > */ > void *packet_get_do(const struct pool *p, const size_t idx, > size_t offset, size_t len, size_t *left, > - const char *func, int line) > + uint8_t *ttl, const char *func, int line) > { > - void *r =3D packet_get_try_do(p, idx, offset, len, left, func, line); > + void *r =3D packet_get_try_do(p, idx, offset, len, left, ttl, func, lin= e); > =20 > if (!r) { > trace("missing packet data length %zu, offset %zu from " > "length %zu, %s:%i", > - len, offset, p->pkt[idx].iov_len, func, line); > + len, offset, p->pkt[idx].iov.iov_len, func, line); > } > =20 > return r; > diff --git a/packet.h b/packet.h > index c94780a..1f5142c 100644 > --- a/packet.h > +++ b/packet.h > @@ -11,6 +11,8 @@ > /* Maximum size of a single packet stored in pool, including headers */ > #define PACKET_MAX_LEN ((size_t)UINT16_MAX) > =20 > +#define DEFAULT_TTL 64 This appears to be the default default on Linux, but looks like it can be changed by sysctl. > /** > * struct pool - Generic pool of packets stored in a buffer > * @buf: Buffer storing packet descriptors, > @@ -26,28 +28,36 @@ struct pool { > size_t buf_size; > size_t size; > size_t count; > - struct iovec pkt[]; > + struct { > + struct iovec iov; > + uint8_t ttl; Hm. I'm not entirely sure I like the idea of putting the TTL explicitly here, given that it is usually contained implicitly in the IP headrs of the packets in the pool. Then again, packet pool entries are roughly like skbs, which do include a bunch of fields giving "summary" information. The alternative approach to this would be to add TTL to the criteria for making "seqs" in tap.c. That is update the logic for L4_MATCH and L4_SET as in the earlier versions. Then you can pass a single ttl for a group of packets. The "seqs" tend to roughly line up with flows, but they don't have to. > + uint8_t pad[3]; I don't think you should need to explicitly pad, the compiler should do that automatically. On 64-bit systems (i.e. most of them), I think you'd need 7 bytes of padding to keep things properly aligned. > + } pkt[]; > }; > =20 > int vu_packet_check_range(void *buf, const char *ptr, size_t len); > void packet_add_do(struct pool *p, size_t len, const char *start, > - const char *func, int line); > + const uint8_t ttl, const char *func, int line); > void *packet_get_try_do(const struct pool *p, const size_t idx, > size_t offset, size_t len, size_t *left, > - const char *func, int line); > + uint8_t *ttl, const char *func, int line); > void *packet_get_do(const struct pool *p, const size_t idx, > size_t offset, size_t len, size_t *left, > - const char *func, int line); > + uint8_t *ttl, const char *func, int line); > bool pool_full(const struct pool *p); > void pool_flush(struct pool *p); > =20 > #define packet_add(p, len, start) \ > - packet_add_do(p, len, start, __func__, __LINE__) > + packet_add_do(p, len, start, DEFAULT_TTL, __func__, __LINE__) > +#define packet_add_ttl(p, len, start, ttl) \ > + packet_add_do(p, len, start, ttl, __func__, __LINE__) I also don't love the fact that the ttl value in the pool will only sometimes be correct: only in the paths where you actually bother to parse it from the packet. > #define packet_get_try(p, idx, offset, len, left) \ > - packet_get_try_do(p, idx, offset, len, left, __func__, __LINE__) > + packet_get_try_do(p, idx, offset, len, left, NULL, __func__, __LINE__) > #define packet_get(p, idx, offset, len, left) \ > - packet_get_do(p, idx, offset, len, left, __func__, __LINE__) > + packet_get_do(p, idx, offset, len, left, NULL, __func__, __LINE__) > +#define packet_get_ttl(p, idx, offset, len, left, ttl) \ > + packet_get_do(p, idx, offset, len, left, ttl, __func__, __LINE__) > =20 > #define PACKET_POOL_DECL(_name, _size, _buf) \ > struct _name ## _t { \ > @@ -55,7 +65,11 @@ struct _name ## _t { \ > size_t buf_size; \ > size_t size; \ > size_t count; \ > - struct iovec pkt[_size]; \ > + struct { \ > + struct iovec iov; \ > + uint8_t ttl; \ > + uint8_t pad[3]; \ > + } pkt[_size]; \ > } > =20 > #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size) \ > diff --git a/tap.c b/tap.c > index 3a6fcbe..ac9b3df 100644 > --- a/tap.c > +++ b/tap.c > @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); > * @dest: Destination port > * @saddr: Source address > * @daddr: Destination address > + * @ttl: TTL/hop_limit for packet This seems to be a leftover from an earlier version; you're no longer adding a ttle field to tap_l4_t. > * @msg: Array of messages that can be handled in a single call > */ > static struct tap4_l4_t { > @@ -821,7 +822,7 @@ resume: > #undef L4_SET > =20 > append: > - packet_add((struct pool *)&seq->p, l4len, l4h); > + packet_add_ttl((struct pool *)&seq->p, l4len, l4h, iph->ttl); > } > =20 > for (j =3D 0, seq =3D tap4_l4; j < seq_count; j++, seq++) { > diff --git a/udp.c b/udp.c > index 39431d7..5fbba49 100644 > --- a/udp.c > +++ b/udp.c > @@ -859,8 +859,10 @@ fail: > */ > int udp_tap_handler(const struct ctx *c, uint8_t pif, > sa_family_t af, const void *saddr, const void *daddr, > - const struct pool *p, int idx, const struct timespec *now) > + const struct pool *p, int idx, > + const struct timespec *now) Unrelated whitespace change. > { > + char ancillary[CMSG_SPACE(sizeof(int))]; > const struct flowside *toside; > struct mmsghdr mm[UIO_MAXIOV]; > union sockaddr_inany to_sa; > @@ -885,7 +887,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > src =3D ntohs(uh->source); > dst =3D ntohs(uh->dest); > =20 > - tosidx =3D udp_flow_from_tap(c, pif, af, saddr, daddr, src, dst, now); > + tosidx =3D udp_flow_from_tap(c, pif, af, saddr, daddr, > + src, dst, now); > + Unrelated whitespace change. > if (!(uflow =3D udp_at_sidx(tosidx))) { > char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; > =20 > @@ -915,8 +919,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > for (i =3D 0; i < (int)p->count - idx; i++) { > struct udphdr *uh_send; > size_t len; > + uint8_t ttl; > =20 > - uh_send =3D packet_get(p, idx + i, 0, sizeof(*uh), &len); > + uh_send =3D packet_get_ttl(p, idx + i, 0, sizeof(*uh), &len, &ttl); > if (!uh_send) > return p->count - idx; > =20 > @@ -926,7 +931,6 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > if (len) { > m[i].iov_base =3D (char *)(uh_send + 1); > m[i].iov_len =3D len; > - Unrelated whitespace change. > mm[i].msg_hdr.msg_iov =3D m + i; > mm[i].msg_hdr.msg_iovlen =3D 1; > } else { > @@ -938,6 +942,22 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > mm[i].msg_hdr.msg_controllen =3D 0; > mm[i].msg_hdr.msg_flags =3D 0; > =20 > + if (ttl !=3D DEFAULT_TTL) { I'm also not entirely convinced of the wisdom of making things conditional on a check against a fixed DEFAULT_TTL. If the guest has its default TTL configured to something other than 64, this will trigger on every packet. It will also trigger every time (probably) on any traffic that isn't directly from the guest, but forwarded by the guest from some other interface (e.g. a VPN). The TTL might have started at 64 on the endpoint, but it will have decreased by some hops before it reaches us. Handling this sort of case gracefully, is why I had a different approach in mind. In struct udp_flow keep the current TTL for the flow's socket. Whenever we get a packet from tap with a TTL that's different, setsockopt() and update the current socket TTL. For cases like the above where the TTL is not 64, but is uniform, we'll get a single setsockopt() then not have to do anything more. For traceroute we'll get a setsockopt() per "distance" that traceroute probes. Then again, it's possible that adding the cmsg to the existing syscall is of negligible cost, whereas an extra syscall will have some cost, even if it's only sometimes. Not sure. > + struct cmsghdr *cmsg =3D (void *) ancillary; > + > + if (af =3D=3D AF_INET) { > + cmsg->cmsg_level =3D IPPROTO_IP; > + cmsg->cmsg_type =3D IP_TTL; > + } else { > + cmsg->cmsg_level =3D IPPROTO_IPV6; > + cmsg->cmsg_type =3D IPV6_HOPLIMIT; > + } > + cmsg->cmsg_len =3D CMSG_LEN(sizeof(int)); > + *((int *) CMSG_DATA(cmsg)) =3D ttl; I also wonder if we should be settil (ttl - 1) here: should we count ourselves as a "hop"? > + mm[i].msg_hdr.msg_control =3D ancillary; > + mm[i].msg_hdr.msg_controllen =3D sizeof(ancillary); > + } > + > count++; > } > =20 > diff --git a/udp.h b/udp.h > index de2df6d..6adbfcd 100644 > --- a/udp.h > +++ b/udp.h > @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union = epoll_ref ref, > uint32_t events, const struct timespec *now); > int udp_tap_handler(const struct ctx *c, uint8_t pif, > sa_family_t af, const void *saddr, const void *daddr, > - const struct pool *p, int idx, const struct timespec *now); > + const struct pool *p, int idx, > + const struct timespec *now); > int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *a= ddr, > const char *ifname, in_port_t port); > int udp_init(struct ctx *c); --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --J66foiCZuCVzgA09 Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmfqJtsACgkQzQJF27ox 2GciNw//e/my31SR8rdp81bZTEBl7jgt6K8nPwchfJXb7Y1Si1MEykaS0s4KyPbq wUgm4cFPkbOifFr7JTl+N/WD1H1/afU0GF4JglPL6wAJJXyrbYR4HPgDI5UMesGS elMV8n24BfG/NDKIPB2hvRq4X8D5XPy2MpeX9ON3vp+6zO6iqn8AsPnqD7WQI61F GSj33lFSMgQ0C/1zudl8yW0MxXIUUCLS+xGh9CLCOqQIAIATdA/F9/4jd3sw/2jZ 4BaWCiAWDw/HJi+k8laFzD/hDFSRWOTZssffFmCaHcykF2pITCSxeSFoTpllk4Co xMJbiu803fYMmBTpViEyqW/lA+vH5YZy4DyKnU87/yajU3gTL7pelD487AX7ovu/ JDlKf0DgHpo+deiBML5GwGUq1Zi8mLMy3gVPy29LQYn2rIHTf4ud7ZItsi+y6lpM xoQXifm2DFSeN+A4BVa521eOpsjqXSsI0I6/1s7SEZ9mcs+BFrhVaLSuTi3C4e5s UXsuOHlfljjae+YS+XS5FCdiffuaTqCgaTvoSTmcwHtF1p4zBOhSjCdCzsJiL6D5 1hFLtS64E+RPeYY1ue7UiqtAAKE738I8bbviFL4Fw6MS/frY69LEOUlDLhmYZ0sp 1IWpYMIq3vZpn7pqVFR/RJE4IcQIjBKEJGGJ2ZIiGWW5Ku1b3Z0= =H+Y0 -----END PGP SIGNATURE----- --J66foiCZuCVzgA09--