From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>, passt-dev@passt.top
Subject: Re: [PATCH v4] udp: support traceroute
Date: Fri, 4 Apr 2025 13:54:36 +0200 [thread overview]
Message-ID: <20250404135436.76faa385@elisabeth> (raw)
In-Reply-To: <20250404135015.069d5a91@elisabeth>
On Fri, 4 Apr 2025 13:50:15 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:
> Jon, I wasn't actually suggesting that you would drop *all* the Cc:'s.
> :) I just saw no reason to specifically spam Laurent with this.
>
> On Fri, 4 Apr 2025 10:31:29 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Thu, Apr 03, 2025 at 04:27:12PM -0400, Jon Maloy wrote:
> > >
> > >
> > > On 2025-04-03 11:48, Stefano Brivio wrote:
> > > > The implementation looks solid to me, a list of nits (or a bit
> > > > more) below.
> > > >
> > > > By the way, I don't think you need to Cc: people who are already on
> > > > this list unless you specifically want their attention.
> > > >
> > > > On Wed, 2 Apr 2025 22:22:29 -0400
> > > > Jon Maloy <jmaloy@redhat.com> wrote:
> > > >
> > > > > Now that ICMP pass-through from socket-to-tap is in place, it is
> > > > > easy to support UDP based traceroute functionality in direction
> > > > > tap-to-socket.
> > > > >
> > > > > We fix that in this commit.
> > > > >
> > > > > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > > >
> > > > This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I
> > > > understood correctly.
> > > >
> > > > > ---
> > > > > v2: - Using ancillary data instead of setsockopt to transfer outgoing
> > > > > TTL.
> > > > > - Support IPv6
> > > > > v3: - Storing ttl per packet instead of per flow. This may not be
> > > > > elegant, but much less intrusive than changing the flow
> > >
> > > [...]
> > >
> > > > > @@ -11,6 +11,8 @@
> > > > > /* Maximum size of a single packet stored in pool, including headers */
> > > > > #define PACKET_MAX_LEN ((size_t)UINT16_MAX)
> > > > > +#define DEFAULT_TTL 64
> > > >
> > > > If I understood correctly, David's comment to this on v3:
> > > >
> > > > https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/
> > > >
> > > > was meant to imply that, as the default value can be changed via
> > > > sysctl, the value set via sysctl could be read at start-up. I'm fine
> > > > with 64 as well, by the way, with a slight preference for reading the
> > > > value via sysctl.
> > >
> > > I don't think the local host/container setting will have any effect
> > > if the sending guest is a VM.
> >
> > That's true, but..
> >
> > > The benefit is of this is dubious.
> >
> > .. uflow->ttl[] isn't so much representing what the guest set, as a
> > cache of what the socket is sending and that *does* depend on the host
> > value.
>
> Right, my concern is that now we'll use the host value (whatever it is)
> if the value from the container / guest is 64.
>
> So:
>
> - guest uses 63, host has 255 configured: we use 63
>
> - guest uses 64, host has 64 configured: we use 64
>
> - ...but: guest uses 64, host has 255 configured: we use 255
>
> ...and this might actually break traceroute itself in some extreme
> cases.
>
> Let's say we have 255 configured on the host and you're in the middle
> of a traceroute:
>
> - guest sends TTL 62, goes out with 62 -> 62nd hop replies
> - guest sends TTL 63, goes out with 63 -> 63rd hop replies
> - guest sends TTL 64, goes out with 255 -> destination replies
> - guest sends TTL 65, goes out with 65 -> 65th hop replies, traceroute broken
>
> See also the comment below.
>
> > > > All this might go away, though, please read the comment to
> > > > udp_flow_new() below, first.
> > > >
> > > > > +
> > > > > /**
> > > > > * struct pool - Generic pool of packets stored in a buffer
> > > > > * @buf: Buffer storing packet descriptors,
> > > > > diff --git a/tap.c b/tap.c
> > > > > index 3a6fcbe..e65d592 100644
> > > > > --- a/tap.c
> > > > > +++ b/tap.c
> > > > > @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> > > > > * @dest: Destination port
> > > > > * @saddr: Source address
> > > > > * @daddr: Destination address
> > > > > + * @ttl: Time to live
> > > > > * @msg: Array of messages that can be handled in a single call
> > > > > */
> > > > > static struct tap4_l4_t {
> > > > > @@ -574,6 +575,8 @@ static struct tap4_l4_t {
> > > > > struct in_addr saddr;
> > > > > struct in_addr daddr;
> > > > > + uint8_t ttl;
> > > >
> > > > If you move this after 'protocol' you save 4 or 8 bytes depending on
> > > > the architecture and, perhaps more importantly, with 64-byte cachelines,
> > > > you can fit the set of fields involved in the L4_MATCH() comparison
> > > > four times instead of three. If you have a look with pahole(1):
> > > >
> > > Good point. I didn't notice.
> > >
> > >
> > > [...]
> > > > > const struct flowside *toside;
> > > > > struct mmsghdr mm[UIO_MAXIOV];
> > > > > @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> > > > > mm[i].msg_hdr.msg_controllen = 0;
> > > > > mm[i].msg_hdr.msg_flags = 0;
> > > > > + if (ttl != uflow->ttl[tosidx.sidei]) {
> > > > > + uflow->ttl[tosidx.sidei] = ttl;
> > > > > + if (af == AF_INET) {
> > > > > + if (setsockopt(s, IPPROTO_IP, IP_TTL,
> > > > > + &ttl, sizeof(ttl)) < 0)
> > > > > + perror("setsockopt (IP_TTL)");
> > > >
> > > > This would print to file descriptor 2 even if it's a socket. It should
> > > > be err_perror() instead, but now we also have flow_perror() which
> > > > prints flow index and type, given 'uflow' here, say:
> > > >
> > > > flow_perror(uflow, "IP_TTL setsockopt");
> > > >
> > > > > + } else {
> > > > > + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT,
> > > > > + &ttl, sizeof(ttl)) < 0)
> > > > > + perror("setsockopt (IP_TTL)");
> > > >
> > > > ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps:
> > > >
> > > > flow_perror(uflow,
> > > > "setsockopt IPV6_HOPLIMIT");
> > > >
> > > Ok.
> > >
> > > > > + }
> > > > > + }
> > > > > +
> > > > > count++;
> > > > > }
> > > > > diff --git a/udp.h b/udp.h
> > > > > index de2df6d..041fad4 100644
> > > > > --- a/udp.h
> > > > > +++ b/udp.h
> > > > > @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> > > > > uint32_t events, const struct timespec *now);
> > > > > int udp_tap_handler(const struct ctx *c, uint8_t pif,
> > > > > sa_family_t af, const void *saddr, const void *daddr,
> > > > > - const struct pool *p, int idx, const struct timespec *now);
> > > > > + uint8_t ttl, const struct pool *p, int idx,
> > > >
> > > > Excess whitespace beetween 'uint8_t' and 'ttl'.
> > > >
> > > > > + const struct timespec *now);
> > > > > int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> > > > > const char *ifname, in_port_t port);
> > > > > int udp_init(struct ctx *c);
> > > > > diff --git a/udp_flow.c b/udp_flow.c
> > > > > index bf4b896..39372c2 100644
> > > > > --- a/udp_flow.c
> > > > > +++ b/udp_flow.c
> > > > > @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
> > > > > uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
> > > > > uflow->ts = now->tv_sec;
> > > > > uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1;
> > > > > + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL;
> > > >
> > > > By the way, instead of using a default value, what about fetching the
> > > > current value with getsockopt()?
> > > >
> > > > One additional system call per UDP flow doesn't feel like a lot of
> > > > overhead, and we can be sure it's correct, no matter if the user
> > > > configures a different value before or after we start.
> > > >
> > > This patch fixes UDP messaging tap->socket, and TTL may have any
> > > value in the first arriving packet. Reading it from the socket here only
> > > makes sense when I add the same support in direction socket->tap.
> > > That is my next project.
>
> Well, wait, the getsockopt() will not tell you the value the socket is
> receiving. It tells you the value that the socket would send, at least
> according to the documentation:
>
> IP_TTL (since Linux 1.0)
> Set or retrieve the current time-to-live field that is
> used in every packet sent from this socket.
>
> and that's what makes it relevant: this is the value that we would
> normally use, unless you issue the setsockopt().
>
> But... there's a plot twist: this is just for IPv4. For IPv6:
>
> IPV6_RTHDR, IPV6_AUTHHDR, IPV6_DSTOPTS, IPV6_HOPOPTS, IPV6_FLOWINFO,
> IPV6_HOPLIMIT
> Set delivery of control messages for incoming datagrams
> containing extension headers from the received packet.
>
> [...]
>
> IPV6_HOPLIMIT delivers an integer containing the hop count of
> the packet.
>
> so I wonder: is it correct to use IPV6_HOPLIMIT at all, even for the
> setsockopt() you're adding?
>
> I haven't tested this (at least not yet), but from the documentation
> that seems to apply to *received* packets. No idea what the
> setsockopt() would do, at this point.
>
> Could it be that we should use IP_TTL for *sent* IPv4 packets as well?
^^^^ IPv6, I meant
>
> I'll try to test this specific part in a bit, unless you already did.
--
Stefano
next prev parent reply other threads:[~2025-04-04 11:54 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-03 2:22 [PATCH v4] udp: support traceroute Jon Maloy
2025-04-03 15:48 ` Stefano Brivio
2025-04-03 20:27 ` Jon Maloy
2025-04-03 23:31 ` David Gibson
2025-04-04 11:50 ` Stefano Brivio
2025-04-04 11:54 ` Stefano Brivio [this message]
2025-04-04 12:54 ` Jon Maloy
2025-04-04 13:02 ` Stefano Brivio
2025-04-04 13:35 ` Jon Maloy
2025-04-04 14:03 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250404135436.76faa385@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=jmaloy@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).