public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>, passt-dev@passt.top
Subject: Re: [PATCH v4] udp: support traceroute
Date: Fri, 4 Apr 2025 16:03:14 +0200	[thread overview]
Message-ID: <20250404160314.7c8e2683@elisabeth> (raw)
In-Reply-To: <b46c996d-3f83-473f-9095-64f4712f7a6b@redhat.com>

On Fri, 4 Apr 2025 09:35:25 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> On 2025-04-04 09:02, Stefano Brivio wrote:
> > On Fri, 4 Apr 2025 08:54:46 -0400
> > Jon Maloy <jmaloy@redhat.com> wrote:
> >   
> >> On 2025-04-04 07:50, Stefano Brivio wrote:  
> >>> Jon, I wasn't actually suggesting that you would drop *all* the Cc:'s.
> >>> :) I just saw no reason to specifically spam Laurent with this.
> >>>
> >>> On Fri, 4 Apr 2025 10:31:29 +1100
> >>> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>>      
> >>>> On Thu, Apr 03, 2025 at 04:27:12PM -0400, Jon Maloy wrote:  
> >>>>>
> >>>>>
> >>>>> On 2025-04-03 11:48, Stefano Brivio wrote:  
> >>>>>> The implementation looks solid to me, a list of nits (or a bit
> >>>>>> more) below.
> >>>>>>
> >>>>>> By the way, I don't think you need to Cc: people who are already on
> >>>>>> this list unless you specifically want their attention.
> >>>>>>
> >>>>>> On Wed,  2 Apr 2025 22:22:29 -0400
> >>>>>> Jon Maloy <jmaloy@redhat.com> wrote:
> >>>>>>         
> >>>>>>> Now that ICMP pass-through from socket-to-tap is in place, it is
> >>>>>>> easy to support UDP based traceroute functionality in direction
> >>>>>>> tap-to-socket.
> >>>>>>>
> >>>>>>> We fix that  in this commit.
> >>>>>>>
> >>>>>>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>  
> >>>>>>
> >>>>>> This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I
> >>>>>> understood correctly.
> >>>>>>         
> >>>>>>> ---
> >>>>>>> v2: - Using ancillary data instead of setsockopt to transfer outgoing
> >>>>>>>          TTL.
> >>>>>>>        - Support IPv6
> >>>>>>> v3: - Storing ttl per packet instead of per flow. This may not be
> >>>>>>>          elegant, but much less intrusive than changing the flow  
> >>>>>
> >>>>> [...]
> >>>>>         
> >>>>>>> @@ -11,6 +11,8 @@
> >>>>>>>     /* Maximum size of a single packet stored in pool, including headers */
> >>>>>>>     #define PACKET_MAX_LEN	((size_t)UINT16_MAX)
> >>>>>>> +#define DEFAULT_TTL 64  
> >>>>>>
> >>>>>> If I understood correctly, David's comment to this on v3:
> >>>>>>
> >>>>>>      https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/
> >>>>>>
> >>>>>> was meant to imply that, as the default value can be changed via
> >>>>>> sysctl, the value set via sysctl could be read at start-up. I'm fine
> >>>>>> with 64 as well, by the way, with a slight preference for reading the
> >>>>>> value via sysctl.  
> >>>>>
> >>>>> I don't think the local host/container setting will have any effect
> >>>>> if the sending guest is a VM.  
> >>>>
> >>>> That's true, but..
> >>>>     
> >>>>> The benefit is of this is dubious.  
> >>>>
> >>>> .. uflow->ttl[] isn't so much representing what the guest set, as a
> >>>> cache of what the socket is sending and that *does* depend on the host
> >>>> value.  
> >>>
> >>> Right, my concern is that now we'll use the host value (whatever it is)
> >>> if the value from the container / guest is 64.
> >>>
> >>> So:
> >>>
> >>> - guest uses 63, host has 255 configured: we use 63
> >>>
> >>> - guest uses 64, host has 64 configured: we use 64
> >>>
> >>> - ...but: guest uses 64, host has 255 configured: we use 255
> >>>
> >>> ...and this might actually break traceroute itself in some extreme
> >>> cases.
> >>>
> >>> Let's say we have 255 configured on the host and you're in the middle
> >>> of a traceroute:
> >>>
> >>> - guest sends TTL 62, goes out with 62 -> 62nd hop replies
> >>> - guest sends TTL 63, goes out with 63 -> 63rd hop replies
> >>> - guest sends TTL 64, goes out with 255 -> destination replies
> >>> - guest sends TTL 65, goes out with 65 -> 65th hop replies, traceroute broken  
> >>
> >> Conclusion is that we have to set TTL in the socket at the opening of
> >> every new flow in direction tap->sock. I can do that in a separate patch.  
> > 
> > Not necessarily, I think, you can also check what the current value is.
> > If it matches, there's no need to set it.  
> 
> That value is likely to be different from that of the first
> incoming packet from the guest, so we will end up calling setsockopt()
> anyway.

My point was that it's actually very likely. A container has the same
TTL by default (ip_default_ttl is namespaced though, so it can be
changed), and 64 is a common default value anyway (same default for
Linux guests).

> Besides, I believe the difference between doing an initial
> setsockopt() vs a getsockopt() is mininmal, so nothing is gained.

Okay, I thought one would lock the socket "heavily" and the other one
wouldn't, but I guess you're right, it should be a minimal difference
anyway.

> I have a simpler idea: When the udp_flow struct is created, instead
> of using DEFAULT_TTL, we set ttl to the one value TTL can never have: 
> zero. That means the condition for calling setsockopt() will alwaays be
> met for the first arriving packet, and from that point on the value
> in the socket and the cache in udp_flow will always be in sync.

Ah, right, that looks even simpler.

Again, regardless of that, I'm not sure if it works for IPv6 and
IPV6_HOPLIMIT.

> ///jon
> 
> >   
> >>> See also the comment below.
> >>>      
> >>>>>> All this might go away, though, please read the comment to
> >>>>>> udp_flow_new() below, first.
> >>>>>>         
> >>>>>>> +
> >>>>>>>     /**
> >>>>>>>      * struct pool - Generic pool of packets stored in a buffer
> >>>>>>>      * @buf:	Buffer storing packet descriptors,
> >>>>>>> diff --git a/tap.c b/tap.c
> >>>>>>> index 3a6fcbe..e65d592 100644
> >>>>>>> --- a/tap.c
> >>>>>>> +++ b/tap.c
> >>>>>>> @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> >>>>>>>      * @dest:	Destination port
> >>>>>>>      * @saddr:	Source address
> >>>>>>>      * @daddr:	Destination address
> >>>>>>> + * @ttl:	Time to live
> >>>>>>>      * @msg:	Array of messages that can be handled in a single call
> >>>>>>>      */
> >>>>>>>     static struct tap4_l4_t {
> >>>>>>> @@ -574,6 +575,8 @@ static struct tap4_l4_t {
> >>>>>>>     	struct in_addr saddr;
> >>>>>>>     	struct in_addr daddr;
> >>>>>>> +	uint8_t ttl;  
> >>>>>>
> >>>>>> If you move this after 'protocol' you save 4 or 8 bytes depending on
> >>>>>> the architecture and, perhaps more importantly, with 64-byte cachelines,
> >>>>>> you can fit the set of fields involved in the L4_MATCH() comparison
> >>>>>> four times instead of three. If you have a look with pahole(1):
> >>>>>>         
> >>>>> Good point. I  didn't notice.
> >>>>>
> >>>>>
> >>>>> [...]  
> >>>>>>>     	const struct flowside *toside;
> >>>>>>>     	struct mmsghdr mm[UIO_MAXIOV];
> >>>>>>> @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> >>>>>>>     		mm[i].msg_hdr.msg_controllen = 0;
> >>>>>>>     		mm[i].msg_hdr.msg_flags = 0;
> >>>>>>> +		if (ttl != uflow->ttl[tosidx.sidei]) {
> >>>>>>> +			uflow->ttl[tosidx.sidei] = ttl;
> >>>>>>> +			if (af == AF_INET) {
> >>>>>>> +				if (setsockopt(s, IPPROTO_IP, IP_TTL,
> >>>>>>> +					       &ttl, sizeof(ttl)) < 0)
> >>>>>>> +					perror("setsockopt (IP_TTL)");  
> >>>>>>
> >>>>>> This would print to file descriptor 2 even if it's a socket. It should
> >>>>>> be err_perror() instead, but now we also have flow_perror() which
> >>>>>> prints flow index and type, given 'uflow' here, say:
> >>>>>>
> >>>>>> 					flow_perror(uflow, "IP_TTL setsockopt");
> >>>>>>         
> >>>>>>> +			} else {
> >>>>>>> +				if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT,
> >>>>>>> +					       &ttl, sizeof(ttl)) < 0)
> >>>>>>> +					perror("setsockopt (IP_TTL)");  
> >>>>>>
> >>>>>> ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps:
> >>>>>>
> >>>>>> 					flow_perror(uflow,
> >>>>>> 						    "setsockopt IPV6_HOPLIMIT");
> >>>>>>         
> >>>>> Ok.
> >>>>>         
> >>>>>>> +			}
> >>>>>>> +		}
> >>>>>>> +
> >>>>>>>     		count++;
> >>>>>>>     	}
> >>>>>>> diff --git a/udp.h b/udp.h
> >>>>>>> index de2df6d..041fad4 100644
> >>>>>>> --- a/udp.h
> >>>>>>> +++ b/udp.h
> >>>>>>> @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> >>>>>>>     			    uint32_t events, const struct timespec *now);
> >>>>>>>     int udp_tap_handler(const struct ctx *c, uint8_t pif,
> >>>>>>>     		    sa_family_t af, const void *saddr, const void *daddr,
> >>>>>>> -		    const struct pool *p, int idx, const struct timespec *now);
> >>>>>>> +		    uint8_t  ttl, const struct pool *p, int idx,  
> >>>>>>
> >>>>>> Excess whitespace beetween 'uint8_t' and 'ttl'.
> >>>>>>         
> >>>>>>> +		    const struct timespec *now);
> >>>>>>>     int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> >>>>>>>     		  const char *ifname, in_port_t port);
> >>>>>>>     int udp_init(struct ctx *c);
> >>>>>>> diff --git a/udp_flow.c b/udp_flow.c
> >>>>>>> index bf4b896..39372c2 100644
> >>>>>>> --- a/udp_flow.c
> >>>>>>> +++ b/udp_flow.c
> >>>>>>> @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
> >>>>>>>     	uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
> >>>>>>>     	uflow->ts = now->tv_sec;
> >>>>>>>     	uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1;
> >>>>>>> +	uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL;  
> >>>>>>
> >>>>>> By the way, instead of using a default value, what about fetching the
> >>>>>> current value with getsockopt()?
> >>>>>>
> >>>>>> One additional system call per UDP flow doesn't feel like a lot of
> >>>>>> overhead, and we can be sure it's correct, no matter if the user
> >>>>>> configures a different value before or after we start.
> >>>>>>         
> >>>>> This patch fixes UDP messaging tap->socket, and TTL may have any
> >>>>> value in the first arriving packet. Reading it from the socket here only
> >>>>> makes sense when I add the same support in direction socket->tap.
> >>>>> That is my next project.  
> >>>
> >>> Well, wait, the getsockopt() will not tell you the value the socket is
> >>> receiving. It tells you the value that the socket would send, at least
> >>> according to the documentation:  
> >>
> >> We do have IP_RECVTTL and IP6_RECVHOPLIMIT for that. When this option is
> >> set, we can catch the TTL on received packets by reading anillary data.  
> > 
> > Sure, but here we're talking about the value that the socket would
> > send. That's what I'm suggesting to fetch via getsockopt() for IPv4.
> > 
> > How you're using *IPV6_HOPLIMIT* here doesn't match the documentation.
> > Is the documentation wrong? I haven't checked.
> >   
> >> ///jon
> >>  
> >>>
> >>>          IP_TTL (since Linux 1.0)
> >>>                 Set or retrieve the current time-to-live field that is  

-- 
Stefano


      reply	other threads:[~2025-04-04 14:03 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-03  2:22 [PATCH v4] udp: support traceroute Jon Maloy
2025-04-03 15:48 ` Stefano Brivio
2025-04-03 20:27   ` Jon Maloy
2025-04-03 23:31     ` David Gibson
2025-04-04 11:50       ` Stefano Brivio
2025-04-04 11:54         ` Stefano Brivio
2025-04-04 12:54         ` Jon Maloy
2025-04-04 13:02           ` Stefano Brivio
2025-04-04 13:35             ` Jon Maloy
2025-04-04 14:03               ` Stefano Brivio [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250404160314.7c8e2683@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=jmaloy@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).