From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=hzIXqT25; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 774075A026F for ; Fri, 04 Apr 2025 16:03:20 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1743775399; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5m6ifhRqc8KPrxzSKSmQUZBucbbmO0lEsZ6h4RMRax0=; b=hzIXqT2591XzQprJ0gQkx8MBHEca6tN4FZtb+ncwtArJ5QkGdoHrEm2I/HPCvtcmt8siGR 2YLyOvV0Bml+WLSywg1xH1iBZIQm1F2Xp9OU5kljTxwGu9BfvlHa55TwgNIDX6mCrKQByi aHtBdyijrNUzJ67t7bsCLsyX6j6sXAg= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-16-EFtvYG-SMN-mER303xJAeg-1; Fri, 04 Apr 2025 10:03:17 -0400 X-MC-Unique: EFtvYG-SMN-mER303xJAeg-1 X-Mimecast-MFC-AGG-ID: EFtvYG-SMN-mER303xJAeg_1743775397 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-43ce245c5acso25060035e9.2 for ; Fri, 04 Apr 2025 07:03:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743775396; x=1744380196; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=5m6ifhRqc8KPrxzSKSmQUZBucbbmO0lEsZ6h4RMRax0=; b=im4xRKlQwyaG3mAiQpIJzNK5bjzhvPW80URlpcorq1gXOJPWsISQTH75x8hl0CgYVI /VyBVj3DB/bgA3er9KYetsKS4gNuB5OvrCalFjcOX5iH/peAJwYvP60BYsN0RpkyD7Bk L33HSnYuNPiYm45Ow69KUdn1lATYyuXX4cFBF1CaPyn6tXqE6IQmAKMdKhfq+Hx9rM/t D62CJaT2sp8WMnjK/Ledd94k/I07OBw74vy5OsBgqibSwmP7IdMLD9LuEiPFWidxCrqF 14QaW1ju2CWOkk9A9RaY/g5PEU9FWPruJvBzQiH4hxwHc+d1GF5r86Yt9MpbRAI/okLV 5tlA== X-Forwarded-Encrypted: i=1; AJvYcCX7OIpunL1yzdPay3WkegkwmWxRE7jXADX8GCJKatZf/B7+hFghb+Sva+L4/UC6sS3qLSdoI0Bi5xI=@passt.top X-Gm-Message-State: AOJu0YwKSW9yC7XY9cDk36yby/MA+PKPU8d0UU3+XBMsjrDfEqfg1pp/ l5gIl0NnIE02z2m1ipDXwSpZQSCpaBK9eUfmaK8RZeoc5XfLuLXCFv49/itNCa+FgIkXCs4cuoL 3lSWdLXOGzBpSWzGNT32Y2Phv+U319HqkyZcWdwDpZspHgjYy1g== X-Gm-Gg: ASbGncsuKvzvNnaBdFCmbPPwAhk4zYzTkRE2k800h95KzjWYc7OIZNFkLTC2xyoab6v AJ2URfalP+9SWddjeN7Xel67p3l0ep+KDkKj6UkoCcePtHgDjdLslAt4po6srettVXqezj0i8C5 mUnrmPh70jy7orIPvaBwSetFGBs6f5xebSvaKmSP+OChwGayFQL5Jf2UEJoZApb0UZjeE9YQFjm 0Ci0vVW32qsNIQsd+6gMgPOwgLe+r8yJpPCO02xL4bTeZnqh2wWmMg9fdhKoUuV42iEkI/6jUso HtFQV5NukYHgRGRc65pVe6P3FRfUELR8zWLOnyULoVy7 X-Received: by 2002:a05:6000:1449:b0:39c:30f9:1e57 with SMTP id ffacd0b85a97d-39d0de14297mr2709675f8f.18.1743775396448; Fri, 04 Apr 2025 07:03:16 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH5mp2z1ro+cPPT/PytpkFh5JUQ/OZXVI4Hkwd0lxPLH/BLsESzckdFY6r3JfP/lesSkGV/FA== X-Received: by 2002:a05:6000:1449:b0:39c:30f9:1e57 with SMTP id ffacd0b85a97d-39d0de14297mr2709605f8f.18.1743775395821; Fri, 04 Apr 2025 07:03:15 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39c300968c4sm4482447f8f.9.2025.04.04.07.03.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Apr 2025 07:03:15 -0700 (PDT) Date: Fri, 4 Apr 2025 16:03:14 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4] udp: support traceroute Message-ID: <20250404160314.7c8e2683@elisabeth> In-Reply-To: References: <20250403022229.836067-1-jmaloy@redhat.com> <20250403174833.6d033172@elisabeth> <4986e27d-20d9-4b2b-883d-d696e84ec9cf@redhat.com> <20250404135015.069d5a91@elisabeth> <3b055987-7c7f-4cd0-9757-9676b9142d17@redhat.com> <20250404150234.40f4d150@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: J-1z7S7HVpqumlkg55Oi3YWuKzj7XrcOAu9zT1xX0LA_1743775397 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: LUD5CF3LHERJT66VUIJEFSM7V7Y44YFV X-Message-ID-Hash: LUD5CF3LHERJT66VUIJEFSM7V7Y44YFV X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 4 Apr 2025 09:35:25 -0400 Jon Maloy wrote: > On 2025-04-04 09:02, Stefano Brivio wrote: > > On Fri, 4 Apr 2025 08:54:46 -0400 > > Jon Maloy wrote: > > > >> On 2025-04-04 07:50, Stefano Brivio wrote: > >>> Jon, I wasn't actually suggesting that you would drop *all* the Cc:'s. > >>> :) I just saw no reason to specifically spam Laurent with this. > >>> > >>> On Fri, 4 Apr 2025 10:31:29 +1100 > >>> David Gibson wrote: > >>> > >>>> On Thu, Apr 03, 2025 at 04:27:12PM -0400, Jon Maloy wrote: > >>>>> > >>>>> > >>>>> On 2025-04-03 11:48, Stefano Brivio wrote: > >>>>>> The implementation looks solid to me, a list of nits (or a bit > >>>>>> more) below. > >>>>>> > >>>>>> By the way, I don't think you need to Cc: people who are already on > >>>>>> this list unless you specifically want their attention. > >>>>>> > >>>>>> On Wed, 2 Apr 2025 22:22:29 -0400 > >>>>>> Jon Maloy wrote: > >>>>>> > >>>>>>> Now that ICMP pass-through from socket-to-tap is in place, it is > >>>>>>> easy to support UDP based traceroute functionality in direction > >>>>>>> tap-to-socket. > >>>>>>> > >>>>>>> We fix that in this commit. > >>>>>>> > >>>>>>> Signed-off-by: Jon Maloy > >>>>>> > >>>>>> This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I > >>>>>> understood correctly. > >>>>>> > >>>>>>> --- > >>>>>>> v2: - Using ancillary data instead of setsockopt to transfer outgoing > >>>>>>> TTL. > >>>>>>> - Support IPv6 > >>>>>>> v3: - Storing ttl per packet instead of per flow. This may not be > >>>>>>> elegant, but much less intrusive than changing the flow > >>>>> > >>>>> [...] > >>>>> > >>>>>>> @@ -11,6 +11,8 @@ > >>>>>>> /* Maximum size of a single packet stored in pool, including headers */ > >>>>>>> #define PACKET_MAX_LEN ((size_t)UINT16_MAX) > >>>>>>> +#define DEFAULT_TTL 64 > >>>>>> > >>>>>> If I understood correctly, David's comment to this on v3: > >>>>>> > >>>>>> https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/ > >>>>>> > >>>>>> was meant to imply that, as the default value can be changed via > >>>>>> sysctl, the value set via sysctl could be read at start-up. I'm fine > >>>>>> with 64 as well, by the way, with a slight preference for reading the > >>>>>> value via sysctl. > >>>>> > >>>>> I don't think the local host/container setting will have any effect > >>>>> if the sending guest is a VM. > >>>> > >>>> That's true, but.. > >>>> > >>>>> The benefit is of this is dubious. > >>>> > >>>> .. uflow->ttl[] isn't so much representing what the guest set, as a > >>>> cache of what the socket is sending and that *does* depend on the host > >>>> value. > >>> > >>> Right, my concern is that now we'll use the host value (whatever it is) > >>> if the value from the container / guest is 64. > >>> > >>> So: > >>> > >>> - guest uses 63, host has 255 configured: we use 63 > >>> > >>> - guest uses 64, host has 64 configured: we use 64 > >>> > >>> - ...but: guest uses 64, host has 255 configured: we use 255 > >>> > >>> ...and this might actually break traceroute itself in some extreme > >>> cases. > >>> > >>> Let's say we have 255 configured on the host and you're in the middle > >>> of a traceroute: > >>> > >>> - guest sends TTL 62, goes out with 62 -> 62nd hop replies > >>> - guest sends TTL 63, goes out with 63 -> 63rd hop replies > >>> - guest sends TTL 64, goes out with 255 -> destination replies > >>> - guest sends TTL 65, goes out with 65 -> 65th hop replies, traceroute broken > >> > >> Conclusion is that we have to set TTL in the socket at the opening of > >> every new flow in direction tap->sock. I can do that in a separate patch. > > > > Not necessarily, I think, you can also check what the current value is. > > If it matches, there's no need to set it. > > That value is likely to be different from that of the first > incoming packet from the guest, so we will end up calling setsockopt() > anyway. My point was that it's actually very likely. A container has the same TTL by default (ip_default_ttl is namespaced though, so it can be changed), and 64 is a common default value anyway (same default for Linux guests). > Besides, I believe the difference between doing an initial > setsockopt() vs a getsockopt() is mininmal, so nothing is gained. Okay, I thought one would lock the socket "heavily" and the other one wouldn't, but I guess you're right, it should be a minimal difference anyway. > I have a simpler idea: When the udp_flow struct is created, instead > of using DEFAULT_TTL, we set ttl to the one value TTL can never have: > zero. That means the condition for calling setsockopt() will alwaays be > met for the first arriving packet, and from that point on the value > in the socket and the cache in udp_flow will always be in sync. Ah, right, that looks even simpler. Again, regardless of that, I'm not sure if it works for IPv6 and IPV6_HOPLIMIT. > ///jon > > > > >>> See also the comment below. > >>> > >>>>>> All this might go away, though, please read the comment to > >>>>>> udp_flow_new() below, first. > >>>>>> > >>>>>>> + > >>>>>>> /** > >>>>>>> * struct pool - Generic pool of packets stored in a buffer > >>>>>>> * @buf: Buffer storing packet descriptors, > >>>>>>> diff --git a/tap.c b/tap.c > >>>>>>> index 3a6fcbe..e65d592 100644 > >>>>>>> --- a/tap.c > >>>>>>> +++ b/tap.c > >>>>>>> @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); > >>>>>>> * @dest: Destination port > >>>>>>> * @saddr: Source address > >>>>>>> * @daddr: Destination address > >>>>>>> + * @ttl: Time to live > >>>>>>> * @msg: Array of messages that can be handled in a single call > >>>>>>> */ > >>>>>>> static struct tap4_l4_t { > >>>>>>> @@ -574,6 +575,8 @@ static struct tap4_l4_t { > >>>>>>> struct in_addr saddr; > >>>>>>> struct in_addr daddr; > >>>>>>> + uint8_t ttl; > >>>>>> > >>>>>> If you move this after 'protocol' you save 4 or 8 bytes depending on > >>>>>> the architecture and, perhaps more importantly, with 64-byte cachelines, > >>>>>> you can fit the set of fields involved in the L4_MATCH() comparison > >>>>>> four times instead of three. If you have a look with pahole(1): > >>>>>> > >>>>> Good point. I didn't notice. > >>>>> > >>>>> > >>>>> [...] > >>>>>>> const struct flowside *toside; > >>>>>>> struct mmsghdr mm[UIO_MAXIOV]; > >>>>>>> @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > >>>>>>> mm[i].msg_hdr.msg_controllen = 0; > >>>>>>> mm[i].msg_hdr.msg_flags = 0; > >>>>>>> + if (ttl != uflow->ttl[tosidx.sidei]) { > >>>>>>> + uflow->ttl[tosidx.sidei] = ttl; > >>>>>>> + if (af == AF_INET) { > >>>>>>> + if (setsockopt(s, IPPROTO_IP, IP_TTL, > >>>>>>> + &ttl, sizeof(ttl)) < 0) > >>>>>>> + perror("setsockopt (IP_TTL)"); > >>>>>> > >>>>>> This would print to file descriptor 2 even if it's a socket. It should > >>>>>> be err_perror() instead, but now we also have flow_perror() which > >>>>>> prints flow index and type, given 'uflow' here, say: > >>>>>> > >>>>>> flow_perror(uflow, "IP_TTL setsockopt"); > >>>>>> > >>>>>>> + } else { > >>>>>>> + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT, > >>>>>>> + &ttl, sizeof(ttl)) < 0) > >>>>>>> + perror("setsockopt (IP_TTL)"); > >>>>>> > >>>>>> ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps: > >>>>>> > >>>>>> flow_perror(uflow, > >>>>>> "setsockopt IPV6_HOPLIMIT"); > >>>>>> > >>>>> Ok. > >>>>> > >>>>>>> + } > >>>>>>> + } > >>>>>>> + > >>>>>>> count++; > >>>>>>> } > >>>>>>> diff --git a/udp.h b/udp.h > >>>>>>> index de2df6d..041fad4 100644 > >>>>>>> --- a/udp.h > >>>>>>> +++ b/udp.h > >>>>>>> @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > >>>>>>> uint32_t events, const struct timespec *now); > >>>>>>> int udp_tap_handler(const struct ctx *c, uint8_t pif, > >>>>>>> sa_family_t af, const void *saddr, const void *daddr, > >>>>>>> - const struct pool *p, int idx, const struct timespec *now); > >>>>>>> + uint8_t ttl, const struct pool *p, int idx, > >>>>>> > >>>>>> Excess whitespace beetween 'uint8_t' and 'ttl'. > >>>>>> > >>>>>>> + const struct timespec *now); > >>>>>>> int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, > >>>>>>> const char *ifname, in_port_t port); > >>>>>>> int udp_init(struct ctx *c); > >>>>>>> diff --git a/udp_flow.c b/udp_flow.c > >>>>>>> index bf4b896..39372c2 100644 > >>>>>>> --- a/udp_flow.c > >>>>>>> +++ b/udp_flow.c > >>>>>>> @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, > >>>>>>> uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); > >>>>>>> uflow->ts = now->tv_sec; > >>>>>>> uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; > >>>>>>> + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL; > >>>>>> > >>>>>> By the way, instead of using a default value, what about fetching the > >>>>>> current value with getsockopt()? > >>>>>> > >>>>>> One additional system call per UDP flow doesn't feel like a lot of > >>>>>> overhead, and we can be sure it's correct, no matter if the user > >>>>>> configures a different value before or after we start. > >>>>>> > >>>>> This patch fixes UDP messaging tap->socket, and TTL may have any > >>>>> value in the first arriving packet. Reading it from the socket here only > >>>>> makes sense when I add the same support in direction socket->tap. > >>>>> That is my next project. > >>> > >>> Well, wait, the getsockopt() will not tell you the value the socket is > >>> receiving. It tells you the value that the socket would send, at least > >>> according to the documentation: > >> > >> We do have IP_RECVTTL and IP6_RECVHOPLIMIT for that. When this option is > >> set, we can catch the TTL on received packets by reading anillary data. > > > > Sure, but here we're talking about the value that the socket would > > send. That's what I'm suggesting to fetch via getsockopt() for IPv4. > > > > How you're using *IPV6_HOPLIMIT* here doesn't match the documentation. > > Is the documentation wrong? I haven't checked. > > > >> ///jon > >> > >>> > >>> IP_TTL (since Linux 1.0) > >>> Set or retrieve the current time-to-live field that is -- Stefano