From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=VbS1jhLO; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 7905B5A026F for ; Fri, 04 Apr 2025 13:54:42 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1743767681; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=huj0iwvXvsgB2itBbvoG86DaV0gtsD/MfKHn0JOs2HA=; b=VbS1jhLOQ6/ccGz5IoR9BI85Fhzutyn/jVYVHFtjpVJiLydHzvec6975wEoIk2HRmr3HPD irztgFB0AobC37IFVBCmSJyOeMuBDjJf6J+zsWw+yZU7NeIHcfpjBReIh9CiYiZE/MJHAD 1gd4z9lcSRw2lnN+ybsUJtp/3nzSWI4= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-31-13BEbOXMOVmgc76U-w5rMA-1; Fri, 04 Apr 2025 07:54:40 -0400 X-MC-Unique: 13BEbOXMOVmgc76U-w5rMA-1 X-Mimecast-MFC-AGG-ID: 13BEbOXMOVmgc76U-w5rMA_1743767679 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-4394c489babso10314185e9.1 for ; Fri, 04 Apr 2025 04:54:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743767679; x=1744372479; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=huj0iwvXvsgB2itBbvoG86DaV0gtsD/MfKHn0JOs2HA=; b=mAjmvibvtlFJfxFeGGo84y9fyWLq6y0A63aKHvHb9PlA+OeLnoKghnResg8qRg/MkH DTn7lvPHnIQWi5nQAZt884Bk+n/uE4ncDDW6uJ4RmEaXSyVp/LHQS3opcUrlSwZjyVpd j5rcO/37XCVCWjEeVAMzgRNm2PlkBgdvRkG4aDQpeAoJwxSt+0j2ppYDHiE2zTsvY3A1 y22buNXpyYu9E8RAJDjaGQCI4AiNiM4nt9ivabZ3nTvSe96OZrXCcc73bqXNj+RWmvZQ OCD2TzbBkiCqAK4TpbbDYIQUhc2ocqAJyLJl4cpILgQjjOBYu8xIr9XlHE6ERnlV+et6 bm5w== X-Forwarded-Encrypted: i=1; AJvYcCWQPq3l/xqUF84oGiYzgB5Zckw2IITdV/SMk3cXML5QdOh/4J2p6a6BRjsEVqxXc0etnOgyjWUY1vg=@passt.top X-Gm-Message-State: AOJu0YxE2g3upkjfgp7VXj4ywF4pNQ65tfF87FqZiTeBxcyDM/GImTEE tnILL4KnpCtuTuMKjs3Gt4WFsBTyQs8XVWV4aXmX4hmmgfzVFaCvAjB/rLbtUmrhGE6WRQEOleT IR5FPX1ooQUgsXTjohJsydYKWPO2e44O61aiKqOOVZgfe4VoBDQ== X-Gm-Gg: ASbGncu+Wh/Utv62MEAtF4/zwoWXmrEoKl8DK3+FIYxArwLtXbdew5+l1Dlk4J2VrC6 ENI00yYwTNF6YXcH/WFWi8U/55+0eprqDNriHCK6FCmuNZxaU16+QySrs28oMU+DwaUqCA7p52I L51Ul6RY+m9yTdgCnujVtT28rPn43wFuaEb6IjZvZtrWzfvkBW0isFaAPzzX4H6xpqWP3+/IOI2 GVXWAAo0PmVnyHWtbQduz3qHJDBM7JUiub5pqIGIs+FlHZIlzexuqNz2ipxK/ED7df8ZDS0deE8 Www/Dx6mkHs54C3Rpdsa0lqx+4xxJFsVeuodykNovKw2 X-Received: by 2002:a05:600c:4e09:b0:43c:eeee:b706 with SMTP id 5b1f17b1804b1-43ecf9fe6d2mr28452825e9.24.1743767678794; Fri, 04 Apr 2025 04:54:38 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHDq+XoD5MyrqP9UDDvw6eWeSBWQbQ5/W2uFT6IgZB2WedeSBLVbaVWN+Eek0ekftmpXZ3kog== X-Received: by 2002:a05:600c:4e09:b0:43c:eeee:b706 with SMTP id 5b1f17b1804b1-43ecf9fe6d2mr28452665e9.24.1743767678355; Fri, 04 Apr 2025 04:54:38 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39c30096896sm4236962f8f.19.2025.04.04.04.54.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Apr 2025 04:54:38 -0700 (PDT) Date: Fri, 4 Apr 2025 13:54:36 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4] udp: support traceroute Message-ID: <20250404135436.76faa385@elisabeth> In-Reply-To: <20250404135015.069d5a91@elisabeth> References: <20250403022229.836067-1-jmaloy@redhat.com> <20250403174833.6d033172@elisabeth> <4986e27d-20d9-4b2b-883d-d696e84ec9cf@redhat.com> <20250404135015.069d5a91@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: qIKkf1ffvzAgLuWPL8PsMZTXW7taS0qXLd0VdUGAKrg_1743767679 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: KTMRIJUFFXTMB3MUXY3VDPAE54MBEPGS X-Message-ID-Hash: KTMRIJUFFXTMB3MUXY3VDPAE54MBEPGS X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 4 Apr 2025 13:50:15 +0200 Stefano Brivio wrote: > Jon, I wasn't actually suggesting that you would drop *all* the Cc:'s. > :) I just saw no reason to specifically spam Laurent with this. > > On Fri, 4 Apr 2025 10:31:29 +1100 > David Gibson wrote: > > > On Thu, Apr 03, 2025 at 04:27:12PM -0400, Jon Maloy wrote: > > > > > > > > > On 2025-04-03 11:48, Stefano Brivio wrote: > > > > The implementation looks solid to me, a list of nits (or a bit > > > > more) below. > > > > > > > > By the way, I don't think you need to Cc: people who are already on > > > > this list unless you specifically want their attention. > > > > > > > > On Wed, 2 Apr 2025 22:22:29 -0400 > > > > Jon Maloy wrote: > > > > > > > > > Now that ICMP pass-through from socket-to-tap is in place, it is > > > > > easy to support UDP based traceroute functionality in direction > > > > > tap-to-socket. > > > > > > > > > > We fix that in this commit. > > > > > > > > > > Signed-off-by: Jon Maloy > > > > > > > > This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I > > > > understood correctly. > > > > > > > > > --- > > > > > v2: - Using ancillary data instead of setsockopt to transfer outgoing > > > > > TTL. > > > > > - Support IPv6 > > > > > v3: - Storing ttl per packet instead of per flow. This may not be > > > > > elegant, but much less intrusive than changing the flow > > > > > > [...] > > > > > > > > @@ -11,6 +11,8 @@ > > > > > /* Maximum size of a single packet stored in pool, including headers */ > > > > > #define PACKET_MAX_LEN ((size_t)UINT16_MAX) > > > > > +#define DEFAULT_TTL 64 > > > > > > > > If I understood correctly, David's comment to this on v3: > > > > > > > > https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/ > > > > > > > > was meant to imply that, as the default value can be changed via > > > > sysctl, the value set via sysctl could be read at start-up. I'm fine > > > > with 64 as well, by the way, with a slight preference for reading the > > > > value via sysctl. > > > > > > I don't think the local host/container setting will have any effect > > > if the sending guest is a VM. > > > > That's true, but.. > > > > > The benefit is of this is dubious. > > > > .. uflow->ttl[] isn't so much representing what the guest set, as a > > cache of what the socket is sending and that *does* depend on the host > > value. > > Right, my concern is that now we'll use the host value (whatever it is) > if the value from the container / guest is 64. > > So: > > - guest uses 63, host has 255 configured: we use 63 > > - guest uses 64, host has 64 configured: we use 64 > > - ...but: guest uses 64, host has 255 configured: we use 255 > > ...and this might actually break traceroute itself in some extreme > cases. > > Let's say we have 255 configured on the host and you're in the middle > of a traceroute: > > - guest sends TTL 62, goes out with 62 -> 62nd hop replies > - guest sends TTL 63, goes out with 63 -> 63rd hop replies > - guest sends TTL 64, goes out with 255 -> destination replies > - guest sends TTL 65, goes out with 65 -> 65th hop replies, traceroute broken > > See also the comment below. > > > > > All this might go away, though, please read the comment to > > > > udp_flow_new() below, first. > > > > > > > > > + > > > > > /** > > > > > * struct pool - Generic pool of packets stored in a buffer > > > > > * @buf: Buffer storing packet descriptors, > > > > > diff --git a/tap.c b/tap.c > > > > > index 3a6fcbe..e65d592 100644 > > > > > --- a/tap.c > > > > > +++ b/tap.c > > > > > @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); > > > > > * @dest: Destination port > > > > > * @saddr: Source address > > > > > * @daddr: Destination address > > > > > + * @ttl: Time to live > > > > > * @msg: Array of messages that can be handled in a single call > > > > > */ > > > > > static struct tap4_l4_t { > > > > > @@ -574,6 +575,8 @@ static struct tap4_l4_t { > > > > > struct in_addr saddr; > > > > > struct in_addr daddr; > > > > > + uint8_t ttl; > > > > > > > > If you move this after 'protocol' you save 4 or 8 bytes depending on > > > > the architecture and, perhaps more importantly, with 64-byte cachelines, > > > > you can fit the set of fields involved in the L4_MATCH() comparison > > > > four times instead of three. If you have a look with pahole(1): > > > > > > > Good point. I didn't notice. > > > > > > > > > [...] > > > > > const struct flowside *toside; > > > > > struct mmsghdr mm[UIO_MAXIOV]; > > > > > @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > > > > > mm[i].msg_hdr.msg_controllen = 0; > > > > > mm[i].msg_hdr.msg_flags = 0; > > > > > + if (ttl != uflow->ttl[tosidx.sidei]) { > > > > > + uflow->ttl[tosidx.sidei] = ttl; > > > > > + if (af == AF_INET) { > > > > > + if (setsockopt(s, IPPROTO_IP, IP_TTL, > > > > > + &ttl, sizeof(ttl)) < 0) > > > > > + perror("setsockopt (IP_TTL)"); > > > > > > > > This would print to file descriptor 2 even if it's a socket. It should > > > > be err_perror() instead, but now we also have flow_perror() which > > > > prints flow index and type, given 'uflow' here, say: > > > > > > > > flow_perror(uflow, "IP_TTL setsockopt"); > > > > > > > > > + } else { > > > > > + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT, > > > > > + &ttl, sizeof(ttl)) < 0) > > > > > + perror("setsockopt (IP_TTL)"); > > > > > > > > ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps: > > > > > > > > flow_perror(uflow, > > > > "setsockopt IPV6_HOPLIMIT"); > > > > > > > Ok. > > > > > > > > + } > > > > > + } > > > > > + > > > > > count++; > > > > > } > > > > > diff --git a/udp.h b/udp.h > > > > > index de2df6d..041fad4 100644 > > > > > --- a/udp.h > > > > > +++ b/udp.h > > > > > @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > > > > > uint32_t events, const struct timespec *now); > > > > > int udp_tap_handler(const struct ctx *c, uint8_t pif, > > > > > sa_family_t af, const void *saddr, const void *daddr, > > > > > - const struct pool *p, int idx, const struct timespec *now); > > > > > + uint8_t ttl, const struct pool *p, int idx, > > > > > > > > Excess whitespace beetween 'uint8_t' and 'ttl'. > > > > > > > > > + const struct timespec *now); > > > > > int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, > > > > > const char *ifname, in_port_t port); > > > > > int udp_init(struct ctx *c); > > > > > diff --git a/udp_flow.c b/udp_flow.c > > > > > index bf4b896..39372c2 100644 > > > > > --- a/udp_flow.c > > > > > +++ b/udp_flow.c > > > > > @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, > > > > > uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); > > > > > uflow->ts = now->tv_sec; > > > > > uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; > > > > > + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL; > > > > > > > > By the way, instead of using a default value, what about fetching the > > > > current value with getsockopt()? > > > > > > > > One additional system call per UDP flow doesn't feel like a lot of > > > > overhead, and we can be sure it's correct, no matter if the user > > > > configures a different value before or after we start. > > > > > > > This patch fixes UDP messaging tap->socket, and TTL may have any > > > value in the first arriving packet. Reading it from the socket here only > > > makes sense when I add the same support in direction socket->tap. > > > That is my next project. > > Well, wait, the getsockopt() will not tell you the value the socket is > receiving. It tells you the value that the socket would send, at least > according to the documentation: > > IP_TTL (since Linux 1.0) > Set or retrieve the current time-to-live field that is > used in every packet sent from this socket. > > and that's what makes it relevant: this is the value that we would > normally use, unless you issue the setsockopt(). > > But... there's a plot twist: this is just for IPv4. For IPv6: > > IPV6_RTHDR, IPV6_AUTHHDR, IPV6_DSTOPTS, IPV6_HOPOPTS, IPV6_FLOWINFO, > IPV6_HOPLIMIT > Set delivery of control messages for incoming datagrams > containing extension headers from the received packet. > > [...] > > IPV6_HOPLIMIT delivers an integer containing the hop count of > the packet. > > so I wonder: is it correct to use IPV6_HOPLIMIT at all, even for the > setsockopt() you're adding? > > I haven't tested this (at least not yet), but from the documentation > that seems to apply to *received* packets. No idea what the > setsockopt() would do, at this point. > > Could it be that we should use IP_TTL for *sent* IPv4 packets as well? ^^^^ IPv6, I meant > > I'll try to test this specific part in a bit, unless you already did. -- Stefano