From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=JWC3A334; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 50C7B5A026F for ; Fri, 04 Apr 2025 15:02:42 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1743771761; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5Bv/upec0EViqcvnj2rWo/5fzdgvAAF4ThNOgJpXMso=; b=JWC3A334tjL6mdQzBtYDiGeEs5FyBftnp0IzqVQaTn4YH5nnNzylbYNrRu9HQ59bwzSTeK QrrxPJH4JFjTmi0NLKQ4HBGaS3B+rnu/FdCL0+lQStTq1Y/G0teg0DIL/lBhX24RBVbMin h5chMfvPOuYQCZ4G49bMaCcAFSw/LeQ= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-671-b1rbAIx3NU2MD467gtKJKA-1; Fri, 04 Apr 2025 09:02:39 -0400 X-MC-Unique: b1rbAIx3NU2MD467gtKJKA-1 X-Mimecast-MFC-AGG-ID: b1rbAIx3NU2MD467gtKJKA_1743771758 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-391459c0395so978174f8f.2 for ; Fri, 04 Apr 2025 06:02:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743771757; x=1744376557; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=5Bv/upec0EViqcvnj2rWo/5fzdgvAAF4ThNOgJpXMso=; b=nU+dkEa3i+zm1k+k7vmxX35hKDqnKzH8O94xoLuFApv0pjJtuOjb5VcfI8mDpPRr1a x9iiEP94516K4FlsWAX1lyBOEPrXfC0JobDY3k6wJGF1oAVQTjdaOUj5SLLhch61/0FS UqexoJIpnJAuW5cLKv8Fmp8jnPEiAGi8AL9xuaTTV8zvgAMGWkj6NxArFTuhyX7q/VWe yyNccLdpYbzDXQoe4+o8Nn1iLUw6xw52qJ6K5DNrHQJODZBNmLxFMhoHuOHcd8W/HThK mvCpTgfgQBsU6a9I9nIeWtLKGYXLS3JLZ2IZZJystADUDxA3WatM7oEujOHANuaUpLt7 I6/g== X-Forwarded-Encrypted: i=1; AJvYcCUF7XYgcSW2q/eegYRotik74KubN4MPyN+nngPqu43k1ZtRzc8MazUhoV2NTMLrCRLBaD4mNX30htc=@passt.top X-Gm-Message-State: AOJu0Yz1xvq9Uyan2J48+CQq67DPwXNhvNvgTU5XxYzWp3X8DWpxmOui DbJ39UDYjch9Gv7DE4FrxNwXmWeo5tB5gB5MhI7Qp1sv1JKPc6iA9QjhIwHidF3i1dhxPaKgm4Z oqquwxivPanmB8xNowwcYvQXtvWIjqzd/w6vuHbzllxkNr62xFE1UFe5DOA== X-Gm-Gg: ASbGncuHHqoV2SadYfUPq6v9SEkO0NeZDyDR6A/H2i/6n3BqIZnnMd3DeSop/um3YyA mlXeX2glQmLeZBKFGMLehXhBELTDrZxbvNRK0OLX1k7SII0hkqoA5Oqvspm1r9WrDNUu3fSwrT2 J/NVaKrE63NByNocr+cl7ucpOHKzgjAZgLXkylxaw4qVt921pCk/xZcXy9fN9+8p9s/OXqhlN0v 3L1PHHccxSyU1HF2IW9M7D3P6bvpqS0NL+lQ1CAamBD3xn9K19VjXf+8jJ39n/D2zt46ErUd5+B oTRh4+NvmDazWkcD4jogdRy9pPc= X-Received: by 2002:a05:6000:1787:b0:391:1806:e23f with SMTP id ffacd0b85a97d-39cb3596ea9mr2825744f8f.17.1743771756927; Fri, 04 Apr 2025 06:02:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGVF2m5x4a61EheBjOM+M/pW2GQTOeoDgUBJvTMcA/ItoKoPvCe/KCK3b1PmSLFITOFarRDCg== X-Received: by 2002:a05:6000:1787:b0:391:1806:e23f with SMTP id ffacd0b85a97d-39cb3596ea9mr2825666f8f.17.1743771756270; Fri, 04 Apr 2025 06:02:36 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-39c3020d6b1sm4377387f8f.62.2025.04.04.06.02.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Apr 2025 06:02:35 -0700 (PDT) Date: Fri, 4 Apr 2025 15:02:34 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4] udp: support traceroute Message-ID: <20250404150234.40f4d150@elisabeth> In-Reply-To: <3b055987-7c7f-4cd0-9757-9676b9142d17@redhat.com> References: <20250403022229.836067-1-jmaloy@redhat.com> <20250403174833.6d033172@elisabeth> <4986e27d-20d9-4b2b-883d-d696e84ec9cf@redhat.com> <20250404135015.069d5a91@elisabeth> <3b055987-7c7f-4cd0-9757-9676b9142d17@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: sR-wO-fQojzX8Q-fnSse-aCTQZWn2IKHT0f59M9XIBo_1743771758 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: W5BYY3YW5OEO5Q74EYGARGIRN7NREP46 X-Message-ID-Hash: W5BYY3YW5OEO5Q74EYGARGIRN7NREP46 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 4 Apr 2025 08:54:46 -0400 Jon Maloy wrote: > On 2025-04-04 07:50, Stefano Brivio wrote: > > Jon, I wasn't actually suggesting that you would drop *all* the Cc:'s. > > :) I just saw no reason to specifically spam Laurent with this. > > > > On Fri, 4 Apr 2025 10:31:29 +1100 > > David Gibson wrote: > > > >> On Thu, Apr 03, 2025 at 04:27:12PM -0400, Jon Maloy wrote: > >>> > >>> > >>> On 2025-04-03 11:48, Stefano Brivio wrote: > >>>> The implementation looks solid to me, a list of nits (or a bit > >>>> more) below. > >>>> > >>>> By the way, I don't think you need to Cc: people who are already on > >>>> this list unless you specifically want their attention. > >>>> > >>>> On Wed, 2 Apr 2025 22:22:29 -0400 > >>>> Jon Maloy wrote: > >>>> > >>>>> Now that ICMP pass-through from socket-to-tap is in place, it is > >>>>> easy to support UDP based traceroute functionality in direction > >>>>> tap-to-socket. > >>>>> > >>>>> We fix that in this commit. > >>>>> > >>>>> Signed-off-by: Jon Maloy > >>>> > >>>> This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I > >>>> understood correctly. > >>>> > >>>>> --- > >>>>> v2: - Using ancillary data instead of setsockopt to transfer outgoing > >>>>> TTL. > >>>>> - Support IPv6 > >>>>> v3: - Storing ttl per packet instead of per flow. This may not be > >>>>> elegant, but much less intrusive than changing the flow > >>> > >>> [...] > >>> > >>>>> @@ -11,6 +11,8 @@ > >>>>> /* Maximum size of a single packet stored in pool, including headers */ > >>>>> #define PACKET_MAX_LEN ((size_t)UINT16_MAX) > >>>>> +#define DEFAULT_TTL 64 > >>>> > >>>> If I understood correctly, David's comment to this on v3: > >>>> > >>>> https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/ > >>>> > >>>> was meant to imply that, as the default value can be changed via > >>>> sysctl, the value set via sysctl could be read at start-up. I'm fine > >>>> with 64 as well, by the way, with a slight preference for reading the > >>>> value via sysctl. > >>> > >>> I don't think the local host/container setting will have any effect > >>> if the sending guest is a VM. > >> > >> That's true, but.. > >> > >>> The benefit is of this is dubious. > >> > >> .. uflow->ttl[] isn't so much representing what the guest set, as a > >> cache of what the socket is sending and that *does* depend on the host > >> value. > > > > Right, my concern is that now we'll use the host value (whatever it is) > > if the value from the container / guest is 64. > > > > So: > > > > - guest uses 63, host has 255 configured: we use 63 > > > > - guest uses 64, host has 64 configured: we use 64 > > > > - ...but: guest uses 64, host has 255 configured: we use 255 > > > > ...and this might actually break traceroute itself in some extreme > > cases. > > > > Let's say we have 255 configured on the host and you're in the middle > > of a traceroute: > > > > - guest sends TTL 62, goes out with 62 -> 62nd hop replies > > - guest sends TTL 63, goes out with 63 -> 63rd hop replies > > - guest sends TTL 64, goes out with 255 -> destination replies > > - guest sends TTL 65, goes out with 65 -> 65th hop replies, traceroute broken > > Conclusion is that we have to set TTL in the socket at the opening of > every new flow in direction tap->sock. I can do that in a separate patch. Not necessarily, I think, you can also check what the current value is. If it matches, there's no need to set it. > > See also the comment below. > > > >>>> All this might go away, though, please read the comment to > >>>> udp_flow_new() below, first. > >>>> > >>>>> + > >>>>> /** > >>>>> * struct pool - Generic pool of packets stored in a buffer > >>>>> * @buf: Buffer storing packet descriptors, > >>>>> diff --git a/tap.c b/tap.c > >>>>> index 3a6fcbe..e65d592 100644 > >>>>> --- a/tap.c > >>>>> +++ b/tap.c > >>>>> @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); > >>>>> * @dest: Destination port > >>>>> * @saddr: Source address > >>>>> * @daddr: Destination address > >>>>> + * @ttl: Time to live > >>>>> * @msg: Array of messages that can be handled in a single call > >>>>> */ > >>>>> static struct tap4_l4_t { > >>>>> @@ -574,6 +575,8 @@ static struct tap4_l4_t { > >>>>> struct in_addr saddr; > >>>>> struct in_addr daddr; > >>>>> + uint8_t ttl; > >>>> > >>>> If you move this after 'protocol' you save 4 or 8 bytes depending on > >>>> the architecture and, perhaps more importantly, with 64-byte cachelines, > >>>> you can fit the set of fields involved in the L4_MATCH() comparison > >>>> four times instead of three. If you have a look with pahole(1): > >>>> > >>> Good point. I didn't notice. > >>> > >>> > >>> [...] > >>>>> const struct flowside *toside; > >>>>> struct mmsghdr mm[UIO_MAXIOV]; > >>>>> @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > >>>>> mm[i].msg_hdr.msg_controllen = 0; > >>>>> mm[i].msg_hdr.msg_flags = 0; > >>>>> + if (ttl != uflow->ttl[tosidx.sidei]) { > >>>>> + uflow->ttl[tosidx.sidei] = ttl; > >>>>> + if (af == AF_INET) { > >>>>> + if (setsockopt(s, IPPROTO_IP, IP_TTL, > >>>>> + &ttl, sizeof(ttl)) < 0) > >>>>> + perror("setsockopt (IP_TTL)"); > >>>> > >>>> This would print to file descriptor 2 even if it's a socket. It should > >>>> be err_perror() instead, but now we also have flow_perror() which > >>>> prints flow index and type, given 'uflow' here, say: > >>>> > >>>> flow_perror(uflow, "IP_TTL setsockopt"); > >>>> > >>>>> + } else { > >>>>> + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT, > >>>>> + &ttl, sizeof(ttl)) < 0) > >>>>> + perror("setsockopt (IP_TTL)"); > >>>> > >>>> ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps: > >>>> > >>>> flow_perror(uflow, > >>>> "setsockopt IPV6_HOPLIMIT"); > >>>> > >>> Ok. > >>> > >>>>> + } > >>>>> + } > >>>>> + > >>>>> count++; > >>>>> } > >>>>> diff --git a/udp.h b/udp.h > >>>>> index de2df6d..041fad4 100644 > >>>>> --- a/udp.h > >>>>> +++ b/udp.h > >>>>> @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > >>>>> uint32_t events, const struct timespec *now); > >>>>> int udp_tap_handler(const struct ctx *c, uint8_t pif, > >>>>> sa_family_t af, const void *saddr, const void *daddr, > >>>>> - const struct pool *p, int idx, const struct timespec *now); > >>>>> + uint8_t ttl, const struct pool *p, int idx, > >>>> > >>>> Excess whitespace beetween 'uint8_t' and 'ttl'. > >>>> > >>>>> + const struct timespec *now); > >>>>> int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, > >>>>> const char *ifname, in_port_t port); > >>>>> int udp_init(struct ctx *c); > >>>>> diff --git a/udp_flow.c b/udp_flow.c > >>>>> index bf4b896..39372c2 100644 > >>>>> --- a/udp_flow.c > >>>>> +++ b/udp_flow.c > >>>>> @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, > >>>>> uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); > >>>>> uflow->ts = now->tv_sec; > >>>>> uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; > >>>>> + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL; > >>>> > >>>> By the way, instead of using a default value, what about fetching the > >>>> current value with getsockopt()? > >>>> > >>>> One additional system call per UDP flow doesn't feel like a lot of > >>>> overhead, and we can be sure it's correct, no matter if the user > >>>> configures a different value before or after we start. > >>>> > >>> This patch fixes UDP messaging tap->socket, and TTL may have any > >>> value in the first arriving packet. Reading it from the socket here only > >>> makes sense when I add the same support in direction socket->tap. > >>> That is my next project. > > > > Well, wait, the getsockopt() will not tell you the value the socket is > > receiving. It tells you the value that the socket would send, at least > > according to the documentation: > > We do have IP_RECVTTL and IP6_RECVHOPLIMIT for that. When this option is > set, we can catch the TTL on received packets by reading anillary data. Sure, but here we're talking about the value that the socket would send. That's what I'm suggesting to fetch via getsockopt() for IPv4. How you're using *IPV6_HOPLIMIT* here doesn't match the documentation. Is the documentation wrong? I haven't checked. > ///jon > > > > > IP_TTL (since Linux 1.0) > > Set or retrieve the current time-to-live field that is -- Stefano