From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=epD+7K7U; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 948235A026F for ; Fri, 04 Apr 2025 14:54:51 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1743771290; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=e42Wv0j2rik044uXwYF1L2VSglw6lxOwajrMLI451xk=; b=epD+7K7UuRGYOACYFaP/PWJj/Rbla3mTKcbvzdLs523F5PZ0zxuOLqon8ZXu9XCaPWMRJ5 DJcDDsr0LlnKNJPHIQTAGFegz5YYJegAXBENavKNwHJlsBCYtYesAN215C6SkEFMaM058+ /MtUhccbkfUeMLC5ZqFrzo9aMie5juY= Received: from mail-il1-f199.google.com (mail-il1-f199.google.com [209.85.166.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-456-KZV7LmmxPKmGGGwiROeDww-1; Fri, 04 Apr 2025 08:54:48 -0400 X-MC-Unique: KZV7LmmxPKmGGGwiROeDww-1 X-Mimecast-MFC-AGG-ID: KZV7LmmxPKmGGGwiROeDww_1743771288 Received: by mail-il1-f199.google.com with SMTP id e9e14a558f8ab-3d44a3882a0so19658505ab.1 for ; Fri, 04 Apr 2025 05:54:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743771288; x=1744376088; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=e42Wv0j2rik044uXwYF1L2VSglw6lxOwajrMLI451xk=; b=jC7a/pzw/yXboWrXxwWakGauG7iBRzX+wbxRA+uABb3MV4XqO3Jr6w13D5zRAOcZ9p CkF4ZytT7qVOfAqR3RsCj2u+rm1ePQYkLAffARS+NFbNzKfdquzdqocOIuLGBZrTWLDe wYXX1VYQZZ7Kn5n3kgD/87kfFyhFI0PQ2tvYhgCkp5kW23XbzpRg3AHcbmRppuq8p2mN yevdXP6oiH9mgOujfhnzy1NgZI1wqC3lzJydAbFh3Y+01Ezdml4MvoqnkGxi6uiBFFeu KEPKHzv2bdP43kovgErwIRnZmNSwy6z+NpvqQVscS+hR91mIVPWSQ1U0TS7PPGE6OZMj WCew== X-Forwarded-Encrypted: i=1; AJvYcCUMwMlgHGoPfXpU3767RO/8KU+lLUmFKPBNOJgOKxZZxzzd4irugy8RgJQDK08+eyQe7ZM9tUxs/FM=@passt.top X-Gm-Message-State: AOJu0YwNrcVAPEkdklVCDJQOMOcF7BId/NpYtkQ8zbkWG7P0laT7hQ84 +y2LRKxxH7vaINeIBl1VGViix0crGqqvvE86V/cyoyQQBWIV0cns+fWYdFU+gbbpHBSK2/9ThAG QdSU8Z24JJZxLYsMjWtLUq2wYfyyEsh5w1TSTv/jG9vO0F8Rk1w== X-Gm-Gg: ASbGnctbwiIsIqXW19MA9/asdKAk6/bf1ECH2n3STygj5TvHyKnCQsmjFcv5b1gOnP2 IqLzAxWp1oMGbNFKHReHHAJ89V28WQL1TSh2oZLvaQqudWQdDIPHTOPsxhbaiWBDWv9lqIYQVRg fcETMtXcu3THp2Cmg3kGkzX+w7shDu9jf98+h04GqmOghnavcr7WEXya8CvKB8GBN2VskwLTKhu seY7woRBujqmDP2OFdXHczux/Rx/Myq2bCTQRUW1x4kpewv2KaV81oa1fYM1VFK+MpZXI9J6iHI 7lZ4nGGqU3wDihnFO4rc+e9sISNVGeF9AiHof0IhJFt8wJjR2LwP6Am62ObQhaI= X-Received: by 2002:a05:6e02:1fe1:b0:3d4:6ed4:e0a with SMTP id e9e14a558f8ab-3d6e3f999famr33548315ab.4.1743771288038; Fri, 04 Apr 2025 05:54:48 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFr2fHd2NNyXnmRoeeqS644GTDwnqStvvDReT6qjMeACbEXfLOhm+5p7sJeltP672SS9Gq0uw== X-Received: by 2002:a05:6e02:1fe1:b0:3d4:6ed4:e0a with SMTP id e9e14a558f8ab-3d6e3f999famr33548055ab.4.1743771287541; Fri, 04 Apr 2025 05:54:47 -0700 (PDT) Received: from ?IPV6:2001:4958:231f:7c01:99a2:ef22:1861:9725? ([2001:4958:231f:7c01:99a2:ef22:1861:9725]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-3d6de95e796sm7939655ab.49.2025.04.04.05.54.46 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 04 Apr 2025 05:54:47 -0700 (PDT) Message-ID: <3b055987-7c7f-4cd0-9757-9676b9142d17@redhat.com> Date: Fri, 4 Apr 2025 08:54:46 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4] udp: support traceroute To: Stefano Brivio References: <20250403022229.836067-1-jmaloy@redhat.com> <20250403174833.6d033172@elisabeth> <4986e27d-20d9-4b2b-883d-d696e84ec9cf@redhat.com> <20250404135015.069d5a91@elisabeth> From: Jon Maloy In-Reply-To: <20250404135015.069d5a91@elisabeth> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 27i0fCpZHqyvOQ8O1K4qSbiq1eBmIng-QU8h4C89zC0_1743771288 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: EOLMBLCABUMBJIVX7OJYJS6V3BA5NZKK X-Message-ID-Hash: EOLMBLCABUMBJIVX7OJYJS6V3BA5NZKK X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2025-04-04 07:50, Stefano Brivio wrote: > Jon, I wasn't actually suggesting that you would drop *all* the Cc:'s. > :) I just saw no reason to specifically spam Laurent with this. > > On Fri, 4 Apr 2025 10:31:29 +1100 > David Gibson wrote: > >> On Thu, Apr 03, 2025 at 04:27:12PM -0400, Jon Maloy wrote: >>> >>> >>> On 2025-04-03 11:48, Stefano Brivio wrote: >>>> The implementation looks solid to me, a list of nits (or a bit >>>> more) below. >>>> >>>> By the way, I don't think you need to Cc: people who are already on >>>> this list unless you specifically want their attention. >>>> >>>> On Wed, 2 Apr 2025 22:22:29 -0400 >>>> Jon Maloy wrote: >>>> >>>>> Now that ICMP pass-through from socket-to-tap is in place, it is >>>>> easy to support UDP based traceroute functionality in direction >>>>> tap-to-socket. >>>>> >>>>> We fix that in this commit. >>>>> >>>>> Signed-off-by: Jon Maloy >>>> >>>> This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I >>>> understood correctly. >>>> >>>>> --- >>>>> v2: - Using ancillary data instead of setsockopt to transfer outgoing >>>>> TTL. >>>>> - Support IPv6 >>>>> v3: - Storing ttl per packet instead of per flow. This may not be >>>>> elegant, but much less intrusive than changing the flow >>> >>> [...] >>> >>>>> @@ -11,6 +11,8 @@ >>>>> /* Maximum size of a single packet stored in pool, including headers */ >>>>> #define PACKET_MAX_LEN ((size_t)UINT16_MAX) >>>>> +#define DEFAULT_TTL 64 >>>> >>>> If I understood correctly, David's comment to this on v3: >>>> >>>> https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/ >>>> >>>> was meant to imply that, as the default value can be changed via >>>> sysctl, the value set via sysctl could be read at start-up. I'm fine >>>> with 64 as well, by the way, with a slight preference for reading the >>>> value via sysctl. >>> >>> I don't think the local host/container setting will have any effect >>> if the sending guest is a VM. >> >> That's true, but.. >> >>> The benefit is of this is dubious. >> >> .. uflow->ttl[] isn't so much representing what the guest set, as a >> cache of what the socket is sending and that *does* depend on the host >> value. > > Right, my concern is that now we'll use the host value (whatever it is) > if the value from the container / guest is 64. > > So: > > - guest uses 63, host has 255 configured: we use 63 > > - guest uses 64, host has 64 configured: we use 64 > > - ...but: guest uses 64, host has 255 configured: we use 255 > > ...and this might actually break traceroute itself in some extreme > cases. > > Let's say we have 255 configured on the host and you're in the middle > of a traceroute: > > - guest sends TTL 62, goes out with 62 -> 62nd hop replies > - guest sends TTL 63, goes out with 63 -> 63rd hop replies > - guest sends TTL 64, goes out with 255 -> destination replies > - guest sends TTL 65, goes out with 65 -> 65th hop replies, traceroute broken Conclusion is that we have to set TTL in the socket at the opening of every new flow in direction tap->sock. I can do that in a separate patch. > > See also the comment below. > >>>> All this might go away, though, please read the comment to >>>> udp_flow_new() below, first. >>>> >>>>> + >>>>> /** >>>>> * struct pool - Generic pool of packets stored in a buffer >>>>> * @buf: Buffer storing packet descriptors, >>>>> diff --git a/tap.c b/tap.c >>>>> index 3a6fcbe..e65d592 100644 >>>>> --- a/tap.c >>>>> +++ b/tap.c >>>>> @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); >>>>> * @dest: Destination port >>>>> * @saddr: Source address >>>>> * @daddr: Destination address >>>>> + * @ttl: Time to live >>>>> * @msg: Array of messages that can be handled in a single call >>>>> */ >>>>> static struct tap4_l4_t { >>>>> @@ -574,6 +575,8 @@ static struct tap4_l4_t { >>>>> struct in_addr saddr; >>>>> struct in_addr daddr; >>>>> + uint8_t ttl; >>>> >>>> If you move this after 'protocol' you save 4 or 8 bytes depending on >>>> the architecture and, perhaps more importantly, with 64-byte cachelines, >>>> you can fit the set of fields involved in the L4_MATCH() comparison >>>> four times instead of three. If you have a look with pahole(1): >>>> >>> Good point. I didn't notice. >>> >>> >>> [...] >>>>> const struct flowside *toside; >>>>> struct mmsghdr mm[UIO_MAXIOV]; >>>>> @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, >>>>> mm[i].msg_hdr.msg_controllen = 0; >>>>> mm[i].msg_hdr.msg_flags = 0; >>>>> + if (ttl != uflow->ttl[tosidx.sidei]) { >>>>> + uflow->ttl[tosidx.sidei] = ttl; >>>>> + if (af == AF_INET) { >>>>> + if (setsockopt(s, IPPROTO_IP, IP_TTL, >>>>> + &ttl, sizeof(ttl)) < 0) >>>>> + perror("setsockopt (IP_TTL)"); >>>> >>>> This would print to file descriptor 2 even if it's a socket. It should >>>> be err_perror() instead, but now we also have flow_perror() which >>>> prints flow index and type, given 'uflow' here, say: >>>> >>>> flow_perror(uflow, "IP_TTL setsockopt"); >>>> >>>>> + } else { >>>>> + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT, >>>>> + &ttl, sizeof(ttl)) < 0) >>>>> + perror("setsockopt (IP_TTL)"); >>>> >>>> ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps: >>>> >>>> flow_perror(uflow, >>>> "setsockopt IPV6_HOPLIMIT"); >>>> >>> Ok. >>> >>>>> + } >>>>> + } >>>>> + >>>>> count++; >>>>> } >>>>> diff --git a/udp.h b/udp.h >>>>> index de2df6d..041fad4 100644 >>>>> --- a/udp.h >>>>> +++ b/udp.h >>>>> @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, >>>>> uint32_t events, const struct timespec *now); >>>>> int udp_tap_handler(const struct ctx *c, uint8_t pif, >>>>> sa_family_t af, const void *saddr, const void *daddr, >>>>> - const struct pool *p, int idx, const struct timespec *now); >>>>> + uint8_t ttl, const struct pool *p, int idx, >>>> >>>> Excess whitespace beetween 'uint8_t' and 'ttl'. >>>> >>>>> + const struct timespec *now); >>>>> int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, >>>>> const char *ifname, in_port_t port); >>>>> int udp_init(struct ctx *c); >>>>> diff --git a/udp_flow.c b/udp_flow.c >>>>> index bf4b896..39372c2 100644 >>>>> --- a/udp_flow.c >>>>> +++ b/udp_flow.c >>>>> @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, >>>>> uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); >>>>> uflow->ts = now->tv_sec; >>>>> uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; >>>>> + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL; >>>> >>>> By the way, instead of using a default value, what about fetching the >>>> current value with getsockopt()? >>>> >>>> One additional system call per UDP flow doesn't feel like a lot of >>>> overhead, and we can be sure it's correct, no matter if the user >>>> configures a different value before or after we start. >>>> >>> This patch fixes UDP messaging tap->socket, and TTL may have any >>> value in the first arriving packet. Reading it from the socket here only >>> makes sense when I add the same support in direction socket->tap. >>> That is my next project. > > Well, wait, the getsockopt() will not tell you the value the socket is > receiving. It tells you the value that the socket would send, at least > according to the documentation: We do have IP_RECVTTL and IP6_RECVHOPLIMIT for that. When this option is set, we can catch the TTL on received packets by reading anillary data. ///jon > > IP_TTL (since Linux 1.0) > Set or retrieve the current time-to-live field that is >