From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=iFuuMD9J; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 915BB5A0008 for ; Thu, 03 Apr 2025 17:48:41 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1743695320; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mCnXnUFxgD2hi5WhfjYrJvBCk1jjASDLkd6R7UBlMzk=; b=iFuuMD9JLZdZiSMnDocVkltHb+2d+yQi7RTJVq2YYpK5qFKt/AIkBufzVZaYNqMi/BV3kM 4Ek7acLVJTaZGYLUptiXLpoFfjrp20m8j6klKovtEkW9GF5rIfOi3BpsMzViEnmdwGEzo2 C2VAUeD8yvYmhXJI5KjoKsyS21lGHQI= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-205-69RmvrYnNO-xjOkYxCruBw-1; Thu, 03 Apr 2025 11:48:39 -0400 X-MC-Unique: 69RmvrYnNO-xjOkYxCruBw-1 X-Mimecast-MFC-AGG-ID: 69RmvrYnNO-xjOkYxCruBw_1743695318 Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-3913b2d355fso545507f8f.1 for ; Thu, 03 Apr 2025 08:48:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743695318; x=1744300118; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=mCnXnUFxgD2hi5WhfjYrJvBCk1jjASDLkd6R7UBlMzk=; b=NYcOtazfdTbBiJFn8ZuxudS1sj9dVZbIyvagK98uZT8wddH44hYrn4IBn0ETociMhM ft8RCrH/Gi+mNCazVYfQBuBkGQ4SRDbcvgHqe4FIz1lvImr4VeY2ntHJPUs5PkLONd3I 3miW2vEzd+RCd+VLVpezbpRB7LHZXlVnbNONE9IYR2dL/h5GtcWEvT1pvj50cSggMEfi cuQfXOz1sgW+9i2tumt+SuHoHZKYykllkinRiXulOS2B/YdF6mChnNYn+rpejgjQ306q CdcuGQ6EKR4x4m+88Meu1UGnxy8so6Xe/6BSRjiAMA9ujUkFU0M/8nbvP13LF4vHqQtF 7CkQ== X-Gm-Message-State: AOJu0YyS/37dGz+O1/9ztoioL47f2knU/n9mTJG0jdDZkilRaAjdmY/5 pOXjWwKygqzLN7aZCfBZkP/PGYJ55gr4wW8EKPW4e8b/bEjanP+DwrsW6s6KS3Rttls5R3HHlLY tDkHYUY9CruNGLwgIL1ZnAWWq7rvBMlzx0p/7k55J+nn70bLxx++hjB4Ekk1tIVpl84pPG//eVU Qelwt6b5Ar76j4wgOCtJubDb7I82AsrFV1 X-Gm-Gg: ASbGncsiVm0OmXQAwcJc4/eA7jYcXaXBI9yAb3tt4EmQ+pfUi06kZ44tOG917bdmezr N+prqKHIm/F5WIwCKLf8OxMc/q1E1pzwYxyZohNjUtlvUN84hgfz7n58657iQHCNxbEyHNMQuxE 2EANVO7/BOsF7lH3QajiDZrwkmoaiOP0QgkKrAKwM3Kj/e0HYNynTq4W58xv1MIWelda1AYf9tJ 2qzPSJS7aAYuTYg8TZEfaIjtq/MP/qnfrzPivfDw88cZQUhaOGMAONK1cx8cm8NPBWosjiG4QUp oSA6pdiZzz/zC5rKJRx8s2Zu6HoA/wpG8Vwc6bm0MQ7L X-Received: by 2002:a5d:6da7:0:b0:391:2fe3:24ec with SMTP id ffacd0b85a97d-39c2f8d34c2mr3151063f8f.14.1743695317678; Thu, 03 Apr 2025 08:48:37 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFcMBMVvwV6XoxpYuy85tDHuAfS3iHmkkBa3XSXK6d5unia4h0K1V27vjr9o0FkjPftXM8soQ== X-Received: by 2002:a5d:6da7:0:b0:391:2fe3:24ec with SMTP id ffacd0b85a97d-39c2f8d34c2mr3151034f8f.14.1743695317025; Thu, 03 Apr 2025 08:48:37 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43ec169b8a3sm25037525e9.19.2025.04.03.08.48.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Apr 2025 08:48:36 -0700 (PDT) Date: Thu, 3 Apr 2025 17:48:33 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4] udp: support traceroute Message-ID: <20250403174833.6d033172@elisabeth> In-Reply-To: <20250403022229.836067-1-jmaloy@redhat.com> References: <20250403022229.836067-1-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: it8i7C05hcLnMc_JXHVEjTmPwo8n0fSGEdiVGO8Tadk_1743695318 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: DNSLGPW77PRAJV6FXXNNGOSIHMTGVUBL X-Message-ID-Hash: DNSLGPW77PRAJV6FXXNNGOSIHMTGVUBL X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: The implementation looks solid to me, a list of nits (or a bit more) below. By the way, I don't think you need to Cc: people who are already on this list unless you specifically want their attention. On Wed, 2 Apr 2025 22:22:29 -0400 Jon Maloy wrote: > Now that ICMP pass-through from socket-to-tap is in place, it is > easy to support UDP based traceroute functionality in direction > tap-to-socket. > > We fix that in this commit. > > Signed-off-by: Jon Maloy This fixes https://bugs.passt.top/show_bug.cgi?id=64 ("Link:" tag) if I understood correctly. > --- > v2: - Using ancillary data instead of setsockopt to transfer outgoing > TTL. > - Support IPv6 > v3: - Storing ttl per packet instead of per flow. This may not be > elegant, but much less intrusive than changing the flow > criteria. This eliminates the need for the extra, flow-changing > patch we introduced in v2. > v4: - Going back to something similar to the original solution, but > storing current ttl in struct udp_flow, plus ensuring that all > packets in a struct tap4_l4_t/tap6_l4_t instance, have the same > ttl. After input from David Gibson. > --- > packet.h | 2 ++ > tap.c | 18 ++++++++++++++---- > udp.c | 17 ++++++++++++++++- > udp.h | 3 ++- > udp_flow.c | 1 + > udp_flow.h | 1 + > 6 files changed, 36 insertions(+), 6 deletions(-) > > diff --git a/packet.h b/packet.h > index c94780a..e84e123 100644 > --- a/packet.h > +++ b/packet.h > @@ -11,6 +11,8 @@ > /* Maximum size of a single packet stored in pool, including headers */ > #define PACKET_MAX_LEN ((size_t)UINT16_MAX) > > +#define DEFAULT_TTL 64 If I understood correctly, David's comment to this on v3: https://archives.passt.top/passt-dev/Z-om3Ey-HR1Hj8UH@zatzit/ was meant to imply that, as the default value can be changed via sysctl, the value set via sysctl could be read at start-up. I'm fine with 64 as well, by the way, with a slight preference for reading the value via sysctl. All this might go away, though, please read the comment to udp_flow_new() below, first. > + > /** > * struct pool - Generic pool of packets stored in a buffer > * @buf: Buffer storing packet descriptors, > diff --git a/tap.c b/tap.c > index 3a6fcbe..e65d592 100644 > --- a/tap.c > +++ b/tap.c > @@ -563,6 +563,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); > * @dest: Destination port > * @saddr: Source address > * @daddr: Destination address > + * @ttl: Time to live > * @msg: Array of messages that can be handled in a single call > */ > static struct tap4_l4_t { > @@ -574,6 +575,8 @@ static struct tap4_l4_t { > struct in_addr saddr; > struct in_addr daddr; > > + uint8_t ttl; If you move this after 'protocol' you save 4 or 8 bytes depending on the architecture and, perhaps more importantly, with 64-byte cachelines, you can fit the set of fields involved in the L4_MATCH() comparison four times instead of three. If you have a look with pahole(1): -- struct tap4_l4_t { uint8_t protocol; /* 0 1 */ /* XXX 1 byte hole, try to pack */ uint16_t source; /* 2 2 */ uint16_t dest; /* 4 2 */ /* XXX 2 bytes hole, try to pack */ struct in_addr saddr; /* 8 4 */ struct in_addr daddr; /* 12 4 */ uint8_t ttl; /* 16 1 */ /* XXX 7 bytes hole, try to pack */ ... } -- becomes: -- struct tap4_l4_t { uint8_t protocol; /* 0 1 */ uint8_t ttl; /* 1 1 */ uint16_t source; /* 2 2 */ uint16_t dest; /* 4 2 */ /* XXX 2 bytes hole, try to pack */ struct in_addr saddr; /* 8 4 */ struct in_addr daddr; /* 12 4 */ ... } -- ...if you move it, please don't forget to update the comment to the struct. > + > struct pool_l4_t p; > } tap4_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */]; > > @@ -586,6 +589,7 @@ static struct tap4_l4_t { > * @dest: Destination port > * @saddr: Source address > * @daddr: Destination address > + * @hop_limit: Hop limit > * @msg: Array of messages that can be handled in a single call > */ > static struct tap6_l4_t { > @@ -598,6 +602,8 @@ static struct tap6_l4_t { > struct in6_addr saddr; > struct in6_addr daddr; > > + uint8_t hop_limit; Here, instead, it doesn't matter, because 'p' starts at 48 bytes anyway, and we compare the flow label too. > + > struct pool_l4_t p; > } tap6_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */]; > > @@ -786,7 +792,8 @@ resume: > #define L4_MATCH(iph, uh, seq) \ > ((seq)->protocol == (iph)->protocol && \ > (seq)->source == (uh)->source && (seq)->dest == (uh)->dest && \ > - (seq)->saddr.s_addr == (iph)->saddr && (seq)->daddr.s_addr == (iph)->daddr) > + (seq)->saddr.s_addr == (iph)->saddr && \ > + (seq)->daddr.s_addr == (iph)->daddr && (seq)->ttl == (iph)->ttl) > > #define L4_SET(iph, uh, seq) \ > do { \ > @@ -795,6 +802,7 @@ resume: > (seq)->dest = (uh)->dest; \ > (seq)->saddr.s_addr = (iph)->saddr; \ > (seq)->daddr.s_addr = (iph)->daddr; \ > + (seq)->ttl = (iph)->ttl; \ > } while (0) > > if (seq && L4_MATCH(iph, uh, seq) && seq->p.count < UIO_MAXIOV) > @@ -843,7 +851,7 @@ append: > for (k = 0; k < p->count; ) > k += udp_tap_handler(c, PIF_TAP, AF_INET, > &seq->saddr, &seq->daddr, > - p, k, now); > + seq->ttl, p, k, now); > } > } > > @@ -966,7 +974,8 @@ resume: > (seq)->dest == (uh)->dest && \ > (seq)->flow_lbl == ip6_get_flow_lbl(ip6h) && \ > IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr) && \ > - IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr)) > + IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr) && \ > + (seq)->hop_limit == (ip6h)->hop_limit) > > #define L4_SET(ip6h, proto, uh, seq) \ > do { \ > @@ -976,6 +985,7 @@ resume: > (seq)->flow_lbl = ip6_get_flow_lbl(ip6h); \ > (seq)->saddr = *saddr; \ > (seq)->daddr = *daddr; \ > + (seq)->hop_limit = (ip6h)->hop_limit; \ > } while (0) > > if (seq && L4_MATCH(ip6h, proto, uh, seq) && > @@ -1026,7 +1036,7 @@ append: > for (k = 0; k < p->count; ) > k += udp_tap_handler(c, PIF_TAP, AF_INET6, > &seq->saddr, &seq->daddr, > - p, k, now); > + seq->hop_limit, p, k, now); > } > } > > diff --git a/udp.c b/udp.c > index 39431d7..bc93292 100644 > --- a/udp.c > +++ b/udp.c > @@ -849,6 +849,7 @@ fail: > * @af: Address family, AF_INET or AF_INET6 > * @saddr: Source address > * @daddr: Destination address > + * @ttl: TTL or hop limit for packets to be sent in this call > * @p: Pool of UDP packets, with UDP headers > * @idx: Index of first packet to process > * @now: Current timestamp > @@ -859,7 +860,8 @@ fail: > */ > int udp_tap_handler(const struct ctx *c, uint8_t pif, > sa_family_t af, const void *saddr, const void *daddr, > - const struct pool *p, int idx, const struct timespec *now) > + uint8_t ttl, const struct pool *p, int idx, > + const struct timespec *now) > { > const struct flowside *toside; > struct mmsghdr mm[UIO_MAXIOV]; > @@ -938,6 +940,19 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, > mm[i].msg_hdr.msg_controllen = 0; > mm[i].msg_hdr.msg_flags = 0; > > + if (ttl != uflow->ttl[tosidx.sidei]) { > + uflow->ttl[tosidx.sidei] = ttl; > + if (af == AF_INET) { > + if (setsockopt(s, IPPROTO_IP, IP_TTL, > + &ttl, sizeof(ttl)) < 0) > + perror("setsockopt (IP_TTL)"); This would print to file descriptor 2 even if it's a socket. It should be err_perror() instead, but now we also have flow_perror() which prints flow index and type, given 'uflow' here, say: flow_perror(uflow, "IP_TTL setsockopt"); > + } else { > + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT, > + &ttl, sizeof(ttl)) < 0) > + perror("setsockopt (IP_TTL)"); ...and this is IPV6_HOPLIMIT, not IP_TTL, so perhaps: flow_perror(uflow, "setsockopt IPV6_HOPLIMIT"); > + } > + } > + > count++; > } > > diff --git a/udp.h b/udp.h > index de2df6d..041fad4 100644 > --- a/udp.h > +++ b/udp.h > @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > uint32_t events, const struct timespec *now); > int udp_tap_handler(const struct ctx *c, uint8_t pif, > sa_family_t af, const void *saddr, const void *daddr, > - const struct pool *p, int idx, const struct timespec *now); > + uint8_t ttl, const struct pool *p, int idx, Excess whitespace beetween 'uint8_t' and 'ttl'. > + const struct timespec *now); > int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, > const char *ifname, in_port_t port); > int udp_init(struct ctx *c); > diff --git a/udp_flow.c b/udp_flow.c > index bf4b896..39372c2 100644 > --- a/udp_flow.c > +++ b/udp_flow.c > @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, > uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); > uflow->ts = now->tv_sec; > uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; > + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL; By the way, instead of using a default value, what about fetching the current value with getsockopt()? One additional system call per UDP flow doesn't feel like a lot of overhead, and we can be sure it's correct, no matter if the user configures a different value before or after we start. > > if (s_ini >= 0) { > /* When using auto port-scanning the listening port could go > diff --git a/udp_flow.h b/udp_flow.h > index 9a1b059..606ac08 100644 > --- a/udp_flow.h > +++ b/udp_flow.h > @@ -21,6 +21,7 @@ struct udp_flow { > bool closed :1; > time_t ts; > int s[SIDES]; > + uint8_t ttl[SIDES]; Ths should be added to the struct comment above, which, by mistake, seems to refer to 'struct udp' by the way (I would fix that right away while at it...). > }; > > struct udp_flow *udp_at_sidx(flow_sidx_t sidx); -- Stefano