From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v5] udp: support traceroute
Date: Fri, 4 Apr 2025 11:57:58 +1100 [thread overview]
Message-ID: <Z-8uliWIs7YYWCdF@zatzit> (raw)
In-Reply-To: <Z-8ca23ZgeDg9HMc@zatzit>
[-- Attachment #1: Type: text/plain, Size: 10682 bytes --]
On Fri, Apr 04, 2025 at 10:40:27AM +1100, David Gibson wrote:
> On Thu, Apr 03, 2025 at 06:27:06PM -0400, Jon Maloy wrote:
> > Now that ICMP pass-through from socket-to-tap is in place, it is
> > easy to support UDP based traceroute functionality in direction
> > tap-to-socket.
> >
> > We fix that in this commit.
> >
> > Link: https://bugs.passt.top/show_bug.cgi?id=64
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> One commont below.
Oh.. wait. I just realised this patch has a weird side effect, for
flows which are initiated from outside, rather than from the guest.
If the flow is initiated from outside, it's maybe a bit unlikely, but
we could still get a non-default TTL from the guest on reply
datagrams. That will trigger this code and alter the socket's TTL.
But in this case the socket is not a flow specific socket, but the
listening socket which initiated the flow, which could be handling
packets on many flows.
The "cached" TTL is stored per-flow, not per-socket, so we won't look
at the right value when we process datagrams from other flows, so they
may go out with the wrong TTL.
I think the obvious way to address this is to stop using the
"listening" socket to send datagrams for a flow, using a connect()ed
socket instead. We have other reasons to do that too, and I'm working
on implementing that right now.
The question is whether this is a serious enough problem to delay this
until the "both sides connect()ed sockets" change is merged.
Mitigating the problem we have:
* Actually having multiple flows on a socket isn't super common (we
handled this completely wrong before UDP flow table support, and
no-one much noticed)
* Non-default TTL packets on replies from the guest are probably
uncommon
* Wrong TTL packets sent out on different flows probably aren't
particularly damaging in most cases
> > ---
> > v2: - Using ancillary data instead of setsockopt to transfer outgoing
> > TTL.
> > - Support IPv6
> > v3: - Storing ttl per packet instead of per flow. This may not be
> > elegant, but much less intrusive than changing the flow
> > criteria. This eliminates the need for the extra, flow-changing
> > patch we introduced in v2.
> > v4: - Going back to something similar to the original solution, but
> > storing current ttl in struct udp_flow, plus ensuring that all
> > packets in a struct tap4_l4_t/tap6_l4_t instance have the same
> > ttl. After input from David Gibson.
> > v5: - Some minor fixes after feedback from Stefano Brivio.
> > ---
> > packet.h | 2 ++
> > tap.c | 17 +++++++++++++----
> > udp.c | 19 ++++++++++++++++++-
> > udp.h | 3 ++-
> > udp_flow.c | 1 +
> > udp_flow.h | 4 +++-
> > 6 files changed, 39 insertions(+), 7 deletions(-)
> >
> > diff --git a/packet.h b/packet.h
> > index c94780a..e84e123 100644
> > --- a/packet.h
> > +++ b/packet.h
> > @@ -11,6 +11,8 @@
> > /* Maximum size of a single packet stored in pool, including headers */
> > #define PACKET_MAX_LEN ((size_t)UINT16_MAX)
> >
> > +#define DEFAULT_TTL 64
>
> This is still fixed, rather than either probing the sysctl or using
> getsockopt() to determine the initial value. I don't think we want to
> delay this further to change that, but it could be a reasonable
> follow up improvement.
>
> > +
> > /**
> > * struct pool - Generic pool of packets stored in a buffer
> > * @buf: Buffer storing packet descriptors,
> > diff --git a/tap.c b/tap.c
> > index 3a6fcbe..d630f6d 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -559,6 +559,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> > * struct l4_seq4_t - Message sequence for one protocol handler call, IPv4
> > * @msgs: Count of messages in sequence
> > * @protocol: Protocol number
> > + * @ttl: Time to live
> > * @source: Source port
> > * @dest: Destination port
> > * @saddr: Source address
> > @@ -567,6 +568,7 @@ PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> > */
> > static struct tap4_l4_t {
> > uint8_t protocol;
> > + uint8_t ttl;
> >
> > uint16_t source;
> > uint16_t dest;
> > @@ -586,6 +588,7 @@ static struct tap4_l4_t {
> > * @dest: Destination port
> > * @saddr: Source address
> > * @daddr: Destination address
> > + * @hop_limit: Hop limit
> > * @msg: Array of messages that can be handled in a single call
> > */
> > static struct tap6_l4_t {
> > @@ -598,6 +601,8 @@ static struct tap6_l4_t {
> > struct in6_addr saddr;
> > struct in6_addr daddr;
> >
> > + uint8_t hop_limit;
> > +
> > struct pool_l4_t p;
> > } tap6_l4[TAP_SEQS /* Arbitrary: TAP_MSGS in theory, so limit in users */];
> >
> > @@ -786,7 +791,8 @@ resume:
> > #define L4_MATCH(iph, uh, seq) \
> > ((seq)->protocol == (iph)->protocol && \
> > (seq)->source == (uh)->source && (seq)->dest == (uh)->dest && \
> > - (seq)->saddr.s_addr == (iph)->saddr && (seq)->daddr.s_addr == (iph)->daddr)
> > + (seq)->saddr.s_addr == (iph)->saddr && \
> > + (seq)->daddr.s_addr == (iph)->daddr && (seq)->ttl == (iph)->ttl)
> >
> > #define L4_SET(iph, uh, seq) \
> > do { \
> > @@ -795,6 +801,7 @@ resume:
> > (seq)->dest = (uh)->dest; \
> > (seq)->saddr.s_addr = (iph)->saddr; \
> > (seq)->daddr.s_addr = (iph)->daddr; \
> > + (seq)->ttl = (iph)->ttl; \
> > } while (0)
> >
> > if (seq && L4_MATCH(iph, uh, seq) && seq->p.count < UIO_MAXIOV)
> > @@ -843,7 +850,7 @@ append:
> > for (k = 0; k < p->count; )
> > k += udp_tap_handler(c, PIF_TAP, AF_INET,
> > &seq->saddr, &seq->daddr,
> > - p, k, now);
> > + seq->ttl, p, k, now);
> > }
> > }
> >
> > @@ -966,7 +973,8 @@ resume:
> > (seq)->dest == (uh)->dest && \
> > (seq)->flow_lbl == ip6_get_flow_lbl(ip6h) && \
> > IN6_ARE_ADDR_EQUAL(&(seq)->saddr, saddr) && \
> > - IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr))
> > + IN6_ARE_ADDR_EQUAL(&(seq)->daddr, daddr) && \
> > + (seq)->hop_limit == (ip6h)->hop_limit)
> >
> > #define L4_SET(ip6h, proto, uh, seq) \
> > do { \
> > @@ -976,6 +984,7 @@ resume:
> > (seq)->flow_lbl = ip6_get_flow_lbl(ip6h); \
> > (seq)->saddr = *saddr; \
> > (seq)->daddr = *daddr; \
> > + (seq)->hop_limit = (ip6h)->hop_limit; \
> > } while (0)
> >
> > if (seq && L4_MATCH(ip6h, proto, uh, seq) &&
> > @@ -1026,7 +1035,7 @@ append:
> > for (k = 0; k < p->count; )
> > k += udp_tap_handler(c, PIF_TAP, AF_INET6,
> > &seq->saddr, &seq->daddr,
> > - p, k, now);
> > + seq->hop_limit, p, k, now);
> > }
> > }
> >
> > diff --git a/udp.c b/udp.c
> > index 39431d7..618a4e2 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -849,6 +849,7 @@ fail:
> > * @af: Address family, AF_INET or AF_INET6
> > * @saddr: Source address
> > * @daddr: Destination address
> > + * @ttl: TTL or hop limit for packets to be sent in this call
> > * @p: Pool of UDP packets, with UDP headers
> > * @idx: Index of first packet to process
> > * @now: Current timestamp
> > @@ -859,7 +860,8 @@ fail:
> > */
> > int udp_tap_handler(const struct ctx *c, uint8_t pif,
> > sa_family_t af, const void *saddr, const void *daddr,
> > - const struct pool *p, int idx, const struct timespec *now)
> > + uint8_t ttl, const struct pool *p, int idx,
> > + const struct timespec *now)
> > {
> > const struct flowside *toside;
> > struct mmsghdr mm[UIO_MAXIOV];
> > @@ -938,6 +940,21 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> > mm[i].msg_hdr.msg_controllen = 0;
> > mm[i].msg_hdr.msg_flags = 0;
> >
> > + if (ttl != uflow->ttl[tosidx.sidei]) {
> > + uflow->ttl[tosidx.sidei] = ttl;
> > + if (af == AF_INET) {
> > + if (setsockopt(s, IPPROTO_IP, IP_TTL,
> > + &ttl, sizeof(ttl)) < 0)
> > + flow_perror(uflow,
> > + "setsockopt IP_TTL");
> > + } else {
> > + if (setsockopt(s, IPPROTO_IPV6, IPV6_HOPLIMIT,
> > + &ttl, sizeof(ttl)) < 0)
> > + flow_perror(uflow,
> > + "setsockopt IPV6_HOPLIMIT");
> > + }
> > + }
> > +
> > count++;
> > }
> >
> > diff --git a/udp.h b/udp.h
> > index de2df6d..a811475 100644
> > --- a/udp.h
> > +++ b/udp.h
> > @@ -15,7 +15,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> > uint32_t events, const struct timespec *now);
> > int udp_tap_handler(const struct ctx *c, uint8_t pif,
> > sa_family_t af, const void *saddr, const void *daddr,
> > - const struct pool *p, int idx, const struct timespec *now);
> > + uint8_t ttl, const struct pool *p, int idx,
> > + const struct timespec *now);
> > int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> > const char *ifname, in_port_t port);
> > int udp_init(struct ctx *c);
> > diff --git a/udp_flow.c b/udp_flow.c
> > index bf4b896..39372c2 100644
> > --- a/udp_flow.c
> > +++ b/udp_flow.c
> > @@ -137,6 +137,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
> > uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp);
> > uflow->ts = now->tv_sec;
> > uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1;
> > + uflow->ttl[INISIDE] = uflow->ttl[TGTSIDE] = DEFAULT_TTL;
> >
> > if (s_ini >= 0) {
> > /* When using auto port-scanning the listening port could go
> > diff --git a/udp_flow.h b/udp_flow.h
> > index 9a1b059..520de62 100644
> > --- a/udp_flow.h
> > +++ b/udp_flow.h
> > @@ -8,11 +8,12 @@
> > #define UDP_FLOW_H
> >
> > /**
> > - * struct udp - Descriptor for a flow of UDP packets
> > + * struct udp_flow - Descriptor for a flow of UDP packets
> > * @f: Generic flow information
> > * @closed: Flow is already closed
> > * @ts: Activity timestamp
> > * @s: Socket fd (or -1) for each side of the flow
> > + * @ttl: TTL or hop_limit for both sides
> > */
> > struct udp_flow {
> > /* Must be first element */
> > @@ -21,6 +22,7 @@ struct udp_flow {
> > bool closed :1;
> > time_t ts;
> > int s[SIDES];
> > + uint8_t ttl[SIDES];
> > };
> >
> > struct udp_flow *udp_at_sidx(flow_sidx_t sidx);
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-04-04 2:04 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-03 22:27 [PATCH v5] udp: support traceroute Jon Maloy
2025-04-03 23:40 ` David Gibson
2025-04-04 0:57 ` David Gibson [this message]
2025-04-04 11:01 ` Stefano Brivio
2025-04-04 11:11 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z-8uliWIs7YYWCdF@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=jmaloy@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).