From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=TjwAnzdV; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id A73165A0623 for ; Thu, 13 Feb 2025 00:22:32 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1739402544; bh=MceZydzlXfeL1V3OQuqwFCdMBNIQn8oEbdyEx/NMbkg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=TjwAnzdV7H/PNkUQ2cZBhkl8VsyHkMWfqbbkXKdcMcD1KmT+2eFPEUN60puEaJrxg I+UUWEcuHvoDuxLZHhX6a9KGVkrocMeSTmZtcQbG38rr3oC78hBrHmEVbMDbk9ASRB Cty5ZdZMwwqYikj9BhdUL/vpmCrNjJetuBOhAPioVUI85jnMj1DwFNM640TVwyiZU0 ddZn1UhVJZHiVovUd/dqN4Ea72e6+AAg0SMj9H29rxV4EfiYfCPzP697aJbQIeyc08 l/uB1Pn+3DC5ueWxVZpeN1FLqqfJcXWZcYedCuF6a/SXPVUs5lizX2IlzD5PO3t3Cf ZVwmDLubUwgqQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YtZ8w3k5fz4wj2; Thu, 13 Feb 2025 10:22:24 +1100 (AEDT) Date: Thu, 13 Feb 2025 10:21:41 +1100 From: David Gibson To: Jon Maloy Subject: Re: [PATCH] udp: create and send ICMPv4 to local peer when applicable Message-ID: References: <20250209170056.1160547-1-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="JBd5JIw2i4wQ4uxL" Content-Disposition: inline In-Reply-To: Message-ID-Hash: Q5XEK4LUHDBKSDCSJ5MTLE5PK3U3DKVT X-Message-ID-Hash: Q5XEK4LUHDBKSDCSJ5MTLE5PK3U3DKVT X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --JBd5JIw2i4wQ4uxL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Feb 12, 2025 at 04:39:07PM -0500, Jon Maloy wrote: > On 2025-02-10 19:55, David Gibson wrote: > > On Sun, Feb 09, 2025 at 12:00:56PM -0500, Jon Maloy wrote: > > > When a local peer sends a UDP message to a non-existing port on an > > > existing remote host, that host will return an ICMP message containing > > > the error code ICMP_PORT_UNREACH, plus the header and the first eight > > > bytes of the original message. If the sender socket has been connecte= d, > > > it uses this message to issue a "Connection Refused" event to the use= r. >=20 > [...] >=20 > > > +void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, > > > + struct in_addr dst, size_t l4len, uint8_t proto) > >=20 > > As discussed on our call, it probably makes sense to alter this to set > > the Don't Fragment bit always. >=20 > I have been studying the Linux code, but this is handled in so many places > that it it hard to get a full grip on it. Wireshark logs seem > to indicate that it is always set, and googling a bit on the subject > indicates the same. I think it is safe to do it, so if you agree I > can post a prepending patch setting this. I'm not 100% confident 100% pedantically correct. I'm pretty confident it won't cause problems though. >=20 > >=20 > > > { > > > uint16_t l3len =3D l4len + sizeof(*ip4h); > > > diff --git a/tap.h b/tap.h > > > index dfbd8b9..3ba00c1 100644 > > > --- a/tap.h > > > +++ b/tap.h > > > @@ -44,6 +44,8 @@ static inline void tap_hdr_update(struct tap_hdr *t= hdr, size_t l2len) > > > thdr->vnet_len =3D htonl(l2len); > > > } >=20 > [...] > > > + * udp_send_conn_fail_icmp4() - Construct and send ICMP to local peer > > > + * @c: Execution context > > > + * @ee: Extended error descriptor > > > + * @ref: epoll reference > > > + * @iov: First bytes (max 8) of original UDP message body > >=20 > > I don't love passing this as an iov, because that tends to imply there > > might be multiple buffers, which is not the case here (both the caller > > and callee require a single buffer). >=20 > The real reason is that csum_udp4() requires an iov, so if we don't > pass it here we will have to recreate it inside the new function. Yeah, I guess so. Ok, I suppose it makes more sense than not here. > > > + */ > > > +static void udp_send_conn_fail_icmp4(const struct ctx *c, > > > + const struct sock_extended_err *ee, > > > + union epoll_ref ref, > > > + struct iovec *iov) > >=20 > >=20 > >=20 > > > +{ > > > + flow_sidx_t tosidx =3D flow_sidx_opposite(ref.flowside); > >=20 > > This is subtly, but badly wrong. ref.flowside is only valid if the > > ref is of a type which uses the flowside field. That's true for UDP > > reply sockets, but *not* for UDP listening sockets, and > > udp_sock_errs() is called on both. I think this function needs to > > take an explicit flowside. >=20 > Yes, I was a little uncertain about this. Can you give me small code > snippet of how this should be done? Small code for what example? For finding the flow associated with a listening socket, look at udp_flow_from_sock(), specifically the part before flow_alloc(). I think we'll need to either duplicate or split that out into a helper to use for the error path. I'm pretty sure the flow lookup needs to be done in the caller, not here, though, so you'll want to pass it. > > > + const struct flowside *toside =3D flowside_at_sidx(tosidx); > > > + struct in_addr oaddr =3D toside->oaddr.v4mapped.a4; > > > + struct in_addr eaddr =3D toside->eaddr.v4mapped.a4; > > > + struct iov_tail data =3D IOV_TAIL(iov, 1, 0); > > > + struct { > > > + struct icmphdr icmp4h; > > > + struct iphdr ip4h; > > > + struct udphdr uh; > > > + char data[8]; > > > + } msg; > >=20 > > Needs a ((packed)) attribute to ensure we don't get compiler padding. > >=20 > ok. > > It took me a minute to work out what was going on here. Specifically > > that ip4h and uh here aren't the headers of the packet we're sending > > now, but the reconstructed headers of the packet that prompted the > > ICMP error. >=20 > Yes, I try to re-create the original message as closely as possible. Right, and correctly so, but a comment might make it easier to make that leap for someone new looking at it. > > > + size_t udplen =3D sizeof(msg.uh) + iov->iov_len; > > > + size_t msglen =3D sizeof(msg.icmp4h) + sizeof(msg.ip4h) + udplen; > > > + > > > + memset(&msg, 0, sizeof(msg)); > > > + msg.icmp4h.type =3D ee->ee_type; > > > + msg.icmp4h.code =3D ee->ee_code; > > > + tap_push_ip4h(&msg.ip4h, eaddr, oaddr, udplen, IPPROTO_UDP); > >=20 > > This isn't quite a natural fit. It does work, but the tap_push_() > > functions are kind of designed to work with a byte buffer where we > > only work out the positions of headers as we construct them. Probably > > a clean up for some other day. > >=20 > > > + msg.uh.source =3D htons(toside->eport); > > > + msg.uh.dest =3D htons(toside->oport); > > > + msg.uh.len =3D htons(udplen); > > > + csum_udp4(&msg.uh, oaddr, eaddr, &data); > >=20 > > It might make sense to split a tap_push_uh4() function out from > > tap_udp4_send() which you could re-use here. >=20 > Good point. > >=20 > > > + memcpy(&msg.data, iov->iov_base, iov->iov_len); > > > + tap_icmp4_send(c, oaddr, eaddr, &msg, msglen); > > > +} > > > + > > > /** > > > * udp_sock_recverr() - Receive and clear an error from a socket > > > - * @s: Socket to receive from > > > + * @c: Execution context > > > + * @ref: epoll reference > > > * > > > * Return: 1 if error received and processed, 0 if no more errors i= n queue, < 0 > > > * if there was an error reading the queue > > > * > > > * #syscalls recvmsg > > > */ > > > -static int udp_sock_recverr(int s) > > > +static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref) > >=20 > > Similarly udp_sock_recverr() is used both on "listening" and reply > > sockets, so it's not really safe to use the ref here. You can > > explicitly pass the fd easily enough. The flow is trickier... >=20 > I can pass *both* the fd and ref, alternativedly the flowside, but > it looks redundant and adds extra code. > If I add a flowside struct instead of ref I will need to declare > that at the all the multiple locations where udp_sock_errs(). > Do you have a good suggestion? Sorry, I didn't think that through all the way. You will want to pass the ref here, but you'll need to examine the ref type to work out how to determine the flow. For EPOLL_TYPE_UDP_REPLY you can just look at ref.flowside. However for EPOLL_TYPE_UDP_LISTEN you'll need to read the error first, then look up the flow like udp_flow_from_sock() (but not allowing new flows to be created). > > > { > > > const struct sock_extended_err *ee; > > > const struct cmsghdr *hdr; > > > char buf[CMSG_SPACE(sizeof(*ee))]; > > > + char udp_data[8]; > > > + int s =3D ref.fd; > > > + struct iovec iov =3D { > > > + .iov_base =3D udp_data, > > > + .iov_len =3D sizeof(udp_data) >=20 > [...] >=20 > > > if (rc < 0) > > > @@ -558,7 +607,7 @@ static void udp_buf_listen_sock_handler(const str= uct ctx *c, > > > const socklen_t sasize =3D sizeof(udp_meta[0].s_in); > > > int n, i; > > > - if (udp_sock_errs(c, ref.fd, events) < 0) { > > > + if (udp_sock_errs(c, ref, events) < 0) { > >=20 > > ... because in the listening socket case we don't know the flow until > > we've looked at the packet. I'd be tempted as a first cut to only do > > the ICMP thing for errors on reply sockets. >=20 > I can try that, but it would feel a little reduced. It would be, but there's nothing wrong with doing one thing at a time. > > Listening sockets will be trickier. I think we'll have to actually > > check msg_name in the received error in order to work out which flow > > it belongs to. >=20 > Actually, I did that first, until I realized I could extract the source > IP address from the flow object. But how does that help? That only works if you already know the right flow. You do for EPOLL_TYPE_UDP_REPLY, but "listening" sockets (EPOLL_TYPE_UDP_LISTEN) handle multiple flows each so it's impossible to know the flow from the ref alone. You'll have to examine msg_name to figure it out. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --JBd5JIw2i4wQ4uxL Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmetLO0ACgkQzQJF27ox 2GeXcw/8DDzSxBDVRvCwIP32VqctRBbv4wGdzTehQjfLcdBPuT0Xyv2q7XimG2n5 zD32q9AuLddv1EDPKkEAiYKXfRiGm3SA6WYBhUo7Kb3d/5YlSWgMI7lLXhPSGHs3 iIC/c+NTF5aRqu9CvGSyt4LJjdh+bakKmR5IvNrUZHC4X4tRHUpRrpesM4l/9vbW fl34aKpNhd2mQkeWezxIwlDzDRpNHK86TMt/LBuDByZ1nMfzGSN/MRrL4MwCVBo2 Soo9AzFJ9tOsBMvrHXBoSlY6/YVoYdYSIsPVIg9CQCMD6z/J/yC4Ct3Ivhz8kB5/ dpKk6TWeR2ITvU7iAm9rLNrWTZ37Us1MuLkiX66PRv84IArxpuCSgWmxFzTmaJSG vLDRIA55QFmDOM7J1TCPOhGrcNzpfqQ8B7ntY/09KgQrEHSQwcn7+H6o93/N/bX2 HXONZvseftwDqomZ3L9WdaGOCYv7Vuad/jaQTp55w3dzUfTNm3wr/L3j+TTzKUqj 6KACUo2og4lU17veZFsgUJC7Mnrn9BeBAndRjK0vlpAaOIumPrWCoiPPUYqCmN+E l2/ib79zAOKGHYmhzEc4/XySo75XR+C2Wn28+UZ51jNzFSnVLidaigbpCHIejv7U JxLD4uNxy1TRklh1JNnL2TWeKgiIesOB7fg4ofxFM8vsUpdVILc= =194/ -----END PGP SIGNATURE----- --JBd5JIw2i4wQ4uxL--