From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=iIvT33q0; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 989855A0272 for ; Tue, 11 Feb 2025 01:56:50 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1739235402; bh=h+uNNo0AeoXbxx4lOeQXVvojUQyD7XMos9k1uJ9NsRg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=iIvT33q0W9YBbAxbtAeLLAyalweGqzu1jvWcAXEJTyak00J+xY+4vE791jnHD9fcL ixfrOjvPKGQC6WA3vLUI5nFulQ35YF4q5bP32vw6GjdDrUZPwc8Xnq9CD3xi3Nu9DO 6ZtRKFS0L80NJOWI/Mc0MzO5srwPe3+PLr8r9sDHKWw0SCqBAWNZWTurxr6OwzWxDc CIzC8TBuN7EcwxNgWP4ClDqiK4nKIwYJf/Qeqm3MWYy3EgL6jjjnE+SSq1alDbl4NC nV9EdDbvK2aI1EYJPxtk7+iZhSpRv2l0+r0Cw4BKjqvxtL/QWdiaYwCLs7XDCrgM8A fkDBk47AfyFJw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YsNLf43Ldz4wcr; Tue, 11 Feb 2025 11:56:42 +1100 (AEDT) Date: Tue, 11 Feb 2025 11:55:00 +1100 From: David Gibson To: Jon Maloy Subject: Re: [PATCH] udp: create and send ICMPv4 to local peer when applicable Message-ID: References: <20250209170056.1160547-1-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="EqbDUl0hc4C4VDFw" Content-Disposition: inline In-Reply-To: <20250209170056.1160547-1-jmaloy@redhat.com> Message-ID-Hash: GHKGDRY4VSIG2OJ236XZJKV5TUTMEE5Q X-Message-ID-Hash: GHKGDRY4VSIG2OJ236XZJKV5TUTMEE5Q X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --EqbDUl0hc4C4VDFw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Feb 09, 2025 at 12:00:56PM -0500, Jon Maloy wrote: > When a local peer sends a UDP message to a non-existing port on an > existing remote host, that host will return an ICMP message containing > the error code ICMP_PORT_UNREACH, plus the header and the first eight > bytes of the original message. If the sender socket has been connected, > it uses this message to issue a "Connection Refused" event to the user. >=20 > Until now, we have only read such events from the externally facing > socket, but we don't forward them back to the local sender because > we cannot read the ICMP message directly to user space. Because of > this, the local peer will hang and wait for a response that never > arrives. >=20 > We now fix this for IPv4 by recreating and forwarding a correct ICMP > message back to the internal sender. We synthesize the message based > on the information in the extended error structure, plus the returned > part of the original message body. >=20 > Note that for the sake of completeness, we even produce ICMP messages > for other error codes. We have noticed that at least ICMP_PROT_UNREACH > is propagated as an error event back to the user. >=20 > Signed-off-by: Jon Maloy > --- > tap.c | 4 +-- > tap.h | 2 ++ > udp.c | 71 ++++++++++++++++++++++++++++++++++++++++++-------- > udp_internal.h | 2 +- > udp_vu.c | 4 +-- > 5 files changed, 67 insertions(+), 16 deletions(-) >=20 > diff --git a/tap.c b/tap.c > index cd32a90..f83aac5 100644 > --- a/tap.c > +++ b/tap.c > @@ -142,8 +142,8 @@ static void *tap_push_l2h(const struct ctx *c, void *= buf, uint16_t proto) > * > * Return: pointer at which to write the packet's payload > */ > -static void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, > - struct in_addr dst, size_t l4len, uint8_t proto) > +void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, > + struct in_addr dst, size_t l4len, uint8_t proto) As discussed on our call, it probably makes sense to alter this to set the Don't Fragment bit always. > { > uint16_t l3len =3D l4len + sizeof(*ip4h); > =20 > diff --git a/tap.h b/tap.h > index dfbd8b9..3ba00c1 100644 > --- a/tap.h > +++ b/tap.h > @@ -44,6 +44,8 @@ static inline void tap_hdr_update(struct tap_hdr *thdr,= size_t l2len) > thdr->vnet_len =3D htonl(l2len); > } > =20 > +void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, > + struct in_addr dst, size_t l4len, uint8_t proto); > void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sp= ort, > struct in_addr dst, in_port_t dport, > const void *in, size_t dlen); > diff --git a/udp.c b/udp.c > index 923cc38..4b800b6 100644 > --- a/udp.c > +++ b/udp.c > @@ -87,6 +87,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -402,25 +403,70 @@ static void udp_tap_prepare(const struct mmsghdr *m= mh, > (*tap_iov)[UDP_IOV_PAYLOAD].iov_len =3D l4len; > } > =20 > +/** > + * udp_send_conn_fail_icmp4() - Construct and send ICMP to local peer > + * @c: Execution context > + * @ee: Extended error descriptor > + * @ref: epoll reference > + * @iov: First bytes (max 8) of original UDP message body I don't love passing this as an iov, because that tends to imply there might be multiple buffers, which is not the case here (both the caller and callee require a single buffer). > + */ > +static void udp_send_conn_fail_icmp4(const struct ctx *c, > + const struct sock_extended_err *ee, > + union epoll_ref ref, > + struct iovec *iov) > +{ > + flow_sidx_t tosidx =3D flow_sidx_opposite(ref.flowside); This is subtly, but badly wrong. ref.flowside is only valid if the ref is of a type which uses the flowside field. That's true for UDP reply sockets, but *not* for UDP listening sockets, and udp_sock_errs() is called on both. I think this function needs to take an explicit flowside. > + const struct flowside *toside =3D flowside_at_sidx(tosidx); > + struct in_addr oaddr =3D toside->oaddr.v4mapped.a4; > + struct in_addr eaddr =3D toside->eaddr.v4mapped.a4; > + struct iov_tail data =3D IOV_TAIL(iov, 1, 0); > + struct { > + struct icmphdr icmp4h; > + struct iphdr ip4h; > + struct udphdr uh; > + char data[8]; > + } msg; Needs a ((packed)) attribute to ensure we don't get compiler padding. It took me a minute to work out what was going on here. Specifically that ip4h and uh here aren't the headers of the packet we're sending now, but the reconstructed headers of the packet that prompted the ICMP error. > + size_t udplen =3D sizeof(msg.uh) + iov->iov_len; > + size_t msglen =3D sizeof(msg.icmp4h) + sizeof(msg.ip4h) + udplen; > + > + memset(&msg, 0, sizeof(msg)); > + msg.icmp4h.type =3D ee->ee_type; > + msg.icmp4h.code =3D ee->ee_code; > + tap_push_ip4h(&msg.ip4h, eaddr, oaddr, udplen, IPPROTO_UDP); This isn't quite a natural fit. It does work, but the tap_push_() functions are kind of designed to work with a byte buffer where we only work out the positions of headers as we construct them. Probably a clean up for some other day. > + msg.uh.source =3D htons(toside->eport); > + msg.uh.dest =3D htons(toside->oport); > + msg.uh.len =3D htons(udplen); > + csum_udp4(&msg.uh, oaddr, eaddr, &data); It might make sense to split a tap_push_uh4() function out from tap_udp4_send() which you could re-use here. > + memcpy(&msg.data, iov->iov_base, iov->iov_len); > + tap_icmp4_send(c, oaddr, eaddr, &msg, msglen); > +} > + > /** > * udp_sock_recverr() - Receive and clear an error from a socket > - * @s: Socket to receive from > + * @c: Execution context > + * @ref: epoll reference > * > * Return: 1 if error received and processed, 0 if no more errors in que= ue, < 0 > * if there was an error reading the queue > * > * #syscalls recvmsg > */ > -static int udp_sock_recverr(int s) > +static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref) Similarly udp_sock_recverr() is used both on "listening" and reply sockets, so it's not really safe to use the ref here. You can explicitly pass the fd easily enough. The flow is trickier... > { > const struct sock_extended_err *ee; > const struct cmsghdr *hdr; > char buf[CMSG_SPACE(sizeof(*ee))]; > + char udp_data[8]; > + int s =3D ref.fd; > + struct iovec iov =3D { > + .iov_base =3D udp_data, > + .iov_len =3D sizeof(udp_data) > + }; > struct msghdr mh =3D { > .msg_name =3D NULL, > .msg_namelen =3D 0, > - .msg_iov =3D NULL, > - .msg_iovlen =3D 0, > + .msg_iov =3D &iov, > + .msg_iovlen =3D 1, > .msg_control =3D buf, > .msg_controllen =3D sizeof(buf), > }; > @@ -450,8 +496,10 @@ static int udp_sock_recverr(int s) > } > =20 > ee =3D (const struct sock_extended_err *)CMSG_DATA(hdr); > - > - /* TODO: When possible propagate and otherwise handle errors */ > + if (ee->ee_type =3D=3D ICMP_DEST_UNREACH) { > + iov.iov_len =3D rc; > + udp_send_conn_fail_icmp4(c, ee, ref, &iov); > + } > debug("%s error on UDP socket %i: %s", > str_ee_origin(ee), s, strerror_(ee->ee_errno)); > =20 > @@ -461,15 +509,16 @@ static int udp_sock_recverr(int s) > /** > * udp_sock_errs() - Process errors on a socket > * @c: Execution context > - * @s: Socket to receive from > + * @ref: epoll reference > * @events: epoll events bitmap > * > * Return: Number of errors handled, or < 0 if we have an unrecoverable = error > */ > -int udp_sock_errs(const struct ctx *c, int s, uint32_t events) > +int udp_sock_errs(const struct ctx *c, union epoll_ref ref, uint32_t eve= nts) > { > unsigned n_err =3D 0; > socklen_t errlen; > + int s =3D ref.fd; > int rc, err; > =20 > ASSERT(!c->no_udp); > @@ -478,7 +527,7 @@ int udp_sock_errs(const struct ctx *c, int s, uint32_= t events) > return 0; /* Nothing to do */ > =20 > /* Empty the error queue */ > - while ((rc =3D udp_sock_recverr(s)) > 0) > + while ((rc =3D udp_sock_recverr(c, ref)) > 0) > n_err +=3D rc; > =20 > if (rc < 0) > @@ -558,7 +607,7 @@ static void udp_buf_listen_sock_handler(const struct = ctx *c, > const socklen_t sasize =3D sizeof(udp_meta[0].s_in); > int n, i; > =20 > - if (udp_sock_errs(c, ref.fd, events) < 0) { > + if (udp_sock_errs(c, ref, events) < 0) { =2E.. because in the listening socket case we don't know the flow until we've looked at the packet. I'd be tempted as a first cut to only do the ICMP thing for errors on reply sockets. Listening sockets will be trickier. I think we'll have to actually check msg_name in the received error in order to work out which flow it belongs to. > err("UDP: Unrecoverable error on listening socket:" > " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port); > /* FIXME: what now? close/re-open socket? */ > @@ -661,7 +710,7 @@ static void udp_buf_reply_sock_handler(const struct c= tx *c, union epoll_ref ref, > =20 > from_s =3D uflow->s[ref.flowside.sidei]; > =20 > - if (udp_sock_errs(c, from_s, events) < 0) { > + if (udp_sock_errs(c, ref, events) < 0) { > flow_err(uflow, "Unrecoverable error on reply socket"); > flow_err_details(uflow); > udp_flow_close(c, uflow); > diff --git a/udp_internal.h b/udp_internal.h > index cc80e30..3b081f5 100644 > --- a/udp_internal.h > +++ b/udp_internal.h > @@ -30,5 +30,5 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_p= ayload_t *bp, > size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, > const struct flowside *toside, size_t dlen, > bool no_udp_csum); > -int udp_sock_errs(const struct ctx *c, int s, uint32_t events); > +int udp_sock_errs(const struct ctx *c, union epoll_ref ref, uint32_t eve= nts); > #endif /* UDP_INTERNAL_H */ > diff --git a/udp_vu.c b/udp_vu.c > index 4123510..c26a223 100644 > --- a/udp_vu.c > +++ b/udp_vu.c > @@ -227,7 +227,7 @@ void udp_vu_listen_sock_handler(const struct ctx *c, = union epoll_ref ref, > struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > int i; > =20 > - if (udp_sock_errs(c, ref.fd, events) < 0) { > + if (udp_sock_errs(c, ref, events) < 0) { > err("UDP: Unrecoverable error on listening socket:" > " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port); > return; > @@ -302,7 +302,7 @@ void udp_vu_reply_sock_handler(const struct ctx *c, u= nion epoll_ref ref, > =20 > ASSERT(!c->no_udp); > =20 > - if (udp_sock_errs(c, from_s, events) < 0) { > + if (udp_sock_errs(c, ref, events) < 0) { > flow_err(uflow, "Unrecoverable error on reply socket"); > flow_err_details(uflow); > udp_flow_close(c, uflow); --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --EqbDUl0hc4C4VDFw Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmeqn9oACgkQzQJF27ox 2Gdh4xAAkz8sz8RMUP8aCErjnmzcEN5Sne5axFms4oKq16u986P/vR3VW+8xzwFp RvzeRhdxGLKNIMa21DQ1fxwQDqmYITQpVvpjfpa2VPB5thsO7A6UcrJrHtZ9ha7s zHULPGfkp3Pkn+mbit4brXZEOiEhyc1MciCp7aHNKIHM05kLX6WhEavqN4aYEIxx NRF87tNIyQbUIB8xULNkhcSwrwwAS8xmKG3WMpusVU5OhhWj4e+H1FtV2YX97YF/ R7E/lSlahksmT7Zs6Lp9pBSiu1jjIG80mfAvM2E+M1OgUnVc61r8ZEmDKjXCubqM 2aukeuReKo6EBz2XqxRjHLhRGUibIAeQ7N4jxW5bom4OGmMUmYOl+XE0A0XkoMiI 65z31WlLfekfv6zeV1tDufQA6qZ5emuSG8m+7HbXn6jp2O9KJqDEF1nEQLhrrjJW ihyr7U8WjgY/kicfYwL2iJOFKcDmx65devHMG5KOYR5+oyHxcltMlDcEUWWP7U/H t94P7Z5bScNyajcEyuW0FrZpYlxuil0WP/yTmdGh8E04nMRdCHkjoaJAPys+067x L3hc/i2M/jW9P4B8bmNRVX791e47j1Cu7rXwd8S1uqOXQcvJ2AthZZGpxXiBJGCp irws+R3CioPGRZiwURnB/cgWzbyjkK3B7ce48/Lmfwa8u9rXytE= =+Lzc -----END PGP SIGNATURE----- --EqbDUl0hc4C4VDFw--