From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 9E46C5A0268 for ; Fri, 18 Nov 2022 06:25:03 +0100 (CET) Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4ND4wr33PYz4xZ3; Fri, 18 Nov 2022 16:25:00 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1668749100; bh=MK53V7UcL8Du4wbdmEOFBiaHOYzprbk0ovQv8/Fxnzs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nCBOHx3ffTeLENhczbVQ9OlGw8JbwcHF5helV25A1bEzOmbVJSI24t/Ssfz6QN42C FWkuGn1lf8w0R52NV6TRME/b5dOcwBGxSRqMy7+Ftx49Lsi53XVbqtGPp51OIKjBkI uUqB8AEyh7eoF0cbd4A6WWQHOZb9k377gY+UQHss= Date: Fri, 18 Nov 2022 16:08:27 +1100 From: David Gibson To: Stefano Brivio Subject: Re: UDP "splicing" fairly broken :( Message-ID: References: <20221117082100.5fb5ee2d@elisabeth> <20221117210235.78db52b4@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="KX5DaZywMWT6vT2N" Content-Disposition: inline In-Reply-To: <20221117210235.78db52b4@elisabeth> Message-ID-Hash: SXRUX3AWP6HUQQVRNJWFNGCR2E5O672D X-Message-ID-Hash: SXRUX3AWP6HUQQVRNJWFNGCR2E5O672D X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --KX5DaZywMWT6vT2N Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Nov 17, 2022 at 09:02:35PM +0100, Stefano Brivio wrote: > On Thu, 17 Nov 2022 21:30:03 +1100 > David Gibson wrote: >=20 > > On Thu, Nov 17, 2022 at 08:21:00AM +0100, Stefano Brivio wrote: > > > On Thu, 17 Nov 2022 16:07:32 +1100 > > > David Gibson wrote: > > > =20 > > > > In preparation for trying to implement dual stack sockets for UDP, > > > > I've been getting my head around how the UDP splicing works. Alas, > > > > I'm pretty sure that it's broken if there's not a one-to-one > > > > correspondence between init side source ports and ns side destinati= on > > > > ports. That will typically be the case, but given its UDP there's = no > > > > guarantee. =20 > > >=20 > > > I understand the concern below, but I don't understand this part, that > > > is: in which other way is it broken? =20 > >=20 > > Uh.. other than? >=20 > Sorry, I didn't understand it's just broken for the cases below, but > not in case one source port corresponds one destination port (you wrote > it, I misread). Ah, right. > So, well, I disagree on it being fairly broken: it works in the most > common case. This doesn't really matter though: So, I'm less sanguine about this, as you probably realize. Yes, it's not a common case, but the failure mode is nasty. It's not just that it falls back to a slower path, nor even that it just drops packets. At best it's going to lead to very hard to debug failures. At worst, it's a security problem: the second scenario potentially lets an unprivileged host process intercept replies to another user's requests to a server in the ns, or interpose incorrect reponses to that other user's requests. > - it's an obstacle to unify IPv4 and IPv6 sockets >=20 > - it's definitely not pretty -- it was more of a sketch I wanted to > rewrite for a long time >=20 > - it doesn't make a huge difference in packet rates (far from having > the same impact as the TCP spliced connections) >=20 > - it's broken in those (albeit probably uncommon) cases below >=20 > ...let's drop it for now, and add back a saner version later, now that > we know in much better detail how it should work and where the problems > might be. I had a look at this today, and unfortunately it's less straightforward than I first thought. The difficulty is that the splicing isn't just a performance hack, it's also the only way that the host can access servers within the ns bound explicitly to 127.0.0.1 or ::1. I'm not particularly convinced allowing that's a good idea (it's a behavioural difference between passt and pasta for one). But, removing it now would be a big semantic change, and if we do so we should make it consistent for tcp as well. Still looking and thinking through options. > > > > In addition, UDP servers in the ns will not see the correct port > > > > numbers with getpeername(). That's also true of spliced TCP > > > > connections (see https://bugs.passt.top/show_bug.cgi?id=3D39), but = it's > > > > more likely to matter for UDP (I don't know of any TCP protocols th= at > > > > care about source port number on the server side, but there are some > > > > common UDP protocols that have at least port number conventions on > > > > both sides). =20 > > >=20 > > > I can think of DHCP and DNS, for which we offer special handling > > > somehow. Still, if the flow is started by the guest or container, > > > replies should really come with a source port matching the destination > > > port used initially. =20 > >=20 > > I'm not concerned so much about replies coming from a different port > > as a server which expects initial requests from a particular port. > > Still not that likely, but more likely than with TCP. > >=20 > > > For TCP, I don't see this is as an issue at all. =20 > >=20 > > I largely agree. > >=20 > > > > I can expand on the details later, but pasta will do the wrong thing > > > > in at least some circumstances for both a single init side socket > > > > sendto()ing packets to multiple different ports in the ns/guest and > > > > multiple init side sockets send()ing to the same port in the ns/gue= st. > > > >=20 > > > > I think I know how to fix it, but it's not a trivial job. So, the > > > > question is do I embark on this now, or do I just remove UDP > > > > "splicing" entirely for the time being (other than a minimum requir= ed > > > > to make -U work)? That would unblock dual stack UDP sockets and we > > > > can attempt to reoptimize this later. =20 > > >=20 > > > So, I'm not really sure what's broken here, but in any case, UDP =20 > >=20 > > I'll fill in the details below. > >=20 > > > "splicing" doesn't offer as much value as the TCP one does, the > > > difference in packet rate is not that big. I don't see a problem if we > > > want to remove it temporarily. =20 > >=20 > > Ok, good to know. > >=20 > > > The only real concern I have is how easy it would be to add it back > > > after a rework not taking that functionality into account. =20 > >=20 > > I actually think this will fit better with the tap path once I've made > > the dual stack socket changes. > >=20 > > Ok, for the details of the problem. I'm only considering the case > > where the host side initiates the communication. I think there are > > similar cases the other way, but I haven't thought them through. > >=20 > > Scenario 1: one source port, multiple destination ports > >=20 > > Here pasta is running with -u 200 -u 300 > >=20 > > 1. Client on the host opens UDP socket A and binds it to localhost:100 > >=20 > > 2. Client sends datagram 1 on socket A to localhost:200 with sendto() > > 3. Datagram 1 is received by pasta on splice socket B bound to > > localhost:200 > > 4. Because of the -U 200, pasta handles this in >=20 > -u 200 here >=20 > > udp_sock_handler_splice(), ref has splice=3D=3DUDP_TO_NS > > 5, recvmmsg() gets a single datagram, from source localhost:100, so > > src=3D=3D100 > > 6. udp_splice_map[v6][100].ns_conn_sock is empty, so we call > > udp_splice_connect_ns() > > 6.1. udp_splice_connect() creates socket B*, and connects it > > to localhost:200 in the namespace > > 6.2. udp_splice_map[v6][100].ns_conn_sock is populated with > > socket B* > > 7. sendmmsg() forwards the datagram to socket B* > > 8. Datagram 1 correctly reaches port 200 within the ns > >=20 > > 9. Client sends datagram 2 on socket A to localhost:300 with > > sendto() > > 11. Datagram 2 is received by pasta on socket C bound to > > localhost:300 > > 10. Again, pasta handles this in udp_sock_handler_splice() with > > UDP_TO_NS. Again, src=3D=3D100 > > 11. udp_splice_map[v6][100] is populated with socket B* from above > > 12. sendmmsg() forwads datagram 2 to socket B* > > * 13, Datagram 2 is incorrectly delivered to port 200 within the ns, > > instead of port 300 > >=20 > > Scenario 2: multiple source ports, one destination port > >=20 > > Here pasta is running with -u 1000 > >=20 > > 1. Client on the host opens socket A bound to localhost:2000 > > 2. Client on the host opens socket B bound to localhost:3000 > > 2. Client sends datagram 1 from socket A to localhost:1000 with > > sendto() > > 3. Client sends datagram 2 from socket B to localhost:1000 with > > sendto() > > =20 > > 4. Datagram 1 and 2 are both received by pasta on socket C bound to > > localhost:1000, with UDP_TO_NS > > 5. Datagram 1 and 2 happen to both be received by the same > > recvmmsg(), in that order > > 6. udp_sock_handler_splice() only examines udp_mmh_recv[0] and so > > sets src=3D=3D2000 > > 7. udp_splice_map[v6][2000].ns_conn_sock is unpopulated, so > > udp_splice_connect_ns() is called > > 7.1 udp_splice_connect creates socket C* and connects it to > > localhost:1000 within the guest, let's say it gets > > ephemeral bound port 50000. It's tagged with > > UDP_BACK_TO_INIT > > 7.2 udp_splice_map[v6][2000].ns_conn_sock is populated with > > socket C* > > 7.3 udp_splice_map[v6][50000].init_bound_sock is populated with > > socket C > > 7.4 udp_splice_map[v6][50000].init_dst_port is populated with 20= 00 > > 8. sendmmsg() forwads datagrams 1 & 2 to socket C* > > 9. Datagrams 1 & 2 correctly delivered to port 1000 in the namespace > >=20 > > 10. Server within the namespace receives datagram 1 with > > recvfrom(). From address is localhost:50000 (socket C*) > > 11. Server sends reply datagram 1* to localhost:50000 within the ns > > 12. Server receives datagram 2 with recvfrom(). From address is > > again localhost:50000 (socket C*) > > 13. Server sends reply datagram 2* to localhost:50000 within the ns > >=20 > > 14. pasta receives datagrams 1* and 2* on socket C*. > > UDP_BACK_TO_INIT and dst=3D=3D50000 from the epoll ref > >=20 > > 15. udp_sock_handler_splice() sets s to socket C from > > udp_splice_map[v6][50000].init_bound_sock, and send_dst to 2000 > > from udp_splice_map[v6][50000] > > 16. sendmmsg() forwards datagram 1* on socket C to localhost:2000 > > 17. Datagram 1* correctly received by socket A on localhost:2000 > > 18. sendmmsg() forwards datagram 2* on socket C to localhost:2000 > > * 19. Datagram 2* incorrectly received by socket A on localhost:2000 > > instead of socket B on localhost:3000 >=20 > Thanks a lot for the details, issues are clear to me now. >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --KX5DaZywMWT6vT2N Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEoULxWu4/Ws0dB+XtgypY4gEwYSIFAmN3E0UACgkQgypY4gEw YSIFjg/9F9xcI0e6Y+My965X51a1UzfgCUrCqDJmrmNdyWtsyi51udEmcPMy8OAN hjEGhJZllO97T1NnqYjfALgodEi6K00rLZI0RL2pdLUP0KKgsxT+Su/sy22NM1bg 1Otp4frVQMIXUBNJwKTTw0sgzwJrRoup5vgqQq7nbUGTNzY2Dc/InzAP4KaorgR+ 9aGz/Zt5EFOtp0G0qkASXDDqdspctOsEBJwp+0iuA7O1ct30JOwrfegWc2aayoeu 1mRo0xNyNJPrMJiYJ81sTe78TLjC9QGsl5qhTc2stpegZLc3LQ9MV5GbDSHjponX j4uhwifgM1D1iTbF60lJlZ5q5lV2hBpnjkIaAavBfWIrk+ozfZ5vQls6QxvQvNle ZTNO0pQVUhyo1MbqQ5cbgGDjLxr8oc5/mLhSj0agXAT4PuOkHQVGoBncL6xgifZh Qwtgs3gQcFhczMrft+47d6AXPPAbcBclwjsnUhk7w/H/gbPAIpb3U88rjY6zlDtZ LeFGtlg3ByI0SfvsJqq3hHa6T3BzDRW3NwSJ2g2drDYo+eKYzf4R+aCZs0RweECs e71CDMk8Ky7CHPQiJhjF8E/Lb2+zH0QeiYoOzCAvv6uBmJHf9cfzLqlbE2pu1jDW j7xyvyur+z79ck9FhP7sWPQDl09hyoP/M9Czr2fNyKKHOKaC5SY= =9drr -----END PGP SIGNATURE----- --KX5DaZywMWT6vT2N--