public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: UDP "splicing" fairly broken :(
Date: Fri, 18 Nov 2022 16:08:27 +1100	[thread overview]
Message-ID: <Y3cTS3eoL2AftUdN@yekko> (raw)
In-Reply-To: <20221117210235.78db52b4@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 9776 bytes --]

On Thu, Nov 17, 2022 at 09:02:35PM +0100, Stefano Brivio wrote:
> On Thu, 17 Nov 2022 21:30:03 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Nov 17, 2022 at 08:21:00AM +0100, Stefano Brivio wrote:
> > > On Thu, 17 Nov 2022 16:07:32 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > In preparation for trying to implement dual stack sockets for UDP,
> > > > I've been getting my head around how the UDP splicing works.  Alas,
> > > > I'm pretty sure that it's broken if there's not a one-to-one
> > > > correspondence between init side source ports and ns side destination
> > > > ports.  That will typically be the case, but given its UDP there's no
> > > > guarantee.  
> > > 
> > > I understand the concern below, but I don't understand this part, that
> > > is: in which other way is it broken?  
> > 
> > Uh.. other than?
> 
> Sorry, I didn't understand it's just broken for the cases below, but
> not in case one source port corresponds one destination port (you wrote
> it, I misread).

Ah, right.

> So, well, I disagree on it being fairly broken: it works in the most
> common case. This doesn't really matter though:

So, I'm less sanguine about this, as you probably realize.  Yes, it's
not a common case, but the failure mode is nasty.  It's not just that
it falls back to a slower path, nor even that it just drops packets.
At best it's going to lead to very hard to debug failures.  At worst,
it's a security problem: the second scenario potentially lets an
unprivileged host process intercept replies to another user's requests
to a server in the ns, or interpose incorrect reponses to that other
user's requests.

> - it's an obstacle to unify IPv4 and IPv6 sockets
> 
> - it's definitely not pretty -- it was more of a sketch I wanted to
>   rewrite for a long time
> 
> - it doesn't make a huge difference in packet rates (far from having
>   the same impact as the TCP spliced connections)
> 
> - it's broken in those (albeit probably uncommon) cases below
> 
> ...let's drop it for now, and add back a saner version later, now that
> we know in much better detail how it should work and where the problems
> might be.

I had a look at this today, and unfortunately it's less
straightforward than I first thought.  The difficulty is that the
splicing isn't just a performance hack, it's also the only way that
the host can access servers within the ns bound explicitly to
127.0.0.1 or ::1.  I'm not particularly convinced allowing that's a
good idea (it's a behavioural difference between passt and pasta for
one).  But, removing it now would be a big semantic change, and if we
do so we should make it consistent for tcp as well.

Still looking and thinking through options.

> > > > In addition, UDP servers in the ns will not see the correct port
> > > > numbers with getpeername().  That's also true of spliced TCP
> > > > connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's
> > > > more likely to matter for UDP (I don't know of any TCP protocols that
> > > > care about source port number on the server side, but there are some
> > > > common UDP protocols that have at least port number conventions on
> > > > both sides).  
> > > 
> > > I can think of DHCP and DNS, for which we offer special handling
> > > somehow. Still, if the flow is started by the guest or container,
> > > replies should really come with a source port matching the destination
> > > port used initially.  
> > 
> > I'm not concerned so much about replies coming from a different port
> > as a server which expects initial requests from a particular port.
> > Still not that likely, but more likely than with TCP.
> > 
> > > For TCP, I don't see this is as an issue at all.  
> > 
> > I largely agree.
> > 
> > > > I can expand on the details later, but pasta will do the wrong thing
> > > > in at least some circumstances for both a single init side socket
> > > > sendto()ing packets to multiple different ports in the ns/guest and
> > > > multiple init side sockets send()ing to the same port in the ns/guest.
> > > > 
> > > > I think I know how to fix it, but it's not a trivial job.  So, the
> > > > question is do I embark on this now, or do I just remove UDP
> > > > "splicing" entirely for the time being (other than a minimum required
> > > > to make -U work)?  That would unblock dual stack UDP sockets and we
> > > > can attempt to reoptimize this later.  
> > > 
> > > So, I'm not really sure what's broken here, but in any case, UDP  
> > 
> > I'll fill in the details below.
> > 
> > > "splicing" doesn't offer as much value as the TCP one does, the
> > > difference in packet rate is not that big. I don't see a problem if we
> > > want to remove it temporarily.  
> > 
> > Ok, good to know.
> > 
> > > The only real concern I have is how easy it would be to add it back
> > > after a rework not taking that functionality into account.  
> > 
> > I actually think this will fit better with the tap path once I've made
> > the dual stack socket changes.
> > 
> > Ok, for the details of the problem.  I'm only considering the case
> > where the host side initiates the communication.  I think there are
> > similar cases the other way, but I haven't thought them through.
> > 
> > Scenario 1: one source port, multiple destination ports
> > 
> > Here pasta is running with -u 200 -u 300
> > 
> >   1. Client on the host opens UDP socket A and binds it to localhost:100
> > 
> >   2. Client sends datagram 1 on socket A to localhost:200 with sendto()
> >   3. Datagram 1 is received by pasta on splice socket B bound to
> >      localhost:200
> >   4. Because of the -U 200, pasta handles this in
> 
> -u 200 here
> 
> >      udp_sock_handler_splice(), ref has splice==UDP_TO_NS
> >   5, recvmmsg() gets a single datagram, from source localhost:100, so
> >      src==100
> >   6. udp_splice_map[v6][100].ns_conn_sock is empty, so we call
> >      udp_splice_connect_ns()
> >       6.1. udp_splice_connect() creates socket B*, and connects it
> >            to localhost:200 in the namespace
> >       6.2. udp_splice_map[v6][100].ns_conn_sock is populated with
> >            socket B*
> >   7. sendmmsg() forwards the datagram to socket B*
> >   8. Datagram 1 correctly reaches port 200 within the ns
> > 
> >   9. Client sends datagram 2 on socket A to localhost:300 with
> >      sendto()
> >   11. Datagram 2 is received by pasta on socket C bound to
> >       localhost:300
> >   10. Again, pasta handles this in udp_sock_handler_splice() with
> >       UDP_TO_NS.  Again, src==100
> >   11. udp_splice_map[v6][100] is populated with socket B* from above
> >   12. sendmmsg() forwads datagram 2 to socket B*
> > * 13, Datagram 2 is incorrectly delivered to port 200 within the ns,
> >       instead of port 300
> > 
> > Scenario 2: multiple source ports, one destination port
> > 
> > Here pasta is running with -u 1000
> > 
> >   1. Client on the host opens socket A bound to localhost:2000
> >   2. Client on the host opens socket B bound to localhost:3000
> >   2. Client sends datagram 1 from socket A to localhost:1000 with
> >      sendto()
> >   3. Client sends datagram 2 from socket B to localhost:1000 with
> >      sendto()
> >   
> >   4. Datagram 1 and 2 are both received by pasta on socket C bound to
> >      localhost:1000, with UDP_TO_NS
> >   5. Datagram 1 and 2 happen to both be received by the same
> >      recvmmsg(), in that order
> >   6. udp_sock_handler_splice() only examines udp_mmh_recv[0] and so
> >      sets src==2000
> >   7. udp_splice_map[v6][2000].ns_conn_sock is unpopulated, so
> >      udp_splice_connect_ns() is called
> >        7.1 udp_splice_connect creates socket C* and connects it to
> >            localhost:1000 within the guest, let's say it gets
> > 	   ephemeral bound port 50000.  It's tagged with
> > 	   UDP_BACK_TO_INIT
> >        7.2 udp_splice_map[v6][2000].ns_conn_sock is populated with
> >            socket C*
> >        7.3 udp_splice_map[v6][50000].init_bound_sock is populated with
> >            socket C
> >        7.4 udp_splice_map[v6][50000].init_dst_port is populated with 2000
> >   8. sendmmsg() forwads datagrams 1 & 2 to socket C*
> >   9. Datagrams 1 & 2 correctly delivered to port 1000 in the namespace
> > 
> >   10. Server within the namespace receives datagram 1 with
> >       recvfrom().  From address is localhost:50000 (socket C*)
> >   11. Server sends reply datagram 1* to localhost:50000 within the ns
> >   12. Server receives datagram 2 with recvfrom().  From address is
> >       again localhost:50000 (socket C*)
> >   13. Server sends reply datagram 2* to localhost:50000 within the ns
> > 
> >   14. pasta receives datagrams 1* and 2* on socket C*.
> >       UDP_BACK_TO_INIT and dst==50000 from the epoll ref
> > 
> >   15. udp_sock_handler_splice() sets s to socket C from
> >       udp_splice_map[v6][50000].init_bound_sock, and send_dst to 2000
> >       from udp_splice_map[v6][50000]
> >   16. sendmmsg() forwards datagram 1* on socket C to localhost:2000
> >   17. Datagram 1* correctly received by socket A on localhost:2000
> >   18. sendmmsg() forwards datagram 2* on socket C to localhost:2000
> > * 19. Datagram 2* incorrectly received by socket A on localhost:2000
> >       instead of socket B on localhost:3000
> 
> Thanks a lot for the details, issues are clear to me now.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2022-11-18  5:25 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-17  5:07 UDP "splicing" fairly broken :( David Gibson
2022-11-17  7:21 ` Stefano Brivio
2022-11-17 10:30   ` David Gibson
2022-11-17 20:02     ` Stefano Brivio
2022-11-18  5:08       ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y3cTS3eoL2AftUdN@yekko \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).