public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
Date: Mon, 3 Feb 2025 11:46:13 +1100	[thread overview]
Message-ID: <Z6AR1cPukcRXu9XH@zatzit> (raw)
In-Reply-To: <20250131100919.0950ec1e@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 11341 bytes --]

On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:
> Fixed, finally. Some answers:
> 
> On Fri, 31 Jan 2025 17:14:18 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:
> > > On Thu, 30 Jan 2025 09:32:36 +0100
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >   
> > > > I would like to quickly complete the whole flow first, because I think
> > > > we can inform design and implementation decisions much better at that
> > > > point  
> > > 
> > > So, there seems to be a problem with (testing?) this. I couldn't quite
> > > understand the root cause yet, and it doesn't happen with the reference
> > > source.c and target.c implementations I shared.
> > > 
> > > Let's assume I have a connection in the source guest to 127.0.0.1:9091,
> > > from 127.0.0.1:56350. After the migration, in the target, I get:
> > > 
> > > ---
> > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79
> > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1
> > > recvfrom(72, "\1", 1, 0, NULL, NULL)    = 1
> > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0
> > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0
> > > write(2, "77.6923: ", 977.6923: )                = 9
> > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51
> > > write(2, "\n", 1
> > > )                       = 1
> > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
> > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0
> > > write(2, "77.6924: ", 977.6924: )                = 9
> > > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53
> > > write(2, "\n", 1
> > > )                       = 1
> > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
> > > ---
> > > 
> > > EADDRNOTAVAIL, according to the documentation, which seems to be
> > > consistent with a glance at the implementation (that is, I must be
> > > missing some issue in the kernel), should be returned on connect() if:
> > > 
> > >        EADDRNOTAVAIL
> > >               (Internet  domain sockets) The socket referred to by
> > >               sockfd had not previously been bound to  an  address
> > >               and,  upon  attempting  to  bind  it to an ephemeral
> > >               port, it was determined that all port numbers in the
> > >               ephemeral port range are currently in use.  See  the
> > >               discussion of /proc/sys/net/ipv4/ip_local_port_range
> > >               in ip(7).
> > > 
> > > but well, of course it was bound.
> > > 
> > > To a port, indeed, not a full address, that is, any (0.0.0.0) and
> > > address port, but I think for the purposes of this description that
> > > bind() call is enough.  
> > 
> > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired
> > socket.
> 
> It is.
> 
> > Usually, of course, that 0.0.0.0 would be resolved to a real
> > address at connect() time.  But TCP_REPAIR's version of connect()
> > bypasses a bunch of the usual connect logic, so maybe we need an
> > explicit address here.
> 
> No need.

Ok.

> > ...but that doesn't explain the difference between passt and your test
> > implementation.
> 
> The difference that actually matters is that the test implementation
> terminates, and that has the equivalent effect of switching off repair
> mode for the closed sockets, which frees up all the associated context,
> including the port.
> 
> Usually, there are no valid operations on closed sockets (not even
> close()). This is the first exception I ever met: you can set
> TCP_REPAIR_OFF.

I'm still confused by the specific sequence of events that's causing
the problem.  If a socket is closed with close(2) it should no longer
exist, so I don't see how you could even attempt to do anything with
it.

Do you mean that the socket is shutdown(RD|WR)?  Or that it's been
closed by passt, but not by passt-repair?  Or the other way around?

I'd kind of assume that you _must_ close the socket while still in
repair mode, since we want it to go away on the source without
attempting to FIN or RST or anything.

> But there's a catch: you can't pass a closed socket in repair mode via
> SCM_RIGHTS (well, I'm fairly sure nobody approached this level of
> insanity before): you get EBADF (which is an understatement).
> 
> And there's another catch: if you actually try to do that, even if it
> fails, that has the same effect of clearing the socket entirely: you
> free up the port.

!?! this is even more baffling.  Passing what's now an unrelated,
unassigned integer as an fd is having some effect on a socket that was
around!?  If so that's a horrifying kernel bug.

> But we can't use this, unfortunately, because if we do, the peer will
> get a zero-length read (EOF). Now, I could reintroduce a "quit" command
> in passt-repair, and we would know that EOF doesn't actually mean
> completion, but it complicates things again.
> 
> What works, though, is simply terminating. We can't do that before
> VHOST_USER_CHECK_DEVICE_STATE, but just after that. That's what I
> implemented at the moment (updated patches coming soon).
> 
> > > Is this related to SO_REUSEADDR? I need it (on both source and target)
> > > because, at least in my tests, source and target are on the same
> > > machine, in the same namespace. If I drop it:  
> > 
> > Again, I can think of various problems that not having the same
> > address available on source and dest might have, but not any which
> > explain the difference between passt and the experimental impl.
> > 
> > > ---
> > > bind(79, {sa_family=AF_INET, sin_port=htons(46280), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
> > > ---
> > > 
> > > as expected.
> > > 
> > > However, in my reference implementation, with a connection from
> > > 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does:
> > > 
> > > ---
> > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
> > > setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > bind(3, {sa_family=AF_INET, sin_port=htons(9998), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> > > socket(AF_UNIX, SOCK_STREAM, 0)         = 4
> > > unlink("/tmp/repair.sock")              = 0
> > > bind(4, {sa_family=AF_UNIX, sun_path="/tmp/repair.sock"}, 110) = 0
> > > listen(4, 1)                            = 0
> > > accept(4, NULL, NULL)                   = 5
> > > sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[3]}], msg_controllen=24, msg_flags=0}, 0) = 1
> > > recvfrom(5, "\1", 1, 0, NULL, NULL)     = 1
> > > setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0
> > > setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) = 0
> > > setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
> > > setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) = 0
> > > connect(3, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> > > ---
> > > 
> > > The only obvious difference is that, here, I'm not binding to an
> > > ephemeral port: the source port (in both source and target "guests") is
> > > 9998.
> > > 
> > > Fine, so I tried forcing a lower port in passt (source) as well, and
> > > this is what I get in the target now:
> > > 
> > > ---
> > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79
> > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > bind(79, {sa_family=AF_INET, sin_port=htons(9000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1
> > > recvfrom(72, "\1", 1, 0, NULL, NULL)    = 1
> > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0
> > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) = 0
> > > write(2, "46.9751: ", 946.9751: )                = 9
> > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 3946857962) = 51
> > > write(2, "\n", 1
> > > )                       = 1
> > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
> > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) = 0
> > > write(2, "46.9752: ", 946.9752: )                = 9
> > > write(2, "Set receive queue sequence for s"..., 54Set receive queue sequence for socket 79 to 2474644625) = 54
> > > write(2, "\n", 1
> > > )                       = 1
> > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
> > > ---
> > > 
> > > no obvious difference. I'll try binding to an explicit address, next,
> > > but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. it
> > > works with the reference implementation.  
> > 
> > I have no ideas yet :(.
> > 
> > > Yes, I explicitly close() the socket in the source passt now, but that
> > > doesn't change things.
> > > 
> > > This is presumably just an issue with testing, because in real use
> > > cases source and target guests would be on different machines. Another
> > > idea could be separating the namespaces.  
> > 
> > Well, if that's relevant to the problem which isn't clear yet.  I
> > mean, I guess it's worth trying with source and dest in different
> > namespaces.
> > 
> > > I can't just run source and target passt in two instances of pasta
> > > --config-net, because pasta would run into the same issue,  
> > 
> > Uh.. which same issue?  pasta's not trying to do any TCP_REPAIR stuff
> > or migration.
> 
> Same issue in the sense that if I connect namespaces with pasta, I
> can't migrate a connection between them, because pasta can't migrate a
> connection. It would close it and try to reopen it.
> 
> > > but I could
> > > isolate one namespace with it, then add two network namespaces inside
> > > that, and connect them with veth pairs.  
> > 
> > Two pasta instances actually sounds like a better bet to me, because
> > the two "hosts" will have the same address, which is what we'd expect
> > for a "real" migration - and it kind of has to be the case for the
> > host side connections to work afterwards.
> 
> Eh, yes, but we're back to the original problem. A veth interface
> wouldn't care, instead.
> 
> Anyway, no need, it's finally working now.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-02-03  2:16 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-28  0:49   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-28  0:50   ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
2025-01-28  0:53   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-29  1:02       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:44           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:27               ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-28  0:59   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-29  1:03       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:44           ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-28  1:40   ` David Gibson
2025-01-28  6:50     ` Stefano Brivio
2025-01-29  1:16       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:48           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:38               ` David Gibson
2025-01-30  8:32                 ` Stefano Brivio
2025-01-30  8:54                   ` David Gibson
2025-01-31  5:46                     ` Stefano Brivio
2025-01-31  6:32                       ` David Gibson
2025-01-31  9:09                         ` Stefano Brivio
2025-02-03  0:47                           ` David Gibson
2025-01-31  5:36                   ` Stefano Brivio
2025-01-31  6:14                     ` David Gibson
2025-01-31  9:09                       ` Stefano Brivio
2025-02-03  0:46                         ` David Gibson [this message]
2025-02-03  6:09                           ` Stefano Brivio
2025-02-03  9:06                             ` David Gibson
2025-02-03  9:45                               ` Stefano Brivio
2025-02-03  9:57                                 ` David Gibson
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31   ` Stefano Brivio
2025-01-28  1:51   ` David Gibson
2025-01-28  6:51     ` Stefano Brivio
2025-01-29  1:29       ` David Gibson
2025-01-29  7:04         ` Stefano Brivio
2025-01-30  0:53           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:43               ` David Gibson
2025-01-30  7:56                 ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6AR1cPukcRXu9XH@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).