public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
Date: Mon, 3 Feb 2025 20:57:52 +1100	[thread overview]
Message-ID: <Z6CTIEW0mpqjWp-2@zatzit> (raw)
In-Reply-To: <20250203104505.71d768ed@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 9388 bytes --]

On Mon, Feb 03, 2025 at 10:45:05AM +0100, Stefano Brivio wrote:
> On Mon, 3 Feb 2025 20:06:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Feb 03, 2025 at 07:09:28AM +0100, Stefano Brivio wrote:
> > > On Mon, 3 Feb 2025 11:46:13 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:  
> > > > > Fixed, finally. Some answers:
> > > > > 
> > > > > On Fri, 31 Jan 2025 17:14:18 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:    
> > > > > > > On Thu, 30 Jan 2025 09:32:36 +0100
> > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >       
> > > > > > > > I would like to quickly complete the whole flow first, because I think
> > > > > > > > we can inform design and implementation decisions much better at that
> > > > > > > > point      
> > > > > > > 
> > > > > > > So, there seems to be a problem with (testing?) this. I couldn't quite
> > > > > > > understand the root cause yet, and it doesn't happen with the reference
> > > > > > > source.c and target.c implementations I shared.
> > > > > > > 
> > > > > > > Let's assume I have a connection in the source guest to 127.0.0.1:9091,
> > > > > > > from 127.0.0.1:56350. After the migration, in the target, I get:
> > > > > > > 
> > > > > > > ---
> > > > > > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79
> > > > > > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > > > > > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> > > > > > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1
> > > > > > > recvfrom(72, "\1", 1, 0, NULL, NULL)    = 1
> > > > > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0
> > > > > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0
> > > > > > > write(2, "77.6923: ", 977.6923: )                = 9
> > > > > > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51
> > > > > > > write(2, "\n", 1
> > > > > > > )                       = 1
> > > > > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
> > > > > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0
> > > > > > > write(2, "77.6924: ", 977.6924: )                = 9
> > > > > > > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53
> > > > > > > write(2, "\n", 1
> > > > > > > )                       = 1
> > > > > > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
> > > > > > > ---
> > > > > > > 
> > > > > > > EADDRNOTAVAIL, according to the documentation, which seems to be
> > > > > > > consistent with a glance at the implementation (that is, I must be
> > > > > > > missing some issue in the kernel), should be returned on connect() if:
> > > > > > > 
> > > > > > >        EADDRNOTAVAIL
> > > > > > >               (Internet  domain sockets) The socket referred to by
> > > > > > >               sockfd had not previously been bound to  an  address
> > > > > > >               and,  upon  attempting  to  bind  it to an ephemeral
> > > > > > >               port, it was determined that all port numbers in the
> > > > > > >               ephemeral port range are currently in use.  See  the
> > > > > > >               discussion of /proc/sys/net/ipv4/ip_local_port_range
> > > > > > >               in ip(7).
> > > > > > > 
> > > > > > > but well, of course it was bound.
> > > > > > > 
> > > > > > > To a port, indeed, not a full address, that is, any (0.0.0.0) and
> > > > > > > address port, but I think for the purposes of this description that
> > > > > > > bind() call is enough.      
> > > > > > 
> > > > > > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired
> > > > > > socket.    
> > > > > 
> > > > > It is.
> > > > >     
> > > > > > Usually, of course, that 0.0.0.0 would be resolved to a real
> > > > > > address at connect() time.  But TCP_REPAIR's version of connect()
> > > > > > bypasses a bunch of the usual connect logic, so maybe we need an
> > > > > > explicit address here.    
> > > > > 
> > > > > No need.    
> > > > 
> > > > Ok.
> > > >   
> > > > > > ...but that doesn't explain the difference between passt and your test
> > > > > > implementation.    
> > > > > 
> > > > > The difference that actually matters is that the test implementation
> > > > > terminates, and that has the equivalent effect of switching off repair
> > > > > mode for the closed sockets, which frees up all the associated context,
> > > > > including the port.
> > > > > 
> > > > > Usually, there are no valid operations on closed sockets (not even
> > > > > close()). This is the first exception I ever met: you can set
> > > > > TCP_REPAIR_OFF.    
> > > > 
> > > > I'm still confused by the specific sequence of events that's causing
> > > > the problem.  If a socket is closed with close(2) it should no longer
> > > > exist, so I don't see how you could even attempt to do anything with
> > > > it.
> > > > 
> > > > Do you mean that the socket is shutdown(RD|WR)?  Or that it's been
> > > > closed by passt, but not by passt-repair?  Or the other way around?
> > > > 
> > > > I'd kind of assume that you _must_ close the socket while still in
> > > > repair mode, since we want it to go away on the source without
> > > > attempting to FIN or RST or anything.  
> > > 
> > > While the explanation for the issue is what you gave as comment to 8/20
> > > (I need to close() the socket from passt-repair), let me answer here:
> > > sure, I must close() it, and it was close()d by passt but not
> > > passt-repair.  
> > 
> > Right, I realised the problem with the missing close in passt-repair
> > after I wrote this.
> > 
> > > > > But there's a catch: you can't pass a closed socket in repair mode via
> > > > > SCM_RIGHTS (well, I'm fairly sure nobody approached this level of
> > > > > insanity before): you get EBADF (which is an understatement).
> > > > > 
> > > > > And there's another catch: if you actually try to do that, even if it
> > > > > fails, that has the same effect of clearing the socket entirely: you
> > > > > free up the port.    
> > > > 
> > > > !?! this is even more baffling.  Passing what's now an unrelated,
> > > > unassigned integer as an fd is having some effect on a socket that was
> > > > around!?  If so that's a horrifying kernel bug.  
> > > 
> > > Nah, most likely not. The EBADF on a close()d socket is a bit
> > > questionable (it should be EINVAL? Or a -1 socket in the
> > > recipient?),  
> > 
> > You're not "passing a closed socket", that's nonsensical.  You're
> > trying to pass a stale fd that's no longer refers to your socket.
> 
> Well, I'm just passing a number that doesn't happen to refer to a
> current socket, but:
> 
> > EBADF is _exactly_ what should happen, regardless of whether or not
> > the underlying socket is really closed, or if it's held open by
> > another fd somewhere (a dup() or something passed to another process
> > like in this case).
> 
> ...EBADF on a sendmsg() means, in POSIX.1-2024:
> 
> [EBADF]
>     The socket argument is not a valid file descriptor.
> 
> and nothing else. This matches GNU/Linux documentation by the way.
> 
> The socket argument is *not* one of the file descriptors that you can
> pass via SCM_RIGHTS.

Eh, I guess so.

> I would argue that a more reasonable and less surprising behaviour
> would be signalling that there's no socket to send with a -1 in the
> receiver. Or omit it in ancillary data altogether. POSIX specifies
> SCM_RIGHTS, but doesn't mention any error for ancillary data.

I mean, regardless of the letter of the law, throwing an error on
sendmsg() seems much more useful than the receiver having to process a
bogus value.  And given that, EBADF seems again sensible, regardless
of the letter of the law.  EINVAL can mean so many things its kind of
useless.

For the man page, I'd chalk it up to them just not considering all the
things that could go wrong with SCM_RIGHTS.  Not covering it in POSIX
is a bit more surprising.

> > > but other than that, the explanation is that passing that closed socket
> > > caused EOF in passt-repair, and passt-repair would quit, solving the
> > > issue.  
> > 
> > Passing a bad fd caused an error on the sendmsg(), which caused an EOF
> > on the other end.  Which is a little odd, but again nothing to do with
> > "passing a closed socket"; that's impossible - if the socket is closed
> > there's no way to refer to it and so no way to even attempt sending
> > it.
> 
> Right, so it shouldn't be sent, but the error doesn't match.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-02-03  9:57 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-28  0:49   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-28  0:50   ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
2025-01-28  0:53   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-29  1:02       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:44           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:27               ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-28  0:59   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-29  1:03       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:44           ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-28  1:40   ` David Gibson
2025-01-28  6:50     ` Stefano Brivio
2025-01-29  1:16       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:48           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:38               ` David Gibson
2025-01-30  8:32                 ` Stefano Brivio
2025-01-30  8:54                   ` David Gibson
2025-01-31  5:46                     ` Stefano Brivio
2025-01-31  6:32                       ` David Gibson
2025-01-31  9:09                         ` Stefano Brivio
2025-02-03  0:47                           ` David Gibson
2025-01-31  5:36                   ` Stefano Brivio
2025-01-31  6:14                     ` David Gibson
2025-01-31  9:09                       ` Stefano Brivio
2025-02-03  0:46                         ` David Gibson
2025-02-03  6:09                           ` Stefano Brivio
2025-02-03  9:06                             ` David Gibson
2025-02-03  9:45                               ` Stefano Brivio
2025-02-03  9:57                                 ` David Gibson [this message]
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31   ` Stefano Brivio
2025-01-28  1:51   ` David Gibson
2025-01-28  6:51     ` Stefano Brivio
2025-01-29  1:29       ` David Gibson
2025-01-29  7:04         ` Stefano Brivio
2025-01-30  0:53           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:43               ` David Gibson
2025-01-30  7:56                 ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6CTIEW0mpqjWp-2@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).