From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
Date: Mon, 3 Feb 2025 10:45:05 +0100 [thread overview]
Message-ID: <20250203104505.71d768ed@elisabeth> (raw)
In-Reply-To: <Z6CHFJFA2GljJbUG@zatzit>
On Mon, 3 Feb 2025 20:06:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Mon, Feb 03, 2025 at 07:09:28AM +0100, Stefano Brivio wrote:
> > On Mon, 3 Feb 2025 11:46:13 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:
> > > > Fixed, finally. Some answers:
> > > >
> > > > On Fri, 31 Jan 2025 17:14:18 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:
> > > > > > On Thu, 30 Jan 2025 09:32:36 +0100
> > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >
> > > > > > > I would like to quickly complete the whole flow first, because I think
> > > > > > > we can inform design and implementation decisions much better at that
> > > > > > > point
> > > > > >
> > > > > > So, there seems to be a problem with (testing?) this. I couldn't quite
> > > > > > understand the root cause yet, and it doesn't happen with the reference
> > > > > > source.c and target.c implementations I shared.
> > > > > >
> > > > > > Let's assume I have a connection in the source guest to 127.0.0.1:9091,
> > > > > > from 127.0.0.1:56350. After the migration, in the target, I get:
> > > > > >
> > > > > > ---
> > > > > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79
> > > > > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > > > > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> > > > > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1
> > > > > > recvfrom(72, "\1", 1, 0, NULL, NULL) = 1
> > > > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0
> > > > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0
> > > > > > write(2, "77.6923: ", 977.6923: ) = 9
> > > > > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51
> > > > > > write(2, "\n", 1
> > > > > > ) = 1
> > > > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
> > > > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0
> > > > > > write(2, "77.6924: ", 977.6924: ) = 9
> > > > > > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53
> > > > > > write(2, "\n", 1
> > > > > > ) = 1
> > > > > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
> > > > > > ---
> > > > > >
> > > > > > EADDRNOTAVAIL, according to the documentation, which seems to be
> > > > > > consistent with a glance at the implementation (that is, I must be
> > > > > > missing some issue in the kernel), should be returned on connect() if:
> > > > > >
> > > > > > EADDRNOTAVAIL
> > > > > > (Internet domain sockets) The socket referred to by
> > > > > > sockfd had not previously been bound to an address
> > > > > > and, upon attempting to bind it to an ephemeral
> > > > > > port, it was determined that all port numbers in the
> > > > > > ephemeral port range are currently in use. See the
> > > > > > discussion of /proc/sys/net/ipv4/ip_local_port_range
> > > > > > in ip(7).
> > > > > >
> > > > > > but well, of course it was bound.
> > > > > >
> > > > > > To a port, indeed, not a full address, that is, any (0.0.0.0) and
> > > > > > address port, but I think for the purposes of this description that
> > > > > > bind() call is enough.
> > > > >
> > > > > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired
> > > > > socket.
> > > >
> > > > It is.
> > > >
> > > > > Usually, of course, that 0.0.0.0 would be resolved to a real
> > > > > address at connect() time. But TCP_REPAIR's version of connect()
> > > > > bypasses a bunch of the usual connect logic, so maybe we need an
> > > > > explicit address here.
> > > >
> > > > No need.
> > >
> > > Ok.
> > >
> > > > > ...but that doesn't explain the difference between passt and your test
> > > > > implementation.
> > > >
> > > > The difference that actually matters is that the test implementation
> > > > terminates, and that has the equivalent effect of switching off repair
> > > > mode for the closed sockets, which frees up all the associated context,
> > > > including the port.
> > > >
> > > > Usually, there are no valid operations on closed sockets (not even
> > > > close()). This is the first exception I ever met: you can set
> > > > TCP_REPAIR_OFF.
> > >
> > > I'm still confused by the specific sequence of events that's causing
> > > the problem. If a socket is closed with close(2) it should no longer
> > > exist, so I don't see how you could even attempt to do anything with
> > > it.
> > >
> > > Do you mean that the socket is shutdown(RD|WR)? Or that it's been
> > > closed by passt, but not by passt-repair? Or the other way around?
> > >
> > > I'd kind of assume that you _must_ close the socket while still in
> > > repair mode, since we want it to go away on the source without
> > > attempting to FIN or RST or anything.
> >
> > While the explanation for the issue is what you gave as comment to 8/20
> > (I need to close() the socket from passt-repair), let me answer here:
> > sure, I must close() it, and it was close()d by passt but not
> > passt-repair.
>
> Right, I realised the problem with the missing close in passt-repair
> after I wrote this.
>
> > > > But there's a catch: you can't pass a closed socket in repair mode via
> > > > SCM_RIGHTS (well, I'm fairly sure nobody approached this level of
> > > > insanity before): you get EBADF (which is an understatement).
> > > >
> > > > And there's another catch: if you actually try to do that, even if it
> > > > fails, that has the same effect of clearing the socket entirely: you
> > > > free up the port.
> > >
> > > !?! this is even more baffling. Passing what's now an unrelated,
> > > unassigned integer as an fd is having some effect on a socket that was
> > > around!? If so that's a horrifying kernel bug.
> >
> > Nah, most likely not. The EBADF on a close()d socket is a bit
> > questionable (it should be EINVAL? Or a -1 socket in the
> > recipient?),
>
> You're not "passing a closed socket", that's nonsensical. You're
> trying to pass a stale fd that's no longer refers to your socket.
Well, I'm just passing a number that doesn't happen to refer to a
current socket, but:
> EBADF is _exactly_ what should happen, regardless of whether or not
> the underlying socket is really closed, or if it's held open by
> another fd somewhere (a dup() or something passed to another process
> like in this case).
...EBADF on a sendmsg() means, in POSIX.1-2024:
[EBADF]
The socket argument is not a valid file descriptor.
and nothing else. This matches GNU/Linux documentation by the way.
The socket argument is *not* one of the file descriptors that you can
pass via SCM_RIGHTS.
I would argue that a more reasonable and less surprising behaviour
would be signalling that there's no socket to send with a -1 in the
receiver. Or omit it in ancillary data altogether. POSIX specifies
SCM_RIGHTS, but doesn't mention any error for ancillary data.
> > but other than that, the explanation is that passing that closed socket
> > caused EOF in passt-repair, and passt-repair would quit, solving the
> > issue.
>
> Passing a bad fd caused an error on the sendmsg(), which caused an EOF
> on the other end. Which is a little odd, but again nothing to do with
> "passing a closed socket"; that's impossible - if the socket is closed
> there's no way to refer to it and so no way to even attempt sending
> it.
Right, so it shouldn't be sent, but the error doesn't match.
--
Stefano
next prev parent reply other threads:[~2025-02-03 9:45 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-28 0:49 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-28 0:50 ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
2025-01-28 0:53 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:02 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:27 ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-28 0:59 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:03 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-28 1:40 ` David Gibson
2025-01-28 6:50 ` Stefano Brivio
2025-01-29 1:16 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:48 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:38 ` David Gibson
2025-01-30 8:32 ` Stefano Brivio
2025-01-30 8:54 ` David Gibson
2025-01-31 5:46 ` Stefano Brivio
2025-01-31 6:32 ` David Gibson
2025-01-31 9:09 ` Stefano Brivio
2025-02-03 0:47 ` David Gibson
2025-01-31 5:36 ` Stefano Brivio
2025-01-31 6:14 ` David Gibson
2025-01-31 9:09 ` Stefano Brivio
2025-02-03 0:46 ` David Gibson
2025-02-03 6:09 ` Stefano Brivio
2025-02-03 9:06 ` David Gibson
2025-02-03 9:45 ` Stefano Brivio [this message]
2025-02-03 9:57 ` David Gibson
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31 ` Stefano Brivio
2025-01-28 1:51 ` David Gibson
2025-01-28 6:51 ` Stefano Brivio
2025-01-29 1:29 ` David Gibson
2025-01-29 7:04 ` Stefano Brivio
2025-01-30 0:53 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:43 ` David Gibson
2025-01-30 7:56 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250203104505.71d768ed@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).