From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=MhfJ61Ux; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id AE4105A026F for ; Mon, 03 Feb 2025 03:16:57 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1738549000; bh=MNYhBgmmxnJFE2uxO0F+IVKXLjb8sbVcVXgqm364SCo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=MhfJ61UxwP2a+TlKf0NmOG/m6hBgpIp+ssH2q99mS1LYT9+HdF4iCrnr0vGNAUg8v kt4GqYFbMd5dZmNYCJecX5sUggSHRcGQ03XsQjGluDX9ogSyS/dLnfzf34DmqmRYuT QX9HE1ROR7Fum6vjsx9QtzU7DBWIdtmagm473Ncyn/EuodD4ujbl10vGGR1ehApaYf BKMm/HWSSMmF5qUw+mO7L/loHuV9H0MQBfyfGfIWiit5IU/upXozkrIx/WxMz0Y821 LeVw9huoTVD9j8hMbpG5tUyxOTL0TgUgfEroRJ1DN3aeTF9+mfDtvpajK1wrVNVXmh s7EvlHbEJpGJA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YmVVc45P8z4wj2; Mon, 3 Feb 2025 13:16:40 +1100 (AEDT) Date: Mon, 3 Feb 2025 11:46:13 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: References: <20250128075001.3557d398@elisabeth> <20250129083350.220a7ab0@elisabeth> <20250130055522.39acb265@elisabeth> <20250130093236.117c3fd0@elisabeth> <20250131063655.41a5861b@elisabeth> <20250131100919.0950ec1e@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="xO+TjMfrJvT9EBUJ" Content-Disposition: inline In-Reply-To: <20250131100919.0950ec1e@elisabeth> Message-ID-Hash: M3SBFQA3JLFHVPB2FZKF32ZM5QZMXJ4X X-Message-ID-Hash: M3SBFQA3JLFHVPB2FZKF32ZM5QZMXJ4X X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --xO+TjMfrJvT9EBUJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote: > Fixed, finally. Some answers: >=20 > On Fri, 31 Jan 2025 17:14:18 +1100 > David Gibson wrote: >=20 > > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote: > > > On Thu, 30 Jan 2025 09:32:36 +0100 > > > Stefano Brivio wrote: > > > =20 > > > > I would like to quickly complete the whole flow first, because I th= ink > > > > we can inform design and implementation decisions much better at th= at > > > > point =20 > > >=20 > > > So, there seems to be a problem with (testing?) this. I couldn't quite > > > understand the root cause yet, and it doesn't happen with the referen= ce > > > source.c and target.c implementations I shared. > > >=20 > > > Let's assume I have a connection in the source guest to 127.0.0.1:909= 1, > > > from 127.0.0.1:56350. After the migration, in the target, I get: > > >=20 > > > --- > > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) =3D 79 > > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) =3D 0 > > > bind(79, {sa_family=3DAF_INET, sin_port=3Dhtons(56350), sin_addr=3Din= et_addr("0.0.0.0")}, 16) =3D 0 > > > sendmsg(72, {msg_name=3DNULL, msg_namelen=3D0, msg_iov=3D[{iov_base= =3D"\1", iov_len=3D1}], msg_iovlen=3D1, msg_control=3D[{cmsg_len=3D20, cmsg= _level=3DSOL_SOCKET, cmsg_type=3DSCM_RIGHTS, cmsg_data=3D[79]}], msg_contro= llen=3D24, msg_flags=3D0}, 0) =3D 1 > > > recvfrom(72, "\1", 1, 0, NULL, NULL) =3D 1 > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) =3D 0 > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) =3D 0 > > > write(2, "77.6923: ", 977.6923: ) =3D 9 > > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequ= ence for socket 79 to 1788468535) =3D 51 > > > write(2, "\n", 1 > > > ) =3D 1 > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) =3D 0 > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) =3D 0 > > > write(2, "77.6924: ", 977.6924: ) =3D 9 > > > write(2, "Set receive queue sequence for s"..., 53Set receive queue s= equence for socket 79 to 115288604) =3D 53 > > > write(2, "\n", 1 > > > ) =3D 1 > > > connect(79, {sa_family=3DAF_INET, sin_port=3Dhtons(9091), sin_addr=3D= inet_addr("127.0.0.1")}, 16) =3D -1 EADDRNOTAVAIL (Cannot assign requested = address) > > > --- > > >=20 > > > EADDRNOTAVAIL, according to the documentation, which seems to be > > > consistent with a glance at the implementation (that is, I must be > > > missing some issue in the kernel), should be returned on connect() if: > > >=20 > > > EADDRNOTAVAIL > > > (Internet domain sockets) The socket referred to by > > > sockfd had not previously been bound to an address > > > and, upon attempting to bind it to an ephemeral > > > port, it was determined that all port numbers in the > > > ephemeral port range are currently in use. See the > > > discussion of /proc/sys/net/ipv4/ip_local_port_range > > > in ip(7). > > >=20 > > > but well, of course it was bound. > > >=20 > > > To a port, indeed, not a full address, that is, any (0.0.0.0) and > > > address port, but I think for the purposes of this description that > > > bind() call is enough. =20 > >=20 > > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired > > socket. >=20 > It is. >=20 > > Usually, of course, that 0.0.0.0 would be resolved to a real > > address at connect() time. But TCP_REPAIR's version of connect() > > bypasses a bunch of the usual connect logic, so maybe we need an > > explicit address here. >=20 > No need. Ok. > > ...but that doesn't explain the difference between passt and your test > > implementation. >=20 > The difference that actually matters is that the test implementation > terminates, and that has the equivalent effect of switching off repair > mode for the closed sockets, which frees up all the associated context, > including the port. >=20 > Usually, there are no valid operations on closed sockets (not even > close()). This is the first exception I ever met: you can set > TCP_REPAIR_OFF. I'm still confused by the specific sequence of events that's causing the problem. If a socket is closed with close(2) it should no longer exist, so I don't see how you could even attempt to do anything with it. Do you mean that the socket is shutdown(RD|WR)? Or that it's been closed by passt, but not by passt-repair? Or the other way around? I'd kind of assume that you _must_ close the socket while still in repair mode, since we want it to go away on the source without attempting to FIN or RST or anything. > But there's a catch: you can't pass a closed socket in repair mode via > SCM_RIGHTS (well, I'm fairly sure nobody approached this level of > insanity before): you get EBADF (which is an understatement). >=20 > And there's another catch: if you actually try to do that, even if it > fails, that has the same effect of clearing the socket entirely: you > free up the port. !?! this is even more baffling. Passing what's now an unrelated, unassigned integer as an fd is having some effect on a socket that was around!? If so that's a horrifying kernel bug. > But we can't use this, unfortunately, because if we do, the peer will > get a zero-length read (EOF). Now, I could reintroduce a "quit" command > in passt-repair, and we would know that EOF doesn't actually mean > completion, but it complicates things again. >=20 > What works, though, is simply terminating. We can't do that before > VHOST_USER_CHECK_DEVICE_STATE, but just after that. That's what I > implemented at the moment (updated patches coming soon). >=20 > > > Is this related to SO_REUSEADDR? I need it (on both source and target) > > > because, at least in my tests, source and target are on the same > > > machine, in the same namespace. If I drop it: =20 > >=20 > > Again, I can think of various problems that not having the same > > address available on source and dest might have, but not any which > > explain the difference between passt and the experimental impl. > >=20 > > > --- > > > bind(79, {sa_family=3DAF_INET, sin_port=3Dhtons(46280), sin_addr=3Din= et_addr("0.0.0.0")}, 16) =3D -1 EADDRINUSE (Address already in use) > > > --- > > >=20 > > > as expected. > > >=20 > > > However, in my reference implementation, with a connection from > > > 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does: > > >=20 > > > --- > > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) =3D 3 > > > setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) =3D 0 > > > bind(3, {sa_family=3DAF_INET, sin_port=3Dhtons(9998), sin_addr=3Dinet= _addr("0.0.0.0")}, 16) =3D 0 > > > socket(AF_UNIX, SOCK_STREAM, 0) =3D 4 > > > unlink("/tmp/repair.sock") =3D 0 > > > bind(4, {sa_family=3DAF_UNIX, sun_path=3D"/tmp/repair.sock"}, 110) = =3D 0 > > > listen(4, 1) =3D 0 > > > accept(4, NULL, NULL) =3D 5 > > > sendmsg(5, {msg_name=3DNULL, msg_namelen=3D0, msg_iov=3D[{iov_base=3D= "\1", iov_len=3D1}], msg_iovlen=3D1, msg_control=3D[{cmsg_len=3D20, cmsg_le= vel=3DSOL_SOCKET, cmsg_type=3DSCM_RIGHTS, cmsg_data=3D[3]}], msg_controllen= =3D24, msg_flags=3D0}, 0) =3D 1 > > > recvfrom(5, "\1", 1, 0, NULL, NULL) =3D 1 > > > setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) =3D 0 > > > setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) =3D 0 > > > setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) =3D 0 > > > setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) =3D 0 > > > connect(3, {sa_family=3DAF_INET, sin_port=3Dhtons(9091), sin_addr=3Di= net_addr("127.0.0.1")}, 16) =3D 0 > > > --- > > >=20 > > > The only obvious difference is that, here, I'm not binding to an > > > ephemeral port: the source port (in both source and target "guests") = is > > > 9998. > > >=20 > > > Fine, so I tried forcing a lower port in passt (source) as well, and > > > this is what I get in the target now: > > >=20 > > > --- > > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) =3D 79 > > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) =3D 0 > > > bind(79, {sa_family=3DAF_INET, sin_port=3Dhtons(9000), sin_addr=3Dine= t_addr("0.0.0.0")}, 16) =3D 0 > > > sendmsg(72, {msg_name=3DNULL, msg_namelen=3D0, msg_iov=3D[{iov_base= =3D"\1", iov_len=3D1}], msg_iovlen=3D1, msg_control=3D[{cmsg_len=3D20, cmsg= _level=3DSOL_SOCKET, cmsg_type=3DSCM_RIGHTS, cmsg_data=3D[79]}], msg_contro= llen=3D24, msg_flags=3D0}, 0) =3D 1 > > > recvfrom(72, "\1", 1, 0, NULL, NULL) =3D 1 > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) =3D 0 > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) =3D 0 > > > write(2, "46.9751: ", 946.9751: ) =3D 9 > > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequ= ence for socket 79 to 3946857962) =3D 51 > > > write(2, "\n", 1 > > > ) =3D 1 > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) =3D 0 > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) =3D 0 > > > write(2, "46.9752: ", 946.9752: ) =3D 9 > > > write(2, "Set receive queue sequence for s"..., 54Set receive queue s= equence for socket 79 to 2474644625) =3D 54 > > > write(2, "\n", 1 > > > ) =3D 1 > > > connect(79, {sa_family=3DAF_INET, sin_port=3Dhtons(9091), sin_addr=3D= inet_addr("127.0.0.1")}, 16) =3D -1 EADDRNOTAVAIL (Cannot assign requested = address) > > > --- > > >=20 > > > no obvious difference. I'll try binding to an explicit address, next, > > > but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. = it > > > works with the reference implementation. =20 > >=20 > > I have no ideas yet :(. > >=20 > > > Yes, I explicitly close() the socket in the source passt now, but that > > > doesn't change things. > > >=20 > > > This is presumably just an issue with testing, because in real use > > > cases source and target guests would be on different machines. Another > > > idea could be separating the namespaces. =20 > >=20 > > Well, if that's relevant to the problem which isn't clear yet. I > > mean, I guess it's worth trying with source and dest in different > > namespaces. > >=20 > > > I can't just run source and target passt in two instances of pasta > > > --config-net, because pasta would run into the same issue, =20 > >=20 > > Uh.. which same issue? pasta's not trying to do any TCP_REPAIR stuff > > or migration. >=20 > Same issue in the sense that if I connect namespaces with pasta, I > can't migrate a connection between them, because pasta can't migrate a > connection. It would close it and try to reopen it. >=20 > > > but I could > > > isolate one namespace with it, then add two network namespaces inside > > > that, and connect them with veth pairs. =20 > >=20 > > Two pasta instances actually sounds like a better bet to me, because > > the two "hosts" will have the same address, which is what we'd expect > > for a "real" migration - and it kind of has to be the case for the > > host side connections to work afterwards. >=20 > Eh, yes, but we're back to the original problem. A veth interface > wouldn't care, instead. >=20 > Anyway, no need, it's finally working now. >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --xO+TjMfrJvT9EBUJ Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmegEdQACgkQzQJF27ox 2Gcc9g/+MH4FD6DWm8OCP7UAB1befy2KfYozCJSDJ50ej7pX46P3DQloiwoLhqRz RKHyvmwZpduWB3FPDsI0W6NnkB58qvARd/ZdqhzbS5IebMSPrYB1dB2gv2UxIgeA pq93oOqhncIABUarDgg78+kQ6wHVoCTvoN4xybBC62SQur2ReuM3na3x32G7Tark lTULUNMusz1CxIoW4CHJJMlJ9BwGFebTsAr2i/OPPVdtsEXs2pWvcYHR++9vhhs0 6zU9pYip2XYOF5nc5egBrwd+qesqJ0VTHm01lwOOMH2Yyuzg36uINQCMcF9Rz8WM uaXZKOScuBceZUlm6IkPAQb9Hii+LRrKaDywXDClJw1BBowiBXcnLKuAZP7bRGSz Vz6eDFysbFn8GNik8YBWJyOghpGfpAVZOxQTgOBWrkwvXe9LsiX+PXdf+esqAll+ qC8tclPKOLqFMKgyV24qj1bF49hL8ndG0wzvvDKYGQN+geUEjDt5wrP4QKuOnGEv RedXw1IbMEU3YiPp+tMwONIpIbn2Nua0za+3uMn/LKsMeOW7al4uZ7F9faEWUpn5 CIB0F+//a7IZoNZOctrEnSsYw3PspvUCdv8eoduJQWaYELAlDrHvT9Fjh+bEMZBd odK1iT5Kuy8Afo/UIIne4PF4tYzIAUUu0tweCpEQaKO3JXHVm5Y= =Fhtz -----END PGP SIGNATURE----- --xO+TjMfrJvT9EBUJ--