From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=dsG5Z6OU; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id B62D75A0626 for ; Fri, 31 Jan 2025 10:09:25 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1738314564; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=O9ZzvpyWkrFa7xN/J/tHe0r/3gcV5o3hHcsCOSZIc5c=; b=dsG5Z6OUoww6lsO5xXl3HyOPnTO8XROG2JCsIS+/tnYLKg1udsiJ6nTdW/8wO4nCH4+ro9 McdhujQRiLqqnkTZDo2O/vO3a2Rz0zOv2c4KxviAkcXWkNW+QGmOZctZGlhYJmYJ3639i/ n/NLjfKx/4y69BnRlmrbSD9Hot2sXmc= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-691-AFk5xiuGNguzKfGDV6i1PQ-1; Fri, 31 Jan 2025 04:09:23 -0500 X-MC-Unique: AFk5xiuGNguzKfGDV6i1PQ-1 X-Mimecast-MFC-AGG-ID: AFk5xiuGNguzKfGDV6i1PQ Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-38c24ac3415so1376960f8f.2 for ; Fri, 31 Jan 2025 01:09:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738314562; x=1738919362; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=O9ZzvpyWkrFa7xN/J/tHe0r/3gcV5o3hHcsCOSZIc5c=; b=WeNgwZKDmh18k1glMCJEgcyafpohE3ZP79DAupwfikjugXsrwcvrMjycbDU1S8OUS8 C8zaAUXGzC8nTR1SHw4YQhtf8prIoKKONceIRiIA4yOEy+s9LcCrIWyIzFENfPEJn3rA XJEMuh0GxtgLImmcPfYLIelwC2nGzrqZfG7GDiljBTrk5OsWPG4baGdfeEg+YAqBrcar nMBVmlDivio3yNwj7GtK/1LUETeGeKXw4EdiUh6mKHxl6WM5XYCaSuwmXwn0krVjCEgN nQ62xy6NKCVnb5yF6Z14yulFO1Zye2fOvKy7VO7prczdPKB+DYfc5R3H/cnMqjJ9La1X m6kg== X-Gm-Message-State: AOJu0Yx4MccyFW6WMnyOgyTVJiVCpmNGbW2lytzkExsI/lFMe9TaUaGo lC489Fuc6H55fXj3vYqiSJgNvnNtYXwK0nGWibh9EoL6G5hLjTbaP+tGdWVajvVF+tc7gCQeTRB dlijfy/ZEaiEhpGM1AFm1snBwx2iLcMRst6pkon33qyrCNMOrzg== X-Gm-Gg: ASbGncsrn4mVGaCILAq0y72zLQv6AiURIlcHy8NgeMCEglYPO8/WxB/tuHxKXcRpBuz 6zXZja+RJ7YA7P3gk/ff8i7cc3uOgh2/oCOO1AGRlgPtB7IswF1wwucw+xx1EISiQ4H8/+hCEIe l1mxB96oieCRcl0vSkUPcdlL3VbURtIlf3fqU7eJAOBQaHbppLvJVv2ljpuSkEDhr38PeoXXmVB sQ+AjTpfHZK+q3BKeIeKLbJIwvrJxhadoduJW79A6mKx2exWGs+qpBWH5D0cQotylbHJzZKo2FO sUiHr5G/QTTc6pNB7DmY8kMm3wUkjoPiKg== X-Received: by 2002:a5d:518f:0:b0:38a:4184:152a with SMTP id ffacd0b85a97d-38c5196a391mr7838834f8f.28.1738314561662; Fri, 31 Jan 2025 01:09:21 -0800 (PST) X-Google-Smtp-Source: AGHT+IGc+nQyUx5+7J94bZ9mS66ccO4seDur9UIT6vKs+gceHkKeLlah8YFrc6DUi+BT+55CEZ4QRA== X-Received: by 2002:a5d:518f:0:b0:38a:4184:152a with SMTP id ffacd0b85a97d-38c5196a391mr7838779f8f.28.1738314561134; Fri, 31 Jan 2025 01:09:21 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38c5c0ece21sm4079092f8f.20.2025.01.31.01.09.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 31 Jan 2025 01:09:20 -0800 (PST) Date: Fri, 31 Jan 2025 10:09:19 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: <20250131100919.0950ec1e@elisabeth> In-Reply-To: References: <20250127231532.672363-7-sbrivio@redhat.com> <20250128075001.3557d398@elisabeth> <20250129083350.220a7ab0@elisabeth> <20250130055522.39acb265@elisabeth> <20250130093236.117c3fd0@elisabeth> <20250131063655.41a5861b@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: KY7s9XSlGXjZbje62roAPodWwU7_rB3EL5tb4BWH8JY_1738314562 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: PWJRR7DD2EZZOI6KPVYQ2MLZ2CBOBOGI X-Message-ID-Hash: PWJRR7DD2EZZOI6KPVYQ2MLZ2CBOBOGI X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Fixed, finally. Some answers: On Fri, 31 Jan 2025 17:14:18 +1100 David Gibson wrote: > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote: > > On Thu, 30 Jan 2025 09:32:36 +0100 > > Stefano Brivio wrote: > > > > > I would like to quickly complete the whole flow first, because I think > > > we can inform design and implementation decisions much better at that > > > point > > > > So, there seems to be a problem with (testing?) this. I couldn't quite > > understand the root cause yet, and it doesn't happen with the reference > > source.c and target.c implementations I shared. > > > > Let's assume I have a connection in the source guest to 127.0.0.1:9091, > > from 127.0.0.1:56350. After the migration, in the target, I get: > > > > --- > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 > > recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 > > write(2, "77.6923: ", 977.6923: ) = 9 > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 > > write(2, "\n", 1 > > ) = 1 > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 > > write(2, "77.6924: ", 977.6924: ) = 9 > > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 > > write(2, "\n", 1 > > ) = 1 > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) > > --- > > > > EADDRNOTAVAIL, according to the documentation, which seems to be > > consistent with a glance at the implementation (that is, I must be > > missing some issue in the kernel), should be returned on connect() if: > > > > EADDRNOTAVAIL > > (Internet domain sockets) The socket referred to by > > sockfd had not previously been bound to an address > > and, upon attempting to bind it to an ephemeral > > port, it was determined that all port numbers in the > > ephemeral port range are currently in use. See the > > discussion of /proc/sys/net/ipv4/ip_local_port_range > > in ip(7). > > > > but well, of course it was bound. > > > > To a port, indeed, not a full address, that is, any (0.0.0.0) and > > address port, but I think for the purposes of this description that > > bind() call is enough. > > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired > socket. It is. > Usually, of course, that 0.0.0.0 would be resolved to a real > address at connect() time. But TCP_REPAIR's version of connect() > bypasses a bunch of the usual connect logic, so maybe we need an > explicit address here. No need. > ...but that doesn't explain the difference between passt and your test > implementation. The difference that actually matters is that the test implementation terminates, and that has the equivalent effect of switching off repair mode for the closed sockets, which frees up all the associated context, including the port. Usually, there are no valid operations on closed sockets (not even close()). This is the first exception I ever met: you can set TCP_REPAIR_OFF. But there's a catch: you can't pass a closed socket in repair mode via SCM_RIGHTS (well, I'm fairly sure nobody approached this level of insanity before): you get EBADF (which is an understatement). And there's another catch: if you actually try to do that, even if it fails, that has the same effect of clearing the socket entirely: you free up the port. But we can't use this, unfortunately, because if we do, the peer will get a zero-length read (EOF). Now, I could reintroduce a "quit" command in passt-repair, and we would know that EOF doesn't actually mean completion, but it complicates things again. What works, though, is simply terminating. We can't do that before VHOST_USER_CHECK_DEVICE_STATE, but just after that. That's what I implemented at the moment (updated patches coming soon). > > Is this related to SO_REUSEADDR? I need it (on both source and target) > > because, at least in my tests, source and target are on the same > > machine, in the same namespace. If I drop it: > > Again, I can think of various problems that not having the same > address available on source and dest might have, but not any which > explain the difference between passt and the experimental impl. > > > --- > > bind(79, {sa_family=AF_INET, sin_port=htons(46280), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) > > --- > > > > as expected. > > > > However, in my reference implementation, with a connection from > > 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does: > > > > --- > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 > > setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > bind(3, {sa_family=AF_INET, sin_port=htons(9998), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > > socket(AF_UNIX, SOCK_STREAM, 0) = 4 > > unlink("/tmp/repair.sock") = 0 > > bind(4, {sa_family=AF_UNIX, sun_path="/tmp/repair.sock"}, 110) = 0 > > listen(4, 1) = 0 > > accept(4, NULL, NULL) = 5 > > sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[3]}], msg_controllen=24, msg_flags=0}, 0) = 1 > > recvfrom(5, "\1", 1, 0, NULL, NULL) = 1 > > setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 > > setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) = 0 > > setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 > > setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) = 0 > > connect(3, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > > --- > > > > The only obvious difference is that, here, I'm not binding to an > > ephemeral port: the source port (in both source and target "guests") is > > 9998. > > > > Fine, so I tried forcing a lower port in passt (source) as well, and > > this is what I get in the target now: > > > > --- > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > bind(79, {sa_family=AF_INET, sin_port=htons(9000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 > > recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) = 0 > > write(2, "46.9751: ", 946.9751: ) = 9 > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 3946857962) = 51 > > write(2, "\n", 1 > > ) = 1 > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) = 0 > > write(2, "46.9752: ", 946.9752: ) = 9 > > write(2, "Set receive queue sequence for s"..., 54Set receive queue sequence for socket 79 to 2474644625) = 54 > > write(2, "\n", 1 > > ) = 1 > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) > > --- > > > > no obvious difference. I'll try binding to an explicit address, next, > > but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. it > > works with the reference implementation. > > I have no ideas yet :(. > > > Yes, I explicitly close() the socket in the source passt now, but that > > doesn't change things. > > > > This is presumably just an issue with testing, because in real use > > cases source and target guests would be on different machines. Another > > idea could be separating the namespaces. > > Well, if that's relevant to the problem which isn't clear yet. I > mean, I guess it's worth trying with source and dest in different > namespaces. > > > I can't just run source and target passt in two instances of pasta > > --config-net, because pasta would run into the same issue, > > Uh.. which same issue? pasta's not trying to do any TCP_REPAIR stuff > or migration. Same issue in the sense that if I connect namespaces with pasta, I can't migrate a connection between them, because pasta can't migrate a connection. It would close it and try to reopen it. > > but I could > > isolate one namespace with it, then add two network namespaces inside > > that, and connect them with veth pairs. > > Two pasta instances actually sounds like a better bet to me, because > the two "hosts" will have the same address, which is what we'd expect > for a "real" migration - and it kind of has to be the case for the > host side connections to work afterwards. Eh, yes, but we're back to the original problem. A veth interface wouldn't care, instead. Anyway, no need, it's finally working now. -- Stefano