From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=FqPaQ/Hx; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 693355A061B for ; Fri, 20 Dec 2024 14:29:05 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1734701344; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Se0abgq8ssy0GxRfxAOgePb9Wq272B/k8Y/qrlxc6GA=; b=FqPaQ/Hx1fdvGbS1MAaMyPh66/dXuPFEuoHn8NB+wUTtdhLvct7jJZx18bUFDe+bKuRbhY DFDfC8xsgZyKxz5siDq9gz12R87RsjYFEVAqp4GbMDXc2cKO0uIagVfBt7HZs2HKf/nOFg mvNoWCw0j5yUIWa4wxFKyDZixdxrWfo= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-251-1i05M_PeM3Gr9Aw_m1P4CA-1; Fri, 20 Dec 2024 08:29:03 -0500 X-MC-Unique: 1i05M_PeM3Gr9Aw_m1P4CA-1 X-Mimecast-MFC-AGG-ID: 1i05M_PeM3Gr9Aw_m1P4CA Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-4361ac8b25fso11176895e9.2 for ; Fri, 20 Dec 2024 05:29:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734701341; x=1735306141; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Se0abgq8ssy0GxRfxAOgePb9Wq272B/k8Y/qrlxc6GA=; b=ilY8JyCjAr3lM6nG6zTKbgwxToV3HaybLrh8lNvqZ2rU7uizeYY2gTYi1O6CiekuvW ihWnlnkhnkt+NYQUNA1vWC9WiHflFTLGdkkyYixHmqlJUulqLJ2NkUYVimq/UpH4ORdi EPBBDuIYQ++Gp7grbUGDc2QPcNBIK/ZWzFf1B2fPPy5fbHXRV8+1HwNefaxZ4aDtg+rI YpqLFoZK34ZO5hAg5xuT9VFcKNfQ37Cholc/DYtbFkyPlywWMkI5rQ+aQjpUt2Aqlkss 94h/E8aNETL9N4PiTycJYhf5nkspuhdyK6WGd+kaDfBV+1dSrkBg/3sJ3l8tNrIEBDL5 vsCw== X-Gm-Message-State: AOJu0Yz3Cpu+YMaAMgWchxFu2IqEt8I10u027ypIIy/UII+9wK5OEgaf PAzfCWiWu6Kj0KbY1wnQ5Vf9lsn2956yvALvpWAR4KBKpcrOg0pyg1XgGgMMD1oSc3UprttgyeR Est5ZaE/OHXb2mA4tSIB/ql3iCWDf4xEudqAjtfOp2i+4ehvceItp1E9crt2aaxDzqaEmUN+FrI cKhS4uBrRVsY742+FN5i7HltJlgI7yZ6Lu X-Gm-Gg: ASbGncsNMSuOJkAhi0WeByydxLK7NeDmdF/MAfOW167sElmALWOC6vYSdz7GpwUGnFW bYJ6DbpbLCWc8OK5lkCSINqU+FqrWEHtr5ZJHZTWZ9br28TEIp+WapH+9YaN1xIl49kZx8r45fd ZqEq32EjzOPTe043cWWMvZLYOs7O3+w7D0D2VmXzTV637hMveU77WWc01OIRjfSGSidKMlh5Dfs xv440c4QPu8n7PaxGvJeELBgooaulkQgJ4WunolE6kje8pGWLu4reG6D0KAlRPfymV74t1b54C0 mvscJhAcTw== X-Received: by 2002:a05:600c:1549:b0:436:1c0c:bfb6 with SMTP id 5b1f17b1804b1-43668b78cd4mr24876005e9.27.1734701340909; Fri, 20 Dec 2024 05:29:00 -0800 (PST) X-Google-Smtp-Source: AGHT+IG4tJRVmjyMscdJll/2LWijt80WKjt9HJLXnIc/gDvQ2MBy9o24OgarClR/FoE3bl+9f+OY+A== X-Received: by 2002:a05:600c:1549:b0:436:1c0c:bfb6 with SMTP id 5b1f17b1804b1-43668b78cd4mr24875755e9.27.1734701340302; Fri, 20 Dec 2024 05:29:00 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43656b119ccsm80066425e9.24.2024.12.20.05.28.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 20 Dec 2024 05:28:59 -0800 (PST) Date: Fri, 20 Dec 2024 14:28:58 +0100 From: Stefano Brivio To: Laurent Vivier Subject: Re: [PATCH 8/9] vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command Message-ID: <20241220142858.5426dba6@elisabeth> In-Reply-To: <61d7f502-6f4d-40bc-b3a5-feed4229585d@redhat.com> References: <20241219111400.2352110-1-lvivier@redhat.com> <20241219111400.2352110-9-lvivier@redhat.com> <20241219204759.65bc5353@elisabeth> <61d7f502-6f4d-40bc-b3a5-feed4229585d@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: PEQh91w1Eqe3XMfayN-1rpk1uJBQOy3vjmsAzdtHjrg_1734701342 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: Z3ZX4TJJF5LMG6VAWXJELL7HZNHQC46Y X-Message-ID-Hash: Z3ZX4TJJF5LMG6VAWXJELL7HZNHQC46Y X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 20 Dec 2024 08:56:27 +0100 Laurent Vivier wrote: > On 19/12/2024 20:47, Stefano Brivio wrote: > > On Thu, 19 Dec 2024 12:13:59 +0100 > > Laurent Vivier wrote: > > > >> +/** > >> + * vu_migrate() -- Send/receive passt insternal state to/from QEMU > > > > Magic! > > > >> + * @vdev: vhost-user device > >> + * @events: epoll events > >> + */ > >> +void vu_migrate(struct vu_dev *vdev, uint32_t events) > >> +{ > >> + int ret; > >> + > >> + /* TODO: collect/set passt internal state > >> + * and use vdev->device_state_fd to send/receive it > >> + */ > >> + debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); > >> + if (events & EPOLLOUT) { > > > > I haven't really reviewed the series yet, but I have a preliminary > > question: how does the hypervisor tell us that we're writing too much? > > The hypervisor is not aware of anything. It's only a bridge between the two passt instances. Well, conceptually, yes, but: > But normally the destination side can report an error, this aborts the migration on both > sides. > > In QEMU, the code is: [Thanks, this makes everything much easier] > > while (true) { > ssize_t read_ret; > > /* read the data from the backend */ > read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size)); ...there are two buffers involved here: the file buffer, and QEMU's 'transfer_buf'. While 'transfer_buf' is handled transparently (read and write right away... except that if writing fails, with the current implementation, we'll lose data, I guess), the (kernel) buffer we write to has a size limit. If we write faster than QEMU can call read() on it, we'll fill it up. Of course, this won't happen if we write "PASST", but if we transfer a full flow table, or, perhaps one day, socket queues, that's megabytes... so we'll have to handle partial writes I suppose. Anyway, it's surely not troublesome right now, I was just curious to see how this is handled. > if (read_ret < 0) { > ret = -errno; > error_setg_errno(errp, -ret, "Failed to receive state"); > goto fail; > } > > assert(read_ret <= chunk_size); > qemu_put_be32(f, read_ret); > > if (read_ret == 0) { > /* EOF */ > break; > } > > /* send to destination QEMU */ > qemu_put_buffer(f, transfer_buf, read_ret); > } > > /* > * Back-end will not really care, but be clean and close our end of the pipe > * before inquiring the back-end about whether transfer was successful > */ > close(read_fd); > > On the other side: > > while (true) { > size_t this_chunk_size = qemu_get_be32(f); > ssize_t write_ret; > const uint8_t *transfer_pointer; > > if (this_chunk_size == 0) { > /* End of state */ > break; > } > > if (transfer_buf_size < this_chunk_size) { > transfer_buf = g_realloc(transfer_buf, this_chunk_size); > transfer_buf_size = this_chunk_size; > } > > if (qemu_get_buffer(f, transfer_buf, this_chunk_size) < > this_chunk_size) > { > error_setg(errp, "Failed to read state"); > ret = -EINVAL; > goto fail; > } > > transfer_pointer = transfer_buf; > while (this_chunk_size > 0) { > write_ret = RETRY_ON_EINTR( > write(write_fd, transfer_pointer, this_chunk_size) > ); > if (write_ret < 0) { > ret = -errno; > error_setg_errno(errp, -ret, "Failed to send state"); > goto fail; > } else if (write_ret == 0) { > error_setg(errp, "Failed to send state: Connection is closed"); > ret = -ECONNRESET; > goto fail; > } > > assert(write_ret <= this_chunk_size); > this_chunk_size -= write_ret; > transfer_pointer += write_ret; > } > } > > /* > * Close our end, thus ending transfer, before inquiring the back-end about > * whether transfer was successful > */ > close(write_fd); > > > Moreover, I think it's important to know, the source side is stopped at the end of > migration but it is in a state it can be restarted. You mean we'll get EPOLLHUP and we know we should terminate at that point, right? Or is there another mechanism? > > I guess we'll do a short write and we'll need to go back to EPOLLOUT? > > There's no minimum chunk size we can write, correct? > > Correct. It works even if there is no write() in this code. > The hypervisor reads until we close the file descriptor (it's a pipe in fact). Ah, sorry, I meant: there's no mimimum chunk size _we are guaranteed to be able to write_. That is, if we fail after "P", we'll need to note down the progress and go back to wait for EPOLLOUT. Not necessarily at the moment, as the context we transfer is small, but I guess we'll need to handle short writes eventually. > >> + debug("Saving backend state"); > >> + > >> + /* send some stuff */ > >> + ret = write(vdev->device_state_fd, "PASST", 6); > >> + /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ > >> + vdev->device_state_result = ret == -1 ? -1 : 0; > >> + /* Closing the file descriptor signals the end of transfer */ > >> + epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, > >> + vdev->device_state_fd, NULL); > >> + close(vdev->device_state_fd); > >> + vdev->device_state_fd = -1; > >> + } else if (events & EPOLLIN) { > > > > ...and similarly here, I guess we'll get a short read? > > We must read until we get a close(). Yes, I get that, but symmetrically with the case above, if we get EAGAIN, I guess we'll need to go back to waiting on EPOLLIN. > >> + char buf[6]; > >> + > >> + debug("Loading backend state"); > >> + /* read some stuff */ > >> + ret = read(vdev->device_state_fd, buf, sizeof(buf)); > >> + /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ > >> + if (ret != sizeof(buf)) { > >> + vdev->device_state_result = -1; > >> + } else { > >> + ret = strncmp(buf, "PASST", sizeof(buf)); > >> + vdev->device_state_result = ret == 0 ? 0 : -1; > >> + } > >> + } else if (events & EPOLLHUP) { > >> + debug("Closing migration channel"); > >> + > >> + /* The end of file signals the end of the transfer. */ > >> + epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, > >> + vdev->device_state_fd, NULL); > >> + close(vdev->device_state_fd); > >> + vdev->device_state_fd = -1; > >> + } > >> +} -- Stefano