Re: [PATCH 8/9] vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: Stefano Brivio <sbrivio@redhat.com>
To: Laurent Vivier <lvivier@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 8/9] vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command
Date: Fri, 20 Dec 2024 14:28:58 +0100	[thread overview]
Message-ID: <20241220142858.5426dba6@elisabeth> (raw)
In-Reply-To: <61d7f502-6f4d-40bc-b3a5-feed4229585d@redhat.com>

On Fri, 20 Dec 2024 08:56:27 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 19/12/2024 20:47, Stefano Brivio wrote:
> > On Thu, 19 Dec 2024 12:13:59 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> >> +/**
> >> + * vu_migrate() -- Send/receive passt insternal state to/from QEMU  
> > 
> > Magic!
> >   
> >> + * @vdev:	vhost-user device
> >> + * @events:	epoll events
> >> + */
> >> +void vu_migrate(struct vu_dev *vdev, uint32_t events)
> >> +{
> >> +	int ret;
> >> +
> >> +	/* TODO: collect/set passt internal state
> >> +         *       and use vdev->device_state_fd to send/receive it
> >> +         */
> >> +	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> >> +	if (events & EPOLLOUT) {  
> > 
> > I haven't really reviewed the series yet, but I have a preliminary
> > question: how does the hypervisor tell us that we're writing too much?  
> 
> The hypervisor is not aware of anything. It's only a bridge between the two passt instances.

Well, conceptually, yes, but:

> But normally the destination side can report an error, this aborts the migration on both 
> sides.
> 
> In QEMU, the code is:

[Thanks, this makes everything much easier]

> 
>      while (true) {
>          ssize_t read_ret;
> 
> 	/* read the data from the backend */
>          read_ret = RETRY_ON_EINTR(read(read_fd, transfer_buf, chunk_size));

...there are two buffers involved here: the file buffer, and QEMU's
'transfer_buf'.

While 'transfer_buf' is handled transparently (read and write right
away... except that if writing fails, with the current implementation,
we'll lose data, I guess), the (kernel) buffer we write to has a size
limit.

If we write faster than QEMU can call read() on it, we'll fill it up.

Of course, this won't happen if we write "PASST", but if we transfer
a full flow table, or, perhaps one day, socket queues, that's
megabytes... so we'll have to handle partial writes I suppose.

Anyway, it's surely not troublesome right now, I was just curious to
see how this is handled.

>          if (read_ret < 0) {
>              ret = -errno;
>              error_setg_errno(errp, -ret, "Failed to receive state");
>              goto fail;
>          }
> 
>          assert(read_ret <= chunk_size);
>          qemu_put_be32(f, read_ret);
> 
>          if (read_ret == 0) {
>              /* EOF */
>              break;
>          }
> 
> 	/* send to destination QEMU */
>          qemu_put_buffer(f, transfer_buf, read_ret);
>      }
> 
>      /*
>       * Back-end will not really care, but be clean and close our end of the pipe
>       * before inquiring the back-end about whether transfer was successful
>       */
>      close(read_fd);
> 
> On the other side:
> 
>      while (true) {
>          size_t this_chunk_size = qemu_get_be32(f);
>          ssize_t write_ret;
>          const uint8_t *transfer_pointer;
> 
>          if (this_chunk_size == 0) {
>              /* End of state */
>              break;
>          }
> 
>          if (transfer_buf_size < this_chunk_size) {
>              transfer_buf = g_realloc(transfer_buf, this_chunk_size);
>              transfer_buf_size = this_chunk_size;
>          }
> 
>          if (qemu_get_buffer(f, transfer_buf, this_chunk_size) <
>                  this_chunk_size)
>          {
>              error_setg(errp, "Failed to read state");
>              ret = -EINVAL;
>              goto fail;
>          }
> 
>          transfer_pointer = transfer_buf;
>          while (this_chunk_size > 0) {
>              write_ret = RETRY_ON_EINTR(
>                  write(write_fd, transfer_pointer, this_chunk_size)
>              );
>              if (write_ret < 0) {
>                  ret = -errno;
>                  error_setg_errno(errp, -ret, "Failed to send state");
>                  goto fail;
>              } else if (write_ret == 0) {
>                  error_setg(errp, "Failed to send state: Connection is closed");
>                  ret = -ECONNRESET;
>                  goto fail;
>              }
> 
>              assert(write_ret <= this_chunk_size);
>              this_chunk_size -= write_ret;
>              transfer_pointer += write_ret;
>          }
>      }
> 
>      /*
>       * Close our end, thus ending transfer, before inquiring the back-end about
>       * whether transfer was successful
>       */
>      close(write_fd);
> 
> 
> Moreover, I think it's important to know, the source side is stopped at the end of 
> migration but it is in a state it can be restarted.

You mean we'll get EPOLLHUP and we know we should terminate at that
point, right? Or is there another mechanism?

> > I guess we'll do a short write and we'll need to go back to EPOLLOUT?
> > There's no minimum chunk size we can write, correct?  
> 
> Correct. It works even if there is no write() in this code.
> The hypervisor reads until we close the file descriptor (it's a pipe in fact).

Ah, sorry, I meant: there's no mimimum chunk size _we are guaranteed to
be able to write_.

That is, if we fail after "P", we'll need to note down the progress
and go back to wait for EPOLLOUT.

Not necessarily at the moment, as the context we transfer is small, but
I guess we'll need to handle short writes eventually.

> >> +		debug("Saving backend state");
> >> +
> >> +		/* send some stuff */
> >> +		ret = write(vdev->device_state_fd, "PASST", 6);
> >> +		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> >> +		vdev->device_state_result = ret == -1 ? -1 : 0;
> >> +		/* Closing the file descriptor signals the end of transfer */
> >> +		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> >> +			  vdev->device_state_fd, NULL);
> >> +		close(vdev->device_state_fd);
> >> +		vdev->device_state_fd = -1;
> >> +	} else if (events & EPOLLIN) {  
> > 
> > ...and similarly here, I guess we'll get a short read?  
> 
> We must read until we get a close().

Yes, I get that, but symmetrically with the case above, if we get
EAGAIN, I guess we'll need to go back to waiting on EPOLLIN.

> >> +		char buf[6];
> >> +
> >> +		debug("Loading backend state");
> >> +		/* read some stuff */
> >> +		ret = read(vdev->device_state_fd, buf, sizeof(buf));
> >> +		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> >> +		if (ret != sizeof(buf)) {
> >> +			vdev->device_state_result = -1;
> >> +		} else {
> >> +			ret = strncmp(buf, "PASST", sizeof(buf));
> >> +			vdev->device_state_result = ret == 0 ? 0 : -1;
> >> +		}
> >> +	} else if (events & EPOLLHUP) {
> >> +		debug("Closing migration channel");
> >> +
> >> +		/* The end of file signals the end of the transfer. */
> >> +		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> >> +			  vdev->device_state_fd, NULL);
> >> +		close(vdev->device_state_fd);
> >> +		vdev->device_state_fd = -1;
> >> +	}
> >> +}  

-- 
Stefano

next prev parent reply	other threads:[~2024-12-20 13:29 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-19 11:13 [PATCH 0/9] vhost-user: Migration support Laurent Vivier
2024-12-19 11:13 ` [PATCH 1/9] virtio: Use const pointer for vu_dev Laurent Vivier
2024-12-20  0:24   ` David Gibson
2025-01-06  8:58     ` Stefano Brivio
2024-12-19 11:13 ` [PATCH 2/9] vhost-user: update protocol features and commands list Laurent Vivier
2025-01-17 18:04   ` Stefano Brivio
2024-12-19 11:13 ` [PATCH 3/9] vhost-user: add VHOST_USER_SET_LOG_FD command Laurent Vivier
2025-01-17 18:04   ` Stefano Brivio
2024-12-19 11:13 ` [PATCH 4/9] vhost-user: Pass vu_dev to more virtio functions Laurent Vivier
2024-12-19 11:13 ` [PATCH 5/9] vhost-user: add VHOST_USER_SET_LOG_BASE command Laurent Vivier
2025-01-17 18:05   ` Stefano Brivio
2025-01-20 10:57     ` Laurent Vivier
2025-01-17 19:10   ` Stefano Brivio
2024-12-19 11:13 ` [PATCH 6/9] vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_LOG_SHMFD Laurent Vivier
2024-12-19 11:13 ` [PATCH 7/9] vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command Laurent Vivier
2024-12-19 11:13 ` [PATCH 8/9] vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command Laurent Vivier
2024-12-19 19:47   ` Stefano Brivio
2024-12-20  7:56     ` Laurent Vivier
2024-12-20 13:28       ` Stefano Brivio [this message]
2025-01-17 18:05   ` Stefano Brivio
2025-01-20 11:00     ` Laurent Vivier
2025-01-20 20:09       ` Stefano Brivio
2024-12-19 11:14 ` [PATCH 9/9] vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_DEVICE_STATE Laurent Vivier
2025-01-17 12:13 ` [PATCH 0/9] vhost-user: Migration support Laurent Vivier
2025-01-17 12:44   ` Stefano Brivio
2025-01-17 13:27     ` Laurent Vivier
2025-01-17 13:38       ` Stefano Brivio
2025-01-17 13:58         ` Laurent Vivier
2025-01-17 14:29           ` Stefano Brivio
2025-01-17 13:31     ` Stefano Brivio
2025-01-17 16:51 ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241220142858.5426dba6@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).