public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
Date: Thu, 30 Jan 2025 05:55:22 +0100	[thread overview]
Message-ID: <20250130055522.39acb265@elisabeth> (raw)
In-Reply-To: <Z5rMU0dVWJWSZ_ta@zatzit>

On Thu, 30 Jan 2025 11:48:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:
> > On Wed, 29 Jan 2025 12:16:58 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:  
> > > > On Tue, 28 Jan 2025 12:40:12 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:    
> > > > > > Add two sets (source or target) of three functions each for passt in
> > > > > > vhost-user mode, triggered by activity on the file descriptor passed
> > > > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > > > > > 
> > > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > > > > >   for migration, before data is transferred
> > > > > > 
> > > > > > - migrate_source() sends, and migrate_target() receives migration data
> > > > > > 
> > > > > > - migrate_source_post() and migrate_target_post() are responsible for
> > > > > >   any post-migration task
> > > > > > 
> > > > > > Callbacks are added to these functions with arrays of function
> > > > > > pointers in migrate.c. Migration handlers are versioned.
> > > > > > 
> > > > > > Versioned descriptions of data sections will be added to the
> > > > > > data_versions array, which points to versioned iovec arrays. Version
> > > > > > 1 is currently empty and will be filled in in subsequent patches.
> > > > > > 
> > > > > > The source announces the data version to be used and informs the peer
> > > > > > about endianness, and the size of void *, time_t, flow entries and
> > > > > > flow hash table entries.
> > > > > > 
> > > > > > The target checks if the version of the source is still supported. If
> > > > > > it's not, it aborts the migration.
> > > > > > 
> > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > > ---
> > > > > >  Makefile    |  12 +--
> > > > > >  migrate.c   | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  migrate.h   |  90 ++++++++++++++++++
> > > > > >  passt.c     |   2 +-
> > > > > >  vu_common.c | 122 ++++++++++++++++---------
> > > > > >  vu_common.h |   2 +-
> > > > > >  6 files changed, 438 insertions(+), 49 deletions(-)
> > > > > >  create mode 100644 migrate.c
> > > > > >  create mode 100644 migrate.h
> > > > > > 
> > > > > > diff --git a/Makefile b/Makefile
> > > > > > index 464eef1..1383875 100644
> > > > > > --- a/Makefile
> > > > > > +++ b/Makefile
> > > > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > > > > >  
> > > > > >  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > >  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > > > > -	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > > > > -	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > > > > +	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > >  	vhost_user.c virtio.c vu_common.c
> > > > > >  QRAP_SRCS = qrap.c
> > > > > >  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > > > > >  
> > > > > >  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > > > > >  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > > > > -	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > > > > -	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > > > > -	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > > > > -	virtio.h vu_common.h
> > > > > > +	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > > > > +	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > > > > +	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > > > > +	vhost_user.h virtio.h vu_common.h
> > > > > >  HEADERS = $(PASST_HEADERS) seccomp.h
> > > > > >  
> > > > > >  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > > > > diff --git a/migrate.c b/migrate.c
> > > > > > new file mode 100644
> > > > > > index 0000000..bee9653
> > > > > > --- /dev/null
> > > > > > +++ b/migrate.c
> > > > > > @@ -0,0 +1,259 @@
> > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > +
> > > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > > + *  for qemu/UNIX domain socket mode
> > > > > > + *
> > > > > > + * PASTA - Pack A Subtle Tap Abstraction
> > > > > > + *  for network namespace/tap device mode
> > > > > > + *
> > > > > > + * migrate.c - Migration sections, layout, and routines
> > > > > > + *
> > > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > > + */
> > > > > > +
> > > > > > +#include <errno.h>
> > > > > > +#include <sys/uio.h>
> > > > > > +
> > > > > > +#include "util.h"
> > > > > > +#include "ip.h"
> > > > > > +#include "passt.h"
> > > > > > +#include "inany.h"
> > > > > > +#include "flow.h"
> > > > > > +#include "flow_table.h"
> > > > > > +
> > > > > > +#include "migrate.h"
> > > > > > +
> > > > > > +/* Current version of migration data */
> > > > > > +#define MIGRATE_VERSION		1
> > > > > > +
> > > > > > +/* Magic as we see it and as seen with reverse endianness */
> > > > > > +#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
> > > > > > +#define MIGRATE_MAGIC_SWAPPED	0xB0D1B10B1B1DBBB1      
> > > > > 
> > > > > As noted, I'm hoping we can get rid of "either endian" migration.  But
> > > > > if this stays, we should define it using __bswap_constant_32() to
> > > > > avoid embarrassing mistakes.    
> > > > 
> > > > Those always give me issues on musl,    
> > > 
> > > What sort of issues?  We're already using them, and have fallback
> > > versions defined in util.h  
> > 
> > The very issues that brought me to introduce those fallback versions,
> > so I'm instinctively reluctant to use them.
> > 
> > Actually, I think it's even clearer to have this spelt out (I always
> > need to stop for a moment and think: what happens when I cross the
> > 32-bit boundary?).  
> 
> Oh, yes, we'd need to add a __bswap_constant_64() for this.

...which doesn't exist on musl. On current Alpine Edge:

util.h:131:34: error: implicit declaration of function '__bswap_constant_64' [-Wimplicit-function-declaration]
  131 | #define htonll_constant(x)       (__bswap_constant_64(x))
      |                                  ^~~~~~~~~~~~~~~~~~~

...so rather than adding an implementation for this single usage, which
makes it actually less clear to me, I would keep it like it is.

> [snip]
> > > > > > +/**
> > > > > > + * union migrate_header - Migration header from source
> > > > > > + * @magic:		0xB1BB1D1B0BB1D1B0, host order
> > > > > > + * @version:		Source sends highest known, target aborts if unsupported
> > > > > > + * @voidp_size:		sizeof(void *), network order
> > > > > > + * @time_t_size:	sizeof(time_t), network order
> > > > > > + * @flow_size:		sizeof(union flow), network order
> > > > > > + * @flow_sidx_size:	sizeof(struct flow_sidx_t), network order
> > > > > > + * @unused:		Go figure
> > > > > > + */
> > > > > > +union migrate_header {
> > > > > > +	struct {
> > > > > > +		uint64_t magic;
> > > > > > +		uint32_t version;
> > > > > > +		uint32_t voidp_size;
> > > > > > +		uint32_t time_t_size;
> > > > > > +		uint32_t flow_size;
> > > > > > +		uint32_t flow_sidx_size;
> > > > > > +	};
> > > > > > +	uint8_t unused[65536];      
> > > > > 
> > > > > So, having looked at this, I no longer think padding the header to 64kiB
> > > > > is a good idea.  The structure means we're basically stuck always
> > > > > having that chunky header.  Instead, I think the header should be
> > > > > absolutely minimal: basically magic and version only.  v1 (and maybe
> > > > > others) can add a "metadata" or whatever section for additional
> > > > > information like this they need.    
> > > > 
> > > > The header is processed by the target in a separate, preliminary step,
> > > > though.
> > > > 
> > > > That's why I added metadata right in the header: if the target needs to
> > > > abort the migration because, say, the size of a flow entry is too big
> > > > to handle for a particular version, then we should know that before
> > > > migrate_target_pre().    
> > > 
> > > Ah, yes, I missed that, we'd need a more complex design to do
> > > additional transfers and checks before making the target_pre
> > > callbacks.
> > >   
> > > > As long as we check the version first, we can always shrink the header
> > > > later on.    
> > > 
> > > *thinks*.. I guess so, though it's kind of awkward; a future version
> > > would have to read the "header of the header", check the version, then
> > > if it's the old one, read the remainder of the 64kiB block.
> > > 
> > > I still think we should clearly separate the part that we're
> > > committing to being in every future version (which I think should just
> > > be magic and version), from the stuff that's just v1.  
> > 
> > Sure, I can add a comment.
> >   
> > > > But having 64 KiB reserved looks more robust because it's a
> > > > safe place to add this kind of metadata.
> > > > 
> > > > Note that 64 KiB is typically transferred in a single read/write
> > > > from/to the vhost-user back-end.    
> > > 
> > > Ok, but it also has to go over the qemu migration channel, which will
> > > often be a physical link, not a super-fast local/virtual one, and may
> > > be bandwidth capped as well.  I'm not actually certain if 64kiB is
> > > likely to be a problem there, but it *is* large compared to the state
> > > blobs of most qemu devices (usually only a few hundred bytes).  
> > 
> > Even if we transfer just what we need of a flow, it's still something
> > well in excess of 50 bytes each. 100k flows would be 5 megs.  
> 
> Just transferring the in-use flows would be higher priority than being
> selective about what we send within each flow.

Well, of course, I meant that we would only transfer used flows at that
point, because it's not about transferring the flow table as a whole,
with none of the advantages and disadvantages of it.

But still one can have 128k flows at the moment.

> It's both easier to do
> and a bigger win in most cases.  That would dramatically reduce the
> size sent here.

Yep, feel free.

> > Well, anyway, let's cut this down to 4k, which should be enough, so
> > that it's not a topic anymore.  
> 
> I still think it's ugly, but whatever.

Same here.

-- 
Stefano


  reply	other threads:[~2025-01-30  4:55 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-28  0:49   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-28  0:50   ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
2025-01-28  0:53   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-29  1:02       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:44           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:27               ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-28  0:59   ` David Gibson
2025-01-28  6:48     ` Stefano Brivio
2025-01-29  1:03       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:44           ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-28  1:40   ` David Gibson
2025-01-28  6:50     ` Stefano Brivio
2025-01-29  1:16       ` David Gibson
2025-01-29  7:33         ` Stefano Brivio
2025-01-30  0:48           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio [this message]
2025-01-30  7:38               ` David Gibson
2025-01-30  8:32                 ` Stefano Brivio
2025-01-30  8:54                   ` David Gibson
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31   ` Stefano Brivio
2025-01-28  1:51   ` David Gibson
2025-01-28  6:51     ` Stefano Brivio
2025-01-29  1:29       ` David Gibson
2025-01-29  7:04         ` Stefano Brivio
2025-01-30  0:53           ` David Gibson
2025-01-30  4:55             ` Stefano Brivio
2025-01-30  7:43               ` David Gibson
2025-01-30  7:56                 ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250130055522.39acb265@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).