From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v13 5/6] migrate: Migrate TCP flows
Date: Mon, 10 Feb 2025 10:51:31 +0100 [thread overview]
Message-ID: <20250210105131.5403ae1b@elisabeth> (raw)
In-Reply-To: <Z6mXFDrbvGm3R6uB@zatzit>
On Mon, 10 Feb 2025 17:05:08 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Sun, Feb 09, 2025 at 11:20:04PM +0100, Stefano Brivio wrote:
> >
> > +/**
> > + * flow_migrate_source_rollback() - Disable repair mode, return failure
> > + * @c: Execution context
> > + * @max_flow: Maximum index of affected flows
> > + * @ret: Negative error code
> > + *
> > + * Return: @ret
> > + */
> > +static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
> > + int ret)
> > +{
> > + union flow *flow;
> > + unsigned i;
> > +
> > + debug("...roll back migration");
> > +
> > + foreach_tcp_flow(i, flow, max_flow)
> > + tcp_flow_repair_off(c, &flow->tcp);
> > +
> > + repair_flush(c);
>
> I think this should die() on failures. If we get here, it could well
> mean we've already had a failure enabling repair mode, so an error
> disabling is more plausible than usual. I think die()ing is
> preferable to carrying on, since resuming normal operation with some
> of our sockets in repair mode is almost certain to result in really
> weird, hard to debug behaviour.
It makes sense in general, except for the case where we were unable to
set repair mode for any of the sockets, say, passt-repair isn't there
or we can't use it for any reason.
We should probably tell this function how to handle failures, from
callers.
> > +
> > + return ret;
> > +}
> > +
> > +/**
> > + * flow_migrate_repair_all() - Turn repair mode on or off for all flows
> > + * @c: Execution context
> > + * @enable: Switch repair mode on if set, off otherwise
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +static int flow_migrate_repair_all(struct ctx *c, bool enable)
> > +{
> > + union flow *flow;
> > + unsigned i;
> > + int rc;
> > +
> > + foreach_tcp_flow(i, flow, FLOW_MAX) {
> > + if (enable)
> > + rc = tcp_flow_repair_on(c, &flow->tcp);
> > + else
> > + rc = tcp_flow_repair_off(c, &flow->tcp);
> > +
> > + if (rc) {
> > + debug("Can't %s repair mode: %s",
> > + enable ? "enable" : "disable", strerror_(-rc));
> > + return flow_migrate_source_rollback(c, i, rc);
> > + }
> > + }
> > +
> > + if ((rc = repair_flush(c))) {
> > + debug("Can't %s repair mode: %s",
> > + enable ? "enable" : "disable", strerror_(-rc));
> > + return flow_migrate_source_rollback(c, i, rc);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source_early() - Early tasks: shrink (RFC 7323 2.2) TCP windows
> > + * @c: Execution context
> > + * @stage: Migration stage information, unused
> > + * @fd: Migration file descriptor, unused
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> > + int fd)
> > +{
> > + union flow *flow;
> > + unsigned i;
> > + int rc;
> > +
> > + (void)stage;
> > + (void)fd;
> > +
> > + /* We need repair mode to dump and set (some) window parameters */
> > + if ((rc = flow_migrate_repair_all(c, true)))
> > + return -rc;
> > +
> > + foreach_tcp_flow(i, flow, FLOW_MAX) {
> > + if ((rc = tcp_flow_migrate_shrink_window(i, &flow->tcp))) {
> > + err("Shrinking window, flow %u: %s", i, strerror_(-rc));
> > + return flow_migrate_source_rollback(c, i, -rc);
> > + }
> > + }
> > +
> > + /* Now send window updates. We'll flip repair mode back on in a bit */
> > + if ((rc = flow_migrate_repair_all(c, false)))
> > + return -rc;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
> > + * @c: Execution context
> > + * @stage: Migration stage information (unused)
> > + * @fd: Migration file descriptor (unused)
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> > + int fd)
> > +{
> > + int rc;
> > +
> > + (void)stage;
> > + (void)fd;
> > +
> > + if ((rc = flow_migrate_repair_all(c, true)))
> > + return -rc;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source() - Dump all the remaining information and send data
> > + * @c: Execution context (unused)
> > + * @stage: Migration stage information (unused)
> > + * @fd: Migration file descriptor
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > + int fd)
> > +{
> > + uint32_t count = 0;
> > + union flow *flow;
> > + unsigned i;
> > + int rc;
> > +
> > + (void)c;
> > + (void)stage;
> > +
> > + foreach_tcp_flow(i, flow, FLOW_MAX)
> > + count++;
> > +
> > + count = htonl(count);
> > + if ((rc = write_all_buf(fd, &count, sizeof(count)))) {
> > + rc = errno;
> > + err_perror("Can't send flow count (%u)", ntohl(count));
> > + return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> > + }
> > +
> > + debug("Sending %u flows", ntohl(count));
> > +
> > + /* Dump and send information that can be stored in the flow table */
> > + foreach_tcp_flow(i, flow, FLOW_MAX) {
> > + if ((rc = tcp_flow_migrate_source(fd, &flow->tcp))) {
> > + err("Can't send data, flow %u: %s", i, strerror_(-rc));
> > + return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
> > + }
> > + }
> > +
> > + /* And then "extended" data (including window data we saved previously):
> > + * the target needs to set repair mode on sockets before it can set
> > + * this stuff, but it needs sockets (and flows) for that.
> > + *
> > + * This also closes sockets so that the target can start connecting
> > + * theirs: you can't sendmsg() to queues (using the socket) if the
> > + * socket is not connected (EPIPE), not even in repair mode. And the
> > + * target needs to restore queues now because we're sending the data.
> > + *
> > + * So, no rollback here, just try as hard as we can.
> > + */
> > + foreach_tcp_flow(i, flow, FLOW_MAX) {
> > + if ((rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp)))
> > + err("Extended data for flow %u: %s", i, strerror_(-rc));
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_target() - Receive flows and insert in flow table
> > + * @c: Execution context
> > + * @stage: Migration stage information (unused)
> > + * @fd: Migration file descriptor
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> > + int fd)
> > +{
> > + uint32_t count;
> > + unsigned i;
> > + int rc;
> > +
> > + (void)stage;
> > +
> > + if (read_all_buf(fd, &count, sizeof(count)))
> > + return errno;
> > +
> > + count = ntohl(count);
> > + debug("Receiving %u flows", count);
> > +
> > + if ((rc = flow_migrate_repair_all(c, true)))
> > + return -rc;
> > +
> > + repair_flush(c);
>
> Unnecessary, flow_migrate_repair_all() already handles this.
>
> > +
> > + /* TODO: flow header with type, instead? */
> > + for (i = 0; i < count; i++) {
> > + rc = tcp_flow_migrate_target(c, fd);
> > + if (rc) {
> > + debug("Bad target data for flow %u: %s, abort",
> > + i, strerror_(-rc));
> > + return -rc;
> > + }
> > + }
> > +
> > + repair_flush(c);
> > +
> > + for (i = 0; i < count; i++) {
> > + rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
> > + if (rc) {
> > + debug("Bad target extended data for flow %u: %s, abort",
> > + i, strerror_(-rc));
> > + return -rc;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * flow_init() - Initialise flow related data structures
> > */
> > diff --git a/flow.h b/flow.h
> > index 24ba3ef..675726e 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -249,6 +249,14 @@ union flow;
> >
> > void flow_init(void);
> > void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> > +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> > + int fd);
> > +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> > + int fd);
> > +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > + int fd);
> > +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> > + int fd);
> >
> > void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > __attribute__((format(printf, 3, 4)));
> > diff --git a/migrate.c b/migrate.c
> > index 1c59016..c5c6663 100644
> > --- a/migrate.c
> > +++ b/migrate.c
> > @@ -98,11 +98,30 @@ static int seen_addrs_target_v1(struct ctx *c,
> >
> > /* Stages for version 1 */
> > static const struct migrate_stage stages_v1[] = {
> > + /* FIXME: With this step, close() in tcp_flow_migrate_source_ext()
> > + * *sometimes* closes the connection for real.
> > + */
> > +/* {
> > + .name = "shrink TCP windows",
> > + .source = flow_migrate_source_early,
> > + .target = NULL,
> > + },
> > +*/
>
> Given we're not sure if this will help, and it adds some
> complications, probably makes sense to split this into a separate
> patch.
I'd rather not because, due to this, the code for the case *without it*
is a bit different anyway, and we avoid some code churn. I would also
like to merge it unused to make our lives a bit easier the day we retry
to work on it.
> > {
> > .name = "observed addresses",
> > .source = seen_addrs_source_v1,
> > .target = seen_addrs_target_v1,
> > },
> > + {
> > + .name = "prepare flows",
> > + .source = flow_migrate_source_pre,
> > + .target = NULL,
> > + },
> > + {
> > + .name = "transfer flows",
> > + .source = flow_migrate_source,
> > + .target = flow_migrate_target,
> > + },
> > { 0 },
> > };
> >
> > diff --git a/passt.c b/passt.c
> > index 6f9fb4d..68d1a28 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -223,9 +223,6 @@ int main(int argc, char **argv)
> > if (sigaction(SIGCHLD, &sa, NULL))
> > die_perror("Couldn't install signal handlers");
> >
> > - if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> > - die_perror("Couldn't set disposition for SIGPIPE");
> > -
> > c.mode = MODE_PASTA;
> > } else if (strstr(name, "passt")) {
> > c.mode = MODE_PASST;
> > @@ -233,6 +230,9 @@ int main(int argc, char **argv)
> > _exit(EXIT_FAILURE);
> > }
> >
> > + if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> > + die_perror("Couldn't set disposition for SIGPIPE");
> > +
> > madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
> >
> > c.epollfd = epoll_create1(EPOLL_CLOEXEC);
> > diff --git a/repair.c b/repair.c
> > index 784b994..da85edb 100644
> > --- a/repair.c
> > +++ b/repair.c
> > @@ -190,7 +190,6 @@ int repair_flush(struct ctx *c)
> > *
> > * Return: 0 on success, negative error code on failure
> > */
> > -/* cppcheck-suppress unusedFunction */
> > int repair_set(struct ctx *c, int s, int cmd)
> > {
> > int rc;
> > diff --git a/tcp.c b/tcp.c
> > index af6bd95..78db64f 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -280,6 +280,7 @@
> > #include <stddef.h>
> > #include <string.h>
> > #include <sys/epoll.h>
> > +#include <sys/ioctl.h>
> > #include <sys/socket.h>
> > #include <sys/timerfd.h>
> > #include <sys/types.h>
> > @@ -287,6 +288,8 @@
> > #include <time.h>
> > #include <arpa/inet.h>
> >
> > +#include <linux/sockios.h>
> > +
> > #include "checksum.h"
> > #include "util.h"
> > #include "iov.h"
> > @@ -299,6 +302,7 @@
> > #include "log.h"
> > #include "inany.h"
> > #include "flow.h"
> > +#include "repair.h"
> > #include "linux_dep.h"
> >
> > #include "flow_table.h"
> > @@ -326,6 +330,19 @@
> > ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
> > #define CONN_HAS(conn, set) (((conn)->events & (set)) == (set))
> >
> > +/* Buffers to migrate pending data from send and receive queues. No, they don't
> > + * use memory if we don't use them. And we're going away after this, so splurge.
> > + */
> > +#define TCP_MIGRATE_SND_QUEUE_MAX (64 << 20)
> > +#define TCP_MIGRATE_RCV_QUEUE_MAX (64 << 20)
> > +uint8_t tcp_migrate_snd_queue [TCP_MIGRATE_SND_QUEUE_MAX];
> > +uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > +
> > +#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> > +
> > +/* "Extended" data (not stored in the flow table) for TCP flow migration */
> > +static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > +
> > static const char *tcp_event_str[] __attribute((__unused__)) = {
> > "SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
> >
> > @@ -2645,3 +2662,775 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
> > if (c->mode == MODE_PASTA)
> > tcp_splice_refill(c);
> > }
> > +
> > +/**
> > + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
> > + * @c: Execution context
> > + * @conn: Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > + int rc = 0;
> > +
> > + if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
> > + err("Failed to set TCP_REPAIR");
> > +
> > + return rc;
> > +}
> > +
> > +/**
> > + * tcp_flow_repair_off() - Clear repair mode for a single TCP flow
> > + * @c: Execution context
> > + * @conn: Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > + int rc = 0;
> > +
> > + if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
> > + err("Failed to clear TCP_REPAIR");
> > +
> > + return rc;
> > +}
> > +
> > +/**
> > + * tcp_flow_repair_queues() - Read or write socket queues, or send unsent data
> > + * @s: Socket
> > + * @sndbuf: Send queue buffer read or written/sent depending on @set
> > + * @sndlen: Length of send queue buffer to set, network order
> > + * @notsentlen: Length of not sent data, non-zero to actually _send_
> > + * @rcvbuf: Receive queue buffer, read or written depending on @set
> > + * @rcvlen: Length of receive queue buffer to set, network order
> > + * @set: Set or send (unsent data only) if true, dump if false
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + *
> > + * #syscalls:vu ioctl
> > + */
> > +static int tcp_flow_repair_queues(int s,
> > + uint8_t *sndbuf,
> > + uint32_t *sndlen, uint32_t *notsentlen,
> > + uint8_t *rcvbuf, uint32_t *rcvlen,
> > + bool set)
> > +{
> > + ssize_t rc;
> > + int v;
> > +
> > + if (set && rcvbuf) { /* FIXME: can't check notsentlen, rework this */
>
> Not really clear why this is its own block, rather than part of the if
> (set) below.
Because in the general case we need to select the send queue, for both
set (set/write send queue) and !set (get/dump send queue), but not in
the case we write to the "real" socket (send send queue, don't write
it).
That happens to be the case where I pass rcvbuf as NULL, and that's how
I grossly identified that subcase. This function needs a complete
rewrite, it got out of hand.
> > + v = TCP_SEND_QUEUE;
> > + if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> > + rc = -errno;
> > + err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
> > + return rc;
> > + }
> > + }
> > +
> > + if (set) {
> >
> > [...]
> >
> > +/**
> > + * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> > + * @fd: Descriptor for state migration
> > + * @fidx: Flow index
> > + * @conn: Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_migrate_source_ext(int fd, int fidx,
> > + const struct tcp_tap_conn *conn)
> > +{
> > + struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
> > + struct tcp_repair_window wnd;
> > + int s = conn->sock;
> > + int rc;
> > +
> > + /* FIXME: Reenable dump in tcp_flow_migrate_shrink_window() */
> > + tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
> > + &t->sock_max_window, &t->sock_rcv_wnd,
> > + &t->sock_rcv_wup, &wnd);
> > +
> > + rc = tcp_flow_repair_seq(s, &t->sock_seq_snd, &t->sock_seq_rcv, false);
> > + if (rc) {
> > + err("Failed to get sequences on source for socket %i: %s",
> > + s, strerror_(-rc));
> > + return rc;
> > + }
> > +
> > + tcp_flow_repair_opt(s, &t->snd_wscale, &t->rcv_wscale, &t->sock_mss,
> > + false);
> > +
> > + /* Dump receive queue as late as possible */
>
> Hmm.. why?
I'm not entirely sure if it makes sense, but my thought was: if we
spend time doing other things *after* dumping the receive queue, then
the part we risk missing of it (data that comes to the receive queue
after we dump it) is bigger.
If we dump (and transfer) it last, we should decrease the risk of
missing data.
--
Stefano
next prev parent reply other threads:[~2025-02-10 9:51 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 1/6] migrate: Skeleton of live migration logic Stefano Brivio
2025-02-10 2:26 ` David Gibson
2025-02-09 22:20 ` [PATCH v13 2/6] migrate: Migrate guest observed addresses Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair Stefano Brivio
2025-02-10 2:59 ` David Gibson
2025-02-10 15:54 ` Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state Stefano Brivio
2025-02-10 3:43 ` David Gibson
2025-02-09 22:20 ` [PATCH v13 5/6] migrate: Migrate TCP flows Stefano Brivio
2025-02-10 6:05 ` David Gibson
2025-02-10 9:51 ` Stefano Brivio [this message]
2025-02-10 15:54 ` Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 6/6] test: Add migration tests Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250210105131.5403ae1b@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).