From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=M6jHtS3r;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by passt.top (Postfix) with ESMTPS id EB9ED5A0272
	for <passt-dev@passt.top>; Mon, 10 Feb 2025 10:51:43 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1739181102;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=w70He/tsABnNUNSwX29mM5fpiVtJEqD98+vHsdgV+fc=;
	b=M6jHtS3riHTm7KUZCJTx4Nb4fWFPL7NJvlJYkMuDpl34WV7yWutLUQo42mEYj25Fsm9Bjp
	wQZoDmRPNxJxywFVSHFrRygFZKoNNP933E8YkHfNZnSgNzaEXuSyVLXCpiv/XQwsDu4t/V
	QCqeU8Ey0WgLOBNzwwTFR7YD2sK1FwQ=
Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com
 [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-688-YRyHiGqWPbGoZEDEJAay9Q-1; Mon, 10 Feb 2025 04:51:38 -0500
X-MC-Unique: YRyHiGqWPbGoZEDEJAay9Q-1
X-Mimecast-MFC-AGG-ID: YRyHiGqWPbGoZEDEJAay9Q
Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-43933bdcce0so11640455e9.2
        for <passt-dev@passt.top>; Mon, 10 Feb 2025 01:51:38 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1739181097; x=1739785897;
        h=content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=w70He/tsABnNUNSwX29mM5fpiVtJEqD98+vHsdgV+fc=;
        b=ER+N+rILBPGU+FnMLmzItTMkuypxrtXvxDLxhYa/2y1WbnxOrZnv3//Z4M5t1Usqt4
         9ihP3P2KdJGpHeC0zgZ4z/gRBtAwQZkLfZhghgMQPyQzT3P6/A2RW4tIzVNbAHCrPyOH
         MA7WC9YkO/RRHpslAZFc4qHonkJ6O2t6PWaEiryeHg4Azh7i2eDSsM9jrfuoBVoD0x/N
         o0gNlIfuf6YBQkGtEdv6yhaYI5BQ63wgsZa0iNcpQONMPC2ISF0Zjs/rgLgPiVgUCLCK
         zGtv0HOF+5Ky1Gsr5+Zhq9K4grhxGupUEXPld17R/DW0AQ3ffDENH4QTZCucHiPy0L6m
         HzCQ==
X-Gm-Message-State: AOJu0YyvSkEJerxvu/LXjwPR4Rni223dHanP0hDmHjnvfMpeUUxvAyN8
	kRJDgjg+YhtDNkY8mE1XYRNOTR3t4sURbnsyzFDxzWL7gZZ85x6WLwefOgT4vltdVVmE69+zjig
	lAF7xrU2/qPRLBv4etsz8MwZ6v3URzMizUfq5ODi0rOuVWnrwtH2W6tRIvA==
X-Gm-Gg: ASbGncvV9WIEqFa5p5WE0DKe7wJPme+FOf4Zp/uCZ96doBDsNp6hiWxhH8B84CgkZGB
	U6KcqQdsAyZcGiTViLq+k4wG6+ulEITvMmlI5u9p5JFZ+7Fp03lYcJLLI4elp+KnoF3uG8FICAB
	zvikpkkdbVJDyelS5KavIRoLYd+WSdzMuVwSmLZWBcz5zOQNQI3XniLtQBdjrFbC5ENoNlGkmsD
	j09Fk46xBQsmUrndaIVVIDSDg0zztyf3Iihy3omPjxTxOnerjlwv7mQGZ5/GZVC6ZOhKNDbGe+t
	/0ptfgcryJiVQqAG
X-Received: by 2002:a05:600c:4f8e:b0:434:a7e3:db5c with SMTP id 5b1f17b1804b1-43924989dbbmr129113505e9.11.1739181096468;
        Mon, 10 Feb 2025 01:51:36 -0800 (PST)
X-Google-Smtp-Source: AGHT+IEqPOI0wVVTdCLptL/+r5H+U2wKV3y27DWJZ5ITlKbE6VyZO4Ha57zLLnEgkMRI7b5kNsf7Jw==
X-Received: by 2002:a05:600c:4f8e:b0:434:a7e3:db5c with SMTP id 5b1f17b1804b1-43924989dbbmr129112915e9.11.1739181095787;
        Mon, 10 Feb 2025 01:51:35 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4391dc9ff64sm137648875e9.9.2025.02.10.01.51.33
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Feb 2025 01:51:33 -0800 (PST)
Date: Mon, 10 Feb 2025 10:51:31 +0100
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v13 5/6] migrate: Migrate TCP flows
Message-ID: <20250210105131.5403ae1b@elisabeth>
In-Reply-To: <Z6mXFDrbvGm3R6uB@zatzit>
References: <20250209222005.1640077-1-sbrivio@redhat.com>
	<20250209222005.1640077-6-sbrivio@redhat.com>
	<Z6mXFDrbvGm3R6uB@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: vvMED69j_ceY525i4RFdfDWbFPgHCoL4A8FqSF_oYoM_1739181097
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID-Hash: IT7C6AHLDX6YX2PQV3VDUEIGFQ4T6D22
X-Message-ID-Hash: IT7C6AHLDX6YX2PQV3VDUEIGFQ4T6D22
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20250210105131.5403ae1b@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/IT7C6AHLDX6YX2PQV3VDUEIGFQ4T6D22/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Mon, 10 Feb 2025 17:05:08 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sun, Feb 09, 2025 at 11:20:04PM +0100, Stefano Brivio wrote:
> >
> > +/**
> > + * flow_migrate_source_rollback() - Disable repair mode, return failure
> > + * @c:		Execution context
> > + * @max_flow:	Maximum index of affected flows
> > + * @ret:	Negative error code
> > + *
> > + * Return: @ret
> > + */
> > +static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
> > +					int ret)
> > +{
> > +	union flow *flow;
> > +	unsigned i;
> > +
> > +	debug("...roll back migration");
> > +
> > +	foreach_tcp_flow(i, flow, max_flow)
> > +		tcp_flow_repair_off(c, &flow->tcp);
> > +
> > +	repair_flush(c);  
> 
> I think this should die() on failures.  If we get here, it could well
> mean we've already had a failure enabling repair mode, so an error
> disabling is more plausible than usual.  I think die()ing is
> preferable to carrying on, since resuming normal operation with some
> of our sockets in repair mode is almost certain to result in really
> weird, hard to debug behaviour.

It makes sense in general, except for the case where we were unable to
set repair mode for any of the sockets, say, passt-repair isn't there
or we can't use it for any reason.

We should probably tell this function how to handle failures, from
callers.

> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * flow_migrate_repair_all() - Turn repair mode on or off for all flows
> > + * @c:		Execution context
> > + * @enable:	Switch repair mode on if set, off otherwise
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +static int flow_migrate_repair_all(struct ctx *c, bool enable)
> > +{
> > +	union flow *flow;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if (enable)
> > +			rc = tcp_flow_repair_on(c, &flow->tcp);
> > +		else
> > +			rc = tcp_flow_repair_off(c, &flow->tcp);
> > +
> > +		if (rc) {
> > +			debug("Can't %s repair mode: %s",
> > +			      enable ? "enable" : "disable", strerror_(-rc));
> > +			return flow_migrate_source_rollback(c, i, rc);
> > +		}
> > +	}
> > +
> > +	if ((rc = repair_flush(c))) {
> > +		debug("Can't %s repair mode: %s",
> > +		      enable ? "enable" : "disable", strerror_(-rc));
> > +		return flow_migrate_source_rollback(c, i, rc);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source_early() - Early tasks: shrink (RFC 7323 2.2) TCP windows
> > + * @c:		Execution context
> > + * @stage:	Migration stage information, unused
> > + * @fd:		Migration file descriptor, unused
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> > +			      int fd)
> > +{
> > +	union flow *flow;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)stage;
> > +	(void)fd;
> > +
> > +	/* We need repair mode to dump and set (some) window parameters */
> > +	if ((rc = flow_migrate_repair_all(c, true)))
> > +		return -rc;
> > +
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if ((rc = tcp_flow_migrate_shrink_window(i, &flow->tcp))) {
> > +			err("Shrinking window, flow %u: %s", i, strerror_(-rc));
> > +			return flow_migrate_source_rollback(c, i, -rc);
> > +		}
> > +	}
> > +
> > +	/* Now send window updates. We'll flip repair mode back on in a bit */
> > +	if ((rc = flow_migrate_repair_all(c, false)))
> > +		return -rc;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
> > + * @c:		Execution context
> > + * @stage:	Migration stage information (unused)
> > + * @fd:		Migration file descriptor (unused)
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> > +			    int fd)
> > +{
> > +	int rc;
> > +
> > +	(void)stage;
> > +	(void)fd;
> > +
> > +	if ((rc = flow_migrate_repair_all(c, true)))
> > +		return -rc;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source() - Dump all the remaining information and send data
> > + * @c:		Execution context (unused)
> > + * @stage:	Migration stage information (unused)
> > + * @fd:		Migration file descriptor
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd)
> > +{
> > +	uint32_t count = 0;
> > +	union flow *flow;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)c;
> > +	(void)stage;
> > +
> > +	foreach_tcp_flow(i, flow, FLOW_MAX)
> > +		count++;
> > +
> > +	count = htonl(count);
> > +	if ((rc = write_all_buf(fd, &count, sizeof(count)))) {
> > +		rc = errno;
> > +		err_perror("Can't send flow count (%u)", ntohl(count));
> > +		return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> > +	}
> > +
> > +	debug("Sending %u flows", ntohl(count));
> > +
> > +	/* Dump and send information that can be stored in the flow table */
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if ((rc = tcp_flow_migrate_source(fd, &flow->tcp))) {
> > +			err("Can't send data, flow %u: %s", i, strerror_(-rc));
> > +			return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
> > +		}
> > +	}
> > +
> > +	/* And then "extended" data (including window data we saved previously):
> > +	 * the target needs to set repair mode on sockets before it can set
> > +	 * this stuff, but it needs sockets (and flows) for that.
> > +	 *
> > +	 * This also closes sockets so that the target can start connecting
> > +	 * theirs: you can't sendmsg() to queues (using the socket) if the
> > +	 * socket is not connected (EPIPE), not even in repair mode. And the
> > +	 * target needs to restore queues now because we're sending the data.
> > +	 *
> > +	 * So, no rollback here, just try as hard as we can.
> > +	 */
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if ((rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp)))
> > +			err("Extended data for flow %u: %s", i, strerror_(-rc));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_target() - Receive flows and insert in flow table
> > + * @c:		Execution context
> > + * @stage:	Migration stage information (unused)
> > + * @fd:		Migration file descriptor
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd)
> > +{
> > +	uint32_t count;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)stage;
> > +
> > +	if (read_all_buf(fd, &count, sizeof(count)))
> > +		return errno;
> > +
> > +	count = ntohl(count);
> > +	debug("Receiving %u flows", count);
> > +
> > +	if ((rc = flow_migrate_repair_all(c, true)))
> > +		return -rc;
> > +
> > +	repair_flush(c);  
> 
> Unnecessary, flow_migrate_repair_all() already handles this.
> 
> > +
> > +	/* TODO: flow header with type, instead? */
> > +	for (i = 0; i < count; i++) {
> > +		rc = tcp_flow_migrate_target(c, fd);
> > +		if (rc) {
> > +			debug("Bad target data for flow %u: %s, abort",
> > +			      i, strerror_(-rc));
> > +			return -rc;
> > +		}
> > +	}
> > +
> > +	repair_flush(c);
> > +
> > +	for (i = 0; i < count; i++) {
> > +		rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
> > +		if (rc) {
> > +			debug("Bad target extended data for flow %u: %s, abort",
> > +			      i, strerror_(-rc));
> > +			return -rc;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * flow_init() - Initialise flow related data structures
> >   */
> > diff --git a/flow.h b/flow.h
> > index 24ba3ef..675726e 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -249,6 +249,14 @@ union flow;
> >  
> >  void flow_init(void);
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> > +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> > +			      int fd);
> > +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> > +			    int fd);
> > +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd);
> > +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd);
> >  
> >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  	__attribute__((format(printf, 3, 4)));
> > diff --git a/migrate.c b/migrate.c
> > index 1c59016..c5c6663 100644
> > --- a/migrate.c
> > +++ b/migrate.c
> > @@ -98,11 +98,30 @@ static int seen_addrs_target_v1(struct ctx *c,
> >  
> >  /* Stages for version 1 */
> >  static const struct migrate_stage stages_v1[] = {
> > +	/* FIXME: With this step, close() in tcp_flow_migrate_source_ext()
> > +	 * *sometimes* closes the connection for real.
> > +	 */
> > +/*	{
> > +		.name = "shrink TCP windows",
> > +		.source = flow_migrate_source_early,
> > +		.target = NULL,
> > +	},
> > +*/  
> 
> Given we're not sure if this will help, and it adds some
> complications, probably makes sense to split this into a separate
> patch.

I'd rather not because, due to this, the code for the case *without it*
is a bit different anyway, and we avoid some code churn. I would also
like to merge it unused to make our lives a bit easier the day we retry
to work on it.

> >  	{
> >  		.name = "observed addresses",
> >  		.source = seen_addrs_source_v1,
> >  		.target = seen_addrs_target_v1,
> >  	},
> > +	{
> > +		.name = "prepare flows",
> > +		.source = flow_migrate_source_pre,
> > +		.target = NULL,
> > +	},
> > +	{
> > +		.name = "transfer flows",
> > +		.source = flow_migrate_source,
> > +		.target = flow_migrate_target,
> > +	},
> >  	{ 0 },
> >  };
> >  
> > diff --git a/passt.c b/passt.c
> > index 6f9fb4d..68d1a28 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -223,9 +223,6 @@ int main(int argc, char **argv)
> >  		if (sigaction(SIGCHLD, &sa, NULL))
> >  			die_perror("Couldn't install signal handlers");
> >  
> > -		if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> > -			die_perror("Couldn't set disposition for SIGPIPE");
> > -
> >  		c.mode = MODE_PASTA;
> >  	} else if (strstr(name, "passt")) {
> >  		c.mode = MODE_PASST;
> > @@ -233,6 +230,9 @@ int main(int argc, char **argv)
> >  		_exit(EXIT_FAILURE);
> >  	}
> >  
> > +	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> > +		die_perror("Couldn't set disposition for SIGPIPE");
> > +
> >  	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
> >  
> >  	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
> > diff --git a/repair.c b/repair.c
> > index 784b994..da85edb 100644
> > --- a/repair.c
> > +++ b/repair.c
> > @@ -190,7 +190,6 @@ int repair_flush(struct ctx *c)
> >   *
> >   * Return: 0 on success, negative error code on failure
> >   */
> > -/* cppcheck-suppress unusedFunction */
> >  int repair_set(struct ctx *c, int s, int cmd)
> >  {
> >  	int rc;
> > diff --git a/tcp.c b/tcp.c
> > index af6bd95..78db64f 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -280,6 +280,7 @@
> >  #include <stddef.h>
> >  #include <string.h>
> >  #include <sys/epoll.h>
> > +#include <sys/ioctl.h>
> >  #include <sys/socket.h>
> >  #include <sys/timerfd.h>
> >  #include <sys/types.h>
> > @@ -287,6 +288,8 @@
> >  #include <time.h>
> >  #include <arpa/inet.h>
> >  
> > +#include <linux/sockios.h>
> > +
> >  #include "checksum.h"
> >  #include "util.h"
> >  #include "iov.h"
> > @@ -299,6 +302,7 @@
> >  #include "log.h"
> >  #include "inany.h"
> >  #include "flow.h"
> > +#include "repair.h"
> >  #include "linux_dep.h"
> >  
> >  #include "flow_table.h"
> > @@ -326,6 +330,19 @@
> >  	 ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
> >  #define CONN_HAS(conn, set)	(((conn)->events & (set)) == (set))
> >  
> > +/* Buffers to migrate pending data from send and receive queues. No, they don't
> > + * use memory if we don't use them. And we're going away after this, so splurge.
> > + */
> > +#define TCP_MIGRATE_SND_QUEUE_MAX	(64 << 20)
> > +#define TCP_MIGRATE_RCV_QUEUE_MAX	(64 << 20)
> > +uint8_t tcp_migrate_snd_queue		[TCP_MIGRATE_SND_QUEUE_MAX];
> > +uint8_t tcp_migrate_rcv_queue		[TCP_MIGRATE_RCV_QUEUE_MAX];
> > +
> > +#define TCP_MIGRATE_RESTORE_CHUNK_MIN	1024 /* Try smaller when above this */
> > +
> > +/* "Extended" data (not stored in the flow table) for TCP flow migration */
> > +static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > +
> >  static const char *tcp_event_str[] __attribute((__unused__)) = {
> >  	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
> >  
> > @@ -2645,3 +2662,775 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
> >  	if (c->mode == MODE_PASTA)
> >  		tcp_splice_refill(c);
> >  }
> > +
> > +/**
> > + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
> > + * @c:		Execution context
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > +	int rc = 0;
> > +
> > +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
> > +		err("Failed to set TCP_REPAIR");
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * tcp_flow_repair_off() - Clear repair mode for a single TCP flow
> > + * @c:		Execution context
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > +	int rc = 0;
> > +
> > +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
> > +		err("Failed to clear TCP_REPAIR");
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * tcp_flow_repair_queues() - Read or write socket queues, or send unsent data
> > + * @s:			Socket
> > + * @sndbuf:		Send queue buffer read or written/sent depending on @set
> > + * @sndlen:		Length of send queue buffer to set, network order
> > + * @notsentlen:		Length of not sent data, non-zero to actually _send_
> > + * @rcvbuf:		Receive queue buffer, read or written depending on @set
> > + * @rcvlen:		Length of receive queue buffer to set, network order
> > + * @set:		Set or send (unsent data only) if true, dump if false
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + *
> > + * #syscalls:vu ioctl
> > + */
> > +static int tcp_flow_repair_queues(int s,
> > +				  uint8_t *sndbuf,
> > +				  uint32_t *sndlen, uint32_t *notsentlen,
> > +				  uint8_t *rcvbuf, uint32_t *rcvlen,
> > +				  bool set)
> > +{
> > +	ssize_t rc;
> > +	int v;
> > +
> > +	if (set && rcvbuf) { /* FIXME: can't check notsentlen, rework this */  
> 
> Not really clear why this is its own block, rather than part of the if
> (set) below.

Because in the general case we need to select the send queue, for both
set (set/write send queue) and !set (get/dump send queue), but not in
the case we write to the "real" socket (send send queue, don't write
it).

That happens to be the case where I pass rcvbuf as NULL, and that's how
I grossly identified that subcase. This function needs a complete
rewrite, it got out of hand.

> > +		v = TCP_SEND_QUEUE;
> > +		if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> > +			rc = -errno;
> > +			err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
> > +			return rc;
> > +		}
> > +	}
> > +
> > +	if (set) {
> >
> > [...]
> >
> > +/**
> > + * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> > + * @fd:		Descriptor for state migration
> > + * @fidx:	Flow index
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_migrate_source_ext(int fd, int fidx,
> > +				const struct tcp_tap_conn *conn)
> > +{
> > +	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
> > +	struct tcp_repair_window wnd;
> > +	int s = conn->sock;
> > +	int rc;
> > +
> > +	/* FIXME: Reenable dump in tcp_flow_migrate_shrink_window() */
> > +	tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
> > +			    &t->sock_max_window, &t->sock_rcv_wnd,
> > +			    &t->sock_rcv_wup, &wnd);
> > +
> > +	rc = tcp_flow_repair_seq(s, &t->sock_seq_snd, &t->sock_seq_rcv, false);
> > +	if (rc) {
> > +		err("Failed to get sequences on source for socket %i: %s",
> > +		    s, strerror_(-rc));
> > +		return rc;
> > +	}
> > +
> > +	tcp_flow_repair_opt(s, &t->snd_wscale, &t->rcv_wscale, &t->sock_mss,
> > +			    false);
> > +
> > +	/* Dump receive queue as late as possible */  
> 
> Hmm.. why?

I'm not entirely sure if it makes sense, but my thought was: if we
spend time doing other things *after* dumping the receive queue, then
the part we risk missing of it (data that comes to the receive queue
after we dump it) is bigger.

If we dump (and transfer) it last, we should decrease the risk of
missing data.

-- 
Stefano