public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v3 0/3] More graceful handling of migration without passt-repair
@ 2025-02-26  6:04 David Gibson
  2025-02-26  6:04 ` [PATCH v3 1/3] migrate, flow: Trivially succeed if migrating with no flows David Gibson
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: David Gibson @ 2025-02-26  6:04 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

From Red Hat internal testing we've had some reports that if
attempting to migrate without passt-repair, the failure mode is uglier
than we'd like.  The migration fails, which is somewhat expected, but
we don't correctly roll things back on the source, so it breaks
network there as well.

Handle this more gracefully allowing the migration to proceed in this
case, but allow TCP connections to break

I've test patches 1..2/3 reasonably:
 * I get a clean migration if there are now active flows
 * Migration completes, although connections are broken if
   passt-repair isn't connected
 * Basic test suite (minus perf)

Patch 3 I've run the basic test suite on, but haven't tested the
specific functionality of.  Alas, I've spent most of today battling
with RHEL, virt-install, unshare and various other things trying to
create a test environment simulating two hosts with (possibly)
different addresses.  Despite the example given by Stefano in reply to
the previous version, I haven't really succeeded yet.

There are more fragile cases that I'm looking to fix, particularly the
die()s in flow_migrate_source_rollback() and elsewhere, however I ran
into various complications that I didn't manage to sort out today.
I'll continue looking at those tomorrow.

David Gibson (3):
  migrate, flow: Trivially succeed if migrating with no flows
  migrate, flow: Don't attempt to migrate TCP flows without passt-repair
  migrate, tcp: Don't attempt to carry on migration after
    flow_alloc_cancel()

 flow.c | 17 +++++++++++++++--
 tcp.c  |  5 ++++-
 2 files changed, 19 insertions(+), 3 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/3] migrate, flow: Trivially succeed if migrating with no flows
  2025-02-26  6:04 [PATCH v3 0/3] More graceful handling of migration without passt-repair David Gibson
@ 2025-02-26  6:04 ` David Gibson
  2025-02-26  6:04 ` [PATCH v3 2/3] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2025-02-26  6:04 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/flow.c b/flow.c
index bb5dcc3c..6cf96c26 100644
--- a/flow.c
+++ b/flow.c
@@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 
 	debug("Sending %u flows", ntohl(count));
 
+	if (!count)
+		return 0;
+
 	/* Dump and send information that can be stored in the flow table.
 	 *
 	 * Limited rollback options here: if we fail to transfer any data (that
@@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	count = ntohl(count);
 	debug("Receiving %u flows", count);
 
+	if (!count)
+		return 0;
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 
-- 
@@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 
 	debug("Sending %u flows", ntohl(count));
 
+	if (!count)
+		return 0;
+
 	/* Dump and send information that can be stored in the flow table.
 	 *
 	 * Limited rollback options here: if we fail to transfer any data (that
@@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	count = ntohl(count);
 	debug("Receiving %u flows", count);
 
+	if (!count)
+		return 0;
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/3] migrate, flow: Don't attempt to migrate TCP flows without passt-repair
  2025-02-26  6:04 [PATCH v3 0/3] More graceful handling of migration without passt-repair David Gibson
  2025-02-26  6:04 ` [PATCH v3 1/3] migrate, flow: Trivially succeed if migrating with no flows David Gibson
@ 2025-02-26  6:04 ` David Gibson
  2025-02-26  6:04 ` [PATCH v3 3/3] migrate, tcp: Don't attempt to carry on migration after flow_alloc_cancel() David Gibson
  2025-02-26 19:01 ` [PATCH v3 0/3] More graceful handling of migration without passt-repair Stefano Brivio
  3 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2025-02-26  6:04 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/flow.c b/flow.c
index 6cf96c26..749c4984 100644
--- a/flow.c
+++ b/flow.c
@@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 	union flow *flow;
 	int rc;
 
+	/* If we don't have a repair helper, there's nothing we can do */
+	if (c->fd_repair < 0)
+		return 0;
+
 	foreach_established_tcp_flow(flow) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
@@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(flow)
-		count++;
+	/* If we don't have a repair helper, we can't migrate TCP flows */
+	if (c->fd_repair >= 0) {
+		foreach_established_tcp_flow(flow)
+			count++;
+	}
 
 	count = htonl(count);
 	if (write_all_buf(fd, &count, sizeof(count))) {
-- 
@@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 	union flow *flow;
 	int rc;
 
+	/* If we don't have a repair helper, there's nothing we can do */
+	if (c->fd_repair < 0)
+		return 0;
+
 	foreach_established_tcp_flow(flow) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
@@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(flow)
-		count++;
+	/* If we don't have a repair helper, we can't migrate TCP flows */
+	if (c->fd_repair >= 0) {
+		foreach_established_tcp_flow(flow)
+			count++;
+	}
 
 	count = htonl(count);
 	if (write_all_buf(fd, &count, sizeof(count))) {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 3/3] migrate, tcp: Don't attempt to carry on migration after flow_alloc_cancel()
  2025-02-26  6:04 [PATCH v3 0/3] More graceful handling of migration without passt-repair David Gibson
  2025-02-26  6:04 ` [PATCH v3 1/3] migrate, flow: Trivially succeed if migrating with no flows David Gibson
  2025-02-26  6:04 ` [PATCH v3 2/3] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
@ 2025-02-26  6:04 ` David Gibson
  2025-02-26 19:01 ` [PATCH v3 0/3] More graceful handling of migration without passt-repair Stefano Brivio
  3 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2025-02-26  6:04 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on.  This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process.  However, it can't
quite work: the flow_alloc_cancel() means that the flows in the target's
flow table are no longer one to one match to the flows which the source
is sending data for.  This means that data for later flows will be
mismatched to a different flow.  Most likely that will cause some nasty
error later, but even worse it might appear to succeed but lead to data
corruption due to incorrectly restoring one of the flows.

Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it.  Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow.  To make
what's going on clearer, though, put an explicit test for a bad socket fd
in tcp_flow_migrate_target_ext() and discard at that point.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index e3c0a53b..f713fa99 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3376,7 +3376,6 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
 
 	if ((rc = tcp_flow_repair_socket(c, conn))) {
 		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
-		flow_alloc_cancel(flow);
 		return 0;
 	}
 
@@ -3452,6 +3451,10 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 		return rc;
 	}
 
+	if (conn->sock < 0)
+		/* We weren't able to create the socket, discard flow */
+		goto fail;
+
 	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
 		goto fail;
 
-- 
@@ -3376,7 +3376,6 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
 
 	if ((rc = tcp_flow_repair_socket(c, conn))) {
 		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
-		flow_alloc_cancel(flow);
 		return 0;
 	}
 
@@ -3452,6 +3451,10 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 		return rc;
 	}
 
+	if (conn->sock < 0)
+		/* We weren't able to create the socket, discard flow */
+		goto fail;
+
 	if (tcp_flow_select_queue(s, TCP_SEND_QUEUE))
 		goto fail;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 0/3] More graceful handling of migration without passt-repair
  2025-02-26  6:04 [PATCH v3 0/3] More graceful handling of migration without passt-repair David Gibson
                   ` (2 preceding siblings ...)
  2025-02-26  6:04 ` [PATCH v3 3/3] migrate, tcp: Don't attempt to carry on migration after flow_alloc_cancel() David Gibson
@ 2025-02-26 19:01 ` Stefano Brivio
  2025-02-27  1:23   ` David Gibson
  3 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2025-02-26 19:01 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 26 Feb 2025 17:04:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> From Red Hat internal testing we've had some reports that if
> attempting to migrate without passt-repair, the failure mode is uglier
> than we'd like.  The migration fails, which is somewhat expected, but
> we don't correctly roll things back on the source, so it breaks
> network there as well.
> 
> Handle this more gracefully allowing the migration to proceed in this
> case, but allow TCP connections to break
> 
> I've test patches 1..2/3 reasonably:
>  * I get a clean migration if there are now active flows
>  * Migration completes, although connections are broken if
>    passt-repair isn't connected
>  * Basic test suite (minus perf)
> 
> Patch 3 I've run the basic test suite on, but haven't tested the
> specific functionality of.  Alas, I've spent most of today battling
> with RHEL, virt-install, unshare and various other things trying to
> create a test environment simulating two hosts with (possibly)
> different addresses.

Sorry, I've been busy with other stuff most of the day and I didn't
manage to test the specific functionality of 3/3 either. :(

I reviewed this and it all looks good to me but I'm not sure if I
should merge this now or wait until we manage to test the 3/3 case. Let
me know your preference. I can also merge up to 2/3 if it makes things
more convenient.

-- 
Stefano


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 0/3] More graceful handling of migration without passt-repair
  2025-02-26 19:01 ` [PATCH v3 0/3] More graceful handling of migration without passt-repair Stefano Brivio
@ 2025-02-27  1:23   ` David Gibson
  0 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2025-02-27  1:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1868 bytes --]

On Wed, Feb 26, 2025 at 08:01:13PM +0100, Stefano Brivio wrote:
> On Wed, 26 Feb 2025 17:04:19 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > From Red Hat internal testing we've had some reports that if
> > attempting to migrate without passt-repair, the failure mode is uglier
> > than we'd like.  The migration fails, which is somewhat expected, but
> > we don't correctly roll things back on the source, so it breaks
> > network there as well.
> > 
> > Handle this more gracefully allowing the migration to proceed in this
> > case, but allow TCP connections to break
> > 
> > I've test patches 1..2/3 reasonably:
> >  * I get a clean migration if there are now active flows
> >  * Migration completes, although connections are broken if
> >    passt-repair isn't connected
> >  * Basic test suite (minus perf)
> > 
> > Patch 3 I've run the basic test suite on, but haven't tested the
> > specific functionality of.  Alas, I've spent most of today battling
> > with RHEL, virt-install, unshare and various other things trying to
> > create a test environment simulating two hosts with (possibly)
> > different addresses.
> 
> Sorry, I've been busy with other stuff most of the day and I didn't
> manage to test the specific functionality of 3/3 either. :(
> 
> I reviewed this and it all looks good to me but I'm not sure if I
> should merge this now or wait until we manage to test the 3/3 case. Let
> me know your preference. I can also merge up to 2/3 if it makes things
> more convenient.

Feel free to merge 1..2/3 whenever you want.  I'll probably send a new
spin with some other things fixed as well today.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-02-27  1:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-26  6:04 [PATCH v3 0/3] More graceful handling of migration without passt-repair David Gibson
2025-02-26  6:04 ` [PATCH v3 1/3] migrate, flow: Trivially succeed if migrating with no flows David Gibson
2025-02-26  6:04 ` [PATCH v3 2/3] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
2025-02-26  6:04 ` [PATCH v3 3/3] migrate, tcp: Don't attempt to carry on migration after flow_alloc_cancel() David Gibson
2025-02-26 19:01 ` [PATCH v3 0/3] More graceful handling of migration without passt-repair Stefano Brivio
2025-02-27  1:23   ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).