* [PATCH 0/1] migrate: Eliminate listening_sock field from TCP connection state
@ 2026-01-30 5:58 David Gibson
2026-01-30 5:58 ` [PATCH 1/1] migrate: Use forward table information to close() listening sockets David Gibson
0 siblings, 1 reply; 6+ messages in thread
From: David Gibson @ 2026-01-30 5:58 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
A bunch of the things I'm working on right now are made more difficult
by the lack of spare space in the TCP connection structure (without
adding an additional cache line).
I realised that even the minimal forwarding table implementation we
have now allows at least one easy win though: the listening_sock field
can be eliminated pretty easily.
David Gibson (1):
migrate: Use forward table information to close() listening sockets
flow.c | 12 ++++++++++++
fwd.c | 21 +++++++++++++++++++++
fwd.h | 1 +
tcp.c | 9 ---------
tcp_conn.h | 3 ---
5 files changed, 34 insertions(+), 12 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/1] migrate: Use forward table information to close() listening sockets
2026-01-30 5:58 [PATCH 0/1] migrate: Eliminate listening_sock field from TCP connection state David Gibson
@ 2026-01-30 5:58 ` David Gibson
2026-01-31 9:47 ` Stefano Brivio
0 siblings, 1 reply; 6+ messages in thread
From: David Gibson @ 2026-01-30 5:58 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
On incoming migrations we need to bind() reconstructed sockets to their
correct local address. We can't do this if the origin passt instance is
in the same namespace and still has those addresses bound. Arguably that's
a bug in bind()s operation during repair mode, but for now we have to work
around it.
So, to allow local-to-local migrations we close() sockets on the outgoing
side as we process them. In addition to closing the connected socket we
also have to close the associated listen()ing socket, because that can also
cause an address conflict.
To do that, we introduced the listening_sock field in the connection
state, because we had no other way to find the right listening sockets.
Now that we have the forwarding table, we have a complete list of
listening sockets elsewhere. We can use that instead, to close all
listening sockets on outbound migration, rather than just the ones that
might conflict.
This is cleaner and, importantly, saves a valuable 32-bits in the flow
state structure. It does mean that there is a longer window where a peer
attempting to connect during migration might get a Connection Refused.
I think this is an acceptable trade-off for now: arguably we should not
allow local-to-local migrations in any case, since the socket closes make
it impossible to safely roll back migration as per the qemu model.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
flow.c | 12 ++++++++++++
fwd.c | 21 +++++++++++++++++++++
fwd.h | 1 +
tcp.c | 9 ---------
tcp_conn.h | 3 ---
5 files changed, 34 insertions(+), 12 deletions(-)
diff --git a/flow.c b/flow.c
index fd4d5f38..5207143d 100644
--- a/flow.c
+++ b/flow.c
@@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
debug("...roll back migration");
+ if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0)
+ die("Failed to re-establish listening sockets");
+
foreach_established_tcp_flow(flow) {
if (FLOW_IDX(flow) >= bound)
break;
@@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
return flow_migrate_source_rollback(c, FLOW_MAX, rc);
}
+ /* HACK: A local to local migrate will fail if the origin passt has the
+ * listening sockets still open when the destination passt tries to bind
+ * them. This does mean there's a window where we lost our listen()s,
+ * even if the migration is rolled back later. The only way to really
+ * fix that is to not allow local to local migration, which arguably we
+ * should (use namespaces for testing instead). */
+ debug("Stop listen()s");
+ fwd_listen_close(&c->tcp.fwd_in);
+
debug("Sending %u flows", ntohl(count));
if (!count)
diff --git a/fwd.c b/fwd.c
index edbeaf44..4052b797 100644
--- a/fwd.c
+++ b/fwd.c
@@ -654,6 +654,27 @@ int fwd_listen_sync(const struct ctx *c, const struct fwd_ports *fwd,
return 0;
}
+/** fwd_listen_close() - Close all listening sockets
+ * @fwd: Forwarding information
+ */
+void fwd_listen_close(const struct fwd_ports *fwd)
+{
+ unsigned i;
+
+ for (i = 0; i < fwd->count; i++) {
+ const struct fwd_rule *rule = &fwd->rules[i];
+ unsigned port;
+
+ for (port = rule->first; port <= rule->last; port++) {
+ int *fdp = &rule->socks[port - rule->first];
+ if (*fdp >= 0) {
+ close(*fdp);
+ *fdp = -1;
+ }
+ }
+ }
+}
+
/* See enum in kernel's include/net/tcp_states.h */
#define UDP_LISTEN 0x07
#define TCP_LISTEN 0x0a
diff --git a/fwd.h b/fwd.h
index a5dc89db..16070111 100644
--- a/fwd.h
+++ b/fwd.h
@@ -118,6 +118,7 @@ void fwd_scan_ports_timer(struct ctx * c, const struct timespec *now);
int fwd_listen_sync(const struct ctx *c, const struct fwd_ports *fwd,
uint8_t pif, uint8_t proto);
+void fwd_listen_close(const struct fwd_ports *fwd);
bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
union inany_addr *translated);
diff --git a/tcp.c b/tcp.c
index 17e5b006..8d036b61 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1719,7 +1719,6 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
conn->sock = s;
conn->timer = -1;
- conn->listening_sock = -1;
flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT);
if (flow_epoll_set(&conn->f, EPOLL_CTL_ADD, 0, s, TGTSIDE) < 0) {
flow_perror(flow, "Can't register with epoll");
@@ -2483,7 +2482,6 @@ static void tcp_tap_conn_from_sock(const struct ctx *c, union flow *flow,
void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
const struct timespec *now)
{
- struct tcp_tap_conn *conn;
union sockaddr_inany sa;
socklen_t sl = sizeof(sa);
struct flowside *ini;
@@ -2499,9 +2497,6 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
if (s < 0)
goto cancel;
- conn = (struct tcp_tap_conn *)flow;
- conn->listening_sock = ref.fd;
-
tcp_sock_set_nodelay(s);
/* FIXME: If useful: when the listening port has a specific bound
@@ -3443,9 +3438,6 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
return rc;
}
- if (conn->listening_sock != -1 && !fcntl(conn->listening_sock, F_GETFD))
- close(conn->listening_sock);
-
return 0;
}
@@ -3655,7 +3647,6 @@ static int tcp_flow_repair_connect(const struct ctx *c,
}
conn->timer = -1;
- conn->listening_sock = -1;
return 0;
}
diff --git a/tcp_conn.h b/tcp_conn.h
index 9c6ff9ee..21cea109 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -18,7 +18,6 @@
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
* @sock: Socket descriptor number
* @events: Connection events, implying connection states
- * @listening_sock: Listening socket this socket was accept()ed from, or -1
* @timer: timerfd descriptor for timeout events
* @flags: Connection flags representing internal attributes
* @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
@@ -75,8 +74,6 @@ struct tcp_tap_conn {
#define CONN_STATE_BITS /* Setting these clears other flags */ \
(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
- int listening_sock;
-
int timer :FD_REF_BITS;
uint8_t flags;
--
2.52.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] migrate: Use forward table information to close() listening sockets
2026-01-30 5:58 ` [PATCH 1/1] migrate: Use forward table information to close() listening sockets David Gibson
@ 2026-01-31 9:47 ` Stefano Brivio
2026-02-02 0:24 ` David Gibson
0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2026-01-31 9:47 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Fri, 30 Jan 2026 16:58:11 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On incoming migrations we need to bind() reconstructed sockets to their
> correct local address. We can't do this if the origin passt instance is
> in the same namespace and still has those addresses bound. Arguably that's
> a bug in bind()s operation during repair mode, but for now we have to work
> around it.
>
> So, to allow local-to-local migrations we close() sockets on the outgoing
> side as we process them. In addition to closing the connected socket we
> also have to close the associated listen()ing socket, because that can also
> cause an address conflict.
>
> To do that, we introduced the listening_sock field in the connection
> state, because we had no other way to find the right listening sockets.
> Now that we have the forwarding table, we have a complete list of
> listening sockets elsewhere. We can use that instead, to close all
> listening sockets on outbound migration, rather than just the ones that
> might conflict.
>
> This is cleaner and, importantly, saves a valuable 32-bits in the flow
> state structure. It does mean that there is a longer window where a peer
> attempting to connect during migration might get a Connection Refused.
> I think this is an acceptable trade-off for now: arguably we should not
> allow local-to-local migrations in any case, since the socket closes make
> it impossible to safely roll back migration as per the qemu model.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> flow.c | 12 ++++++++++++
> fwd.c | 21 +++++++++++++++++++++
> fwd.h | 1 +
> tcp.c | 9 ---------
> tcp_conn.h | 3 ---
> 5 files changed, 34 insertions(+), 12 deletions(-)
>
> diff --git a/flow.c b/flow.c
> index fd4d5f38..5207143d 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
>
> debug("...roll back migration");
>
> + if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0)
> + die("Failed to re-establish listening sockets");
> +
> foreach_established_tcp_flow(flow) {
> if (FLOW_IDX(flow) >= bound)
> break;
> @@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
Nit: the comment to this function currently says "Send data (flow
table) for flow, close listening". I fixed that up (dropped ", close listening").
> return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> }
>
> + /* HACK: A local to local migrate will fail if the origin passt has the
> + * listening sockets still open when the destination passt tries to bind
> + * them. This does mean there's a window where we lost our listen()s,
> + * even if the migration is rolled back later. The only way to really
> + * fix that is to not allow local to local migration, which arguably we
> + * should (use namespaces for testing instead). */
Actually, we already use namespaces in the current tests, but we didn't
(always) do that during development, and it might be convenient in
general to have the possibility to test *a part* of the implementation
using the same namespace as long as it's reasonably cheap (it seems to
be).
That's just a part because anyway bind() and connect() will conflict,
if we're in the same namespace, which is a kernel issue you already
noted:
https://pad.passt.top/p/TcpRepairTodo#L3
Repair mode sockets should not have address conflicts with non-repair
sockets (both bind() and connect())
but even that part is convenient to have, I think, so I'm a bit worried
that somebody might take this comment as a to-do item, while I don't
think it should be.
Patch applied anyway, to give this as much testing time and exposure as
possible.
--
Stefano
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] migrate: Use forward table information to close() listening sockets
2026-01-31 9:47 ` Stefano Brivio
@ 2026-02-02 0:24 ` David Gibson
2026-02-02 22:12 ` Stefano Brivio
0 siblings, 1 reply; 6+ messages in thread
From: David Gibson @ 2026-02-02 0:24 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 5125 bytes --]
On Sat, Jan 31, 2026 at 10:47:28AM +0100, Stefano Brivio wrote:
> On Fri, 30 Jan 2026 16:58:11 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On incoming migrations we need to bind() reconstructed sockets to their
> > correct local address. We can't do this if the origin passt instance is
> > in the same namespace and still has those addresses bound. Arguably that's
> > a bug in bind()s operation during repair mode, but for now we have to work
> > around it.
> >
> > So, to allow local-to-local migrations we close() sockets on the outgoing
> > side as we process them. In addition to closing the connected socket we
> > also have to close the associated listen()ing socket, because that can also
> > cause an address conflict.
> >
> > To do that, we introduced the listening_sock field in the connection
> > state, because we had no other way to find the right listening sockets.
> > Now that we have the forwarding table, we have a complete list of
> > listening sockets elsewhere. We can use that instead, to close all
> > listening sockets on outbound migration, rather than just the ones that
> > might conflict.
> >
> > This is cleaner and, importantly, saves a valuable 32-bits in the flow
> > state structure. It does mean that there is a longer window where a peer
> > attempting to connect during migration might get a Connection Refused.
> > I think this is an acceptable trade-off for now: arguably we should not
> > allow local-to-local migrations in any case, since the socket closes make
> > it impossible to safely roll back migration as per the qemu model.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > flow.c | 12 ++++++++++++
> > fwd.c | 21 +++++++++++++++++++++
> > fwd.h | 1 +
> > tcp.c | 9 ---------
> > tcp_conn.h | 3 ---
> > 5 files changed, 34 insertions(+), 12 deletions(-)
> >
> > diff --git a/flow.c b/flow.c
> > index fd4d5f38..5207143d 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
> >
> > debug("...roll back migration");
> >
> > + if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0)
> > + die("Failed to re-establish listening sockets");
> > +
> > foreach_established_tcp_flow(flow) {
> > if (FLOW_IDX(flow) >= bound)
> > break;
> > @@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
>
> Nit: the comment to this function currently says "Send data (flow
> table) for flow, close listening". I fixed that up (dropped ", close listening").
Good point, thanks.
> > return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> > }
> >
> > + /* HACK: A local to local migrate will fail if the origin passt has the
> > + * listening sockets still open when the destination passt tries to bind
> > + * them. This does mean there's a window where we lost our listen()s,
> > + * even if the migration is rolled back later. The only way to really
> > + * fix that is to not allow local to local migration, which arguably we
> > + * should (use namespaces for testing instead). */
>
> Actually, we already use namespaces in the current tests,
Oh, nice.
> but we didn't
> (always) do that during development, and it might be convenient in
> general to have the possibility to test *a part* of the implementation
> using the same namespace as long as it's reasonably cheap (it seems to
> be).
Depends what cost you're talking about. It's cheap in terms of
computational complexity, and code compexity. It means, however, that
we can't necessarily roll back failed migrations - i.e. resume on the
origin system. That isn't really correct for the qemu migration
model, which is why I think allowing local migrations probably isn't
the best idea, at least by default.
> That's just a part because anyway bind() and connect() will conflict,
> if we're in the same namespace, which is a kernel issue you already
> noted:
Well, it's a kernel issue that the bound listen()ing sockets conflict
with the half-constructed flow sockets. Having the listening sockets
of the origin passt conflict with the listening sockets of the
destination passt is pretty much expected, and would still be an
impediment to local migration.
> https://pad.passt.top/p/TcpRepairTodo#L3
> Repair mode sockets should not have address conflicts with non-repair
> sockets (both bind() and connect())
>
> but even that part is convenient to have, I think,
I'm not really sure what you mean by that.
> so I'm a bit worried
> that somebody might take this comment as a to-do item, while I don't
> think it should be.
>
> Patch applied anyway, to give this as much testing time and exposure as
> possible.
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] migrate: Use forward table information to close() listening sockets
2026-02-02 0:24 ` David Gibson
@ 2026-02-02 22:12 ` Stefano Brivio
2026-02-04 11:57 ` David Gibson
0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2026-02-02 22:12 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Mon, 2 Feb 2026 10:24:14 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Sat, Jan 31, 2026 at 10:47:28AM +0100, Stefano Brivio wrote:
> > On Fri, 30 Jan 2026 16:58:11 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On incoming migrations we need to bind() reconstructed sockets to their
> > > correct local address. We can't do this if the origin passt instance is
> > > in the same namespace and still has those addresses bound. Arguably that's
> > > a bug in bind()s operation during repair mode, but for now we have to work
> > > around it.
> > >
> > > So, to allow local-to-local migrations we close() sockets on the outgoing
> > > side as we process them. In addition to closing the connected socket we
> > > also have to close the associated listen()ing socket, because that can also
> > > cause an address conflict.
> > >
> > > To do that, we introduced the listening_sock field in the connection
> > > state, because we had no other way to find the right listening sockets.
> > > Now that we have the forwarding table, we have a complete list of
> > > listening sockets elsewhere. We can use that instead, to close all
> > > listening sockets on outbound migration, rather than just the ones that
> > > might conflict.
> > >
> > > This is cleaner and, importantly, saves a valuable 32-bits in the flow
> > > state structure. It does mean that there is a longer window where a peer
> > > attempting to connect during migration might get a Connection Refused.
> > > I think this is an acceptable trade-off for now: arguably we should not
> > > allow local-to-local migrations in any case, since the socket closes make
> > > it impossible to safely roll back migration as per the qemu model.
> > >
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > flow.c | 12 ++++++++++++
> > > fwd.c | 21 +++++++++++++++++++++
> > > fwd.h | 1 +
> > > tcp.c | 9 ---------
> > > tcp_conn.h | 3 ---
> > > 5 files changed, 34 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/flow.c b/flow.c
> > > index fd4d5f38..5207143d 100644
> > > --- a/flow.c
> > > +++ b/flow.c
> > > @@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
> > >
> > > debug("...roll back migration");
> > >
> > > + if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0)
> > > + die("Failed to re-establish listening sockets");
> > > +
> > > foreach_established_tcp_flow(flow) {
> > > if (FLOW_IDX(flow) >= bound)
> > > break;
> > > @@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> >
> > Nit: the comment to this function currently says "Send data (flow
> > table) for flow, close listening". I fixed that up (dropped ", close listening").
>
> Good point, thanks.
>
> > > return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> > > }
> > >
> > > + /* HACK: A local to local migrate will fail if the origin passt has the
> > > + * listening sockets still open when the destination passt tries to bind
> > > + * them. This does mean there's a window where we lost our listen()s,
> > > + * even if the migration is rolled back later. The only way to really
> > > + * fix that is to not allow local to local migration, which arguably we
> > > + * should (use namespaces for testing instead). */
> >
> > Actually, we already use namespaces in the current tests,
>
> Oh, nice.
>
> > but we didn't
> > (always) do that during development, and it might be convenient in
> > general to have the possibility to test *a part* of the implementation
> > using the same namespace as long as it's reasonably cheap (it seems to
> > be).
>
> Depends what cost you're talking about. It's cheap in terms of
> computational complexity, and code compexity. It means, however, that
> we can't necessarily roll back failed migrations - i.e. resume on the
> origin system. That isn't really correct for the qemu migration
> model, which is why I think allowing local migrations probably isn't
> the best idea, at least by default.
Well it's not entirely true that we can't roll back failed migrations.
Depending on the stage we're at, we might close connections, but other
than that we can always continue, somehow. Losing some connections
looks relatively cheap as well.
But sure, it's not ideal. Maybe yes, we shouldn't have that as default,
add a new option and update QEMU's documentation (see below) with it.
> > That's just a part because anyway bind() and connect() will conflict,
> > if we're in the same namespace, which is a kernel issue you already
> > noted:
>
> Well, it's a kernel issue that the bound listen()ing sockets conflict
> with the half-constructed flow sockets. Having the listening sockets
> of the origin passt conflict with the listening sockets of the
> destination passt is pretty much expected, and would still be an
> impediment to local migration.
Ah, right, we don't set listening sockets to repair mode... should we,
by the way? With the fix for freezing incoming TCP queues I'm working
on, as a side effect, that would also mean that the kernel will ignore
SYN segments altogether, which is desirable I think. Maybe it already
happens for some other reason...?
But anyway, without forwarded ports you don't have listening sockets.
> > https://pad.passt.top/p/TcpRepairTodo#L3
> > Repair mode sockets should not have address conflicts with non-repair
> > sockets (both bind() and connect())
> >
> > but even that part is convenient to have, I think,
>
> I'm not really sure what you mean by that.
That we can still *partially* test local migration, without forwarded
ports.
We can check the interface, including passt-repair and all the
vhost-user bits without actual connections, in a simpler way compared
to namespaces.
See:
https://www.qemu.org/docs/master/system/devices/net.html#example-of-migration-of-a-guest-on-the-same-host
and:
https://lore.kernel.org/qemu-devel/20260131032700.12f27487@elisabeth/
it won't work for any practical case, but one can still test a big
chunk of functionality that way.
--
Stefano
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] migrate: Use forward table information to close() listening sockets
2026-02-02 22:12 ` Stefano Brivio
@ 2026-02-04 11:57 ` David Gibson
0 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2026-02-04 11:57 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 7522 bytes --]
On Mon, Feb 02, 2026 at 11:12:08PM +0100, Stefano Brivio wrote:
> On Mon, 2 Feb 2026 10:24:14 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Sat, Jan 31, 2026 at 10:47:28AM +0100, Stefano Brivio wrote:
> > > On Fri, 30 Jan 2026 16:58:11 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On incoming migrations we need to bind() reconstructed sockets to their
> > > > correct local address. We can't do this if the origin passt instance is
> > > > in the same namespace and still has those addresses bound. Arguably that's
> > > > a bug in bind()s operation during repair mode, but for now we have to work
> > > > around it.
> > > >
> > > > So, to allow local-to-local migrations we close() sockets on the outgoing
> > > > side as we process them. In addition to closing the connected socket we
> > > > also have to close the associated listen()ing socket, because that can also
> > > > cause an address conflict.
> > > >
> > > > To do that, we introduced the listening_sock field in the connection
> > > > state, because we had no other way to find the right listening sockets.
> > > > Now that we have the forwarding table, we have a complete list of
> > > > listening sockets elsewhere. We can use that instead, to close all
> > > > listening sockets on outbound migration, rather than just the ones that
> > > > might conflict.
> > > >
> > > > This is cleaner and, importantly, saves a valuable 32-bits in the flow
> > > > state structure. It does mean that there is a longer window where a peer
> > > > attempting to connect during migration might get a Connection Refused.
> > > > I think this is an acceptable trade-off for now: arguably we should not
> > > > allow local-to-local migrations in any case, since the socket closes make
> > > > it impossible to safely roll back migration as per the qemu model.
> > > >
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > > flow.c | 12 ++++++++++++
> > > > fwd.c | 21 +++++++++++++++++++++
> > > > fwd.h | 1 +
> > > > tcp.c | 9 ---------
> > > > tcp_conn.h | 3 ---
> > > > 5 files changed, 34 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/flow.c b/flow.c
> > > > index fd4d5f38..5207143d 100644
> > > > --- a/flow.c
> > > > +++ b/flow.c
> > > > @@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned bound, int ret)
> > > >
> > > > debug("...roll back migration");
> > > >
> > > > + if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0)
> > > > + die("Failed to re-establish listening sockets");
> > > > +
> > > > foreach_established_tcp_flow(flow) {
> > > > if (FLOW_IDX(flow) >= bound)
> > > > break;
> > > > @@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > >
> > > Nit: the comment to this function currently says "Send data (flow
> > > table) for flow, close listening". I fixed that up (dropped ", close listening").
> >
> > Good point, thanks.
> >
> > > > return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> > > > }
> > > >
> > > > + /* HACK: A local to local migrate will fail if the origin passt has the
> > > > + * listening sockets still open when the destination passt tries to bind
> > > > + * them. This does mean there's a window where we lost our listen()s,
> > > > + * even if the migration is rolled back later. The only way to really
> > > > + * fix that is to not allow local to local migration, which arguably we
> > > > + * should (use namespaces for testing instead). */
> > >
> > > Actually, we already use namespaces in the current tests,
> >
> > Oh, nice.
> >
> > > but we didn't
> > > (always) do that during development, and it might be convenient in
> > > general to have the possibility to test *a part* of the implementation
> > > using the same namespace as long as it's reasonably cheap (it seems to
> > > be).
> >
> > Depends what cost you're talking about. It's cheap in terms of
> > computational complexity, and code compexity. It means, however, that
> > we can't necessarily roll back failed migrations - i.e. resume on the
> > origin system. That isn't really correct for the qemu migration
> > model, which is why I think allowing local migrations probably isn't
> > the best idea, at least by default.
>
> Well it's not entirely true that we can't roll back failed migrations.
>
> Depending on the stage we're at, we might close connections, but other
> than that we can always continue, somehow. Losing some connections
> looks relatively cheap as well.
I guess that's true. I'm not sure we currently handle this even as
well as is possible within the constraints.
> But sure, it's not ideal. Maybe yes, we shouldn't have that as default,
> add a new option and update QEMU's documentation (see below) with
> it.
There's two layers to it. Dropping connections on what's otherwise a
no-op migration failure is ugly. Opening the possibility that we
might not be able to rebind ports we were already listening on is
worse.
> > > That's just a part because anyway bind() and connect() will conflict,
> > > if we're in the same namespace, which is a kernel issue you already
> > > noted:
> >
> > Well, it's a kernel issue that the bound listen()ing sockets conflict
> > with the half-constructed flow sockets. Having the listening sockets
> > of the origin passt conflict with the listening sockets of the
> > destination passt is pretty much expected, and would still be an
> > impediment to local migration.
>
> Ah, right, we don't set listening sockets to repair mode... should we,
> by the way?
Uh.. I don't think so? I'm not really sure how repair mode operates
for listening sockets, or if it should even allow that. The relevant
state of a listening socket is pretty much limited to its bound
address, so I don't think we need any additional mechanism to extract
state.
> With the fix for freezing incoming TCP queues I'm working
> on, as a side effect, that would also mean that the kernel will ignore
> SYN segments altogether, which is desirable I think. Maybe it already
> happens for some other reason...?
>
> But anyway, without forwarded ports you don't have listening sockets.
True.
> > > https://pad.passt.top/p/TcpRepairTodo#L3
> > > Repair mode sockets should not have address conflicts with non-repair
> > > sockets (both bind() and connect())
> > >
> > > but even that part is convenient to have, I think,
> >
> > I'm not really sure what you mean by that.
>
> That we can still *partially* test local migration, without forwarded
> ports.
Ah, ok.
> We can check the interface, including passt-repair and all the
> vhost-user bits without actual connections, in a simpler way compared
> to namespaces.
>
> See:
>
> https://www.qemu.org/docs/master/system/devices/net.html#example-of-migration-of-a-guest-on-the-same-host
>
> and:
>
> https://lore.kernel.org/qemu-devel/20260131032700.12f27487@elisabeth/
>
> it won't work for any practical case, but one can still test a big
> chunk of functionality that way.
Good point.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-04 12:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-30 5:58 [PATCH 0/1] migrate: Eliminate listening_sock field from TCP connection state David Gibson
2026-01-30 5:58 ` [PATCH 1/1] migrate: Use forward table information to close() listening sockets David Gibson
2026-01-31 9:47 ` Stefano Brivio
2026-02-02 0:24 ` David Gibson
2026-02-02 22:12 ` Stefano Brivio
2026-02-04 11:57 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).