On Thu, Feb 20, 2025 at 09:07:26AM +0100, Stefano Brivio wrote: > On Thu, 20 Feb 2025 17:03:18 +1100 > David Gibson wrote: > > > Migrating TCP flows requires passt-repair in order to use TCP_REPAIR. If > > passt-repair is not started, our failure mode is pretty ugly though: we'll > > attempt the migration, hitting various problems when we can't enter repair > > mode. In some cases we may not roll back these changes properly, meaning > > we break network connections on the source. > > > > Our general approach is not to completely block migration if there are > > problems, but simply to break any flows we can't migrate. So, if we have > > no connection from passt-repair carry on with the migration, but don't > > attempt to migrate any TCP connections. > > > > Signed-off-by: David Gibson > > --- > > flow.c | 11 +++++++++-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > diff --git a/flow.c b/flow.c > > index 6cf96c26..749c4984 100644 > > --- a/flow.c > > +++ b/flow.c > > @@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable) > > union flow *flow; > > int rc; > > > > + /* If we don't have a repair helper, there's nothing we can do */ > > + if (c->fd_repair < 0) > > + return 0; > > + > > This doesn't fix the behaviour in a relatively likely failure mode: > passt-repair is there, but we can't communicate to it (LSM policy > issues or similar). Ah... true. Although it shouldn't make it any worse for that case, right, so that could be a separate fix. > In that case, unconditionally terminating on failure in the rollback > function: > > if (tcp_flow_repair_off(c, &flow->tcp)) > die("Failed to roll back TCP_REPAIR mode"); > > if (repair_flush(c)) > die("Failed to roll back TCP_REPAIR mode"); > > isn't a very productive thing to do: we go from an uneventful failure > where flows were not affected at all to a guest left without > connectivity. So, the issue is that leaving sockets in repair mode after we leave the migration path would be very bad. We can't easily close sockets/flows for which that's the case, because the batching means if there's a failure we don't actually know which sockets are in which mode, hence the die(). > That starts looking less robust than the alternative (e.g. what I > implemented in v12: silently fail and continue) at least without > https://patchew.org/QEMU/20250217092550.1172055-1-lvivier@redhat.com/ > in a general case as well: if we continue, we'll have hanging flows > that will expire on timeout, but if we don't, again, we'll have a > guest without connectivity. > > I understand that leaving flows around for that long might present a > relevant inconsistency, though. > > So I'm wondering about some alternatives: actually, the rollback > function shouldn't be called at all in this case. Or it could just > (indirectly) call tcp_rst() on all the flows that were possibly > affected. Making it be a safe no-op if we never managed to turn repair on for anything would make sense to me. Unfortunately, in this situation we won't see an error until we do a repair_flush() which means we now don't know the state of any sockets we already passed to repair_set(). We could, I suppose, close all flows that we passed to repair_set() by the time we see the error. If we have < one batch's worth of connections that will kill connectivity almost as much as die()ing, though. I guess it will come back without needing qemu to restart us, though, so that's something. This sort of thing is, incidentally, why I did way back suggest the possibility of passt-repair reporting failures per-fd, rather than just per-batch. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson