On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote: > On Wed, 29 Jan 2025 12:29:27 +1100 > David Gibson wrote: > > > On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote: > > > On Tue, 28 Jan 2025 12:51:59 +1100 > > > David Gibson wrote: > > > > > > > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote: > > > > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of > > > > > passt. Not used yet. > > > > > > > > > > Signed-off-by: Stefano Brivio > > > > > --- > > > > > Makefile | 10 +++-- > > > > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > 2 files changed, 118 insertions(+), 3 deletions(-) > > > > > create mode 100644 passt-repair.c > > > > > > > > > > diff --git a/Makefile b/Makefile > > > > > index 1383875..1b71cb0 100644 > > > > > --- a/Makefile > > > > > +++ b/Makefile > > > > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > > > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > > > vhost_user.c virtio.c vu_common.c > > > > > QRAP_SRCS = qrap.c > > > > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > > > > +PASST_REPAIR_SRCS = passt-repair.c > > > > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) > > > > > > > > > > MANPAGES = passt.1 pasta.1 qrap.1 > > > > > > > > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man > > > > > man1dir ?= $(mandir)/man1 > > > > > > > > > > ifeq ($(TARGET_ARCH),x86_64) > > > > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap > > > > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair > > > > > else > > > > > -BIN := passt pasta qrap > > > > > +BIN := passt pasta qrap passt-repair > > > > > endif > > > > > > > > > > all: $(BIN) $(MANPAGES) docs > > > > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% > > > > > qrap: $(QRAP_SRCS) passt.h > > > > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) > > > > > > > > > > +passt-repair: $(PASST_REPAIR_SRCS) > > > > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) > > > > > + > > > > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ > > > > > rt_sigreturn getpid gettid kill clock_gettime mmap \ > > > > > mmap2 munmap open unlink gettimeofday futex statx \ > > > > > diff --git a/passt-repair.c b/passt-repair.c > > > > > new file mode 100644 > > > > > index 0000000..e9b9609 > > > > > --- /dev/null > > > > > +++ b/passt-repair.c > > > > > @@ -0,0 +1,111 @@ > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > > > + > > > > > +/* PASST - Plug A Simple Socket Transport > > > > > + * for qemu/UNIX domain socket mode > > > > > + * > > > > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets > > > > > + * > > > > > + * Copyright (c) 2025 Red Hat GmbH > > > > > + * Author: Stefano Brivio > > > > > + * > > > > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along > > > > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or > > > > > + * off. Reply by echoing the command. Exit if the command is INT_MAX. > > > > > + */ > > > > > + > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > + > > > > > +#include > > > > > + > > > > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ > > > > > + > > > > > +int main(int argc, char **argv) > > > > > +{ > > > > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] > > > > > + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); > > > > > + struct sockaddr_un a = { AF_UNIX, "" }; > > > > > + int cmd, fds[SCM_MAX_FD], s, ret, i; > > > > > + struct cmsghdr *cmsg; > > > > > + struct msghdr msg; > > > > > + struct iovec iov; > > > > > + > > > > > + iov = (struct iovec){ &cmd, sizeof(cmd) }; > > > > > > > > I mean, local to local, it's *probably* fine, but still a network > > > > protocol not defined in terms of explicit width fields makes me > > > > nervous. I'd prefer to see the cmd being a packed structure with > > > > fixed width elements. > > > > > > It actually is, because: > > > > > > struct { > > > int cmd; > > > }; > > > > > > is a packet structure with fixed width elements. Any architecture we > > > build for (at least the ones I'm aware of) has a 32-bit int. We can > > > make it uint32_t if it makes you feel better. > > > > Sorry, I should have said "*explicitly* fixed width fields". So, yes, > > uint32_t would make me feel better :) > > Changed to int8_t anyway meanwhile. We don't need all those bits. Works or me. > > > > I also think we should do some sort of basic magic / version exchange. > > > > I don't see any reason we'd need to extend the protocol, but I'd > > > > rather have the option if we have to. > > > > > > passt-repair will be packaged and distributed together with passt, > > > though. Versions must match. > > > > But nothing enforces that. > > Distribution packages. If I run claws-mail with the wrong version of, > say, libpixman, it won't start. If you don't use them, you're on your > own. But shared libraries *do* have versioning checks: there are defined compatibility semantics for sonames, and there can be symbol versions as well. > > AIUI KubeVirt will be running passt-repair > > in a different context. Which means it may well be deployed by a > > different path than the passt binary > > No, that's not the way it works. It needs to match, in the sense that > 1. it's a KubeVirt requirement to have compatible packages between > distribution and the "base container image" and 2. this would most > likely be sourced from the "base container image" anyway. > > I maintain the packages for four distributions, plus AppArmor and > SELinux policies upstream and downstream, and I take care of updating > the package in KubeVirt as well, so I guess I have a vague idea of > what's convenient, enforced, burdensome, and so on. > > > which means however we > > distribute it's quite plausible that a downstream screwup could > > mismatch the versions. We should endeavour to have a reasonably > > graceful failure mode for that. > > Regardless of this, I think that *this one* is an interface (I wouldn't > even call it a protocol) that needs to be set in stone, except for > hypothetical (and highly unlikely) UAPI additions which we'll be anyway > able to accommodate for easily. Ok, I can buy that, but it's a contradictory position to "versions must match". > It's a single socket option with three possible values (for 13 years > now), of which we plan to use two. If we want this interface to do > anything else, it should be another interface. > > So there's really no problem with this. > > Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW > should ideally suffice), so it needs to be extremely simple and > auditable. Sending and checking a magic number is not a lot of complexity, even in something on this scale. > > > > And latency here might matter more than in > > > the rest of the migration process. > > > > > > Plus checking a magic number > > > > should make things less damaging and more debuggable if you were to > > > > point the repair helper at an entirely unrelated unix socket instead > > > > of passt's repair socket. > > > > > > Maybe, yes, even though I don't really see good chances for that > > > mistake to happen. Feel free to post a proposal, of course. > > > > I disagree on the good chances for a mistake thing. In GSS I saw > > plenty of occasions where things that shouldn't be mismatched were due > > to some packaging or user screwup. And that's before even considering > > the way that KubeVirt deploys its various pieces seems to provide a > > number of opportunities to mess this up. > > > > So, I'll see what I can come up with. I'm fine with requiring > > matching versions if it's actually checked. Maybe a magic derived > > from our git hash, or even our build-id. > > Both would make things significantly less usable, because they would > make different but compatible builds incompatible, and different > implementations rather inconvenient. Ok, so you're definitely now saying versions *don't* need to match. > For example, it might be a practical solution to have a Go > implementation of this in KubeVirt's virt-handler itself, but if it > needs to extract information or strings from the binary, that becomes > impractical. Ok... could we at least add just a magic number then. If we do ever need a new protocol we can change it, otherwise the protocol immutable for now. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson