On Fri, Feb 21, 2025 at 06:59:12AM +0100, Stefano Brivio wrote: > On Fri, 21 Feb 2025 13:40:12 +1100 > David Gibson wrote: > > > On Thu, Feb 20, 2025 at 11:38:00AM +0100, Stefano Brivio wrote: > > > On Thu, 20 Feb 2025 21:18:06 +1100 > > > David Gibson wrote: > > > > > > > This sort of thing is, incidentally, why I did way back suggest the > > > > possibility of passt-repair reporting failures per-fd, rather than > > > > just per-batch. > > > > > > Sorry, I somehow missed that proposal, and I can't find any trace of > > > it. > > > > It may have just been on IRC somewhere. > > > > > But anyway, the problem is that if we fail to read a batch for any > > > reason (invalid ancillary data... maybe always implying a kernel issue, > > > but I'm not sure), you can't _reliably_ report per-fd failures. > > > *Usually*, you can. Worth it? > > > > Ah, I see. We could handle that by being able to report both per-fd > > and "whole batch" failure (equivalent to failure on every fd), but > > that would complexify the protocol, of course. > > By the way, after having another look at the kernel interface and > implementation: this makes no sense. > > Either we're able to set repair mode for all the sockets, or for none > of them (EPERM). Well.. probably. I suspect something sufficiently insane in an LSM could break that rule. > And if there's any invalid file descriptor in the set, > we'll get EBADF for the whole sendmsg(). > > The stuff I'm proposing below is well beyond my threshold of things > that make no sense to implement, but at least it limits damage in terms > of complexity (and hence of potential impact on the actual usage, > because that's what we're talking about here: a die() that makes no > sense but now proves to be actually harmful). I wouldn't go any further > than that. > > > > In any case, if it's simple, we can still do it, because passt and > > > passt-repair are distributed together. You can't pass back the file > > > descriptors via SCM_RIGHTS though, because we want to close() them > > > before we reply. > > > > > > Another alternative could be that passt-repair reverts back the state > > > of the file descriptors that were already switched, on failure. > > > > That might help a bit, we'd still need to rework the passt-side > > interface to know what needs reverting at the right stage. > > Not for what I'm proposing: > > 1. passt sends 1 (TCP_REPAIR_ON) for sockets 2, 3 > > 2. passt-repair sets repair mode for 2 > > 3. passt-repair fails to set repair mode for 3 > > 4. passt-repair clears repair mode for 2 Again, it shouldn't happen in practice, but you will get a mess if you ever managed to set repair mode, but failed to clear it. There's a similar question in the other direction, if passt is trying to REPAIR_OFF, and we fail part way through. Should passt-repair REPAIR_ON again? I'm not sure it has the information to make a sane choice here. > 5. passt-repair closes the connection to signal failure > > 6. passt knows that repair mode is *not* set for any socket in the batch > > The interface remains the same (per-batch error only), but you can rely > on the whole batch to have failed. > > You already can, even with no changes in passt-repair (see above), by > the way. Well, I just finished implementing a simple way of reporting partial failures. I guess see what you think of it. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson