On Tue, Jun 16, 2026 at 06:37:05PM +0200, David du Colombier wrote: > tap4_handler() requires the L2 payload after the IP header to match > the IP datagram length exactly. Guests whose drivers pad transmitted > frames to the 60 byte Ethernet minimum, as real hardware requires and > as drivers modelled on hardware do (Plan 9's virtio-net, for one), > send pure ACK and FIN segments as 60 byte frames: 14 byte Ethernet > header, 40 byte IPv4 datagram, 6 padding octets. Those frames fail > the exact length check and are dropped without trace. > > passt then never sees such a guest's acknowledgements: it > retransmits from the lowest unacknowledged sequence with exponential > backoff while the guest, which received and acknowledged everything, > waits. Every fresh connection stalls for minutes (a 1 MiB HTTP fetch > over --map-host-loopback measured 248 s before this change, 0.27 s > after; bulk transfer over established connections, whose ACKs ride > data segments above the padding threshold, is unaffected). FIN > segments are padded too, so teardown hangs as well. Note that > tap_send_single() pads passt's own outbound frames to ETH_ZLEN, so > the receive path was already stricter than the send path. > > Trim the trailing padding to the IP datagram length instead, using a > new iov_tail_trim() helper, and keep dropping frames genuinely > shorter than the datagram they claim to carry. IPv6 is unaffected: > its minimal TCP frame is 74 bytes, above the padding threshold. > > Signed-off-by: David du Colombier <0intro@gmail.com> > --- > v2: > - store the iov_tail_size() result in a local variable instead of > computing it twice > - pass ARRAY_SIZE(trim_iov) instead of UIO_MAXIOV This is addressing a real, nasty, bug and I'm glad this is merged. Couple of thoughts that we might want to consider in the future, though. [snip] > +/** > + * iov_tail_trim() - Limit a tail to @len bytes via a scratch iovec array > + * @tail: Pointer to the iov_tail to trim; rebuilt on success to > + * reference @scratch > + * @len: Number of bytes to keep > + * @scratch: Scratch iovec array backing the trimmed tail; must stay > + * valid as long as the trimmed tail is in use > + * @scratch_cnt: Number of elements in @scratch > + * > + * Return: true on success, false if @tail is shorter than @len or does > + * not fit in @scratch (@tail is unchanged on failure) > + */ This interface seems kind of fragile - it imposes a pretty subtle lifetime constraint on @scratch. It also arguably breaks the model of an iov_tail() as a "window" into an underlying, immutable iovec array. I've been wondering for a while if we should extend (and rename) iov_tail to be such a window that can skip bytes from either end of the underlying iovec[], not just the beginning - so it would track a (byte) length as well as offset field. That would make merge iov_tail_clone() and iov_tail_trim(): take the "window" and transcribe it in fresh iov[]. As with existing iov_tail_clone() calls, we'd generally want to leave that to the last possible moment to avoid either repeatedly copying or tricky lifetime dependencies between different iovec arrays. Laurent, I feel like that could be useful for some of the vhost-user stuff you've done. Any thoughts? > +bool iov_tail_trim(struct iov_tail *tail, size_t len, > + struct iovec *scratch, size_t scratch_cnt) > +{ > + ssize_t cnt = iov_tail_clone(scratch, scratch_cnt, tail); > + size_t left = len; > + unsigned int i; > + > + if (cnt < 0) > + return false; > + > + for (i = 0; i < (size_t)cnt && left; i++) { > + if (scratch[i].iov_len > left) > + scratch[i].iov_len = left; > + left -= scratch[i].iov_len; > + } > + > + if (left) > + return false; > + > + *tail = IOV_TAIL(scratch, i, 0); > + return true; > +} > diff --git a/iov.h b/iov.h > index 4fdf14a..3af467e 100644 > --- a/iov.h > +++ b/iov.h > @@ -97,6 +97,8 @@ size_t iov_push_header_(struct iov_tail *tail, const void *v, size_t len); > void *iov_remove_header_(struct iov_tail *tail, void *v, size_t len, size_t align); > ssize_t iov_tail_clone(struct iovec *dst_iov, size_t dst_iov_cnt, > struct iov_tail *tail); > +bool iov_tail_trim(struct iov_tail *tail, size_t len, > + struct iovec *scratch, size_t scratch_cnt); > > /** > * IOV_PEEK_HEADER() - Get typed pointer to a header from an IOV tail > diff --git a/tap.c b/tap.c > index 4cba4c7..6ed7f8d 100644 > --- a/tap.c > +++ b/tap.c > @@ -716,7 +716,8 @@ static int tap4_handler(struct ctx *c, const struct pool *in, > i = 0; > resume: > for (seq_count = 0, seq = NULL; i < in->count; i++) { > - size_t l3len, hlen, l4len; > + struct iovec trim_iov[UIO_MAXIOV]; > + size_t l3len, hlen, l4len, check; > struct ethhdr eh_storage; > struct iphdr iph_storage; > struct udphdr uh_storage; > @@ -775,7 +776,17 @@ resume: > > if (!iov_drop_header(&data, hlen)) > continue; > - if (iov_tail_size(&data) != l4len) > + > + check = iov_tail_size(&data); > + if (check < l4len) > + continue; > + > + /* Drivers modelled on real hardware (Plan 9's virtio, for > + * one) pad short frames to the 60 byte Ethernet minimum: > + * trim trailing padding instead of dropping the packet. > + */ > + if (check > l4len && > + !iov_tail_trim(&data, l4len, trim_iov, ARRAY_SIZE(trim_iov))) > continue; > > if (iph->protocol == IPPROTO_ICMP) { > -- > 2.54.0 > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson