From: Stefano Brivio <sbrivio@redhat.com>
To: Laurent Vivier <lvivier@redhat.com>
Cc: passt-dev@passt.top, Jon Maloy <jmaloy@redhat.com>,
David GIbson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v4 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element
Date: Tue, 26 May 2026 10:38:10 +0200 (CEST) [thread overview]
Message-ID: <20260526103809.54da7aac@elisabeth> (raw)
In-Reply-To: <20260526095955.008a6ea1@elisabeth>
On Tue, 26 May 2026 09:59:55 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:
> On Tue, 26 May 2026 09:31:51 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
>
> > On 5/22/26 14:04, Stefano Brivio wrote:
> > > On Fri, 22 May 2026 07:44:55 +0200
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > >> On Fri, 22 May 2026 06:22:39 +0200
> > >> Stefano Brivio <sbrivio@redhat.com> wrote:
> > >>
> > >>> On Fri, 22 May 2026 01:13:33 +0200
> > >>> Laurent Vivier <lvivier@redhat.com> wrote:
> > >>>
> > >>>> On 5/21/26 10:30, Laurent Vivier wrote:
> > >>>>> On 5/20/26 22:53, Stefano Brivio wrote:
> > >>>>>> On Wed, 20 May 2026 18:18:52 +0200
> > >>>>>> Stefano Brivio <sbrivio@redhat.com> wrote:
> > >>>>>>
> > >>>>>>> On Wed, 20 May 2026 18:07:08 +0200
> > >>>>>>> Stefano Brivio <sbrivio@redhat.com> wrote:
> > >>>>>>>
> > >>>>>>>> On Wed, 20 May 2026 17:34:45 +0200
> > >>>>>>>> Stefano Brivio <sbrivio@redhat.com> wrote:
> > >>>>>>>>> On Wed, 13 May 2026 13:52:08 +0200
> > >>>>>>>>> Laurent Vivier <lvivier@redhat.com> wrote:
> > >>>>>>>>>> Currently, the vhost-user path assumes each virtqueue element contains
> > >>>>>>>>>> exactly one iovec entry covering the entire frame. This assumption
> > >>>>>>>>>> breaks as some virtio-net drivers (notably iPXE) provide descriptors where the
> > >>>>>>>>>> vnet header and the frame payload are in separate buffers, resulting in
> > >>>>>>>>>> two iovec entries per virtqueue element.
> > >>>>>>>>>>
> > >>>>>>>>>> This series refactors the vhost-user data path so that frame lengths,
> > >>>>>>>>>> header sizes, and padding are tracked and passed explicitly rather than
> > >>>>>>>>>> being derived from iovec sizes. This decoupling is a prerequisite for
> > >>>>>>>>>> correctly handling padding of multi-buffer frames.
> > >>>>>>>>>
> > >>>>>>>>> Sorry to bring (likely) bad news, but this series seems to introduce a
> > >>>>>>>>> regression: I got the migration/rampstream_in tests fail twice in a
> > >>>>>>>>> row, which I've never saw happening (I think I saw a single failure a
> > >>>>>>>>> long time ago when the machine had a high CPU load, but nothing else).
> > >>>>>>>>>
> > >>>>>>>>> I'm currently bisecting and the bisect seems to point towards the end
> > >>>>>>>>> of the series (probably 10/10), but I haven't finished yet. I'll keep
> > >>>>>>>>> you posted. I haven't spotted anything that might cause issues there.
> > >>>>>>>>
> > >>>>>>>> Yeah, that's the one :(
> > >>>>>>>>
> > >>>>>>>> $ git bisect bad
> > >>>>>>>> db798fc60f4c5869cb53168354e068fb4dabd91a is the first bad commit
> > >>>>>>>> commit db798fc60f4c5869cb53168354e068fb4dabd91a
> > >>>>>>>> Author: Laurent Vivier <lvivier@redhat.com>
> > >>>>>>>> Date: Wed May 13 13:52:18 2026 +0200
> > >>>>>>>>
> > >>>>>>>> vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()
> > >>>>>
> > >>>>> I checked on my system with the commit previous to this series,
> > >>>>> bcc3d37a6e01 ("util: Fix changes to assert_with_msg()") and rampstream_in fails too (not
> > >>>>> everytime).
> > >>>>>
> > >>>>> > TCP/IPv4: sequence check, ramps, inbound
> > >>>>> ...failed.
> > >>>>>
> > >>>>> and rampstream_out hangs sometime too.
> > >>>>>
> > >>>>> I'm going to try with ealier commits.
> > >>>>
> > >>>> For me the problem can happen with any commit...
> > >>>>
> > >>>> As it depends on the execution path and on the load and speed of the system it looks like
> > >>>> a race condition.
> > >>>
> > >>> Hah, thanks for checking. Maybe...
> > >>>
> > >>>> Did you try to test on a host with a kernel patched with
> > >>>> "[PATCH net v2 0/2] Fix race condition between TCP_REPAIR dump and data receive" ?
> > >>>
> > >>> Now I tried, and yes, the test doesn't hang anymore! I seem to have an
> > >>> issue with teardown functions on recent kernels (current net.git HEAD
> > >>> more or less):
> > >>>
> > >>> ---
> > >>> [...]
> > >>>
> > >>> 2026/05/22 04:08:23 socat[73089] E connect(5, AF=40 cid:94558 port:22, 16): Connection timed out
> > >>> Connection closed by UNKNOWN port 65535
> > >>> ...
> > >>> ---
> > >>>
> > >>> it looks like we stop QEMU a bit too early. But it should be unrelated.
> > >
> > > Oops, I forgot to upgrade QEMU on the virtual machine I was using to
> > > test those kernel builds, I had a somewhat outdated 8.1 version and it
> > > failed migration for unrelated reasons. It works with 11.0.
> > >
> > > Back to kernel versions: the "problem" is that with a recent
> > > net-next.git HEAD, with or without my fix, in a nested VM, the test
> > > always passes (20/20). And I can't easily test things non-nested.
> > >
> > > I guess could just skip that test for the moment from the set I run git
> > > push, and run it manually in the virtual machine, for the moment.
> > >
> > > But judging from captures (test_logs/pasta_1.pcap from PCAP=1 ./run)
> > > I'm fairly sure it's not *that* issue:
> > >
> > > 465 12.141763 192.0.2.1 → 88.198.0.164 58451 TCP [TCP Window Full] 34416 → 10001 [PSH, ACK] Seq=10002100 Ack=1 Win=65536 Len=58397
> > > 466 12.187195 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
> > > 467 13.187281 192.0.2.1 → 88.198.0.164 4150 TCP 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
> > >
> > > last data transfer from client (rampstream):
> > >
> > > 468 13.187358 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
> > >
> > > everything acknowledged, migration starts now:
> > >
> > > 469 14.143217 fe80::f471:c3ff:fe10:4e45 → ff02::2 70 ICMPv6 Router Solicitation from f6:71:c3:10:4e:45
> > > 470 14.687123 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] [TCP Keep-Alive] 10001 → 34416 [ACK] Seq=0 Ack=10060497 Win=0 Len=0
> > >
> > > migration completed: and we acknowledge the right sequence (10060497),
> > > so it didn't jump forward.
> > >
> > > But starting from this point:
> > >
> > > 471 14.687265 192.0.2.1 → 88.198.0.164 60 TCP 34416 → 10001 [ACK] Seq=10060497 Ack=1 Win=65536 Len=0
> > > 472 16.687412 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
> > > 473 16.687450 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
> > > 474 20.687650 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
> > > 475 20.687692 88.198.0.164 → 192.0.2.1 54 TCP [TCP ZeroWindow] 10001 → 34416 [ACK] Seq=1 Ack=10060497 Win=0 Len=0
> > > 476 28.687817 192.0.2.1 → 88.198.0.164 4150 TCP [TCP Retransmission] 34416 → 10001 [PSH, ACK] Seq=10060497 Ack=1 Win=65536 Len=4096
> > >
> > > we keep advertising a zero window (that's the kernel doing it really),
> > > as if we were unable to dequeue data.
> > >
> > > I enabled --trace just for the target instance of passt, and I don't
> > > see anything suspicious there:
> > >
> > > 13.0958: Receiving 1 flows
> > > 13.0958: Flow 0 (NEW): FREE -> NEW
> > > 13.0958: Flow 0 (TCP connection): TGT -> TYPED
> > > 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001
> > > 13.0958: Flow 0 (TCP connection): Side 1 hash table insert: bucket: 138154
> > > 13.0958: Flow 0 (TCP connection): TYPED -> ACTIVE
> > > 13.0958: Flow 0 (TCP connection): HOST [192.0.2.1]:49892 -> [88.198.0.164]:10001 => TAP [192.0.2.1]:49892 -> [88.198.0.164]:10001
> > > 13.0959: Flow 0 (TCP connection): Extended migration data, socket 83 sequences send 3121929544 receive 1643895001
> > > 13.0959: Flow 0 (TCP connection): pending queues: send 0 not sent 0 receive 3500081
> > > 13.0959: Flow 0 (TCP connection): window: snd_wl1 1647395082 snd_wnd 65536 max 65536 rcv_wnd 0 rcv_wup 1647395082
> > > 13.0959: Flow 0 (TCP connection): SO_PEEK_OFF disabled offset=0
> > > 13.0985: Got packet, but RX virtqueue not usable yet
> > > 13.0985: Closing migration channel, fd: 82
> > > 13.0985: Closing TCP_REPAIR helper socket
> > > 13.0985: passt: epoll event on vhost-user command socket 77 (events: 0x00000001)
> > >
> > > then the usual VHOST_USER_CHECK_DEVICE_STATE and VHOST_USER_SET_VRING_ENABLE
> > > commands. After that, a tight loop of:
> > >
> > > 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
> > > 13.0986: Got packet, but RX virtqueue not usable yet
> > > 13.0986: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
> > > 13.0986: Got packet, but RX virtqueue not usable yet
> > >
> > > until we go further with the vhost-user setup. I still see this message
> > > which I had never noticed (but I didn't try to bisect around it):
> > >
> > > 13.1006: ================ Vhost user message ================
> > > 13.1006: Request: VHOST_USER_SET_VRING_ADDR (9)
> > > [...]
> > > 13.1006: Last avail index != used index: 3252 != 3027
> > >
> > > and then after VHOST_USER_SET_VRING_CALL, and:
> > >
> > > 13.1008: passt: epoll event on vhost-user kick socket 78 (events: 0x00000001)
> > > 13.1008: vhost-user: got kick_data: 0000000000000001 idx: 1
> > >
> > > it's just a tight loop of:
> > >
> > > 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
> > > 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
> > > 13.1008: passt: epoll event on connected TCP socket 83 (events: 0x00000001)
> > >
> > > as if we weren't dequeueing anything from there.
> > >
> > > I start suspecting we might be hitting two different issues: perhaps
> > > things fail on your setup because of the kernel bug with TCP_REPAIR not
> > > freezing the queue, and they fail on my setup for some other reason.
> > >
> > > For me it's very deterministic though: with patch 10/10 things always
> > > fail, and without it they never fail.
> > >
> > > I guess I'll add more prints and check for more messages before/after
> > > that patch.
> > >
> >
> > In fact there is a buffer leak because iov_skip_bytes() doesn't correctly compute the
> > number of used elements and then we don't release all the unused buffers.
> >
> > I'm trying to fix that.
> >
> > Please try with series "[PATCH v7 0/4] vhost-user,tcp: Handle multiple iovec entries per
> > virtqueue element" applied, it reworks this part.
>
> I'm trying it now. If that totally reworks this part and it fixes
> things and it's ready to be merged (sorry, I didn't manage to have a
> look yet) I don't think it's strictly necessary to figure out the
> leak.
All tests pass with it, rampstream_in passed 20/20 times. Should I go
ahead and merge both series (UDP and TCP, they both look ready) or do
you still need to figure out the buffer leak first for other reasons?
--
Stefano
next prev parent reply other threads:[~2026-05-26 8:38 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 11:52 Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 01/10] iov: Introduce iov_memset() Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 02/10] iov: Add iov_memcpy() to copy data between iovec arrays Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 03/10] vu_common: Move vnethdr setup into vu_flush() Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 04/10] udp_vu: Move virtqueue management from udp_vu_sock_recv() to its caller Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 05/10] udp_vu: Pass iov explicitly to helpers instead of using file-scoped array Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 06/10] checksum: Pass explicit L4 length to checksum functions Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 07/10] pcap: Pass explicit L2 length to pcap_iov() Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 08/10] vu_common: Pass explicit frame length to vu_flush() Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 09/10] tcp: Pass explicit data length to tcp_fill_headers() Laurent Vivier
2026-05-13 11:52 ` [PATCH v4 10/10] vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() Laurent Vivier
2026-05-14 1:24 ` David Gibson
2026-05-20 0:52 ` [PATCH v4 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element Stefano Brivio
2026-05-20 15:34 ` Stefano Brivio
2026-05-20 16:07 ` Stefano Brivio
2026-05-20 16:18 ` Stefano Brivio
2026-05-20 20:53 ` Stefano Brivio
2026-05-21 8:30 ` Laurent Vivier
2026-05-21 23:13 ` Laurent Vivier
2026-05-22 4:22 ` Stefano Brivio
2026-05-22 5:44 ` Stefano Brivio
2026-05-22 6:15 ` David GIbson
2026-05-22 6:23 ` Stefano Brivio
2026-05-22 6:36 ` David GIbson
2026-05-22 6:45 ` Stefano Brivio
2026-05-22 12:04 ` Stefano Brivio
2026-05-26 7:31 ` Laurent Vivier
2026-05-26 7:59 ` Stefano Brivio
2026-05-26 8:38 ` Stefano Brivio [this message]
2026-05-26 8:54 ` Laurent Vivier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260526103809.54da7aac@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=jmaloy@redhat.com \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).