From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
Date: Fri, 3 Jan 2025 12:16:45 +1100 [thread overview]
Message-ID: <Z3c6ffgMKOVIZRrc@zatzit> (raw)
In-Reply-To: <20250102225948.2cfbd033@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 11339 bytes --]
On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 12:00:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:29 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > We verify that every packet we store in a pool - and every partial packet
> > > > we retreive from it has a length no longer than UINT16_MAX. This
> > > > originated in the older packet pool implementation which stored packet
> > > > lengths in a uint16_t. Now, that packets are represented by a struct
> > > > iovec with its size_t length, this check serves only as a sanity / security
> > > > check that we don't have some wildly out of range length due to a bug
> > > > elsewhere.
> > > >
> > > > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> > > > stored in the pool is in fact an entire frame including both L2 and any
> > > > backend specific headers. We can exceed this in passt mode, even with the
> > > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> > > > 4 bytes of qemu stream length header = 65538 bytes.
> > > >
> > > > Introduce our own define for the maximum length of a packet in the pool and
> > > > set it slightly larger, allowing 128 bytes for L2 and/or other backend
> > > > specific headers. We'll use different amounts of that depending on the
> > > > tap backend, but since this is just a sanity check, the bound doesn't need
> > > > to be 100% tight.
> > >
> > > I couldn't find the time to check what's the maximum amount of bytes we
> > > can get here depending on hypervisor and interface, but if this patch
> >
> > So, it's a separate calculation for each backend type, and some of
> > them are pretty tricky.
> >
> > For anything based on the kernel tap device it is 65535, because it
> > has an internal frame size limit of 65535, already including any L2
> > headers (it explicitly limits the MTU to 65535 - hard_header_len).
> > There is no "hardware" header.
> >
> > For the qemu stream protocol it gets pretty complicated, because there
> > are multiple layers which could clamp the maximum size. It doesn't
> > look like the socket protocol code itself imposes a limit beyond the
> > structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
> > which could be less for 32-bit systems). AFAICT, it's not
> > theoretically impossible to have gigabyte frames with a weird virtual
> > NIC model... though obviously that wouldn't be IP, and probably not
> > even Ethernet.
>
> Theoretically speaking, it could actually be IPv6 with Jumbograms. They
> never really gained traction (because of Ethernet, I think) and we don't
> support them, but the only attempt to deprecate them I'm aware of
> didn't succeed (yet):
>
> https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
>
> ...and I actually wonder if somebody will ever try to revive them for
> virtual networks, where they might make somewhat more sense (you could
> transfer filesystems with one packet per file or similar silly tricks).
Hm, yes. Well, one problem at a time. Well, ok, 2 or 3 problems at a
time.
> > Each virtual NIC could have its own limit. I suspect that's going to
> > be in the vicinity of 64k. But, I'm really struggling to figure out
> > what it is just for virtio-net, so I really don't want to try to
> > figure it out for all of them. With a virtio-net NIC, I seem to be
> > able to set MTU all the way up to 65535 successfully, which implies a
> > maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
> > header) = 65553 at least.
>
> The Layer-2 header is included (because that also happens to be
> ETH_MAX_MTU, on Linux), it wouldn't be on top.
No. Or if so, that's a guest side bug. The MTU set with ip-link is
the maximum L3 size - at the default 1500, the L2 frame can be 1514
bytes. If the driver can't send an L2 frame greater than 65535 bytes,
then it should clamp the MTU to (65535 - hard_header_len) like tuntap
already does.
I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU
which can be had in an ethernet frame (that's longer), or is it the
maximum ethernet frame size (leading to an IP mtu of 65521). I plan
to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU
series. I should merge that with this one, it might make the context
clearer.
AFAICT there is *no* structural limit on the size of an ethernet
frame; the length isn't in any header, it's just assumed to be
reported out of band by the hardware. No theoretical reason that
reporting mechanism couldn't allow lengths > 65535, whether slightly
(65535 bytes of payload + header & FCS) or vastly.
> > Similar situation for vhost-user, where I'm finding it even more
> > inscrutable to figure out what limits are imposed at the sub-IP
> > levels. At the moment the "hardware" header
> > (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
> > packet.c layer, but we might have reasons to change that.
> >
> > So, any sub-IP limits for qemu, I'm basically not managing to find.
> > However, we (more or less) only care about IP, which imposes a more
> > practical limit of: 65535 + L2 header size + "hardware" header size.
> >
> > At present that maxes out at 65553, as above, but if we ever support
> > other L2 encapsulations, or other backend protocols with larger
> > "hardware" headers, that could change.
>
> Okay. I was thinking of a more practical approach, based on the fact
> that we only support Ethernet anyway, with essentially four types of
> adapters (three virtio-net implementations, and tap), plus rather rare
> reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we
> could actually test things.
>
> Well, I did that anyway, because I'm not quite comfortable dropping the
> UINT16_MAX check in packet_add_do() without testing...
We're not dropping it, just raising the limit, fairly slightly.
> and yes, with
> this patch we can trigger a buffer overflow with vhost-user. In detail:
>
> - muvm, virtio-net with 65535 bytes MTU:
>
> - ping -s 65492 works (65534 bytes "on wire" from pcap)
> - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes
> on wire (that's with a newer kernel, probably that's the reason why
> it's not the same limit as QEMU, below)
That's a guest driver bug. If the MTU is 65535, it should be able to
send a 65535 byte IP datagram without fragmentation. Looks like it
needs to clamp the MTU based on L2 limitations.
> - QEMU, virtio-net without vhost-user, 65535 bytes MTU:
>
> - ping -s 65493 works (65535 bytes "on wire" from pcap)
> - with -s 65494: "Bad frame size from guest, resetting connection"
That's our check, which I plan to fix in the MTU series.
> - QEMU, virtio-net with vhost-user, 65535 bytes MTU:
>
> - ping -s 65493 works (65535 bytes "on wire" from pcap)
> - ping -s 65494:
>
> *** buffer overflow detected ***: terminated
>
> without this patch, we catch that in packet_add_do() (and without
> 9/12 we don't crash)
Ouch. That's a bug in our vhost-user code. The easy fix would be to
clamp MTU to 65521, arguably more correct would be to decouple its
notion of maximum frame size from ETH_MAX_MTU.
> - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the
> correct maximum though):
No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed
tap frame size. Seems like there are other drivers that should also
be clamping their max MTU similarly, but aren't.
> - ping -s 65493 works (65535 bytes on wire)
> - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)
This time the fragmentation is correct, because the MTU is only 65521.
> So, I guess we should find out the issue with vhost-user first.
Yeah. I definitely need to intermingle the MTU series with this one
to get the order of operations right.
> Other than that, it looks like we can reach at least 65539 bytes if we
> add the "virtio-net" length descriptors, but still the frame size
> itself (which is actually what matters for the functions in packet.c)
Well.. sort of. The packet.c functions really care nothing about any
of the layers, it's just a blob of data to them. I did miss that
tap_passt_input() excludes the qemu header before inserting the frame
into the pool, so indeed, the pool layer won't currently see length
greater than 65535. Of course, since that frame header *is* stored in
pkt_buf with all the rest, that's arguably not using the pool layer's
buffer bound checks to the full extent we could.
> can't exceed 65535 bytes, at least from my tests.
>
> Then, yes, I agree that it's not actually correct, even though it fits
> all the use cases we have, because we *could* have an implementation
> exceeding that value (at the moment, it looks like we don't).
So, it seems like the Linux drivers might not actually generate
ethernet frames > 65535 bytes - although they don't all correctly
reduce their maximum MTU to reflect that. I don't think we should
rely on that; AFAICT it would be reasonable for a driver + VMM
implementation to allow ethernet frames that are at least 65535 bytes
+ L2 header. That might also allow for 16 byte 802.1q vlan L2
headers.
> > > fixes an actual issue as you seem to imply, actually checking that with
> > > QEMU and muvm would be nice.
> > >
> > > By the way, as you mention a specific calculation, does it really make
> > > sense to use a "good enough" value here? Can we ever exceed 65538
> > > bytes, or can we use that as limit? It would be good to find out, while
> > > at it.
> >
> > So, yes, I think we can exceed 65538. But more significantly, trying
> > to make the limit tight here feels like a borderline layering
> > violation. The packet layer doesn't really care about the frame size
> > as long as it's "sane".
>
> It might still be convenient for some back-ends to define "sane" as 64
> KiB. I'm really not sure if it is, I didn't look into the matching part
> of vhost-user in detail.
That's fair, but I don't think the pool layer should impose that limit
on the backends, because I think it's equally reasonable or another
backend to allow slightly larger frames with a 65535 byte L3 payload.
Hence setting the limit to 64k + "a little bit".
> If it's buggy because we have a user that can exceed that, sure, let's
> fix it. If not... also fine by me as it's some kind of theoretical
> flaw, but we shouldn't crash.
>
> > Fwiw, in the draft changes I have improving
> > MTU handling, it's my intention that individual backends calculate
> > and/or enforce tighter limits of their own where practical, and
> > BUILD_ASSERT() that those fit within the packet layer's frame size
> > limit.
>
> Ah, nice, that definitely sounds like an improvement.
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-01-03 1:17 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-20 8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20 8:35 ` [PATCH v2 01/12] test focus David Gibson
2024-12-20 8:35 ` [PATCH v2 02/12] hack: stop on fail, but not perf fail David Gibson
2024-12-20 8:35 ` [PATCH v2 03/12] make passt dumpable David Gibson
2024-12-20 8:35 ` [PATCH v2 04/12] packet: Use flexible array member in struct pool David Gibson
2024-12-20 8:35 ` [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range() David Gibson
2024-12-20 8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
2025-01-01 21:54 ` Stefano Brivio
2025-01-02 1:00 ` David Gibson
2025-01-02 21:59 ` Stefano Brivio
2025-01-03 1:16 ` David Gibson [this message]
2025-01-05 23:43 ` Stefano Brivio
2024-12-20 8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
2025-01-01 21:54 ` Stefano Brivio
2025-01-02 2:15 ` David Gibson
2025-01-02 22:00 ` Stefano Brivio
2025-01-03 4:48 ` David Gibson
2025-01-06 10:55 ` Stefano Brivio
2024-12-20 8:35 ` [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers David Gibson
2024-12-20 8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
2025-01-01 21:54 ` Stefano Brivio
2025-01-02 2:58 ` David Gibson
2025-01-02 22:00 ` Stefano Brivio
2025-01-03 5:06 ` David Gibson
2025-01-06 10:55 ` Stefano Brivio
2024-12-20 8:35 ` [PATCH v2 10/12] packet: Move packet length checks into packet_check_range() David Gibson
2024-12-20 8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
2025-01-01 21:54 ` Stefano Brivio
2025-01-02 3:46 ` David Gibson
2025-01-02 22:00 ` Stefano Brivio
2025-01-03 6:06 ` David Gibson
2024-12-20 8:35 ` [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB David Gibson
2024-12-20 9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20 10:06 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z3c6ffgMKOVIZRrc@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).