Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
Date: Fri, 3 Jan 2025 12:16:45 +1100	[thread overview]
Message-ID: <Z3c6ffgMKOVIZRrc@zatzit> (raw)
In-Reply-To: <20250102225948.2cfbd033@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 11339 bytes --]

On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 12:00:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:29 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > We verify that every packet we store in a pool - and every partial packet
> > > > we retreive from it has a length no longer than UINT16_MAX.  This
> > > > originated in the older packet pool implementation which stored packet
> > > > lengths in a uint16_t.  Now, that packets are represented by a struct
> > > > iovec with its size_t length, this check serves only as a sanity / security
> > > > check that we don't have some wildly out of range length due to a bug
> > > > elsewhere.
> > > > 
> > > > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> > > > stored in the pool is in fact an entire frame including both L2 and any
> > > > backend specific headers.  We can exceed this in passt mode, even with the
> > > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> > > > 4 bytes of qemu stream length header = 65538 bytes.
> > > > 
> > > > Introduce our own define for the maximum length of a packet in the pool and
> > > > set it slightly larger, allowing 128 bytes for L2 and/or other backend
> > > > specific headers.  We'll use different amounts of that depending on the
> > > > tap backend, but since this is just a sanity check, the bound doesn't need
> > > > to be 100% tight.  
> > > 
> > > I couldn't find the time to check what's the maximum amount of bytes we
> > > can get here depending on hypervisor and interface, but if this patch  
> > 
> > So, it's a separate calculation for each backend type, and some of
> > them are pretty tricky.
> > 
> > For anything based on the kernel tap device it is 65535, because it
> > has an internal frame size limit of 65535, already including any L2
> > headers (it explicitly limits the MTU to 65535 - hard_header_len).
> > There is no "hardware" header.
> > 
> > For the qemu stream protocol it gets pretty complicated, because there
> > are multiple layers which could clamp the maximum size.  It doesn't
> > look like the socket protocol code itself imposes a limit beyond the
> > structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
> > which could be less for 32-bit systems).  AFAICT, it's not
> > theoretically impossible to have gigabyte frames with a weird virtual
> > NIC model... though obviously that wouldn't be IP, and probably not
> > even Ethernet.
> 
> Theoretically speaking, it could actually be IPv6 with Jumbograms. They
> never really gained traction (because of Ethernet, I think) and we don't
> support them, but the only attempt to deprecate them I'm aware of
> didn't succeed (yet):
> 
>   https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
> 
> ...and I actually wonder if somebody will ever try to revive them for
> virtual networks, where they might make somewhat more sense (you could
> transfer filesystems with one packet per file or similar silly tricks).

Hm, yes.  Well, one problem at a time.  Well, ok, 2 or 3 problems at a
time.

> > Each virtual NIC could have its own limit.  I suspect that's going to
> > be in the vicinity of 64k.  But, I'm really struggling to figure out
> > what it is just for virtio-net, so I really don't want to try to
> > figure it out for all of them.  With a virtio-net NIC, I seem to be
> > able to set MTU all the way up to 65535 successfully, which implies a
> > maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
> > header) = 65553 at least.
> 
> The Layer-2 header is included (because that also happens to be
> ETH_MAX_MTU, on Linux), it wouldn't be on top.

No.  Or if so, that's a guest side bug.  The MTU set with ip-link is
the maximum L3 size - at the default 1500, the L2 frame can be 1514
bytes.  If the driver can't send an L2 frame greater than 65535 bytes,
then it should clamp the MTU to (65535 - hard_header_len) like tuntap
already does.

I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU
which can be had in an ethernet frame (that's longer), or is it the
maximum ethernet frame size (leading to an IP mtu of 65521).  I plan
to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU
series.  I should merge that with this one, it might make the context
clearer.

AFAICT there is *no* structural limit on the size of an ethernet
frame; the length isn't in any header, it's just assumed to be
reported out of band by the hardware.  No theoretical reason that
reporting mechanism couldn't allow lengths > 65535, whether slightly
(65535 bytes of payload + header & FCS) or vastly.

> > Similar situation for vhost-user, where I'm finding it even more
> > inscrutable to figure out what limits are imposed at the sub-IP
> > levels.  At the moment the "hardware" header
> > (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
> > packet.c layer, but we might have reasons to change that.
> > 
> > So, any sub-IP limits for qemu, I'm basically not managing to find.
> > However, we (more or less) only care about IP, which imposes a more
> > practical limit of: 65535 + L2 header size + "hardware" header size.
> > 
> > At present that maxes out at 65553, as above, but if we ever support
> > other L2 encapsulations, or other backend protocols with larger
> > "hardware" headers, that could change.
> 
> Okay. I was thinking of a more practical approach, based on the fact
> that we only support Ethernet anyway, with essentially four types of
> adapters (three virtio-net implementations, and tap), plus rather rare
> reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we
> could actually test things. 
> 
> Well, I did that anyway, because I'm not quite comfortable dropping the
> UINT16_MAX check in packet_add_do() without testing...

We're not dropping it, just raising the limit, fairly slightly.

> and yes, with
> this patch we can trigger a buffer overflow with vhost-user. In detail:
> 
> - muvm, virtio-net with 65535 bytes MTU:
> 
>   - ping -s 65492 works (65534 bytes "on wire" from pcap)
>   - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes
>     on wire (that's with a newer kernel, probably that's the reason why
>     it's not the same limit as QEMU, below)

That's a guest driver bug.  If the MTU is 65535, it should be able to
send a 65535 byte IP datagram without fragmentation.  Looks like it
needs to clamp the MTU based on L2 limitations.

> - QEMU, virtio-net without vhost-user, 65535 bytes MTU:
> 
>   - ping -s 65493 works (65535 bytes "on wire" from pcap)
>   - with -s 65494: "Bad frame size from guest, resetting connection"

That's our check, which I plan to fix in the MTU series.

> - QEMU, virtio-net with vhost-user, 65535 bytes MTU:
> 
>   - ping -s 65493 works (65535 bytes "on wire" from pcap)
>   - ping -s 65494:
> 
>     *** buffer overflow detected ***: terminated
> 
>     without this patch, we catch that in packet_add_do() (and without
>     9/12 we don't crash)

Ouch.  That's a bug in our vhost-user code.  The easy fix would be to
clamp MTU to 65521, arguably more correct would be to decouple its
notion of maximum frame size from ETH_MAX_MTU.

> - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the
>   correct maximum though):

No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed
tap frame size.  Seems like there are other drivers that should also
be clamping their max MTU similarly, but aren't.

>   - ping -s 65493 works (65535 bytes on wire)
>   - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)

This time the fragmentation is correct, because the MTU is only 65521.

> So, I guess we should find out the issue with vhost-user first.

Yeah.  I definitely need to intermingle the MTU series with this one
to get the order of operations right.

> Other than that, it looks like we can reach at least 65539 bytes if we
> add the "virtio-net" length descriptors, but still the frame size
> itself (which is actually what matters for the functions in packet.c)

Well.. sort of.  The packet.c functions really care nothing about any
of the layers, it's just a blob of data to them.  I did miss that
tap_passt_input() excludes the qemu header before inserting the frame
into the pool, so indeed, the pool layer won't currently see length
greater than 65535.  Of course, since that frame header *is* stored in
pkt_buf with all the rest, that's arguably not using the pool layer's
buffer bound checks to the full extent we could.

> can't exceed 65535 bytes, at least from my tests.
> 
> Then, yes, I agree that it's not actually correct, even though it fits
> all the use cases we have, because we *could* have an implementation
> exceeding that value (at the moment, it looks like we don't).

So, it seems like the Linux drivers might not actually generate
ethernet frames > 65535 bytes - although they don't all correctly
reduce their maximum MTU to reflect that.  I don't think we should
rely on that; AFAICT it would be reasonable for a driver + VMM
implementation to allow ethernet frames that are at least 65535 bytes
+ L2 header.  That might also allow for 16 byte 802.1q vlan L2
headers.

> > > fixes an actual issue as you seem to imply, actually checking that with
> > > QEMU and muvm would be nice.
> > > 
> > > By the way, as you mention a specific calculation, does it really make
> > > sense to use a "good enough" value here? Can we ever exceed 65538
> > > bytes, or can we use that as limit? It would be good to find out, while
> > > at it.  
> > 
> > So, yes, I think we can exceed 65538.  But more significantly, trying
> > to make the limit tight here feels like a borderline layering
> > violation.  The packet layer doesn't really care about the frame size
> > as long as it's "sane".
> 
> It might still be convenient for some back-ends to define "sane" as 64
> KiB. I'm really not sure if it is, I didn't look into the matching part
> of vhost-user in detail.

That's fair, but I don't think the pool layer should impose that limit
on the backends, because I think it's equally reasonable or another
backend to allow slightly larger frames with a 65535 byte L3 payload.
Hence setting the limit to 64k + "a little bit".

> If it's buggy because we have a user that can exceed that, sure, let's
> fix it. If not... also fine by me as it's some kind of theoretical
> flaw, but we shouldn't crash.
> 
> > Fwiw, in the draft changes I have improving
> > MTU handling, it's my intention that individual backends calculate
> > and/or enforce tighter limits of their own where practical, and
> > BUILD_ASSERT() that those fit within the packet layer's frame size
> > limit.
> 
> Ah, nice, that definitely sounds like an improvement.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2025-01-03  1:17 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20  8:35 ` [PATCH v2 01/12] test focus David Gibson
2024-12-20  8:35 ` [PATCH v2 02/12] hack: stop on fail, but not perf fail David Gibson
2024-12-20  8:35 ` [PATCH v2 03/12] make passt dumpable David Gibson
2024-12-20  8:35 ` [PATCH v2 04/12] packet: Use flexible array member in struct pool David Gibson
2024-12-20  8:35 ` [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range() David Gibson
2024-12-20  8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  1:00     ` David Gibson
2025-01-02 21:59       ` Stefano Brivio
2025-01-03  1:16         ` David Gibson [this message]
2025-01-05 23:43           ` Stefano Brivio
2025-01-08  1:48             ` David Gibson
2024-12-20  8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  2:15     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  4:48         ` David Gibson
2025-01-06 10:55           ` Stefano Brivio
2025-01-08  4:44             ` David Gibson
2024-12-20  8:35 ` [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers David Gibson
2024-12-20  8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  2:58     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  5:06         ` David Gibson
2025-01-06 10:55           ` Stefano Brivio
2025-01-08  4:53             ` David Gibson
2024-12-20  8:35 ` [PATCH v2 10/12] packet: Move packet length checks into packet_check_range() David Gibson
2024-12-20  8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  3:46     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  6:06         ` David Gibson
2025-01-07 22:56           ` Stefano Brivio
2024-12-20  8:35 ` [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB David Gibson
2024-12-20  9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20 10:06   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z3c6ffgMKOVIZRrc@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).