From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=oVWhKQ2D; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 4ED175A0272 for ; Fri, 03 Jan 2025 02:17:01 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202412; t=1735867004; bh=zVRzE7DIHuSJtqf4mV81zFfblM2mGQr7j36YtDb8xgE=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=oVWhKQ2DOvNMLJS/Xwj4MDY6KDIrYzGOY+GRuEfflM1WiMpvDV2ger0YMapeOlCrt 1Mq8rW9ZLOpiRQfD0sV0CvafBi1K8TCp56P97GymDABbvkjuGyAokPrEPeNobibH/s Clnd89+Xz8gx3F3Z4qRNzx832UvA5pzf3xxHjFF+ZYgifDGCZsgFofKWJmgfztUhYf 70896pgWWoggXiKztl6xbKsfHb+IX9nwox6zZbvNvPvIpz0U/28Z6u02iKTCIJMEJV ZBMV3InSrENgH7s5VR+hpoq5OyS0kqdQ+jOrj7xS7MY6eXBx52B6fwajyAlbW6lkmE fh4+uLXkroxFg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YPQdm6ygWz4xfg; Fri, 3 Jan 2025 12:16:44 +1100 (AEDT) Date: Fri, 3 Jan 2025 12:16:45 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX Message-ID: References: <20241220083535.1372523-1-david@gibson.dropbear.id.au> <20241220083535.1372523-7-david@gibson.dropbear.id.au> <20250101225433.45f52b86@elisabeth> <20250102225948.2cfbd033@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="urckNZf3aQ1TRTFz" Content-Disposition: inline In-Reply-To: <20250102225948.2cfbd033@elisabeth> Message-ID-Hash: XHMHJ3JAQNAY4UJZYOKZVHXSNR54HJS7 X-Message-ID-Hash: XHMHJ3JAQNAY4UJZYOKZVHXSNR54HJS7 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --urckNZf3aQ1TRTFz Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote: > On Thu, 2 Jan 2025 12:00:30 +1100 > David Gibson wrote: >=20 > > On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote: > > > On Fri, 20 Dec 2024 19:35:29 +1100 > > > David Gibson wrote: > > > =20 > > > > We verify that every packet we store in a pool - and every partial = packet > > > > we retreive from it has a length no longer than UINT16_MAX. This > > > > originated in the older packet pool implementation which stored pac= ket > > > > lengths in a uint16_t. Now, that packets are represented by a stru= ct > > > > iovec with its size_t length, this check serves only as a sanity / = security > > > > check that we don't have some wildly out of range length due to a b= ug > > > > elsewhere. > > > >=20 > > > > However, UINT16_MAX (65535) isn't quite enough, because the "packet= " as > > > > stored in the pool is in fact an entire frame including both L2 and= any > > > > backend specific headers. We can exceed this in passt mode, even w= ith the > > > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet head= er + > > > > 4 bytes of qemu stream length header =3D 65538 bytes. > > > >=20 > > > > Introduce our own define for the maximum length of a packet in the = pool and > > > > set it slightly larger, allowing 128 bytes for L2 and/or other back= end > > > > specific headers. We'll use different amounts of that depending on= the > > > > tap backend, but since this is just a sanity check, the bound doesn= 't need > > > > to be 100% tight. =20 > > >=20 > > > I couldn't find the time to check what's the maximum amount of bytes = we > > > can get here depending on hypervisor and interface, but if this patch= =20 > >=20 > > So, it's a separate calculation for each backend type, and some of > > them are pretty tricky. > >=20 > > For anything based on the kernel tap device it is 65535, because it > > has an internal frame size limit of 65535, already including any L2 > > headers (it explicitly limits the MTU to 65535 - hard_header_len). > > There is no "hardware" header. > >=20 > > For the qemu stream protocol it gets pretty complicated, because there > > are multiple layers which could clamp the maximum size. It doesn't > > look like the socket protocol code itself imposes a limit beyond the > > structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, > > which could be less for 32-bit systems). AFAICT, it's not > > theoretically impossible to have gigabyte frames with a weird virtual > > NIC model... though obviously that wouldn't be IP, and probably not > > even Ethernet. >=20 > Theoretically speaking, it could actually be IPv6 with Jumbograms. They > never really gained traction (because of Ethernet, I think) and we don't > support them, but the only attempt to deprecate them I'm aware of > didn't succeed (yet): >=20 > https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/ >=20 > ...and I actually wonder if somebody will ever try to revive them for > virtual networks, where they might make somewhat more sense (you could > transfer filesystems with one packet per file or similar silly tricks). Hm, yes. Well, one problem at a time. Well, ok, 2 or 3 problems at a time. > > Each virtual NIC could have its own limit. I suspect that's going to > > be in the vicinity of 64k. But, I'm really struggling to figure out > > what it is just for virtio-net, so I really don't want to try to > > figure it out for all of them. With a virtio-net NIC, I seem to be > > able to set MTU all the way up to 65535 successfully, which implies a > > maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol > > header) =3D 65553 at least. >=20 > The Layer-2 header is included (because that also happens to be > ETH_MAX_MTU, on Linux), it wouldn't be on top. No. Or if so, that's a guest side bug. The MTU set with ip-link is the maximum L3 size - at the default 1500, the L2 frame can be 1514 bytes. If the driver can't send an L2 frame greater than 65535 bytes, then it should clamp the MTU to (65535 - hard_header_len) like tuntap already does. I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU which can be had in an ethernet frame (that's longer), or is it the maximum ethernet frame size (leading to an IP mtu of 65521). I plan to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU series. I should merge that with this one, it might make the context clearer. AFAICT there is *no* structural limit on the size of an ethernet frame; the length isn't in any header, it's just assumed to be reported out of band by the hardware. No theoretical reason that reporting mechanism couldn't allow lengths > 65535, whether slightly (65535 bytes of payload + header & FCS) or vastly. > > Similar situation for vhost-user, where I'm finding it even more > > inscrutable to figure out what limits are imposed at the sub-IP > > levels. At the moment the "hardware" header > > (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the > > packet.c layer, but we might have reasons to change that. > >=20 > > So, any sub-IP limits for qemu, I'm basically not managing to find. > > However, we (more or less) only care about IP, which imposes a more > > practical limit of: 65535 + L2 header size + "hardware" header size. > >=20 > > At present that maxes out at 65553, as above, but if we ever support > > other L2 encapsulations, or other backend protocols with larger > > "hardware" headers, that could change. >=20 > Okay. I was thinking of a more practical approach, based on the fact > that we only support Ethernet anyway, with essentially four types of > adapters (three virtio-net implementations, and tap), plus rather rare > reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=3D107) and we > could actually test things.=20 >=20 > Well, I did that anyway, because I'm not quite comfortable dropping the > UINT16_MAX check in packet_add_do() without testing... We're not dropping it, just raising the limit, fairly slightly. > and yes, with > this patch we can trigger a buffer overflow with vhost-user. In detail: >=20 > - muvm, virtio-net with 65535 bytes MTU: >=20 > - ping -s 65492 works (65534 bytes "on wire" from pcap) > - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes > on wire (that's with a newer kernel, probably that's the reason why > it's not the same limit as QEMU, below) That's a guest driver bug. If the MTU is 65535, it should be able to send a 65535 byte IP datagram without fragmentation. Looks like it needs to clamp the MTU based on L2 limitations. > - QEMU, virtio-net without vhost-user, 65535 bytes MTU: >=20 > - ping -s 65493 works (65535 bytes "on wire" from pcap) > - with -s 65494: "Bad frame size from guest, resetting connection" That's our check, which I plan to fix in the MTU series. > - QEMU, virtio-net with vhost-user, 65535 bytes MTU: >=20 > - ping -s 65493 works (65535 bytes "on wire" from pcap) > - ping -s 65494: >=20 > *** buffer overflow detected ***: terminated >=20 > without this patch, we catch that in packet_add_do() (and without > 9/12 we don't crash) Ouch. That's a bug in our vhost-user code. The easy fix would be to clamp MTU to 65521, arguably more correct would be to decouple its notion of maximum frame size from ETH_MAX_MTU. > - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the > correct maximum though): No, 65521 is correct: 65521+14 =3D 65535 which is the maximum allowed tap frame size. Seems like there are other drivers that should also be clamping their max MTU similarly, but aren't. > - ping -s 65493 works (65535 bytes on wire) > - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments) This time the fragmentation is correct, because the MTU is only 65521. > So, I guess we should find out the issue with vhost-user first. Yeah. I definitely need to intermingle the MTU series with this one to get the order of operations right. > Other than that, it looks like we can reach at least 65539 bytes if we > add the "virtio-net" length descriptors, but still the frame size > itself (which is actually what matters for the functions in packet.c) Well.. sort of. The packet.c functions really care nothing about any of the layers, it's just a blob of data to them. I did miss that tap_passt_input() excludes the qemu header before inserting the frame into the pool, so indeed, the pool layer won't currently see length greater than 65535. Of course, since that frame header *is* stored in pkt_buf with all the rest, that's arguably not using the pool layer's buffer bound checks to the full extent we could. > can't exceed 65535 bytes, at least from my tests. >=20 > Then, yes, I agree that it's not actually correct, even though it fits > all the use cases we have, because we *could* have an implementation > exceeding that value (at the moment, it looks like we don't). So, it seems like the Linux drivers might not actually generate ethernet frames > 65535 bytes - although they don't all correctly reduce their maximum MTU to reflect that. I don't think we should rely on that; AFAICT it would be reasonable for a driver + VMM implementation to allow ethernet frames that are at least 65535 bytes + L2 header. That might also allow for 16 byte 802.1q vlan L2 headers. > > > fixes an actual issue as you seem to imply, actually checking that wi= th > > > QEMU and muvm would be nice. > > >=20 > > > By the way, as you mention a specific calculation, does it really make > > > sense to use a "good enough" value here? Can we ever exceed 65538 > > > bytes, or can we use that as limit? It would be good to find out, whi= le > > > at it. =20 > >=20 > > So, yes, I think we can exceed 65538. But more significantly, trying > > to make the limit tight here feels like a borderline layering > > violation. The packet layer doesn't really care about the frame size > > as long as it's "sane". >=20 > It might still be convenient for some back-ends to define "sane" as 64 > KiB. I'm really not sure if it is, I didn't look into the matching part > of vhost-user in detail. That's fair, but I don't think the pool layer should impose that limit on the backends, because I think it's equally reasonable or another backend to allow slightly larger frames with a 65535 byte L3 payload. Hence setting the limit to 64k + "a little bit". > If it's buggy because we have a user that can exceed that, sure, let's > fix it. If not... also fine by me as it's some kind of theoretical > flaw, but we shouldn't crash. >=20 > > Fwiw, in the draft changes I have improving > > MTU handling, it's my intention that individual backends calculate > > and/or enforce tighter limits of their own where practical, and > > BUILD_ASSERT() that those fit within the packet layer's frame size > > limit. >=20 > Ah, nice, that definitely sounds like an improvement. >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --urckNZf3aQ1TRTFz Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmd3OnEACgkQzQJF27ox 2GcHcg/+MIQ5PphL9c5acnS+M5YPjjKgGiuXFkfvSSnz9OuW9G37gFRtk7TFsOZJ Fqycbxu6PJqMC9m6Yj0sUhDw1Q5YKZjFIwMoKL72Ez3LmTgLWcK6rV0HpNbc55+S OUhTrgIdtFZ3B9m2bX3xDyhtb3AzoXKzv6ryg27uPD67iZuJCLdF6ktewuYhWVvz 9M8NfEK8gzLoV+CMjdaKAX233U5v5QcDzwihZ6ET16ePQSItY+VaHNqpbTMr8NKv 3GF2EgObXMeSiJsOJDuYcnWcSmT39m7n8rGtwRBt2Zr4tepFO6aJJ0qNY6vJtCTG 9XWf0c6o1Os3p9Ea/QnEp4xvqz78MgUD7tztfNQqfBB2za0TGc/NQvwFL/D+fgH/ q0dx0IJWoXaVuEv3t0UrbMKzbFInJdSQ0hgxU3oFHLp+nlHZyqGqFVVgwCDrYBw9 iD/66eBUNj+fDxeGLIJ4Oq1MgvtoLeMLi1aaRf1HtAFQfmAJcL3j9gmlSTG7umt8 l8X1w6f6LQoaqZm3o1D7xDjy82egjRsgfPYa2JN4lUzZjK1wD1BpfqO9JdVt9rV7 tZ7NZSDoW90xyUrFsiq0Ic1Ou+Nd3RmWu7KI3FWpCDlVItLHTVoNYqesO4NmNK7P sKe45Ff8cWTOupHTpYZgFfohBknYCBh1bHm0JWzBTLdpU/8A2r4= =CK3o -----END PGP SIGNATURE----- --urckNZf3aQ1TRTFz--