From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=JrGj+MWF;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by passt.top (Postfix) with ESMTPS id C5FED5A061A
	for <passt-dev@passt.top>; Mon, 06 Jan 2025 00:43:38 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1736120617;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=tU9CfD9VoL4M1hbVJK/36s3Fofm+bBA97NpUMcQZF4k=;
	b=JrGj+MWFnV3zh9zTOjBvB0geBdY9W2Z9U6lMS5NrkE4pgUk6yzEpzjOAWAJDseeFejsK+f
	EycdsBAUiUff3SiqI9fkNXnuzoIduLDW1um4W+N1PeYym21FF1C/ZKfBDZ3oAHKnT2Z3ML
	cysniw/EXSjSb0HW8+cOpagolSafMuk=
Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com
 [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-127-ALPJprROPMOMdK_OtoD8AA-1; Sun, 05 Jan 2025 18:43:35 -0500
X-MC-Unique: ALPJprROPMOMdK_OtoD8AA-1
X-Mimecast-MFC-AGG-ID: ALPJprROPMOMdK_OtoD8AA
Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3862b364578so6738765f8f.1
        for <passt-dev@passt.top>; Sun, 05 Jan 2025 15:43:35 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736120614; x=1736725414;
        h=content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=tU9CfD9VoL4M1hbVJK/36s3Fofm+bBA97NpUMcQZF4k=;
        b=efLmQCHOOslNjHJldLBnqJZoKyPqyTegzPgh1d71o9Q5x6xX4qdp2xD+QIJPyL2XgQ
         1EDhnqVQzmg2NgsrmmLxq/IYDXmZakdLIEPuaDsLgI7IlSgHVeg41C3q+I6LGhXVjwGc
         HcDhH6kWMTO22pAHfEpq1CHFXYGLLfLUG434fPjTL6AFvMVhzeFPT20CQu+7oR9Hwh/C
         bXMSX18blAassSEiZMpqiWhsdqpRk4icwQ9PO0wYIP8OLv6PeMzVJDmzev6T6ssTbA8D
         HSs273TqZQixsF1252cxpkKmoFtGGQfCUbYRc3Xdg7mTrqXP3D3hN59VFMYdCDgQjcmw
         G16g==
X-Gm-Message-State: AOJu0Yzq5puuv/ROhUiKuRR7oEwMcOxUx8YBY5nFmGzwcqtgOK65k1Jd
	Ff8PsQMvgMp8DUXs/R4zhFCJ8H648DZ+839sJSpvfM2h0jwBUUek8MEVFeD7fCFdslAzfCVJe5Z
	4UHPoGqbcDEuth9HYZvPasGU9SjVN2GtVjqs0eDNsluvFkgeCgH90B5/GpQ==
X-Gm-Gg: ASbGncuVGDivhXw0chBirMfhtvVEbjPCV9ZnJx4j/Io03+DQmG8sKOo5JIu7DG6oSU5
	+a3trcGRnz99f4rWeBzaEHV/REi3Inyh8Oy5xpdaSMvw2H7Zr5i3Yb/S5Yz5SoZeAu2K3o8759D
	jnupKlHvc/kaBiEpuaiOHuHZjUyd93UTFJODOXUL+QNmEuVyDV06soQXXXyt+Sjro1YUBFenR5U
	N0bMuM9GwGuPZXyIo44zxFx2pUL2ub8tJhXYTjUlE/aj+zySfeP6d0ajtdGCzYlp2gOn6Wbqxs5
	Zq7PcrG0Og==
X-Received: by 2002:a5d:6da1:0:b0:386:3afc:14a7 with SMTP id ffacd0b85a97d-38a1a1fdae4mr49296814f8f.7.1736120613764;
        Sun, 05 Jan 2025 15:43:33 -0800 (PST)
X-Google-Smtp-Source: AGHT+IErH+T1JfRh+kPyaB561uhyG7x6Tqk1BZT2TSa9/QSF5UmDzDz3AC+bI9/K6Ye11HbJcA3bkA==
X-Received: by 2002:a5d:6da1:0:b0:386:3afc:14a7 with SMTP id ffacd0b85a97d-38a1a1fdae4mr49296801f8f.7.1736120613116;
        Sun, 05 Jan 2025 15:43:33 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38a1c8288b8sm46118953f8f.11.2025.01.05.15.43.30
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 05 Jan 2025 15:43:30 -0800 (PST)
Date: Mon, 6 Jan 2025 00:43:27 +0100
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to
 UINT16_MAX
Message-ID: <20250106004327.037b9924@elisabeth>
In-Reply-To: <Z3c6ffgMKOVIZRrc@zatzit>
References: <20241220083535.1372523-1-david@gibson.dropbear.id.au>
	<20241220083535.1372523-7-david@gibson.dropbear.id.au>
	<20250101225433.45f52b86@elisabeth>
	<Z3XlLkHJpSCdGZ1L@zatzit>
	<20250102225948.2cfbd033@elisabeth>
	<Z3c6ffgMKOVIZRrc@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: yiuPZXt73SOdk2lA13KyoYTP0LNdLD1TBL2Bvtw0-SY_1736120615
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID-Hash: IYKN5QUMRINDKMCYAQQINRVQM23UQOZA
X-Message-ID-Hash: IYKN5QUMRINDKMCYAQQINRVQM23UQOZA
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20250106004327.037b9924@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/IYKN5QUMRINDKMCYAQQINRVQM23UQOZA/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Fri, 3 Jan 2025 12:16:45 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
> > On Thu, 2 Jan 2025 12:00:30 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:  
> > > > On Fri, 20 Dec 2024 19:35:29 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > We verify that every packet we store in a pool - and every partial packet
> > > > > we retreive from it has a length no longer than UINT16_MAX.  This
> > > > > originated in the older packet pool implementation which stored packet
> > > > > lengths in a uint16_t.  Now, that packets are represented by a struct
> > > > > iovec with its size_t length, this check serves only as a sanity / security
> > > > > check that we don't have some wildly out of range length due to a bug
> > > > > elsewhere.
> > > > > 
> > > > > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> > > > > stored in the pool is in fact an entire frame including both L2 and any
> > > > > backend specific headers.  We can exceed this in passt mode, even with the
> > > > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> > > > > 4 bytes of qemu stream length header = 65538 bytes.
> > > > > 
> > > > > Introduce our own define for the maximum length of a packet in the pool and
> > > > > set it slightly larger, allowing 128 bytes for L2 and/or other backend
> > > > > specific headers.  We'll use different amounts of that depending on the
> > > > > tap backend, but since this is just a sanity check, the bound doesn't need
> > > > > to be 100% tight.    
> > > > 
> > > > I couldn't find the time to check what's the maximum amount of bytes we
> > > > can get here depending on hypervisor and interface, but if this patch    
> > > 
> > > So, it's a separate calculation for each backend type, and some of
> > > them are pretty tricky.
> > > 
> > > For anything based on the kernel tap device it is 65535, because it
> > > has an internal frame size limit of 65535, already including any L2
> > > headers (it explicitly limits the MTU to 65535 - hard_header_len).
> > > There is no "hardware" header.
> > > 
> > > For the qemu stream protocol it gets pretty complicated, because there
> > > are multiple layers which could clamp the maximum size.  It doesn't
> > > look like the socket protocol code itself imposes a limit beyond the
> > > structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
> > > which could be less for 32-bit systems).  AFAICT, it's not
> > > theoretically impossible to have gigabyte frames with a weird virtual
> > > NIC model... though obviously that wouldn't be IP, and probably not
> > > even Ethernet.  
> > 
> > Theoretically speaking, it could actually be IPv6 with Jumbograms. They
> > never really gained traction (because of Ethernet, I think) and we don't
> > support them, but the only attempt to deprecate them I'm aware of
> > didn't succeed (yet):
> > 
> >   https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
> > 
> > ...and I actually wonder if somebody will ever try to revive them for
> > virtual networks, where they might make somewhat more sense (you could
> > transfer filesystems with one packet per file or similar silly tricks).  
> 
> Hm, yes.  Well, one problem at a time.  Well, ok, 2 or 3 problems at a
> time.
> 
> > > Each virtual NIC could have its own limit.  I suspect that's going to
> > > be in the vicinity of 64k.  But, I'm really struggling to figure out
> > > what it is just for virtio-net, so I really don't want to try to
> > > figure it out for all of them.  With a virtio-net NIC, I seem to be
> > > able to set MTU all the way up to 65535 successfully, which implies a
> > > maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
> > > header) = 65553 at least.  
> > 
> > The Layer-2 header is included (because that also happens to be
> > ETH_MAX_MTU, on Linux), it wouldn't be on top.  
> 
> No.  Or if so, that's a guest side bug.

I see your point, but it's pretty much a universal bug. I've never seen
a packet interface on Linux that's able to send more than 65535 bytes,
and yet, there are several drivers that allow the MTU to be 65535
bytes.

Yes, they should be fixed, eventually, but I guess the obstacle to
fixing them is that there are, of course, two ways to fix that.

The correct one (enabling frames to be bigger than 64 KiB) would
probably uncover all kind of issues and perhaps kill throughput in many
cases. The wrong one (clamping the MTU) is... well, wrong. But it would
be the only sane option, I suppose.

> The MTU set with ip-link is
> the maximum L3 size - at the default 1500, the L2 frame can be 1514
> bytes.  If the driver can't send an L2 frame greater than 65535 bytes,
> then it should clamp the MTU to (65535 - hard_header_len) like tuntap
> already does.
> 
> I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU
> which can be had in an ethernet frame (that's longer), or is it the
> maximum ethernet frame size (leading to an IP mtu of 65521).

Historically, I think it used to be something on the lines of "maximum
frame size of something that looks like Ethernet".

Note that the maximum frame size allowed by 802.3 is 1500 bytes. With
Jumbo frames, one can typically have 9000 or 9216 bytes, but there's no
defined standard.

> I plan
> to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU
> series.  I should merge that with this one, it might make the context
> clearer.

Well but it's used in the kernel anyway, and that's where the confusion
comes from.

> AFAICT there is *no* structural limit on the size of an ethernet
> frame; the length isn't in any header, it's just assumed to be
> reported out of band by the hardware.  No theoretical reason that
> reporting mechanism couldn't allow lengths > 65535, whether slightly
> (65535 bytes of payload + header & FCS) or vastly.

Same as my understanding.

> > > Similar situation for vhost-user, where I'm finding it even more
> > > inscrutable to figure out what limits are imposed at the sub-IP
> > > levels.  At the moment the "hardware" header
> > > (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
> > > packet.c layer, but we might have reasons to change that.
> > > 
> > > So, any sub-IP limits for qemu, I'm basically not managing to find.
> > > However, we (more or less) only care about IP, which imposes a more
> > > practical limit of: 65535 + L2 header size + "hardware" header size.
> > > 
> > > At present that maxes out at 65553, as above, but if we ever support
> > > other L2 encapsulations, or other backend protocols with larger
> > > "hardware" headers, that could change.  
> > 
> > Okay. I was thinking of a more practical approach, based on the fact
> > that we only support Ethernet anyway, with essentially four types of
> > adapters (three virtio-net implementations, and tap), plus rather rare
> > reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we
> > could actually test things. 
> > 
> > Well, I did that anyway, because I'm not quite comfortable dropping the
> > UINT16_MAX check in packet_add_do() without testing...  
> 
> We're not dropping it, just raising the limit, fairly slightly.
> 
> > and yes, with
> > this patch we can trigger a buffer overflow with vhost-user. In detail:
> > 
> > - muvm, virtio-net with 65535 bytes MTU:
> > 
> >   - ping -s 65492 works (65534 bytes "on wire" from pcap)
> >   - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes
> >     on wire (that's with a newer kernel, probably that's the reason why
> >     it's not the same limit as QEMU, below)  
> 
> That's a guest driver bug.  If the MTU is 65535, it should be able to
> send a 65535 byte IP datagram without fragmentation.  Looks like it
> needs to clamp the MTU based on L2 limitations.

I would have assumed that the issue is in the virtio-net driver in the
Linux kernel (not the network implementation in libkrun or in the
virtio-drivers Rust crate), but what's interesting is that we have one
byte of difference with virtio-net as implemented by QEMU... so
probably my assumption is wrong.

> > - QEMU, virtio-net without vhost-user, 65535 bytes MTU:
> > 
> >   - ping -s 65493 works (65535 bytes "on wire" from pcap)
> >   - with -s 65494: "Bad frame size from guest, resetting connection"  
> 
> That's our check, which I plan to fix in the MTU series.
> 
> > - QEMU, virtio-net with vhost-user, 65535 bytes MTU:
> > 
> >   - ping -s 65493 works (65535 bytes "on wire" from pcap)
> >   - ping -s 65494:
> > 
> >     *** buffer overflow detected ***: terminated
> > 
> >     without this patch, we catch that in packet_add_do() (and without
> >     9/12 we don't crash)  
> 
> Ouch.  That's a bug in our vhost-user code.  The easy fix would be to
> clamp MTU to 65521, arguably more correct would be to decouple its
> notion of maximum frame size from ETH_MAX_MTU.

Where would you clamp that? I'm not sure if it's something that we can
negotiate over the vhost-user protocol. If we can't, then we need to
find another solution for compatibility, even if we fix it in the
kernel.

> > - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the
> >   correct maximum though):  
> 
> No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed
> tap frame size.  Seems like there are other drivers that should also
> be clamping their max MTU similarly, but aren't.

I don't remember exactly why now, but because of some combination of
requirements from normative references, you can't really have an MTU
that's not a multiple of (32-bit) IPv4 words (and things actually
break with 65521... maybe TCP?). See also passt(1).

That's why we use 65520 by default. I'll try to find out if you're
interested.

> >   - ping -s 65493 works (65535 bytes on wire)
> >   - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)  
> 
> This time the fragmentation is correct, because the MTU is only 65521.
> 
> > So, I guess we should find out the issue with vhost-user first.  
> 
> Yeah.  I definitely need to intermingle the MTU series with this one
> to get the order of operations right.
> 
> > Other than that, it looks like we can reach at least 65539 bytes if we
> > add the "virtio-net" length descriptors, but still the frame size
> > itself (which is actually what matters for the functions in packet.c)  
> 
> Well.. sort of.  The packet.c functions really care nothing about any
> of the layers, it's just a blob of data to them.  I did miss that
> tap_passt_input() excludes the qemu header before inserting the frame
> into the pool, so indeed, the pool layer won't currently see length
> greater than 65535.  Of course, since that frame header *is* stored in
> pkt_buf with all the rest, that's arguably not using the pool layer's
> buffer bound checks to the full extent we could.

Hm, yes, true.

But on the other hand, if we just say "plus 128", then we're not using
bound checks to the fullest extent we can, either.

> > can't exceed 65535 bytes, at least from my tests.
> > 
> > Then, yes, I agree that it's not actually correct, even though it fits
> > all the use cases we have, because we *could* have an implementation
> > exceeding that value (at the moment, it looks like we don't).  
> 
> So, it seems like the Linux drivers might not actually generate
> ethernet frames > 65535 bytes - although they don't all correctly
> reduce their maximum MTU to reflect that.  I don't think we should
> rely on that; AFAICT it would be reasonable for a driver + VMM
> implementation to allow ethernet frames that are at least 65535 bytes
> + L2 header.  That might also allow for 16 byte 802.1q vlan L2
> headers.

If it's convenient in our implementation (I think it is, especially on
the opposite path), then I think we can kind of rely on it, in the
sense that we could "simply" be robust to drivers that send out frames
bigger than 65535 bytes (I've seen none to date, not even with VLANs),
and change things if somebody ever needs that.

I mean, probably, the people who're the most likely to try and add
support for bigger frames in hypervisors/kernel at some point in the
future are actually us, because with user-mode networking the
guest-facing MTU is almost entirely useless.

> > > > fixes an actual issue as you seem to imply, actually checking that with
> > > > QEMU and muvm would be nice.
> > > > 
> > > > By the way, as you mention a specific calculation, does it really make
> > > > sense to use a "good enough" value here? Can we ever exceed 65538
> > > > bytes, or can we use that as limit? It would be good to find out, while
> > > > at it.    
> > > 
> > > So, yes, I think we can exceed 65538.  But more significantly, trying
> > > to make the limit tight here feels like a borderline layering
> > > violation.  The packet layer doesn't really care about the frame size
> > > as long as it's "sane".  
> > 
> > It might still be convenient for some back-ends to define "sane" as 64
> > KiB. I'm really not sure if it is, I didn't look into the matching part
> > of vhost-user in detail.  
> 
> That's fair, but I don't think the pool layer should impose that limit
> on the backends, because I think it's equally reasonable or another
> backend to allow slightly larger frames with a 65535 byte L3 payload.
> Hence setting the limit to 64k + "a little bit".
> 
> > If it's buggy because we have a user that can exceed that, sure, let's
> > fix it. If not... also fine by me as it's some kind of theoretical
> > flaw, but we shouldn't crash.
> >   
> > > Fwiw, in the draft changes I have improving
> > > MTU handling, it's my intention that individual backends calculate
> > > and/or enforce tighter limits of their own where practical, and
> > > BUILD_ASSERT() that those fit within the packet layer's frame size
> > > limit.  
> > 
> > Ah, nice, that definitely sounds like an improvement.

-- 
Stefano