From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Rsm2Mell; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id AB16C5A0271 for ; Thu, 02 Jan 2025 23:00:07 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1735855206; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aMVY1C0WZnjnb79hG2imA+593J8XCX1zfoyYn14KYmc=; b=Rsm2MellTisjMxqym6Xi3z38PgwGZMiycpFroLcRA+ZKiJSVzwrjZYbe01SawMybMYiI90 39Z6BpsaHxHXXcLodsg2bRfI8hqIdxIK7Ib/fTw2/RN5lD2TIgdsCPt+/D0ENuOaf3d1P0 uRRn4S1UmmxRZ6bHydyieYiMAKv4+kQ= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-252-r_jx1Y6cMla6SvicCKwz-g-1; Thu, 02 Jan 2025 17:00:05 -0500 X-MC-Unique: r_jx1Y6cMla6SvicCKwz-g-1 X-Mimecast-MFC-AGG-ID: r_jx1Y6cMla6SvicCKwz-g Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-3862e986d17so5214290f8f.3 for ; Thu, 02 Jan 2025 14:00:04 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735855203; x=1736460003; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=aMVY1C0WZnjnb79hG2imA+593J8XCX1zfoyYn14KYmc=; b=bImiNNtddb/mFFZFZNORodm3H+T0dUInNBBBpLx2Y+ZieSkO6ulol9lPfuH/rlzJdr aiwsrgQxiDrxMEewB+bw1DamwmIs1gmXupOHZPG87IRi27XxN5egOR5sKMKDaa4vYtEx /JVwOG0qIvtXr/6dPvqAz7i4UqORNFudDeEirNSspMd49EHqU2xc2c9N2MbqCaQs0Cc3 bsEDZ1DDAjkxXBOsdCgRtb3oyK27WwpYKmQiG32haspwWe8zFK0PDoIROQ5fGfhJiZKF hbAvIXf1C93Z5R8Ynxy/IXQsM6UOp7o3U70yF5TLQhGTbFQMGDS1bf/R55ZIranco8p3 pqLg== X-Gm-Message-State: AOJu0Yw1KyKQJgl/uMxIJQlw1zTSfHkX3VtJzxJO1AcV5omZcRiZqzqN 5+2BNe1oVTbnpMpJuYr7lgbgkTnPRlIzBQj1VpfoeXKzuTyoHdBkYT75pPY6Cb3sty1C/tjXHsq O5kVa6ZAGauo3J758gql1q8Whmwc+z7HRMTrNUKDAgKIknOLGfD3t/hCkeQ== X-Gm-Gg: ASbGncv0j2eKrzc4ewtRoAI2EFF3aCzXPkLBrX1C8guVEv4mal6rYG+fPf/q33ufptK xyMqtCU8UGAXQquu22EH+LDvPEqOu5RuYu6jn3C3aBAcXGzel9l/yoAZNv/iBHYBc5lp62S3q/l 7XUOE7cCt7S48EiZ44PEYkHZlnoxsyHfJ5WG77669lhNM6nw5KYoeS/Y+8QwflljDC7chPb9auq /+rnvpYrPKutfXuEIkGxT0UPMu5+4nXjoAQ1Rtbh566OJQbdieFCXN3LKCDQQ+ifLVthtpfub6j D4hlB0+Nrw== X-Received: by 2002:a5d:6d0f:0:b0:38a:518d:97b with SMTP id ffacd0b85a97d-38a518d0c34mr16875205f8f.11.1735855203130; Thu, 02 Jan 2025 14:00:03 -0800 (PST) X-Google-Smtp-Source: AGHT+IE2Tbz+MHiNDyWRFQko0AhSNr/UmqexO9jKRm3dC8rDw3kjhIdYfjeRQ2pWPjrRAqc97NJWhQ== X-Received: by 2002:a5d:6d0f:0:b0:38a:518d:97b with SMTP id ffacd0b85a97d-38a518d0c34mr16875189f8f.11.1735855202614; Thu, 02 Jan 2025 14:00:02 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38a1c832e74sm38566575f8f.30.2025.01.02.14.00.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Jan 2025 14:00:01 -0800 (PST) Date: Thu, 2 Jan 2025 22:59:48 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX Message-ID: <20250102225948.2cfbd033@elisabeth> In-Reply-To: References: <20241220083535.1372523-1-david@gibson.dropbear.id.au> <20241220083535.1372523-7-david@gibson.dropbear.id.au> <20250101225433.45f52b86@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: SX8373mMr1aW6MqA-2IDZS6y5sJEOGVZLxFia_MIUtE_1735855204 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: AFYRRDXRY3RHDXQW6IDU4RBJ4TVSMPQH X-Message-ID-Hash: AFYRRDXRY3RHDXQW6IDU4RBJ4TVSMPQH X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 2 Jan 2025 12:00:30 +1100 David Gibson wrote: > On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote: > > On Fri, 20 Dec 2024 19:35:29 +1100 > > David Gibson wrote: > > > > > We verify that every packet we store in a pool - and every partial packet > > > we retreive from it has a length no longer than UINT16_MAX. This > > > originated in the older packet pool implementation which stored packet > > > lengths in a uint16_t. Now, that packets are represented by a struct > > > iovec with its size_t length, this check serves only as a sanity / security > > > check that we don't have some wildly out of range length due to a bug > > > elsewhere. > > > > > > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as > > > stored in the pool is in fact an entire frame including both L2 and any > > > backend specific headers. We can exceed this in passt mode, even with the > > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + > > > 4 bytes of qemu stream length header = 65538 bytes. > > > > > > Introduce our own define for the maximum length of a packet in the pool and > > > set it slightly larger, allowing 128 bytes for L2 and/or other backend > > > specific headers. We'll use different amounts of that depending on the > > > tap backend, but since this is just a sanity check, the bound doesn't need > > > to be 100% tight. > > > > I couldn't find the time to check what's the maximum amount of bytes we > > can get here depending on hypervisor and interface, but if this patch > > So, it's a separate calculation for each backend type, and some of > them are pretty tricky. > > For anything based on the kernel tap device it is 65535, because it > has an internal frame size limit of 65535, already including any L2 > headers (it explicitly limits the MTU to 65535 - hard_header_len). > There is no "hardware" header. > > For the qemu stream protocol it gets pretty complicated, because there > are multiple layers which could clamp the maximum size. It doesn't > look like the socket protocol code itself imposes a limit beyond the > structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, > which could be less for 32-bit systems). AFAICT, it's not > theoretically impossible to have gigabyte frames with a weird virtual > NIC model... though obviously that wouldn't be IP, and probably not > even Ethernet. Theoretically speaking, it could actually be IPv6 with Jumbograms. They never really gained traction (because of Ethernet, I think) and we don't support them, but the only attempt to deprecate them I'm aware of didn't succeed (yet): https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/ ...and I actually wonder if somebody will ever try to revive them for virtual networks, where they might make somewhat more sense (you could transfer filesystems with one packet per file or similar silly tricks). > Each virtual NIC could have its own limit. I suspect that's going to > be in the vicinity of 64k. But, I'm really struggling to figure out > what it is just for virtio-net, so I really don't want to try to > figure it out for all of them. With a virtio-net NIC, I seem to be > able to set MTU all the way up to 65535 successfully, which implies a > maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol > header) = 65553 at least. The Layer-2 header is included (because that also happens to be ETH_MAX_MTU, on Linux), it wouldn't be on top. > Similar situation for vhost-user, where I'm finding it even more > inscrutable to figure out what limits are imposed at the sub-IP > levels. At the moment the "hardware" header > (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the > packet.c layer, but we might have reasons to change that. > > So, any sub-IP limits for qemu, I'm basically not managing to find. > However, we (more or less) only care about IP, which imposes a more > practical limit of: 65535 + L2 header size + "hardware" header size. > > At present that maxes out at 65553, as above, but if we ever support > other L2 encapsulations, or other backend protocols with larger > "hardware" headers, that could change. Okay. I was thinking of a more practical approach, based on the fact that we only support Ethernet anyway, with essentially four types of adapters (three virtio-net implementations, and tap), plus rather rare reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we could actually test things. Well, I did that anyway, because I'm not quite comfortable dropping the UINT16_MAX check in packet_add_do() without testing... and yes, with this patch we can trigger a buffer overflow with vhost-user. In detail: - muvm, virtio-net with 65535 bytes MTU: - ping -s 65492 works (65534 bytes "on wire" from pcap) - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes on wire (that's with a newer kernel, probably that's the reason why it's not the same limit as QEMU, below) - QEMU, virtio-net without vhost-user, 65535 bytes MTU: - ping -s 65493 works (65535 bytes "on wire" from pcap) - with -s 65494: "Bad frame size from guest, resetting connection" - QEMU, virtio-net with vhost-user, 65535 bytes MTU: - ping -s 65493 works (65535 bytes "on wire" from pcap) - ping -s 65494: *** buffer overflow detected ***: terminated without this patch, we catch that in packet_add_do() (and without 9/12 we don't crash) - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the correct maximum though): - ping -s 65493 works (65535 bytes on wire) - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments) So, I guess we should find out the issue with vhost-user first. Other than that, it looks like we can reach at least 65539 bytes if we add the "virtio-net" length descriptors, but still the frame size itself (which is actually what matters for the functions in packet.c) can't exceed 65535 bytes, at least from my tests. Then, yes, I agree that it's not actually correct, even though it fits all the use cases we have, because we *could* have an implementation exceeding that value (at the moment, it looks like we don't). > > fixes an actual issue as you seem to imply, actually checking that with > > QEMU and muvm would be nice. > > > > By the way, as you mention a specific calculation, does it really make > > sense to use a "good enough" value here? Can we ever exceed 65538 > > bytes, or can we use that as limit? It would be good to find out, while > > at it. > > So, yes, I think we can exceed 65538. But more significantly, trying > to make the limit tight here feels like a borderline layering > violation. The packet layer doesn't really care about the frame size > as long as it's "sane". It might still be convenient for some back-ends to define "sane" as 64 KiB. I'm really not sure if it is, I didn't look into the matching part of vhost-user in detail. If it's buggy because we have a user that can exceed that, sure, let's fix it. If not... also fine by me as it's some kind of theoretical flaw, but we shouldn't crash. > Fwiw, in the draft changes I have improving > MTU handling, it's my intention that individual backends calculate > and/or enforce tighter limits of their own where practical, and > BUILD_ASSERT() that those fit within the packet layer's frame size > limit. Ah, nice, that definitely sounds like an improvement. -- Stefano