From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 7266B5A004E for ; Fri, 26 Jul 2024 03:27:12 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1721957222; bh=F60asz/4QTMDeaibFD7ua6qfkhLeN9582k9Gn7Xvs48=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UC6Ci2N8VxgLy4E16WKY1FHVtvZTWiSk6EbH3wSO09o0SYBIlrWmgLD59OwzSSVup y21PpOZrb50nloXIoBbNdWxDtpPFdcRhKrWBJSgkAuyRiePhz2rRYF3T/HRueYrbIc iRwEsY83d15hn8H79ir6hUW/yR3hAH84rwJ9Hw78swaFKZSyL+SXQYY83poOnqazKA F7+mqCNpUsdvCvc7kYm4tgkjsG0ajBv1CjeOazQ2z8a/PD80EgFVngb40hsIDUX01r 7DGuyeGgtx5jnflIvqjR1sppIXuV1JuyhGwzlLZypF7UPqORj/2igCXdYxYkCkNLBX 2/T28nmLX3g6w== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4WVVTy6NPXz4x3J; Fri, 26 Jul 2024 11:27:02 +1000 (AEST) Date: Fri, 26 Jul 2024 11:22:31 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch Message-ID: References: <20240724215021.3366863-1-sbrivio@redhat.com> <20240724215021.3366863-11-sbrivio@redhat.com> <20240725111456.47c37d6f@elisabeth> <20240725130908.2ca21e7c@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="gDynP2QfVqY3kjLN" Content-Disposition: inline In-Reply-To: <20240725130908.2ca21e7c@elisabeth> Message-ID-Hash: EQTFV4NNN4HGCN3K3J7W6J7B4SVGPBJE X-Message-ID-Hash: EQTFV4NNN4HGCN3K3J7W6J7B4SVGPBJE X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --gDynP2QfVqY3kjLN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 25, 2024 at 01:09:19PM +0200, Stefano Brivio wrote: > On Thu, 25 Jul 2024 20:23:55 +1000 > David Gibson wrote: >=20 > > On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote: > > > On Thu, 25 Jul 2024 14:37:43 +1000 > > > David Gibson wrote: > > > =20 > > > > On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote: =20 > > > > > This was reported by Matej a while ago, but we forgot to fix it. = Even > > > > > if the hypervisor is necessarily trusted by passt, as it can in a= ny > > > > > case terminate the guest or disrupt guest connectivity, it's a go= od > > > > > idea to be robust against possible issues. > > > > >=20 > > > > > Instead of resetting the connection to the hypervisor, just disca= rd > > > > > the data we read with a single recv(), as we had a few cases where > > > > > QEMU would get the length descriptor wrong, in the past. > > > > >=20 > > > > > While at it, change l2len in tap_handler_passt() to uint32_t, as = the > > > > > length descriptor is logically unsigned and 32-bit wide. > > > > >=20 > > > > > Reported-by: Matej Hrica > > > > > Suggested-by: Matej Hrica > > > > > Signed-off-by: Stefano Brivio > > > > > --- > > > > > tap.c | 10 ++++++---- > > > > > 1 file changed, 6 insertions(+), 4 deletions(-) > > > > >=20 > > > > > diff --git a/tap.c b/tap.c > > > > > index 44bd444..62ba6a4 100644 > > > > > --- a/tap.c > > > > > +++ b/tap.c > > > > > @@ -1011,15 +1011,18 @@ redo: > > > > > } > > > > > =20 > > > > > while (n > (ssize_t)sizeof(uint32_t)) { > > > > > - ssize_t l2len =3D ntohl(*(uint32_t *)p); > > > > > + uint32_t l2len =3D ntohl(*(uint32_t *)p); > > > > > =20 > > > > > p +=3D sizeof(uint32_t); > > > > > n -=3D sizeof(uint32_t); > > > > > =20 > > > > > + if (l2len > (ssize_t)TAP_BUF_BYTES - n) > > > > > + return; =20 > > > >=20 > > > > Neither the condition nor the action makes much sense to me here. > > > > We're testing if the frame can fit in the the remaining buffer spac= e. =20 > > >=20 > > > Not really, we're just checking that the length descriptor fits the > > > remaining buffer space. We're using this in the second recv() below, > > > that's why it matters here. =20 > >=20 > > But AFAICT, what we need to know is if the remainder of the frame fits > > in the buffer. >=20 > It always does because of how TAP_BUF_BYTES and TAP_BUF_FILL are > defined. Assuming the frame is <=3D ETH_MAX_MTU bytes, which we haven't yet verified. But that's not really the point I was making. The key word above is *remainder*: we may have already read part of the frame, so comparing the *total* frame length against the remaining buffer space is incorrect. > > That could be less than the length descriptor if we've > > already recv()ed part of a frame. > >=20 > > > > But we may have already read part (or all) of the frame - i.e. it's > > > > included in 'n'. So I don't see how that condition is useful. =20 > > >=20 > > > ...that is, it has nothing to do with exceeding or not exceeding the > > > buffer on recv(), that's already taken care of by the recv() call, > > > implicitly. > > > =20 > > > > Then, simply returning doesn't seem right under pretty much any > > > > circumstances - that discards some amount of data, and leaves us in= an > > > > unsynchronized state w.r.t. the frame boundaries. =20 > > >=20 > > > That might happen, of course, but it might also happen that the > > > hypervisor sent us *one* corrupted buffer, and the next recv() will > > > read consistent data. =20 > >=20 > > Well, sure, it's possible, but it doesn't seem particularly likely to > > me. AFAICT this is a stream which we need every length field to > > interpret properly. If we lose one, or it's corrupted, I think we're > > done for. >=20 > In most cases, the content of one recv() corresponds to a given number > of whole frames, so what we're doing by returning here is, practically > speaking, similar to what you're suggesting below with MSG_TRUNC. Or, I guess more technically, the recv() boundaries will probably correspond to qemu's send() boundaries(), which are likely to correspond to the frame boundaries. But, I think I now see the root of our disagreement here. If the frame field lengths don't seem to make sense relative to the recv() boundaries, do we trust the former (my position) or the latter (your position, I think). Thinking about it like that, I don't think either position makes a lot of sense. If these mismatch something has already gone horribly wrong and we should probably just die(), or at least reset the tap socket. > > > > If this is just supposed to be a sanity check on the frame length, > > > > then I think we'd be better off with a fixed limit - 64kiB is the > > > > obvious choice. =20 > > >=20 > > > That's already checked below (l2len > ETH_MAX_MTU), and... =20 > >=20 > > Right. I wonder if it would make sense to do that earlier. >=20 > We can. But right now what I want to fix here is just that missing > check, because it actually causes passt to crash (even though we assume > a trusted hypervisor, so it's not really security relevant, but not nice > either). Ok, fair enough. > > > > If we hit that, we can warn() and discard data up to > > > > the end of the too-large frame. That at least has a chance of lett= ing > > > > us recover and move on to future acceptable frames. =20 > > >=20 > > > that's exactly what we do in that case (goto next). =20 > >=20 > > Only for the case that the length is too long, but not *too* long. In > > particular it needs to fit in the buffer to even get there. If we > > sanity checked the frame length earlier we could use MSG_TRUNC to > > discard even a ludicrously large frame and still continue on to the > > next one. >=20 > We don't know if the length descriptor actually matches the length of > the frame, though. If you have a way you think is more robust, would > you mind sending a patch? It's more that if the length descriptor doesn't match the actual length of the frame, I think we're beyond saving. The recv() boundaries probably correspond to to frame boundaries, but that's not guaranteed by SOCK_STREAM semantics, so I don't think we should ever trust them _over_ the length fields we receive. > > > > > /* At most one packet might not fit in a single read, and this > > > > > * needs to be blocking. > > > > > */ > > > > > - if (l2len > n) { > > > > > + if (l2len > (size_t)n) { > > > > > rem =3D recv(c->fd_tap, p + n, l2len - n, 0); =20 > > > ^^^^^^^^^^^^^^^^ > > >=20 > > > This the reason why the check above is relevant. =20 > >=20 > > Relevant, sure, but I still don't think it's right. Actually > > (TAP_BUF_BYTES - n) is an even stranger quantity than I initially > > thought. It's the total space of the buffer minus the current partial > > frame - counting *both* the stuff before our partial frame and after > > it. >=20 > That should be enough to make sure we don't have bad side effects on > this second recv(), but yes: No, I don't think it is. Suppose we get exactly TAP_BUF_FILL bytes in the first recv(). (TAP_BUF_FILL-4) bytes of that is perfectly ordinary, valid frames, which we process. We've read the remaining 4 bytes as a field length, so: p =3D=3D pkt_buf + TAP_BUF_FILL n =3D=3D 0 Now suppose that last frame length field is bad, say l2len =3D=3D 2*ETH_MAX_MTU The test you've suggested: if (l2len > (ssize_t)TAP_BUF_BYTES - n) will let us continue, but the recv() rem =3D recv(c->fd_tap, p + n, l2len - n, 0); = =20 will overrun pkt_buf + TAP_BUF_BYTES. That _probably_ won't happen, since the recv() is likely to end at a frame boundary instead, but again SOCK_STREAM semantics don't guarantee that. And even if the kernel does always preserve send()/recv() boundaries, that qemu might not transmit the frame length as a separate send() (maybe on a debugging / slow path). > > I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYT= ES). >=20 > ...that is actually more accurate. I can fix this up in another > version, unless you can think of a more comprehensive change that also > gives us better possibilities to recover from a corrupted stream. So, as noted above, I don't think we should even attempt to recover =66rom a corrupted stream. One could certainly design a stream protocol that allows for recovery from corrupted portions, but the qemu -net stream protocol isn't so designed. > > > > > if ((n +=3D rem) !=3D l2len) > > > > > return; =20 > > > >=20 > > > > Pre-existing, but a 'return' here basically lands us in a situation= we > > > > have no meaningful chance of recovering from. A die() would be > > > > preferable. Better yet would be continuing to re-recv() until we h= ave > > > > the whole frame, similar to what we do for write_remainder(). =20 > > >=20 > > > Same as above, it depends on what failure you're assuming. If it's ju= st > > > one botched recv(), instead, we recv() again the next time and we > > > recover. =20 > >=20 > > Even if it's just one bad recv(), we still have no idea where we are > > w.r.t. frame boundaries, so I can't see any way we could recover. >=20 > The idea is that the recv() is _usually_ big enough as to flush the > buffer anyway. Hrm, not something I'd really like to rely on. So, I just realised there's another pre-existing problem here. AFAICT there's no guarantee that frame lengths are aligned - which means after an odd sized frame, we'd be making an unaligned u32 read for the frame length, which will crash on some platforms. I'll have a look at all the above today and see what I can come up with. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --gDynP2QfVqY3kjLN Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmai+lYACgkQzQJF27ox 2GfvYw//XasbB3rhKfc0qVrwrf+Vc3xmLWODmrSbEUZD2LzSUN7koQIaHm3SwzwY HT3ca9v7XoDr/DHYugc0dEsBefSvCuAQbM3WxeEPgm34rnUdWYn4o+jYJFh2MrJ7 2oL/eGHjZO/YE8aBw6O+UqTXXM8iCb7mSudVzhYkaph+ggI4BLbyx+sdBO8SCTaZ jSo6+Q/DkuIfa2qtmTVINGYuDd6iiGycAbw8fBpYf/whzBDFqUW5arHpfVfqcHdR Ovb0vv5wmqzBB6/DMCv++t6nvn4wialJCt0HtqQleW+rU8/qV/Sm3h/gQg8Ms1Y4 JpXHuwofNwoJHCwzDCCrzYcwAL3Z9/OzzwS5ltqvbIv4kcrNhcpjUSVcQVdnD1tn sHoCV3EtII+FEqh8Se3327a+J7FOngpox6hZ78Qc2+GT5X6Yl1T/HZwRXeBnGthz dl70c6po7FVuuVvvwODw4M0Uni2Y1vWprDZjylO0V0OMUKBSSPSItVwO+dNi2oAt vMjdjs0NapwHIjBVrMqvTrDea2U6Vb/EveDd2qM7EwWnb9HtmgHDy0cAAdDokrJk e8Ard9gr4ffF+SYvr3SMigO60z3k0SmsEH/NA2Z70693sGClx7Vx+GMFHj5Fz8cE RRazFera/9wu0HyFR7FU1TOnlKXi/wPX5pgN5PApLZOtihuIrZ4= =D7Ad -----END PGP SIGNATURE----- --gDynP2QfVqY3kjLN--