From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 534F75A0265 for ; Wed, 9 Nov 2022 11:29:48 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667989787; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=irTgzSMhbJgPAWU6OSVUIRf19L5asgyAz5WsOE90MhU=; b=gOl0l46jEtsxvm5/45l9AHWAnLtwsmLVjODTc1Z/Ti/RlaMadMczKP1cZsxRPRCR5jc1tK +XwbdbNJlOTWNQj/Z6AyqIdLLAcTeH4NP7Nh+3kGEoNyWxISx+p0bXcV7GkB/qcM/r9BL9 BLKb6iw0C/gU9kqvJ1nW8N0QzToau7w= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-668-hH48FCttMKe76qR4QnNQtQ-1; Wed, 09 Nov 2022 05:29:46 -0500 X-MC-Unique: hH48FCttMKe76qR4QnNQtQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B69B1800B23; Wed, 9 Nov 2022 10:29:45 +0000 (UTC) Received: from maya.cloud.tilaa.com (ovpn-208-9.brq.redhat.com [10.40.208.9]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 44849C16922; Wed, 9 Nov 2022 10:29:45 +0000 (UTC) Date: Wed, 9 Nov 2022 11:29:29 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 1/2] tap: Keep stream consistent if qemu length descriptor spans two recv() calls Message-ID: <20221109112929.0ee6c107@elisabeth> In-Reply-To: References: <20221108085419.3220900-1-sbrivio@redhat.com> <20221108085419.3220900-2-sbrivio@redhat.com> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: QTIIGCRRVLRQXKR7JZBCBKZ4OJP5P2Q3 X-Message-ID-Hash: QTIIGCRRVLRQXKR7JZBCBKZ4OJP5P2Q3 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: <> Archived-At: List-Archive: <> List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 9 Nov 2022 21:15:48 +1100 David Gibson wrote: > On Tue, Nov 08, 2022 at 09:54:18AM +0100, Stefano Brivio wrote: > > I got all paranoid after triggering a divide-by-zero general > > protection fault in passt with a qemu version without the virtio_net > > TX hang fix, while flooding UDP. I start thinking this was actually > > coming from some random changes I was playing with, but before > > reaching this conclusion I reviewed once more the relatively short > > path in tap_handler_passt() before we start using packet_*() > > functions, and found this. > > > > Never observed in practice, but artificially reproduced with changes > > in qemu's socket interface: if we don't receive from qemu a complete > > length descriptor in one recv() call, or if we receive a partial one > > at the end of one call, we currently disregard the rest, which would > > make the stream inconsistent. > > > > Nothing really bad happens, except that from that point on we would > > disregard all the packets we get until, if ever, we get the stream > > back in sync by chance. > > > > Force reading a complete packet length descriptor with a blocking > > recv(), if needed -- not just a complete packet later. > > > > Signed-off-by: Stefano Brivio > > This seems an ok short term fix, but I think we want another approach > in the slightly longer term. Read on.. > > > --- > > tap.c | 18 +++++++++++++++--- > > 1 file changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/tap.c b/tap.c > > index f8314ef..11ac732 100644 > > --- a/tap.c > > +++ b/tap.c > > @@ -747,14 +747,26 @@ redo: > > return -ECONNRESET; > > } > > > > - while (n > (ssize_t)sizeof(uint32_t)) { > > - ssize_t len = ntohl(*(uint32_t *)p); > > + while (n > 0) { > > + ssize_t len; > > + > > + /* Force receiving at least a complete length descriptor to > > + * avoid an inconsistent stream. > > + */ > > Is it actually enough for this to be blocking? AFAICT, recv() on a > stream socket, like read(), can return less data than you requested. It's not enough, hence the check on 'rem' afterwards, and this doesn't cover anyway the case were qemu would decide to send one byte at a time (because as you pointed out blocking doesn't mean we'll get the full amount requested), which never happens in practice, though. > > + if (n < (ssize_t)sizeof(uint32_t)) { > > + rem = recv(c->fd_tap, p + n, > > + (ssize_t)sizeof(uint32_t) - n, 0); > > + if ((n += rem) != (ssize_t)sizeof(uint32_t)) > > + return 0; > > + } > > + > > + len = ntohl(*(uint32_t *)p); > > > > p += sizeof(uint32_t); > > n -= sizeof(uint32_t); > > > > /* At most one packet might not fit in a single read, and this > > - * needs to be blocking. > > + * also needs to be blocking. > > Same issue here (obviously not introduced by this patch, though). Same here. > > */ > > if (len > n) { > > rem = recv(c->fd_tap, p + n, len - n, 0); > > Can we handle both these cases more neatly (and without blocking > recv()) calls, if we maintain two pointers into pkt_buf. The first > one tracks how much we've read from the qemu socket, the second tracks > how much has been parsed into packets. When we get an epoll > notification on the qemu socket, we recv() and advance the first > pointer. Then we discern as many full packets as we can, advancing > the second pointer. Yes, and I actually drafted something like that, but it takes a lot of attention and time to get it right, so I preferred to keep it simple until now. I can file a ticket as enhancement. Also, given that a subsequent recv() would operate on the "next" pointer, it will have less space available than the first one. Ideally this should be a ringbuffer (using scatter-gather IO). -- Stefano