From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id DDAD75A0082 for ; Thu, 26 Jan 2023 00:21:50 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674688909; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u0LEV8ATxwQtXP8hRXKWTeFf6EGqxQL/3SKqr8ZLKxw=; b=hlm3u9LrjmDWB1cpzxO5nU+4VhUE6nXgLyCrpa1nSeWNTOXIgZRdCYSEAPWTjvKZy2Af1R X0eMhW/ucU2HkfcFHW6OQGpYr6qxeGtppFU8/9TUgDD4GE6QMetGU0+AuIjqNJWTr8LFZz lZoPRstBMnp7GPXeJq2/QxsRjstHTLk= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-248-rqUsXV92MR2oXjVNeJUUnQ-1; Wed, 25 Jan 2023 18:21:48 -0500 X-MC-Unique: rqUsXV92MR2oXjVNeJUUnQ-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3ED603814593; Wed, 25 Jan 2023 23:21:48 +0000 (UTC) Received: from maya.cloud.tilaa.com (ovpn-208-4.brq.redhat.com [10.40.208.4]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AFEFE40C200C; Wed, 25 Jan 2023 23:21:47 +0000 (UTC) Date: Thu, 26 Jan 2023 00:21:33 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v3 00/18] RFC: Unify and simplify tap send path Message-ID: <20230126002133.7a8eec98@elisabeth> In-Reply-To: References: <20230106004322.985665-1-david@gibson.dropbear.id.au> <20230124222043.281ef58c@elisabeth> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.1 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: WWZYTQX5RHRG5TJ6M5X3NWUMM4XN5PDY X-Message-ID-Hash: WWZYTQX5RHRG5TJ6M5X3NWUMM4XN5PDY X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 25 Jan 2023 14:13:44 +1100 David Gibson wrote: > On Tue, Jan 24, 2023 at 10:20:43PM +0100, Stefano Brivio wrote: > > On Fri, 6 Jan 2023 11:43:04 +1100 > > David Gibson wrote: > > > > > Although we have an abstraction for the "slow path" (DHCP, NDP) guest > > > bound packets, the TCP and UDP forwarding paths write directly to the > > > tap fd. However, it turns out how they send frames to the tap device > > > is more similar than it originally appears. > > > > > > This series unifies the low-level tap send functions for TCP and UDP, > > > and makes some clean ups along the way. > > > > > > This is based on my earlier outstanding series. > > > > For some reason, performance tests consistently get stuck (both TCP and > > UDP, sometimes throughput, sometimes latency tests) with this series, > > and not without it, but I don't see any possible relationship with that. > > Drat, I didn't encounter that. Any chance you could bisect to figure > out which patch specifically seems to trigger it? I couldn't do it conclusively, yet. :/ Before "tcp: Combine two parts of passt tap send path together", no stalls at all. After that, I'm routinely getting a stall on the perf/passt_udp test, IPv4 host-to-guest with 256B MTU. I know, that test is probably meaningless as a performance figure, but it helps find issues like this, at least. :) Yes, UDP -- the iperf3 client doesn't connect to the server, passt doesn't crash, but it's gone (zombie) by the time I get to it. I think it's the test scripts terminating it (even though I don't see anything on the terminal), and script.log ends with: 2023/01/25 21:27:14 socat[3432381] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 ssh-keygen: generating new host keys: RSA 2023/01/25 21:27:14 socat[3432390] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432393] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432396] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432399] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 DSA ECDSA ED25519 # Warning: Permanently added 'guest' (ED25519) to the list of known hosts. which looks like fairly normal retries. If I run the tests with DEBUG=1, they get stuck during UDP functional testing, so I'm letting that aside for a moment. If I apply the whole series, other tests get stuck (including TCP ones). There might be something going wrong with iperf3's (TCP) control message exchange. I'm going to run this single test next, and add some debugging prints here and there. > I wonder if this could be related to the stalls I'm debugging, > although those didn't appear on the perf tests and also occur on > main. I have now discovered they seem to be masked by large socket > buffer sizes - more info at https://bugs.passt.top/show_bug.cgi?id=41 Maybe the subsequent failures (or even this one) could actually be related, and triggered somehow by some change in timing. I'm still clueless at the moment. -- Stefano