From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id BE2285A0082 for ; Mon, 24 Oct 2022 02:37:08 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1666571827; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DjZA7XPp+KORy9BZcVAuzLzGnpJzHLXW0bY0oFWUBBg=; b=MxH/yE8beJWxDWC/2wXiJg7/XCqEuS0fJMRcFycE8FLqjdUV0W0SehJU/yOxzXPGKSAw8Z PKalSaEpMs3qqaXp0D9lknKahrcrd/j0N9MpbhX+WCrAeNSowr1EeKIBX30zZhIWPIjQWX mPrK7/pRhuxkeJV8PUiYz+SkYMsNdt8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-250-8k4fmq0jPrO8cFd0fvDs8A-1; Sun, 23 Oct 2022 20:37:04 -0400 X-MC-Unique: 8k4fmq0jPrO8cFd0fvDs8A-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id F3C5C101A54E; Mon, 24 Oct 2022 00:36:57 +0000 (UTC) Received: from maya.cloud.tilaa.com (ovpn-208-31.brq.redhat.com [10.40.208.31]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D0B0D140EBF3; Mon, 24 Oct 2022 00:36:44 +0000 (UTC) Date: Mon, 24 Oct 2022 02:36:33 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size Message-ID: <20221024023633.526148e5@elisabeth> In-Reply-To: References: <20221022064503.386563-1-sbrivio@redhat.com> <20221022101535.4ceda8d6@elisabeth> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.7 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: CGBDTNE4PWICUTHKGOTMM4GSJMLW7S5R X-Message-ID-Hash: CGBDTNE4PWICUTHKGOTMM4GSJMLW7S5R X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Andrea Bolognani X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: <> Archived-At: List-Archive: <> List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 24 Oct 2022 10:52:55 +1100 David Gibson wrote: > On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote: > > On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote: > > > On Sat, 22 Oct 2022 08:45:03 +0200 > > > Stefano Brivio wrote: > > > > > > > ...instead of one fourth. On the main() -> conf() -> nl_sock_init() > > > > call path, LTO from gcc 12 on (at least) x86_64 decides to inline... > > > > everything: nl_sock_init() is effectively part of main(), after > > > > commit 3e2eb4337bc0 ("conf: Bind inbound ports with > > > > CAP_NET_BIND_SERVICE before isolate_user()"). > > > > > > > > This means we exceed the maximum stack size, and we get SIGSEGV, > > > > under any condition, at start time, as reported by Andrea on a recent > > > > build for CentOS Stream 9. > > > > > > > > The calculation of NS_FN_STACK_SIZE, which is the stack size we > > > > reserve for clones, was previously obtained by dividing the maximum > > > > stack size by two, to avoid an explicit check on architecture (on > > > > PA-RISC, also known as hppa, the stack grows up, so we point the > > > > clone to the middle of this area), and then further divided by two > > > > to allow for any additional usage in the caller. > > > > > > > > Well, if there are essentially no function calls anymore, this is > > > > not enough. Divide it by eight, which is anyway much more than > > > > possibly needed by any clone()d callee. > > > > > > > > I think this is robust, so it's a fix in some sense. Strictly > > > > speaking, though, we have no formal guarantees that this isn't > > > > either too little or too much. > > > > > > > > What we should do, eventually: check cloned() callees, there are just > > > > thirteen of them at the moment. Note down any stack usage (they are > > > > mostly small helpers), bonus points for an automated way at build > > > > time, quadruple that or so, to allow for extreme clumsiness, and use > > > > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. > > > > > > > > Reported-by: Andrea Bolognani > > > > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") > > > > Signed-off-by: Stefano Brivio > > > > --- > > > > > > I posted this in any case for (later) review, but I'm actually applying > > > it right away, given that some builds are completely unusable otherwise. > > > > That's sensible, however, this patch confuses me. I don't really > > understand how reducing the stack size is avoiding a SEGV, regardless > > of what LTO does. > > Shortly after I wrote this, I realized what the issue was. IIUC, the > problem is basically because we're allocating the stack of the > sub-thread as a buffer on the stack of the main thread, so the main > thread stack has to have room for both. Right, I could have phrased this better. > > The fact that we're basing the runtime stack size > > on a limit that's from build time also doesn't really make sense to > > me. > > This aspect still seems pretty bogus to me. It is. It's still a useful approximation, because those limits are rarely set to non-default per-arch values (all the distributions and versions we test happen to have, on a given architecture, the same value), and at the same time vary wildly depending on the architecture. And with that, we avoid bug-prone VLAs or, worse, alloca(). And if users set limits to substantially lower values, typically other programs won't be able to run as well. But... we don't really need that. With 16 KiB, you usually won't be able to ls. With 128 KiB, gimp crashes for me. We probably need something in between, but that implies we shouldn't just give NS_CALL()ed helpers "a lot", rather (some multiple of) what they need. They're not many and they're quite short, so we could note down some manual calculations, or we could automate that with nm, or pahole, or gcc -save-temps... or even some objdump script. I'm not sure that's worth adding a further compilation phase, though. We have well-known architecture-specific type sizes and alignments on Linux and we already do something similar for AVX2 alignments, so I would try to do this manually, take some reasonable margin on top, and maybe add an explicit stack guard (I think we shouldn't rely on -fstack-protector alone, or at all). -- Stefano