public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
@ 2022-10-22  6:45 Stefano Brivio
  2022-10-22  8:15 ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2022-10-22  6:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Andrea Bolognani

...instead of one fourth. On the main() -> conf() -> nl_sock_init()
call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
everything: nl_sock_init() is effectively part of main(), after
commit 3e2eb4337bc0 ("conf: Bind inbound ports with
CAP_NET_BIND_SERVICE before isolate_user()").

This means we exceed the maximum stack size, and we get SIGSEGV,
under any condition, at start time, as reported by Andrea on a recent
build for CentOS Stream 9.

The calculation of NS_FN_STACK_SIZE, which is the stack size we
reserve for clones, was previously obtained by dividing the maximum
stack size by two, to avoid an explicit check on architecture (on
PA-RISC, also known as hppa, the stack grows up, so we point the
clone to the middle of this area), and then further divided by two
to allow for any additional usage in the caller.

Well, if there are essentially no function calls anymore, this is
not enough. Divide it by eight, which is anyway much more than
possibly needed by any clone()d callee.

I think this is robust, so it's a fix in some sense. Strictly
speaking, though, we have no formal guarantees that this isn't
either too little or too much.

What we should do, eventually: check cloned() callees, there are just
thirteen of them at the moment. Note down any stack usage (they are
mostly small helpers), bonus points for an automated way at build
time, quadruple that or so, to allow for extreme clumsiness, and use
as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.

Reported-by: Andrea Bolognani <abologna@redhat.com>
Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 util.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util.h b/util.h
index 27829b1..c498a80 100644
--- a/util.h
+++ b/util.h
@@ -72,7 +72,7 @@
 #define IPV4_IS_LOOPBACK(addr)						\
 	((addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
 
-#define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 4)
+#define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 #define NS_CALL(fn, arg)						\
 	do {								\
 		char ns_fn_stack[NS_FN_STACK_SIZE];			\
-- 
@@ -72,7 +72,7 @@
 #define IPV4_IS_LOOPBACK(addr)						\
 	((addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
 
-#define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 4)
+#define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 #define NS_CALL(fn, arg)						\
 	do {								\
 		char ns_fn_stack[NS_FN_STACK_SIZE];			\
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
  2022-10-22  6:45 [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size Stefano Brivio
@ 2022-10-22  8:15 ` Stefano Brivio
  2022-10-23 23:36   ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2022-10-22  8:15 UTC (permalink / raw)
  To: passt-dev; +Cc: Andrea Bolognani

On Sat, 22 Oct 2022 08:45:03 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> ...instead of one fourth. On the main() -> conf() -> nl_sock_init()
> call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
> everything: nl_sock_init() is effectively part of main(), after
> commit 3e2eb4337bc0 ("conf: Bind inbound ports with
> CAP_NET_BIND_SERVICE before isolate_user()").
> 
> This means we exceed the maximum stack size, and we get SIGSEGV,
> under any condition, at start time, as reported by Andrea on a recent
> build for CentOS Stream 9.
> 
> The calculation of NS_FN_STACK_SIZE, which is the stack size we
> reserve for clones, was previously obtained by dividing the maximum
> stack size by two, to avoid an explicit check on architecture (on
> PA-RISC, also known as hppa, the stack grows up, so we point the
> clone to the middle of this area), and then further divided by two
> to allow for any additional usage in the caller.
> 
> Well, if there are essentially no function calls anymore, this is
> not enough. Divide it by eight, which is anyway much more than
> possibly needed by any clone()d callee.
> 
> I think this is robust, so it's a fix in some sense. Strictly
> speaking, though, we have no formal guarantees that this isn't
> either too little or too much.
> 
> What we should do, eventually: check cloned() callees, there are just
> thirteen of them at the moment. Note down any stack usage (they are
> mostly small helpers), bonus points for an automated way at build
> time, quadruple that or so, to allow for extreme clumsiness, and use
> as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.
> 
> Reported-by: Andrea Bolognani <abologna@redhat.com>
> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---

I posted this in any case for (later) review, but I'm actually applying
it right away, given that some builds are completely unusable otherwise.

I guess it would make sense to extend build ("distro") tests with some
compiler flags, common and less common.

-- 
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
  2022-10-22  8:15 ` Stefano Brivio
@ 2022-10-23 23:36   ` David Gibson
  2022-10-23 23:52     ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2022-10-23 23:36 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Andrea Bolognani

[-- Attachment #1: Type: text/plain, Size: 2804 bytes --]

On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote:
> On Sat, 22 Oct 2022 08:45:03 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > ...instead of one fourth. On the main() -> conf() -> nl_sock_init()
> > call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
> > everything: nl_sock_init() is effectively part of main(), after
> > commit 3e2eb4337bc0 ("conf: Bind inbound ports with
> > CAP_NET_BIND_SERVICE before isolate_user()").
> > 
> > This means we exceed the maximum stack size, and we get SIGSEGV,
> > under any condition, at start time, as reported by Andrea on a recent
> > build for CentOS Stream 9.
> > 
> > The calculation of NS_FN_STACK_SIZE, which is the stack size we
> > reserve for clones, was previously obtained by dividing the maximum
> > stack size by two, to avoid an explicit check on architecture (on
> > PA-RISC, also known as hppa, the stack grows up, so we point the
> > clone to the middle of this area), and then further divided by two
> > to allow for any additional usage in the caller.
> > 
> > Well, if there are essentially no function calls anymore, this is
> > not enough. Divide it by eight, which is anyway much more than
> > possibly needed by any clone()d callee.
> > 
> > I think this is robust, so it's a fix in some sense. Strictly
> > speaking, though, we have no formal guarantees that this isn't
> > either too little or too much.
> > 
> > What we should do, eventually: check cloned() callees, there are just
> > thirteen of them at the moment. Note down any stack usage (they are
> > mostly small helpers), bonus points for an automated way at build
> > time, quadruple that or so, to allow for extreme clumsiness, and use
> > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.
> > 
> > Reported-by: Andrea Bolognani <abologna@redhat.com>
> > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> 
> I posted this in any case for (later) review, but I'm actually applying
> it right away, given that some builds are completely unusable otherwise.

That's sensible, however, this patch confuses me.  I don't really
understand how reducing the stack size is avoiding a SEGV, regardless
of what LTO does.  The fact that we're basing the runtime stack size
on a limit that's from build time also doesn't really make sense to
me.

> I guess it would make sense to extend build ("distro") tests with some
> compiler flags, common and less common.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
  2022-10-23 23:36   ` David Gibson
@ 2022-10-23 23:52     ` David Gibson
  2022-10-24  0:36       ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2022-10-23 23:52 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Andrea Bolognani

[-- Attachment #1: Type: text/plain, Size: 3275 bytes --]

On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote:
> On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote:
> > On Sat, 22 Oct 2022 08:45:03 +0200
> > Stefano Brivio <sbrivio@redhat.com> wrote:
> > 
> > > ...instead of one fourth. On the main() -> conf() -> nl_sock_init()
> > > call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
> > > everything: nl_sock_init() is effectively part of main(), after
> > > commit 3e2eb4337bc0 ("conf: Bind inbound ports with
> > > CAP_NET_BIND_SERVICE before isolate_user()").
> > > 
> > > This means we exceed the maximum stack size, and we get SIGSEGV,
> > > under any condition, at start time, as reported by Andrea on a recent
> > > build for CentOS Stream 9.
> > > 
> > > The calculation of NS_FN_STACK_SIZE, which is the stack size we
> > > reserve for clones, was previously obtained by dividing the maximum
> > > stack size by two, to avoid an explicit check on architecture (on
> > > PA-RISC, also known as hppa, the stack grows up, so we point the
> > > clone to the middle of this area), and then further divided by two
> > > to allow for any additional usage in the caller.
> > > 
> > > Well, if there are essentially no function calls anymore, this is
> > > not enough. Divide it by eight, which is anyway much more than
> > > possibly needed by any clone()d callee.
> > > 
> > > I think this is robust, so it's a fix in some sense. Strictly
> > > speaking, though, we have no formal guarantees that this isn't
> > > either too little or too much.
> > > 
> > > What we should do, eventually: check cloned() callees, there are just
> > > thirteen of them at the moment. Note down any stack usage (they are
> > > mostly small helpers), bonus points for an automated way at build
> > > time, quadruple that or so, to allow for extreme clumsiness, and use
> > > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.
> > > 
> > > Reported-by: Andrea Bolognani <abologna@redhat.com>
> > > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > 
> > I posted this in any case for (later) review, but I'm actually applying
> > it right away, given that some builds are completely unusable otherwise.
> 
> That's sensible, however, this patch confuses me.  I don't really
> understand how reducing the stack size is avoiding a SEGV, regardless
> of what LTO does.

Shortly after I wrote this, I realized what the issue was.  IIUC, the
problem is basically because we're allocating the stack of the
sub-thread as a buffer on the stack of the main thread, so the main
thread stack has to have room for both.

> The fact that we're basing the runtime stack size
> on a limit that's from build time also doesn't really make sense to
> me.

This aspect still seems pretty bogus to me.

> > I guess it would make sense to extend build ("distro") tests with some
> > compiler flags, common and less common.
> > 
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size
  2022-10-23 23:52     ` David Gibson
@ 2022-10-24  0:36       ` Stefano Brivio
  0 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2022-10-24  0:36 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Andrea Bolognani

On Mon, 24 Oct 2022 10:52:55 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote:
> > On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote:  
> > > On Sat, 22 Oct 2022 08:45:03 +0200
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >   
> > > > ...instead of one fourth. On the main() -> conf() -> nl_sock_init()
> > > > call path, LTO from gcc 12 on (at least) x86_64 decides to inline...
> > > > everything: nl_sock_init() is effectively part of main(), after
> > > > commit 3e2eb4337bc0 ("conf: Bind inbound ports with
> > > > CAP_NET_BIND_SERVICE before isolate_user()").
> > > > 
> > > > This means we exceed the maximum stack size, and we get SIGSEGV,
> > > > under any condition, at start time, as reported by Andrea on a recent
> > > > build for CentOS Stream 9.
> > > > 
> > > > The calculation of NS_FN_STACK_SIZE, which is the stack size we
> > > > reserve for clones, was previously obtained by dividing the maximum
> > > > stack size by two, to avoid an explicit check on architecture (on
> > > > PA-RISC, also known as hppa, the stack grows up, so we point the
> > > > clone to the middle of this area), and then further divided by two
> > > > to allow for any additional usage in the caller.
> > > > 
> > > > Well, if there are essentially no function calls anymore, this is
> > > > not enough. Divide it by eight, which is anyway much more than
> > > > possibly needed by any clone()d callee.
> > > > 
> > > > I think this is robust, so it's a fix in some sense. Strictly
> > > > speaking, though, we have no formal guarantees that this isn't
> > > > either too little or too much.
> > > > 
> > > > What we should do, eventually: check cloned() callees, there are just
> > > > thirteen of them at the moment. Note down any stack usage (they are
> > > > mostly small helpers), bonus points for an automated way at build
> > > > time, quadruple that or so, to allow for extreme clumsiness, and use
> > > > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa.
> > > > 
> > > > Reported-by: Andrea Bolognani <abologna@redhat.com>
> > > > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()")
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---  
> > > 
> > > I posted this in any case for (later) review, but I'm actually applying
> > > it right away, given that some builds are completely unusable otherwise.  
> > 
> > That's sensible, however, this patch confuses me.  I don't really
> > understand how reducing the stack size is avoiding a SEGV, regardless
> > of what LTO does.  
> 
> Shortly after I wrote this, I realized what the issue was.  IIUC, the
> problem is basically because we're allocating the stack of the
> sub-thread as a buffer on the stack of the main thread, so the main
> thread stack has to have room for both.

Right, I could have phrased this better.

> > The fact that we're basing the runtime stack size
> > on a limit that's from build time also doesn't really make sense to
> > me.  
> 
> This aspect still seems pretty bogus to me.

It is.

It's still a useful approximation, because those limits are rarely set
to non-default per-arch values (all the distributions and versions we
test happen to have, on a given architecture, the same value), and at
the same time vary wildly depending on the architecture. And with that,
we avoid bug-prone VLAs or, worse, alloca().

And if users set limits to substantially lower values, typically other
programs won't be able to run as well.

But... we don't really need that. With 16 KiB, you usually won't be able
to ls. With 128 KiB, gimp crashes for me. We probably need something in
between, but that implies we shouldn't just give NS_CALL()ed helpers "a
lot", rather (some multiple of) what they need.

They're not many and they're quite short, so we could note down some
manual calculations, or we could automate that with nm, or pahole, or
gcc -save-temps... or even some objdump script. I'm not sure that's
worth adding a further compilation phase, though.

We have well-known architecture-specific type sizes and alignments on
Linux and we already do something similar for AVX2 alignments, so I
would try to do this manually, take some reasonable margin on top, and
maybe add an explicit stack guard (I think we shouldn't rely on
-fstack-protector alone, or at all).

-- 
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-24  0:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-22  6:45 [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size Stefano Brivio
2022-10-22  8:15 ` Stefano Brivio
2022-10-23 23:36   ` David Gibson
2022-10-23 23:52     ` David Gibson
2022-10-24  0:36       ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).