* [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size @ 2022-10-22 6:45 Stefano Brivio 2022-10-22 8:15 ` Stefano Brivio 0 siblings, 1 reply; 5+ messages in thread From: Stefano Brivio @ 2022-10-22 6:45 UTC (permalink / raw) To: passt-dev; +Cc: Andrea Bolognani ...instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna@redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- util.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/util.h b/util.h index 27829b1..c498a80 100644 --- a/util.h +++ b/util.h @@ -72,7 +72,7 @@ #define IPV4_IS_LOOPBACK(addr) \ ((addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET) -#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 4) +#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 8) #define NS_CALL(fn, arg) \ do { \ char ns_fn_stack[NS_FN_STACK_SIZE]; \ -- @@ -72,7 +72,7 @@ #define IPV4_IS_LOOPBACK(addr) \ ((addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET) -#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 4) +#define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 8) #define NS_CALL(fn, arg) \ do { \ char ns_fn_stack[NS_FN_STACK_SIZE]; \ -- 2.35.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size 2022-10-22 6:45 [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size Stefano Brivio @ 2022-10-22 8:15 ` Stefano Brivio 2022-10-23 23:36 ` David Gibson 0 siblings, 1 reply; 5+ messages in thread From: Stefano Brivio @ 2022-10-22 8:15 UTC (permalink / raw) To: passt-dev; +Cc: Andrea Bolognani On Sat, 22 Oct 2022 08:45:03 +0200 Stefano Brivio <sbrivio@redhat.com> wrote: > ...instead of one fourth. On the main() -> conf() -> nl_sock_init() > call path, LTO from gcc 12 on (at least) x86_64 decides to inline... > everything: nl_sock_init() is effectively part of main(), after > commit 3e2eb4337bc0 ("conf: Bind inbound ports with > CAP_NET_BIND_SERVICE before isolate_user()"). > > This means we exceed the maximum stack size, and we get SIGSEGV, > under any condition, at start time, as reported by Andrea on a recent > build for CentOS Stream 9. > > The calculation of NS_FN_STACK_SIZE, which is the stack size we > reserve for clones, was previously obtained by dividing the maximum > stack size by two, to avoid an explicit check on architecture (on > PA-RISC, also known as hppa, the stack grows up, so we point the > clone to the middle of this area), and then further divided by two > to allow for any additional usage in the caller. > > Well, if there are essentially no function calls anymore, this is > not enough. Divide it by eight, which is anyway much more than > possibly needed by any clone()d callee. > > I think this is robust, so it's a fix in some sense. Strictly > speaking, though, we have no formal guarantees that this isn't > either too little or too much. > > What we should do, eventually: check cloned() callees, there are just > thirteen of them at the moment. Note down any stack usage (they are > mostly small helpers), bonus points for an automated way at build > time, quadruple that or so, to allow for extreme clumsiness, and use > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. > > Reported-by: Andrea Bolognani <abologna@redhat.com> > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > --- I posted this in any case for (later) review, but I'm actually applying it right away, given that some builds are completely unusable otherwise. I guess it would make sense to extend build ("distro") tests with some compiler flags, common and less common. -- Stefano ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size 2022-10-22 8:15 ` Stefano Brivio @ 2022-10-23 23:36 ` David Gibson 2022-10-23 23:52 ` David Gibson 0 siblings, 1 reply; 5+ messages in thread From: David Gibson @ 2022-10-23 23:36 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Andrea Bolognani [-- Attachment #1: Type: text/plain, Size: 2804 bytes --] On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote: > On Sat, 22 Oct 2022 08:45:03 +0200 > Stefano Brivio <sbrivio@redhat.com> wrote: > > > ...instead of one fourth. On the main() -> conf() -> nl_sock_init() > > call path, LTO from gcc 12 on (at least) x86_64 decides to inline... > > everything: nl_sock_init() is effectively part of main(), after > > commit 3e2eb4337bc0 ("conf: Bind inbound ports with > > CAP_NET_BIND_SERVICE before isolate_user()"). > > > > This means we exceed the maximum stack size, and we get SIGSEGV, > > under any condition, at start time, as reported by Andrea on a recent > > build for CentOS Stream 9. > > > > The calculation of NS_FN_STACK_SIZE, which is the stack size we > > reserve for clones, was previously obtained by dividing the maximum > > stack size by two, to avoid an explicit check on architecture (on > > PA-RISC, also known as hppa, the stack grows up, so we point the > > clone to the middle of this area), and then further divided by two > > to allow for any additional usage in the caller. > > > > Well, if there are essentially no function calls anymore, this is > > not enough. Divide it by eight, which is anyway much more than > > possibly needed by any clone()d callee. > > > > I think this is robust, so it's a fix in some sense. Strictly > > speaking, though, we have no formal guarantees that this isn't > > either too little or too much. > > > > What we should do, eventually: check cloned() callees, there are just > > thirteen of them at the moment. Note down any stack usage (they are > > mostly small helpers), bonus points for an automated way at build > > time, quadruple that or so, to allow for extreme clumsiness, and use > > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. > > > > Reported-by: Andrea Bolognani <abologna@redhat.com> > > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > --- > > I posted this in any case for (later) review, but I'm actually applying > it right away, given that some builds are completely unusable otherwise. That's sensible, however, this patch confuses me. I don't really understand how reducing the stack size is avoiding a SEGV, regardless of what LTO does. The fact that we're basing the runtime stack size on a limit that's from build time also doesn't really make sense to me. > I guess it would make sense to extend build ("distro") tests with some > compiler flags, common and less common. > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size 2022-10-23 23:36 ` David Gibson @ 2022-10-23 23:52 ` David Gibson 2022-10-24 0:36 ` Stefano Brivio 0 siblings, 1 reply; 5+ messages in thread From: David Gibson @ 2022-10-23 23:52 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Andrea Bolognani [-- Attachment #1: Type: text/plain, Size: 3275 bytes --] On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote: > On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote: > > On Sat, 22 Oct 2022 08:45:03 +0200 > > Stefano Brivio <sbrivio@redhat.com> wrote: > > > > > ...instead of one fourth. On the main() -> conf() -> nl_sock_init() > > > call path, LTO from gcc 12 on (at least) x86_64 decides to inline... > > > everything: nl_sock_init() is effectively part of main(), after > > > commit 3e2eb4337bc0 ("conf: Bind inbound ports with > > > CAP_NET_BIND_SERVICE before isolate_user()"). > > > > > > This means we exceed the maximum stack size, and we get SIGSEGV, > > > under any condition, at start time, as reported by Andrea on a recent > > > build for CentOS Stream 9. > > > > > > The calculation of NS_FN_STACK_SIZE, which is the stack size we > > > reserve for clones, was previously obtained by dividing the maximum > > > stack size by two, to avoid an explicit check on architecture (on > > > PA-RISC, also known as hppa, the stack grows up, so we point the > > > clone to the middle of this area), and then further divided by two > > > to allow for any additional usage in the caller. > > > > > > Well, if there are essentially no function calls anymore, this is > > > not enough. Divide it by eight, which is anyway much more than > > > possibly needed by any clone()d callee. > > > > > > I think this is robust, so it's a fix in some sense. Strictly > > > speaking, though, we have no formal guarantees that this isn't > > > either too little or too much. > > > > > > What we should do, eventually: check cloned() callees, there are just > > > thirteen of them at the moment. Note down any stack usage (they are > > > mostly small helpers), bonus points for an automated way at build > > > time, quadruple that or so, to allow for extreme clumsiness, and use > > > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. > > > > > > Reported-by: Andrea Bolognani <abologna@redhat.com> > > > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > --- > > > > I posted this in any case for (later) review, but I'm actually applying > > it right away, given that some builds are completely unusable otherwise. > > That's sensible, however, this patch confuses me. I don't really > understand how reducing the stack size is avoiding a SEGV, regardless > of what LTO does. Shortly after I wrote this, I realized what the issue was. IIUC, the problem is basically because we're allocating the stack of the sub-thread as a buffer on the stack of the main thread, so the main thread stack has to have room for both. > The fact that we're basing the runtime stack size > on a limit that's from build time also doesn't really make sense to > me. This aspect still seems pretty bogus to me. > > I guess it would make sense to extend build ("distro") tests with some > > compiler flags, common and less common. > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size 2022-10-23 23:52 ` David Gibson @ 2022-10-24 0:36 ` Stefano Brivio 0 siblings, 0 replies; 5+ messages in thread From: Stefano Brivio @ 2022-10-24 0:36 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev, Andrea Bolognani On Mon, 24 Oct 2022 10:52:55 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Mon, Oct 24, 2022 at 10:36:19AM +1100, David Gibson wrote: > > On Sat, Oct 22, 2022 at 10:15:35AM +0200, Stefano Brivio wrote: > > > On Sat, 22 Oct 2022 08:45:03 +0200 > > > Stefano Brivio <sbrivio@redhat.com> wrote: > > > > > > > ...instead of one fourth. On the main() -> conf() -> nl_sock_init() > > > > call path, LTO from gcc 12 on (at least) x86_64 decides to inline... > > > > everything: nl_sock_init() is effectively part of main(), after > > > > commit 3e2eb4337bc0 ("conf: Bind inbound ports with > > > > CAP_NET_BIND_SERVICE before isolate_user()"). > > > > > > > > This means we exceed the maximum stack size, and we get SIGSEGV, > > > > under any condition, at start time, as reported by Andrea on a recent > > > > build for CentOS Stream 9. > > > > > > > > The calculation of NS_FN_STACK_SIZE, which is the stack size we > > > > reserve for clones, was previously obtained by dividing the maximum > > > > stack size by two, to avoid an explicit check on architecture (on > > > > PA-RISC, also known as hppa, the stack grows up, so we point the > > > > clone to the middle of this area), and then further divided by two > > > > to allow for any additional usage in the caller. > > > > > > > > Well, if there are essentially no function calls anymore, this is > > > > not enough. Divide it by eight, which is anyway much more than > > > > possibly needed by any clone()d callee. > > > > > > > > I think this is robust, so it's a fix in some sense. Strictly > > > > speaking, though, we have no formal guarantees that this isn't > > > > either too little or too much. > > > > > > > > What we should do, eventually: check cloned() callees, there are just > > > > thirteen of them at the moment. Note down any stack usage (they are > > > > mostly small helpers), bonus points for an automated way at build > > > > time, quadruple that or so, to allow for extreme clumsiness, and use > > > > as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. > > > > > > > > Reported-by: Andrea Bolognani <abologna@redhat.com> > > > > Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > --- > > > > > > I posted this in any case for (later) review, but I'm actually applying > > > it right away, given that some builds are completely unusable otherwise. > > > > That's sensible, however, this patch confuses me. I don't really > > understand how reducing the stack size is avoiding a SEGV, regardless > > of what LTO does. > > Shortly after I wrote this, I realized what the issue was. IIUC, the > problem is basically because we're allocating the stack of the > sub-thread as a buffer on the stack of the main thread, so the main > thread stack has to have room for both. Right, I could have phrased this better. > > The fact that we're basing the runtime stack size > > on a limit that's from build time also doesn't really make sense to > > me. > > This aspect still seems pretty bogus to me. It is. It's still a useful approximation, because those limits are rarely set to non-default per-arch values (all the distributions and versions we test happen to have, on a given architecture, the same value), and at the same time vary wildly depending on the architecture. And with that, we avoid bug-prone VLAs or, worse, alloca(). And if users set limits to substantially lower values, typically other programs won't be able to run as well. But... we don't really need that. With 16 KiB, you usually won't be able to ls. With 128 KiB, gimp crashes for me. We probably need something in between, but that implies we shouldn't just give NS_CALL()ed helpers "a lot", rather (some multiple of) what they need. They're not many and they're quite short, so we could note down some manual calculations, or we could automate that with nm, or pahole, or gcc -save-temps... or even some objdump script. I'm not sure that's worth adding a further compilation phase, though. We have well-known architecture-specific type sizes and alignments on Linux and we already do something similar for AVX2 alignments, so I would try to do this manually, take some reasonable margin on top, and maybe add an explicit stack guard (I think we shouldn't rely on -fstack-protector alone, or at all). -- Stefano ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-10-24 0:37 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-10-22 6:45 [PATCH] util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size Stefano Brivio 2022-10-22 8:15 ` Stefano Brivio 2022-10-23 23:36 ` David Gibson 2022-10-23 23:52 ` David Gibson 2022-10-24 0:36 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).