From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 9AA0F5A0082 for ; Wed, 8 Feb 2023 16:06:37 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675868796; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4CEia2tpjrF/CrpSGd4IbnWK+Ju+eHP96qgfuDD0QVY=; b=KwdccM3Kw9TjQEgGC1RxqTC2wHdZnfIpS/3HVJra2LM0CnRlNH3UhUtVw0yCL39wyYwAE+ h2wyFPrpGgN88TQ3b/RBlEgHA5ylTUaGS5jasc1FIILy+VhD/QjU1/jimdh8Rxbcwqjp5O YkWN8ylUsext2r06xcuzvm27voLuIsg= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-613-Sgo8B2MdMnuJH8PHMJs74g-1; Wed, 08 Feb 2023 10:06:33 -0500 X-MC-Unique: Sgo8B2MdMnuJH8PHMJs74g-1 Received: by mail-wm1-f69.google.com with SMTP id s11-20020a05600c384b00b003dffc7343c3so4757610wmr.0 for ; Wed, 08 Feb 2023 07:06:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4CEia2tpjrF/CrpSGd4IbnWK+Ju+eHP96qgfuDD0QVY=; b=fJzupqfGfwjUDb1PN4dVhU6diTu7x5jm+nhaMuABMxQa3fz7WcmxfK4ehdL1c8kij8 YzpPryQJOT+2hdiJFV/o13ui0R3YyRlyfFMIrcY/4xL11vupmNYfu4xStCpANqA/Qi64 VPMI9uyHhvfMSBGzX93wAjM35GIVIPefa/ZHZ+3UTZzPGIhCsERiy/yHekEyh6rN4fbP vv70ucMgm5zQXSRV9Jz1YBG+f2c4O9lXc24MTtVTN6YbXGNMZ8qlfnG/Z1OsUh9xTMmz DX9ER+xYTvXglG4vNXBNno/kwn2LoalstRvTL3uIUw5l6m2brlxMp48ciSRkuBMNQp45 55Zw== X-Gm-Message-State: AO0yUKV9g90BWtA90WHjjPZzP2UhAos8Gc6sS2qkAUIYNCj1BW3FYuyl bKS61VQis5fqT51Zj9vkpJsAyvWo1tbzDgx/aGyz8fUBiVDOkh1+FKQ+s/tRV7t11L1YfaWBwrk DN1QBDbovpT4i X-Received: by 2002:a05:600c:16d6:b0:3db:14d0:65be with SMTP id l22-20020a05600c16d600b003db14d065bemr6841410wmn.34.1675868791883; Wed, 08 Feb 2023 07:06:31 -0800 (PST) X-Google-Smtp-Source: AK7set/9zmA2I6VVQGQ6usCagZEJJrdWX25GcFXe+AehfOv95xoWpver6UxV5Upd2rCBuQwdx7vmPg== X-Received: by 2002:a05:600c:16d6:b0:3db:14d0:65be with SMTP id l22-20020a05600c16d600b003db14d065bemr6841382wmn.34.1675868791531; Wed, 08 Feb 2023 07:06:31 -0800 (PST) Received: from [192.168.188.25] ([80.243.52.134]) by smtp.gmail.com with ESMTPSA id m17-20020a05600c3b1100b003df5be8987esm2389528wms.20.2023.02.08.07.06.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 08 Feb 2023 07:06:31 -0800 (PST) Message-ID: Date: Wed, 8 Feb 2023 16:06:30 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: [PATCH] pasta: wait for netns setup before calling exec To: Stefano Brivio References: <20230201180116.21281-1-pholzing@redhat.com> <20230202112506.187d852e@elisabeth> <7df94a05-50a5-26d8-784c-3565b6578212@redhat.com> <20230203173703.564f73c3@elisabeth> <20230203195536.667a09f6@elisabeth> <20230207115547.0d26f8ba@elisabeth> <20230208140130.45288e1b@elisabeth> From: Paul Holzinger In-Reply-To: <20230208140130.45288e1b@elisabeth> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-MailFrom: pholzing@redhat.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: KMSHFGF6GGBAZKZYKNFXAWHH5YRMKHYN X-Message-ID-Hash: KMSHFGF6GGBAZKZYKNFXAWHH5YRMKHYN X-Mailman-Approved-At: Wed, 08 Feb 2023 17:27:13 +0100 CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 08/02/2023 14:01, Stefano Brivio wrote: > On Tue, 7 Feb 2023 20:09:44 +0100 > Paul Holzinger wrote: > >> On 07/02/2023 11:55, Stefano Brivio wrote: >>> On Mon, 6 Feb 2023 20:53:08 +0100 >>> Paul Holzinger wrote: >>> >>>> On 03/02/2023 19:55, Stefano Brivio wrote: >>>>> On Fri, 3 Feb 2023 17:37:03 +0100 >>>>> Stefano Brivio wrote: >>>>> >>>>>> On Fri, 3 Feb 2023 15:44:40 +0100 >>>>>> Paul Holzinger wrote: >>>>>> >>>>>>> On 02/02/2023 11:25, Stefano Brivio wrote: >>>>>>>> On Wed, 1 Feb 2023 19:01:16 +0100 >>>>>>>> Paul Holzinger wrote: >>>>>>>> >>>>>>>>> When a user spawns a command with pasta they expect the network to be >>>>>>>>> ready. Currently this does not work because pasta will fork/exec >>>>>>>>> before it will setup the network config. >>>>>>>>> >>>>>>>>> This patch fixes it by using a pipe to sync parent and child. The child >>>>>>>>> will now block reading from this pipe before the exec call. The parent >>>>>>>>> will then unblock the child only after the netns was configured. >>>>>>>> Thanks for the patch! I'm reviewing this in a bit. >>>>>>>> >>>>>>>> A few considerations meanwhile: >>>>>>>> >>>>>>>> - there's actually a bigger issue (you're fixing here) than the >>>>>>>> namespace configuration (via netlink) itself: the tap device isn't >>>>>>>> ready (tap_sock_init() hasn't been called yet) when we spawn the >>>>>>>> command in the new namespace. Oops. >>>>>>>> >>>>>>>> If you're wondering: we can't just reorder things, because to complete >>>>>>>> the configuration phase (conf()) we need the namespace to be set up, >>>>>>>> and we can't initialise the tap device before it's set up >>>>>>>> >>>>>>>> - pipes are more commonly used to transfer data around (hence the whole >>>>>>>> code you need to open a communication channel, check it, close it). >>>>>>>> Did you try with a signal? Or is there a reason why it wouldn't work? >>>>>>>> >>>>>>>> You could simply SIGSTOP the child, from the child itself: >>>>>>>> >>>>>>>> kill(getpid(), SIGSTOP); >>>>>>>> >>>>>>>> and send a SIGCONT to it (we already store the PID of the child in >>>>>>>> pasta_child_pid) once we're ready. >>>>>>>> >>>>>>>> SIGCONT is special in that it doesn't need CAP_KILL or the processes >>>>>>>> to run under the same UID -- just in the same session, so it wouldn't >>>>>>>> risk interfering with the isolation_*() calls. >>>>>>>> >>>>>>>> I haven't tested this but I think it should lead to simpler code. >>>>>>> Thinking about this more STOP/CONT will not work reliably, it could stop >>>>>>> the child forever when the parent sends SIGCONT before the child >>>>>>> SIGSTOPs itself. While this is unlikely we have no control over how both >>>>>>> processes are scheduled. >>>>>>> >>>>>>> With this pipe version there is no problem when the parent closes the fd >>>>>>> before the child calls read, read will simply return EOF and the child can >>>>>>> continue, thus it will work correctly in all cases. >>>>>> Ah, right, nice catch. Still, you could probably use pause() or >>>>>> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone >>>>>> experiment and I'll get back to you (probably early next week), unless >>>>>> you manage to get it working before. >>>>> Sorry, forget about it -- it doesn't solve the problem of waiting, in >>>>> the parent, that the child is stopped, which is exactly the point you >>>>> raised. A waitpid() with WUNTRACED does: >>>>> >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> >>>>> #define DELAY_PARENT 0 >>>>> #define DELAY_CHILD 0 >>>>> >>>>> int main() >>>>> { >>>>> pid_t pid; >>>>> int i; >>>>> >>>>> if ((pid = fork())) { >>>>> #if DELAY_PARENT >>>>> for (i = 0; i < 10000000; i++); >>>>> #endif >>>>> waitpid(pid, NULL, WUNTRACED); >>>>> kill(pid, SIGCONT); >>>>> sleep(1); >>>>> return 0; >>>>> } >>>>> >>>>> #if DELAY_CHILD >>>>> for (i = 0; i < 10000000; i++); >>>>> #endif >>>>> raise(SIGSTOP); >>>>> fprintf(stderr, "received SIGCONT\n"); >>>>> return 0; >>>>> } >>>>> >>>>> I left in some busyloops you can use to check. It's three lines, with >>>>> error checks probably 9, still less than the pipe thing (~16) and it >>>>> looks simpler (to me). >>>>> >>>> I don't know what it is but this doesn't work when I implement it in >>>> pasta. Somehow >>>> the child doesn't seem to be stopped. A short lived processes such as ip >>>> addr causes >>>> pasta to exit even before the parent got to the point where it would >>>> send SIGCONT. >>>> If I get the nanoseconds before and after raise(SIGSTOP) there is almost >>>> no delay. >>>> It is clear that the child still runs after raise(SIGSTOP) even though >>>> the parent never send >>>> SIGCONT at this point, in fact I can completely remove the kill(pid, >>>> SIGCONT) call and >>>> the program works without hanging. >>> Sorry, I didn't imagine it would be so messy. >>> >>> The problem is that the child sees a detached PID namspace -- if you >>> drop CLONE_NEWPID from do_clone() in pasta_start_ns(), things work as >>> expected. >>> >>> Sending a SIGSTOP from the child means we're stopping the init process, >>> again only from the child perspective, and you can't send SIGSTOP or >>> SIGKILL to init, see sig_task_ignored() in kernel/signal.c: >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n85 >>> >>> I haven't actually traced this in the kernel and I'm not sure if we >>> hit that condition -- and especially if we should hit it (one might >>> argue it's a kernel issue, in case), but it seems to fit. >>> >>> Also: >>> >>>> And of course if I run this through strace it works just fine, so I am >>>> bit lost right now. >>>> Likely because strace makes things much slower? >>> ...no, a tracer actually makes things work (I also tried with delays). >>> The child stops and we get a SIGCHLD: >>> >>> [pid 3101055] clone(child_stack=0x7fff46e7c3f0, flags=CLONE_VM|CLONE_FILES|CLONE_VFORK|SIGCHLD >>> [pid 3101056] openat(AT_FDCWD, "/proc/sys/net/ipv4/ping_group_range", O_WRONLY|O_TRUNC|O_CLOEXEC) = 6 >>> [pid 3101056] write(6, "0 0", 3) = 3 >>> [pid 3101056] close(6) = 0 >>> [pid 3101056] gettid() = 1 >>> [pid 3101056] getpid() = 1 >>> [pid 3101056] tgkill(1, 1, SIGSTOPstrace: Process 3101057 attached >>> ) = 0 >>> [pid 3101056] --- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=1, si_uid=0} --- >>> [pid 3101057] openat(AT_FDCWD, "/proc/3101056/ns/net", O_RDONLY|O_CLOEXEC >>> [pid 3101056] --- stopped by SIGSTOP --- >>> [pid 3101057] <... openat resumed>) = 6 >>> [pid 3101057] setns(6, CLONE_NEWNET) = 0 >>> [pid 3101057] exit(0) = ? >>> [pid 3101055] <... clone resumed>) = 3101057 >>> [pid 3101057] +++ exited with 0 +++ >>> [pid 3101055] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=3101056, si_uid=0, si_status=SIGSTOP, si_utime=0, si_stime=0} --- >>> >>> I guess because of this condition in sig_ignored(): >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n111 >>> >>> which skips sig_task_ignored() altogether. This kernel behaviour looks >>> rather unexpected to me and we should probably fix it, if my suspicion >>> is really confirmed. >>> >>>> The waitpid call also always fails with ECHILD, I don't understand why. >>>> I think it must >>>> have something to do with the SIGCHILD signal handler that already reaps >>>> the signal info? >>> The tracer shouldn't really reap that, because we only wait on WEXITED >>> there. We get ECHILD (without a tracer) because the child isn't >>> actually stopped. >>> >>>> So either I made a a stupid mistake somewhere or it simply does not work. >>> It won't work like that. We could send SIGSTOP from the parent instead, >>> that works, but then we don't know (concern you raised earlier) if the >>> child is already available when we send it. >>> >>> There's another way to address this concern, though: while >>> CLONE_STOPPED was dropped a long time ago, we can obtain a similar >>> behaviour given that the child inherits our signal mask on clone() (by >>> default), and it can later wait for a given signal. >>> >>> Let's pick SIGUSR1: block it before clone(), and the child will start >>> with it blocked, which means we can safely queue it at any later point >>> in the parent. >>> >>> Something like this: >>> >>> --- >>> diff --git a/passt.c b/passt.c >>> index 8b2c50d..d957e14 100644 >>> --- a/passt.c >>> +++ b/passt.c >>> @@ -301,6 +301,9 @@ int main(int argc, char **argv) >>> else >>> write_pidfile(pidfile_fd, getpid()); >>> >>> + if (pasta_child_pid) >>> + kill(pasta_child_pid, SIGUSR1); >>> + >>> isolate_postfork(&c); >>> >>> timer_init(&c, &now); >>> diff --git a/pasta.c b/pasta.c >>> index 528f02a..9169913 100644 >>> --- a/pasta.c >>> +++ b/pasta.c >>> @@ -47,7 +47,7 @@ >>> #include "log.h" >>> >>> /* PID of child, in case we created a namespace */ >>> -static int pasta_child_pid; >>> +int pasta_child_pid; >>> >>> /** >>> * pasta_child_handler() - Exit once shell exits (if we started it), reap clones >>> @@ -166,10 +166,16 @@ struct pasta_spawn_cmd_arg { >>> static int pasta_spawn_cmd(void *arg) >>> { >>> const struct pasta_spawn_cmd_arg *a; >>> + sigset_t set; >>> >>> if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0")) >>> warn("Cannot set ping_group_range, ICMP requests might fail"); >>> >>> + /* Wait for the parent to be ready: see main() */ >>> + sigemptyset(&set); >>> + sigaddset(&set, SIGUSR1); >>> + sigwaitinfo(&set, NULL); >>> + >>> a = (const struct pasta_spawn_cmd_arg *)arg; >>> execvp(a->exe, a->argv); >>> >>> @@ -196,6 +202,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid, >>> char ns_fn_stack[NS_FN_STACK_SIZE]; >>> char *sh_argv[] = { NULL, NULL }; >>> char sh_arg0[PATH_MAX + 1]; >>> + sigset_t set; >>> >>> c->foreground = 1; >>> if (!c->debug) >>> @@ -226,6 +233,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid, >>> arg.argv = sh_argv; >>> } >>> >>> + /* Block SIGUSR1 in child, we queue it in main() when we're ready */ >>> + sigemptyset(&set); >>> + sigaddset(&set, SIGUSR1); >>> + sigprocmask(SIG_BLOCK, &set, NULL); >>> + >>> pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack, >>> sizeof(ns_fn_stack), >>> CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET | >>> diff --git a/pasta.h b/pasta.h >>> index a8b9893..0ccb7e9 100644 >>> --- a/pasta.h >>> +++ b/pasta.h >>> @@ -6,6 +6,8 @@ >>> #ifndef PASTA_H >>> #define PASTA_H >>> >>> +extern int pasta_child_pid; >>> + >>> void pasta_open_ns(struct ctx *c, const char *netns); >>> void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid, >>> int argc, char *argv[]); >>> --- >>> >>> actually works for me. >>> >>> It's a few more lines than I hoped for, but I think it's still (maybe >>> marginally, now) simpler than the pipe approach. What do you think? >> It seems to work so I am good with it. Do you apply your diff or do you >> want me to send a v2 with it? > As you prefer, I can apply this and give you credits in the message. Please apply it. > >> Just one question I see that these sig... function can return errors >> according to the man page, should we handle them? > I think the only meaningful error condition we might actually hit here > is EINTR from sigwaitinfo(), and in that case we might just as well > proceed without synchronisation, rather than failing or trying to > handle it. > >> I see that there is no error handling done in main() for the sig... >> functions so maybe not? > That's not necessarily a good indication. :) In general, you can make > the mild assumption that some code might have been written "quickly" > and some potentially useful error checks were skipped. > >> Lastly another problem I just found when execvp fails, i.e. ENOENT, the >> parent just keeps hanging and does not exit. > Oh, oops. The intention was that the child exiting would trigger a > SIGCHLD and pasta_child_handler() in the parent. I'm not sure what's > wrong there. I found the issue, you need to add signal to the clone() flags. I will send a patch for this. > >> However this is present in current HEAD, my pipe version and your signal >> version so I don't think we need to block this for that bug.