public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Paul Holzinger <pholzing@redhat.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH] pasta: wait for netns setup before calling exec
Date: Wed, 8 Feb 2023 16:06:30 +0100	[thread overview]
Message-ID: <c92b470a-bc6a-f422-8e57-ee81b6f816e0@redhat.com> (raw)
In-Reply-To: <20230208140130.45288e1b@elisabeth>


On 08/02/2023 14:01, Stefano Brivio wrote:
> On Tue, 7 Feb 2023 20:09:44 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
>
>> On 07/02/2023 11:55, Stefano Brivio wrote:
>>> On Mon, 6 Feb 2023 20:53:08 +0100
>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>   
>>>> On 03/02/2023 19:55, Stefano Brivio wrote:
>>>>> On Fri, 3 Feb 2023 17:37:03 +0100
>>>>> Stefano Brivio <sbrivio@redhat.com> wrote:
>>>>>      
>>>>>> On Fri, 3 Feb 2023 15:44:40 +0100
>>>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>>>      
>>>>>>> On 02/02/2023 11:25, Stefano Brivio wrote:
>>>>>>>> On Wed,  1 Feb 2023 19:01:16 +0100
>>>>>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>>>>>           
>>>>>>>>> When a user spawns a command with pasta they expect the network to be
>>>>>>>>> ready. Currently this does not work because pasta will fork/exec
>>>>>>>>> before it will setup the network config.
>>>>>>>>>
>>>>>>>>> This patch fixes it by using a pipe to sync parent and child. The child
>>>>>>>>> will now block reading from this pipe before the exec call. The parent
>>>>>>>>> will then unblock the child only after the netns was configured.
>>>>>>>> Thanks for the patch! I'm reviewing this in a bit.
>>>>>>>>
>>>>>>>> A few considerations meanwhile:
>>>>>>>>
>>>>>>>> - there's actually a bigger issue (you're fixing here) than the
>>>>>>>>       namespace configuration (via netlink) itself: the tap device isn't
>>>>>>>>       ready (tap_sock_init() hasn't been called yet) when we spawn the
>>>>>>>>       command in the new namespace. Oops.
>>>>>>>>
>>>>>>>>       If you're wondering: we can't just reorder things, because to complete
>>>>>>>>       the configuration phase (conf()) we need the namespace to be set up,
>>>>>>>>       and we can't initialise the tap device before it's set up
>>>>>>>>
>>>>>>>> - pipes are more commonly used to transfer data around (hence the whole
>>>>>>>>       code you need to open a communication channel, check it, close it).
>>>>>>>>       Did you try with a signal? Or is there a reason why it wouldn't work?
>>>>>>>>
>>>>>>>>       You could simply SIGSTOP the child, from the child itself:
>>>>>>>>
>>>>>>>> 	kill(getpid(), SIGSTOP);
>>>>>>>>
>>>>>>>>       and send a SIGCONT to it (we already store the PID of the child in
>>>>>>>>       pasta_child_pid) once we're ready.
>>>>>>>>
>>>>>>>>       SIGCONT is special in that it doesn't need CAP_KILL or the processes
>>>>>>>>       to run under the same UID -- just in the same session, so it wouldn't
>>>>>>>>       risk interfering with the isolation_*() calls.
>>>>>>>>
>>>>>>>>       I haven't tested this but I think it should lead to simpler code.
>>>>>>> Thinking about this more STOP/CONT will not work reliably, it could stop
>>>>>>> the child forever when the parent sends SIGCONT before the child
>>>>>>> SIGSTOPs itself. While this is unlikely we have no control over how both
>>>>>>> processes are scheduled.
>>>>>>>
>>>>>>> With this pipe version there is no problem when the parent closes the fd
>>>>>>> before the child calls read, read will simply return EOF and the child can
>>>>>>> continue, thus it will work correctly in all cases.
>>>>>> Ah, right, nice catch. Still, you could probably use pause() or
>>>>>> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
>>>>>> experiment and I'll get back to you (probably early next week), unless
>>>>>> you manage to get it working before.
>>>>> Sorry, forget about it -- it doesn't solve the problem of waiting, in
>>>>> the parent, that the child is stopped, which is exactly the point you
>>>>> raised. A waitpid() with WUNTRACED does:
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <unistd.h>
>>>>> #include <signal.h>
>>>>> #include <sys/wait.h>
>>>>>
>>>>> #define DELAY_PARENT	0
>>>>> #define DELAY_CHILD	0
>>>>>
>>>>> int main()
>>>>> {
>>>>> 	pid_t pid;
>>>>> 	int i;
>>>>>
>>>>> 	if ((pid = fork())) {
>>>>> #if DELAY_PARENT
>>>>> 		for (i = 0; i < 10000000; i++);
>>>>> #endif
>>>>> 		waitpid(pid, NULL, WUNTRACED);
>>>>> 		kill(pid, SIGCONT);
>>>>> 		sleep(1);
>>>>> 		return 0;
>>>>> 	}
>>>>>
>>>>> #if DELAY_CHILD
>>>>> 	for (i = 0; i < 10000000; i++);
>>>>> #endif
>>>>> 	raise(SIGSTOP);
>>>>> 	fprintf(stderr, "received SIGCONT\n");
>>>>> 	return 0;
>>>>> }
>>>>>
>>>>> I left in some busyloops you can use to check. It's three lines, with
>>>>> error checks probably 9, still less than the pipe thing (~16) and it
>>>>> looks simpler (to me).
>>>>>      
>>>> I don't know what it is but this doesn't work when I implement it in
>>>> pasta. Somehow
>>>> the child doesn't seem to be stopped. A short lived processes such as ip
>>>> addr causes
>>>> pasta to exit even before the parent got to the point where it would
>>>> send SIGCONT.
>>>> If I get the nanoseconds before and after raise(SIGSTOP) there is almost
>>>> no delay.
>>>> It is clear that the child still runs after raise(SIGSTOP) even though
>>>> the parent never send
>>>> SIGCONT at this point, in fact I can completely remove the kill(pid,
>>>> SIGCONT) call and
>>>> the program works without hanging.
>>> Sorry, I didn't imagine it would be so messy.
>>>
>>> The problem is that the child sees a detached PID namspace -- if you
>>> drop CLONE_NEWPID from do_clone() in pasta_start_ns(), things work as
>>> expected.
>>>
>>> Sending a SIGSTOP from the child means we're stopping the init process,
>>> again only from the child perspective, and you can't send SIGSTOP or
>>> SIGKILL to init, see sig_task_ignored() in kernel/signal.c:
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n85
>>>
>>> I haven't actually traced this in the kernel and I'm not sure if we
>>> hit that condition -- and especially if we should hit it (one might
>>> argue it's a kernel issue, in case), but it seems to fit.
>>>
>>> Also:
>>>   
>>>> And of course if I run this through strace it works just fine, so I am
>>>> bit lost right now.
>>>> Likely because strace makes things much slower?
>>> ...no, a tracer actually makes things work (I also tried with delays).
>>> The child stops and we get a SIGCHLD:
>>>
>>> [pid 3101055] clone(child_stack=0x7fff46e7c3f0, flags=CLONE_VM|CLONE_FILES|CLONE_VFORK|SIGCHLD <unfinished ...>
>>> [pid 3101056] openat(AT_FDCWD, "/proc/sys/net/ipv4/ping_group_range", O_WRONLY|O_TRUNC|O_CLOEXEC) = 6
>>> [pid 3101056] write(6, "0 0", 3)        = 3
>>> [pid 3101056] close(6)                  = 0
>>> [pid 3101056] gettid()                  = 1
>>> [pid 3101056] getpid()                  = 1
>>> [pid 3101056] tgkill(1, 1, SIGSTOPstrace: Process 3101057 attached
>>> )     = 0
>>> [pid 3101056] --- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
>>> [pid 3101057] openat(AT_FDCWD, "/proc/3101056/ns/net", O_RDONLY|O_CLOEXEC <unfinished ...>
>>> [pid 3101056] --- stopped by SIGSTOP ---
>>> [pid 3101057] <... openat resumed>)     = 6
>>> [pid 3101057] setns(6, CLONE_NEWNET)    = 0
>>> [pid 3101057] exit(0)                   = ?
>>> [pid 3101055] <... clone resumed>)      = 3101057
>>> [pid 3101057] +++ exited with 0 +++
>>> [pid 3101055] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=3101056, si_uid=0, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
>>>
>>> I guess because of this condition in sig_ignored():
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n111
>>>
>>> which skips sig_task_ignored() altogether. This kernel behaviour looks
>>> rather unexpected to me and we should probably fix it, if my suspicion
>>> is really confirmed.
>>>   
>>>> The waitpid call also always fails with ECHILD, I don't understand why.
>>>> I think it must
>>>> have something to do with the SIGCHILD signal handler that already reaps
>>>> the signal info?
>>> The tracer shouldn't really reap that, because we only wait on WEXITED
>>> there. We get ECHILD (without a tracer) because the child isn't
>>> actually stopped.
>>>   
>>>> So either I made a a stupid mistake somewhere or it simply does not work.
>>> It won't work like that. We could send SIGSTOP from the parent instead,
>>> that works, but then we don't know (concern you raised earlier) if the
>>> child is already available when we send it.
>>>
>>> There's another way to address this concern, though: while
>>> CLONE_STOPPED was dropped a long time ago, we can obtain a similar
>>> behaviour given that the child inherits our signal mask on clone() (by
>>> default), and it can later wait for a given signal.
>>>
>>> Let's pick SIGUSR1: block it before clone(), and the child will start
>>> with it blocked, which means we can safely queue it at any later point
>>> in the parent.
>>>
>>> Something like this:
>>>
>>> ---
>>> diff --git a/passt.c b/passt.c
>>> index 8b2c50d..d957e14 100644
>>> --- a/passt.c
>>> +++ b/passt.c
>>> @@ -301,6 +301,9 @@ int main(int argc, char **argv)
>>>           else
>>>                   write_pidfile(pidfile_fd, getpid());
>>>    
>>> +       if (pasta_child_pid)
>>> +               kill(pasta_child_pid, SIGUSR1);
>>> +
>>>           isolate_postfork(&c);
>>>    
>>>           timer_init(&c, &now);
>>> diff --git a/pasta.c b/pasta.c
>>> index 528f02a..9169913 100644
>>> --- a/pasta.c
>>> +++ b/pasta.c
>>> @@ -47,7 +47,7 @@
>>>    #include "log.h"
>>>    
>>>    /* PID of child, in case we created a namespace */
>>> -static int pasta_child_pid;
>>> +int pasta_child_pid;
>>>    
>>>    /**
>>>     * pasta_child_handler() - Exit once shell exits (if we started it), reap clones
>>> @@ -166,10 +166,16 @@ struct pasta_spawn_cmd_arg {
>>>    static int pasta_spawn_cmd(void *arg)
>>>    {
>>>           const struct pasta_spawn_cmd_arg *a;
>>> +       sigset_t set;
>>>    
>>>           if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
>>>                   warn("Cannot set ping_group_range, ICMP requests might fail");
>>>    
>>> +       /* Wait for the parent to be ready: see main() */
>>> +       sigemptyset(&set);
>>> +       sigaddset(&set, SIGUSR1);
>>> +       sigwaitinfo(&set, NULL);
>>> +
>>>           a = (const struct pasta_spawn_cmd_arg *)arg;
>>>           execvp(a->exe, a->argv);
>>>    
>>> @@ -196,6 +202,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>>>           char ns_fn_stack[NS_FN_STACK_SIZE];
>>>           char *sh_argv[] = { NULL, NULL };
>>>           char sh_arg0[PATH_MAX + 1];
>>> +       sigset_t set;
>>>    
>>>           c->foreground = 1;
>>>           if (!c->debug)
>>> @@ -226,6 +233,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>>>                   arg.argv = sh_argv;
>>>           }
>>>    
>>> +       /* Block SIGUSR1 in child, we queue it in main() when we're ready */
>>> +       sigemptyset(&set);
>>> +       sigaddset(&set, SIGUSR1);
>>> +       sigprocmask(SIG_BLOCK, &set, NULL);
>>> +
>>>           pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
>>>                                      sizeof(ns_fn_stack),
>>>                                      CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
>>> diff --git a/pasta.h b/pasta.h
>>> index a8b9893..0ccb7e9 100644
>>> --- a/pasta.h
>>> +++ b/pasta.h
>>> @@ -6,6 +6,8 @@
>>>    #ifndef PASTA_H
>>>    #define PASTA_H
>>>    
>>> +extern int pasta_child_pid;
>>> +
>>>    void pasta_open_ns(struct ctx *c, const char *netns);
>>>    void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>>>                       int argc, char *argv[]);
>>> ---
>>>
>>> actually works for me.
>>>
>>> It's a few more lines than I hoped for, but I think it's still (maybe
>>> marginally, now) simpler than the pipe approach. What do you think?
>> It seems to work so I am good with it. Do you apply your diff or do you
>> want me to send a v2 with it?
> As you prefer, I can apply this and give you credits in the message.
Please apply it.
>
>> Just one question I see that these sig... function can return errors
>> according to the man page, should we handle them?
> I think the only meaningful error condition we might actually hit here
> is EINTR from sigwaitinfo(), and in that case we might just as well
> proceed without synchronisation, rather than failing or trying to
> handle it.
>
>> I see that there is no error handling done in main() for the sig...
>> functions so maybe not?
> That's not necessarily a good indication. :) In general, you can make
> the mild assumption that some code might have been written "quickly"
> and some potentially useful error checks were skipped.
>
>> Lastly another problem I just found when execvp fails, i.e. ENOENT, the
>> parent just keeps hanging and does not exit.
> Oh, oops. The intention was that the child exiting would trigger a
> SIGCHLD and pasta_child_handler() in the parent. I'm not sure what's
> wrong there.

I found the issue, you need to add signal to the clone() flags. I will 
send a patch for this.

>
>> However this is present in current HEAD, my pipe version and your signal
>> version so I don't think we need to block this for that bug.


      reply	other threads:[~2023-02-08 15:06 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01 18:01 [PATCH] pasta: wait for netns setup before calling exec Paul Holzinger
2023-02-02 10:25 ` Stefano Brivio
2023-02-02 15:23   ` Paul Holzinger
2023-02-03 14:44   ` Paul Holzinger
2023-02-03 16:37     ` Stefano Brivio
2023-02-03 18:55       ` Stefano Brivio
2023-02-06 19:53         ` Paul Holzinger
2023-02-07 10:55           ` Stefano Brivio
2023-02-07 19:09             ` Paul Holzinger
2023-02-08 13:01               ` Stefano Brivio
2023-02-08 15:06                 ` Paul Holzinger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c92b470a-bc6a-f422-8e57-ee81b6f816e0@redhat.com \
    --to=pholzing@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).