public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH] pasta: wait for netns setup before calling exec
@ 2023-02-01 18:01 Paul Holzinger
  2023-02-02 10:25 ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Holzinger @ 2023-02-01 18:01 UTC (permalink / raw)
  To: passt-dev; +Cc: Paul Holzinger

When a user spawns a command with pasta they expect the network to be
ready. Currently this does not work because pasta will fork/exec
before it will setup the network config.

This patch fixes it by using a pipe to sync parent and child. The child
will now block reading from this pipe before the exec call. The parent
will then unblock the child only after the netns was configured.

A command like `pasta --config-net -- ping -c1 1.1.1.1` can now
actually work as expected.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
---
 passt.c |  9 ++++++++-
 passt.h |  3 +++
 pasta.c | 31 ++++++++++++++++++++++++++++---
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/passt.c b/passt.c
index 8b2c50d..4ef5797 100644
--- a/passt.c
+++ b/passt.c
@@ -187,7 +187,8 @@ int main(int argc, char **argv)
 
 	isolate_initial();
 
-	c.pasta_netns_fd = c.fd_tap = c.fd_tap_listen = -1;
+	c.pasta_netns_fd = c.pasta_command_ready_fd =
+		c.fd_tap = c.fd_tap_listen = -1;
 
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = 0;
@@ -296,6 +297,12 @@ int main(int argc, char **argv)
 		exit(EXIT_FAILURE);
 	}
 
+	/* start pasta child process now after the netns is setup */
+	if (c.pasta_command_ready_fd > -1) {
+		/* close causes EOF for the read in the child so no need to write() */
+		close(c.pasta_command_ready_fd);
+	}
+
 	if (!c.foreground)
 		__daemon(pidfile_fd, devnull_fd);
 	else
diff --git a/passt.h b/passt.h
index 3d7e567..a78cd81 100644
--- a/passt.h
+++ b/passt.h
@@ -154,6 +154,8 @@ struct ip6_ctx {
  * @pcap:		Path for packet capture file
  * @pid_file:		Path to PID file, empty string if not configured
  * @pasta_netns_fd:	File descriptor for network namespace in pasta mode
+ * @pasta_command_ready_fd:	File descriptor for the ready pipe to
+ * 								start child cmd, -1 if not used
  * @no_netns_quit:	In pasta mode, don't exit if fs-bound namespace is gone
  * @netns_base:		Base name for fs-bound namespace, if any, in pasta mode
  * @netns_dir:		Directory of fs-bound namespace, if any, in pasta mode
@@ -205,6 +207,7 @@ struct ctx {
 	int one_off;
 
 	int pasta_netns_fd;
+	int pasta_command_ready_fd;
 
 	int no_netns_quit;
 	char netns_base[PATH_MAX];
diff --git a/pasta.c b/pasta.c
index 528f02a..56ac326 100644
--- a/pasta.c
+++ b/pasta.c
@@ -149,16 +149,19 @@ void pasta_open_ns(struct ctx *c, const char *netns)
 
 /**
  * struct pasta_spawn_cmd_arg - Argument for pasta_spawn_cmd()
- * @exe:	Executable to run
- * @argv:	Command and arguments to run
+ * @exe:		Executable to run
+ * @argv:		Command and arguments to run
+ * @ready_pipe:	Ready pipe pair from parent.
  */
 struct pasta_spawn_cmd_arg {
 	const char *exe;
 	char *const *argv;
+	int ready_pipe[2];
 };
 
 /**
- * pasta_spawn_cmd() - Prepare new netns, start command or shell
+ * pasta_spawn_cmd() - Prepare new netns, spawn child, wait for parent,
+ *						then exec command or shell
  * @arg:	See @pasta_spawn_cmd_arg
  *
  * Return: this function never returns
@@ -166,11 +169,24 @@ struct pasta_spawn_cmd_arg {
 static int pasta_spawn_cmd(void *arg)
 {
 	const struct pasta_spawn_cmd_arg *a;
+	char buf[1];
 
 	if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
 		warn("Cannot set ping_group_range, ICMP requests might fail");
 
 	a = (const struct pasta_spawn_cmd_arg *)arg;
+
+	/* close write side, we want read to return EOF when parent closes the fd */
+	close(a->ready_pipe[1]);
+
+	/* wait here for parent setup to finish before we exec */
+	if (TEMP_FAILURE_RETRY(read(a->ready_pipe[0], buf, sizeof(buf))) < 0) {
+		err("Failed to read ready pipe from parent: %s",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+	close(a->ready_pipe[0]);
+
 	execvp(a->exe, a->argv);
 
 	perror("execvp");
@@ -226,6 +242,13 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 		arg.argv = sh_argv;
 	}
 
+	if (pipe(arg.ready_pipe) < 0) {
+		err("Create ready pipe: %s", strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	c->pasta_command_ready_fd = arg.ready_pipe[1];
+
 	pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
 				   sizeof(ns_fn_stack),
 				   CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
@@ -237,6 +260,8 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 		exit(EXIT_FAILURE);
 	}
 
+	close(arg.ready_pipe[0]);
+
 	NS_CALL(pasta_wait_for_ns, c);
 }
 
-- 
@@ -149,16 +149,19 @@ void pasta_open_ns(struct ctx *c, const char *netns)
 
 /**
  * struct pasta_spawn_cmd_arg - Argument for pasta_spawn_cmd()
- * @exe:	Executable to run
- * @argv:	Command and arguments to run
+ * @exe:		Executable to run
+ * @argv:		Command and arguments to run
+ * @ready_pipe:	Ready pipe pair from parent.
  */
 struct pasta_spawn_cmd_arg {
 	const char *exe;
 	char *const *argv;
+	int ready_pipe[2];
 };
 
 /**
- * pasta_spawn_cmd() - Prepare new netns, start command or shell
+ * pasta_spawn_cmd() - Prepare new netns, spawn child, wait for parent,
+ *						then exec command or shell
  * @arg:	See @pasta_spawn_cmd_arg
  *
  * Return: this function never returns
@@ -166,11 +169,24 @@ struct pasta_spawn_cmd_arg {
 static int pasta_spawn_cmd(void *arg)
 {
 	const struct pasta_spawn_cmd_arg *a;
+	char buf[1];
 
 	if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
 		warn("Cannot set ping_group_range, ICMP requests might fail");
 
 	a = (const struct pasta_spawn_cmd_arg *)arg;
+
+	/* close write side, we want read to return EOF when parent closes the fd */
+	close(a->ready_pipe[1]);
+
+	/* wait here for parent setup to finish before we exec */
+	if (TEMP_FAILURE_RETRY(read(a->ready_pipe[0], buf, sizeof(buf))) < 0) {
+		err("Failed to read ready pipe from parent: %s",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+	close(a->ready_pipe[0]);
+
 	execvp(a->exe, a->argv);
 
 	perror("execvp");
@@ -226,6 +242,13 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 		arg.argv = sh_argv;
 	}
 
+	if (pipe(arg.ready_pipe) < 0) {
+		err("Create ready pipe: %s", strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	c->pasta_command_ready_fd = arg.ready_pipe[1];
+
 	pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
 				   sizeof(ns_fn_stack),
 				   CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
@@ -237,6 +260,8 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 		exit(EXIT_FAILURE);
 	}
 
+	close(arg.ready_pipe[0]);
+
 	NS_CALL(pasta_wait_for_ns, c);
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-01 18:01 [PATCH] pasta: wait for netns setup before calling exec Paul Holzinger
@ 2023-02-02 10:25 ` Stefano Brivio
  2023-02-02 15:23   ` Paul Holzinger
  2023-02-03 14:44   ` Paul Holzinger
  0 siblings, 2 replies; 11+ messages in thread
From: Stefano Brivio @ 2023-02-02 10:25 UTC (permalink / raw)
  To: Paul Holzinger; +Cc: passt-dev

On Wed,  1 Feb 2023 19:01:16 +0100
Paul Holzinger <pholzing@redhat.com> wrote:

> When a user spawns a command with pasta they expect the network to be
> ready. Currently this does not work because pasta will fork/exec
> before it will setup the network config.
> 
> This patch fixes it by using a pipe to sync parent and child. The child
> will now block reading from this pipe before the exec call. The parent
> will then unblock the child only after the netns was configured.

Thanks for the patch! I'm reviewing this in a bit.

A few considerations meanwhile:

- there's actually a bigger issue (you're fixing here) than the
  namespace configuration (via netlink) itself: the tap device isn't
  ready (tap_sock_init() hasn't been called yet) when we spawn the
  command in the new namespace. Oops.

  If you're wondering: we can't just reorder things, because to complete
  the configuration phase (conf()) we need the namespace to be set up,
  and we can't initialise the tap device before it's set up

- pipes are more commonly used to transfer data around (hence the whole
  code you need to open a communication channel, check it, close it).
  Did you try with a signal? Or is there a reason why it wouldn't work?

  You could simply SIGSTOP the child, from the child itself:

	kill(getpid(), SIGSTOP);

  and send a SIGCONT to it (we already store the PID of the child in
  pasta_child_pid) once we're ready.

  SIGCONT is special in that it doesn't need CAP_KILL or the processes
  to run under the same UID -- just in the same session, so it wouldn't
  risk interfering with the isolation_*() calls.

  I haven't tested this but I think it should lead to simpler code.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-02 10:25 ` Stefano Brivio
@ 2023-02-02 15:23   ` Paul Holzinger
  2023-02-03 14:44   ` Paul Holzinger
  1 sibling, 0 replies; 11+ messages in thread
From: Paul Holzinger @ 2023-02-02 15:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev


On 02/02/2023 11:25, Stefano Brivio wrote:
> On Wed,  1 Feb 2023 19:01:16 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
>
>> When a user spawns a command with pasta they expect the network to be
>> ready. Currently this does not work because pasta will fork/exec
>> before it will setup the network config.
>>
>> This patch fixes it by using a pipe to sync parent and child. The child
>> will now block reading from this pipe before the exec call. The parent
>> will then unblock the child only after the netns was configured.
> Thanks for the patch! I'm reviewing this in a bit.
>
> A few considerations meanwhile:
>
> - there's actually a bigger issue (you're fixing here) than the
>    namespace configuration (via netlink) itself: the tap device isn't
>    ready (tap_sock_init() hasn't been called yet) when we spawn the
>    command in the new namespace. Oops.
>
>    If you're wondering: we can't just reorder things, because to complete
>    the configuration phase (conf()) we need the namespace to be set up,
>    and we can't initialise the tap device before it's set up
>
> - pipes are more commonly used to transfer data around (hence the whole
>    code you need to open a communication channel, check it, close it).
>    Did you try with a signal? Or is there a reason why it wouldn't work?
>
>    You could simply SIGSTOP the child, from the child itself:
>
> 	kill(getpid(), SIGSTOP);
>
>    and send a SIGCONT to it (we already store the PID of the child in
>    pasta_child_pid) once we're ready.
>
>    SIGCONT is special in that it doesn't need CAP_KILL or the processes
>    to run under the same UID -- just in the same session, so it wouldn't
>    risk interfering with the isolation_*() calls.
>
>    I haven't tested this but I think it should lead to simpler code.
>
I simply haven't thought of using the stop signal, we use pipes like that
in podman so I knew how to implement it.
I will test your approach, I agree that it would be a bit simpler.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-02 10:25 ` Stefano Brivio
  2023-02-02 15:23   ` Paul Holzinger
@ 2023-02-03 14:44   ` Paul Holzinger
  2023-02-03 16:37     ` Stefano Brivio
  1 sibling, 1 reply; 11+ messages in thread
From: Paul Holzinger @ 2023-02-03 14:44 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev


On 02/02/2023 11:25, Stefano Brivio wrote:
> On Wed,  1 Feb 2023 19:01:16 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
>
>> When a user spawns a command with pasta they expect the network to be
>> ready. Currently this does not work because pasta will fork/exec
>> before it will setup the network config.
>>
>> This patch fixes it by using a pipe to sync parent and child. The child
>> will now block reading from this pipe before the exec call. The parent
>> will then unblock the child only after the netns was configured.
> Thanks for the patch! I'm reviewing this in a bit.
>
> A few considerations meanwhile:
>
> - there's actually a bigger issue (you're fixing here) than the
>    namespace configuration (via netlink) itself: the tap device isn't
>    ready (tap_sock_init() hasn't been called yet) when we spawn the
>    command in the new namespace. Oops.
>
>    If you're wondering: we can't just reorder things, because to complete
>    the configuration phase (conf()) we need the namespace to be set up,
>    and we can't initialise the tap device before it's set up
>
> - pipes are more commonly used to transfer data around (hence the whole
>    code you need to open a communication channel, check it, close it).
>    Did you try with a signal? Or is there a reason why it wouldn't work?
>
>    You could simply SIGSTOP the child, from the child itself:
>
> 	kill(getpid(), SIGSTOP);
>
>    and send a SIGCONT to it (we already store the PID of the child in
>    pasta_child_pid) once we're ready.
>
>    SIGCONT is special in that it doesn't need CAP_KILL or the processes
>    to run under the same UID -- just in the same session, so it wouldn't
>    risk interfering with the isolation_*() calls.
>
>    I haven't tested this but I think it should lead to simpler code.
>
Thinking about this more STOP/CONT will not work reliably, it could stop

the child forever when the parent sends SIGCONT before the child

SIGSTOPs itself. While this is unlikely we have no control over how both

processes are scheduled.

With this pipe version there is no problem when the parent closes the fd

before the child calls read, read will simply return EOF and the child can

continue, thus it will work correctly in all cases.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-03 14:44   ` Paul Holzinger
@ 2023-02-03 16:37     ` Stefano Brivio
  2023-02-03 18:55       ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2023-02-03 16:37 UTC (permalink / raw)
  To: Paul Holzinger; +Cc: passt-dev

On Fri, 3 Feb 2023 15:44:40 +0100
Paul Holzinger <pholzing@redhat.com> wrote:

> On 02/02/2023 11:25, Stefano Brivio wrote:
> > On Wed,  1 Feb 2023 19:01:16 +0100
> > Paul Holzinger <pholzing@redhat.com> wrote:
> >  
> >> When a user spawns a command with pasta they expect the network to be
> >> ready. Currently this does not work because pasta will fork/exec
> >> before it will setup the network config.
> >>
> >> This patch fixes it by using a pipe to sync parent and child. The child
> >> will now block reading from this pipe before the exec call. The parent
> >> will then unblock the child only after the netns was configured.  
> > Thanks for the patch! I'm reviewing this in a bit.
> >
> > A few considerations meanwhile:
> >
> > - there's actually a bigger issue (you're fixing here) than the
> >    namespace configuration (via netlink) itself: the tap device isn't
> >    ready (tap_sock_init() hasn't been called yet) when we spawn the
> >    command in the new namespace. Oops.
> >
> >    If you're wondering: we can't just reorder things, because to complete
> >    the configuration phase (conf()) we need the namespace to be set up,
> >    and we can't initialise the tap device before it's set up
> >
> > - pipes are more commonly used to transfer data around (hence the whole
> >    code you need to open a communication channel, check it, close it).
> >    Did you try with a signal? Or is there a reason why it wouldn't work?
> >
> >    You could simply SIGSTOP the child, from the child itself:
> >
> > 	kill(getpid(), SIGSTOP);
> >
> >    and send a SIGCONT to it (we already store the PID of the child in
> >    pasta_child_pid) once we're ready.
> >
> >    SIGCONT is special in that it doesn't need CAP_KILL or the processes
> >    to run under the same UID -- just in the same session, so it wouldn't
> >    risk interfering with the isolation_*() calls.
> >
> >    I haven't tested this but I think it should lead to simpler code.
>
> Thinking about this more STOP/CONT will not work reliably, it could stop
> the child forever when the parent sends SIGCONT before the child
> SIGSTOPs itself. While this is unlikely we have no control over how both
> processes are scheduled.
> 
> With this pipe version there is no problem when the parent closes the fd
> before the child calls read, read will simply return EOF and the child can
> continue, thus it will work correctly in all cases.

Ah, right, nice catch. Still, you could probably use pause() or
sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
experiment and I'll get back to you (probably early next week), unless
you manage to get it working before.

If that doesn't work, fine, let's go with the pipe.

Unfortunately CLONE_STOPPED (flag for clone) has gone away a long time
ago.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-03 16:37     ` Stefano Brivio
@ 2023-02-03 18:55       ` Stefano Brivio
  2023-02-06 19:53         ` Paul Holzinger
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2023-02-03 18:55 UTC (permalink / raw)
  To: Paul Holzinger; +Cc: passt-dev

On Fri, 3 Feb 2023 17:37:03 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 3 Feb 2023 15:44:40 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
> 
> > On 02/02/2023 11:25, Stefano Brivio wrote:  
> > > On Wed,  1 Feb 2023 19:01:16 +0100
> > > Paul Holzinger <pholzing@redhat.com> wrote:
> > >    
> > >> When a user spawns a command with pasta they expect the network to be
> > >> ready. Currently this does not work because pasta will fork/exec
> > >> before it will setup the network config.
> > >>
> > >> This patch fixes it by using a pipe to sync parent and child. The child
> > >> will now block reading from this pipe before the exec call. The parent
> > >> will then unblock the child only after the netns was configured.    
> > > Thanks for the patch! I'm reviewing this in a bit.
> > >
> > > A few considerations meanwhile:
> > >
> > > - there's actually a bigger issue (you're fixing here) than the
> > >    namespace configuration (via netlink) itself: the tap device isn't
> > >    ready (tap_sock_init() hasn't been called yet) when we spawn the
> > >    command in the new namespace. Oops.
> > >
> > >    If you're wondering: we can't just reorder things, because to complete
> > >    the configuration phase (conf()) we need the namespace to be set up,
> > >    and we can't initialise the tap device before it's set up
> > >
> > > - pipes are more commonly used to transfer data around (hence the whole
> > >    code you need to open a communication channel, check it, close it).
> > >    Did you try with a signal? Or is there a reason why it wouldn't work?
> > >
> > >    You could simply SIGSTOP the child, from the child itself:
> > >
> > > 	kill(getpid(), SIGSTOP);
> > >
> > >    and send a SIGCONT to it (we already store the PID of the child in
> > >    pasta_child_pid) once we're ready.
> > >
> > >    SIGCONT is special in that it doesn't need CAP_KILL or the processes
> > >    to run under the same UID -- just in the same session, so it wouldn't
> > >    risk interfering with the isolation_*() calls.
> > >
> > >    I haven't tested this but I think it should lead to simpler code.  
> >
> > Thinking about this more STOP/CONT will not work reliably, it could stop
> > the child forever when the parent sends SIGCONT before the child
> > SIGSTOPs itself. While this is unlikely we have no control over how both
> > processes are scheduled.
> > 
> > With this pipe version there is no problem when the parent closes the fd
> > before the child calls read, read will simply return EOF and the child can
> > continue, thus it will work correctly in all cases.  
> 
> Ah, right, nice catch. Still, you could probably use pause() or
> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
> experiment and I'll get back to you (probably early next week), unless
> you manage to get it working before.

Sorry, forget about it -- it doesn't solve the problem of waiting, in
the parent, that the child is stopped, which is exactly the point you
raised. A waitpid() with WUNTRACED does:

#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/wait.h>

#define DELAY_PARENT	0
#define DELAY_CHILD	0

int main()
{
	pid_t pid;
	int i;

	if ((pid = fork())) {
#if DELAY_PARENT
		for (i = 0; i < 10000000; i++);
#endif
		waitpid(pid, NULL, WUNTRACED);
		kill(pid, SIGCONT);
		sleep(1);
		return 0;
	}

#if DELAY_CHILD
	for (i = 0; i < 10000000; i++);
#endif
	raise(SIGSTOP);
	fprintf(stderr, "received SIGCONT\n");
	return 0;
}

I left in some busyloops you can use to check. It's three lines, with
error checks probably 9, still less than the pipe thing (~16) and it
looks simpler (to me).

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-03 18:55       ` Stefano Brivio
@ 2023-02-06 19:53         ` Paul Holzinger
  2023-02-07 10:55           ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Holzinger @ 2023-02-06 19:53 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev


On 03/02/2023 19:55, Stefano Brivio wrote:
> On Fri, 3 Feb 2023 17:37:03 +0100
> Stefano Brivio <sbrivio@redhat.com> wrote:
>
>> On Fri, 3 Feb 2023 15:44:40 +0100
>> Paul Holzinger <pholzing@redhat.com> wrote:
>>
>>> On 02/02/2023 11:25, Stefano Brivio wrote:
>>>> On Wed,  1 Feb 2023 19:01:16 +0100
>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>     
>>>>> When a user spawns a command with pasta they expect the network to be
>>>>> ready. Currently this does not work because pasta will fork/exec
>>>>> before it will setup the network config.
>>>>>
>>>>> This patch fixes it by using a pipe to sync parent and child. The child
>>>>> will now block reading from this pipe before the exec call. The parent
>>>>> will then unblock the child only after the netns was configured.
>>>> Thanks for the patch! I'm reviewing this in a bit.
>>>>
>>>> A few considerations meanwhile:
>>>>
>>>> - there's actually a bigger issue (you're fixing here) than the
>>>>     namespace configuration (via netlink) itself: the tap device isn't
>>>>     ready (tap_sock_init() hasn't been called yet) when we spawn the
>>>>     command in the new namespace. Oops.
>>>>
>>>>     If you're wondering: we can't just reorder things, because to complete
>>>>     the configuration phase (conf()) we need the namespace to be set up,
>>>>     and we can't initialise the tap device before it's set up
>>>>
>>>> - pipes are more commonly used to transfer data around (hence the whole
>>>>     code you need to open a communication channel, check it, close it).
>>>>     Did you try with a signal? Or is there a reason why it wouldn't work?
>>>>
>>>>     You could simply SIGSTOP the child, from the child itself:
>>>>
>>>> 	kill(getpid(), SIGSTOP);
>>>>
>>>>     and send a SIGCONT to it (we already store the PID of the child in
>>>>     pasta_child_pid) once we're ready.
>>>>
>>>>     SIGCONT is special in that it doesn't need CAP_KILL or the processes
>>>>     to run under the same UID -- just in the same session, so it wouldn't
>>>>     risk interfering with the isolation_*() calls.
>>>>
>>>>     I haven't tested this but I think it should lead to simpler code.
>>> Thinking about this more STOP/CONT will not work reliably, it could stop
>>> the child forever when the parent sends SIGCONT before the child
>>> SIGSTOPs itself. While this is unlikely we have no control over how both
>>> processes are scheduled.
>>>
>>> With this pipe version there is no problem when the parent closes the fd
>>> before the child calls read, read will simply return EOF and the child can
>>> continue, thus it will work correctly in all cases.
>> Ah, right, nice catch. Still, you could probably use pause() or
>> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
>> experiment and I'll get back to you (probably early next week), unless
>> you manage to get it working before.
> Sorry, forget about it -- it doesn't solve the problem of waiting, in
> the parent, that the child is stopped, which is exactly the point you
> raised. A waitpid() with WUNTRACED does:
>
> #include <stdio.h>
> #include <unistd.h>
> #include <signal.h>
> #include <sys/wait.h>
>
> #define DELAY_PARENT	0
> #define DELAY_CHILD	0
>
> int main()
> {
> 	pid_t pid;
> 	int i;
>
> 	if ((pid = fork())) {
> #if DELAY_PARENT
> 		for (i = 0; i < 10000000; i++);
> #endif
> 		waitpid(pid, NULL, WUNTRACED);
> 		kill(pid, SIGCONT);
> 		sleep(1);
> 		return 0;
> 	}
>
> #if DELAY_CHILD
> 	for (i = 0; i < 10000000; i++);
> #endif
> 	raise(SIGSTOP);
> 	fprintf(stderr, "received SIGCONT\n");
> 	return 0;
> }
>
> I left in some busyloops you can use to check. It's three lines, with
> error checks probably 9, still less than the pipe thing (~16) and it
> looks simpler (to me).
>

I don't know what it is but this doesn't work when I implement it in 
pasta. Somehow
the child doesn't seem to be stopped. A short lived processes such as ip 
addr causes
pasta to exit even before the parent got to the point where it would 
send SIGCONT.
If I get the nanoseconds before and after raise(SIGSTOP) there is almost 
no delay.
It is clear that the child still runs after raise(SIGSTOP) even though 
the parent never send
SIGCONT at this point, in fact I can completely remove the kill(pid, 
SIGCONT) call and
the program works without hanging.

And of course if I run this through strace it works just fine, so I am 
bit lost right now.
Likely because strace makes things much slower?

The waitpid call also always fails with ECHILD, I don't understand why. 
I think it must
have something to do with the SIGCHILD signal handler that already reaps 
the signal info?

So either I made a a stupid mistake somewhere or it simply does not work.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-06 19:53         ` Paul Holzinger
@ 2023-02-07 10:55           ` Stefano Brivio
  2023-02-07 19:09             ` Paul Holzinger
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2023-02-07 10:55 UTC (permalink / raw)
  To: Paul Holzinger; +Cc: passt-dev

On Mon, 6 Feb 2023 20:53:08 +0100
Paul Holzinger <pholzing@redhat.com> wrote:

> On 03/02/2023 19:55, Stefano Brivio wrote:
> > On Fri, 3 Feb 2023 17:37:03 +0100
> > Stefano Brivio <sbrivio@redhat.com> wrote:
> >  
> >> On Fri, 3 Feb 2023 15:44:40 +0100
> >> Paul Holzinger <pholzing@redhat.com> wrote:
> >>  
> >>> On 02/02/2023 11:25, Stefano Brivio wrote:  
> >>>> On Wed,  1 Feb 2023 19:01:16 +0100
> >>>> Paul Holzinger <pholzing@redhat.com> wrote:
> >>>>       
> >>>>> When a user spawns a command with pasta they expect the network to be
> >>>>> ready. Currently this does not work because pasta will fork/exec
> >>>>> before it will setup the network config.
> >>>>>
> >>>>> This patch fixes it by using a pipe to sync parent and child. The child
> >>>>> will now block reading from this pipe before the exec call. The parent
> >>>>> will then unblock the child only after the netns was configured.  
> >>>> Thanks for the patch! I'm reviewing this in a bit.
> >>>>
> >>>> A few considerations meanwhile:
> >>>>
> >>>> - there's actually a bigger issue (you're fixing here) than the
> >>>>     namespace configuration (via netlink) itself: the tap device isn't
> >>>>     ready (tap_sock_init() hasn't been called yet) when we spawn the
> >>>>     command in the new namespace. Oops.
> >>>>
> >>>>     If you're wondering: we can't just reorder things, because to complete
> >>>>     the configuration phase (conf()) we need the namespace to be set up,
> >>>>     and we can't initialise the tap device before it's set up
> >>>>
> >>>> - pipes are more commonly used to transfer data around (hence the whole
> >>>>     code you need to open a communication channel, check it, close it).
> >>>>     Did you try with a signal? Or is there a reason why it wouldn't work?
> >>>>
> >>>>     You could simply SIGSTOP the child, from the child itself:
> >>>>
> >>>> 	kill(getpid(), SIGSTOP);
> >>>>
> >>>>     and send a SIGCONT to it (we already store the PID of the child in
> >>>>     pasta_child_pid) once we're ready.
> >>>>
> >>>>     SIGCONT is special in that it doesn't need CAP_KILL or the processes
> >>>>     to run under the same UID -- just in the same session, so it wouldn't
> >>>>     risk interfering with the isolation_*() calls.
> >>>>
> >>>>     I haven't tested this but I think it should lead to simpler code.  
> >>> Thinking about this more STOP/CONT will not work reliably, it could stop
> >>> the child forever when the parent sends SIGCONT before the child
> >>> SIGSTOPs itself. While this is unlikely we have no control over how both
> >>> processes are scheduled.
> >>>
> >>> With this pipe version there is no problem when the parent closes the fd
> >>> before the child calls read, read will simply return EOF and the child can
> >>> continue, thus it will work correctly in all cases.  
> >> Ah, right, nice catch. Still, you could probably use pause() or
> >> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
> >> experiment and I'll get back to you (probably early next week), unless
> >> you manage to get it working before.  
> > Sorry, forget about it -- it doesn't solve the problem of waiting, in
> > the parent, that the child is stopped, which is exactly the point you
> > raised. A waitpid() with WUNTRACED does:
> >
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <signal.h>
> > #include <sys/wait.h>
> >
> > #define DELAY_PARENT	0
> > #define DELAY_CHILD	0
> >
> > int main()
> > {
> > 	pid_t pid;
> > 	int i;
> >
> > 	if ((pid = fork())) {
> > #if DELAY_PARENT
> > 		for (i = 0; i < 10000000; i++);
> > #endif
> > 		waitpid(pid, NULL, WUNTRACED);
> > 		kill(pid, SIGCONT);
> > 		sleep(1);
> > 		return 0;
> > 	}
> >
> > #if DELAY_CHILD
> > 	for (i = 0; i < 10000000; i++);
> > #endif
> > 	raise(SIGSTOP);
> > 	fprintf(stderr, "received SIGCONT\n");
> > 	return 0;
> > }
> >
> > I left in some busyloops you can use to check. It's three lines, with
> > error checks probably 9, still less than the pipe thing (~16) and it
> > looks simpler (to me).
> >  
> 
> I don't know what it is but this doesn't work when I implement it in 
> pasta. Somehow
> the child doesn't seem to be stopped. A short lived processes such as ip 
> addr causes
> pasta to exit even before the parent got to the point where it would 
> send SIGCONT.
> If I get the nanoseconds before and after raise(SIGSTOP) there is almost 
> no delay.
> It is clear that the child still runs after raise(SIGSTOP) even though 
> the parent never send
> SIGCONT at this point, in fact I can completely remove the kill(pid, 
> SIGCONT) call and
> the program works without hanging.

Sorry, I didn't imagine it would be so messy.

The problem is that the child sees a detached PID namspace -- if you
drop CLONE_NEWPID from do_clone() in pasta_start_ns(), things work as
expected.

Sending a SIGSTOP from the child means we're stopping the init process,
again only from the child perspective, and you can't send SIGSTOP or
SIGKILL to init, see sig_task_ignored() in kernel/signal.c:
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n85

I haven't actually traced this in the kernel and I'm not sure if we
hit that condition -- and especially if we should hit it (one might
argue it's a kernel issue, in case), but it seems to fit.

Also:

> And of course if I run this through strace it works just fine, so I am 
> bit lost right now.
> Likely because strace makes things much slower?

...no, a tracer actually makes things work (I also tried with delays).
The child stops and we get a SIGCHLD:

[pid 3101055] clone(child_stack=0x7fff46e7c3f0, flags=CLONE_VM|CLONE_FILES|CLONE_VFORK|SIGCHLD <unfinished ...>
[pid 3101056] openat(AT_FDCWD, "/proc/sys/net/ipv4/ping_group_range", O_WRONLY|O_TRUNC|O_CLOEXEC) = 6
[pid 3101056] write(6, "0 0", 3)        = 3
[pid 3101056] close(6)                  = 0
[pid 3101056] gettid()                  = 1
[pid 3101056] getpid()                  = 1
[pid 3101056] tgkill(1, 1, SIGSTOPstrace: Process 3101057 attached
)     = 0
[pid 3101056] --- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
[pid 3101057] openat(AT_FDCWD, "/proc/3101056/ns/net", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 3101056] --- stopped by SIGSTOP ---
[pid 3101057] <... openat resumed>)     = 6
[pid 3101057] setns(6, CLONE_NEWNET)    = 0
[pid 3101057] exit(0)                   = ?
[pid 3101055] <... clone resumed>)      = 3101057
[pid 3101057] +++ exited with 0 +++
[pid 3101055] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=3101056, si_uid=0, si_status=SIGSTOP, si_utime=0, si_stime=0} ---

I guess because of this condition in sig_ignored():
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n111

which skips sig_task_ignored() altogether. This kernel behaviour looks
rather unexpected to me and we should probably fix it, if my suspicion
is really confirmed.

> The waitpid call also always fails with ECHILD, I don't understand why. 
> I think it must
> have something to do with the SIGCHILD signal handler that already reaps 
> the signal info?

The tracer shouldn't really reap that, because we only wait on WEXITED
there. We get ECHILD (without a tracer) because the child isn't
actually stopped.

> So either I made a a stupid mistake somewhere or it simply does not work.

It won't work like that. We could send SIGSTOP from the parent instead,
that works, but then we don't know (concern you raised earlier) if the
child is already available when we send it.

There's another way to address this concern, though: while
CLONE_STOPPED was dropped a long time ago, we can obtain a similar
behaviour given that the child inherits our signal mask on clone() (by
default), and it can later wait for a given signal.

Let's pick SIGUSR1: block it before clone(), and the child will start
with it blocked, which means we can safely queue it at any later point
in the parent.

Something like this:

---
diff --git a/passt.c b/passt.c
index 8b2c50d..d957e14 100644
--- a/passt.c
+++ b/passt.c
@@ -301,6 +301,9 @@ int main(int argc, char **argv)
        else
                write_pidfile(pidfile_fd, getpid());
 
+       if (pasta_child_pid)
+               kill(pasta_child_pid, SIGUSR1);
+
        isolate_postfork(&c);
 
        timer_init(&c, &now);
diff --git a/pasta.c b/pasta.c
index 528f02a..9169913 100644
--- a/pasta.c
+++ b/pasta.c
@@ -47,7 +47,7 @@
 #include "log.h"
 
 /* PID of child, in case we created a namespace */
-static int pasta_child_pid;
+int pasta_child_pid;
 
 /**
  * pasta_child_handler() - Exit once shell exits (if we started it), reap clones
@@ -166,10 +166,16 @@ struct pasta_spawn_cmd_arg {
 static int pasta_spawn_cmd(void *arg)
 {
        const struct pasta_spawn_cmd_arg *a;
+       sigset_t set;
 
        if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
                warn("Cannot set ping_group_range, ICMP requests might fail");
 
+       /* Wait for the parent to be ready: see main() */
+       sigemptyset(&set);
+       sigaddset(&set, SIGUSR1);
+       sigwaitinfo(&set, NULL);
+
        a = (const struct pasta_spawn_cmd_arg *)arg;
        execvp(a->exe, a->argv);
 
@@ -196,6 +202,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
        char ns_fn_stack[NS_FN_STACK_SIZE];
        char *sh_argv[] = { NULL, NULL };
        char sh_arg0[PATH_MAX + 1];
+       sigset_t set;
 
        c->foreground = 1;
        if (!c->debug)
@@ -226,6 +233,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
                arg.argv = sh_argv;
        }
 
+       /* Block SIGUSR1 in child, we queue it in main() when we're ready */
+       sigemptyset(&set);
+       sigaddset(&set, SIGUSR1);
+       sigprocmask(SIG_BLOCK, &set, NULL);
+
        pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
                                   sizeof(ns_fn_stack),
                                   CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
diff --git a/pasta.h b/pasta.h
index a8b9893..0ccb7e9 100644
--- a/pasta.h
+++ b/pasta.h
@@ -6,6 +6,8 @@
 #ifndef PASTA_H
 #define PASTA_H
 
+extern int pasta_child_pid;
+
 void pasta_open_ns(struct ctx *c, const char *netns);
 void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
                    int argc, char *argv[]);
---

actually works for me.

It's a few more lines than I hoped for, but I think it's still (maybe
marginally, now) simpler than the pipe approach. What do you think?

-- 
@@ -6,6 +6,8 @@
 #ifndef PASTA_H
 #define PASTA_H
 
+extern int pasta_child_pid;
+
 void pasta_open_ns(struct ctx *c, const char *netns);
 void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
                    int argc, char *argv[]);
---

actually works for me.

It's a few more lines than I hoped for, but I think it's still (maybe
marginally, now) simpler than the pipe approach. What do you think?

-- 
Stefano


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-07 10:55           ` Stefano Brivio
@ 2023-02-07 19:09             ` Paul Holzinger
  2023-02-08 13:01               ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Holzinger @ 2023-02-07 19:09 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev


On 07/02/2023 11:55, Stefano Brivio wrote:
> On Mon, 6 Feb 2023 20:53:08 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
>
>> On 03/02/2023 19:55, Stefano Brivio wrote:
>>> On Fri, 3 Feb 2023 17:37:03 +0100
>>> Stefano Brivio <sbrivio@redhat.com> wrote:
>>>   
>>>> On Fri, 3 Feb 2023 15:44:40 +0100
>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>   
>>>>> On 02/02/2023 11:25, Stefano Brivio wrote:
>>>>>> On Wed,  1 Feb 2023 19:01:16 +0100
>>>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>>>        
>>>>>>> When a user spawns a command with pasta they expect the network to be
>>>>>>> ready. Currently this does not work because pasta will fork/exec
>>>>>>> before it will setup the network config.
>>>>>>>
>>>>>>> This patch fixes it by using a pipe to sync parent and child. The child
>>>>>>> will now block reading from this pipe before the exec call. The parent
>>>>>>> will then unblock the child only after the netns was configured.
>>>>>> Thanks for the patch! I'm reviewing this in a bit.
>>>>>>
>>>>>> A few considerations meanwhile:
>>>>>>
>>>>>> - there's actually a bigger issue (you're fixing here) than the
>>>>>>      namespace configuration (via netlink) itself: the tap device isn't
>>>>>>      ready (tap_sock_init() hasn't been called yet) when we spawn the
>>>>>>      command in the new namespace. Oops.
>>>>>>
>>>>>>      If you're wondering: we can't just reorder things, because to complete
>>>>>>      the configuration phase (conf()) we need the namespace to be set up,
>>>>>>      and we can't initialise the tap device before it's set up
>>>>>>
>>>>>> - pipes are more commonly used to transfer data around (hence the whole
>>>>>>      code you need to open a communication channel, check it, close it).
>>>>>>      Did you try with a signal? Or is there a reason why it wouldn't work?
>>>>>>
>>>>>>      You could simply SIGSTOP the child, from the child itself:
>>>>>>
>>>>>> 	kill(getpid(), SIGSTOP);
>>>>>>
>>>>>>      and send a SIGCONT to it (we already store the PID of the child in
>>>>>>      pasta_child_pid) once we're ready.
>>>>>>
>>>>>>      SIGCONT is special in that it doesn't need CAP_KILL or the processes
>>>>>>      to run under the same UID -- just in the same session, so it wouldn't
>>>>>>      risk interfering with the isolation_*() calls.
>>>>>>
>>>>>>      I haven't tested this but I think it should lead to simpler code.
>>>>> Thinking about this more STOP/CONT will not work reliably, it could stop
>>>>> the child forever when the parent sends SIGCONT before the child
>>>>> SIGSTOPs itself. While this is unlikely we have no control over how both
>>>>> processes are scheduled.
>>>>>
>>>>> With this pipe version there is no problem when the parent closes the fd
>>>>> before the child calls read, read will simply return EOF and the child can
>>>>> continue, thus it will work correctly in all cases.
>>>> Ah, right, nice catch. Still, you could probably use pause() or
>>>> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
>>>> experiment and I'll get back to you (probably early next week), unless
>>>> you manage to get it working before.
>>> Sorry, forget about it -- it doesn't solve the problem of waiting, in
>>> the parent, that the child is stopped, which is exactly the point you
>>> raised. A waitpid() with WUNTRACED does:
>>>
>>> #include <stdio.h>
>>> #include <unistd.h>
>>> #include <signal.h>
>>> #include <sys/wait.h>
>>>
>>> #define DELAY_PARENT	0
>>> #define DELAY_CHILD	0
>>>
>>> int main()
>>> {
>>> 	pid_t pid;
>>> 	int i;
>>>
>>> 	if ((pid = fork())) {
>>> #if DELAY_PARENT
>>> 		for (i = 0; i < 10000000; i++);
>>> #endif
>>> 		waitpid(pid, NULL, WUNTRACED);
>>> 		kill(pid, SIGCONT);
>>> 		sleep(1);
>>> 		return 0;
>>> 	}
>>>
>>> #if DELAY_CHILD
>>> 	for (i = 0; i < 10000000; i++);
>>> #endif
>>> 	raise(SIGSTOP);
>>> 	fprintf(stderr, "received SIGCONT\n");
>>> 	return 0;
>>> }
>>>
>>> I left in some busyloops you can use to check. It's three lines, with
>>> error checks probably 9, still less than the pipe thing (~16) and it
>>> looks simpler (to me).
>>>   
>> I don't know what it is but this doesn't work when I implement it in
>> pasta. Somehow
>> the child doesn't seem to be stopped. A short lived processes such as ip
>> addr causes
>> pasta to exit even before the parent got to the point where it would
>> send SIGCONT.
>> If I get the nanoseconds before and after raise(SIGSTOP) there is almost
>> no delay.
>> It is clear that the child still runs after raise(SIGSTOP) even though
>> the parent never send
>> SIGCONT at this point, in fact I can completely remove the kill(pid,
>> SIGCONT) call and
>> the program works without hanging.
> Sorry, I didn't imagine it would be so messy.
>
> The problem is that the child sees a detached PID namspace -- if you
> drop CLONE_NEWPID from do_clone() in pasta_start_ns(), things work as
> expected.
>
> Sending a SIGSTOP from the child means we're stopping the init process,
> again only from the child perspective, and you can't send SIGSTOP or
> SIGKILL to init, see sig_task_ignored() in kernel/signal.c:
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n85
>
> I haven't actually traced this in the kernel and I'm not sure if we
> hit that condition -- and especially if we should hit it (one might
> argue it's a kernel issue, in case), but it seems to fit.
>
> Also:
>
>> And of course if I run this through strace it works just fine, so I am
>> bit lost right now.
>> Likely because strace makes things much slower?
> ...no, a tracer actually makes things work (I also tried with delays).
> The child stops and we get a SIGCHLD:
>
> [pid 3101055] clone(child_stack=0x7fff46e7c3f0, flags=CLONE_VM|CLONE_FILES|CLONE_VFORK|SIGCHLD <unfinished ...>
> [pid 3101056] openat(AT_FDCWD, "/proc/sys/net/ipv4/ping_group_range", O_WRONLY|O_TRUNC|O_CLOEXEC) = 6
> [pid 3101056] write(6, "0 0", 3)        = 3
> [pid 3101056] close(6)                  = 0
> [pid 3101056] gettid()                  = 1
> [pid 3101056] getpid()                  = 1
> [pid 3101056] tgkill(1, 1, SIGSTOPstrace: Process 3101057 attached
> )     = 0
> [pid 3101056] --- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
> [pid 3101057] openat(AT_FDCWD, "/proc/3101056/ns/net", O_RDONLY|O_CLOEXEC <unfinished ...>
> [pid 3101056] --- stopped by SIGSTOP ---
> [pid 3101057] <... openat resumed>)     = 6
> [pid 3101057] setns(6, CLONE_NEWNET)    = 0
> [pid 3101057] exit(0)                   = ?
> [pid 3101055] <... clone resumed>)      = 3101057
> [pid 3101057] +++ exited with 0 +++
> [pid 3101055] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=3101056, si_uid=0, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
>
> I guess because of this condition in sig_ignored():
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n111
>
> which skips sig_task_ignored() altogether. This kernel behaviour looks
> rather unexpected to me and we should probably fix it, if my suspicion
> is really confirmed.
>
>> The waitpid call also always fails with ECHILD, I don't understand why.
>> I think it must
>> have something to do with the SIGCHILD signal handler that already reaps
>> the signal info?
> The tracer shouldn't really reap that, because we only wait on WEXITED
> there. We get ECHILD (without a tracer) because the child isn't
> actually stopped.
>
>> So either I made a a stupid mistake somewhere or it simply does not work.
> It won't work like that. We could send SIGSTOP from the parent instead,
> that works, but then we don't know (concern you raised earlier) if the
> child is already available when we send it.
>
> There's another way to address this concern, though: while
> CLONE_STOPPED was dropped a long time ago, we can obtain a similar
> behaviour given that the child inherits our signal mask on clone() (by
> default), and it can later wait for a given signal.
>
> Let's pick SIGUSR1: block it before clone(), and the child will start
> with it blocked, which means we can safely queue it at any later point
> in the parent.
>
> Something like this:
>
> ---
> diff --git a/passt.c b/passt.c
> index 8b2c50d..d957e14 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -301,6 +301,9 @@ int main(int argc, char **argv)
>          else
>                  write_pidfile(pidfile_fd, getpid());
>   
> +       if (pasta_child_pid)
> +               kill(pasta_child_pid, SIGUSR1);
> +
>          isolate_postfork(&c);
>   
>          timer_init(&c, &now);
> diff --git a/pasta.c b/pasta.c
> index 528f02a..9169913 100644
> --- a/pasta.c
> +++ b/pasta.c
> @@ -47,7 +47,7 @@
>   #include "log.h"
>   
>   /* PID of child, in case we created a namespace */
> -static int pasta_child_pid;
> +int pasta_child_pid;
>   
>   /**
>    * pasta_child_handler() - Exit once shell exits (if we started it), reap clones
> @@ -166,10 +166,16 @@ struct pasta_spawn_cmd_arg {
>   static int pasta_spawn_cmd(void *arg)
>   {
>          const struct pasta_spawn_cmd_arg *a;
> +       sigset_t set;
>   
>          if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
>                  warn("Cannot set ping_group_range, ICMP requests might fail");
>   
> +       /* Wait for the parent to be ready: see main() */
> +       sigemptyset(&set);
> +       sigaddset(&set, SIGUSR1);
> +       sigwaitinfo(&set, NULL);
> +
>          a = (const struct pasta_spawn_cmd_arg *)arg;
>          execvp(a->exe, a->argv);
>   
> @@ -196,6 +202,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>          char ns_fn_stack[NS_FN_STACK_SIZE];
>          char *sh_argv[] = { NULL, NULL };
>          char sh_arg0[PATH_MAX + 1];
> +       sigset_t set;
>   
>          c->foreground = 1;
>          if (!c->debug)
> @@ -226,6 +233,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>                  arg.argv = sh_argv;
>          }
>   
> +       /* Block SIGUSR1 in child, we queue it in main() when we're ready */
> +       sigemptyset(&set);
> +       sigaddset(&set, SIGUSR1);
> +       sigprocmask(SIG_BLOCK, &set, NULL);
> +
>          pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
>                                     sizeof(ns_fn_stack),
>                                     CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
> diff --git a/pasta.h b/pasta.h
> index a8b9893..0ccb7e9 100644
> --- a/pasta.h
> +++ b/pasta.h
> @@ -6,6 +6,8 @@
>   #ifndef PASTA_H
>   #define PASTA_H
>   
> +extern int pasta_child_pid;
> +
>   void pasta_open_ns(struct ctx *c, const char *netns);
>   void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>                      int argc, char *argv[]);
> ---
>
> actually works for me.
>
> It's a few more lines than I hoped for, but I think it's still (maybe
> marginally, now) simpler than the pipe approach. What do you think?
It seems to work so I am good with it. Do you apply your diff or do you 
want me to send a v2 with it?
Just one question I see that these sig... function can return errors 
according to the man page, should we handle them?
I see that there is no error handling done in main() for the sig... 
functions so maybe not?

Lastly another problem I just found when execvp fails, i.e. ENOENT, the 
parent just keeps hanging and does not exit.
However this is present in current HEAD, my pipe version and your signal 
version so I don't think we need to block this for that bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-07 19:09             ` Paul Holzinger
@ 2023-02-08 13:01               ` Stefano Brivio
  2023-02-08 15:06                 ` Paul Holzinger
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2023-02-08 13:01 UTC (permalink / raw)
  To: Paul Holzinger; +Cc: passt-dev

On Tue, 7 Feb 2023 20:09:44 +0100
Paul Holzinger <pholzing@redhat.com> wrote:

> On 07/02/2023 11:55, Stefano Brivio wrote:
> > On Mon, 6 Feb 2023 20:53:08 +0100
> > Paul Holzinger <pholzing@redhat.com> wrote:
> >  
> >> On 03/02/2023 19:55, Stefano Brivio wrote:  
> >>> On Fri, 3 Feb 2023 17:37:03 +0100
> >>> Stefano Brivio <sbrivio@redhat.com> wrote:
> >>>     
> >>>> On Fri, 3 Feb 2023 15:44:40 +0100
> >>>> Paul Holzinger <pholzing@redhat.com> wrote:
> >>>>     
> >>>>> On 02/02/2023 11:25, Stefano Brivio wrote:  
> >>>>>> On Wed,  1 Feb 2023 19:01:16 +0100
> >>>>>> Paul Holzinger <pholzing@redhat.com> wrote:
> >>>>>>          
> >>>>>>> When a user spawns a command with pasta they expect the network to be
> >>>>>>> ready. Currently this does not work because pasta will fork/exec
> >>>>>>> before it will setup the network config.
> >>>>>>>
> >>>>>>> This patch fixes it by using a pipe to sync parent and child. The child
> >>>>>>> will now block reading from this pipe before the exec call. The parent
> >>>>>>> will then unblock the child only after the netns was configured.  
> >>>>>> Thanks for the patch! I'm reviewing this in a bit.
> >>>>>>
> >>>>>> A few considerations meanwhile:
> >>>>>>
> >>>>>> - there's actually a bigger issue (you're fixing here) than the
> >>>>>>      namespace configuration (via netlink) itself: the tap device isn't
> >>>>>>      ready (tap_sock_init() hasn't been called yet) when we spawn the
> >>>>>>      command in the new namespace. Oops.
> >>>>>>
> >>>>>>      If you're wondering: we can't just reorder things, because to complete
> >>>>>>      the configuration phase (conf()) we need the namespace to be set up,
> >>>>>>      and we can't initialise the tap device before it's set up
> >>>>>>
> >>>>>> - pipes are more commonly used to transfer data around (hence the whole
> >>>>>>      code you need to open a communication channel, check it, close it).
> >>>>>>      Did you try with a signal? Or is there a reason why it wouldn't work?
> >>>>>>
> >>>>>>      You could simply SIGSTOP the child, from the child itself:
> >>>>>>
> >>>>>> 	kill(getpid(), SIGSTOP);
> >>>>>>
> >>>>>>      and send a SIGCONT to it (we already store the PID of the child in
> >>>>>>      pasta_child_pid) once we're ready.
> >>>>>>
> >>>>>>      SIGCONT is special in that it doesn't need CAP_KILL or the processes
> >>>>>>      to run under the same UID -- just in the same session, so it wouldn't
> >>>>>>      risk interfering with the isolation_*() calls.
> >>>>>>
> >>>>>>      I haven't tested this but I think it should lead to simpler code.  
> >>>>> Thinking about this more STOP/CONT will not work reliably, it could stop
> >>>>> the child forever when the parent sends SIGCONT before the child
> >>>>> SIGSTOPs itself. While this is unlikely we have no control over how both
> >>>>> processes are scheduled.
> >>>>>
> >>>>> With this pipe version there is no problem when the parent closes the fd
> >>>>> before the child calls read, read will simply return EOF and the child can
> >>>>> continue, thus it will work correctly in all cases.  
> >>>> Ah, right, nice catch. Still, you could probably use pause() or
> >>>> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
> >>>> experiment and I'll get back to you (probably early next week), unless
> >>>> you manage to get it working before.  
> >>> Sorry, forget about it -- it doesn't solve the problem of waiting, in
> >>> the parent, that the child is stopped, which is exactly the point you
> >>> raised. A waitpid() with WUNTRACED does:
> >>>
> >>> #include <stdio.h>
> >>> #include <unistd.h>
> >>> #include <signal.h>
> >>> #include <sys/wait.h>
> >>>
> >>> #define DELAY_PARENT	0
> >>> #define DELAY_CHILD	0
> >>>
> >>> int main()
> >>> {
> >>> 	pid_t pid;
> >>> 	int i;
> >>>
> >>> 	if ((pid = fork())) {
> >>> #if DELAY_PARENT
> >>> 		for (i = 0; i < 10000000; i++);
> >>> #endif
> >>> 		waitpid(pid, NULL, WUNTRACED);
> >>> 		kill(pid, SIGCONT);
> >>> 		sleep(1);
> >>> 		return 0;
> >>> 	}
> >>>
> >>> #if DELAY_CHILD
> >>> 	for (i = 0; i < 10000000; i++);
> >>> #endif
> >>> 	raise(SIGSTOP);
> >>> 	fprintf(stderr, "received SIGCONT\n");
> >>> 	return 0;
> >>> }
> >>>
> >>> I left in some busyloops you can use to check. It's three lines, with
> >>> error checks probably 9, still less than the pipe thing (~16) and it
> >>> looks simpler (to me).
> >>>     
> >> I don't know what it is but this doesn't work when I implement it in
> >> pasta. Somehow
> >> the child doesn't seem to be stopped. A short lived processes such as ip
> >> addr causes
> >> pasta to exit even before the parent got to the point where it would
> >> send SIGCONT.
> >> If I get the nanoseconds before and after raise(SIGSTOP) there is almost
> >> no delay.
> >> It is clear that the child still runs after raise(SIGSTOP) even though
> >> the parent never send
> >> SIGCONT at this point, in fact I can completely remove the kill(pid,
> >> SIGCONT) call and
> >> the program works without hanging.  
> > Sorry, I didn't imagine it would be so messy.
> >
> > The problem is that the child sees a detached PID namspace -- if you
> > drop CLONE_NEWPID from do_clone() in pasta_start_ns(), things work as
> > expected.
> >
> > Sending a SIGSTOP from the child means we're stopping the init process,
> > again only from the child perspective, and you can't send SIGSTOP or
> > SIGKILL to init, see sig_task_ignored() in kernel/signal.c:
> >    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n85
> >
> > I haven't actually traced this in the kernel and I'm not sure if we
> > hit that condition -- and especially if we should hit it (one might
> > argue it's a kernel issue, in case), but it seems to fit.
> >
> > Also:
> >  
> >> And of course if I run this through strace it works just fine, so I am
> >> bit lost right now.
> >> Likely because strace makes things much slower?  
> > ...no, a tracer actually makes things work (I also tried with delays).
> > The child stops and we get a SIGCHLD:
> >
> > [pid 3101055] clone(child_stack=0x7fff46e7c3f0, flags=CLONE_VM|CLONE_FILES|CLONE_VFORK|SIGCHLD <unfinished ...>
> > [pid 3101056] openat(AT_FDCWD, "/proc/sys/net/ipv4/ping_group_range", O_WRONLY|O_TRUNC|O_CLOEXEC) = 6
> > [pid 3101056] write(6, "0 0", 3)        = 3
> > [pid 3101056] close(6)                  = 0
> > [pid 3101056] gettid()                  = 1
> > [pid 3101056] getpid()                  = 1
> > [pid 3101056] tgkill(1, 1, SIGSTOPstrace: Process 3101057 attached
> > )     = 0
> > [pid 3101056] --- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
> > [pid 3101057] openat(AT_FDCWD, "/proc/3101056/ns/net", O_RDONLY|O_CLOEXEC <unfinished ...>
> > [pid 3101056] --- stopped by SIGSTOP ---
> > [pid 3101057] <... openat resumed>)     = 6
> > [pid 3101057] setns(6, CLONE_NEWNET)    = 0
> > [pid 3101057] exit(0)                   = ?
> > [pid 3101055] <... clone resumed>)      = 3101057
> > [pid 3101057] +++ exited with 0 +++
> > [pid 3101055] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=3101056, si_uid=0, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
> >
> > I guess because of this condition in sig_ignored():
> >    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n111
> >
> > which skips sig_task_ignored() altogether. This kernel behaviour looks
> > rather unexpected to me and we should probably fix it, if my suspicion
> > is really confirmed.
> >  
> >> The waitpid call also always fails with ECHILD, I don't understand why.
> >> I think it must
> >> have something to do with the SIGCHILD signal handler that already reaps
> >> the signal info?  
> > The tracer shouldn't really reap that, because we only wait on WEXITED
> > there. We get ECHILD (without a tracer) because the child isn't
> > actually stopped.
> >  
> >> So either I made a a stupid mistake somewhere or it simply does not work.  
> > It won't work like that. We could send SIGSTOP from the parent instead,
> > that works, but then we don't know (concern you raised earlier) if the
> > child is already available when we send it.
> >
> > There's another way to address this concern, though: while
> > CLONE_STOPPED was dropped a long time ago, we can obtain a similar
> > behaviour given that the child inherits our signal mask on clone() (by
> > default), and it can later wait for a given signal.
> >
> > Let's pick SIGUSR1: block it before clone(), and the child will start
> > with it blocked, which means we can safely queue it at any later point
> > in the parent.
> >
> > Something like this:
> >
> > ---
> > diff --git a/passt.c b/passt.c
> > index 8b2c50d..d957e14 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -301,6 +301,9 @@ int main(int argc, char **argv)
> >          else
> >                  write_pidfile(pidfile_fd, getpid());
> >   
> > +       if (pasta_child_pid)
> > +               kill(pasta_child_pid, SIGUSR1);
> > +
> >          isolate_postfork(&c);
> >   
> >          timer_init(&c, &now);
> > diff --git a/pasta.c b/pasta.c
> > index 528f02a..9169913 100644
> > --- a/pasta.c
> > +++ b/pasta.c
> > @@ -47,7 +47,7 @@
> >   #include "log.h"
> >   
> >   /* PID of child, in case we created a namespace */
> > -static int pasta_child_pid;
> > +int pasta_child_pid;
> >   
> >   /**
> >    * pasta_child_handler() - Exit once shell exits (if we started it), reap clones
> > @@ -166,10 +166,16 @@ struct pasta_spawn_cmd_arg {
> >   static int pasta_spawn_cmd(void *arg)
> >   {
> >          const struct pasta_spawn_cmd_arg *a;
> > +       sigset_t set;
> >   
> >          if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
> >                  warn("Cannot set ping_group_range, ICMP requests might fail");
> >   
> > +       /* Wait for the parent to be ready: see main() */
> > +       sigemptyset(&set);
> > +       sigaddset(&set, SIGUSR1);
> > +       sigwaitinfo(&set, NULL);
> > +
> >          a = (const struct pasta_spawn_cmd_arg *)arg;
> >          execvp(a->exe, a->argv);
> >   
> > @@ -196,6 +202,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
> >          char ns_fn_stack[NS_FN_STACK_SIZE];
> >          char *sh_argv[] = { NULL, NULL };
> >          char sh_arg0[PATH_MAX + 1];
> > +       sigset_t set;
> >   
> >          c->foreground = 1;
> >          if (!c->debug)
> > @@ -226,6 +233,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
> >                  arg.argv = sh_argv;
> >          }
> >   
> > +       /* Block SIGUSR1 in child, we queue it in main() when we're ready */
> > +       sigemptyset(&set);
> > +       sigaddset(&set, SIGUSR1);
> > +       sigprocmask(SIG_BLOCK, &set, NULL);
> > +
> >          pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
> >                                     sizeof(ns_fn_stack),
> >                                     CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
> > diff --git a/pasta.h b/pasta.h
> > index a8b9893..0ccb7e9 100644
> > --- a/pasta.h
> > +++ b/pasta.h
> > @@ -6,6 +6,8 @@
> >   #ifndef PASTA_H
> >   #define PASTA_H
> >   
> > +extern int pasta_child_pid;
> > +
> >   void pasta_open_ns(struct ctx *c, const char *netns);
> >   void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
> >                      int argc, char *argv[]);
> > ---
> >
> > actually works for me.
> >
> > It's a few more lines than I hoped for, but I think it's still (maybe
> > marginally, now) simpler than the pipe approach. What do you think?  
> It seems to work so I am good with it. Do you apply your diff or do you 
> want me to send a v2 with it?

As you prefer, I can apply this and give you credits in the message.

> Just one question I see that these sig... function can return errors 
> according to the man page, should we handle them?

I think the only meaningful error condition we might actually hit here
is EINTR from sigwaitinfo(), and in that case we might just as well
proceed without synchronisation, rather than failing or trying to
handle it.

> I see that there is no error handling done in main() for the sig... 
> functions so maybe not?

That's not necessarily a good indication. :) In general, you can make
the mild assumption that some code might have been written "quickly"
and some potentially useful error checks were skipped.

> Lastly another problem I just found when execvp fails, i.e. ENOENT, the 
> parent just keeps hanging and does not exit.

Oh, oops. The intention was that the child exiting would trigger a
SIGCHLD and pasta_child_handler() in the parent. I'm not sure what's
wrong there.

> However this is present in current HEAD, my pipe version and your signal 
> version so I don't think we need to block this for that bug.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] pasta: wait for netns setup before calling exec
  2023-02-08 13:01               ` Stefano Brivio
@ 2023-02-08 15:06                 ` Paul Holzinger
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Holzinger @ 2023-02-08 15:06 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev


On 08/02/2023 14:01, Stefano Brivio wrote:
> On Tue, 7 Feb 2023 20:09:44 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
>
>> On 07/02/2023 11:55, Stefano Brivio wrote:
>>> On Mon, 6 Feb 2023 20:53:08 +0100
>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>   
>>>> On 03/02/2023 19:55, Stefano Brivio wrote:
>>>>> On Fri, 3 Feb 2023 17:37:03 +0100
>>>>> Stefano Brivio <sbrivio@redhat.com> wrote:
>>>>>      
>>>>>> On Fri, 3 Feb 2023 15:44:40 +0100
>>>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>>>      
>>>>>>> On 02/02/2023 11:25, Stefano Brivio wrote:
>>>>>>>> On Wed,  1 Feb 2023 19:01:16 +0100
>>>>>>>> Paul Holzinger <pholzing@redhat.com> wrote:
>>>>>>>>           
>>>>>>>>> When a user spawns a command with pasta they expect the network to be
>>>>>>>>> ready. Currently this does not work because pasta will fork/exec
>>>>>>>>> before it will setup the network config.
>>>>>>>>>
>>>>>>>>> This patch fixes it by using a pipe to sync parent and child. The child
>>>>>>>>> will now block reading from this pipe before the exec call. The parent
>>>>>>>>> will then unblock the child only after the netns was configured.
>>>>>>>> Thanks for the patch! I'm reviewing this in a bit.
>>>>>>>>
>>>>>>>> A few considerations meanwhile:
>>>>>>>>
>>>>>>>> - there's actually a bigger issue (you're fixing here) than the
>>>>>>>>       namespace configuration (via netlink) itself: the tap device isn't
>>>>>>>>       ready (tap_sock_init() hasn't been called yet) when we spawn the
>>>>>>>>       command in the new namespace. Oops.
>>>>>>>>
>>>>>>>>       If you're wondering: we can't just reorder things, because to complete
>>>>>>>>       the configuration phase (conf()) we need the namespace to be set up,
>>>>>>>>       and we can't initialise the tap device before it's set up
>>>>>>>>
>>>>>>>> - pipes are more commonly used to transfer data around (hence the whole
>>>>>>>>       code you need to open a communication channel, check it, close it).
>>>>>>>>       Did you try with a signal? Or is there a reason why it wouldn't work?
>>>>>>>>
>>>>>>>>       You could simply SIGSTOP the child, from the child itself:
>>>>>>>>
>>>>>>>> 	kill(getpid(), SIGSTOP);
>>>>>>>>
>>>>>>>>       and send a SIGCONT to it (we already store the PID of the child in
>>>>>>>>       pasta_child_pid) once we're ready.
>>>>>>>>
>>>>>>>>       SIGCONT is special in that it doesn't need CAP_KILL or the processes
>>>>>>>>       to run under the same UID -- just in the same session, so it wouldn't
>>>>>>>>       risk interfering with the isolation_*() calls.
>>>>>>>>
>>>>>>>>       I haven't tested this but I think it should lead to simpler code.
>>>>>>> Thinking about this more STOP/CONT will not work reliably, it could stop
>>>>>>> the child forever when the parent sends SIGCONT before the child
>>>>>>> SIGSTOPs itself. While this is unlikely we have no control over how both
>>>>>>> processes are scheduled.
>>>>>>>
>>>>>>> With this pipe version there is no problem when the parent closes the fd
>>>>>>> before the child calls read, read will simply return EOF and the child can
>>>>>>> continue, thus it will work correctly in all cases.
>>>>>> Ah, right, nice catch. Still, you could probably use pause() or
>>>>>> sigsuspend() instead of the SIGSTOP. Let me try a quick stand-alone
>>>>>> experiment and I'll get back to you (probably early next week), unless
>>>>>> you manage to get it working before.
>>>>> Sorry, forget about it -- it doesn't solve the problem of waiting, in
>>>>> the parent, that the child is stopped, which is exactly the point you
>>>>> raised. A waitpid() with WUNTRACED does:
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <unistd.h>
>>>>> #include <signal.h>
>>>>> #include <sys/wait.h>
>>>>>
>>>>> #define DELAY_PARENT	0
>>>>> #define DELAY_CHILD	0
>>>>>
>>>>> int main()
>>>>> {
>>>>> 	pid_t pid;
>>>>> 	int i;
>>>>>
>>>>> 	if ((pid = fork())) {
>>>>> #if DELAY_PARENT
>>>>> 		for (i = 0; i < 10000000; i++);
>>>>> #endif
>>>>> 		waitpid(pid, NULL, WUNTRACED);
>>>>> 		kill(pid, SIGCONT);
>>>>> 		sleep(1);
>>>>> 		return 0;
>>>>> 	}
>>>>>
>>>>> #if DELAY_CHILD
>>>>> 	for (i = 0; i < 10000000; i++);
>>>>> #endif
>>>>> 	raise(SIGSTOP);
>>>>> 	fprintf(stderr, "received SIGCONT\n");
>>>>> 	return 0;
>>>>> }
>>>>>
>>>>> I left in some busyloops you can use to check. It's three lines, with
>>>>> error checks probably 9, still less than the pipe thing (~16) and it
>>>>> looks simpler (to me).
>>>>>      
>>>> I don't know what it is but this doesn't work when I implement it in
>>>> pasta. Somehow
>>>> the child doesn't seem to be stopped. A short lived processes such as ip
>>>> addr causes
>>>> pasta to exit even before the parent got to the point where it would
>>>> send SIGCONT.
>>>> If I get the nanoseconds before and after raise(SIGSTOP) there is almost
>>>> no delay.
>>>> It is clear that the child still runs after raise(SIGSTOP) even though
>>>> the parent never send
>>>> SIGCONT at this point, in fact I can completely remove the kill(pid,
>>>> SIGCONT) call and
>>>> the program works without hanging.
>>> Sorry, I didn't imagine it would be so messy.
>>>
>>> The problem is that the child sees a detached PID namspace -- if you
>>> drop CLONE_NEWPID from do_clone() in pasta_start_ns(), things work as
>>> expected.
>>>
>>> Sending a SIGSTOP from the child means we're stopping the init process,
>>> again only from the child perspective, and you can't send SIGSTOP or
>>> SIGKILL to init, see sig_task_ignored() in kernel/signal.c:
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n85
>>>
>>> I haven't actually traced this in the kernel and I'm not sure if we
>>> hit that condition -- and especially if we should hit it (one might
>>> argue it's a kernel issue, in case), but it seems to fit.
>>>
>>> Also:
>>>   
>>>> And of course if I run this through strace it works just fine, so I am
>>>> bit lost right now.
>>>> Likely because strace makes things much slower?
>>> ...no, a tracer actually makes things work (I also tried with delays).
>>> The child stops and we get a SIGCHLD:
>>>
>>> [pid 3101055] clone(child_stack=0x7fff46e7c3f0, flags=CLONE_VM|CLONE_FILES|CLONE_VFORK|SIGCHLD <unfinished ...>
>>> [pid 3101056] openat(AT_FDCWD, "/proc/sys/net/ipv4/ping_group_range", O_WRONLY|O_TRUNC|O_CLOEXEC) = 6
>>> [pid 3101056] write(6, "0 0", 3)        = 3
>>> [pid 3101056] close(6)                  = 0
>>> [pid 3101056] gettid()                  = 1
>>> [pid 3101056] getpid()                  = 1
>>> [pid 3101056] tgkill(1, 1, SIGSTOPstrace: Process 3101057 attached
>>> )     = 0
>>> [pid 3101056] --- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
>>> [pid 3101057] openat(AT_FDCWD, "/proc/3101056/ns/net", O_RDONLY|O_CLOEXEC <unfinished ...>
>>> [pid 3101056] --- stopped by SIGSTOP ---
>>> [pid 3101057] <... openat resumed>)     = 6
>>> [pid 3101057] setns(6, CLONE_NEWNET)    = 0
>>> [pid 3101057] exit(0)                   = ?
>>> [pid 3101055] <... clone resumed>)      = 3101057
>>> [pid 3101057] +++ exited with 0 +++
>>> [pid 3101055] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=3101056, si_uid=0, si_status=SIGSTOP, si_utime=0, si_stime=0} ---
>>>
>>> I guess because of this condition in sig_ignored():
>>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/signal.c?id=05ecb680708a1dbe6554d6fc17e5d9a8a7cb5e6a#n111
>>>
>>> which skips sig_task_ignored() altogether. This kernel behaviour looks
>>> rather unexpected to me and we should probably fix it, if my suspicion
>>> is really confirmed.
>>>   
>>>> The waitpid call also always fails with ECHILD, I don't understand why.
>>>> I think it must
>>>> have something to do with the SIGCHILD signal handler that already reaps
>>>> the signal info?
>>> The tracer shouldn't really reap that, because we only wait on WEXITED
>>> there. We get ECHILD (without a tracer) because the child isn't
>>> actually stopped.
>>>   
>>>> So either I made a a stupid mistake somewhere or it simply does not work.
>>> It won't work like that. We could send SIGSTOP from the parent instead,
>>> that works, but then we don't know (concern you raised earlier) if the
>>> child is already available when we send it.
>>>
>>> There's another way to address this concern, though: while
>>> CLONE_STOPPED was dropped a long time ago, we can obtain a similar
>>> behaviour given that the child inherits our signal mask on clone() (by
>>> default), and it can later wait for a given signal.
>>>
>>> Let's pick SIGUSR1: block it before clone(), and the child will start
>>> with it blocked, which means we can safely queue it at any later point
>>> in the parent.
>>>
>>> Something like this:
>>>
>>> ---
>>> diff --git a/passt.c b/passt.c
>>> index 8b2c50d..d957e14 100644
>>> --- a/passt.c
>>> +++ b/passt.c
>>> @@ -301,6 +301,9 @@ int main(int argc, char **argv)
>>>           else
>>>                   write_pidfile(pidfile_fd, getpid());
>>>    
>>> +       if (pasta_child_pid)
>>> +               kill(pasta_child_pid, SIGUSR1);
>>> +
>>>           isolate_postfork(&c);
>>>    
>>>           timer_init(&c, &now);
>>> diff --git a/pasta.c b/pasta.c
>>> index 528f02a..9169913 100644
>>> --- a/pasta.c
>>> +++ b/pasta.c
>>> @@ -47,7 +47,7 @@
>>>    #include "log.h"
>>>    
>>>    /* PID of child, in case we created a namespace */
>>> -static int pasta_child_pid;
>>> +int pasta_child_pid;
>>>    
>>>    /**
>>>     * pasta_child_handler() - Exit once shell exits (if we started it), reap clones
>>> @@ -166,10 +166,16 @@ struct pasta_spawn_cmd_arg {
>>>    static int pasta_spawn_cmd(void *arg)
>>>    {
>>>           const struct pasta_spawn_cmd_arg *a;
>>> +       sigset_t set;
>>>    
>>>           if (write_file("/proc/sys/net/ipv4/ping_group_range", "0 0"))
>>>                   warn("Cannot set ping_group_range, ICMP requests might fail");
>>>    
>>> +       /* Wait for the parent to be ready: see main() */
>>> +       sigemptyset(&set);
>>> +       sigaddset(&set, SIGUSR1);
>>> +       sigwaitinfo(&set, NULL);
>>> +
>>>           a = (const struct pasta_spawn_cmd_arg *)arg;
>>>           execvp(a->exe, a->argv);
>>>    
>>> @@ -196,6 +202,7 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>>>           char ns_fn_stack[NS_FN_STACK_SIZE];
>>>           char *sh_argv[] = { NULL, NULL };
>>>           char sh_arg0[PATH_MAX + 1];
>>> +       sigset_t set;
>>>    
>>>           c->foreground = 1;
>>>           if (!c->debug)
>>> @@ -226,6 +233,11 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>>>                   arg.argv = sh_argv;
>>>           }
>>>    
>>> +       /* Block SIGUSR1 in child, we queue it in main() when we're ready */
>>> +       sigemptyset(&set);
>>> +       sigaddset(&set, SIGUSR1);
>>> +       sigprocmask(SIG_BLOCK, &set, NULL);
>>> +
>>>           pasta_child_pid = do_clone(pasta_spawn_cmd, ns_fn_stack,
>>>                                      sizeof(ns_fn_stack),
>>>                                      CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET |
>>> diff --git a/pasta.h b/pasta.h
>>> index a8b9893..0ccb7e9 100644
>>> --- a/pasta.h
>>> +++ b/pasta.h
>>> @@ -6,6 +6,8 @@
>>>    #ifndef PASTA_H
>>>    #define PASTA_H
>>>    
>>> +extern int pasta_child_pid;
>>> +
>>>    void pasta_open_ns(struct ctx *c, const char *netns);
>>>    void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>>>                       int argc, char *argv[]);
>>> ---
>>>
>>> actually works for me.
>>>
>>> It's a few more lines than I hoped for, but I think it's still (maybe
>>> marginally, now) simpler than the pipe approach. What do you think?
>> It seems to work so I am good with it. Do you apply your diff or do you
>> want me to send a v2 with it?
> As you prefer, I can apply this and give you credits in the message.
Please apply it.
>
>> Just one question I see that these sig... function can return errors
>> according to the man page, should we handle them?
> I think the only meaningful error condition we might actually hit here
> is EINTR from sigwaitinfo(), and in that case we might just as well
> proceed without synchronisation, rather than failing or trying to
> handle it.
>
>> I see that there is no error handling done in main() for the sig...
>> functions so maybe not?
> That's not necessarily a good indication. :) In general, you can make
> the mild assumption that some code might have been written "quickly"
> and some potentially useful error checks were skipped.
>
>> Lastly another problem I just found when execvp fails, i.e. ENOENT, the
>> parent just keeps hanging and does not exit.
> Oh, oops. The intention was that the child exiting would trigger a
> SIGCHLD and pasta_child_handler() in the parent. I'm not sure what's
> wrong there.

I found the issue, you need to add signal to the clone() flags. I will 
send a patch for this.

>
>> However this is present in current HEAD, my pipe version and your signal
>> version so I don't think we need to block this for that bug.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-08 15:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01 18:01 [PATCH] pasta: wait for netns setup before calling exec Paul Holzinger
2023-02-02 10:25 ` Stefano Brivio
2023-02-02 15:23   ` Paul Holzinger
2023-02-03 14:44   ` Paul Holzinger
2023-02-03 16:37     ` Stefano Brivio
2023-02-03 18:55       ` Stefano Brivio
2023-02-06 19:53         ` Paul Holzinger
2023-02-07 10:55           ` Stefano Brivio
2023-02-07 19:09             ` Paul Holzinger
2023-02-08 13:01               ` Stefano Brivio
2023-02-08 15:06                 ` Paul Holzinger

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).