public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes
@ 2023-05-14 18:14 Stefano Brivio
  2023-05-14 18:14 ` [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req() Stefano Brivio
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

This series, along with pseudo-related fixes, enables:

- optional copy of all routes from selected interface in outer
  namespace, to (hopefully!) fix the issue reported by Callum at:
    https://github.com/containers/podman/issues/18539

- optional copy of all addresses, mostly for consistency. It doesn't,
  however, enable assignment of multiple addresses in the sense
  requested at:
    https://bugs.passt.top/show_bug.cgi?id=47

  because the addresses still need to be present on the host, and
  the "outer" address isn't selected depending on the address used
  inside the container

- operation without a gateway address, to (again, hopefully) support
  usage of Wireguard endpoints established outside the container,
    https://bugs.passt.top/show_bug.cgi?id=49

I tested the single functionalities introduced here, but I didn't
try to reproduce the setups where the issues were reported, so some
help with testing is definitely fundamental here. Thanks.

Stefano Brivio (10):
  netlink: Fix comment about response buffer size for nl_req()
  pasta: Improve error handling on failure to join network namespace
  netlink: Add functionality to copy routes from outer namespace
  conf: --config-net option is for pasta mode only
  conf, pasta: With --config-net, copy all routes by default
  Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask
    and gateway"
  conf: Don't exit if sourced default route has no gateway
  netlink: Add functionality to copy addresses from outer namespace
  conf, pasta: With --config-net, copy all addresses by default
  passt.h: Fix description of pasta_ifi in struct ctx

 conf.c    |  81 ++++++++++++++++++++---------------
 netlink.c | 123 +++++++++++++++++++++++++++++++++++++++---------------
 netlink.h |  13 ++++--
 passt.1   |  25 ++++++++++-
 passt.h   |   8 +++-
 pasta.c   |  26 ++++++++----
 6 files changed, 195 insertions(+), 81 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req()
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-16  3:23   ` David Gibson
  2023-05-14 18:14 ` [PATCH 02/10] pasta: Improve error handling on failure to join network namespace Stefano Brivio
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Fixes: fde8004ab0b4 ("netlink: Use 8 KiB * netlink message header size as response buffer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 netlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/netlink.c b/netlink.c
index b99af85..c07a13c 100644
--- a/netlink.c
+++ b/netlink.c
@@ -99,7 +99,7 @@ fail:
 /**
  * nl_req() - Send netlink request and read response
  * @ns:		Use netlink socket in namespace
- * @buf:	Buffer for response (at least BUFSIZ long)
+ * @buf:	Buffer for response (at least NLBUFSIZ long)
  * @req:	Request with netlink header
  * @len:	Request length
  *
-- 
@@ -99,7 +99,7 @@ fail:
 /**
  * nl_req() - Send netlink request and read response
  * @ns:		Use netlink socket in namespace
- * @buf:	Buffer for response (at least BUFSIZ long)
+ * @buf:	Buffer for response (at least NLBUFSIZ long)
  * @req:	Request with netlink header
  * @len:	Request length
  *
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/10] pasta: Improve error handling on failure to join network namespace
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
  2023-05-14 18:14 ` [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req() Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-16  3:24   ` David Gibson
  2023-05-14 18:14 ` [PATCH 03/10] netlink: Add functionality to copy routes from outer namespace Stefano Brivio
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

In pasta_wait_for_ns(), open() failing with ENOENT is expected: we're
busy-looping until the network namespace appears. But any other
failure is not something we're going to recover from: return right
away if we don't get either success or ENOENT.

Now that pasta_wait_for_ns() can actually fail, handle that in
pasta_start_ns() by reporting the issue and exiting.

Looping on EPERM, when pasta doesn't actually have the permissions to
join a given namespace, isn't exactly a productive thing to do.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 pasta.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/pasta.c b/pasta.c
index 3a4d704..2fa0168 100644
--- a/pasta.c
+++ b/pasta.c
@@ -94,8 +94,11 @@ static int pasta_wait_for_ns(void *arg)
 	char ns[PATH_MAX];
 
 	snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
-	do
-		while ((c->pasta_netns_fd = open(ns, flags)) < 0);
+	while ((c->pasta_netns_fd = open(ns, flags)) < 0) {
+		if (errno != ENOENT)
+			return 0;
+	}
+
 	while (setns(c->pasta_netns_fd, CLONE_NEWNET) &&
 	       !close(c->pasta_netns_fd));
 
@@ -252,6 +255,8 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 	}
 
 	NS_CALL(pasta_wait_for_ns, c);
+	if (c->pasta_netns_fd < 0)
+		die("Failed to join network namespace");
 }
 
 /**
-- 
@@ -94,8 +94,11 @@ static int pasta_wait_for_ns(void *arg)
 	char ns[PATH_MAX];
 
 	snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
-	do
-		while ((c->pasta_netns_fd = open(ns, flags)) < 0);
+	while ((c->pasta_netns_fd = open(ns, flags)) < 0) {
+		if (errno != ENOENT)
+			return 0;
+	}
+
 	while (setns(c->pasta_netns_fd, CLONE_NEWNET) &&
 	       !close(c->pasta_netns_fd));
 
@@ -252,6 +255,8 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
 	}
 
 	NS_CALL(pasta_wait_for_ns, c);
+	if (c->pasta_netns_fd < 0)
+		die("Failed to join network namespace");
 }
 
 /**
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/10] netlink: Add functionality to copy routes from outer namespace
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
  2023-05-14 18:14 ` [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req() Stefano Brivio
  2023-05-14 18:14 ` [PATCH 02/10] pasta: Improve error handling on failure to join network namespace Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-14 18:14 ` [PATCH 04/10] conf: --config-net option is for pasta mode only Stefano Brivio
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Instead of just fetching the default gateway and configuring a single
equivalent route in the target namespace, on 'pasta --config-net', it
might be desirable in some cases to copy the whole set of routes
corresponding to a given output interface.

For instance, in:
  https://github.com/containers/podman/issues/18539
  IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes

configuring the default gateway won't work without a gateway-less
route (specifying the output interface only), because the default
gateway is, somewhat dubiously, not on the same subnet as the
container.

This is a similar case to the one covered by commit 7656a6f88882
("conf: Adjust netmask on mismatch between IPv4 address/netmask and
gateway"), and I'm not exactly proud of that workaround.

We also have:
  https://bugs.passt.top/show_bug.cgi?id=49
  pasta does not work with tap-style interface

for which, eventually, we should be able to configure a gateway-less
route in the target namespace.

Introduce different operation modes for nl_route(), including a new
NL_DUP one, not exposed yet, which simply parrots back to the kernel
the route dump for a given interface from the outer namespace, fixing
up flags and interface indices on the way, and requesting to add the
same routes in the target namespace, on the interface we manage.

I'm not kidding, it actually works pretty well.

Link: https://github.com/containers/podman/issues/18539
Link: https://bugs.passt.top/show_bug.cgi?id=49
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c    |  4 ++--
 netlink.c | 59 +++++++++++++++++++++++++++++++++++++++----------------
 netlink.h |  9 ++++++++-
 pasta.c   |  6 ++++--
 4 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/conf.c b/conf.c
index 447b000..aad2b00 100644
--- a/conf.c
+++ b/conf.c
@@ -646,7 +646,7 @@ static unsigned int conf_ip4(unsigned int ifi,
 	}
 
 	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->gw))
-		nl_route(0, ifi, AF_INET, &ip4->gw);
+		nl_route(NL_GET, ifi, 0, AF_INET, &ip4->gw);
 
 	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr))
 		nl_addr(0, ifi, AF_INET, &ip4->addr, &ip4->prefix_len, NULL);
@@ -718,7 +718,7 @@ static unsigned int conf_ip6(unsigned int ifi,
 	}
 
 	if (IN6_IS_ADDR_UNSPECIFIED(&ip6->gw))
-		nl_route(0, ifi, AF_INET6, &ip6->gw);
+		nl_route(NL_GET, ifi, 0, AF_INET6, &ip6->gw);
 
 	nl_addr(0, ifi, AF_INET6,
 		IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) ? &ip6->addr : NULL,
diff --git a/netlink.c b/netlink.c
index c07a13c..0ff94ae 100644
--- a/netlink.c
+++ b/netlink.c
@@ -185,16 +185,16 @@ unsigned int nl_get_ext_if(sa_family_t af)
 }
 
 /**
- * nl_route() - Get/set default gateway for given interface and address family
- * @ns:		Use netlink socket in namespace
- * @ifi:	Interface index
+ * nl_route() - Get/set/copy routes for given interface and address family
+ * @op:		Requested operation
+ * @ifi:	Interface index in outer network namespace
+ * @ifi_ns:	Interface index in target namespace for NL_SET, NL_DUP
  * @af:		Address family
- * @gw:		Default gateway to fill if zero, to set if not
+ * @gw:		Default gateway to fill on NL_GET, to set on NL_SET
  */
-void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw)
+void nl_route(enum nl_op op, unsigned int ifi, unsigned int ifi_ns,
+	      sa_family_t af, void *gw)
 {
-	int set = (af == AF_INET6 && !IN6_IS_ADDR_UNSPECIFIED(gw)) ||
-		  (af == AF_INET && *(uint32_t *)gw);
 	struct req_t {
 		struct nlmsghdr nlh;
 		struct rtmsg rtm;
@@ -215,7 +215,7 @@ void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw)
 			} r4;
 		} set;
 	} req = {
-		.nlh.nlmsg_type	  = set ? RTM_NEWROUTE : RTM_GETROUTE,
+		.nlh.nlmsg_type	  = op == NL_SET ? RTM_NEWROUTE : RTM_GETROUTE,
 		.nlh.nlmsg_flags  = NLM_F_REQUEST,
 		.nlh.nlmsg_seq	  = nl_seq++,
 
@@ -228,14 +228,14 @@ void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw)
 		.rta.rta_len	  = RTA_LENGTH(sizeof(unsigned int)),
 		.ifi		  = ifi,
 	};
+	ssize_t n, nlmsgs_size;
 	struct nlmsghdr *nh;
 	struct rtattr *rta;
-	struct rtmsg *rtm;
 	char buf[NLBUFSIZ];
-	ssize_t n;
+	struct rtmsg *rtm;
 	size_t na;
 
-	if (set) {
+	if (op == NL_SET) {
 		if (af == AF_INET6) {
 			size_t rta_len = RTA_LENGTH(sizeof(req.set.r6.d));
 
@@ -269,31 +269,56 @@ void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw)
 		req.nlh.nlmsg_flags |= NLM_F_DUMP;
 	}
 
-	if ((n = nl_req(ns, buf, &req, req.nlh.nlmsg_len)) < 0 || set)
+	if ((n = nl_req(op == NL_SET, buf, &req, req.nlh.nlmsg_len)) < 0)
+		return;
+
+	if (op == NL_SET)
 		return;
 
 	nh = (struct nlmsghdr *)buf;
+	nlmsgs_size = n;
+
 	for ( ; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
 		if (nh->nlmsg_type != RTM_NEWROUTE)
 			goto next;
 
+		if (op == NL_DUP) {
+			nh->nlmsg_seq = nl_seq++;
+			nh->nlmsg_pid = 0;
+			nh->nlmsg_flags &= ~NLM_F_DUMP_FILTERED;
+			nh->nlmsg_flags |= NLM_F_REQUEST | NLM_F_ACK |
+					   NLM_F_CREATE;
+		}
+
 		rtm = (struct rtmsg *)NLMSG_DATA(nh);
-		if (rtm->rtm_dst_len)
+		if (op == NL_GET && rtm->rtm_dst_len)
 			continue;
 
 		for (rta = RTM_RTA(rtm), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
 		     rta = RTA_NEXT(rta, na)) {
-			if (rta->rta_type != RTA_GATEWAY)
-				continue;
+			if (op == NL_GET) {
+				if (rta->rta_type != RTA_GATEWAY)
+					continue;
 
-			memcpy(gw, RTA_DATA(rta), RTA_PAYLOAD(rta));
-			return;
+				memcpy(gw, RTA_DATA(rta), RTA_PAYLOAD(rta));
+				return;
+			}
+
+			if (op == NL_DUP && rta->rta_type == RTA_OIF)
+				*(unsigned int *)RTA_DATA(rta) = ifi_ns;
 		}
 
 next:
 		if (nh->nlmsg_type == NLMSG_DONE)
 			break;
 	}
+
+	if (op == NL_DUP) {
+		char resp[NLBUFSIZ];
+
+		nh = (struct nlmsghdr *)buf;
+		nl_req(1, resp, nh, nlmsgs_size);
+	}
 }
 
 /**
diff --git a/netlink.h b/netlink.h
index ca4d6ef..217cf1e 100644
--- a/netlink.h
+++ b/netlink.h
@@ -6,9 +6,16 @@
 #ifndef NETLINK_H
 #define NETLINK_H
 
+enum nl_op {
+	NL_GET,
+	NL_SET,
+	NL_DUP,
+};
+
 void nl_sock_init(const struct ctx *c, bool ns);
 unsigned int nl_get_ext_if(sa_family_t af);
-void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw);
+void nl_route(enum nl_op op, unsigned int ifi, unsigned int ifi_ns,
+	      sa_family_t af, void *gw);
 void nl_addr(int ns, unsigned int ifi, sa_family_t af,
 	     void *addr, int *prefix_len, void *addr_l);
 void nl_link(int ns, unsigned int ifi, void *mac, int up, int mtu);
diff --git a/pasta.c b/pasta.c
index 2fa0168..9161a1f 100644
--- a/pasta.c
+++ b/pasta.c
@@ -273,14 +273,16 @@ void pasta_ns_conf(struct ctx *c)
 		if (c->ifi4) {
 			nl_addr(1, c->pasta_ifi, AF_INET, &c->ip4.addr,
 				&c->ip4.prefix_len, NULL);
-			nl_route(1, c->pasta_ifi, AF_INET, &c->ip4.gw);
+			nl_route(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+				 &c->ip4.gw);
 		}
 
 		if (c->ifi6) {
 			int prefix_len = 64;
 			nl_addr(1, c->pasta_ifi, AF_INET6, &c->ip6.addr,
 				&prefix_len, NULL);
-			nl_route(1, c->pasta_ifi, AF_INET6, &c->ip6.gw);
+			nl_route(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+				 &c->ip6.gw);
 		}
 	} else {
 		nl_link(1, c->pasta_ifi, c->mac_guest, 0, 0);
-- 
@@ -273,14 +273,16 @@ void pasta_ns_conf(struct ctx *c)
 		if (c->ifi4) {
 			nl_addr(1, c->pasta_ifi, AF_INET, &c->ip4.addr,
 				&c->ip4.prefix_len, NULL);
-			nl_route(1, c->pasta_ifi, AF_INET, &c->ip4.gw);
+			nl_route(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+				 &c->ip4.gw);
 		}
 
 		if (c->ifi6) {
 			int prefix_len = 64;
 			nl_addr(1, c->pasta_ifi, AF_INET6, &c->ip6.addr,
 				&prefix_len, NULL);
-			nl_route(1, c->pasta_ifi, AF_INET6, &c->ip6.gw);
+			nl_route(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+				 &c->ip6.gw);
 		}
 	} else {
 		nl_link(1, c->pasta_ifi, c->mac_guest, 0, 0);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/10] conf: --config-net option is for pasta mode only
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (2 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 03/10] netlink: Add functionality to copy routes from outer namespace Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-16  3:59   ` David Gibson
  2023-05-14 18:14 ` [PATCH 05/10] conf, pasta: With --config-net, copy all routes by default Stefano Brivio
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/conf.c b/conf.c
index aad2b00..bc1ae99 100644
--- a/conf.c
+++ b/conf.c
@@ -1198,7 +1198,6 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"userns",	required_argument,	NULL,		2 },
 		{"netns",	required_argument,	NULL,		3 },
 		{"netns-only",	no_argument,		&netns_only,	1 },
-		{"config-net",	no_argument,		&c->pasta_conf_ns, 1 },
 		{"ns-mac-addr",	required_argument,	NULL,		4 },
 		{"dhcp-dns",	no_argument,		NULL,		5 },
 		{"no-dhcp-dns",	no_argument,		NULL,		6 },
@@ -1212,6 +1211,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"version",	no_argument,		NULL,		14 },
 		{"outbound-if4", required_argument,	NULL,		15 },
 		{"outbound-if6", required_argument,	NULL,		16 },
+		{"config-net",	no_argument,		NULL,		17 },
 		{ 0 },
 	};
 	struct get_bound_ports_ns_arg ns_ports_arg = { .c = c };
@@ -1369,6 +1369,12 @@ void conf(struct ctx *c, int argc, char **argv)
 			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
 				die("Invalid interface name: %s", optarg);
 
+			break;
+		case 17:
+			if (c->mode != MODE_PASTA)
+				die("--config-net is for pasta mode only");
+
+			c->pasta_conf_ns = 1;
 			break;
 		case 'd':
 			if (c->debug)
-- 
@@ -1198,7 +1198,6 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"userns",	required_argument,	NULL,		2 },
 		{"netns",	required_argument,	NULL,		3 },
 		{"netns-only",	no_argument,		&netns_only,	1 },
-		{"config-net",	no_argument,		&c->pasta_conf_ns, 1 },
 		{"ns-mac-addr",	required_argument,	NULL,		4 },
 		{"dhcp-dns",	no_argument,		NULL,		5 },
 		{"no-dhcp-dns",	no_argument,		NULL,		6 },
@@ -1212,6 +1211,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"version",	no_argument,		NULL,		14 },
 		{"outbound-if4", required_argument,	NULL,		15 },
 		{"outbound-if6", required_argument,	NULL,		16 },
+		{"config-net",	no_argument,		NULL,		17 },
 		{ 0 },
 	};
 	struct get_bound_ports_ns_arg ns_ports_arg = { .c = c };
@@ -1369,6 +1369,12 @@ void conf(struct ctx *c, int argc, char **argv)
 			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
 				die("Invalid interface name: %s", optarg);
 
+			break;
+		case 17:
+			if (c->mode != MODE_PASTA)
+				die("--config-net is for pasta mode only");
+
+			c->pasta_conf_ns = 1;
 			break;
 		case 'd':
 			if (c->debug)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/10] conf, pasta: With --config-net, copy all routes by default
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (3 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 04/10] conf: --config-net option is for pasta mode only Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-14 18:14 ` [PATCH 06/10] Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway" Stefano Brivio
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Use the newly-introduced NL_DUP mode for nl_route() to copy all the
routes associated to the template interface in the outer namespace,
unless --no-copy-routes (also implied by -g) is given.

Otherwise, we can't use default gateways which are not, address-wise,
on the same subnet as the container, as reported by Callum.

Reported-by: Callum Parsey <callum@neoninteger.au>
Link: https://github.com/containers/podman/issues/18539
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c  | 14 ++++++++++++++
 passt.1 | 10 ++++++++++
 passt.h |  4 +++-
 pasta.c |  6 ++++--
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/conf.c b/conf.c
index bc1ae99..c2a745e 100644
--- a/conf.c
+++ b/conf.c
@@ -923,6 +923,7 @@ pasta_opts:
 	info(   "  --no-netns-quit	Don't quit if filesystem-bound target");
 	info(   "  			network namespace is deleted");
 	info(   "  --config-net		Configure tap interface in namespace");
+	info(   "  --no-copy-routes	Don't copy all routes to namespace");
 	info(   "  --ns-mac-addr ADDR	Set MAC address on tap interface");
 
 	exit(EXIT_FAILURE);
@@ -1212,6 +1213,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"outbound-if4", required_argument,	NULL,		15 },
 		{"outbound-if6", required_argument,	NULL,		16 },
 		{"config-net",	no_argument,		NULL,		17 },
+		{"no-copy-routes", no_argument,		NULL,		18 },
 		{ 0 },
 	};
 	struct get_bound_ports_ns_arg ns_ports_arg = { .c = c };
@@ -1376,6 +1378,12 @@ void conf(struct ctx *c, int argc, char **argv)
 
 			c->pasta_conf_ns = 1;
 			break;
+		case 18:
+			if (c->mode != MODE_PASTA)
+				die("--no-copy-routes is for pasta mode only");
+
+			c->no_copy_routes = 1;
+			break;
 		case 'd':
 			if (c->debug)
 				die("Multiple --debug options given");
@@ -1524,6 +1532,9 @@ void conf(struct ctx *c, int argc, char **argv)
 			}
 			break;
 		case 'g':
+			if (c->mode == MODE_PASTA)
+				c->no_copy_routes = 1;
+
 			if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)		&&
 			    inet_pton(AF_INET6, optarg, &c->ip6.gw)	&&
 			    !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)	&&
@@ -1658,6 +1669,9 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (*c->sock_path && c->fd_tap >= 0)
 		die("Options --socket and --fd are mutually exclusive");
 
+	if (c->mode == MODE_PASTA && c->no_copy_routes && !c->pasta_conf_ns)
+		die("Option --no-copy-routes needs --config-net");
+
 	if (!ifi4 && *c->ip4.ifname_out)
 		ifi4 = if_nametoindex(c->ip4.ifname_out);
 
diff --git a/passt.1 b/passt.1
index 19e01d5..10c96ae 100644
--- a/passt.1
+++ b/passt.1
@@ -546,6 +546,16 @@ NAME are given as target), do not exit once the network namespace is deleted.
 Configure networking in the namespace: set up addresses and routes as configured
 or sourced from the host, and bring up the tap interface.
 
+.TP
+.BR \-\-no-copy-routes
+With \-\-config-net, do not copy all the routes associated to the interface we
+derive addresses and routes from: set up only the default gateway. Implied by
+-g, \-\-gateway.
+
+Default is to copy all the routing entries from the interface in the outer
+namespace to the target namespace, translating the output interface attribute to
+the outbound interface in the namespace.
+
 .TP
 .BR \-\-ns-mac-addr " " \fIaddr
 Configure MAC address \fIaddr\fR on the tap interface in the namespace.
diff --git a/passt.h b/passt.h
index 73fe808..d314596 100644
--- a/passt.h
+++ b/passt.h
@@ -181,7 +181,8 @@ struct ip6_ctx {
  * @ip6:		IPv6 configuration
  * @pasta_ifn:		Name of namespace interface for pasta
  * @pasta_ifn:		Index of namespace interface for pasta
- * @pasta_conf_ns:	Configure namespace interface after creating it
+ * @pasta_conf_ns:	Configure namespace after creating it
+ * @no_copy_routes:	Don't copy all routes when configuring target namespace
  * @no_tcp:		Disable TCP operation
  * @tcp:		Context for TCP protocol handler
  * @no_tcp:		Disable UDP operation
@@ -240,6 +241,7 @@ struct ctx {
 	char pasta_ifn[IF_NAMESIZE];
 	unsigned int pasta_ifi;
 	int pasta_conf_ns;
+	int no_copy_routes;
 
 	int no_tcp;
 	struct tcp_ctx tcp;
diff --git a/pasta.c b/pasta.c
index 9161a1f..120cc1e 100644
--- a/pasta.c
+++ b/pasta.c
@@ -268,12 +268,14 @@ void pasta_ns_conf(struct ctx *c)
 	nl_link(1, 1 /* lo */, MAC_ZERO, 1, 0);
 
 	if (c->pasta_conf_ns) {
+		enum nl_op op_routes = c->no_copy_routes ? NL_SET : NL_DUP;
+
 		nl_link(1, c->pasta_ifi, c->mac_guest, 1, c->mtu);
 
 		if (c->ifi4) {
 			nl_addr(1, c->pasta_ifi, AF_INET, &c->ip4.addr,
 				&c->ip4.prefix_len, NULL);
-			nl_route(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+			nl_route(op_routes, c->ifi4, c->pasta_ifi, AF_INET,
 				 &c->ip4.gw);
 		}
 
@@ -281,7 +283,7 @@ void pasta_ns_conf(struct ctx *c)
 			int prefix_len = 64;
 			nl_addr(1, c->pasta_ifi, AF_INET6, &c->ip6.addr,
 				&prefix_len, NULL);
-			nl_route(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+			nl_route(op_routes, c->ifi6, c->pasta_ifi, AF_INET6,
 				 &c->ip6.gw);
 		}
 	} else {
-- 
@@ -268,12 +268,14 @@ void pasta_ns_conf(struct ctx *c)
 	nl_link(1, 1 /* lo */, MAC_ZERO, 1, 0);
 
 	if (c->pasta_conf_ns) {
+		enum nl_op op_routes = c->no_copy_routes ? NL_SET : NL_DUP;
+
 		nl_link(1, c->pasta_ifi, c->mac_guest, 1, c->mtu);
 
 		if (c->ifi4) {
 			nl_addr(1, c->pasta_ifi, AF_INET, &c->ip4.addr,
 				&c->ip4.prefix_len, NULL);
-			nl_route(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+			nl_route(op_routes, c->ifi4, c->pasta_ifi, AF_INET,
 				 &c->ip4.gw);
 		}
 
@@ -281,7 +283,7 @@ void pasta_ns_conf(struct ctx *c)
 			int prefix_len = 64;
 			nl_addr(1, c->pasta_ifi, AF_INET6, &c->ip6.addr,
 				&prefix_len, NULL);
-			nl_route(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+			nl_route(op_routes, c->ifi6, c->pasta_ifi, AF_INET6,
 				 &c->ip6.gw);
 		}
 	} else {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/10] Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (4 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 05/10] conf, pasta: With --config-net, copy all routes by default Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-16  4:00   ` David Gibson
  2023-05-14 18:14 ` [PATCH 07/10] conf: Don't exit if sourced default route has no gateway Stefano Brivio
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

This reverts commit 7656a6f8888237b9e23d63666e921528b6aaf950: now, by
default, we copy all the routes associated to the outbound interface
into the routing table of the container, so there's no need for this
horrible workaround anymore.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c | 25 +------------------------
 1 file changed, 1 insertion(+), 24 deletions(-)

diff --git a/conf.c b/conf.c
index c2a745e..3a2fc2d 100644
--- a/conf.c
+++ b/conf.c
@@ -634,9 +634,6 @@ static int conf_ip4_prefix(const char *arg)
 static unsigned int conf_ip4(unsigned int ifi,
 			     struct ip4_ctx *ip4, unsigned char *mac)
 {
-	in_addr_t addr, gw;
-	int shift;
-
 	if (!ifi)
 		ifi = nl_get_ext_if(AF_INET);
 
@@ -651,10 +648,8 @@ static unsigned int conf_ip4(unsigned int ifi,
 	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr))
 		nl_addr(0, ifi, AF_INET, &ip4->addr, &ip4->prefix_len, NULL);
 
-	addr = ntohl(ip4->addr.s_addr);
-	gw = ntohl(ip4->gw.s_addr);
-
 	if (!ip4->prefix_len) {
+		in_addr_t addr = ntohl(ip4->addr.s_addr);
 		if (IN_CLASSA(addr))
 			ip4->prefix_len = (32 - IN_CLASSA_NSHIFT);
 		else if (IN_CLASSB(addr))
@@ -665,24 +660,6 @@ static unsigned int conf_ip4(unsigned int ifi,
 			ip4->prefix_len = 32;
 	}
 
-	/* We might get an address with a netmask that makes the default
-	 * gateway unreachable, and in that case we would fail to configure
-	 * the default route, with --config-net, or presumably a DHCP client
-	 * in the guest or container would face the same issue.
-	 *
-	 * The host might have another route, to the default gateway itself,
-	 * fixing the situation, but we only read default routes.
-	 *
-	 * Fix up the mask to allow reaching the default gateway from our
-	 * configured address, if needed, and only if we find a non-zero
-	 * mask that makes the gateway reachable.
-	 */
-	shift = 32 - ip4->prefix_len;
-	while (shift < 32 && addr >> shift != gw >> shift)
-		shift++;
-	if (shift < 32)
-		ip4->prefix_len = 32 - shift;
-
 	memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen));
 
 	if (MAC_IS_ZERO(mac))
-- 
@@ -634,9 +634,6 @@ static int conf_ip4_prefix(const char *arg)
 static unsigned int conf_ip4(unsigned int ifi,
 			     struct ip4_ctx *ip4, unsigned char *mac)
 {
-	in_addr_t addr, gw;
-	int shift;
-
 	if (!ifi)
 		ifi = nl_get_ext_if(AF_INET);
 
@@ -651,10 +648,8 @@ static unsigned int conf_ip4(unsigned int ifi,
 	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr))
 		nl_addr(0, ifi, AF_INET, &ip4->addr, &ip4->prefix_len, NULL);
 
-	addr = ntohl(ip4->addr.s_addr);
-	gw = ntohl(ip4->gw.s_addr);
-
 	if (!ip4->prefix_len) {
+		in_addr_t addr = ntohl(ip4->addr.s_addr);
 		if (IN_CLASSA(addr))
 			ip4->prefix_len = (32 - IN_CLASSA_NSHIFT);
 		else if (IN_CLASSB(addr))
@@ -665,24 +660,6 @@ static unsigned int conf_ip4(unsigned int ifi,
 			ip4->prefix_len = 32;
 	}
 
-	/* We might get an address with a netmask that makes the default
-	 * gateway unreachable, and in that case we would fail to configure
-	 * the default route, with --config-net, or presumably a DHCP client
-	 * in the guest or container would face the same issue.
-	 *
-	 * The host might have another route, to the default gateway itself,
-	 * fixing the situation, but we only read default routes.
-	 *
-	 * Fix up the mask to allow reaching the default gateway from our
-	 * configured address, if needed, and only if we find a non-zero
-	 * mask that makes the gateway reachable.
-	 */
-	shift = 32 - ip4->prefix_len;
-	while (shift < 32 && addr >> shift != gw >> shift)
-		shift++;
-	if (shift < 32)
-		ip4->prefix_len = 32 - shift;
-
 	memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen));
 
 	if (MAC_IS_ZERO(mac))
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/10] conf: Don't exit if sourced default route has no gateway
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (5 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 06/10] Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway" Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-14 18:14 ` [PATCH 08/10] netlink: Add functionality to copy addresses from outer namespace Stefano Brivio
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

If we use a template interface without a gateway on the default
route, we can still offer almost complete functionality, except that,
of course, we can't map the gateway address to the outer namespace or
host, and that we have no obvious server address or identifier for
use in DHCP's siaddr and option 54 (Server identifier, mandatory).

Continue, if we have a default route but no default gateway, and
imply --no-map-gw and --no-dhcp in that case. NDP responder and
DHCPv6 should be able to work as usual because we require a
link-local address to be present, and we'll fall back to that.

Together with the previous commits implementing an actual copy of
routes from the outer namespace, this should finally fix the
operation of 'pasta --config-net' for cases where we have a default
route on the host, but no default gateway, as it's the case for
tap-style routes, including typical Wireguard endpoints.

Reported-by: me@yawnt.com
Link: https://bugs.passt.top/show_bug.cgi?id=49
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c  | 10 +++++++---
 passt.1 |  6 ++++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/conf.c b/conf.c
index 3a2fc2d..76d5f41 100644
--- a/conf.c
+++ b/conf.c
@@ -665,8 +665,7 @@ static unsigned int conf_ip4(unsigned int ifi,
 	if (MAC_IS_ZERO(mac))
 		nl_link(0, ifi, mac, 0, 0);
 
-	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->gw) ||
-	    IN4_IS_ADDR_UNSPECIFIED(&ip4->addr) ||
+	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr) ||
 	    MAC_IS_ZERO(mac))
 		return 0;
 
@@ -708,7 +707,6 @@ static unsigned int conf_ip6(unsigned int ifi,
 		nl_link(0, ifi, mac, 0, 0);
 
 	if (IN6_IS_ADDR_UNSPECIFIED(&ip6->gw) ||
-	    IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) ||
 	    IN6_IS_ADDR_UNSPECIFIED(&ip6->addr_ll) ||
 	    MAC_IS_ZERO(mac))
 		return 0;
@@ -1672,6 +1670,12 @@ void conf(struct ctx *c, int argc, char **argv)
 	    (*c->ip6.ifname_out && !c->ifi6))
 		die("External interface not usable");
 
+	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw))
+		c->no_map_gw = c->no_dhcp = 1;
+
+	if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw))
+		c->no_map_gw = 1;
+
 	/* Inbound port options can be parsed now (after IPv4/IPv6 settings) */
 	optind = 1;
 	do {
diff --git a/passt.1 b/passt.1
index 10c96ae..f965c34 100644
--- a/passt.1
+++ b/passt.1
@@ -281,7 +281,8 @@ guest or target namespace will be silently dropped.
 .TP
 .BR \-\-no-dhcp
 Disable the DHCP server. DHCP client requests coming from guest or target
-namespace will be silently dropped.
+namespace will be silently dropped. Implied if there is no gateway on the
+selected IPv4 default route.
 
 .TP
 .BR \-\-no-ndp
@@ -301,7 +302,8 @@ namespace will be ignored.
 .TP
 .BR \-\-no-map-gw
 Don't remap TCP connections and untracked UDP traffic, with the gateway address
-as destination, to the host.
+as destination, to the host. Implied if there is no gateway on the selected
+default route for any of the enabled address families.
 
 .TP
 .BR \-4 ", " \-\-ipv4-only
-- 
@@ -281,7 +281,8 @@ guest or target namespace will be silently dropped.
 .TP
 .BR \-\-no-dhcp
 Disable the DHCP server. DHCP client requests coming from guest or target
-namespace will be silently dropped.
+namespace will be silently dropped. Implied if there is no gateway on the
+selected IPv4 default route.
 
 .TP
 .BR \-\-no-ndp
@@ -301,7 +302,8 @@ namespace will be ignored.
 .TP
 .BR \-\-no-map-gw
 Don't remap TCP connections and untracked UDP traffic, with the gateway address
-as destination, to the host.
+as destination, to the host. Implied if there is no gateway on the selected
+default route for any of the enabled address families.
 
 .TP
 .BR \-4 ", " \-\-ipv4-only
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/10] netlink: Add functionality to copy addresses from outer namespace
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (6 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 07/10] conf: Don't exit if sourced default route has no gateway Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-14 18:14 ` [PATCH 09/10] conf, pasta: With --config-net, copy all addresses by default Stefano Brivio
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Similarly to what we've just done with routes, support NL_DUP for
addresses (currently not exposed): nl_addr() can optionally copy
mulitple addresses to the target namespace, by fixing up data from
the dump with appropriate flags and interface index, and repeating
it back to the kernel on the socket opened in the target namespace.

Link-local addresses are not copied: the family is set to AF_UNSPEC,
which means the kernel will ignore them. Same for addresses from a
mismatching address (pre-4.19 kernels without support for
NETLINK_GET_STRICT_CHK).

Ignore IFA_LABEL attributes by changing their type to IFA_UNSPEC,
because in general they will report mismatching names, and we don't
really need to use labels as we already know the interface index.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c    |  8 ++++---
 netlink.c | 62 +++++++++++++++++++++++++++++++++++++++++--------------
 netlink.h |  4 ++--
 pasta.c   |  8 +++----
 4 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/conf.c b/conf.c
index 76d5f41..1b6cda0 100644
--- a/conf.c
+++ b/conf.c
@@ -645,8 +645,10 @@ static unsigned int conf_ip4(unsigned int ifi,
 	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->gw))
 		nl_route(NL_GET, ifi, 0, AF_INET, &ip4->gw);
 
-	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr))
-		nl_addr(0, ifi, AF_INET, &ip4->addr, &ip4->prefix_len, NULL);
+	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr)) {
+		nl_addr(NL_GET, ifi, 0, AF_INET,
+			&ip4->addr, &ip4->prefix_len, NULL);
+	}
 
 	if (!ip4->prefix_len) {
 		in_addr_t addr = ntohl(ip4->addr.s_addr);
@@ -696,7 +698,7 @@ static unsigned int conf_ip6(unsigned int ifi,
 	if (IN6_IS_ADDR_UNSPECIFIED(&ip6->gw))
 		nl_route(NL_GET, ifi, 0, AF_INET6, &ip6->gw);
 
-	nl_addr(0, ifi, AF_INET6,
+	nl_addr(NL_GET, ifi, 0, AF_INET6,
 		IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) ? &ip6->addr : NULL,
 		&prefix_len, &ip6->addr_ll);
 
diff --git a/netlink.c b/netlink.c
index 0ff94ae..70218cd 100644
--- a/netlink.c
+++ b/netlink.c
@@ -322,19 +322,18 @@ next:
 }
 
 /**
- * nl_addr() - Get/set IP addresses
- * @ns:		Use netlink socket in namespace
- * @ifi:	Interface index
+ * nl_addr() - Get/set/copy IP addresses for given interface and address family
+ * @op:		Requested operation
+ * @ifi:	Interface index in outer network namespace
+ * @ifi_ns:	Interface index in target namespace for NL_SET, NL_DUP
  * @af:		Address family
- * @addr:	Global address to fill if zero, to set if not, ignored if NULL
+ * @addr:	Global address to fill on NL_GET, to set on NL_SET
  * @prefix_len:	Mask or prefix length, set or fetched (for IPv4)
- * @addr_l:	Link-scoped address to fill, NULL if not requested
+ * @addr_l:	Link-scoped address to fill on NL_GET
  */
-void nl_addr(int ns, unsigned int ifi, sa_family_t af,
-	     void *addr, int *prefix_len, void *addr_l)
+void nl_addr(enum nl_op op, unsigned int ifi, unsigned int ifi_ns,
+	     sa_family_t af, void *addr, int *prefix_len, void *addr_l)
 {
-	int set = addr && ((af == AF_INET6 && !IN6_IS_ADDR_UNSPECIFIED(addr)) ||
-			   (af == AF_INET && *(uint32_t *)addr));
 	struct req_t {
 		struct nlmsghdr nlh;
 		struct ifaddrmsg ifa;
@@ -353,23 +352,23 @@ void nl_addr(int ns, unsigned int ifi, sa_family_t af,
 			} a6;
 		} set;
 	} req = {
-		.nlh.nlmsg_type    = set ? RTM_NEWADDR : RTM_GETADDR,
+		.nlh.nlmsg_type    = op == NL_SET ? RTM_NEWADDR : RTM_GETADDR,
 		.nlh.nlmsg_flags   = NLM_F_REQUEST,
 		.nlh.nlmsg_len     = NLMSG_LENGTH(sizeof(struct ifaddrmsg)),
 		.nlh.nlmsg_seq     = nl_seq++,
 
 		.ifa.ifa_family    = af,
 		.ifa.ifa_index     = ifi,
-		.ifa.ifa_prefixlen = *prefix_len,
+		.ifa.ifa_prefixlen = op == NL_SET ? *prefix_len : 0,
 	};
+	ssize_t n, nlmsgs_size;
 	struct ifaddrmsg *ifa;
 	struct nlmsghdr *nh;
 	struct rtattr *rta;
 	char buf[NLBUFSIZ];
-	ssize_t n;
 	size_t na;
 
-	if (set) {
+	if (op == NL_SET) {
 		if (af == AF_INET6) {
 			size_t rta_len = RTA_LENGTH(sizeof(req.set.a6.l));
 
@@ -404,21 +403,47 @@ void nl_addr(int ns, unsigned int ifi, sa_family_t af,
 		req.nlh.nlmsg_flags |= NLM_F_DUMP;
 	}
 
-	if ((n = nl_req(ns, buf, &req, req.nlh.nlmsg_len)) < 0 || set)
+	if ((n = nl_req(op == NL_SET, buf, &req, req.nlh.nlmsg_len)) < 0)
+		return;
+
+	if (op == NL_SET)
 		return;
 
 	nh = (struct nlmsghdr *)buf;
+	nlmsgs_size = n;
+
 	for ( ; NLMSG_OK(nh, n); nh = NLMSG_NEXT(nh, n)) {
 		if (nh->nlmsg_type != RTM_NEWADDR)
 			goto next;
 
+		if (op == NL_DUP) {
+			nh->nlmsg_seq = nl_seq++;
+			nh->nlmsg_pid = 0;
+			nh->nlmsg_flags &= ~NLM_F_DUMP_FILTERED;
+			nh->nlmsg_flags |= NLM_F_REQUEST | NLM_F_ACK |
+					   NLM_F_CREATE;
+		}
+
 		ifa = (struct ifaddrmsg *)NLMSG_DATA(nh);
+
+		if (op == NL_DUP && (ifa->ifa_scope == RT_SCOPE_LINK ||
+				     ifa->ifa_index != ifi)) {
+			ifa->ifa_family = AF_UNSPEC;
+			goto next;
+		}
+
 		if (ifa->ifa_index != ifi)
 			goto next;
 
+		if (op == NL_DUP)
+			ifa->ifa_index = ifi_ns;
+
 		for (rta = IFA_RTA(ifa), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
 		     rta = RTA_NEXT(rta, na)) {
-			if (rta->rta_type != IFA_ADDRESS)
+			if (op == NL_DUP && rta->rta_type == IFA_LABEL)
+				rta->rta_type = IFA_UNSPEC;
+
+			if (op == NL_DUP || rta->rta_type != IFA_ADDRESS)
 				continue;
 
 			if (af == AF_INET && addr && !*(uint32_t *)addr) {
@@ -439,6 +464,13 @@ next:
 		if (nh->nlmsg_type == NLMSG_DONE)
 			break;
 	}
+
+	if (op == NL_DUP) {
+		char resp[NLBUFSIZ];
+
+		nh = (struct nlmsghdr *)buf;
+		nl_req(1, resp, nh, nlmsgs_size);
+	}
 }
 
 /**
diff --git a/netlink.h b/netlink.h
index 217cf1e..cd0e666 100644
--- a/netlink.h
+++ b/netlink.h
@@ -16,8 +16,8 @@ void nl_sock_init(const struct ctx *c, bool ns);
 unsigned int nl_get_ext_if(sa_family_t af);
 void nl_route(enum nl_op op, unsigned int ifi, unsigned int ifi_ns,
 	      sa_family_t af, void *gw);
-void nl_addr(int ns, unsigned int ifi, sa_family_t af,
-	     void *addr, int *prefix_len, void *addr_l);
+void nl_addr(enum nl_op op, unsigned int ifi, unsigned int ifi_ns,
+	     sa_family_t af, void *addr, int *prefix_len, void *addr_l);
 void nl_link(int ns, unsigned int ifi, void *mac, int up, int mtu);
 
 #endif /* NETLINK_H */
diff --git a/pasta.c b/pasta.c
index 120cc1e..749fd11 100644
--- a/pasta.c
+++ b/pasta.c
@@ -273,16 +273,16 @@ void pasta_ns_conf(struct ctx *c)
 		nl_link(1, c->pasta_ifi, c->mac_guest, 1, c->mtu);
 
 		if (c->ifi4) {
-			nl_addr(1, c->pasta_ifi, AF_INET, &c->ip4.addr,
-				&c->ip4.prefix_len, NULL);
+			nl_addr(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+				&c->ip4.addr, &c->ip4.prefix_len, NULL);
 			nl_route(op_routes, c->ifi4, c->pasta_ifi, AF_INET,
 				 &c->ip4.gw);
 		}
 
 		if (c->ifi6) {
 			int prefix_len = 64;
-			nl_addr(1, c->pasta_ifi, AF_INET6, &c->ip6.addr,
-				&prefix_len, NULL);
+			nl_addr(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+				&c->ip6.addr, &prefix_len, NULL);
 			nl_route(op_routes, c->ifi6, c->pasta_ifi, AF_INET6,
 				 &c->ip6.gw);
 		}
-- 
@@ -273,16 +273,16 @@ void pasta_ns_conf(struct ctx *c)
 		nl_link(1, c->pasta_ifi, c->mac_guest, 1, c->mtu);
 
 		if (c->ifi4) {
-			nl_addr(1, c->pasta_ifi, AF_INET, &c->ip4.addr,
-				&c->ip4.prefix_len, NULL);
+			nl_addr(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+				&c->ip4.addr, &c->ip4.prefix_len, NULL);
 			nl_route(op_routes, c->ifi4, c->pasta_ifi, AF_INET,
 				 &c->ip4.gw);
 		}
 
 		if (c->ifi6) {
 			int prefix_len = 64;
-			nl_addr(1, c->pasta_ifi, AF_INET6, &c->ip6.addr,
-				&prefix_len, NULL);
+			nl_addr(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+				&c->ip6.addr, &prefix_len, NULL);
 			nl_route(op_routes, c->ifi6, c->pasta_ifi, AF_INET6,
 				 &c->ip6.gw);
 		}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/10] conf, pasta: With --config-net, copy all addresses by default
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (7 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 08/10] netlink: Add functionality to copy addresses from outer namespace Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-14 18:14 ` [PATCH 10/10] passt.h: Fix description of pasta_ifi in struct ctx Stefano Brivio
  2023-05-16  5:06 ` [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes David Gibson
  10 siblings, 0 replies; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Use the newly-introduced NL_DUP mode for nl_addr() to copy all the
addresses associated to the template interface in the outer
namespace, unless --no-copy-addrs (also implied by -a) is given.

This is done mostly for consistency with routes. It might partially
cover the issue at:
  https://bugs.passt.top/show_bug.cgi?id=47
  Support multiple addresses per address family

for some use cases, but not the originally intended one: we'll still
use a single outbound address (unless the routing table specifies
different preferred source addresses depending on the destination),
regardless of the address used in the target namespace.

Link: https://bugs.passt.top/show_bug.cgi?id=47
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c  | 16 ++++++++++++++--
 passt.1 |  9 +++++++++
 passt.h |  2 ++
 pasta.c |  5 +++--
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/conf.c b/conf.c
index 1b6cda0..0ee48e7 100644
--- a/conf.c
+++ b/conf.c
@@ -901,6 +901,7 @@ pasta_opts:
 	info(   "  			network namespace is deleted");
 	info(   "  --config-net		Configure tap interface in namespace");
 	info(   "  --no-copy-routes	Don't copy all routes to namespace");
+	info(   "  --no-copy-addrs	Don't copy all addresses to namespace");
 	info(   "  --ns-mac-addr ADDR	Set MAC address on tap interface");
 
 	exit(EXIT_FAILURE);
@@ -1191,6 +1192,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"outbound-if6", required_argument,	NULL,		16 },
 		{"config-net",	no_argument,		NULL,		17 },
 		{"no-copy-routes", no_argument,		NULL,		18 },
+		{"no-copy-addrs", no_argument,		NULL,		19 },
 		{ 0 },
 	};
 	struct get_bound_ports_ns_arg ns_ports_arg = { .c = c };
@@ -1361,6 +1363,12 @@ void conf(struct ctx *c, int argc, char **argv)
 
 			c->no_copy_routes = 1;
 			break;
+		case 19:
+			if (c->mode != MODE_PASTA)
+				die("--no-copy-addrs is for pasta mode only");
+
+			c->no_copy_addrs = 1;
+			break;
 		case 'd':
 			if (c->debug)
 				die("Multiple --debug options given");
@@ -1646,8 +1654,12 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (*c->sock_path && c->fd_tap >= 0)
 		die("Options --socket and --fd are mutually exclusive");
 
-	if (c->mode == MODE_PASTA && c->no_copy_routes && !c->pasta_conf_ns)
-		die("Option --no-copy-routes needs --config-net");
+	if (c->mode == MODE_PASTA && !c->pasta_conf_ns) {
+		if (c->no_copy_routes)
+			die("Option --no-copy-routes needs --config-net");
+		if (c->no_copy_addrs)
+			die("Option --no-copy-addrs needs --config-net");
+	}
 
 	if (!ifi4 && *c->ip4.ifname_out)
 		ifi4 = if_nametoindex(c->ip4.ifname_out);
diff --git a/passt.1 b/passt.1
index f965c34..87b076d 100644
--- a/passt.1
+++ b/passt.1
@@ -558,6 +558,15 @@ Default is to copy all the routing entries from the interface in the outer
 namespace to the target namespace, translating the output interface attribute to
 the outbound interface in the namespace.
 
+.TP
+.BR \-\-no-copy-addrs
+With \-\-config-net, do not copy all the addresses associated to the interface
+we derive addresses and routes from: set up a single one. Implied by \-a,
+\-\-address.
+
+Default is to copy all the addresses, except for link-local ones, from the
+interface from the outer namespace to the target namespace.
+
 .TP
 .BR \-\-ns-mac-addr " " \fIaddr
 Configure MAC address \fIaddr\fR on the tap interface in the namespace.
diff --git a/passt.h b/passt.h
index d314596..b51a1e5 100644
--- a/passt.h
+++ b/passt.h
@@ -183,6 +183,7 @@ struct ip6_ctx {
  * @pasta_ifn:		Index of namespace interface for pasta
  * @pasta_conf_ns:	Configure namespace after creating it
  * @no_copy_routes:	Don't copy all routes when configuring target namespace
+ * @no_copy_addrs:	Don't copy all addresses when configuring namespace
  * @no_tcp:		Disable TCP operation
  * @tcp:		Context for TCP protocol handler
  * @no_tcp:		Disable UDP operation
@@ -242,6 +243,7 @@ struct ctx {
 	unsigned int pasta_ifi;
 	int pasta_conf_ns;
 	int no_copy_routes;
+	int no_copy_addrs;
 
 	int no_tcp;
 	struct tcp_ctx tcp;
diff --git a/pasta.c b/pasta.c
index 749fd11..1b54974 100644
--- a/pasta.c
+++ b/pasta.c
@@ -269,11 +269,12 @@ void pasta_ns_conf(struct ctx *c)
 
 	if (c->pasta_conf_ns) {
 		enum nl_op op_routes = c->no_copy_routes ? NL_SET : NL_DUP;
+		enum nl_op op_addrs =  c->no_copy_addrs  ? NL_SET : NL_DUP;
 
 		nl_link(1, c->pasta_ifi, c->mac_guest, 1, c->mtu);
 
 		if (c->ifi4) {
-			nl_addr(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+			nl_addr(op_addrs, c->ifi4, c->pasta_ifi, AF_INET,
 				&c->ip4.addr, &c->ip4.prefix_len, NULL);
 			nl_route(op_routes, c->ifi4, c->pasta_ifi, AF_INET,
 				 &c->ip4.gw);
@@ -281,7 +282,7 @@ void pasta_ns_conf(struct ctx *c)
 
 		if (c->ifi6) {
 			int prefix_len = 64;
-			nl_addr(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+			nl_addr(op_addrs, c->ifi6, c->pasta_ifi, AF_INET6,
 				&c->ip6.addr, &prefix_len, NULL);
 			nl_route(op_routes, c->ifi6, c->pasta_ifi, AF_INET6,
 				 &c->ip6.gw);
-- 
@@ -269,11 +269,12 @@ void pasta_ns_conf(struct ctx *c)
 
 	if (c->pasta_conf_ns) {
 		enum nl_op op_routes = c->no_copy_routes ? NL_SET : NL_DUP;
+		enum nl_op op_addrs =  c->no_copy_addrs  ? NL_SET : NL_DUP;
 
 		nl_link(1, c->pasta_ifi, c->mac_guest, 1, c->mtu);
 
 		if (c->ifi4) {
-			nl_addr(NL_SET, c->ifi4, c->pasta_ifi, AF_INET,
+			nl_addr(op_addrs, c->ifi4, c->pasta_ifi, AF_INET,
 				&c->ip4.addr, &c->ip4.prefix_len, NULL);
 			nl_route(op_routes, c->ifi4, c->pasta_ifi, AF_INET,
 				 &c->ip4.gw);
@@ -281,7 +282,7 @@ void pasta_ns_conf(struct ctx *c)
 
 		if (c->ifi6) {
 			int prefix_len = 64;
-			nl_addr(NL_SET, c->ifi6, c->pasta_ifi, AF_INET6,
+			nl_addr(op_addrs, c->ifi6, c->pasta_ifi, AF_INET6,
 				&c->ip6.addr, &prefix_len, NULL);
 			nl_route(op_routes, c->ifi6, c->pasta_ifi, AF_INET6,
 				 &c->ip6.gw);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/10] passt.h: Fix description of pasta_ifi in struct ctx
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (8 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 09/10] conf, pasta: With --config-net, copy all addresses by default Stefano Brivio
@ 2023-05-14 18:14 ` Stefano Brivio
  2023-05-16  4:03   ` David Gibson
  2023-05-16  5:06 ` [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes David Gibson
  10 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-14 18:14 UTC (permalink / raw)
  To: passt-dev; +Cc: Callum Parsey, me, David Gibson, lemmi

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 passt.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/passt.h b/passt.h
index b51a1e5..96fd27b 100644
--- a/passt.h
+++ b/passt.h
@@ -180,7 +180,7 @@ struct ip6_ctx {
  * @ifi6:		Index of template interface for IPv6, 0 if IPv6 disabled
  * @ip6:		IPv6 configuration
  * @pasta_ifn:		Name of namespace interface for pasta
- * @pasta_ifn:		Index of namespace interface for pasta
+ * @pasta_ifi:		Index of namespace interface for pasta
  * @pasta_conf_ns:	Configure namespace after creating it
  * @no_copy_routes:	Don't copy all routes when configuring target namespace
  * @no_copy_addrs:	Don't copy all addresses when configuring namespace
-- 
@@ -180,7 +180,7 @@ struct ip6_ctx {
  * @ifi6:		Index of template interface for IPv6, 0 if IPv6 disabled
  * @ip6:		IPv6 configuration
  * @pasta_ifn:		Name of namespace interface for pasta
- * @pasta_ifn:		Index of namespace interface for pasta
+ * @pasta_ifi:		Index of namespace interface for pasta
  * @pasta_conf_ns:	Configure namespace after creating it
  * @no_copy_routes:	Don't copy all routes when configuring target namespace
  * @no_copy_addrs:	Don't copy all addresses when configuring namespace
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req()
  2023-05-14 18:14 ` [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req() Stefano Brivio
@ 2023-05-16  3:23   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2023-05-16  3:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

On Sun, May 14, 2023 at 08:14:06PM +0200, Stefano Brivio wrote:
> Fixes: fde8004ab0b4 ("netlink: Use 8 KiB * netlink message header size as response buffer")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Makes sense regardless of the rest of the series.

> ---
>  netlink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/netlink.c b/netlink.c
> index b99af85..c07a13c 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -99,7 +99,7 @@ fail:
>  /**
>   * nl_req() - Send netlink request and read response
>   * @ns:		Use netlink socket in namespace
> - * @buf:	Buffer for response (at least BUFSIZ long)
> + * @buf:	Buffer for response (at least NLBUFSIZ long)
>   * @req:	Request with netlink header
>   * @len:	Request length
>   *

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/10] pasta: Improve error handling on failure to join network namespace
  2023-05-14 18:14 ` [PATCH 02/10] pasta: Improve error handling on failure to join network namespace Stefano Brivio
@ 2023-05-16  3:24   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2023-05-16  3:24 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 1817 bytes --]

On Sun, May 14, 2023 at 08:14:07PM +0200, Stefano Brivio wrote:
> In pasta_wait_for_ns(), open() failing with ENOENT is expected: we're
> busy-looping until the network namespace appears. But any other
> failure is not something we're going to recover from: return right
> away if we don't get either success or ENOENT.
> 
> Now that pasta_wait_for_ns() can actually fail, handle that in
> pasta_start_ns() by reporting the issue and exiting.
> 
> Looping on EPERM, when pasta doesn't actually have the permissions to
> join a given namespace, isn't exactly a productive thing to do.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Also makes sense regardless of the rest of the series.

> ---
>  pasta.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/pasta.c b/pasta.c
> index 3a4d704..2fa0168 100644
> --- a/pasta.c
> +++ b/pasta.c
> @@ -94,8 +94,11 @@ static int pasta_wait_for_ns(void *arg)
>  	char ns[PATH_MAX];
>  
>  	snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
> -	do
> -		while ((c->pasta_netns_fd = open(ns, flags)) < 0);
> +	while ((c->pasta_netns_fd = open(ns, flags)) < 0) {
> +		if (errno != ENOENT)
> +			return 0;
> +	}
> +
>  	while (setns(c->pasta_netns_fd, CLONE_NEWNET) &&
>  	       !close(c->pasta_netns_fd));
>  
> @@ -252,6 +255,8 @@ void pasta_start_ns(struct ctx *c, uid_t uid, gid_t gid,
>  	}
>  
>  	NS_CALL(pasta_wait_for_ns, c);
> +	if (c->pasta_netns_fd < 0)
> +		die("Failed to join network namespace");
>  }
>  
>  /**

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 04/10] conf: --config-net option is for pasta mode only
  2023-05-14 18:14 ` [PATCH 04/10] conf: --config-net option is for pasta mode only Stefano Brivio
@ 2023-05-16  3:59   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2023-05-16  3:59 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 1849 bytes --]

On Sun, May 14, 2023 at 08:14:09PM +0200, Stefano Brivio wrote:
> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Also makes sense regardless of the rest of the series.

> ---
>  conf.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/conf.c b/conf.c
> index aad2b00..bc1ae99 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -1198,7 +1198,6 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"userns",	required_argument,	NULL,		2 },
>  		{"netns",	required_argument,	NULL,		3 },
>  		{"netns-only",	no_argument,		&netns_only,	1 },
> -		{"config-net",	no_argument,		&c->pasta_conf_ns, 1 },
>  		{"ns-mac-addr",	required_argument,	NULL,		4 },
>  		{"dhcp-dns",	no_argument,		NULL,		5 },
>  		{"no-dhcp-dns",	no_argument,		NULL,		6 },
> @@ -1212,6 +1211,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"version",	no_argument,		NULL,		14 },
>  		{"outbound-if4", required_argument,	NULL,		15 },
>  		{"outbound-if6", required_argument,	NULL,		16 },
> +		{"config-net",	no_argument,		NULL,		17 },
>  		{ 0 },
>  	};
>  	struct get_bound_ports_ns_arg ns_ports_arg = { .c = c };
> @@ -1369,6 +1369,12 @@ void conf(struct ctx *c, int argc, char **argv)
>  			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
>  				die("Invalid interface name: %s", optarg);
>  
> +			break;
> +		case 17:
> +			if (c->mode != MODE_PASTA)
> +				die("--config-net is for pasta mode only");
> +
> +			c->pasta_conf_ns = 1;
>  			break;
>  		case 'd':
>  			if (c->debug)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/10] Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway"
  2023-05-14 18:14 ` [PATCH 06/10] Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway" Stefano Brivio
@ 2023-05-16  4:00   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2023-05-16  4:00 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 2775 bytes --]

On Sun, May 14, 2023 at 08:14:11PM +0200, Stefano Brivio wrote:
> This reverts commit 7656a6f8888237b9e23d63666e921528b6aaf950: now, by
> default, we copy all the routes associated to the outbound interface
> into the routing table of the container, so there's no need for this
> horrible workaround anymore.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

This was never the right solution - even without the other
considerations raised by this series we probably should have
advertized an explicit host route to the gateway instead of mangling
the netmask.

> ---
>  conf.c | 25 +------------------------
>  1 file changed, 1 insertion(+), 24 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index c2a745e..3a2fc2d 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -634,9 +634,6 @@ static int conf_ip4_prefix(const char *arg)
>  static unsigned int conf_ip4(unsigned int ifi,
>  			     struct ip4_ctx *ip4, unsigned char *mac)
>  {
> -	in_addr_t addr, gw;
> -	int shift;
> -
>  	if (!ifi)
>  		ifi = nl_get_ext_if(AF_INET);
>  
> @@ -651,10 +648,8 @@ static unsigned int conf_ip4(unsigned int ifi,
>  	if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr))
>  		nl_addr(0, ifi, AF_INET, &ip4->addr, &ip4->prefix_len, NULL);
>  
> -	addr = ntohl(ip4->addr.s_addr);
> -	gw = ntohl(ip4->gw.s_addr);
> -
>  	if (!ip4->prefix_len) {
> +		in_addr_t addr = ntohl(ip4->addr.s_addr);
>  		if (IN_CLASSA(addr))
>  			ip4->prefix_len = (32 - IN_CLASSA_NSHIFT);
>  		else if (IN_CLASSB(addr))
> @@ -665,24 +660,6 @@ static unsigned int conf_ip4(unsigned int ifi,
>  			ip4->prefix_len = 32;
>  	}
>  
> -	/* We might get an address with a netmask that makes the default
> -	 * gateway unreachable, and in that case we would fail to configure
> -	 * the default route, with --config-net, or presumably a DHCP client
> -	 * in the guest or container would face the same issue.
> -	 *
> -	 * The host might have another route, to the default gateway itself,
> -	 * fixing the situation, but we only read default routes.
> -	 *
> -	 * Fix up the mask to allow reaching the default gateway from our
> -	 * configured address, if needed, and only if we find a non-zero
> -	 * mask that makes the gateway reachable.
> -	 */
> -	shift = 32 - ip4->prefix_len;
> -	while (shift < 32 && addr >> shift != gw >> shift)
> -		shift++;
> -	if (shift < 32)
> -		ip4->prefix_len = 32 - shift;
> -
>  	memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen));
>  
>  	if (MAC_IS_ZERO(mac))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 10/10] passt.h: Fix description of pasta_ifi in struct ctx
  2023-05-14 18:14 ` [PATCH 10/10] passt.h: Fix description of pasta_ifi in struct ctx Stefano Brivio
@ 2023-05-16  4:03   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2023-05-16  4:03 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

On Sun, May 14, 2023 at 08:14:15PM +0200, Stefano Brivio wrote:
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Also makes sense regardless of the rest of the series.

> ---
>  passt.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/passt.h b/passt.h
> index b51a1e5..96fd27b 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -180,7 +180,7 @@ struct ip6_ctx {
>   * @ifi6:		Index of template interface for IPv6, 0 if IPv6 disabled
>   * @ip6:		IPv6 configuration
>   * @pasta_ifn:		Name of namespace interface for pasta
> - * @pasta_ifn:		Index of namespace interface for pasta
> + * @pasta_ifi:		Index of namespace interface for pasta
>   * @pasta_conf_ns:	Configure namespace after creating it
>   * @no_copy_routes:	Don't copy all routes when configuring target namespace
>   * @no_copy_addrs:	Don't copy all addresses when configuring namespace

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes
  2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
                   ` (9 preceding siblings ...)
  2023-05-14 18:14 ` [PATCH 10/10] passt.h: Fix description of pasta_ifi in struct ctx Stefano Brivio
@ 2023-05-16  5:06 ` David Gibson
  2023-05-16 21:42   ` Stefano Brivio
  10 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2023-05-16  5:06 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 6342 bytes --]

On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano Brivio wrote:
> This series, along with pseudo-related fixes, enables:
> 
> - optional copy of all routes from selected interface in outer
>   namespace, to (hopefully!) fix the issue reported by Callum at:
>     https://github.com/containers/podman/issues/18539
> 
> - optional copy of all addresses, mostly for consistency. It doesn't,
>   however, enable assignment of multiple addresses in the sense
>   requested at:
>     https://bugs.passt.top/show_bug.cgi?id=47
> 
>   because the addresses still need to be present on the host, and
>   the "outer" address isn't selected depending on the address used
>   inside the container
> 
> - operation without a gateway address, to (again, hopefully) support
>   usage of Wireguard endpoints established outside the container,
>     https://bugs.passt.top/show_bug.cgi?id=49
> 
> I tested the single functionalities introduced here, but I didn't
> try to reproduce the setups where the issues were reported, so some
> help with testing is definitely fundamental here. Thanks.

I've sent reviews for some of the simpler patches in this series which
make sense even without the context of the overall aim.  I think those
can be applied immediately.  For the rest of the series, I want to
address the generalities before doing detailed review of the
implementation.

I think the basic idea here is sound: we want to expose anything
routable to the host as routable to the guest, even when the host has
a more complex routing setup that just a netmask on the "main"
interface and a default gateway within that prefix.  But I think we
want to think a bit more deeply about exactly what we need/want to
expose here.

Even with the current code, the default gateway address we advertise
to the guest is kind of meaningless: the guest cannot directly access
that gateway, everything really goes through passt on the host.  This
works because the gateway address (like everything) will ARP/NDP to
passt's host side MAC address and once the packets hit passt it
doesn't matter what the guest thought the routing was going to be.

I think we have a few choices in two more-or-less orthogonal
categories.

A) What routable prefixes do we advertise to the guest?

  A.1) Always a default route (0.0.0.0/0 and ::/0)

We tell the guest that every address is routable via the passt
interface, regardless of routing setup on the host.  This essentially
tells the guest to delegate all routing responsibility to passt.

Advantages:
  * Simple
  * No need to update anything if routing configuration on the host
    changes
Disadvantages:
  * If addresses are unroutable from the host, the guest will only
    know via ICMP/ICMPv6, rather than statically, which may be a worse
    UX on the guest side.  Plus we might need to actually implement
    those host unreachable ICMPs.
  * Might be messy if the guest has multiple interfacees - e.g. if we
    allow passt to be configured to attach to a specific host
    interface only, then we have multiple passts attached to a single
    guest: they'd all be advertising a default route.

  A.2) Copy routable prefixes from the host to the guest

We just advertise those prefixes routable to the host to the guest
(which might include an empty prefix == default route).

Advantages:
  * Guest statically knows what addresses are routable via the passt
    interface
Disadvantages:
  * What do we do with overlapping prefixes?  On the host we might
    have more specific routes pointing to a specific interface.  For
    the guest they all point to the passt interface, so what's the
    point?
  * Can we advertise an arbitrary set of static routes via all our
    mechanisms (--config-net, DHCP, NDP+DHCPv6)?  Even if we can it
    adds more complexity to that code
  * How do we update things if the host routing configuration changes?
  * What do we do if the host has source-based routing or other
    advanced stuff set up?

B) What gateway, if any, do we advertise for each route?

  B.1) Copy it from the host

Advantages:
 * Guest L3 configuration resembles that of the host
Disadvantages:
 * If the host route doesn't have a gateway we have to fall back on
   B.2 or B.3 anyway
 * Misleading: in fact everything is routed by passt and the host
   before it reaches any gateway we're listing here

  B.2) Pick an address to represent passt as gateway

Advantages:
 * Accurately represents that everything is routed by passt
 * We can make this the same as the NAT-to-host address, so we only
   have one "magic" address (per AF)
Disadvantages:
 * Have to allocate an address that's safe, which is tricky (but we
   usually want this for NAT-to-host anyway)
 * Do we want just one address, or one for each distinct gateway from
   the host?
 * If we can't pick something in the interfaces "natural" prefix, we
   will also need to advertise a static route to reach it.

  B.3) Don't advertise a gateway for any route

passt essentially proxy ARPs for the entire internet.

Advantages:
  * No need to allocate an address - in fact passt need not have any
    guest facing IP at all
  * Extends naturally if we ever have a guest<->passt transport that's
    point-to-point rather than pseudo-ethernet
Disadvantages:
  * Guest ARP / neighbour tables could get real big



The status quo is, roughly, A.1+B.1, except that we also enforce that
the host must have a default route, which sidesteps one of the
complications of B.1.  IIUC, this series is implementing A.2+B.1.

Thinking about it, I'm moderately convinced that B.1 is a bad idea.
I'm leaning towards B.2 - combining it with the NAT-to-host cleanups
to have a more concrete guest-visible address for passt itself - but
I'm also open to B.3.

I'm not sure about A.1 vs. A.2.  I was leaning towards A.2, but on
further consideration, I feel like the fact that A.1 automatically
works for routing changes on the host might outweigh the fact that he
guest only gets limited information (ICMP) about what's routable.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes
  2023-05-16  5:06 ` [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes David Gibson
@ 2023-05-16 21:42   ` Stefano Brivio
  2023-05-17  1:15     ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-16 21:42 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Callum Parsey, me, lemmi

On Tue, 16 May 2023 15:06:29 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano Brivio wrote:
> > This series, along with pseudo-related fixes, enables:
> > 
> > - optional copy of all routes from selected interface in outer
> >   namespace, to (hopefully!) fix the issue reported by Callum at:
> >     https://github.com/containers/podman/issues/18539
> > 
> > - optional copy of all addresses, mostly for consistency. It doesn't,
> >   however, enable assignment of multiple addresses in the sense
> >   requested at:
> >     https://bugs.passt.top/show_bug.cgi?id=47
> > 
> >   because the addresses still need to be present on the host, and
> >   the "outer" address isn't selected depending on the address used
> >   inside the container
> > 
> > - operation without a gateway address, to (again, hopefully) support
> >   usage of Wireguard endpoints established outside the container,
> >     https://bugs.passt.top/show_bug.cgi?id=49
> > 
> > I tested the single functionalities introduced here, but I didn't
> > try to reproduce the setups where the issues were reported, so some
> > help with testing is definitely fundamental here. Thanks.  
> 
> I've sent reviews for some of the simpler patches in this series which
> make sense even without the context of the overall aim.  I think those
> can be applied immediately.

Those are actually the least important patches for users -- and I can't
apply 6/10 without breaking Podman's CI plus probably a number of
deployments (that's why it comes after 5/10)... so, no, I would rather
not apply the rest for the moment.

> For the rest of the series, I want to address the generalities before
> doing detailed review of the implementation.
> 
> I think the basic idea here is sound: we want to expose anything
> routable to the host as routable to the guest, even when the host has
> a more complex routing setup that just a netmask on the "main"
> interface and a default gateway within that prefix.

The intentions behind this series are actually slightly different:

- we have a complete breakage in a seemingly common use case (I would
  even say cloud-init setups in general), and I'd like to fix that
  sooner rather than later

- this concerns only the direct configuration pasta does, with
  --config-net. What we advertise is definitely related, but not the
  same topic... to the point that the issues fixed by this series don't
  even occur with a DHCP client:
    https://github.com/containers/podman/issues/18539#issuecomment-1545023424

  And, in general, we can't advertise everything we can configure (say,
  a route without router over DHCP).

  I'd be much more careful about what we advertise. We have direct
  control of what we configure via netlink, but for DHCP, NDP, DHCPv6,
  we need to think of possible interpretations and common half-bugs as
  well.

> But I think we want to think a bit more deeply about exactly what we
> need/want to expose here.
> 
> Even with the current code, the default gateway address we advertise
> to the guest is kind of meaningless: the guest cannot directly access
> that gateway, everything really goes through passt on the host.

In the simplest, probably most common network setups, that's actually
the gateway that connects our guest to other nodes.

For other cases, I think we should eventually implement
https://bugs.passt.top/show_bug.cgi?id=47 anyway, and it goes without
saying that, then, we can't just use the same host route no matter what
the container chooses. We'll need to match them.

I mean, I'm not saying that the behaviour from this series is complete
and self-consistent, just that it works around obvious, urgent issues
and at the same time it looks like we'll probably need something
similar to support further use cases.

> This works because the gateway address (like everything) will ARP/NDP
> to passt's host side MAC address and once the packets hit passt it
> doesn't matter what the guest thought the routing was going to be.
> 
> I think we have a few choices in two more-or-less orthogonal
> categories.
> 
> A) What routable prefixes do we advertise to the guest?
> 
>   A.1) Always a default route (0.0.0.0/0 and ::/0)
> 
> We tell the guest that every address is routable via the passt
> interface, regardless of routing setup on the host.  This essentially
> tells the guest to delegate all routing responsibility to passt.
> 
> Advantages:
>   * Simple
>   * No need to update anything if routing configuration on the host
>     changes
> Disadvantages:
>   * If addresses are unroutable from the host, the guest will only
>     know via ICMP/ICMPv6, rather than statically, which may be a worse
>     UX on the guest side.  Plus we might need to actually implement
>     those host unreachable ICMPs.
>   * Might be messy if the guest has multiple interfacees - e.g. if we
>     allow passt to be configured to attach to a specific host
>     interface only, then we have multiple passts attached to a single
>     guest: they'd all be advertising a default route.
> 
>   A.2) Copy routable prefixes from the host to the guest

I'm having a hard time figuring out the definition of this point. How
would you define that? Strictly speaking, in the case at hand, nothing
is routable: we have a /32 address.

> We just advertise those prefixes routable to the host to the guest
> (which might include an empty prefix == default route).
> 
> Advantages:
>   * Guest statically knows what addresses are routable via the passt
>     interface
> Disadvantages:
>   * What do we do with overlapping prefixes?  On the host we might
>     have more specific routes pointing to a specific interface.  For
>     the guest they all point to the passt interface, so what's the
>     point?
>   * Can we advertise an arbitrary set of static routes via all our
>     mechanisms (--config-net, DHCP, NDP+DHCPv6)?  Even if we can it
>     adds more complexity to that code
>   * How do we update things if the host routing configuration changes?
>   * What do we do if the host has source-based routing or other
>     advanced stuff set up?
> 
> B) What gateway, if any, do we advertise for each route?
> 
>   B.1) Copy it from the host
> 
> Advantages:
>  * Guest L3 configuration resembles that of the host

...which is a fundamental design goal of passt: transparency, and
pretending it doesn't exist. Otherwise we can have a route, a bridge,
an interface, etc.

Now, while there are use cases that rely on different aspects of this
transparency (KubeVirt and service mesh integration) I understand this
might sound a bit dogmatic, because you might say there are more
important use cases (which I'm not aware of) or supposed benefits.

What's far less dogmatic, though, is how many issues we happily and
automatically avoid by relying on the sanity of the host networking
configuration.

By trying to copy it as close as possible, we avoid one very important
source of issues, which is our interpretation or possible lack of
knowledge about how applications we don't know about chose to interact
with kernel and network setups. The main case fixed by this series
shows exactly that: I think it's broken, but it works, and users
expect it to work.

And by trusting the host configuration we don't lose much: if that's
broken, almost everything else is broken anyway.

> Disadvantages:
>  * If the host route doesn't have a gateway we have to fall back on
>    B.2 or B.3 anyway

Well, they are a particular case of B.1 then: what's the disadvantage?
This is consistent (especially with this series, and especially if we
start adapting the *default* behaviours in this sense).

>  * Misleading: in fact everything is routed by passt and the host
>    before it reaches any gateway we're listing here

But passt isn't supposed to be a router...? Let's say we have multiple
routes on the host, we configure or advertise multiple routes to the
guest. Does that make passt a router? I don't think so: we're just
associating them as closely as possible, without fancy interpretations.

A router has its own routing table, passt's would simply be a copy.
Right now it has essentially none.

>   B.2) Pick an address to represent passt as gateway
> 
> Advantages:
>  * Accurately represents that everything is routed by passt

This is configurable, actually, but no, I insist that passt isn't
*functionally* routing anything, or at least that we should get as
close as possible to that.

>  * We can make this the same as the NAT-to-host address, so we only
>    have one "magic" address (per AF)

Not really, if it's configurable.

> Disadvantages:
>  * Have to allocate an address that's safe, which is tricky (but we
>    usually want this for NAT-to-host anyway)

There's a difference between picking an address by default and letting
the user configure one. Besides, at least for IPv4, I don't think such
an address exists.

>  * Do we want just one address, or one for each distinct gateway from
>    the host?
>  * If we can't pick something in the interfaces "natural" prefix, we
>    will also need to advertise a static route to reach it.
> 
>   B.3) Don't advertise a gateway for any route
> 
> passt essentially proxy ARPs for the entire internet.
> 
> Advantages:
>   * No need to allocate an address - in fact passt need not have any
>     guest facing IP at all
>   * Extends naturally if we ever have a guest<->passt transport that's
>     point-to-point rather than pseudo-ethernet
> Disadvantages:
>   * Guest ARP / neighbour tables could get real big

...it would also break a number of applications that peek at netlink
(or do ioctl()s) to check they are in fact online.

> 
> 
> The status quo is, roughly, A.1+B.1, except that we also enforce that
> the host must have a default route, which sidesteps one of the
> complications of B.1.  IIUC, this series is implementing A.2+B.1.
> 
> Thinking about it, I'm moderately convinced that B.1 is a bad idea.
> I'm leaning towards B.2 - combining it with the NAT-to-host cleanups
> to have a more concrete guest-visible address for passt itself - but
> I'm also open to B.3.

...that, especially B.3, sounds like another tool, or at least like
another mode, because it conflicts quite a bit with design goals.

They're different from design _choices_ in the sense that that's what
I've been "selling" to users and what I and others have been
implementing in integrations so far.

> I'm not sure about A.1 vs. A.2.  I was leaning towards A.2, but on
> further consideration, I feel like the fact that A.1 automatically
> works for routing changes on the host might outweigh the fact that he
> guest only gets limited information (ICMP) about what's routable.

I don't think A.2 is doable, but even if it were, yes, I don't think
it would be worth the effort. If needed (and I never saw a request in
this sense), we could enrich ICMP/ICMPv6 handling guest- or
container-side quite a bit.

-- 
Stefano


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes
  2023-05-16 21:42   ` Stefano Brivio
@ 2023-05-17  1:15     ` David Gibson
  2023-05-17  6:52       ` Stefano Brivio
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2023-05-17  1:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 15957 bytes --]

On Tue, May 16, 2023 at 11:42:09PM +0200, Stefano Brivio wrote:
> On Tue, 16 May 2023 15:06:29 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano Brivio wrote:
> > > This series, along with pseudo-related fixes, enables:
> > > 
> > > - optional copy of all routes from selected interface in outer
> > >   namespace, to (hopefully!) fix the issue reported by Callum at:
> > >     https://github.com/containers/podman/issues/18539
> > > 
> > > - optional copy of all addresses, mostly for consistency. It doesn't,
> > >   however, enable assignment of multiple addresses in the sense
> > >   requested at:
> > >     https://bugs.passt.top/show_bug.cgi?id=47
> > > 
> > >   because the addresses still need to be present on the host, and
> > >   the "outer" address isn't selected depending on the address used
> > >   inside the container
> > > 
> > > - operation without a gateway address, to (again, hopefully) support
> > >   usage of Wireguard endpoints established outside the container,
> > >     https://bugs.passt.top/show_bug.cgi?id=49
> > > 
> > > I tested the single functionalities introduced here, but I didn't
> > > try to reproduce the setups where the issues were reported, so some
> > > help with testing is definitely fundamental here. Thanks.  
> > 
> > I've sent reviews for some of the simpler patches in this series which
> > make sense even without the context of the overall aim.  I think those
> > can be applied immediately.
> 
> Those are actually the least important patches for users

Well, granted.

> -- and I can't
> apply 6/10 without breaking Podman's CI plus probably a number of
> deployments (that's why it comes after 5/10)... so, no, I would rather
> not apply the rest for the moment.

Uh.. true, 6/10 is problematic, but I think the other easy ones could
be applied safely enough.

> > For the rest of the series, I want to address the generalities before
> > doing detailed review of the implementation.
> > 
> > I think the basic idea here is sound: we want to expose anything
> > routable to the host as routable to the guest, even when the host has
> > a more complex routing setup that just a netmask on the "main"
> > interface and a default gateway within that prefix.
> 
> The intentions behind this series are actually slightly different:
> 
> - we have a complete breakage in a seemingly common use case (I would
>   even say cloud-init setups in general), and I'd like to fix that
>   sooner rather than later

Well, sure, but we should at least think about where we're going with
this longer term, so we don't box ourselves in.

> - this concerns only the direct configuration pasta does, with
>   --config-net. What we advertise is definitely related, but not the
>   same topic... to the point that the issues fixed by this series don't
>   even occur with a DHCP client:
>     https://github.com/containers/podman/issues/18539#issuecomment-1545023424

Ah, interesting.  It looks like dhclient (or rather dhclient-script, I
expect) is adding an explicit /32 route to the default gateway.  It
seems to me the best quick fix for --config-net is to do the same
thing.  Basically rather than expanding the netmask as we did in 6/10,
if the gateway address is not in the interface's netmask add a /32 or
/128 route to the gateway.

>   And, in general, we can't advertise everything we can configure (say,
>   a route without router over DHCP).

Ah, true.  The DHCP options for static routes are even more limited
than I realized.  Ok, that nixes option B.3.

>   I'd be much more careful about what we advertise. We have direct
>   control of what we configure via netlink, but for DHCP, NDP, DHCPv6,
>   we need to think of possible interpretations and common half-bugs as
>   well.
> 
> > But I think we want to think a bit more deeply about exactly what we
> > need/want to expose here.
> > 
> > Even with the current code, the default gateway address we advertise
> > to the guest is kind of meaningless: the guest cannot directly access
> > that gateway, everything really goes through passt on the host.
> 
> In the simplest, probably most common network setups, that's actually
> the gateway that connects our guest to other nodes.

I don't understand what you mean by this.  Yes, we have the same IP
for the gateway that the host sees, but the NAT to host means that we
can't even talk to the gateway at L4.  Literally the only thing the
guest kernel will do with that gateway address is put it into ARP and
neighbour discovery packets, which passt will resolve to its own MAC,
like nearly every other IP.

> For other cases, I think we should eventually implement
> https://bugs.passt.top/show_bug.cgi?id=47 anyway, and it goes without
> saying that, then, we can't just use the same host route no matter what
> the container chooses. We'll need to match them.

Oh.. I'm wondering if I've been confusing by using "host route" in two
different ways: one being "a route taken from the passt host system"
and the other meaning "a route to a single network host, that is /32
or /128".

I agree that we should move to allowing multiple IPs on the guest
side, but I don't see how that conflicts with the routing issue here.

> I mean, I'm not saying that the behaviour from this series is complete
> and self-consistent, just that it works around obvious, urgent issues
> and at the same time it looks like we'll probably need something
> similar to support further use cases.

Adding a /32 or /128 route to the gateway seems a simpler way to do
that to me.  Plus it matches the behaviour that DHCP seems to be doing
anyway.

> > This works because the gateway address (like everything) will ARP/NDP
> > to passt's host side MAC address and once the packets hit passt it
> > doesn't matter what the guest thought the routing was going to be.
> > 
> > I think we have a few choices in two more-or-less orthogonal
> > categories.
> > 
> > A) What routable prefixes do we advertise to the guest?
> > 
> >   A.1) Always a default route (0.0.0.0/0 and ::/0)
> > 
> > We tell the guest that every address is routable via the passt
> > interface, regardless of routing setup on the host.  This essentially
> > tells the guest to delegate all routing responsibility to passt.
> > 
> > Advantages:
> >   * Simple
> >   * No need to update anything if routing configuration on the host
> >     changes
> > Disadvantages:
> >   * If addresses are unroutable from the host, the guest will only
> >     know via ICMP/ICMPv6, rather than statically, which may be a worse
> >     UX on the guest side.  Plus we might need to actually implement
> >     those host unreachable ICMPs.
> >   * Might be messy if the guest has multiple interfacees - e.g. if we
> >     allow passt to be configured to attach to a specific host
> >     interface only, then we have multiple passts attached to a single
> >     guest: they'd all be advertising a default route.
> > 
> >   A.2) Copy routable prefixes from the host to the guest
> 
> I'm having a hard time figuring out the definition of this point. How
> would you define that? Strictly speaking, in the case at hand, nothing
> is routable: we have a /32 address.

Right.. which means that if the host is working, it must have an
additional static route - also probably /32 - telling it how to get to
the gateway.  Indeed I can see it in the bug, initial comment:
    172.31.1.1 dev ens3 proto static scope link metric 100
With A.2 we'd copy that route to the guest - or at least one with the
same prefix (which is a single address in this case).

> > We just advertise those prefixes routable to the host to the guest
> > (which might include an empty prefix == default route).
> > 
> > Advantages:
> >   * Guest statically knows what addresses are routable via the passt
> >     interface
> > Disadvantages:
> >   * What do we do with overlapping prefixes?  On the host we might
> >     have more specific routes pointing to a specific interface.  For
> >     the guest they all point to the passt interface, so what's the
> >     point?
> >   * Can we advertise an arbitrary set of static routes via all our
> >     mechanisms (--config-net, DHCP, NDP+DHCPv6)?  Even if we can it
> >     adds more complexity to that code
> >   * How do we update things if the host routing configuration changes?
> >   * What do we do if the host has source-based routing or other
> >     advanced stuff set up?
> > 
> > B) What gateway, if any, do we advertise for each route?
> > 
> >   B.1) Copy it from the host
> > 
> > Advantages:
> >  * Guest L3 configuration resembles that of the host
> 
> ...which is a fundamental design goal of passt: transparency, and
> pretending it doesn't exist. Otherwise we can have a route, a bridge,
> an interface, etc.

Well... we want to be transparent for anything visible at L4.  For
things only visible at L3 - like routes, it's not possible for things
to look 100% identical, so I think we have some wiggle room in exactly
what we do.

> Now, while there are use cases that rely on different aspects of this
> transparency (KubeVirt and service mesh integration) I understand this
> might sound a bit dogmatic, because you might say there are more
> important use cases (which I'm not aware of) or supposed benefits.
> 
> What's far less dogmatic, though, is how many issues we happily and
> automatically avoid by relying on the sanity of the host networking
> configuration.
> 
> By trying to copy it as close as possible, we avoid one very important
> source of issues, which is our interpretation or possible lack of
> knowledge about how applications we don't know about chose to interact
> with kernel and network setups. The main case fixed by this series
> shows exactly that: I think it's broken, but it works, and users
> expect it to work.
> 
> And by trusting the host configuration we don't lose much: if that's
> broken, almost everything else is broken anyway.

It's not a question of "trust" in the host configuration, it's the
fact that parts of the host configuration don't make sense in the
guest's context.  Most obviously the interface names from the host
routes can't be used in the guest.  We can and do use the same
addresses for the routers, but what does it really mean?  The guest
can't actually contact them as neighbours - when it tries they just
ARP to passt's fake MAC and the packets get routed by the host kernel
regardless of what router the guest was trying to send them to - in
fact neither passt nor the host kernel will even know what router the
guest thought it was using.

> > Disadvantages:
> >  * If the host route doesn't have a gateway we have to fall back on
> >    B.2 or B.3 anyway
> 
> Well, they are a particular case of B.1 then: what's the disadvantage?

Two cases is more complex than one.

> This is consistent (especially with this series, and especially if we
> start adapting the *default* behaviours in this sense).
> 
> >  * Misleading: in fact everything is routed by passt and the host
> >    before it reaches any gateway we're listing here
> 
> But passt isn't supposed to be a router...? Let's say we have multiple
> routes on the host, we configure or advertise multiple routes to the
> guest. Does that make passt a router? I don't think so: we're just
> associating them as closely as possible, without fancy interpretations.
> 
> A router has its own routing table, passt's would simply be a copy.
> Right now it has essentially none.

Sorry, by "passt" here I really meant the host kernel, which
absolutely will route the packets.  There's no guarantee they'll even
go next to the router the guest thought it was using, although it's
likely.

> >   B.2) Pick an address to represent passt as gateway
> > 
> > Advantages:
> >  * Accurately represents that everything is routed by passt
> 
> This is configurable, actually, but no, I insist that passt isn't
> *functionally* routing anything, or at least that we should get as
> close as possible to that.

Again, the host kernel definitely will, and there's no avoiding that.

> >  * We can make this the same as the NAT-to-host address, so we only
> >    have one "magic" address (per AF)
> 
> Not really, if it's configurable.

I mean one per passt instance, not one globally.  As opposed to the
gateway address and the NAT-to-host address being potentially
different magic addresses in a single instance.

> > Disadvantages:
> >  * Have to allocate an address that's safe, which is tricky (but we
> >    usually want this for NAT-to-host anyway)
> 
> There's a difference between picking an address by default and letting
> the user configure one. Besides, at least for IPv4, I don't think such
> an address exists.

There certainly isn't one we can use everywhere.  I think we have some
options for probing one that will be safe in a particular case.

> >  * Do we want just one address, or one for each distinct gateway from
> >    the host?
> >  * If we can't pick something in the interfaces "natural" prefix, we
> >    will also need to advertise a static route to reach it.
> > 
> >   B.3) Don't advertise a gateway for any route
> > 
> > passt essentially proxy ARPs for the entire internet.
> > 
> > Advantages:
> >   * No need to allocate an address - in fact passt need not have any
> >     guest facing IP at all
> >   * Extends naturally if we ever have a guest<->passt transport that's
> >     point-to-point rather than pseudo-ethernet
> > Disadvantages:
> >   * Guest ARP / neighbour tables could get real big
> 
> ...it would also break a number of applications that peek at netlink
> (or do ioctl()s) to check they are in fact online.

Uh.. what exactly are they looking at?  We'd still have at least one
route, they just wouldn't have gateways attached to them.  But as you
pointed out above I don't think we can do this with DHCP, which pretty
much kills it anyway.

> > The status quo is, roughly, A.1+B.1, except that we also enforce that
> > the host must have a default route, which sidesteps one of the
> > complications of B.1.  IIUC, this series is implementing A.2+B.1.
> > 
> > Thinking about it, I'm moderately convinced that B.1 is a bad idea.
> > I'm leaning towards B.2 - combining it with the NAT-to-host cleanups
> > to have a more concrete guest-visible address for passt itself - but
> > I'm also open to B.3.
> 
> ...that, especially B.3, sounds like another tool, or at least like
> another mode, because it conflicts quite a bit with design goals.

> They're different from design _choices_ in the sense that that's what
> I've been "selling" to users and what I and others have been
> implementing in integrations so far.

So the ways L4 transparency are valuable (including guest address) are
pretty clear to me.  Are there also cases where the (partial) L3
transparency matter?  They're certainly not obvious to me

> > I'm not sure about A.1 vs. A.2.  I was leaning towards A.2, but on
> > further consideration, I feel like the fact that A.1 automatically
> > works for routing changes on the host might outweigh the fact that he
> > guest only gets limited information (ICMP) about what's routable.
> 
> I don't think A.2 is doable,

?? AFAICT this series is doing A.2

> but even if it were, yes, I don't think
> it would be worth the effort. If needed (and I never saw a request in
> this sense), we could enrich ICMP/ICMPv6 handling guest- or
> container-side quite a bit.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes
  2023-05-17  1:15     ` David Gibson
@ 2023-05-17  6:52       ` Stefano Brivio
  2023-05-18  3:26         ` David Gibson
  0 siblings, 1 reply; 21+ messages in thread
From: Stefano Brivio @ 2023-05-17  6:52 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Callum Parsey, me, lemmi

On Wed, 17 May 2023 11:15:06 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, May 16, 2023 at 11:42:09PM +0200, Stefano Brivio wrote:
> > On Tue, 16 May 2023 15:06:29 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano Brivio wrote:  
> > > > This series, along with pseudo-related fixes, enables:
> > > > 
> > > > - optional copy of all routes from selected interface in outer
> > > >   namespace, to (hopefully!) fix the issue reported by Callum at:
> > > >     https://github.com/containers/podman/issues/18539
> > > > 
> > > > - optional copy of all addresses, mostly for consistency. It doesn't,
> > > >   however, enable assignment of multiple addresses in the sense
> > > >   requested at:
> > > >     https://bugs.passt.top/show_bug.cgi?id=47
> > > > 
> > > >   because the addresses still need to be present on the host, and
> > > >   the "outer" address isn't selected depending on the address used
> > > >   inside the container
> > > > 
> > > > - operation without a gateway address, to (again, hopefully) support
> > > >   usage of Wireguard endpoints established outside the container,
> > > >     https://bugs.passt.top/show_bug.cgi?id=49
> > > > 
> > > > I tested the single functionalities introduced here, but I didn't
> > > > try to reproduce the setups where the issues were reported, so some
> > > > help with testing is definitely fundamental here. Thanks.    
> > > 
> > > I've sent reviews for some of the simpler patches in this series which
> > > make sense even without the context of the overall aim.  I think those
> > > can be applied immediately.  
> > 
> > Those are actually the least important patches for users  
> 
> Well, granted.
> 
> > -- and I can't
> > apply 6/10 without breaking Podman's CI plus probably a number of
> > deployments (that's why it comes after 5/10)... so, no, I would rather
> > not apply the rest for the moment.  
> 
> Uh.. true, 6/10 is problematic, but I think the other easy ones could
> be applied safely enough.

With two hands and (worryingly close to) just 24 hours in a day, I
honestly can't picture even a quick rebase and retest for those being a
priority, while happily keeping around the issue that 5/10 fixes.

> > > For the rest of the series, I want to address the generalities before
> > > doing detailed review of the implementation.
> > > 
> > > I think the basic idea here is sound: we want to expose anything
> > > routable to the host as routable to the guest, even when the host has
> > > a more complex routing setup that just a netmask on the "main"
> > > interface and a default gateway within that prefix.  
> > 
> > The intentions behind this series are actually slightly different:
> > 
> > - we have a complete breakage in a seemingly common use case (I would
> >   even say cloud-init setups in general), and I'd like to fix that
> >   sooner rather than later  
> 
> Well, sure, but we should at least think about where we're going with
> this longer term, so we don't box ourselves in.

I don't think this is going to "box us in" -- I'm just proposing to
change this after about two years, and we can definitely change it
again, as long as things keep working. If keeping things working is
boxing us in, well, I can't see that as a bad thing.

That is, I don't see people doing screen-scraping of 'ip route show'
and writing applications around that, and surely not adapting
applications to what pasta does. Not at the moment, and surely not for
a long while.

> > - this concerns only the direct configuration pasta does, with
> >   --config-net. What we advertise is definitely related, but not the
> >   same topic... to the point that the issues fixed by this series don't
> >   even occur with a DHCP client:
> >     https://github.com/containers/podman/issues/18539#issuecomment-1545023424  
> 
> Ah, interesting.  It looks like dhclient (or rather dhclient-script, I
> expect) is adding an explicit /32 route to the default gateway.  It
> seems to me the best quick fix for --config-net is to do the same
> thing.  Basically rather than expanding the netmask as we did in 6/10,
> if the gateway address is not in the interface's netmask add a /32 or
> /128 route to the gateway.

That's the first option I considered (of course!):
  https://github.com/containers/podman/issues/18539#issuecomment-1545260780
  https://github.com/containers/podman/issues/18539#issuecomment-1546967377

and only as I started implementing it, I realised that we can have
anyway chained dependencies which aren't that easy to handle, especially
if we admit an arbitrary number of routes and we need to sort them.

Plus it's going to be 1. more code 2. actually "complicated" code. This
is stupidly simple instead.

I have some experience of fixing IPv6 FIB code in kernel with
consequences on sorting/selection, and there are just so many hidden
details involved in interpretation of Linux-style routes and ways of
shadowing them.

> >   And, in general, we can't advertise everything we can configure (say,
> >   a route without router over DHCP).  
> 
> Ah, true.  The DHCP options for static routes are even more limited
> than I realized.  Ok, that nixes option B.3.
> 
> >   I'd be much more careful about what we advertise. We have direct
> >   control of what we configure via netlink, but for DHCP, NDP, DHCPv6,
> >   we need to think of possible interpretations and common half-bugs as
> >   well.
> >   
> > > But I think we want to think a bit more deeply about exactly what we
> > > need/want to expose here.
> > > 
> > > Even with the current code, the default gateway address we advertise
> > > to the guest is kind of meaningless: the guest cannot directly access
> > > that gateway, everything really goes through passt on the host.  
> > 
> > In the simplest, probably most common network setups, that's actually
> > the gateway that connects our guest to other nodes.  
> 
> I don't understand what you mean by this.  Yes, we have the same IP
> for the gateway that the host sees, but the NAT to host means that we
> can't even talk to the gateway at L4.

It's disabled by default in Podman. It's the default behaviour in
passt because this started from KubeVirt and that's what they expect,
but that's about it. Once the address is configurable, this is not a
valid point, in general.

A gateway doesn't need to be a host, and it's very often, functionally,
not a host. This is by design: RFC 791, 2.2:

  In a gateway the higher level protocols need not be implemented and
  the GGP functions are added to the IP module.

> Literally the only thing the guest kernel will do with that gateway
> address is put it into ARP and neighbour discovery packets, which passt
> will resolve to its own MAC, like nearly every other IP.

No, the guest kernel might also have netfilter rules, specifying that
gateway address, that were originally designed for the host, as if guest
and passt didn't exist. Those might happily use the gateway address to
represent the notion of gateway.

> > For other cases, I think we should eventually implement
> > https://bugs.passt.top/show_bug.cgi?id=47 anyway, and it goes without
> > saying that, then, we can't just use the same host route no matter what
> > the container chooses. We'll need to match them.  
> 
> Oh.. I'm wondering if I've been confusing by using "host route" in two
> different ways: one being "a route taken from the passt host system"
> and the other meaning "a route to a single network host, that is /32
> or /128".
> 
> I agree that we should move to allowing multiple IPs on the guest
> side, but I don't see how that conflicts with the routing issue here.

It's related because, once we allow them, different host routes should
actually be used, so having them in the guest/container too,
aiming at a 1:1 mapping, should simplify things rather than become
misleading.

> > I mean, I'm not saying that the behaviour from this series is complete
> > and self-consistent, just that it works around obvious, urgent issues
> > and at the same time it looks like we'll probably need something
> > similar to support further use cases.  
> 
> Adding a /32 or /128 route to the gateway seems a simpler way to do
> that to me.  Plus it matches the behaviour that DHCP seems to be doing
> anyway.

For the reason why it's not really simple, see above. About what DHCP
clients do: that's in general not the case for udhcpc, dhcpcd, pump, or
NetworkManager. See also e1c94637ad50 ("dhcp: Send option 121 if the
default gateway is not on the assigned subnet").

As far as I know, that's just what the script used in conjunction with
ISC's dhclient does on _some_ distributions.

> > > This works because the gateway address (like everything) will ARP/NDP
> > > to passt's host side MAC address and once the packets hit passt it
> > > doesn't matter what the guest thought the routing was going to be.
> > > 
> > > I think we have a few choices in two more-or-less orthogonal
> > > categories.
> > > 
> > > A) What routable prefixes do we advertise to the guest?
> > > 
> > >   A.1) Always a default route (0.0.0.0/0 and ::/0)
> > > 
> > > We tell the guest that every address is routable via the passt
> > > interface, regardless of routing setup on the host.  This essentially
> > > tells the guest to delegate all routing responsibility to passt.
> > > 
> > > Advantages:
> > >   * Simple
> > >   * No need to update anything if routing configuration on the host
> > >     changes
> > > Disadvantages:
> > >   * If addresses are unroutable from the host, the guest will only
> > >     know via ICMP/ICMPv6, rather than statically, which may be a worse
> > >     UX on the guest side.  Plus we might need to actually implement
> > >     those host unreachable ICMPs.
> > >   * Might be messy if the guest has multiple interfacees - e.g. if we
> > >     allow passt to be configured to attach to a specific host
> > >     interface only, then we have multiple passts attached to a single
> > >     guest: they'd all be advertising a default route.
> > > 
> > >   A.2) Copy routable prefixes from the host to the guest  
> > 
> > I'm having a hard time figuring out the definition of this point. How
> > would you define that? Strictly speaking, in the case at hand, nothing
> > is routable: we have a /32 address.  
> 
> Right.. which means that if the host is working, it must have an
> additional static route - also probably /32 - telling it how to get to
> the gateway.  Indeed I can see it in the bug, initial comment:
>     172.31.1.1 dev ens3 proto static scope link metric 100
> With A.2 we'd copy that route to the guest - or at least one with the
> same prefix (which is a single address in this case).

Blindly copying routes is one thing. Figuring out what subnets are
routable and which ones aren't is a different matter, and _that_ is
what I don't consider generally feasible.

> > > We just advertise those prefixes routable to the host to the guest
> > > (which might include an empty prefix == default route).
> > > 
> > > Advantages:
> > >   * Guest statically knows what addresses are routable via the passt
> > >     interface
> > > Disadvantages:
> > >   * What do we do with overlapping prefixes?  On the host we might
> > >     have more specific routes pointing to a specific interface.  For
> > >     the guest they all point to the passt interface, so what's the
> > >     point?
> > >   * Can we advertise an arbitrary set of static routes via all our
> > >     mechanisms (--config-net, DHCP, NDP+DHCPv6)?  Even if we can it
> > >     adds more complexity to that code
> > >   * How do we update things if the host routing configuration changes?
> > >   * What do we do if the host has source-based routing or other
> > >     advanced stuff set up?
> > > 
> > > B) What gateway, if any, do we advertise for each route?
> > > 
> > >   B.1) Copy it from the host
> > > 
> > > Advantages:
> > >  * Guest L3 configuration resembles that of the host  
> > 
> > ...which is a fundamental design goal of passt: transparency, and
> > pretending it doesn't exist. Otherwise we can have a route, a bridge,
> > an interface, etc.  
> 
> Well... we want to be transparent for anything visible at L4.  For
> things only visible at L3 - like routes, it's not possible for things
> to look 100% identical, so I think we have some wiggle room in exactly
> what we do.

With this series, in most cases, things will actually be 100% identical.

But no, the transparency design goal applies especially to L3:
  https://passt.top/passt/about/#motivation

let alone ALGs, which are probably less common nowadays -- but with
service meshes (increasingly common), L3 transparency is very helpful
to have. Otherwise libslirp would be absolutely enough from that
perspective.

> > Now, while there are use cases that rely on different aspects of this
> > transparency (KubeVirt and service mesh integration) I understand this
> > might sound a bit dogmatic, because you might say there are more
> > important use cases (which I'm not aware of) or supposed benefits.
> > 
> > What's far less dogmatic, though, is how many issues we happily and
> > automatically avoid by relying on the sanity of the host networking
> > configuration.
> > 
> > By trying to copy it as close as possible, we avoid one very important
> > source of issues, which is our interpretation or possible lack of
> > knowledge about how applications we don't know about chose to interact
> > with kernel and network setups. The main case fixed by this series
> > shows exactly that: I think it's broken, but it works, and users
> > expect it to work.
> > 
> > And by trusting the host configuration we don't lose much: if that's
> > broken, almost everything else is broken anyway.  
> 
> It's not a question of "trust" in the host configuration, it's the
> fact that parts of the host configuration don't make sense in the
> guest's context.  Most obviously the interface names from the host
> routes can't be used in the guest.

By default, with containers, even interface names are copied. Indices,
of course, aren't, but that's not something users or applications
typically try to fiddle with.

> We can and do use the same
> addresses for the routers, but what does it really mean?  The guest
> can't actually contact them as neighbours - when it tries they just
> ARP to passt's fake MAC and the packets get routed by the host kernel
> regardless of what router the guest was trying to send them to - in
> fact neither passt nor the host kernel will even know what router the
> guest thought it was using.

This is an L2 matter, which was never a problem for any project using
this.

> > > Disadvantages:
> > >  * If the host route doesn't have a gateway we have to fall back on
> > >    B.2 or B.3 anyway  
> > 
> > Well, they are a particular case of B.1 then: what's the disadvantage?  
> 
> Two cases is more complex than one.

...but, with this series, we don't implement two different cases...?

> > This is consistent (especially with this series, and especially if we
> > start adapting the *default* behaviours in this sense).
> >   
> > >  * Misleading: in fact everything is routed by passt and the host
> > >    before it reaches any gateway we're listing here  
> > 
> > But passt isn't supposed to be a router...? Let's say we have multiple
> > routes on the host, we configure or advertise multiple routes to the
> > guest. Does that make passt a router? I don't think so: we're just
> > associating them as closely as possible, without fancy interpretations.
> > 
> > A router has its own routing table, passt's would simply be a copy.
> > Right now it has essentially none.  
> 
> Sorry, by "passt" here I really meant the host kernel, which
> absolutely will route the packets.  There's no guarantee they'll even
> go next to the router the guest thought it was using, although it's
> likely.

Right. I'd just try to make it as likely as possible. That doesn't come
with this series, and, for instance, me@yawnt.com already checked and
told me this series isn't enough for the "regular" Wireguard case (with
the endpoint in the outer namespace), but this clearly appears to be
getting closer to what we'll need to "naturally" support that.

> > >   B.2) Pick an address to represent passt as gateway
> > > 
> > > Advantages:
> > >  * Accurately represents that everything is routed by passt  
> > 
> > This is configurable, actually, but no, I insist that passt isn't
> > *functionally* routing anything, or at least that we should get as
> > close as possible to that.  
> 
> Again, the host kernel definitely will, and there's no avoiding that.
> 
> > >  * We can make this the same as the NAT-to-host address, so we only
> > >    have one "magic" address (per AF)  
> > 
> > Not really, if it's configurable.  
> 
> I mean one per passt instance, not one globally.  As opposed to the
> gateway address and the NAT-to-host address being potentially
> different magic addresses in a single instance.
> 
> > > Disadvantages:
> > >  * Have to allocate an address that's safe, which is tricky (but we
> > >    usually want this for NAT-to-host anyway)  
> > 
> > There's a difference between picking an address by default and letting
> > the user configure one. Besides, at least for IPv4, I don't think such
> > an address exists.  
> 
> There certainly isn't one we can use everywhere.  I think we have some
> options for probing one that will be safe in a particular case.

Well... that doesn't sound great.

> > >  * Do we want just one address, or one for each distinct gateway from
> > >    the host?
> > >  * If we can't pick something in the interfaces "natural" prefix, we
> > >    will also need to advertise a static route to reach it.
> > > 
> > >   B.3) Don't advertise a gateway for any route
> > > 
> > > passt essentially proxy ARPs for the entire internet.
> > > 
> > > Advantages:
> > >   * No need to allocate an address - in fact passt need not have any
> > >     guest facing IP at all
> > >   * Extends naturally if we ever have a guest<->passt transport that's
> > >     point-to-point rather than pseudo-ethernet
> > > Disadvantages:
> > >   * Guest ARP / neighbour tables could get real big  
> > 
> > ...it would also break a number of applications that peek at netlink
> > (or do ioctl()s) to check they are in fact online.  
> 
> Uh.. what exactly are they looking at?  We'd still have at least one
> route, they just wouldn't have gateways attached to them.  But as you
> pointed out above I don't think we can do this with DHCP, which pretty
> much kills it anyway.
> 
> > > The status quo is, roughly, A.1+B.1, except that we also enforce that
> > > the host must have a default route, which sidesteps one of the
> > > complications of B.1.  IIUC, this series is implementing A.2+B.1.
> > > 
> > > Thinking about it, I'm moderately convinced that B.1 is a bad idea.
> > > I'm leaning towards B.2 - combining it with the NAT-to-host cleanups
> > > to have a more concrete guest-visible address for passt itself - but
> > > I'm also open to B.3.  
> > 
> > ...that, especially B.3, sounds like another tool, or at least like
> > another mode, because it conflicts quite a bit with design goals.  
> 
> > They're different from design _choices_ in the sense that that's what
> > I've been "selling" to users and what I and others have been
> > implementing in integrations so far.  
> 
> So the ways L4 transparency are valuable (including guest address) are
> pretty clear to me.  Are there also cases where the (partial) L3
> transparency matter?  They're certainly not obvious to me

Absolutely, and I thought I explained this a number of times, but...
service meshes using netfilter. Applications being moved from "host" to
containers, or from containers to VMs. That's where we want to pretend
nothing changes, and that usually causes more L3 headaches rather than
L4.

> > > I'm not sure about A.1 vs. A.2.  I was leaning towards A.2, but on
> > > further consideration, I feel like the fact that A.1 automatically
> > > works for routing changes on the host might outweigh the fact that he
> > > guest only gets limited information (ICMP) about what's routable.  
> > 
> > I don't think A.2 is doable,  
> 
> ?? AFAICT this series is doing A.2

Not really, we're not figuring out routable prefixes, just blindly
copying routing entries. It's closer to A.2 than A.1, but it's not
that, either.

-- 
Stefano


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes
  2023-05-17  6:52       ` Stefano Brivio
@ 2023-05-18  3:26         ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2023-05-18  3:26 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Callum Parsey, me, lemmi

[-- Attachment #1: Type: text/plain, Size: 25054 bytes --]

On Wed, May 17, 2023 at 08:52:50AM +0200, Stefano Brivio wrote:
> On Wed, 17 May 2023 11:15:06 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, May 16, 2023 at 11:42:09PM +0200, Stefano Brivio wrote:
> > > On Tue, 16 May 2023 15:06:29 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano Brivio wrote:  
> > > > > This series, along with pseudo-related fixes, enables:
> > > > > 
> > > > > - optional copy of all routes from selected interface in outer
> > > > >   namespace, to (hopefully!) fix the issue reported by Callum at:
> > > > >     https://github.com/containers/podman/issues/18539
> > > > > 
> > > > > - optional copy of all addresses, mostly for consistency. It doesn't,
> > > > >   however, enable assignment of multiple addresses in the sense
> > > > >   requested at:
> > > > >     https://bugs.passt.top/show_bug.cgi?id=47
> > > > > 
> > > > >   because the addresses still need to be present on the host, and
> > > > >   the "outer" address isn't selected depending on the address used
> > > > >   inside the container
> > > > > 
> > > > > - operation without a gateway address, to (again, hopefully) support
> > > > >   usage of Wireguard endpoints established outside the container,
> > > > >     https://bugs.passt.top/show_bug.cgi?id=49
> > > > > 
> > > > > I tested the single functionalities introduced here, but I didn't
> > > > > try to reproduce the setups where the issues were reported, so some
> > > > > help with testing is definitely fundamental here. Thanks.    
> > > > 
> > > > I've sent reviews for some of the simpler patches in this series which
> > > > make sense even without the context of the overall aim.  I think those
> > > > can be applied immediately.  
> > > 
> > > Those are actually the least important patches for users  
> > 
> > Well, granted.
> > 
> > > -- and I can't
> > > apply 6/10 without breaking Podman's CI plus probably a number of
> > > deployments (that's why it comes after 5/10)... so, no, I would rather
> > > not apply the rest for the moment.  
> > 
> > Uh.. true, 6/10 is problematic, but I think the other easy ones could
> > be applied safely enough.
> 
> With two hands and (worryingly close to) just 24 hours in a day, I
> honestly can't picture even a quick rebase and retest for those being a
> priority, while happily keeping around the issue that 5/10 fixes.

Yeah, ok, fair enough.  This sort of thing is, of course, why I'm
spending so much effort trying to improve the testing situation.

> > > > For the rest of the series, I want to address the generalities before
> > > > doing detailed review of the implementation.
> > > > 
> > > > I think the basic idea here is sound: we want to expose anything
> > > > routable to the host as routable to the guest, even when the host has
> > > > a more complex routing setup that just a netmask on the "main"
> > > > interface and a default gateway within that prefix.  
> > > 
> > > The intentions behind this series are actually slightly different:
> > > 
> > > - we have a complete breakage in a seemingly common use case (I would
> > >   even say cloud-init setups in general), and I'd like to fix that
> > >   sooner rather than later  
> > 
> > Well, sure, but we should at least think about where we're going with
> > this longer term, so we don't box ourselves in.
> 
> I don't think this is going to "box us in" -- I'm just proposing to
> change this after about two years, and we can definitely change it
> again, as long as things keep working. If keeping things working is
> boxing us in, well, I can't see that as a bad thing.
> 
> That is, I don't see people doing screen-scraping of 'ip route show'
> and writing applications around that, and surely not adapting
> applications to what pasta does. Not at the moment, and surely not for
> a long while.

That's a good point - but also supports what I'm saying below about
having some wiggle room about what we do at L3.

> > > - this concerns only the direct configuration pasta does, with
> > >   --config-net. What we advertise is definitely related, but not the
> > >   same topic... to the point that the issues fixed by this series don't
> > >   even occur with a DHCP client:
> > >     https://github.com/containers/podman/issues/18539#issuecomment-1545023424  
> > 
> > Ah, interesting.  It looks like dhclient (or rather dhclient-script, I
> > expect) is adding an explicit /32 route to the default gateway.  It
> > seems to me the best quick fix for --config-net is to do the same
> > thing.  Basically rather than expanding the netmask as we did in 6/10,
> > if the gateway address is not in the interface's netmask add a /32 or
> > /128 route to the gateway.
> 
> That's the first option I considered (of course!):
>   https://github.com/containers/podman/issues/18539#issuecomment-1545260780
>   https://github.com/containers/podman/issues/18539#issuecomment-1546967377
> 
> and only as I started implementing it, I realised that we can have
> anyway chained dependencies which aren't that easy to handle, especially
> if we admit an arbitrary number of routes and we need to sort them.

I'm not understanding where these dependencies are coming from.  I'm
not proposing copying anything from the host here, what I'm suggesting
is that --config-net do the equivalent of:

	if ($gw not in <ifi's built in prefix>)
		ip route add $gw/32 dev $ifname

> Plus it's going to be 1. more code 2. actually "complicated" code. This
> is stupidly simple instead.
> 
> I have some experience of fixing IPv6 FIB code in kernel with
> consequences on sorting/selection, and there are just so many hidden
> details involved in interpretation of Linux-style routes and ways of
> shadowing them.

Right, this sounds like more reason to be nervous of blindly copying
routes from the host: it's hard to know how we'll have to tweak them
to make sense in the guest context.

> > >   And, in general, we can't advertise everything we can configure (say,
> > >   a route without router over DHCP).  
> > 
> > Ah, true.  The DHCP options for static routes are even more limited
> > than I realized.  Ok, that nixes option B.3.
> > 
> > >   I'd be much more careful about what we advertise. We have direct
> > >   control of what we configure via netlink, but for DHCP, NDP, DHCPv6,
> > >   we need to think of possible interpretations and common half-bugs as
> > >   well.
> > >   
> > > > But I think we want to think a bit more deeply about exactly what we
> > > > need/want to expose here.
> > > > 
> > > > Even with the current code, the default gateway address we advertise
> > > > to the guest is kind of meaningless: the guest cannot directly access
> > > > that gateway, everything really goes through passt on the host.  
> > > 
> > > In the simplest, probably most common network setups, that's actually
> > > the gateway that connects our guest to other nodes.  
> > 
> > I don't understand what you mean by this.  Yes, we have the same IP
> > for the gateway that the host sees, but the NAT to host means that we
> > can't even talk to the gateway at L4.
> 
> It's disabled by default in Podman. It's the default behaviour in
> passt because this started from KubeVirt and that's what they expect,
> but that's about it. Once the address is configurable, this is not a
> valid point, in general.
> 
> A gateway doesn't need to be a host, and it's very often, functionally,
> not a host. This is by design: RFC 791, 2.2:
> 
>   In a gateway the higher level protocols need not be implemented and
>   the GGP functions are added to the IP module.
> 
> > Literally the only thing the guest kernel will do with that gateway
> > address is put it into ARP and neighbour discovery packets, which passt
> > will resolve to its own MAC, like nearly every other IP.
> 
> No, the guest kernel might also have netfilter rules, specifying that
> gateway address, that were originally designed for the host, as if guest
> and passt didn't exist. Those might happily use the gateway address to
> represent the notion of gateway.

Ok... trying to improve my understanding of this case.

If we are using NAT to host on the gateway address, that would
silently change the semantics of ip filter rules using that address,
which doesn't seem great.

If we're not using map-gw, and the rules are about filtering traffic
going to the gateway in it's (possible) capacity as just another host,
then those rules are still valid, regardless of whether that address
is still the gateway in the guest.

Are you considering a case where a guest-side script is building the
iptables rules based on the gateway address (which it retrieves from
ip route show, or dhcp or something)?  Or something else?


> > > For other cases, I think we should eventually implement
> > > https://bugs.passt.top/show_bug.cgi?id=47 anyway, and it goes without
> > > saying that, then, we can't just use the same host route no matter what
> > > the container chooses. We'll need to match them.  
> > 
> > Oh.. I'm wondering if I've been confusing by using "host route" in two
> > different ways: one being "a route taken from the passt host system"
> > and the other meaning "a route to a single network host, that is /32
> > or /128".
> > 
> > I agree that we should move to allowing multiple IPs on the guest
> > side, but I don't see how that conflicts with the routing issue here.
> 
> It's related because, once we allow them, different host routes should
> actually be used, so having them in the guest/container too,
> aiming at a 1:1 mapping, should simplify things rather than become
> misleading.

But whatever routers we set in the guest, they'll all ARP back to the
same thing, so I'm not sure what difference it makes.

> > > I mean, I'm not saying that the behaviour from this series is complete
> > > and self-consistent, just that it works around obvious, urgent issues
> > > and at the same time it looks like we'll probably need something
> > > similar to support further use cases.  
> > 
> > Adding a /32 or /128 route to the gateway seems a simpler way to do
> > that to me.  Plus it matches the behaviour that DHCP seems to be doing
> > anyway.
> 
> For the reason why it's not really simple, see above. About what DHCP
> clients do: that's in general not the case for udhcpc, dhcpcd, pump, or
> NetworkManager. See also e1c94637ad50 ("dhcp: Send option 121 if the
> default gateway is not on the assigned subnet").

Oh, I'd missed that option 121 stuff.  I'm guessing dhclient(-script)
is actually setting the route in response to that, rather than it
being built in.  But now I'm even more confused.. what I'm suggesting
is basically the exact equivalent logic for config-net, so I'm not
sure why you're objecting.

[As an aside, option 121 explicitly allows advertising local routes,
which means B.3 is possible with DHCP after all - althouh maybe not
wise, since I gather option 121 isn't necessarily widely supported]

> As far as I know, that's just what the script used in conjunction with
> ISC's dhclient does on _some_ distributions.
> 
> > > > This works because the gateway address (like everything) will ARP/NDP
> > > > to passt's host side MAC address and once the packets hit passt it
> > > > doesn't matter what the guest thought the routing was going to be.
> > > > 
> > > > I think we have a few choices in two more-or-less orthogonal
> > > > categories.
> > > > 
> > > > A) What routable prefixes do we advertise to the guest?
> > > > 
> > > >   A.1) Always a default route (0.0.0.0/0 and ::/0)
> > > > 
> > > > We tell the guest that every address is routable via the passt
> > > > interface, regardless of routing setup on the host.  This essentially
> > > > tells the guest to delegate all routing responsibility to passt.
> > > > 
> > > > Advantages:
> > > >   * Simple
> > > >   * No need to update anything if routing configuration on the host
> > > >     changes
> > > > Disadvantages:
> > > >   * If addresses are unroutable from the host, the guest will only
> > > >     know via ICMP/ICMPv6, rather than statically, which may be a worse
> > > >     UX on the guest side.  Plus we might need to actually implement
> > > >     those host unreachable ICMPs.
> > > >   * Might be messy if the guest has multiple interfacees - e.g. if we
> > > >     allow passt to be configured to attach to a specific host
> > > >     interface only, then we have multiple passts attached to a single
> > > >     guest: they'd all be advertising a default route.
> > > > 
> > > >   A.2) Copy routable prefixes from the host to the guest  
> > > 
> > > I'm having a hard time figuring out the definition of this point. How
> > > would you define that? Strictly speaking, in the case at hand, nothing
> > > is routable: we have a /32 address.  
> > 
> > Right.. which means that if the host is working, it must have an
> > additional static route - also probably /32 - telling it how to get to
> > the gateway.  Indeed I can see it in the bug, initial comment:
> >     172.31.1.1 dev ens3 proto static scope link metric 100
> > With A.2 we'd copy that route to the guest - or at least one with the
> > same prefix (which is a single address in this case).
> 
> Blindly copying routes is one thing. Figuring out what subnets are
> routable and which ones aren't is a different matter, and _that_ is
> what I don't consider generally feasible.

Oh, sorry, I was unclear.  By "routable" I meant just "has a matching
entry in the host's routes" not any deeper analysis of whether they're
actually reachable.  So blindly(ish) copying routes is exactly what
I'm suggesting for A.2

> > > > We just advertise those prefixes routable to the host to the guest
> > > > (which might include an empty prefix == default route).
> > > > 
> > > > Advantages:
> > > >   * Guest statically knows what addresses are routable via the passt
> > > >     interface
> > > > Disadvantages:
> > > >   * What do we do with overlapping prefixes?  On the host we might
> > > >     have more specific routes pointing to a specific interface.  For
> > > >     the guest they all point to the passt interface, so what's the
> > > >     point?
> > > >   * Can we advertise an arbitrary set of static routes via all our
> > > >     mechanisms (--config-net, DHCP, NDP+DHCPv6)?  Even if we can it
> > > >     adds more complexity to that code
> > > >   * How do we update things if the host routing configuration changes?
> > > >   * What do we do if the host has source-based routing or other
> > > >     advanced stuff set up?
> > > > 
> > > > B) What gateway, if any, do we advertise for each route?
> > > > 
> > > >   B.1) Copy it from the host
> > > > 
> > > > Advantages:
> > > >  * Guest L3 configuration resembles that of the host  
> > > 
> > > ...which is a fundamental design goal of passt: transparency, and
> > > pretending it doesn't exist. Otherwise we can have a route, a bridge,
> > > an interface, etc.  
> > 
> > Well... we want to be transparent for anything visible at L4.  For
> > things only visible at L3 - like routes, it's not possible for things
> > to look 100% identical, so I think we have some wiggle room in exactly
> > what we do.
> 
> With this series, in most cases, things will actually be 100% identical.
> 
> But no, the transparency design goal applies especially to L3:
>   https://passt.top/passt/about/#motivation
> 
> let alone ALGs, which are probably less common nowadays -- but with
> service meshes (increasingly common), L3 transparency is very helpful
> to have. Otherwise libslirp would be absolutely enough from that
> perspective.

I'm still not seeing what concrete examples need transparency to the
*routes* rather than just the addresses at L3.

> > > Now, while there are use cases that rely on different aspects of this
> > > transparency (KubeVirt and service mesh integration) I understand this
> > > might sound a bit dogmatic, because you might say there are more
> > > important use cases (which I'm not aware of) or supposed benefits.
> > > 
> > > What's far less dogmatic, though, is how many issues we happily and
> > > automatically avoid by relying on the sanity of the host networking
> > > configuration.
> > > 
> > > By trying to copy it as close as possible, we avoid one very important
> > > source of issues, which is our interpretation or possible lack of
> > > knowledge about how applications we don't know about chose to interact
> > > with kernel and network setups. The main case fixed by this series
> > > shows exactly that: I think it's broken, but it works, and users
> > > expect it to work.
> > > 
> > > And by trusting the host configuration we don't lose much: if that's
> > > broken, almost everything else is broken anyway.  
> > 
> > It's not a question of "trust" in the host configuration, it's the
> > fact that parts of the host configuration don't make sense in the
> > guest's context.  Most obviously the interface names from the host
> > routes can't be used in the guest.
> 
> By default, with containers, even interface names are copied. Indices,
> of course, aren't, but that's not something users or applications
> typically try to fiddle with.

Huh?  If the interface name is anything other than the pasta template
interface, it won't even exist in the guest.  So how can you possibly
copy it?

> > We can and do use the same
> > addresses for the routers, but what does it really mean?  The guest
> > can't actually contact them as neighbours - when it tries they just
> > ARP to passt's fake MAC and the packets get routed by the host kernel
> > regardless of what router the guest was trying to send them to - in
> > fact neither passt nor the host kernel will even know what router the
> > guest thought it was using.
> 
> This is an L2 matter, which was never a problem for any project using
> this.

It's really not, we'll absolutely hit the IP stack for routing, not
just the datalink layer in the host kernel.

> > > > Disadvantages:
> > > >  * If the host route doesn't have a gateway we have to fall back on
> > > >    B.2 or B.3 anyway  
> > > 
> > > Well, they are a particular case of B.1 then: what's the disadvantage?  
> > 
> > Two cases is more complex than one.
> 
> ...but, with this series, we don't implement two different cases...?
> 
> > > This is consistent (especially with this series, and especially if we
> > > start adapting the *default* behaviours in this sense).
> > >   
> > > >  * Misleading: in fact everything is routed by passt and the host
> > > >    before it reaches any gateway we're listing here  
> > > 
> > > But passt isn't supposed to be a router...? Let's say we have multiple
> > > routes on the host, we configure or advertise multiple routes to the
> > > guest. Does that make passt a router? I don't think so: we're just
> > > associating them as closely as possible, without fancy interpretations.
> > > 
> > > A router has its own routing table, passt's would simply be a copy.
> > > Right now it has essentially none.  
> > 
> > Sorry, by "passt" here I really meant the host kernel, which
> > absolutely will route the packets.  There's no guarantee they'll even
> > go next to the router the guest thought it was using, although it's
> > likely.
> 
> Right. I'd just try to make it as likely as possible. That doesn't come
> with this series, and, for instance, me@yawnt.com already checked and
> told me this series isn't enough for the "regular" Wireguard case (with
> the endpoint in the outer namespace), but this clearly appears to be
> getting closer to what we'll need to "naturally" support that.
> 
> > > >   B.2) Pick an address to represent passt as gateway
> > > > 
> > > > Advantages:
> > > >  * Accurately represents that everything is routed by passt  
> > > 
> > > This is configurable, actually, but no, I insist that passt isn't
> > > *functionally* routing anything, or at least that we should get as
> > > close as possible to that.  
> > 
> > Again, the host kernel definitely will, and there's no avoiding that.
> > 
> > > >  * We can make this the same as the NAT-to-host address, so we only
> > > >    have one "magic" address (per AF)  
> > > 
> > > Not really, if it's configurable.  
> > 
> > I mean one per passt instance, not one globally.  As opposed to the
> > gateway address and the NAT-to-host address being potentially
> > different magic addresses in a single instance.
> > 
> > > > Disadvantages:
> > > >  * Have to allocate an address that's safe, which is tricky (but we
> > > >    usually want this for NAT-to-host anyway)  
> > > 
> > > There's a difference between picking an address by default and letting
> > > the user configure one. Besides, at least for IPv4, I don't think such
> > > an address exists.  
> > 
> > There certainly isn't one we can use everywhere.  I think we have some
> > options for probing one that will be safe in a particular case.
> 
> Well... that doesn't sound great.

Of course.  We're trying to allocate scarce IP space out of nowhere,
it's always going to be messy.

> > > >  * Do we want just one address, or one for each distinct gateway from
> > > >    the host?
> > > >  * If we can't pick something in the interfaces "natural" prefix, we
> > > >    will also need to advertise a static route to reach it.
> > > > 
> > > >   B.3) Don't advertise a gateway for any route
> > > > 
> > > > passt essentially proxy ARPs for the entire internet.
> > > > 
> > > > Advantages:
> > > >   * No need to allocate an address - in fact passt need not have any
> > > >     guest facing IP at all
> > > >   * Extends naturally if we ever have a guest<->passt transport that's
> > > >     point-to-point rather than pseudo-ethernet
> > > > Disadvantages:
> > > >   * Guest ARP / neighbour tables could get real big  
> > > 
> > > ...it would also break a number of applications that peek at netlink
> > > (or do ioctl()s) to check they are in fact online.  
> > 
> > Uh.. what exactly are they looking at?  We'd still have at least one
> > route, they just wouldn't have gateways attached to them.  But as you
> > pointed out above I don't think we can do this with DHCP, which pretty
> > much kills it anyway.
> > 
> > > > The status quo is, roughly, A.1+B.1, except that we also enforce that
> > > > the host must have a default route, which sidesteps one of the
> > > > complications of B.1.  IIUC, this series is implementing A.2+B.1.
> > > > 
> > > > Thinking about it, I'm moderately convinced that B.1 is a bad idea.
> > > > I'm leaning towards B.2 - combining it with the NAT-to-host cleanups
> > > > to have a more concrete guest-visible address for passt itself - but
> > > > I'm also open to B.3.  
> > > 
> > > ...that, especially B.3, sounds like another tool, or at least like
> > > another mode, because it conflicts quite a bit with design goals.  
> > 
> > > They're different from design _choices_ in the sense that that's what
> > > I've been "selling" to users and what I and others have been
> > > implementing in integrations so far.  
> > 
> > So the ways L4 transparency are valuable (including guest address) are
> > pretty clear to me.  Are there also cases where the (partial) L3
> > transparency matter?  They're certainly not obvious to me
> 
> Absolutely, and I thought I explained this a number of times, but...
> service meshes using netfilter. Applications being moved from "host" to
> containers, or from containers to VMs. That's where we want to pretend
> nothing changes, and that usually causes more L3 headaches rather than
> L4.

I'm clear on how NAT breaks everything.  But I'm still not seeing how
the list of routes matters.

> > > > I'm not sure about A.1 vs. A.2.  I was leaning towards A.2, but on
> > > > further consideration, I feel like the fact that A.1 automatically
> > > > works for routing changes on the host might outweigh the fact that he
> > > > guest only gets limited information (ICMP) about what's routable.  
> > > 
> > > I don't think A.2 is doable,  
> > 
> > ?? AFAICT this series is doing A.2
> 
> Not really, we're not figuring out routable prefixes, just blindly
> copying routing entries. It's closer to A.2 than A.1, but it's not
> that, either.

Again, the only "figuring out of routable prefixes" I was suggesting
for A.2 is "is there an entry in the host routing table".
i.e. blindly copying route entries.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-05-18  3:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-14 18:14 [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Stefano Brivio
2023-05-14 18:14 ` [PATCH 01/10] netlink: Fix comment about response buffer size for nl_req() Stefano Brivio
2023-05-16  3:23   ` David Gibson
2023-05-14 18:14 ` [PATCH 02/10] pasta: Improve error handling on failure to join network namespace Stefano Brivio
2023-05-16  3:24   ` David Gibson
2023-05-14 18:14 ` [PATCH 03/10] netlink: Add functionality to copy routes from outer namespace Stefano Brivio
2023-05-14 18:14 ` [PATCH 04/10] conf: --config-net option is for pasta mode only Stefano Brivio
2023-05-16  3:59   ` David Gibson
2023-05-14 18:14 ` [PATCH 05/10] conf, pasta: With --config-net, copy all routes by default Stefano Brivio
2023-05-14 18:14 ` [PATCH 06/10] Revert "conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway" Stefano Brivio
2023-05-16  4:00   ` David Gibson
2023-05-14 18:14 ` [PATCH 07/10] conf: Don't exit if sourced default route has no gateway Stefano Brivio
2023-05-14 18:14 ` [PATCH 08/10] netlink: Add functionality to copy addresses from outer namespace Stefano Brivio
2023-05-14 18:14 ` [PATCH 09/10] conf, pasta: With --config-net, copy all addresses by default Stefano Brivio
2023-05-14 18:14 ` [PATCH 10/10] passt.h: Fix description of pasta_ifi in struct ctx Stefano Brivio
2023-05-16  4:03   ` David Gibson
2023-05-16  5:06 ` [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes David Gibson
2023-05-16 21:42   ` Stefano Brivio
2023-05-17  1:15     ` David Gibson
2023-05-17  6:52       ` Stefano Brivio
2023-05-18  3:26         ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).