* [PATCH v3 0/4] Don't expose container loopback services to the host
@ 2024-10-02 5:48 David Gibson
2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
podman issue #24045 pointed out that pasta's spliced forwarding logic
can expose services within the namespace bound only to 127.0.0.1 or
::1 to the host. However, the namespace probably expects those to
only be accessible to itself, so that's probably not what we want.
This changes our forwarding logic so as not to do this. Note that the
podman tests will currently fail with this series, I've submitted
podman PR https://github.com/containers/podman/pull/24064 to fix that.
Link: https://github.com/containers/podman/issues/24045
Changes since v2:
* Add new field do structure comment
Changes since v1:
* Add --host-lo-to-ns-lo option to preserve the old behaviour
* Clarify the new behaviour in the man page
* Add some extra patches making some other corrections to the man
page
David Gibson (4):
passt.1: Mark --stderr as deprecated more prominently
passt.1: Clarify and update "Handling of local addresses" section
test: Clarify test for spliced inbound transfers
fwd: Direct inbound spliced forwards to the guest's external address
conf.c | 9 ++++++
fwd.c | 31 +++++++++++++-----
passt.1 | 75 +++++++++++++++++++++++++++-----------------
passt.h | 2 ++
test/passt_in_ns/tcp | 8 ++---
test/passt_in_ns/udp | 4 +--
test/pasta/tcp | 16 +++++-----
test/pasta/udp | 8 ++---
8 files changed, 98 insertions(+), 55 deletions(-)
--
2.46.2
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently
2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson
@ 2024-10-02 5:48 ` David Gibson
2024-10-02 5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson
` (2 subsequent siblings)
3 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add
one to make it easier to spot.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
passt.1 | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/passt.1 b/passt.1
index 79d134d..acc1f92 100644
--- a/passt.1
+++ b/passt.1
@@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change.
Default is to fork into background.
.TP
-.BR \-e ", " \-\-stderr
+.BR \-e ", " \-\-stderr " " (DEPRECATED)
This option has no effect, and is maintained for compatibility purposes only.
Note that this configuration option is \fBdeprecated\fR and will be removed in a
--
@@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change.
Default is to fork into background.
.TP
-.BR \-e ", " \-\-stderr
+.BR \-e ", " \-\-stderr " " (DEPRECATED)
This option has no effect, and is maintained for compatibility purposes only.
Note that this configuration option is \fBdeprecated\fR and will be removed in a
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section
2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson
2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson
@ 2024-10-02 5:48 ` David Gibson
2024-10-02 5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson
2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson
3 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour. Update it accordingly.
It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host. Use "loopback
address" to be more precise about this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
passt.1 | 54 ++++++++++++++++++++++++++++--------------------------
1 file changed, 28 insertions(+), 26 deletions(-)
diff --git a/passt.1 b/passt.1
index acc1f92..11104e1 100644
--- a/passt.1
+++ b/passt.1
@@ -863,38 +863,40 @@ root@localhost's password:
.SH NOTES
-.SS Handling of traffic with local destination and source addresses
-
-Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address,
-depending on the configuration. Local destination or source addresses need to be
-changed before packets are delivered to the guest or target namespace: most
-operating systems would drop packets received from non-loopback interfaces with
-local addresses, and it would also be impossible for guest or target namespace
-to route answers back.
-
-For convenience, and somewhat arbitrarily, the source address on these packets
-is translated to the address of the default IPv4 or IPv6 gateway (if any) --
-this is known to be an existing, valid address on the same subnet.
-
-Loopback destination addresses are instead translated to the observed external
-address of the guest or target namespace. For IPv6 packets, if usage of a
-link-local address by guest or namespace has ever been observed, and the
-original destination address is also a link-local address, the observed
-link-local address is used. Otherwise, the observed global address is used. For
-both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses
-will be used instead.
+.SS Handling of traffic with loopback destination and source addresses
+
+Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback
+address (127.0.0.0/8 or ::1), depending on the configuration. Loopback
+destination or source addresses need to be changed before packets are
+delivered to the guest or target namespace: most operating systems
+would drop packets received with loopback addresses on non-loopback
+interfaces, and it would also be impossible for guest or target
+namespace to route answers back.
+
+For convenience, the source address on these packets is translated to
+the address specified by the \fB\-\-map-host-loopback\fR option. If
+not specified this defaults, somewhat arbitrarily, to the address of
+default IPv4 or IPv6 gateway (if any) -- this is known to be an
+existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or
+\fB\-\-map-host-loopback none\fR are specified this translation is
+disabled and packets with loopback addresses are simply dropped.
+
+Loopback destination addresses are translated to the observed external
+address of the guest or target namespace. For IPv6, the observed
+link-local address is used if the translated source address is
+link-local, otherwise the observed global address is used. For both
+IPv4 and IPv6, if no addresses have been seen yet, the configured
+addresses will be used instead.
For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1,
with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
-Similarly, for traffic coming from guest or namespace, packets with destination
-address corresponding to the default gateway will have their destination address
-translated to a loopback address, if and only if a packet, in the opposite
-direction, with a loopback destination or source address, port-wise matching for
-UDP, or connection-wise for TCP, has been recently forwarded to guest or
-namespace. This behaviour can be disabled with \-\-no\-map\-gw.
+Similarly, for traffic coming from guest or namespace, packets with
+destination address corresponding to the \fB\-\-map-host-loopback\fR
+address will have their destination address translated to a loopback
+address.
.SS Handling of local traffic in pasta
--
@@ -863,38 +863,40 @@ root@localhost's password:
.SH NOTES
-.SS Handling of traffic with local destination and source addresses
-
-Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address,
-depending on the configuration. Local destination or source addresses need to be
-changed before packets are delivered to the guest or target namespace: most
-operating systems would drop packets received from non-loopback interfaces with
-local addresses, and it would also be impossible for guest or target namespace
-to route answers back.
-
-For convenience, and somewhat arbitrarily, the source address on these packets
-is translated to the address of the default IPv4 or IPv6 gateway (if any) --
-this is known to be an existing, valid address on the same subnet.
-
-Loopback destination addresses are instead translated to the observed external
-address of the guest or target namespace. For IPv6 packets, if usage of a
-link-local address by guest or namespace has ever been observed, and the
-original destination address is also a link-local address, the observed
-link-local address is used. Otherwise, the observed global address is used. For
-both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses
-will be used instead.
+.SS Handling of traffic with loopback destination and source addresses
+
+Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback
+address (127.0.0.0/8 or ::1), depending on the configuration. Loopback
+destination or source addresses need to be changed before packets are
+delivered to the guest or target namespace: most operating systems
+would drop packets received with loopback addresses on non-loopback
+interfaces, and it would also be impossible for guest or target
+namespace to route answers back.
+
+For convenience, the source address on these packets is translated to
+the address specified by the \fB\-\-map-host-loopback\fR option. If
+not specified this defaults, somewhat arbitrarily, to the address of
+default IPv4 or IPv6 gateway (if any) -- this is known to be an
+existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or
+\fB\-\-map-host-loopback none\fR are specified this translation is
+disabled and packets with loopback addresses are simply dropped.
+
+Loopback destination addresses are translated to the observed external
+address of the guest or target namespace. For IPv6, the observed
+link-local address is used if the translated source address is
+link-local, otherwise the observed global address is used. For both
+IPv4 and IPv6, if no addresses have been seen yet, the configured
+addresses will be used instead.
For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1,
with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
-Similarly, for traffic coming from guest or namespace, packets with destination
-address corresponding to the default gateway will have their destination address
-translated to a loopback address, if and only if a packet, in the opposite
-direction, with a loopback destination or source address, port-wise matching for
-UDP, or connection-wise for TCP, has been recently forwarded to guest or
-namespace. This behaviour can be disabled with \-\-no\-map\-gw.
+Similarly, for traffic coming from guest or namespace, packets with
+destination address corresponding to the \fB\-\-map-host-loopback\fR
+address will have their destination address translated to a loopback
+address.
.SS Handling of local traffic in pasta
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 3/4] test: Clarify test for spliced inbound transfers
2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson
2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson
2024-10-02 5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson
@ 2024-10-02 5:48 ` David Gibson
2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson
3 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1. This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace. That's not an inherent property for "spliced" connections
and arguably an undesirable one. Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.
Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.
Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
test/passt_in_ns/tcp | 8 ++++----
test/passt_in_ns/udp | 4 ++--
test/pasta/tcp | 16 ++++++++--------
test/pasta/udp | 8 ++++----
4 files changed, 18 insertions(+), 18 deletions(-)
diff --git a/test/passt_in_ns/tcp b/test/passt_in_ns/tcp
index aaf340e..319880b 100644
--- a/test/passt_in_ns/tcp
+++ b/test/passt_in_ns/tcp
@@ -32,7 +32,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10001
guestw
guest cmp test_big.bin /root/big.bin
-test TCP/IPv4: host to ns: big transfer
+test TCP/IPv4: host to ns (spliced): big transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002
@@ -90,7 +90,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10001
guestw
guest cmp test_small.bin /root/small.bin
-test TCP/IPv4: host to ns: small transfer
+test TCP/IPv4: host to ns (spliced): small transfer
nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002
@@ -146,7 +146,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10001
guestw
guest cmp test_big.bin /root/big.bin
-test TCP/IPv6: host to ns: big transfer
+test TCP/IPv6: host to ns (spliced): big transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002
@@ -204,7 +204,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10001
guestw
guest cmp test_small.bin /root/small.bin
-test TCP/IPv6: host to ns: small transfer
+test TCP/IPv6: host to ns (spliced): small transfer
nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002
diff --git a/test/passt_in_ns/udp b/test/passt_in_ns/udp
index 3426ab9..791511c 100644
--- a/test/passt_in_ns/udp
+++ b/test/passt_in_ns/udp
@@ -30,7 +30,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10001,shut-null
guestw
guest cmp test.bin /root/medium.bin
-test UDP/IPv4: host to ns
+test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
@@ -88,7 +88,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10001,shut-null
guestw
guest cmp test.bin /root/medium.bin
-test UDP/IPv6: host to ns
+test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
sleep 1
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null
diff --git a/test/pasta/tcp b/test/pasta/tcp
index 6ab18c5..53b6f25 100644
--- a/test/pasta/tcp
+++ b/test/pasta/tcp
@@ -19,8 +19,8 @@ set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin
set TEMP_SMALL __STATEDIR__/test_small.bin
set TEMP_NS_SMALL __STATEDIR__/test_ns_small.bin
-test TCP/IPv4: host to ns: big transfer
-nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_BIG__,create,trunc
+test TCP/IPv4: host to ns (spliced): big transfer
+nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002
nsw
check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__
@@ -38,8 +38,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003
hostw
check cmp __BASEPATH__/big.bin __TEMP_BIG__
-test TCP/IPv4: host to ns: small transfer
-nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_SMALL__,create,trunc
+test TCP/IPv4: host to ns (spliced): small transfer
+nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
host socat OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002
nsw
check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__
@@ -57,8 +57,8 @@ ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003
hostw
check cmp __BASEPATH__/small.bin __TEMP_SMALL__
-test TCP/IPv6: host to ns: big transfer
-nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_BIG__,create,trunc
+test TCP/IPv6: host to ns (spliced): big transfer
+nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc
host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002
nsw
check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__
@@ -77,8 +77,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003
hostw
check cmp __BASEPATH__/big.bin __TEMP_BIG__
-test TCP/IPv6: host to ns: small transfer
-nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_SMALL__,create,trunc
+test TCP/IPv6: host to ns (spliced): small transfer
+nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc
host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002
nsw
check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__
diff --git a/test/pasta/udp b/test/pasta/udp
index 30e3a85..7734d02 100644
--- a/test/pasta/udp
+++ b/test/pasta/udp
@@ -17,8 +17,8 @@ htools dd socat ip jq
set TEMP __STATEDIR__/test.bin
set TEMP_NS __STATEDIR__/test_ns.bin
-test UDP/IPv4: host to ns
-nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc
+test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
+nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__
@@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null
hostw
check cmp __BASEPATH__/medium.bin __TEMP__
-test UDP/IPv6: host to ns
-nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc
+test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
+nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__
--
@@ -17,8 +17,8 @@ htools dd socat ip jq
set TEMP __STATEDIR__/test.bin
set TEMP_NS __STATEDIR__/test_ns.bin
-test UDP/IPv4: host to ns
-nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc
+test UDP/IPv4: host to ns (recvmmsg/sendmmsg)
+nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__
@@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null
hostw
check cmp __BASEPATH__/medium.bin __TEMP__
-test UDP/IPv6: host to ns
-nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc
+test UDP/IPv6: host to ns (recvmmsg/sendmmsg)
+nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc
host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null
nsw
check cmp __BASEPATH__/medium.bin __TEMP_NS__
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson
` (2 preceding siblings ...)
2024-10-02 5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson
@ 2024-10-02 5:48 ` David Gibson
2024-10-09 13:07 ` Stefano Brivio
3 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing. This gives us a very large performance improvement when it's
possible.
Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address. However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
forwarding exposes those services to the host, which isn't generally what
we want.
Correct this by instead forwarding inbound "splice" flows to the guest's
external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
conf.c | 9 +++++++++
fwd.c | 31 +++++++++++++++++++++++--------
passt.1 | 23 +++++++++++++++++++----
passt.h | 2 ++
4 files changed, 53 insertions(+), 12 deletions(-)
diff --git a/conf.c b/conf.c
index 6e62510..b5318f3 100644
--- a/conf.c
+++ b/conf.c
@@ -908,6 +908,9 @@ pasta_opts:
" -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
" SPEC is as described above\n"
" default: auto\n"
+ " --host-lo-to-ns-lo DEPRECATED:\n"
+ " Translate host-loopback forwards to\n"
+ " namespace loopback\n"
" --userns NSPATH Target user namespace to join\n"
" --netns PATH|NAME Target network namespace to join\n"
" --netns-only Don't join existing user namespace\n"
@@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv)
{"netns-only", no_argument, NULL, 20 },
{"map-host-loopback", required_argument, NULL, 21 },
{"map-guest-addr", required_argument, NULL, 22 },
+ {"host-lo-to-ns-lo", no_argument, NULL, 23 },
{ 0 },
};
const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv)
conf_nat(optarg, &c->ip4.map_guest_addr,
&c->ip6.map_guest_addr, NULL);
break;
+ case 23:
+ if (c->mode != MODE_PASTA)
+ die("--host-lo-to-ns-lo is for pasta mode only");
+ c->host_lo_to_ns_lo = 1;
+ break;
case 'd':
c->debug = 1;
c->quiet = 0;
diff --git a/fwd.c b/fwd.c
index a505098..c71f5e1 100644
--- a/fwd.c
+++ b/fwd.c
@@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
(proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
/* spliceable */
- /* Preserve the specific loopback adddress used, but let the
- * kernel pick a source port on the target side
+ /* The traffic will go over the guest's 'lo' interface, but by
+ * default use its external address, so we don't inadvertently
+ * expose services that listen only on the guest's loopback
+ * address. That can be overridden by --host-lo-to-ns-lo which
+ * will instead forward to the loopback address in the guest.
+ *
+ * In either case, let the kernel pick the source address to
+ * match.
*/
- tgt->oaddr = ini->eaddr;
+ if (inany_v4(&ini->eaddr)) {
+ if (c->host_lo_to_ns_lo)
+ tgt->eaddr = inany_loopback4;
+ else
+ tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
+ tgt->oaddr = inany_any4;
+ } else {
+ if (c->host_lo_to_ns_lo)
+ tgt->eaddr = inany_loopback6;
+ else
+ tgt->eaddr.a6 = c->ip6.addr_seen;
+ tgt->oaddr = inany_any6;
+ }
+
+ /* Let the kernel pick source port */
tgt->oport = 0;
if (proto == IPPROTO_UDP)
/* But for UDP preserve the source port */
tgt->oport = ini->eport;
- if (inany_v4(&ini->eaddr))
- tgt->eaddr = inany_loopback4;
- else
- tgt->eaddr = inany_loopback6;
-
return PIF_SPLICE;
}
diff --git a/passt.1 b/passt.1
index 11104e1..332384c 100644
--- a/passt.1
+++ b/passt.1
@@ -586,6 +586,13 @@ Configure UDP port forwarding from target namespace to init namespace.
Default is \fBauto\fR.
+.TP
+.BR \-\-host-lo-to-ns-lo " " (DEPRECATED)
+If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from
+the host's loopback address will appear on the loopback address in the
+guest as well. Without this option such forwarded packets will appear
+to come from the guest's public address.
+
.TP
.BR \-\-userns " " \fIspec
Target user namespace to join, as a path. If PID is given, without this option,
@@ -874,8 +881,9 @@ interfaces, and it would also be impossible for guest or target
namespace to route answers back.
For convenience, the source address on these packets is translated to
-the address specified by the \fB\-\-map-host-loopback\fR option. If
-not specified this defaults, somewhat arbitrarily, to the address of
+the address specified by the \fB\-\-map-host-loopback\fR option (with
+some exceptions in pasta mode, see next section below). If not
+specified this defaults, somewhat arbitrarily, to the address of
default IPv4 or IPv6 gateway (if any) -- this is known to be an
existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or
\fB\-\-map-host-loopback none\fR are specified this translation is
@@ -912,8 +920,15 @@ and the new socket using the \fBsplice\fR(2) system call, and for UDP, a pair
of \fBrecvmmsg\fR(2) and \fBsendmmsg\fR(2) system calls deals with packet
transfers.
-This bypass only applies to local connections and traffic, because it's not
-possible to bind sockets to foreign addresses.
+Because it's not possible to bind sockets to foreign addresses, this
+bypass only applies to local connections and traffic. It also means
+that the address translation differs slightly from passt mode.
+Connections from loopback to loopback on the host will appear to come
+from the target namespace's public address within the guest, unless
+\fB\-\-host-lo-to-ns-lo\fR is specified, in which case they will
+appear to come from loopback in the namespace as well. The latter
+behaviour used to be the default, but is usually undesirable, since it
+can unintentionally expose namespace local services to the host.
.SS Binding to low numbered ports (well-known or system ports, up to 1023)
diff --git a/passt.h b/passt.h
index 031c9b6..f7b7a58 100644
--- a/passt.h
+++ b/passt.h
@@ -225,6 +225,7 @@ struct ip6_ctx {
* @no_dhcpv6: Disable DHCPv6 server
* @no_ndp: Disable NDP handler altogether
* @no_ra: Disable router advertisements
+ * @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
*/
@@ -284,6 +285,7 @@ struct ctx {
int no_dhcpv6;
int no_ndp;
int no_ra;
+ int host_lo_to_ns_lo;
int low_wmem;
int low_rmem;
--
@@ -225,6 +225,7 @@ struct ip6_ctx {
* @no_dhcpv6: Disable DHCPv6 server
* @no_ndp: Disable NDP handler altogether
* @no_ra: Disable router advertisements
+ * @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses
* @low_wmem: Low probed net.core.wmem_max
* @low_rmem: Low probed net.core.rmem_max
*/
@@ -284,6 +285,7 @@ struct ctx {
int no_dhcpv6;
int no_ndp;
int no_ra;
+ int host_lo_to_ns_lo;
int low_wmem;
int low_rmem;
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson
@ 2024-10-09 13:07 ` Stefano Brivio
2024-10-09 20:44 ` Stefano Brivio
0 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2024-10-09 13:07 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Wed, 2 Oct 2024 15:48:26 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> In pasta mode, where addressing permits we "splice" connections, forwarding
> directly from host socket to guest/container socket without any L2 or L3
> processing. This gives us a very large performance improvement when it's
> possible.
>
> Since the traffic is from a local socket within the guest, it will go over
> the guest's 'lo' interface, and accordingly we set the guest side address
> to be the loopback address. However this has a surprising side effect:
> sometimes guests will run services that are only supposed to be used within
> the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
> forwarding exposes those services to the host, which isn't generally what
> we want.
>
> Correct this by instead forwarding inbound "splice" flows to the guest's
> external address.
>
> Link: https://github.com/containers/podman/issues/24045
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> conf.c | 9 +++++++++
> fwd.c | 31 +++++++++++++++++++++++--------
> passt.1 | 23 +++++++++++++++++++----
> passt.h | 2 ++
> 4 files changed, 53 insertions(+), 12 deletions(-)
>
> diff --git a/conf.c b/conf.c
> index 6e62510..b5318f3 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -908,6 +908,9 @@ pasta_opts:
> " -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
> " SPEC is as described above\n"
> " default: auto\n"
> + " --host-lo-to-ns-lo DEPRECATED:\n"
> + " Translate host-loopback forwards to\n"
> + " namespace loopback\n"
> " --userns NSPATH Target user namespace to join\n"
> " --netns PATH|NAME Target network namespace to join\n"
> " --netns-only Don't join existing user namespace\n"
> @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv)
> {"netns-only", no_argument, NULL, 20 },
> {"map-host-loopback", required_argument, NULL, 21 },
> {"map-guest-addr", required_argument, NULL, 22 },
> + {"host-lo-to-ns-lo", no_argument, NULL, 23 },
> { 0 },
> };
> const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv)
> conf_nat(optarg, &c->ip4.map_guest_addr,
> &c->ip6.map_guest_addr, NULL);
> break;
> + case 23:
> + if (c->mode != MODE_PASTA)
> + die("--host-lo-to-ns-lo is for pasta mode only");
> + c->host_lo_to_ns_lo = 1;
> + break;
> case 'd':
> c->debug = 1;
> c->quiet = 0;
> diff --git a/fwd.c b/fwd.c
> index a505098..c71f5e1 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> /* spliceable */
>
> - /* Preserve the specific loopback adddress used, but let the
> - * kernel pick a source port on the target side
> + /* The traffic will go over the guest's 'lo' interface, but by
> + * default use its external address, so we don't inadvertently
> + * expose services that listen only on the guest's loopback
> + * address. That can be overridden by --host-lo-to-ns-lo which
> + * will instead forward to the loopback address in the guest.
> + *
> + * In either case, let the kernel pick the source address to
> + * match.
> */
> - tgt->oaddr = ini->eaddr;
> + if (inany_v4(&ini->eaddr)) {
> + if (c->host_lo_to_ns_lo)
> + tgt->eaddr = inany_loopback4;
> + else
> + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> + tgt->oaddr = inany_any4;
> + } else {
> + if (c->host_lo_to_ns_lo)
> + tgt->eaddr = inany_loopback6;
> + else
> + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
> + tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the
"TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
sometimes (about one in three/four runs), that's what I mistakenly
reported as coming from Laurent's series at:
https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/
It hangs like this (display with >= 240 columns):
ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname ==
"enp9s0").addr_info[0].local'
│...passed. 88.198.0.164
│
ns$ ip -j -4 route show|jq -rM '.[] | select(.dst ==
"default").gateway'
│Starting test: TCP/IPv4: ns to host (spliced): big
transfer 88.198.0.161
│?
cmp /home/sbrivio/passt/test/big.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq
-rM '.[] | select(.ifname == "enp9s0").mtu'
│...passed. 65520
│
ns$ /sbin/dhclient -6 --no-pid enp9s0
│Starting
test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr
show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] |
select(.prefixlen == 128).local] | .[0]' │? cmp
/home/sbrivio/passt/test/big.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2
│...passed.
ns$ ip -j -6 route show|jq -rM '.[] | select(.dst ==
"default").gateway'
│ fe80::1
│Starting
test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip
jq >/dev/null
│? cmp
/home/sbrivio/passt/test/small.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u
TCP4-LISTEN:10002
OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc
│...passed. ns$ socat -u
OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003
│ ns$ ip -j -4
route show|jq -rM '.[] | select(.dst == "default").gateway'
│Starting test:
TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161
│?
cmp /home/sbrivio/passt/test/small.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u
OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003
│...passed. ns$
socat -u TCP4-LISTEN:10002
OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc
│ ns$ socat
OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003
│Starting
test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route
show|jq -rM '.[] | select(.dst == "default").gateway'
│? cmp
/home/sbrivio/passt/test/small.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161
│...passed.
ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin
TCP4:88.198.0.161:10003
│ ns$ strace socat -u TCP6-LISTEN:10002
OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc
2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00
...even without strace. The client is done, the server hangs.
If I unblock this manually by re-running the same client command, the
server wakes up, writes the file, and terminates, and the test
continues normally.
Those three "received NS, sending NA" messages in the pasta pane are
printed in a short time after the test starts.
If I run this with TRACE=1 (which needs the patch I just sent), this
is pasta's debugging output for this test:
--
6.1401: pasta: epoll event on listening TCP socket 6 (events:
0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW
6.1402: Flow 0 (INI): NEW -> INI
6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ?
6.1402: Flow 0 (TGT): INI -> TGT
6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0
-> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection
(spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)):
HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
[2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)):
event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection
(spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)):
TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST
[::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
[2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun
device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA
7.0006: pasta: epoll event on namespace timer watch 12 (events:
0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288
7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP
(spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced):
cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection
(spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection
(spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on
/dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS,
sending NA 8.0006: pasta: epoll event on namespace timer watch 12
(events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at
tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at
tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device
13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006:
pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
9.2065: pasta: epoll event on connected spliced TCP socket 118 (events:
0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on
socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)):
flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection
(spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)):
CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE):
HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
[2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace
timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on
namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll
event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta:
epoll event on namespace timer watch 12 (events: 0x00000001) [...] --
Relevant parts of strace output from the client:
--
openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5
ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate
ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
connect(6, {sa_family=AF_INET6, sin6_port=htons(10002),
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr),
sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6,
sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6,
"::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5],
[6], [], NULL, NULL) = 2 (in [5], out [6]) read(5,
"\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"...,
8192) = 8192 write(6,
"\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"...,
8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out
[6]) read(5,
"\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"...,
8192) = 8192 write(6,
"\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"...,
8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out
[6])
[...]
pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
read(5, "", 8192) = 0
shutdown(6, SHUT_WR) = 0
shutdown(6, SHUT_RDWR) = 0
exit_group(0) = ?
+++ exited with 0 +++
--
and from the server:
--
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET6, sin6_port=htons(10002),
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr),
sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0
getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002),
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr),
sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL --
If I connect from the host without a server in the namespace (but
with the port forwarded by pasta), I get a connection reset, and
if the port is not forwarded by pasta, connection refused.
But this is another case: we start connecting and accept the
connection (probably we shouldn't). Note the "No route to host"
error on the socket.
It looks somehow similar to the race I fixed with commit
f4e9f26480ef ("pasta: Disable neighbour solicitations on device
up to prevent DAD"), but it doesn't look like an invalid
c->ip6.addr_seen, because otherwise pasta would reset the
connection, I suppose.
I haven't debugged further yet. This looks like an existing
issue in pasta rather than in this series or in the tests,
but it blocks tests, so I haven't applied this yet.
--
Stefano
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-09 13:07 ` Stefano Brivio
@ 2024-10-09 20:44 ` Stefano Brivio
2024-10-10 5:57 ` David Gibson
0 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2024-10-09 20:44 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Wed, 9 Oct 2024 15:07:21 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:
> On Wed, 2 Oct 2024 15:48:26 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > In pasta mode, where addressing permits we "splice" connections, forwarding
> > directly from host socket to guest/container socket without any L2 or L3
> > processing. This gives us a very large performance improvement when it's
> > possible.
> >
> > Since the traffic is from a local socket within the guest, it will go over
> > the guest's 'lo' interface, and accordingly we set the guest side address
> > to be the loopback address. However this has a surprising side effect:
> > sometimes guests will run services that are only supposed to be used within
> > the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
> > forwarding exposes those services to the host, which isn't generally what
> > we want.
> >
> > Correct this by instead forwarding inbound "splice" flows to the guest's
> > external address.
> >
> > Link: https://github.com/containers/podman/issues/24045
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > conf.c | 9 +++++++++
> > fwd.c | 31 +++++++++++++++++++++++--------
> > passt.1 | 23 +++++++++++++++++++----
> > passt.h | 2 ++
> > 4 files changed, 53 insertions(+), 12 deletions(-)
> >
> > diff --git a/conf.c b/conf.c
> > index 6e62510..b5318f3 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -908,6 +908,9 @@ pasta_opts:
> > " -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
> > " SPEC is as described above\n"
> > " default: auto\n"
> > + " --host-lo-to-ns-lo DEPRECATED:\n"
> > + " Translate host-loopback forwards to\n"
> > + " namespace loopback\n"
> > " --userns NSPATH Target user namespace to join\n"
> > " --netns PATH|NAME Target network namespace to join\n"
> > " --netns-only Don't join existing user namespace\n"
> > @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv)
> > {"netns-only", no_argument, NULL, 20 },
> > {"map-host-loopback", required_argument, NULL, 21 },
> > {"map-guest-addr", required_argument, NULL, 22 },
> > + {"host-lo-to-ns-lo", no_argument, NULL, 23 },
> > { 0 },
> > };
> > const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> > @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv)
> > conf_nat(optarg, &c->ip4.map_guest_addr,
> > &c->ip6.map_guest_addr, NULL);
> > break;
> > + case 23:
> > + if (c->mode != MODE_PASTA)
> > + die("--host-lo-to-ns-lo is for pasta mode only");
> > + c->host_lo_to_ns_lo = 1;
> > + break;
> > case 'd':
> > c->debug = 1;
> > c->quiet = 0;
> > diff --git a/fwd.c b/fwd.c
> > index a505098..c71f5e1 100644
> > --- a/fwd.c
> > +++ b/fwd.c
> > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > /* spliceable */
> >
> > - /* Preserve the specific loopback adddress used, but let the
> > - * kernel pick a source port on the target side
> > + /* The traffic will go over the guest's 'lo' interface, but by
> > + * default use its external address, so we don't inadvertently
> > + * expose services that listen only on the guest's loopback
> > + * address. That can be overridden by --host-lo-to-ns-lo which
> > + * will instead forward to the loopback address in the guest.
> > + *
> > + * In either case, let the kernel pick the source address to
> > + * match.
> > */
> > - tgt->oaddr = ini->eaddr;
> > + if (inany_v4(&ini->eaddr)) {
> > + if (c->host_lo_to_ns_lo)
> > + tgt->eaddr = inany_loopback4;
> > + else
> > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > + tgt->oaddr = inany_any4;
> > + } else {
> > + if (c->host_lo_to_ns_lo)
> > + tgt->eaddr = inany_loopback6;
> > + else
> > + tgt->eaddr.a6 = c->ip6.addr_seen;
>
> Either this...
>
> > + tgt->oaddr = inany_any6;
>
> or this (and not something before this patch, up to 3/4) make the
> "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> sometimes (about one in three/four runs), that's what I mistakenly
> reported as coming from Laurent's series at:
>
> https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/
>
> It hangs like this (display with >= 240 columns):
Ouch, sorry, it looks like saving something in claws-mail as draft and
sending it later means lines will be forcefully wrapped. Here's the
original test output:
ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed.
88.198.0.164 │
ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer
88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin
ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed.
65520 │
ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer
ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin
2a01:4f8:222:904::2 │...passed.
ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │
fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer
ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin
ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed.
ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │
ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer
88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin
ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed.
ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │
ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer
ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin
88.198.0.161 │...passed.
ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │
ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer
│
──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer───────────────────────────────────
host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161
fe80::1 │DNS:
host$ which ip jq >/dev/null │ 185.12.64.1
host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2
88.198.0.164 │ NAT to host ::1: fe80::1
host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6:
88.198.0.161 │ assign: 2a01:4f8:222:904::2
host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1
enp9s0 │ our link-local: fe80::1
host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS:
ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2
2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1
host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA
fe80::1 │DHCP: offer to discover
host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50
host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request
host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50
host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE
host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY
host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA
host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA
host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA
host$ │
──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00
> ...even without strace. The client is done, the server hangs.
>
> If I unblock this manually by re-running the same client command, the
> server wakes up, writes the file, and terminates, and the test
> continues normally.
>
> Those three "received NS, sending NA" messages in the pasta pane are
> printed in a short time after the test starts.
>
> If I run this with TRACE=1 (which needs the patch I just sent), this
> is pasta's debugging output for this test:
>
> --
> 6.1401: pasta: epoll event on listening TCP socket 6 (events:
> 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW
> 6.1402: Flow 0 (INI): NEW -> INI
> 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ?
> 6.1402: Flow 0 (TGT): INI -> TGT
> 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0
> -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection
> (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)):
> HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)):
> event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection
> (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)):
> TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST
> [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun
> device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA
> 7.0006: pasta: epoll event on namespace timer watch 12 (events:
> 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288
> 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP
> (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced):
> cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection
> (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection
> (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on
> /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS,
> sending NA 8.0006: pasta: epoll event on namespace timer watch 12
> (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at
> tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at
> tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device
> 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006:
> pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
> 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events:
> 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on
> socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)):
> flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection
> (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)):
> CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE):
> HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace
> timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on
> namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll
> event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta:
> epoll event on namespace timer watch 12 (events: 0x00000001) [...] --
This was:
6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW
6.1402: Flow 0 (INI): NEW -> INI
6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ?
6.1402: Flow 0 (TGT): INI -> TGT
6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002
6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED
6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002
6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377
6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT
6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE
6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002
6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001)
6.1402: NDP: received NS, sending NA
7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
7.0007: TCP (spliced): cannot set pool pipe size to 524288
7.0007: TCP (spliced): cannot set pool pipe size to 524288
7.0007: TCP (spliced): cannot set pool pipe size to 524288
7.0007: TCP (spliced): cannot set pool pipe size to 524288
7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766
7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766
7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001)
7.1585: NDP: received NS, sending NA
8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766
8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766
8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001)
8.1825: NDP: received NS, sending NA
9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c)
9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host
9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624
9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1
9.2068: Flow 0 (TCP connection (spliced)): CLOSED
9.2068: Flow 0 (FREE): ACTIVE -> FREE
9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002
10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
[...]
> Relevant parts of strace output from the client:
>
> --
> openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5
> ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate
> ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0
> socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
> fcntl(6, F_SETFD, FD_CLOEXEC) = 0
> connect(6, {sa_family=AF_INET6, sin6_port=htons(10002),
> sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr),
> sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6,
> sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6,
> "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5],
> [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5,
> "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"...,
> 8192) = 8192 write(6,
> "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"...,
> 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out
> [6]) read(5,
> "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"...,
> 8192) = 8192 write(6,
> "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"...,
> 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out
> [6])
This was:
openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5
ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device)
fcntl(5, F_SETFD, FD_CLOEXEC) = 0
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0
pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192
write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192
pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192
write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192
pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
> [...]
>
> pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
> read(5, "", 8192) = 0
> shutdown(6, SHUT_WR) = 0
> shutdown(6, SHUT_RDWR) = 0
> exit_group(0) = ?
> +++ exited with 0 +++
> --
>
> and from the server:
>
> --
> socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
> fcntl(6, F_SETFD, FD_CLOEXEC) = 0
> setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(6, {sa_family=AF_INET6, sin6_port=htons(10002),
> sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr),
> sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0
> getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002),
> sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr),
> sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL --
And this was:
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0
listen(6, 5) = 0
getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0
pselect6(7, [4 6], NULL, NULL, NULL, NULL
> If I connect from the host without a server in the namespace (but
> with the port forwarded by pasta), I get a connection reset, and
> if the port is not forwarded by pasta, connection refused.
>
> But this is another case: we start connecting and accept the
> connection (probably we shouldn't). Note the "No route to host"
> error on the socket.
>
> It looks somehow similar to the race I fixed with commit
> f4e9f26480ef ("pasta: Disable neighbour solicitations on device
> up to prevent DAD"), but it doesn't look like an invalid
> c->ip6.addr_seen, because otherwise pasta would reset the
> connection, I suppose.
>
> I haven't debugged further yet. This looks like an existing
> issue in pasta rather than in this series or in the tests,
> but it blocks tests, so I haven't applied this yet.
--
Stefano
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-09 20:44 ` Stefano Brivio
@ 2024-10-10 5:57 ` David Gibson
2024-10-16 3:15 ` David Gibson
0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2024-10-10 5:57 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 5120 bytes --]
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> On Wed, 9 Oct 2024 15:07:21 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
>
> > On Wed, 2 Oct 2024 15:48:26 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > In pasta mode, where addressing permits we "splice" connections, forwarding
> > > directly from host socket to guest/container socket without any L2 or L3
> > > processing. This gives us a very large performance improvement when it's
> > > possible.
> > >
> > > Since the traffic is from a local socket within the guest, it will go over
> > > the guest's 'lo' interface, and accordingly we set the guest side address
> > > to be the loopback address. However this has a surprising side effect:
> > > sometimes guests will run services that are only supposed to be used within
> > > the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
> > > forwarding exposes those services to the host, which isn't generally what
> > > we want.
> > >
> > > Correct this by instead forwarding inbound "splice" flows to the guest's
> > > external address.
> > >
> > > Link: https://github.com/containers/podman/issues/24045
> > >
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > conf.c | 9 +++++++++
> > > fwd.c | 31 +++++++++++++++++++++++--------
> > > passt.1 | 23 +++++++++++++++++++----
> > > passt.h | 2 ++
> > > 4 files changed, 53 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/conf.c b/conf.c
> > > index 6e62510..b5318f3 100644
> > > --- a/conf.c
> > > +++ b/conf.c
> > > @@ -908,6 +908,9 @@ pasta_opts:
> > > " -U, --udp-ns SPEC UDP port forwarding to init namespace\n"
> > > " SPEC is as described above\n"
> > > " default: auto\n"
> > > + " --host-lo-to-ns-lo DEPRECATED:\n"
> > > + " Translate host-loopback forwards to\n"
> > > + " namespace loopback\n"
> > > " --userns NSPATH Target user namespace to join\n"
> > > " --netns PATH|NAME Target network namespace to join\n"
> > > " --netns-only Don't join existing user namespace\n"
> > > @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv)
> > > {"netns-only", no_argument, NULL, 20 },
> > > {"map-host-loopback", required_argument, NULL, 21 },
> > > {"map-guest-addr", required_argument, NULL, 22 },
> > > + {"host-lo-to-ns-lo", no_argument, NULL, 23 },
> > > { 0 },
> > > };
> > > const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> > > @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv)
> > > conf_nat(optarg, &c->ip4.map_guest_addr,
> > > &c->ip6.map_guest_addr, NULL);
> > > break;
> > > + case 23:
> > > + if (c->mode != MODE_PASTA)
> > > + die("--host-lo-to-ns-lo is for pasta mode only");
> > > + c->host_lo_to_ns_lo = 1;
> > > + break;
> > > case 'd':
> > > c->debug = 1;
> > > c->quiet = 0;
> > > diff --git a/fwd.c b/fwd.c
> > > index a505098..c71f5e1 100644
> > > --- a/fwd.c
> > > +++ b/fwd.c
> > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > /* spliceable */
> > >
> > > - /* Preserve the specific loopback adddress used, but let the
> > > - * kernel pick a source port on the target side
> > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > + * default use its external address, so we don't inadvertently
> > > + * expose services that listen only on the guest's loopback
> > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > + * will instead forward to the loopback address in the guest.
> > > + *
> > > + * In either case, let the kernel pick the source address to
> > > + * match.
> > > */
> > > - tgt->oaddr = ini->eaddr;
> > > + if (inany_v4(&ini->eaddr)) {
> > > + if (c->host_lo_to_ns_lo)
> > > + tgt->eaddr = inany_loopback4;
> > > + else
> > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > + tgt->oaddr = inany_any4;
> > > + } else {
> > > + if (c->host_lo_to_ns_lo)
> > > + tgt->eaddr = inany_loopback6;
> > > + else
> > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> >
> > Either this...
> >
> > > + tgt->oaddr = inany_any6;
> >
> > or this (and not something before this patch, up to 3/4) make the
> > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > sometimes (about one in three/four runs), that's what I mistakenly
> > reported as coming from Laurent's series at:
Huh, interesting. Just got back from my leave and ran that group of
tests in a loop this afternoon, but didn't manage to reproduce. I
have administrivia that will probably fill the rest of this week, but
I'll look into this as soon as I can.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-10 5:57 ` David Gibson
@ 2024-10-16 3:15 ` David Gibson
2024-10-16 5:46 ` David Gibson
0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2024-10-16 3:15 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 4514 bytes --]
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > On Wed, 9 Oct 2024 15:07:21 +0200
> > Stefano Brivio <sbrivio@redhat.com> wrote:
[snip]
> > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > /* spliceable */
> > > >
> > > > - /* Preserve the specific loopback adddress used, but let the
> > > > - * kernel pick a source port on the target side
> > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > + * default use its external address, so we don't inadvertently
> > > > + * expose services that listen only on the guest's loopback
> > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > + * will instead forward to the loopback address in the guest.
> > > > + *
> > > > + * In either case, let the kernel pick the source address to
> > > > + * match.
> > > > */
> > > > - tgt->oaddr = ini->eaddr;
> > > > + if (inany_v4(&ini->eaddr)) {
> > > > + if (c->host_lo_to_ns_lo)
> > > > + tgt->eaddr = inany_loopback4;
> > > > + else
> > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > + tgt->oaddr = inany_any4;
> > > > + } else {
> > > > + if (c->host_lo_to_ns_lo)
> > > > + tgt->eaddr = inany_loopback6;
> > > > + else
> > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > >
> > > Either this...
> > >
> > > > + tgt->oaddr = inany_any6;
> > >
> > > or this (and not something before this patch, up to 3/4) make the
> > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > sometimes (about one in three/four runs), that's what I mistakenly
> > > reported as coming from Laurent's series at:
>
> Huh, interesting. Just got back from my leave and ran that group of
> tests in a loop this afternoon, but didn't manage to reproduce. I
> have administrivia that will probably fill the rest of this week, but
> I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea
what's going on. As you say it's seeming like the address (addr_seen
== addr in this case) isn't properly ready. This is over splice, but
on the tap interface, I see the container sending NS messages for its
own address - seems like it's doing DAD. But more importantly, we're
answering those NS messages with NA messages, because we answer all
NS. i.e. we're making the DAD fail. What I'm not sure of is how this
ever worked at all. --config-net makes sense, since we disable DAD,
but our test suite has always been using NDP+DHCP instead of
--config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
happens most of the time and for IPv4 via ARP, which is used much more
rarely. I think we need to be more selective in what NS or ARP
lookups we resopnd to. The question is what approach to take:
Option 1: Answer everything apart from specific exceptions
========
Basically we'd explicitly exclude the guest/container's assigned
address but continue answering everything else. This is a bit ugly
and means we'd probably still have a similar problem if the guest just
picks an address instead of taking the assigned one. On the other
hand it means things will work in at least some cases where the guest
tries to contact remote hosts directly instead of via the default
gateway.
Option 2: Only answer for specific addresses
=========
Reverse the logic above, usually *don't* answer NS queries, unless
they have a specific address that we know is our responsibility. That
would be, afaict, the gateway address, the "our_tap" addresses and the
-map--host-loopback and --map-guest-addr addresses (usually all the
same).
This might be less robust against certain guests that don't use a
default gateway when we expect. I guess that could happen when we
have non-gateway routes copied from the host, so we'd have to handle
that too.
It would remove one more barrier towards having multiple bridged
guests behind a single passt/pasta instance (they can't reasonably
communicate with each other if passt answers all their NS / ARP
requests).
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-16 3:15 ` David Gibson
@ 2024-10-16 5:46 ` David Gibson
2024-10-16 8:39 ` David Gibson
0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2024-10-16 5:46 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 4152 bytes --]
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
> On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > > On Wed, 9 Oct 2024 15:07:21 +0200
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> [snip]
> > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > > /* spliceable */
> > > > >
> > > > > - /* Preserve the specific loopback adddress used, but let the
> > > > > - * kernel pick a source port on the target side
> > > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > > + * default use its external address, so we don't inadvertently
> > > > > + * expose services that listen only on the guest's loopback
> > > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > > + * will instead forward to the loopback address in the guest.
> > > > > + *
> > > > > + * In either case, let the kernel pick the source address to
> > > > > + * match.
> > > > > */
> > > > > - tgt->oaddr = ini->eaddr;
> > > > > + if (inany_v4(&ini->eaddr)) {
> > > > > + if (c->host_lo_to_ns_lo)
> > > > > + tgt->eaddr = inany_loopback4;
> > > > > + else
> > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > > + tgt->oaddr = inany_any4;
> > > > > + } else {
> > > > > + if (c->host_lo_to_ns_lo)
> > > > > + tgt->eaddr = inany_loopback6;
> > > > > + else
> > > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > > >
> > > > Either this...
> > > >
> > > > > + tgt->oaddr = inany_any6;
> > > >
> > > > or this (and not something before this patch, up to 3/4) make the
> > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > > sometimes (about one in three/four runs), that's what I mistakenly
> > > > reported as coming from Laurent's series at:
> >
> > Huh, interesting. Just got back from my leave and ran that group of
> > tests in a loop this afternoon, but didn't manage to reproduce. I
> > have administrivia that will probably fill the rest of this week, but
> > I'll look into this as soon as I can.
>
> I reproduced the problem on passt.top, and I have a partial idea
> what's going on. As you say it's seeming like the address (addr_seen
> == addr in this case) isn't properly ready. This is over splice, but
> on the tap interface, I see the container sending NS messages for its
> own address - seems like it's doing DAD. But more importantly, we're
> answering those NS messages with NA messages, because we answer all
> NS. i.e. we're making the DAD fail. What I'm not sure of is how this
> ever worked at all. --config-net makes sense, since we disable DAD,
> but our test suite has always been using NDP+DHCP instead of
> --config-net.
>
> So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
> happens most of the time and for IPv4 via ARP, which is used much more
> rarely. I think we need to be more selective in what NS or ARP
> lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do*
exclude those from getting replies. In this case though, we're
getting NS requests for the assigned address from what looks like the
SLAAC address. So, I do think it would be wise to explicitly exclude
these: we shouldn't be giving NA responses for an address that ought
to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this. Is for some reason the DHCP
address not "taking", so the container is trying to locate it on the
network instead? Or _is_ this DAD, but under some circumstances
rather than using :: as the source address it uses another configured
address.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-16 5:46 ` David Gibson
@ 2024-10-16 8:39 ` David Gibson
2024-10-16 15:26 ` Stefano Brivio
0 siblings, 1 reply; 16+ messages in thread
From: David Gibson @ 2024-10-16 8:39 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 5390 bytes --]
On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
> On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
> > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > > > On Wed, 9 Oct 2024 15:07:21 +0200
> > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > [snip]
> > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > > > /* spliceable */
> > > > > >
> > > > > > - /* Preserve the specific loopback adddress used, but let the
> > > > > > - * kernel pick a source port on the target side
> > > > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > > > + * default use its external address, so we don't inadvertently
> > > > > > + * expose services that listen only on the guest's loopback
> > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > > > + * will instead forward to the loopback address in the guest.
> > > > > > + *
> > > > > > + * In either case, let the kernel pick the source address to
> > > > > > + * match.
> > > > > > */
> > > > > > - tgt->oaddr = ini->eaddr;
> > > > > > + if (inany_v4(&ini->eaddr)) {
> > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > + tgt->eaddr = inany_loopback4;
> > > > > > + else
> > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > > > + tgt->oaddr = inany_any4;
> > > > > > + } else {
> > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > + tgt->eaddr = inany_loopback6;
> > > > > > + else
> > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > > > >
> > > > > Either this...
> > > > >
> > > > > > + tgt->oaddr = inany_any6;
> > > > >
> > > > > or this (and not something before this patch, up to 3/4) make the
> > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > > > sometimes (about one in three/four runs), that's what I mistakenly
> > > > > reported as coming from Laurent's series at:
> > >
> > > Huh, interesting. Just got back from my leave and ran that group of
> > > tests in a loop this afternoon, but didn't manage to reproduce. I
> > > have administrivia that will probably fill the rest of this week, but
> > > I'll look into this as soon as I can.
> >
> > I reproduced the problem on passt.top, and I have a partial idea
> > what's going on. As you say it's seeming like the address (addr_seen
> > == addr in this case) isn't properly ready. This is over splice, but
> > on the tap interface, I see the container sending NS messages for its
> > own address - seems like it's doing DAD. But more importantly, we're
> > answering those NS messages with NA messages, because we answer all
> > NS. i.e. we're making the DAD fail. What I'm not sure of is how this
> > ever worked at all. --config-net makes sense, since we disable DAD,
> > but our test suite has always been using NDP+DHCP instead of
> > --config-net.
> >
> > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
> > happens most of the time and for IPv4 via ARP, which is used much more
> > rarely. I think we need to be more selective in what NS or ARP
> > lookups we resopnd to. The question is what approach to take:
>
> Hmm... no.. there's more to this.
>
> Usually DAD requests have :: as the source address, and we *do*
> exclude those from getting replies. In this case though, we're
> getting NS requests for the assigned address from what looks like the
> SLAAC address. So, I do think it would be wise to explicitly exclude
> these: we shouldn't be giving NA responses for an address that ought
> to belong to the guest, even if it doesn't look like a DAD.
>
> But, I'm not sure what's triggering this. Is for some reason the DHCP
> address not "taking", so the container is trying to locate it on the
> network instead? Or _is_ this DAD, but under some circumstances
> rather than using :: as the source address it uses another configured
> address.
Ok.. I've understood a bit more. While timing is a factor here, it
looks like the main reason I wasn't seeing it on my machine is what
I'd consider a bug in the Debian version of the dhclient-script:
when adding an IPv6 address, it returns without waiting for DAD to
complete (i.e. for the address to be non-tentative).
There's also an additional bug, which doesn't cause this problem, I
think, but caused some problems when I was investigating. DHCPv6
needs the link-local SLAAC address already configured and
non-tentative. The Fedora dhclient-script waits for that too at the
PREINIT6 stage, but the Debian one doesn't, meaning if you attempt
dhclient -6 immediately after starting the namespace it will fail to
bind the UDP address it needs.
I still think it's a good idea not to give NA messages for the guest
assigned address, but we'll need a different workaround for this
issue. I guess we'll have to manually wait for DAD to complete in the
DHCP tests, which will be kind of mucky :/
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-16 8:39 ` David Gibson
@ 2024-10-16 15:26 ` Stefano Brivio
2024-10-17 1:19 ` David Gibson
2024-10-17 5:06 ` David Gibson
0 siblings, 2 replies; 16+ messages in thread
From: Stefano Brivio @ 2024-10-16 15:26 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Wed, 16 Oct 2024 19:39:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
> > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
> > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > > > > On Wed, 9 Oct 2024 15:07:21 +0200
> > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > [snip]
> > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > > > > /* spliceable */
> > > > > > >
> > > > > > > - /* Preserve the specific loopback adddress used, but let the
> > > > > > > - * kernel pick a source port on the target side
> > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > > > > + * default use its external address, so we don't inadvertently
> > > > > > > + * expose services that listen only on the guest's loopback
> > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > > > > + * will instead forward to the loopback address in the guest.
> > > > > > > + *
> > > > > > > + * In either case, let the kernel pick the source address to
> > > > > > > + * match.
> > > > > > > */
> > > > > > > - tgt->oaddr = ini->eaddr;
> > > > > > > + if (inany_v4(&ini->eaddr)) {
> > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > + tgt->eaddr = inany_loopback4;
> > > > > > > + else
> > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > > > > + tgt->oaddr = inany_any4;
> > > > > > > + } else {
> > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > + tgt->eaddr = inany_loopback6;
> > > > > > > + else
> > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > > > > >
> > > > > > Either this...
> > > > > >
> > > > > > > + tgt->oaddr = inany_any6;
> > > > > >
> > > > > > or this (and not something before this patch, up to 3/4) make the
> > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > > > > sometimes (about one in three/four runs), that's what I mistakenly
> > > > > > reported as coming from Laurent's series at:
> > > >
> > > > Huh, interesting. Just got back from my leave and ran that group of
> > > > tests in a loop this afternoon, but didn't manage to reproduce. I
> > > > have administrivia that will probably fill the rest of this week, but
> > > > I'll look into this as soon as I can.
> > >
> > > I reproduced the problem on passt.top, and I have a partial idea
> > > what's going on. As you say it's seeming like the address (addr_seen
> > > == addr in this case) isn't properly ready. This is over splice, but
> > > on the tap interface, I see the container sending NS messages for its
> > > own address - seems like it's doing DAD. But more importantly, we're
> > > answering those NS messages with NA messages, because we answer all
> > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this
> > > ever worked at all. --config-net makes sense, since we disable DAD,
> > > but our test suite has always been using NDP+DHCP instead of
> > > --config-net.
> > >
> > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
> > > happens most of the time and for IPv4 via ARP, which is used much more
> > > rarely. I think we need to be more selective in what NS or ARP
> > > lookups we resopnd to. The question is what approach to take:
> >
> > Hmm... no.. there's more to this.
> >
> > Usually DAD requests have :: as the source address, and we *do*
> > exclude those from getting replies. In this case though, we're
> > getting NS requests for the assigned address from what looks like the
> > SLAAC address. So, I do think it would be wise to explicitly exclude
> > these: we shouldn't be giving NA responses for an address that ought
> > to belong to the guest, even if it doesn't look like a DAD.
> >
> > But, I'm not sure what's triggering this. Is for some reason the DHCP
> > address not "taking", so the container is trying to locate it on the
> > network instead? Or _is_ this DAD, but under some circumstances
> > rather than using :: as the source address it uses another configured
> > address.
>
> Ok.. I've understood a bit more. While timing is a factor here, it
> looks like the main reason I wasn't seeing it on my machine is what
> I'd consider a bug in the Debian version of the dhclient-script:
> when adding an IPv6 address, it returns without waiting for DAD to
> complete (i.e. for the address to be non-tentative).
Oops. On one hand, I would feel inclined to propose a fix for the
Debian and Ubuntu packages. On the other hand, I wonder if it's
universally considered a bug: the DHCPv6 client did its job at that
point, and it's debatable whether dhclient should wait for the address
to be usable before forking to background.
That is, arguably, the job of dhclient's is to request and configure an
address. It's not a network configuration daemon. There might be many
other reasons why that address is unusable, and yet dhclient is not
responsible for them.
By the way, I guess it's just an issue for test scripts like this one.
> There's also an additional bug, which doesn't cause this problem, I
> think, but caused some problems when I was investigating. DHCPv6
> needs the link-local SLAAC address already configured and
> non-tentative. The Fedora dhclient-script waits for that too at the
> PREINIT6 stage, but the Debian one doesn't, meaning if you attempt
> dhclient -6 immediately after starting the namespace it will fail to
> bind the UDP address it needs.
Right, and that's fine for us because we have a 2-second delay after
SLAAC. This looks to me a bit more like a real bug, but again, there
might be many other reasons why dhclient can't use a link-local
address. One might argue that it's the responsibility of the
user/caller to invoke dhclient when appropriate.
In that sense, you might be wondering why there's a 2-second delay
after SLAAC, but no delay after invoking dhclient -6: the reason is
that I was convinced that one wouldn't need DAD once a DHCPv6 client
configures an address. The server is already checking that, I thought.
Well, no. RFC 8415 18.2.10.1:
https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1
says:
If the client can operate with the addresses and/or prefixes obtained
from the server:
[...]
- The client MUST perform duplicate address detection as per
Section 5.4 of [RFC4862], which does list some exceptions, on each
of the received addresses in any IAs on which it has not performed
duplicate address detection during processing of any of the
previous Reply messages from the server. The client performs the
duplicate address detection before using the received addresses
for any traffic. If any of the addresses are found to be in use
on the link, the client sends a Decline message to the server for
those addresses as described in Section 18.2.8.
> I still think it's a good idea not to give NA messages for the guest
> assigned address, but we'll need a different workaround for this
> issue.
I read the rest of your reasoning about it, but the nice thing of the
current behaviour (and that's why I added that single check on the
source address in ndp()) is that the guest can really use whatever
address it wants, regardless of what we tried to configure, and we'll
resolve any other address.
If we receive a neighbour solicitation for the guest assigned address,
and the source address is not unspecified, it means that the guest is
_not_ using the assigned address, and it's actually trying to reach it.
> I guess we'll have to manually wait for DAD to complete in the
> DHCP tests, which will be kind of mucky :/
Alternatively, we could use the same trick as added by commit
f4e9f26480ef ("pasta: Disable neighbour solicitations on device up
to prevent DAD"): disable neighbour solicitations, run dhclient -6,
set 'nodad' on the address, and re-enable neighbour solicitations.
This works for me:
--
diff --git a/test/pasta/dhcp b/test/pasta/dhcp
index 41556b8..76aa723 100644
--- a/test/pasta/dhcp
+++ b/test/pasta/dhcp
@@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu'
check [ __MTU__ = 65520 ]
test DHCPv6: address
+ns ip link set dev __IFNAME__ arp off
ns /sbin/dhclient -6 --no-pid __IFNAME__
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+ns ip addr change __ADDR6__ dev __IFNAME__ nodad
+ns ip link set dev __IFNAME__ arp on
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ __ADDR6__ = __HOST_ADDR6__ ]
--
Adding 2-second delays as we have them for NDP doesn't look that bad:
$ grep --exclude-dir=demo -rn "dhclient -6"
pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__
passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__
passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__
perf/passt_tcp:117:guest dhclient -6 -x
perf/passt_tcp:118:guest dhclient -6 __IFNAME__
two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__
two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__
given that we don't need it on dhclient -x, tests would take about
12 seconds longer.
Or we could switch to the arp off / nodad / arp on approach for
everything, including SLAAC:
$ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2"
pasta/ndp:21:sleep 2
passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD
passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2
two_guests/basic:39:sleep 2
and save slightly less than 8 seconds. If you ask me, I would have a
slight preference for the nodad approach.
--
@@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu'
check [ __MTU__ = 65520 ]
test DHCPv6: address
+ns ip link set dev __IFNAME__ arp off
ns /sbin/dhclient -6 --no-pid __IFNAME__
hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+ns ip addr change __ADDR6__ dev __IFNAME__ nodad
+ns ip link set dev __IFNAME__ arp on
hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
check [ __ADDR6__ = __HOST_ADDR6__ ]
--
Adding 2-second delays as we have them for NDP doesn't look that bad:
$ grep --exclude-dir=demo -rn "dhclient -6"
pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__
passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__
passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__
perf/passt_tcp:117:guest dhclient -6 -x
perf/passt_tcp:118:guest dhclient -6 __IFNAME__
two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__
two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__
given that we don't need it on dhclient -x, tests would take about
12 seconds longer.
Or we could switch to the arp off / nodad / arp on approach for
everything, including SLAAC:
$ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2"
pasta/ndp:21:sleep 2
passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD
passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2
two_guests/basic:39:sleep 2
and save slightly less than 8 seconds. If you ask me, I would have a
slight preference for the nodad approach.
--
Stefano
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-16 15:26 ` Stefano Brivio
@ 2024-10-17 1:19 ` David Gibson
2024-10-17 8:31 ` Stefano Brivio
2024-10-17 5:06 ` David Gibson
1 sibling, 1 reply; 16+ messages in thread
From: David Gibson @ 2024-10-17 1:19 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 10052 bytes --]
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
> On Wed, 16 Oct 2024 19:39:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
> > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
> > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > > > > > On Wed, 9 Oct 2024 15:07:21 +0200
> > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > [snip]
> > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > > > > > /* spliceable */
> > > > > > > >
> > > > > > > > - /* Preserve the specific loopback adddress used, but let the
> > > > > > > > - * kernel pick a source port on the target side
> > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > > > > > + * default use its external address, so we don't inadvertently
> > > > > > > > + * expose services that listen only on the guest's loopback
> > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > > > > > + * will instead forward to the loopback address in the guest.
> > > > > > > > + *
> > > > > > > > + * In either case, let the kernel pick the source address to
> > > > > > > > + * match.
> > > > > > > > */
> > > > > > > > - tgt->oaddr = ini->eaddr;
> > > > > > > > + if (inany_v4(&ini->eaddr)) {
> > > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > > + tgt->eaddr = inany_loopback4;
> > > > > > > > + else
> > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > > > > > + tgt->oaddr = inany_any4;
> > > > > > > > + } else {
> > > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > > + tgt->eaddr = inany_loopback6;
> > > > > > > > + else
> > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > > > > > >
> > > > > > > Either this...
> > > > > > >
> > > > > > > > + tgt->oaddr = inany_any6;
> > > > > > >
> > > > > > > or this (and not something before this patch, up to 3/4) make the
> > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > > > > > sometimes (about one in three/four runs), that's what I mistakenly
> > > > > > > reported as coming from Laurent's series at:
> > > > >
> > > > > Huh, interesting. Just got back from my leave and ran that group of
> > > > > tests in a loop this afternoon, but didn't manage to reproduce. I
> > > > > have administrivia that will probably fill the rest of this week, but
> > > > > I'll look into this as soon as I can.
> > > >
> > > > I reproduced the problem on passt.top, and I have a partial idea
> > > > what's going on. As you say it's seeming like the address (addr_seen
> > > > == addr in this case) isn't properly ready. This is over splice, but
> > > > on the tap interface, I see the container sending NS messages for its
> > > > own address - seems like it's doing DAD. But more importantly, we're
> > > > answering those NS messages with NA messages, because we answer all
> > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this
> > > > ever worked at all. --config-net makes sense, since we disable DAD,
> > > > but our test suite has always been using NDP+DHCP instead of
> > > > --config-net.
> > > >
> > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
> > > > happens most of the time and for IPv4 via ARP, which is used much more
> > > > rarely. I think we need to be more selective in what NS or ARP
> > > > lookups we resopnd to. The question is what approach to take:
> > >
> > > Hmm... no.. there's more to this.
> > >
> > > Usually DAD requests have :: as the source address, and we *do*
> > > exclude those from getting replies. In this case though, we're
> > > getting NS requests for the assigned address from what looks like the
> > > SLAAC address. So, I do think it would be wise to explicitly exclude
> > > these: we shouldn't be giving NA responses for an address that ought
> > > to belong to the guest, even if it doesn't look like a DAD.
> > >
> > > But, I'm not sure what's triggering this. Is for some reason the DHCP
> > > address not "taking", so the container is trying to locate it on the
> > > network instead? Or _is_ this DAD, but under some circumstances
> > > rather than using :: as the source address it uses another configured
> > > address.
> >
> > Ok.. I've understood a bit more. While timing is a factor here, it
> > looks like the main reason I wasn't seeing it on my machine is what
> > I'd consider a bug in the Debian version of the dhclient-script:
> > when adding an IPv6 address, it returns without waiting for DAD to
> > complete (i.e. for the address to be non-tentative).
>
> Oops. On one hand, I would feel inclined to propose a fix for the
> Debian and Ubuntu packages. On the other hand, I wonder if it's
> universally considered a bug: the DHCPv6 client did its job at that
> point, and it's debatable whether dhclient should wait for the address
> to be usable before forking to background.
>
> That is, arguably, the job of dhclient's is to request and configure an
> address. It's not a network configuration daemon. There might be many
> other reasons why that address is unusable, and yet dhclient is not
> responsible for them.
Hrm... I guess. Counterpoints..
- Most other failures to get a usable address will result in a
visible error
- dhclient has a --dad-wait-time option which seems to imply that the
script should wait for DAD
- The upstream script version waits for DAD
In any case I filed a report for it
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
> By the way, I guess it's just an issue for test scripts like this one.
Why do you guess that?
> > There's also an additional bug, which doesn't cause this problem, I
> > think, but caused some problems when I was investigating. DHCPv6
> > needs the link-local SLAAC address already configured and
> > non-tentative. The Fedora dhclient-script waits for that too at the
> > PREINIT6 stage, but the Debian one doesn't, meaning if you attempt
> > dhclient -6 immediately after starting the namespace it will fail to
> > bind the UDP address it needs.
>
> Right, and that's fine for us because we have a 2-second delay after
> SLAAC. This looks to me a bit more like a real bug, but again, there
> might be many other reasons why dhclient can't use a link-local
> address. One might argue that it's the responsibility of the
> user/caller to invoke dhclient when appropriate.
Here I think it's a much clearer argument that it's a real bug. We
play fast and loose with it for mbuto, but dhclient can typically be
called on an interface that isn't even up: the PREINIT/PREINIT6 script
actions are specifically for this, they'll bring the interface up
ready for the client to do its thing. I'd say the script is failing
to do its job for PREINIT6 if there isn't a usable link-local address
at the end.
I filed a report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085229
> In that sense, you might be wondering why there's a 2-second delay
> after SLAAC, but no delay after invoking dhclient -6: the reason is
> that I was convinced that one wouldn't need DAD once a DHCPv6 client
> configures an address. The server is already checking that, I thought.
>
> Well, no. RFC 8415 18.2.10.1:
>
> https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1
>
> says:
>
> If the client can operate with the addresses and/or prefixes obtained
> from the server:
>
> [...]
>
> - The client MUST perform duplicate address detection as per
> Section 5.4 of [RFC4862], which does list some exceptions, on each
> of the received addresses in any IAs on which it has not performed
> duplicate address detection during processing of any of the
> previous Reply messages from the server. The client performs the
> duplicate address detection before using the received addresses
> for any traffic. If any of the addresses are found to be in use
> on the link, the client sends a Decline message to the server for
> those addresses as described in Section 18.2.8.
Indeed.
> > I still think it's a good idea not to give NA messages for the guest
> > assigned address, but we'll need a different workaround for this
> > issue.
>
> I read the rest of your reasoning about it, but the nice thing of the
> current behaviour (and that's why I added that single check on the
> source address in ndp()) is that the guest can really use whatever
> address it wants, regardless of what we tried to configure, and we'll
> resolve any other address.
Hrm. I suppose. Fwiw we already make the equivalent exclusion for ARP
> If we receive a neighbour solicitation for the guest assigned address,
> and the source address is not unspecified, it means that the guest is
> _not_ using the assigned address, and it's actually trying to reach it.
>
> > I guess we'll have to manually wait for DAD to complete in the
> > DHCP tests, which will be kind of mucky :/
>
> Alternatively, we could use the same trick as added by commit
> f4e9f26480ef ("pasta: Disable neighbour solicitations on device up
> to prevent DAD"): disable neighbour solicitations, run dhclient -6,
> set 'nodad' on the address, and re-enable neighbour solicitations.
>
> This works for me:
Ok. More complex, but faster, I guess. I'll try implementing this.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-16 15:26 ` Stefano Brivio
2024-10-17 1:19 ` David Gibson
@ 2024-10-17 5:06 ` David Gibson
1 sibling, 0 replies; 16+ messages in thread
From: David Gibson @ 2024-10-17 5:06 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 2022 bytes --]
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
[snip]
> Adding 2-second delays as we have them for NDP doesn't look that bad:
> $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns
> /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest
> /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6
> __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x
> perf/passt_tcp:118:guest dhclient -6 __IFNAME__
> two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__
> two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__
> given that we don't need it on dhclient -x, tests would take about
> 12 seconds longer.
> Or we could switch to the arp off / nodad / arp on approach for
> everything, including SLAAC:
> $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro
> --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2"
> pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1
> dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set
> dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2
> and save slightly less than 8 seconds. If you ask me, I would have a
> slight preference for the nodad approach.
Much as I don't like making the tests slower, I don't think the nodad
approach is a good idea. It's fine if we think of the NDP/DHCP only
as setting things up for the transfer tests. But they also serve the
purpose of testing the NDP and DHCP paths themselves. For that
purpose we should perform a full "normal" configuration, including
DAD. [e.g. if we really did manage to break our NS responses so that
we totally broke DAD, we'd want our tests to catch that].
I have a draft implementation of explicitly waiting for DAD to
complete both for SLAAC and for DHCPv6, patches coming shortly.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-17 1:19 ` David Gibson
@ 2024-10-17 8:31 ` Stefano Brivio
2024-10-21 1:35 ` David Gibson
0 siblings, 1 reply; 16+ messages in thread
From: Stefano Brivio @ 2024-10-17 8:31 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Thu, 17 Oct 2024 12:19:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
> > On Wed, 16 Oct 2024 19:39:40 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
> > > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
> > > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> > > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > > > > > > On Wed, 9 Oct 2024 15:07:21 +0200
> > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > [snip]
> > > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > > > > > > /* spliceable */
> > > > > > > > >
> > > > > > > > > - /* Preserve the specific loopback adddress used, but let the
> > > > > > > > > - * kernel pick a source port on the target side
> > > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > > > > > > + * default use its external address, so we don't inadvertently
> > > > > > > > > + * expose services that listen only on the guest's loopback
> > > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > > > > > > + * will instead forward to the loopback address in the guest.
> > > > > > > > > + *
> > > > > > > > > + * In either case, let the kernel pick the source address to
> > > > > > > > > + * match.
> > > > > > > > > */
> > > > > > > > > - tgt->oaddr = ini->eaddr;
> > > > > > > > > + if (inany_v4(&ini->eaddr)) {
> > > > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > > > + tgt->eaddr = inany_loopback4;
> > > > > > > > > + else
> > > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > > > > > > + tgt->oaddr = inany_any4;
> > > > > > > > > + } else {
> > > > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > > > + tgt->eaddr = inany_loopback6;
> > > > > > > > > + else
> > > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > > > > > > >
> > > > > > > > Either this...
> > > > > > > >
> > > > > > > > > + tgt->oaddr = inany_any6;
> > > > > > > >
> > > > > > > > or this (and not something before this patch, up to 3/4) make the
> > > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > > > > > > sometimes (about one in three/four runs), that's what I mistakenly
> > > > > > > > reported as coming from Laurent's series at:
> > > > > >
> > > > > > Huh, interesting. Just got back from my leave and ran that group of
> > > > > > tests in a loop this afternoon, but didn't manage to reproduce. I
> > > > > > have administrivia that will probably fill the rest of this week, but
> > > > > > I'll look into this as soon as I can.
> > > > >
> > > > > I reproduced the problem on passt.top, and I have a partial idea
> > > > > what's going on. As you say it's seeming like the address (addr_seen
> > > > > == addr in this case) isn't properly ready. This is over splice, but
> > > > > on the tap interface, I see the container sending NS messages for its
> > > > > own address - seems like it's doing DAD. But more importantly, we're
> > > > > answering those NS messages with NA messages, because we answer all
> > > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this
> > > > > ever worked at all. --config-net makes sense, since we disable DAD,
> > > > > but our test suite has always been using NDP+DHCP instead of
> > > > > --config-net.
> > > > >
> > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
> > > > > happens most of the time and for IPv4 via ARP, which is used much more
> > > > > rarely. I think we need to be more selective in what NS or ARP
> > > > > lookups we resopnd to. The question is what approach to take:
> > > >
> > > > Hmm... no.. there's more to this.
> > > >
> > > > Usually DAD requests have :: as the source address, and we *do*
> > > > exclude those from getting replies. In this case though, we're
> > > > getting NS requests for the assigned address from what looks like the
> > > > SLAAC address. So, I do think it would be wise to explicitly exclude
> > > > these: we shouldn't be giving NA responses for an address that ought
> > > > to belong to the guest, even if it doesn't look like a DAD.
> > > >
> > > > But, I'm not sure what's triggering this. Is for some reason the DHCP
> > > > address not "taking", so the container is trying to locate it on the
> > > > network instead? Or _is_ this DAD, but under some circumstances
> > > > rather than using :: as the source address it uses another configured
> > > > address.
> > >
> > > Ok.. I've understood a bit more. While timing is a factor here, it
> > > looks like the main reason I wasn't seeing it on my machine is what
> > > I'd consider a bug in the Debian version of the dhclient-script:
> > > when adding an IPv6 address, it returns without waiting for DAD to
> > > complete (i.e. for the address to be non-tentative).
> >
> > Oops. On one hand, I would feel inclined to propose a fix for the
> > Debian and Ubuntu packages. On the other hand, I wonder if it's
> > universally considered a bug: the DHCPv6 client did its job at that
> > point, and it's debatable whether dhclient should wait for the address
> > to be usable before forking to background.
> >
> > That is, arguably, the job of dhclient's is to request and configure an
> > address. It's not a network configuration daemon. There might be many
> > other reasons why that address is unusable, and yet dhclient is not
> > responsible for them.
>
> Hrm... I guess. Counterpoints..
> - Most other failures to get a usable address will result in a
> visible error
> - dhclient has a --dad-wait-time option which seems to imply that the
> script should wait for DAD
> - The upstream script version waits for DAD
>
> In any case I filed a report for it
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
>
> > By the way, I guess it's just an issue for test scripts like this one.
>
> Why do you guess that?
Because it's kind of rare that your address changes if you use DHCPv6,
I guess, so this would be relevant almost exclusively at boot.
And, at boot, if a remote peer/client happens to try to connect to the
machine where the client is running right after an address was
assigned, it must have a retry mechanism almost for sure.
--
Stefano
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
2024-10-17 8:31 ` Stefano Brivio
@ 2024-10-21 1:35 ` David Gibson
0 siblings, 0 replies; 16+ messages in thread
From: David Gibson @ 2024-10-21 1:35 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 7407 bytes --]
On Thu, Oct 17, 2024 at 10:31:22AM +0200, Stefano Brivio wrote:
> On Thu, 17 Oct 2024 12:19:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
> > > On Wed, 16 Oct 2024 19:39:40 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
> > > > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
> > > > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
> > > > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
> > > > > > > > On Wed, 9 Oct 2024 15:07:21 +0200
> > > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > [snip]
> > > > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > > > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > > > > > > > > > /* spliceable */
> > > > > > > > > >
> > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the
> > > > > > > > > > - * kernel pick a source port on the target side
> > > > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by
> > > > > > > > > > + * default use its external address, so we don't inadvertently
> > > > > > > > > > + * expose services that listen only on the guest's loopback
> > > > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which
> > > > > > > > > > + * will instead forward to the loopback address in the guest.
> > > > > > > > > > + *
> > > > > > > > > > + * In either case, let the kernel pick the source address to
> > > > > > > > > > + * match.
> > > > > > > > > > */
> > > > > > > > > > - tgt->oaddr = ini->eaddr;
> > > > > > > > > > + if (inany_v4(&ini->eaddr)) {
> > > > > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > > > > + tgt->eaddr = inany_loopback4;
> > > > > > > > > > + else
> > > > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > > > > > > > > + tgt->oaddr = inany_any4;
> > > > > > > > > > + } else {
> > > > > > > > > > + if (c->host_lo_to_ns_lo)
> > > > > > > > > > + tgt->eaddr = inany_loopback6;
> > > > > > > > > > + else
> > > > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen;
> > > > > > > > >
> > > > > > > > > Either this...
> > > > > > > > >
> > > > > > > > > > + tgt->oaddr = inany_any6;
> > > > > > > > >
> > > > > > > > > or this (and not something before this patch, up to 3/4) make the
> > > > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > > > > > > > > sometimes (about one in three/four runs), that's what I mistakenly
> > > > > > > > > reported as coming from Laurent's series at:
> > > > > > >
> > > > > > > Huh, interesting. Just got back from my leave and ran that group of
> > > > > > > tests in a loop this afternoon, but didn't manage to reproduce. I
> > > > > > > have administrivia that will probably fill the rest of this week, but
> > > > > > > I'll look into this as soon as I can.
> > > > > >
> > > > > > I reproduced the problem on passt.top, and I have a partial idea
> > > > > > what's going on. As you say it's seeming like the address (addr_seen
> > > > > > == addr in this case) isn't properly ready. This is over splice, but
> > > > > > on the tap interface, I see the container sending NS messages for its
> > > > > > own address - seems like it's doing DAD. But more importantly, we're
> > > > > > answering those NS messages with NA messages, because we answer all
> > > > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this
> > > > > > ever worked at all. --config-net makes sense, since we disable DAD,
> > > > > > but our test suite has always been using NDP+DHCP instead of
> > > > > > --config-net.
> > > > > >
> > > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
> > > > > > happens most of the time and for IPv4 via ARP, which is used much more
> > > > > > rarely. I think we need to be more selective in what NS or ARP
> > > > > > lookups we resopnd to. The question is what approach to take:
> > > > >
> > > > > Hmm... no.. there's more to this.
> > > > >
> > > > > Usually DAD requests have :: as the source address, and we *do*
> > > > > exclude those from getting replies. In this case though, we're
> > > > > getting NS requests for the assigned address from what looks like the
> > > > > SLAAC address. So, I do think it would be wise to explicitly exclude
> > > > > these: we shouldn't be giving NA responses for an address that ought
> > > > > to belong to the guest, even if it doesn't look like a DAD.
> > > > >
> > > > > But, I'm not sure what's triggering this. Is for some reason the DHCP
> > > > > address not "taking", so the container is trying to locate it on the
> > > > > network instead? Or _is_ this DAD, but under some circumstances
> > > > > rather than using :: as the source address it uses another configured
> > > > > address.
> > > >
> > > > Ok.. I've understood a bit more. While timing is a factor here, it
> > > > looks like the main reason I wasn't seeing it on my machine is what
> > > > I'd consider a bug in the Debian version of the dhclient-script:
> > > > when adding an IPv6 address, it returns without waiting for DAD to
> > > > complete (i.e. for the address to be non-tentative).
> > >
> > > Oops. On one hand, I would feel inclined to propose a fix for the
> > > Debian and Ubuntu packages. On the other hand, I wonder if it's
> > > universally considered a bug: the DHCPv6 client did its job at that
> > > point, and it's debatable whether dhclient should wait for the address
> > > to be usable before forking to background.
> > >
> > > That is, arguably, the job of dhclient's is to request and configure an
> > > address. It's not a network configuration daemon. There might be many
> > > other reasons why that address is unusable, and yet dhclient is not
> > > responsible for them.
> >
> > Hrm... I guess. Counterpoints..
> > - Most other failures to get a usable address will result in a
> > visible error
> > - dhclient has a --dad-wait-time option which seems to imply that the
> > script should wait for DAD
> > - The upstream script version waits for DAD
> >
> > In any case I filed a report for it
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
> >
> > > By the way, I guess it's just an issue for test scripts like this one.
> >
> > Why do you guess that?
>
> Because it's kind of rare that your address changes if you use DHCPv6,
> I guess, so this would be relevant almost exclusively at boot.
Hm, true. I was thinking of ephemeral containers where "boot" could
be quite common.. but those will most likely use --config-net.
> And, at boot, if a remote peer/client happens to try to connect to the
> machine where the client is running right after an address was
> assigned, it must have a retry mechanism almost for sure.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-10-21 1:39 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson
2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson
2024-10-02 5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson
2024-10-02 5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson
2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson
2024-10-09 13:07 ` Stefano Brivio
2024-10-09 20:44 ` Stefano Brivio
2024-10-10 5:57 ` David Gibson
2024-10-16 3:15 ` David Gibson
2024-10-16 5:46 ` David Gibson
2024-10-16 8:39 ` David Gibson
2024-10-16 15:26 ` Stefano Brivio
2024-10-17 1:19 ` David Gibson
2024-10-17 8:31 ` Stefano Brivio
2024-10-21 1:35 ` David Gibson
2024-10-17 5:06 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).