* [PATCH v3 0/4] Don't expose container loopback services to the host @ 2024-10-02 5:48 David Gibson 2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw) To: passt-dev, Stefano Brivio; +Cc: David Gibson podman issue #24045 pointed out that pasta's spliced forwarding logic can expose services within the namespace bound only to 127.0.0.1 or ::1 to the host. However, the namespace probably expects those to only be accessible to itself, so that's probably not what we want. This changes our forwarding logic so as not to do this. Note that the podman tests will currently fail with this series, I've submitted podman PR https://github.com/containers/podman/pull/24064 to fix that. Link: https://github.com/containers/podman/issues/24045 Changes since v2: * Add new field do structure comment Changes since v1: * Add --host-lo-to-ns-lo option to preserve the old behaviour * Clarify the new behaviour in the man page * Add some extra patches making some other corrections to the man page David Gibson (4): passt.1: Mark --stderr as deprecated more prominently passt.1: Clarify and update "Handling of local addresses" section test: Clarify test for spliced inbound transfers fwd: Direct inbound spliced forwards to the guest's external address conf.c | 9 ++++++ fwd.c | 31 +++++++++++++----- passt.1 | 75 +++++++++++++++++++++++++++----------------- passt.h | 2 ++ test/passt_in_ns/tcp | 8 ++--- test/passt_in_ns/udp | 4 +-- test/pasta/tcp | 16 +++++----- test/pasta/udp | 8 ++--- 8 files changed, 98 insertions(+), 55 deletions(-) -- 2.46.2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently 2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson @ 2024-10-02 5:48 ` David Gibson 2024-10-02 5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw) To: passt-dev, Stefano Brivio; +Cc: David Gibson The description of this option says that it's deprecated, but unlike --no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add one to make it easier to spot. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- passt.1 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/passt.1 b/passt.1 index 79d134d..acc1f92 100644 --- a/passt.1 +++ b/passt.1 @@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change. Default is to fork into background. .TP -.BR \-e ", " \-\-stderr +.BR \-e ", " \-\-stderr " " (DEPRECATED) This option has no effect, and is maintained for compatibility purposes only. Note that this configuration option is \fBdeprecated\fR and will be removed in a -- @@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change. Default is to fork into background. .TP -.BR \-e ", " \-\-stderr +.BR \-e ", " \-\-stderr " " (DEPRECATED) This option has no effect, and is maintained for compatibility purposes only. Note that this configuration option is \fBdeprecated\fR and will be removed in a -- 2.46.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section 2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson 2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson @ 2024-10-02 5:48 ` David Gibson 2024-10-02 5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson 2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson 3 siblings, 0 replies; 16+ messages in thread From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw) To: passt-dev, Stefano Brivio; +Cc: David Gibson This section didn't mention the effect of the --map-host-loopback option which now alters this behaviour. Update it accordingly. It used "local addresses" to mean specifically 127.0.0.0/8 and ::1. However, "local" could also refer to link-local addresses or to addresses of any scope which happen to be configured on the host. Use "loopback address" to be more precise about this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- passt.1 | 54 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 28 insertions(+), 26 deletions(-) diff --git a/passt.1 b/passt.1 index acc1f92..11104e1 100644 --- a/passt.1 +++ b/passt.1 @@ -863,38 +863,40 @@ root@localhost's password: .SH NOTES -.SS Handling of traffic with local destination and source addresses - -Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address, -depending on the configuration. Local destination or source addresses need to be -changed before packets are delivered to the guest or target namespace: most -operating systems would drop packets received from non-loopback interfaces with -local addresses, and it would also be impossible for guest or target namespace -to route answers back. - -For convenience, and somewhat arbitrarily, the source address on these packets -is translated to the address of the default IPv4 or IPv6 gateway (if any) -- -this is known to be an existing, valid address on the same subnet. - -Loopback destination addresses are instead translated to the observed external -address of the guest or target namespace. For IPv6 packets, if usage of a -link-local address by guest or namespace has ever been observed, and the -original destination address is also a link-local address, the observed -link-local address is used. Otherwise, the observed global address is used. For -both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses -will be used instead. +.SS Handling of traffic with loopback destination and source addresses + +Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback +address (127.0.0.0/8 or ::1), depending on the configuration. Loopback +destination or source addresses need to be changed before packets are +delivered to the guest or target namespace: most operating systems +would drop packets received with loopback addresses on non-loopback +interfaces, and it would also be impossible for guest or target +namespace to route answers back. + +For convenience, the source address on these packets is translated to +the address specified by the \fB\-\-map-host-loopback\fR option. If +not specified this defaults, somewhat arbitrarily, to the address of +default IPv4 or IPv6 gateway (if any) -- this is known to be an +existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or +\fB\-\-map-host-loopback none\fR are specified this translation is +disabled and packets with loopback addresses are simply dropped. + +Loopback destination addresses are translated to the observed external +address of the guest or target namespace. For IPv6, the observed +link-local address is used if the translated source address is +link-local, otherwise the observed global address is used. For both +IPv4 and IPv6, if no addresses have been seen yet, the configured +addresses will be used instead. For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1, with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while the last observed source address from guest or namespace is 192.0.2.2, this will be translated to a connection from 192.0.2.1 to 192.0.2.2. -Similarly, for traffic coming from guest or namespace, packets with destination -address corresponding to the default gateway will have their destination address -translated to a loopback address, if and only if a packet, in the opposite -direction, with a loopback destination or source address, port-wise matching for -UDP, or connection-wise for TCP, has been recently forwarded to guest or -namespace. This behaviour can be disabled with \-\-no\-map\-gw. +Similarly, for traffic coming from guest or namespace, packets with +destination address corresponding to the \fB\-\-map-host-loopback\fR +address will have their destination address translated to a loopback +address. .SS Handling of local traffic in pasta -- @@ -863,38 +863,40 @@ root@localhost's password: .SH NOTES -.SS Handling of traffic with local destination and source addresses - -Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address, -depending on the configuration. Local destination or source addresses need to be -changed before packets are delivered to the guest or target namespace: most -operating systems would drop packets received from non-loopback interfaces with -local addresses, and it would also be impossible for guest or target namespace -to route answers back. - -For convenience, and somewhat arbitrarily, the source address on these packets -is translated to the address of the default IPv4 or IPv6 gateway (if any) -- -this is known to be an existing, valid address on the same subnet. - -Loopback destination addresses are instead translated to the observed external -address of the guest or target namespace. For IPv6 packets, if usage of a -link-local address by guest or namespace has ever been observed, and the -original destination address is also a link-local address, the observed -link-local address is used. Otherwise, the observed global address is used. For -both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses -will be used instead. +.SS Handling of traffic with loopback destination and source addresses + +Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback +address (127.0.0.0/8 or ::1), depending on the configuration. Loopback +destination or source addresses need to be changed before packets are +delivered to the guest or target namespace: most operating systems +would drop packets received with loopback addresses on non-loopback +interfaces, and it would also be impossible for guest or target +namespace to route answers back. + +For convenience, the source address on these packets is translated to +the address specified by the \fB\-\-map-host-loopback\fR option. If +not specified this defaults, somewhat arbitrarily, to the address of +default IPv4 or IPv6 gateway (if any) -- this is known to be an +existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or +\fB\-\-map-host-loopback none\fR are specified this translation is +disabled and packets with loopback addresses are simply dropped. + +Loopback destination addresses are translated to the observed external +address of the guest or target namespace. For IPv6, the observed +link-local address is used if the translated source address is +link-local, otherwise the observed global address is used. For both +IPv4 and IPv6, if no addresses have been seen yet, the configured +addresses will be used instead. For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1, with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while the last observed source address from guest or namespace is 192.0.2.2, this will be translated to a connection from 192.0.2.1 to 192.0.2.2. -Similarly, for traffic coming from guest or namespace, packets with destination -address corresponding to the default gateway will have their destination address -translated to a loopback address, if and only if a packet, in the opposite -direction, with a loopback destination or source address, port-wise matching for -UDP, or connection-wise for TCP, has been recently forwarded to guest or -namespace. This behaviour can be disabled with \-\-no\-map\-gw. +Similarly, for traffic coming from guest or namespace, packets with +destination address corresponding to the \fB\-\-map-host-loopback\fR +address will have their destination address translated to a loopback +address. .SS Handling of local traffic in pasta -- 2.46.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 3/4] test: Clarify test for spliced inbound transfers 2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson 2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson 2024-10-02 5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson @ 2024-10-02 5:48 ` David Gibson 2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson 3 siblings, 0 replies; 16+ messages in thread From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw) To: passt-dev, Stefano Brivio; +Cc: David Gibson The tests in pasta/tcp and pasta/udp for inbound transfers have the server listening within the namespace explicitly bound to 127.0.0.1 or ::1. This only works because of the behaviour of inbound splice connections, which always appear with both source and destination addresses as loopback in the namespace. That's not an inherent property for "spliced" connections and arguably an undesirable one. Also update the test names to make it clearer that these tests are expecting to exercise the "splice" path. Interestingly this was already correct for the equivalent passt_in_ns/*, although we also update the test names for clarity there. Note that there are similar issues in some of the podman tests, addressed in https://github.com/containers/podman/pull/24064 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- test/passt_in_ns/tcp | 8 ++++---- test/passt_in_ns/udp | 4 ++-- test/pasta/tcp | 16 ++++++++-------- test/pasta/udp | 8 ++++---- 4 files changed, 18 insertions(+), 18 deletions(-) diff --git a/test/passt_in_ns/tcp b/test/passt_in_ns/tcp index aaf340e..319880b 100644 --- a/test/passt_in_ns/tcp +++ b/test/passt_in_ns/tcp @@ -32,7 +32,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10001 guestw guest cmp test_big.bin /root/big.bin -test TCP/IPv4: host to ns: big transfer +test TCP/IPv4: host to ns (spliced): big transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002 @@ -90,7 +90,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10001 guestw guest cmp test_small.bin /root/small.bin -test TCP/IPv4: host to ns: small transfer +test TCP/IPv4: host to ns (spliced): small transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002 @@ -146,7 +146,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10001 guestw guest cmp test_big.bin /root/big.bin -test TCP/IPv6: host to ns: big transfer +test TCP/IPv6: host to ns (spliced): big transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002 @@ -204,7 +204,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10001 guestw guest cmp test_small.bin /root/small.bin -test TCP/IPv6: host to ns: small transfer +test TCP/IPv6: host to ns (spliced): small transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002 diff --git a/test/passt_in_ns/udp b/test/passt_in_ns/udp index 3426ab9..791511c 100644 --- a/test/passt_in_ns/udp +++ b/test/passt_in_ns/udp @@ -30,7 +30,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10001,shut-null guestw guest cmp test.bin /root/medium.bin -test UDP/IPv4: host to ns +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null @@ -88,7 +88,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10001,shut-null guestw guest cmp test.bin /root/medium.bin -test UDP/IPv6: host to ns +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null diff --git a/test/pasta/tcp b/test/pasta/tcp index 6ab18c5..53b6f25 100644 --- a/test/pasta/tcp +++ b/test/pasta/tcp @@ -19,8 +19,8 @@ set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin set TEMP_SMALL __STATEDIR__/test_small.bin set TEMP_NS_SMALL __STATEDIR__/test_ns_small.bin -test TCP/IPv4: host to ns: big transfer -nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_BIG__,create,trunc +test TCP/IPv4: host to ns (spliced): big transfer +nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002 nsw check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__ @@ -38,8 +38,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003 hostw check cmp __BASEPATH__/big.bin __TEMP_BIG__ -test TCP/IPv4: host to ns: small transfer -nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_SMALL__,create,trunc +test TCP/IPv4: host to ns (spliced): small transfer +nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc host socat OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002 nsw check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__ @@ -57,8 +57,8 @@ ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003 hostw check cmp __BASEPATH__/small.bin __TEMP_SMALL__ -test TCP/IPv6: host to ns: big transfer -nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_BIG__,create,trunc +test TCP/IPv6: host to ns (spliced): big transfer +nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002 nsw check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__ @@ -77,8 +77,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003 hostw check cmp __BASEPATH__/big.bin __TEMP_BIG__ -test TCP/IPv6: host to ns: small transfer -nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_SMALL__,create,trunc +test TCP/IPv6: host to ns (spliced): small transfer +nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002 nsw check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__ diff --git a/test/pasta/udp b/test/pasta/udp index 30e3a85..7734d02 100644 --- a/test/pasta/udp +++ b/test/pasta/udp @@ -17,8 +17,8 @@ htools dd socat ip jq set TEMP __STATEDIR__/test.bin set TEMP_NS __STATEDIR__/test_ns.bin -test UDP/IPv4: host to ns -nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ @@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null hostw check cmp __BASEPATH__/medium.bin __TEMP__ -test UDP/IPv6: host to ns -nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ -- @@ -17,8 +17,8 @@ htools dd socat ip jq set TEMP __STATEDIR__/test.bin set TEMP_NS __STATEDIR__/test_ns.bin -test UDP/IPv4: host to ns -nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ @@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null hostw check cmp __BASEPATH__/medium.bin __TEMP__ -test UDP/IPv6: host to ns -nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ -- 2.46.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson ` (2 preceding siblings ...) 2024-10-02 5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson @ 2024-10-02 5:48 ` David Gibson 2024-10-09 13:07 ` Stefano Brivio 3 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2024-10-02 5:48 UTC (permalink / raw) To: passt-dev, Stefano Brivio; +Cc: David Gibson In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen; + tgt->oaddr = inany_any6; + } + + /* Let the kernel pick source port */ tgt->oport = 0; if (proto == IPPROTO_UDP) /* But for UDP preserve the source port */ tgt->oport = ini->eport; - if (inany_v4(&ini->eaddr)) - tgt->eaddr = inany_loopback4; - else - tgt->eaddr = inany_loopback6; - return PIF_SPLICE; } diff --git a/passt.1 b/passt.1 index 11104e1..332384c 100644 --- a/passt.1 +++ b/passt.1 @@ -586,6 +586,13 @@ Configure UDP port forwarding from target namespace to init namespace. Default is \fBauto\fR. +.TP +.BR \-\-host-lo-to-ns-lo " " (DEPRECATED) +If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from +the host's loopback address will appear on the loopback address in the +guest as well. Without this option such forwarded packets will appear +to come from the guest's public address. + .TP .BR \-\-userns " " \fIspec Target user namespace to join, as a path. If PID is given, without this option, @@ -874,8 +881,9 @@ interfaces, and it would also be impossible for guest or target namespace to route answers back. For convenience, the source address on these packets is translated to -the address specified by the \fB\-\-map-host-loopback\fR option. If -not specified this defaults, somewhat arbitrarily, to the address of +the address specified by the \fB\-\-map-host-loopback\fR option (with +some exceptions in pasta mode, see next section below). If not +specified this defaults, somewhat arbitrarily, to the address of default IPv4 or IPv6 gateway (if any) -- this is known to be an existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or \fB\-\-map-host-loopback none\fR are specified this translation is @@ -912,8 +920,15 @@ and the new socket using the \fBsplice\fR(2) system call, and for UDP, a pair of \fBrecvmmsg\fR(2) and \fBsendmmsg\fR(2) system calls deals with packet transfers. -This bypass only applies to local connections and traffic, because it's not -possible to bind sockets to foreign addresses. +Because it's not possible to bind sockets to foreign addresses, this +bypass only applies to local connections and traffic. It also means +that the address translation differs slightly from passt mode. +Connections from loopback to loopback on the host will appear to come +from the target namespace's public address within the guest, unless +\fB\-\-host-lo-to-ns-lo\fR is specified, in which case they will +appear to come from loopback in the namespace as well. The latter +behaviour used to be the default, but is usually undesirable, since it +can unintentionally expose namespace local services to the host. .SS Binding to low numbered ports (well-known or system ports, up to 1023) diff --git a/passt.h b/passt.h index 031c9b6..f7b7a58 100644 --- a/passt.h +++ b/passt.h @@ -225,6 +225,7 @@ struct ip6_ctx { * @no_dhcpv6: Disable DHCPv6 server * @no_ndp: Disable NDP handler altogether * @no_ra: Disable router advertisements + * @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses * @low_wmem: Low probed net.core.wmem_max * @low_rmem: Low probed net.core.rmem_max */ @@ -284,6 +285,7 @@ struct ctx { int no_dhcpv6; int no_ndp; int no_ra; + int host_lo_to_ns_lo; int low_wmem; int low_rmem; -- @@ -225,6 +225,7 @@ struct ip6_ctx { * @no_dhcpv6: Disable DHCPv6 server * @no_ndp: Disable NDP handler altogether * @no_ra: Disable router advertisements + * @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses * @low_wmem: Low probed net.core.wmem_max * @low_rmem: Low probed net.core.rmem_max */ @@ -284,6 +285,7 @@ struct ctx { int no_dhcpv6; int no_ndp; int no_ra; + int host_lo_to_ns_lo; int low_wmem; int low_rmem; -- 2.46.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson @ 2024-10-09 13:07 ` Stefano Brivio 2024-10-09 20:44 ` Stefano Brivio 0 siblings, 1 reply; 16+ messages in thread From: Stefano Brivio @ 2024-10-09 13:07 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Wed, 2 Oct 2024 15:48:26 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > In pasta mode, where addressing permits we "splice" connections, forwarding > directly from host socket to guest/container socket without any L2 or L3 > processing. This gives us a very large performance improvement when it's > possible. > > Since the traffic is from a local socket within the guest, it will go over > the guest's 'lo' interface, and accordingly we set the guest side address > to be the loopback address. However this has a surprising side effect: > sometimes guests will run services that are only supposed to be used within > the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's > forwarding exposes those services to the host, which isn't generally what > we want. > > Correct this by instead forwarding inbound "splice" flows to the guest's > external address. > > Link: https://github.com/containers/podman/issues/24045 > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > conf.c | 9 +++++++++ > fwd.c | 31 +++++++++++++++++++++++-------- > passt.1 | 23 +++++++++++++++++++---- > passt.h | 2 ++ > 4 files changed, 53 insertions(+), 12 deletions(-) > > diff --git a/conf.c b/conf.c > index 6e62510..b5318f3 100644 > --- a/conf.c > +++ b/conf.c > @@ -908,6 +908,9 @@ pasta_opts: > " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" > " SPEC is as described above\n" > " default: auto\n" > + " --host-lo-to-ns-lo DEPRECATED:\n" > + " Translate host-loopback forwards to\n" > + " namespace loopback\n" > " --userns NSPATH Target user namespace to join\n" > " --netns PATH|NAME Target network namespace to join\n" > " --netns-only Don't join existing user namespace\n" > @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) > {"netns-only", no_argument, NULL, 20 }, > {"map-host-loopback", required_argument, NULL, 21 }, > {"map-guest-addr", required_argument, NULL, 22 }, > + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, > { 0 }, > }; > const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; > @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) > conf_nat(optarg, &c->ip4.map_guest_addr, > &c->ip6.map_guest_addr, NULL); > break; > + case 23: > + if (c->mode != MODE_PASTA) > + die("--host-lo-to-ns-lo is for pasta mode only"); > + c->host_lo_to_ns_lo = 1; > + break; > case 'd': > c->debug = 1; > c->quiet = 0; > diff --git a/fwd.c b/fwd.c > index a505098..c71f5e1 100644 > --- a/fwd.c > +++ b/fwd.c > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > /* spliceable */ > > - /* Preserve the specific loopback adddress used, but let the > - * kernel pick a source port on the target side > + /* The traffic will go over the guest's 'lo' interface, but by > + * default use its external address, so we don't inadvertently > + * expose services that listen only on the guest's loopback > + * address. That can be overridden by --host-lo-to-ns-lo which > + * will instead forward to the loopback address in the guest. > + * > + * In either case, let the kernel pick the source address to > + * match. > */ > - tgt->oaddr = ini->eaddr; > + if (inany_v4(&ini->eaddr)) { > + if (c->host_lo_to_ns_lo) > + tgt->eaddr = inany_loopback4; > + else > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > + tgt->oaddr = inany_any4; > + } else { > + if (c->host_lo_to_ns_lo) > + tgt->eaddr = inany_loopback6; > + else > + tgt->eaddr.a6 = c->ip6.addr_seen; Either this... > + tgt->oaddr = inany_any6; or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at: https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/ It hangs like this (display with >= 240 columns): ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00 ...even without strace. The client is done, the server hangs. If I unblock this manually by re-running the same client command, the server wakes up, writes the file, and terminates, and the test continues normally. Those three "received NS, sending NA" messages in the pasta pane are printed in a short time after the test starts. If I run this with TRACE=1 (which needs the patch I just sent), this is pasta's debugging output for this test: -- 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] -- Relevant parts of strace output from the client: -- openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) [...] pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "", 8192) = 0 shutdown(6, SHUT_WR) = 0 shutdown(6, SHUT_RDWR) = 0 exit_group(0) = ? +++ exited with 0 +++ -- and from the server: -- socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL -- If I connect from the host without a server in the namespace (but with the port forwarded by pasta), I get a connection reset, and if the port is not forwarded by pasta, connection refused. But this is another case: we start connecting and accept the connection (probably we shouldn't). Note the "No route to host" error on the socket. It looks somehow similar to the race I fixed with commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"), but it doesn't look like an invalid c->ip6.addr_seen, because otherwise pasta would reset the connection, I suppose. I haven't debugged further yet. This looks like an existing issue in pasta rather than in this series or in the tests, but it blocks tests, so I haven't applied this yet. -- Stefano ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-09 13:07 ` Stefano Brivio @ 2024-10-09 20:44 ` Stefano Brivio 2024-10-10 5:57 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: Stefano Brivio @ 2024-10-09 20:44 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio <sbrivio@redhat.com> wrote: > On Wed, 2 Oct 2024 15:48:26 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > In pasta mode, where addressing permits we "splice" connections, forwarding > > directly from host socket to guest/container socket without any L2 or L3 > > processing. This gives us a very large performance improvement when it's > > possible. > > > > Since the traffic is from a local socket within the guest, it will go over > > the guest's 'lo' interface, and accordingly we set the guest side address > > to be the loopback address. However this has a surprising side effect: > > sometimes guests will run services that are only supposed to be used within > > the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's > > forwarding exposes those services to the host, which isn't generally what > > we want. > > > > Correct this by instead forwarding inbound "splice" flows to the guest's > > external address. > > > > Link: https://github.com/containers/podman/issues/24045 > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > --- > > conf.c | 9 +++++++++ > > fwd.c | 31 +++++++++++++++++++++++-------- > > passt.1 | 23 +++++++++++++++++++---- > > passt.h | 2 ++ > > 4 files changed, 53 insertions(+), 12 deletions(-) > > > > diff --git a/conf.c b/conf.c > > index 6e62510..b5318f3 100644 > > --- a/conf.c > > +++ b/conf.c > > @@ -908,6 +908,9 @@ pasta_opts: > > " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" > > " SPEC is as described above\n" > > " default: auto\n" > > + " --host-lo-to-ns-lo DEPRECATED:\n" > > + " Translate host-loopback forwards to\n" > > + " namespace loopback\n" > > " --userns NSPATH Target user namespace to join\n" > > " --netns PATH|NAME Target network namespace to join\n" > > " --netns-only Don't join existing user namespace\n" > > @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) > > {"netns-only", no_argument, NULL, 20 }, > > {"map-host-loopback", required_argument, NULL, 21 }, > > {"map-guest-addr", required_argument, NULL, 22 }, > > + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, > > { 0 }, > > }; > > const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; > > @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) > > conf_nat(optarg, &c->ip4.map_guest_addr, > > &c->ip6.map_guest_addr, NULL); > > break; > > + case 23: > > + if (c->mode != MODE_PASTA) > > + die("--host-lo-to-ns-lo is for pasta mode only"); > > + c->host_lo_to_ns_lo = 1; > > + break; > > case 'd': > > c->debug = 1; > > c->quiet = 0; > > diff --git a/fwd.c b/fwd.c > > index a505098..c71f5e1 100644 > > --- a/fwd.c > > +++ b/fwd.c > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > /* spliceable */ > > > > - /* Preserve the specific loopback adddress used, but let the > > - * kernel pick a source port on the target side > > + /* The traffic will go over the guest's 'lo' interface, but by > > + * default use its external address, so we don't inadvertently > > + * expose services that listen only on the guest's loopback > > + * address. That can be overridden by --host-lo-to-ns-lo which > > + * will instead forward to the loopback address in the guest. > > + * > > + * In either case, let the kernel pick the source address to > > + * match. > > */ > > - tgt->oaddr = ini->eaddr; > > + if (inany_v4(&ini->eaddr)) { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback4; > > + else > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > + tgt->oaddr = inany_any4; > > + } else { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback6; > > + else > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > Either this... > > > + tgt->oaddr = inany_any6; > > or this (and not something before this patch, up to 3/4) make the > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > sometimes (about one in three/four runs), that's what I mistakenly > reported as coming from Laurent's series at: > > https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/ > > It hangs like this (display with >= 240 columns): Ouch, sorry, it looks like saving something in claws-mail as draft and sending it later means lines will be forcefully wrapped. Here's the original test output: ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00 > ...even without strace. The client is done, the server hangs. > > If I unblock this manually by re-running the same client command, the > server wakes up, writes the file, and terminates, and the test > continues normally. > > Those three "received NS, sending NA" messages in the pasta pane are > printed in a short time after the test starts. > > If I run this with TRACE=1 (which needs the patch I just sent), this > is pasta's debugging output for this test: > > -- > 6.1401: pasta: epoll event on listening TCP socket 6 (events: > 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW > 6.1402: Flow 0 (INI): NEW -> INI > 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? > 6.1402: Flow 0 (TGT): INI -> TGT > 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 > -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection > (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): > HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> > [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): > event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection > (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): > TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST > [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> > [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun > device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA > 7.0006: pasta: epoll event on namespace timer watch 12 (events: > 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 > 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP > (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): > cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection > (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection > (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on > /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, > sending NA 8.0006: pasta: epoll event on namespace timer watch 12 > (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at > tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at > tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device > 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: > pasta: epoll event on namespace timer watch 12 (events: 0x00000001) > 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: > 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on > socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): > flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection > (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): > CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): > HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> > [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace > timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on > namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll > event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: > epoll event on namespace timer watch 12 (events: 0x00000001) [...] -- This was: 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] > Relevant parts of strace output from the client: > > -- > openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 > ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate > ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 > socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 > fcntl(6, F_SETFD, FD_CLOEXEC) = 0 > connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), > sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), > sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, > sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, > "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], > [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, > "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., > 8192) = 8192 write(6, > "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., > 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out > [6]) read(5, > "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., > 8192) = 8192 write(6, > "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., > 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out > [6]) This was: openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) > [...] > > pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) > read(5, "", 8192) = 0 > shutdown(6, SHUT_WR) = 0 > shutdown(6, SHUT_RDWR) = 0 > exit_group(0) = ? > +++ exited with 0 +++ > -- > > and from the server: > > -- > socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 > fcntl(6, F_SETFD, FD_CLOEXEC) = 0 > setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), > sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), > sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 > getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), > sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), > sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL -- And this was: socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL > If I connect from the host without a server in the namespace (but > with the port forwarded by pasta), I get a connection reset, and > if the port is not forwarded by pasta, connection refused. > > But this is another case: we start connecting and accept the > connection (probably we shouldn't). Note the "No route to host" > error on the socket. > > It looks somehow similar to the race I fixed with commit > f4e9f26480ef ("pasta: Disable neighbour solicitations on device > up to prevent DAD"), but it doesn't look like an invalid > c->ip6.addr_seen, because otherwise pasta would reset the > connection, I suppose. > > I haven't debugged further yet. This looks like an existing > issue in pasta rather than in this series or in the tests, > but it blocks tests, so I haven't applied this yet. -- Stefano ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-09 20:44 ` Stefano Brivio @ 2024-10-10 5:57 ` David Gibson 2024-10-16 3:15 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2024-10-10 5:57 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 5120 bytes --] On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > On Wed, 9 Oct 2024 15:07:21 +0200 > Stefano Brivio <sbrivio@redhat.com> wrote: > > > On Wed, 2 Oct 2024 15:48:26 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > In pasta mode, where addressing permits we "splice" connections, forwarding > > > directly from host socket to guest/container socket without any L2 or L3 > > > processing. This gives us a very large performance improvement when it's > > > possible. > > > > > > Since the traffic is from a local socket within the guest, it will go over > > > the guest's 'lo' interface, and accordingly we set the guest side address > > > to be the loopback address. However this has a surprising side effect: > > > sometimes guests will run services that are only supposed to be used within > > > the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's > > > forwarding exposes those services to the host, which isn't generally what > > > we want. > > > > > > Correct this by instead forwarding inbound "splice" flows to the guest's > > > external address. > > > > > > Link: https://github.com/containers/podman/issues/24045 > > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > > --- > > > conf.c | 9 +++++++++ > > > fwd.c | 31 +++++++++++++++++++++++-------- > > > passt.1 | 23 +++++++++++++++++++---- > > > passt.h | 2 ++ > > > 4 files changed, 53 insertions(+), 12 deletions(-) > > > > > > diff --git a/conf.c b/conf.c > > > index 6e62510..b5318f3 100644 > > > --- a/conf.c > > > +++ b/conf.c > > > @@ -908,6 +908,9 @@ pasta_opts: > > > " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" > > > " SPEC is as described above\n" > > > " default: auto\n" > > > + " --host-lo-to-ns-lo DEPRECATED:\n" > > > + " Translate host-loopback forwards to\n" > > > + " namespace loopback\n" > > > " --userns NSPATH Target user namespace to join\n" > > > " --netns PATH|NAME Target network namespace to join\n" > > > " --netns-only Don't join existing user namespace\n" > > > @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) > > > {"netns-only", no_argument, NULL, 20 }, > > > {"map-host-loopback", required_argument, NULL, 21 }, > > > {"map-guest-addr", required_argument, NULL, 22 }, > > > + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, > > > { 0 }, > > > }; > > > const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; > > > @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) > > > conf_nat(optarg, &c->ip4.map_guest_addr, > > > &c->ip6.map_guest_addr, NULL); > > > break; > > > + case 23: > > > + if (c->mode != MODE_PASTA) > > > + die("--host-lo-to-ns-lo is for pasta mode only"); > > > + c->host_lo_to_ns_lo = 1; > > > + break; > > > case 'd': > > > c->debug = 1; > > > c->quiet = 0; > > > diff --git a/fwd.c b/fwd.c > > > index a505098..c71f5e1 100644 > > > --- a/fwd.c > > > +++ b/fwd.c > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > /* spliceable */ > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > - * kernel pick a source port on the target side > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > + * default use its external address, so we don't inadvertently > > > + * expose services that listen only on the guest's loopback > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > + * will instead forward to the loopback address in the guest. > > > + * > > > + * In either case, let the kernel pick the source address to > > > + * match. > > > */ > > > - tgt->oaddr = ini->eaddr; > > > + if (inany_v4(&ini->eaddr)) { > > > + if (c->host_lo_to_ns_lo) > > > + tgt->eaddr = inany_loopback4; > > > + else > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > + tgt->oaddr = inany_any4; > > > + } else { > > > + if (c->host_lo_to_ns_lo) > > > + tgt->eaddr = inany_loopback6; > > > + else > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > Either this... > > > > > + tgt->oaddr = inany_any6; > > > > or this (and not something before this patch, up to 3/4) make the > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > sometimes (about one in three/four runs), that's what I mistakenly > > reported as coming from Laurent's series at: Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-10 5:57 ` David Gibson @ 2024-10-16 3:15 ` David Gibson 2024-10-16 5:46 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2024-10-16 3:15 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 4514 bytes --] On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > On Wed, 9 Oct 2024 15:07:21 +0200 > > Stefano Brivio <sbrivio@redhat.com> wrote: [snip] > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > /* spliceable */ > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > - * kernel pick a source port on the target side > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > + * default use its external address, so we don't inadvertently > > > > + * expose services that listen only on the guest's loopback > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > + * will instead forward to the loopback address in the guest. > > > > + * > > > > + * In either case, let the kernel pick the source address to > > > > + * match. > > > > */ > > > > - tgt->oaddr = ini->eaddr; > > > > + if (inany_v4(&ini->eaddr)) { > > > > + if (c->host_lo_to_ns_lo) > > > > + tgt->eaddr = inany_loopback4; > > > > + else > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > + tgt->oaddr = inany_any4; > > > > + } else { > > > > + if (c->host_lo_to_ns_lo) > > > > + tgt->eaddr = inany_loopback6; > > > > + else > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > Either this... > > > > > > > + tgt->oaddr = inany_any6; > > > > > > or this (and not something before this patch, up to 3/4) make the > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > sometimes (about one in three/four runs), that's what I mistakenly > > > reported as coming from Laurent's series at: > > Huh, interesting. Just got back from my leave and ran that group of > tests in a loop this afternoon, but didn't manage to reproduce. I > have administrivia that will probably fill the rest of this week, but > I'll look into this as soon as I can. I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take: Option 1: Answer everything apart from specific exceptions ======== Basically we'd explicitly exclude the guest/container's assigned address but continue answering everything else. This is a bit ugly and means we'd probably still have a similar problem if the guest just picks an address instead of taking the assigned one. On the other hand it means things will work in at least some cases where the guest tries to contact remote hosts directly instead of via the default gateway. Option 2: Only answer for specific addresses ========= Reverse the logic above, usually *don't* answer NS queries, unless they have a specific address that we know is our responsibility. That would be, afaict, the gateway address, the "our_tap" addresses and the -map--host-loopback and --map-guest-addr addresses (usually all the same). This might be less robust against certain guests that don't use a default gateway when we expect. I guess that could happen when we have non-gateway routes copied from the host, so we'd have to handle that too. It would remove one more barrier towards having multiple bridged guests behind a single passt/pasta instance (they can't reasonably communicate with each other if passt answers all their NS / ARP requests). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-16 3:15 ` David Gibson @ 2024-10-16 5:46 ` David Gibson 2024-10-16 8:39 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2024-10-16 5:46 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 4152 bytes --] On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > Stefano Brivio <sbrivio@redhat.com> wrote: > [snip] > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > /* spliceable */ > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > - * kernel pick a source port on the target side > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > + * default use its external address, so we don't inadvertently > > > > > + * expose services that listen only on the guest's loopback > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > + * will instead forward to the loopback address in the guest. > > > > > + * > > > > > + * In either case, let the kernel pick the source address to > > > > > + * match. > > > > > */ > > > > > - tgt->oaddr = ini->eaddr; > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > + if (c->host_lo_to_ns_lo) > > > > > + tgt->eaddr = inany_loopback4; > > > > > + else > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > + tgt->oaddr = inany_any4; > > > > > + } else { > > > > > + if (c->host_lo_to_ns_lo) > > > > > + tgt->eaddr = inany_loopback6; > > > > > + else > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > Either this... > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > reported as coming from Laurent's series at: > > > > Huh, interesting. Just got back from my leave and ran that group of > > tests in a loop this afternoon, but didn't manage to reproduce. I > > have administrivia that will probably fill the rest of this week, but > > I'll look into this as soon as I can. > > I reproduced the problem on passt.top, and I have a partial idea > what's going on. As you say it's seeming like the address (addr_seen > == addr in this case) isn't properly ready. This is over splice, but > on the tap interface, I see the container sending NS messages for its > own address - seems like it's doing DAD. But more importantly, we're > answering those NS messages with NA messages, because we answer all > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > ever worked at all. --config-net makes sense, since we disable DAD, > but our test suite has always been using NDP+DHCP instead of > --config-net. > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > happens most of the time and for IPv4 via ARP, which is used much more > rarely. I think we need to be more selective in what NS or ARP > lookups we resopnd to. The question is what approach to take: Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-16 5:46 ` David Gibson @ 2024-10-16 8:39 ` David Gibson 2024-10-16 15:26 ` Stefano Brivio 0 siblings, 1 reply; 16+ messages in thread From: David Gibson @ 2024-10-16 8:39 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 5390 bytes --] On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > Stefano Brivio <sbrivio@redhat.com> wrote: > > [snip] > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > /* spliceable */ > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > - * kernel pick a source port on the target side > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > + * default use its external address, so we don't inadvertently > > > > > > + * expose services that listen only on the guest's loopback > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > + * > > > > > > + * In either case, let the kernel pick the source address to > > > > > > + * match. > > > > > > */ > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > + else > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > + tgt->oaddr = inany_any4; > > > > > > + } else { > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > + else > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > Either this... > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > reported as coming from Laurent's series at: > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > have administrivia that will probably fill the rest of this week, but > > > I'll look into this as soon as I can. > > > > I reproduced the problem on passt.top, and I have a partial idea > > what's going on. As you say it's seeming like the address (addr_seen > > == addr in this case) isn't properly ready. This is over splice, but > > on the tap interface, I see the container sending NS messages for its > > own address - seems like it's doing DAD. But more importantly, we're > > answering those NS messages with NA messages, because we answer all > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > ever worked at all. --config-net makes sense, since we disable DAD, > > but our test suite has always been using NDP+DHCP instead of > > --config-net. > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > happens most of the time and for IPv4 via ARP, which is used much more > > rarely. I think we need to be more selective in what NS or ARP > > lookups we resopnd to. The question is what approach to take: > > Hmm... no.. there's more to this. > > Usually DAD requests have :: as the source address, and we *do* > exclude those from getting replies. In this case though, we're > getting NS requests for the assigned address from what looks like the > SLAAC address. So, I do think it would be wise to explicitly exclude > these: we shouldn't be giving NA responses for an address that ought > to belong to the guest, even if it doesn't look like a DAD. > > But, I'm not sure what's triggering this. Is for some reason the DHCP > address not "taking", so the container is trying to locate it on the > network instead? Or _is_ this DAD, but under some circumstances > rather than using :: as the source address it uses another configured > address. Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative). There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs. I still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue. I guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/ -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-16 8:39 ` David Gibson @ 2024-10-16 15:26 ` Stefano Brivio 2024-10-17 1:19 ` David Gibson 2024-10-17 5:06 ` David Gibson 0 siblings, 2 replies; 16+ messages in thread From: Stefano Brivio @ 2024-10-16 15:26 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > > Stefano Brivio <sbrivio@redhat.com> wrote: > > > [snip] > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > > /* spliceable */ > > > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > > - * kernel pick a source port on the target side > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > > + * default use its external address, so we don't inadvertently > > > > > > > + * expose services that listen only on the guest's loopback > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > > + * > > > > > > > + * In either case, let the kernel pick the source address to > > > > > > > + * match. > > > > > > > */ > > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > > + else > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > > + tgt->oaddr = inany_any4; > > > > > > > + } else { > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > > + else > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > > > Either this... > > > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > > reported as coming from Laurent's series at: > > > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > > have administrivia that will probably fill the rest of this week, but > > > > I'll look into this as soon as I can. > > > > > > I reproduced the problem on passt.top, and I have a partial idea > > > what's going on. As you say it's seeming like the address (addr_seen > > > == addr in this case) isn't properly ready. This is over splice, but > > > on the tap interface, I see the container sending NS messages for its > > > own address - seems like it's doing DAD. But more importantly, we're > > > answering those NS messages with NA messages, because we answer all > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > > ever worked at all. --config-net makes sense, since we disable DAD, > > > but our test suite has always been using NDP+DHCP instead of > > > --config-net. > > > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > > happens most of the time and for IPv4 via ARP, which is used much more > > > rarely. I think we need to be more selective in what NS or ARP > > > lookups we resopnd to. The question is what approach to take: > > > > Hmm... no.. there's more to this. > > > > Usually DAD requests have :: as the source address, and we *do* > > exclude those from getting replies. In this case though, we're > > getting NS requests for the assigned address from what looks like the > > SLAAC address. So, I do think it would be wise to explicitly exclude > > these: we shouldn't be giving NA responses for an address that ought > > to belong to the guest, even if it doesn't look like a DAD. > > > > But, I'm not sure what's triggering this. Is for some reason the DHCP > > address not "taking", so the container is trying to locate it on the > > network instead? Or _is_ this DAD, but under some circumstances > > rather than using :: as the source address it uses another configured > > address. > > Ok.. I've understood a bit more. While timing is a factor here, it > looks like the main reason I wasn't seeing it on my machine is what > I'd consider a bug in the Debian version of the dhclient-script: > when adding an IPv6 address, it returns without waiting for DAD to > complete (i.e. for the address to be non-tentative). Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them. By the way, I guess it's just an issue for test scripts like this one. > There's also an additional bug, which doesn't cause this problem, I > think, but caused some problems when I was investigating. DHCPv6 > needs the link-local SLAAC address already configured and > non-tentative. The Fedora dhclient-script waits for that too at the > PREINIT6 stage, but the Debian one doesn't, meaning if you attempt > dhclient -6 immediately after starting the namespace it will fail to > bind the UDP address it needs. Right, and that's fine for us because we have a 2-second delay after SLAAC. This looks to me a bit more like a real bug, but again, there might be many other reasons why dhclient can't use a link-local address. One might argue that it's the responsibility of the user/caller to invoke dhclient when appropriate. In that sense, you might be wondering why there's a 2-second delay after SLAAC, but no delay after invoking dhclient -6: the reason is that I was convinced that one wouldn't need DAD once a DHCPv6 client configures an address. The server is already checking that, I thought. Well, no. RFC 8415 18.2.10.1: https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1 says: If the client can operate with the addresses and/or prefixes obtained from the server: [...] - The client MUST perform duplicate address detection as per Section 5.4 of [RFC4862], which does list some exceptions, on each of the received addresses in any IAs on which it has not performed duplicate address detection during processing of any of the previous Reply messages from the server. The client performs the duplicate address detection before using the received addresses for any traffic. If any of the addresses are found to be in use on the link, the client sends a Decline message to the server for those addresses as described in Section 18.2.8. > I still think it's a good idea not to give NA messages for the guest > assigned address, but we'll need a different workaround for this > issue. I read the rest of your reasoning about it, but the nice thing of the current behaviour (and that's why I added that single check on the source address in ndp()) is that the guest can really use whatever address it wants, regardless of what we tried to configure, and we'll resolve any other address. If we receive a neighbour solicitation for the guest assigned address, and the source address is not unspecified, it means that the guest is _not_ using the assigned address, and it's actually trying to reach it. > I guess we'll have to manually wait for DAD to complete in the > DHCP tests, which will be kind of mucky :/ Alternatively, we could use the same trick as added by commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"): disable neighbour solicitations, run dhclient -6, set 'nodad' on the address, and re-enable neighbour solicitations. This works for me: -- diff --git a/test/pasta/dhcp b/test/pasta/dhcp index 41556b8..76aa723 100644 --- a/test/pasta/dhcp +++ b/test/pasta/dhcp @@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu' check [ __MTU__ = 65520 ] test DHCPv6: address +ns ip link set dev __IFNAME__ arp off ns /sbin/dhclient -6 --no-pid __IFNAME__ hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]' +ns ip addr change __ADDR6__ dev __IFNAME__ nodad +ns ip link set dev __IFNAME__ arp on hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]' check [ __ADDR6__ = __HOST_ADDR6__ ] -- Adding 2-second delays as we have them for NDP doesn't look that bad: $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__ given that we don't need it on dhclient -x, tests would take about 12 seconds longer. Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC: $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2 and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach. -- @@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu' check [ __MTU__ = 65520 ] test DHCPv6: address +ns ip link set dev __IFNAME__ arp off ns /sbin/dhclient -6 --no-pid __IFNAME__ hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]' +ns ip addr change __ADDR6__ dev __IFNAME__ nodad +ns ip link set dev __IFNAME__ arp on hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]' check [ __ADDR6__ = __HOST_ADDR6__ ] -- Adding 2-second delays as we have them for NDP doesn't look that bad: $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__ given that we don't need it on dhclient -x, tests would take about 12 seconds longer. Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC: $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2 and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach. -- Stefano ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-16 15:26 ` Stefano Brivio @ 2024-10-17 1:19 ` David Gibson 2024-10-17 8:31 ` Stefano Brivio 2024-10-17 5:06 ` David Gibson 1 sibling, 1 reply; 16+ messages in thread From: David Gibson @ 2024-10-17 1:19 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 10052 bytes --] On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote: > On Wed, 16 Oct 2024 19:39:40 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote: > > > > [snip] > > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > > > /* spliceable */ > > > > > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > > > - * kernel pick a source port on the target side > > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > > > + * default use its external address, so we don't inadvertently > > > > > > > > + * expose services that listen only on the guest's loopback > > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > > > + * > > > > > > > > + * In either case, let the kernel pick the source address to > > > > > > > > + * match. > > > > > > > > */ > > > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > > > + else > > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > > > + tgt->oaddr = inany_any4; > > > > > > > > + } else { > > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > > > + else > > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > > > > > Either this... > > > > > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > > > reported as coming from Laurent's series at: > > > > > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > > > have administrivia that will probably fill the rest of this week, but > > > > > I'll look into this as soon as I can. > > > > > > > > I reproduced the problem on passt.top, and I have a partial idea > > > > what's going on. As you say it's seeming like the address (addr_seen > > > > == addr in this case) isn't properly ready. This is over splice, but > > > > on the tap interface, I see the container sending NS messages for its > > > > own address - seems like it's doing DAD. But more importantly, we're > > > > answering those NS messages with NA messages, because we answer all > > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > > > ever worked at all. --config-net makes sense, since we disable DAD, > > > > but our test suite has always been using NDP+DHCP instead of > > > > --config-net. > > > > > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > > > happens most of the time and for IPv4 via ARP, which is used much more > > > > rarely. I think we need to be more selective in what NS or ARP > > > > lookups we resopnd to. The question is what approach to take: > > > > > > Hmm... no.. there's more to this. > > > > > > Usually DAD requests have :: as the source address, and we *do* > > > exclude those from getting replies. In this case though, we're > > > getting NS requests for the assigned address from what looks like the > > > SLAAC address. So, I do think it would be wise to explicitly exclude > > > these: we shouldn't be giving NA responses for an address that ought > > > to belong to the guest, even if it doesn't look like a DAD. > > > > > > But, I'm not sure what's triggering this. Is for some reason the DHCP > > > address not "taking", so the container is trying to locate it on the > > > network instead? Or _is_ this DAD, but under some circumstances > > > rather than using :: as the source address it uses another configured > > > address. > > > > Ok.. I've understood a bit more. While timing is a factor here, it > > looks like the main reason I wasn't seeing it on my machine is what > > I'd consider a bug in the Debian version of the dhclient-script: > > when adding an IPv6 address, it returns without waiting for DAD to > > complete (i.e. for the address to be non-tentative). > > Oops. On one hand, I would feel inclined to propose a fix for the > Debian and Ubuntu packages. On the other hand, I wonder if it's > universally considered a bug: the DHCPv6 client did its job at that > point, and it's debatable whether dhclient should wait for the address > to be usable before forking to background. > > That is, arguably, the job of dhclient's is to request and configure an > address. It's not a network configuration daemon. There might be many > other reasons why that address is unusable, and yet dhclient is not > responsible for them. Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231 > By the way, I guess it's just an issue for test scripts like this one. Why do you guess that? > > There's also an additional bug, which doesn't cause this problem, I > > think, but caused some problems when I was investigating. DHCPv6 > > needs the link-local SLAAC address already configured and > > non-tentative. The Fedora dhclient-script waits for that too at the > > PREINIT6 stage, but the Debian one doesn't, meaning if you attempt > > dhclient -6 immediately after starting the namespace it will fail to > > bind the UDP address it needs. > > Right, and that's fine for us because we have a 2-second delay after > SLAAC. This looks to me a bit more like a real bug, but again, there > might be many other reasons why dhclient can't use a link-local > address. One might argue that it's the responsibility of the > user/caller to invoke dhclient when appropriate. Here I think it's a much clearer argument that it's a real bug. We play fast and loose with it for mbuto, but dhclient can typically be called on an interface that isn't even up: the PREINIT/PREINIT6 script actions are specifically for this, they'll bring the interface up ready for the client to do its thing. I'd say the script is failing to do its job for PREINIT6 if there isn't a usable link-local address at the end. I filed a report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085229 > In that sense, you might be wondering why there's a 2-second delay > after SLAAC, but no delay after invoking dhclient -6: the reason is > that I was convinced that one wouldn't need DAD once a DHCPv6 client > configures an address. The server is already checking that, I thought. > > Well, no. RFC 8415 18.2.10.1: > > https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1 > > says: > > If the client can operate with the addresses and/or prefixes obtained > from the server: > > [...] > > - The client MUST perform duplicate address detection as per > Section 5.4 of [RFC4862], which does list some exceptions, on each > of the received addresses in any IAs on which it has not performed > duplicate address detection during processing of any of the > previous Reply messages from the server. The client performs the > duplicate address detection before using the received addresses > for any traffic. If any of the addresses are found to be in use > on the link, the client sends a Decline message to the server for > those addresses as described in Section 18.2.8. Indeed. > > I still think it's a good idea not to give NA messages for the guest > > assigned address, but we'll need a different workaround for this > > issue. > > I read the rest of your reasoning about it, but the nice thing of the > current behaviour (and that's why I added that single check on the > source address in ndp()) is that the guest can really use whatever > address it wants, regardless of what we tried to configure, and we'll > resolve any other address. Hrm. I suppose. Fwiw we already make the equivalent exclusion for ARP > If we receive a neighbour solicitation for the guest assigned address, > and the source address is not unspecified, it means that the guest is > _not_ using the assigned address, and it's actually trying to reach it. > > > I guess we'll have to manually wait for DAD to complete in the > > DHCP tests, which will be kind of mucky :/ > > Alternatively, we could use the same trick as added by commit > f4e9f26480ef ("pasta: Disable neighbour solicitations on device up > to prevent DAD"): disable neighbour solicitations, run dhclient -6, > set 'nodad' on the address, and re-enable neighbour solicitations. > > This works for me: Ok. More complex, but faster, I guess. I'll try implementing this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-17 1:19 ` David Gibson @ 2024-10-17 8:31 ` Stefano Brivio 2024-10-21 1:35 ` David Gibson 0 siblings, 1 reply; 16+ messages in thread From: Stefano Brivio @ 2024-10-17 8:31 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Thu, 17 Oct 2024 12:19:58 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote: > > On Wed, 16 Oct 2024 19:39:40 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > > > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote: > > > > > [snip] > > > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > > > > /* spliceable */ > > > > > > > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > > > > - * kernel pick a source port on the target side > > > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > > > > + * default use its external address, so we don't inadvertently > > > > > > > > > + * expose services that listen only on the guest's loopback > > > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > > > > + * > > > > > > > > > + * In either case, let the kernel pick the source address to > > > > > > > > > + * match. > > > > > > > > > */ > > > > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > > > > + else > > > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > > > > + tgt->oaddr = inany_any4; > > > > > > > > > + } else { > > > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > > > > + else > > > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > > > > > > > Either this... > > > > > > > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > > > > reported as coming from Laurent's series at: > > > > > > > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > > > > have administrivia that will probably fill the rest of this week, but > > > > > > I'll look into this as soon as I can. > > > > > > > > > > I reproduced the problem on passt.top, and I have a partial idea > > > > > what's going on. As you say it's seeming like the address (addr_seen > > > > > == addr in this case) isn't properly ready. This is over splice, but > > > > > on the tap interface, I see the container sending NS messages for its > > > > > own address - seems like it's doing DAD. But more importantly, we're > > > > > answering those NS messages with NA messages, because we answer all > > > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > > > > ever worked at all. --config-net makes sense, since we disable DAD, > > > > > but our test suite has always been using NDP+DHCP instead of > > > > > --config-net. > > > > > > > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > > > > happens most of the time and for IPv4 via ARP, which is used much more > > > > > rarely. I think we need to be more selective in what NS or ARP > > > > > lookups we resopnd to. The question is what approach to take: > > > > > > > > Hmm... no.. there's more to this. > > > > > > > > Usually DAD requests have :: as the source address, and we *do* > > > > exclude those from getting replies. In this case though, we're > > > > getting NS requests for the assigned address from what looks like the > > > > SLAAC address. So, I do think it would be wise to explicitly exclude > > > > these: we shouldn't be giving NA responses for an address that ought > > > > to belong to the guest, even if it doesn't look like a DAD. > > > > > > > > But, I'm not sure what's triggering this. Is for some reason the DHCP > > > > address not "taking", so the container is trying to locate it on the > > > > network instead? Or _is_ this DAD, but under some circumstances > > > > rather than using :: as the source address it uses another configured > > > > address. > > > > > > Ok.. I've understood a bit more. While timing is a factor here, it > > > looks like the main reason I wasn't seeing it on my machine is what > > > I'd consider a bug in the Debian version of the dhclient-script: > > > when adding an IPv6 address, it returns without waiting for DAD to > > > complete (i.e. for the address to be non-tentative). > > > > Oops. On one hand, I would feel inclined to propose a fix for the > > Debian and Ubuntu packages. On the other hand, I wonder if it's > > universally considered a bug: the DHCPv6 client did its job at that > > point, and it's debatable whether dhclient should wait for the address > > to be usable before forking to background. > > > > That is, arguably, the job of dhclient's is to request and configure an > > address. It's not a network configuration daemon. There might be many > > other reasons why that address is unusable, and yet dhclient is not > > responsible for them. > > Hrm... I guess. Counterpoints.. > - Most other failures to get a usable address will result in a > visible error > - dhclient has a --dad-wait-time option which seems to imply that the > script should wait for DAD > - The upstream script version waits for DAD > > In any case I filed a report for it > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231 > > > By the way, I guess it's just an issue for test scripts like this one. > > Why do you guess that? Because it's kind of rare that your address changes if you use DHCPv6, I guess, so this would be relevant almost exclusively at boot. And, at boot, if a remote peer/client happens to try to connect to the machine where the client is running right after an address was assigned, it must have a retry mechanism almost for sure. -- Stefano ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-17 8:31 ` Stefano Brivio @ 2024-10-21 1:35 ` David Gibson 0 siblings, 0 replies; 16+ messages in thread From: David Gibson @ 2024-10-21 1:35 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 7407 bytes --] On Thu, Oct 17, 2024 at 10:31:22AM +0200, Stefano Brivio wrote: > On Thu, 17 Oct 2024 12:19:58 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote: > > > On Wed, 16 Oct 2024 19:39:40 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > > > > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > > > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote: > > > > > > [snip] > > > > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > > > > > /* spliceable */ > > > > > > > > > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > > > > > - * kernel pick a source port on the target side > > > > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > > > > > + * default use its external address, so we don't inadvertently > > > > > > > > > > + * expose services that listen only on the guest's loopback > > > > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > > > > > + * > > > > > > > > > > + * In either case, let the kernel pick the source address to > > > > > > > > > > + * match. > > > > > > > > > > */ > > > > > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > > > > > + else > > > > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > > > > > + tgt->oaddr = inany_any4; > > > > > > > > > > + } else { > > > > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > > > > > + else > > > > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > > > > > > > > > Either this... > > > > > > > > > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > > > > > reported as coming from Laurent's series at: > > > > > > > > > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > > > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > > > > > have administrivia that will probably fill the rest of this week, but > > > > > > > I'll look into this as soon as I can. > > > > > > > > > > > > I reproduced the problem on passt.top, and I have a partial idea > > > > > > what's going on. As you say it's seeming like the address (addr_seen > > > > > > == addr in this case) isn't properly ready. This is over splice, but > > > > > > on the tap interface, I see the container sending NS messages for its > > > > > > own address - seems like it's doing DAD. But more importantly, we're > > > > > > answering those NS messages with NA messages, because we answer all > > > > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > > > > > ever worked at all. --config-net makes sense, since we disable DAD, > > > > > > but our test suite has always been using NDP+DHCP instead of > > > > > > --config-net. > > > > > > > > > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > > > > > happens most of the time and for IPv4 via ARP, which is used much more > > > > > > rarely. I think we need to be more selective in what NS or ARP > > > > > > lookups we resopnd to. The question is what approach to take: > > > > > > > > > > Hmm... no.. there's more to this. > > > > > > > > > > Usually DAD requests have :: as the source address, and we *do* > > > > > exclude those from getting replies. In this case though, we're > > > > > getting NS requests for the assigned address from what looks like the > > > > > SLAAC address. So, I do think it would be wise to explicitly exclude > > > > > these: we shouldn't be giving NA responses for an address that ought > > > > > to belong to the guest, even if it doesn't look like a DAD. > > > > > > > > > > But, I'm not sure what's triggering this. Is for some reason the DHCP > > > > > address not "taking", so the container is trying to locate it on the > > > > > network instead? Or _is_ this DAD, but under some circumstances > > > > > rather than using :: as the source address it uses another configured > > > > > address. > > > > > > > > Ok.. I've understood a bit more. While timing is a factor here, it > > > > looks like the main reason I wasn't seeing it on my machine is what > > > > I'd consider a bug in the Debian version of the dhclient-script: > > > > when adding an IPv6 address, it returns without waiting for DAD to > > > > complete (i.e. for the address to be non-tentative). > > > > > > Oops. On one hand, I would feel inclined to propose a fix for the > > > Debian and Ubuntu packages. On the other hand, I wonder if it's > > > universally considered a bug: the DHCPv6 client did its job at that > > > point, and it's debatable whether dhclient should wait for the address > > > to be usable before forking to background. > > > > > > That is, arguably, the job of dhclient's is to request and configure an > > > address. It's not a network configuration daemon. There might be many > > > other reasons why that address is unusable, and yet dhclient is not > > > responsible for them. > > > > Hrm... I guess. Counterpoints.. > > - Most other failures to get a usable address will result in a > > visible error > > - dhclient has a --dad-wait-time option which seems to imply that the > > script should wait for DAD > > - The upstream script version waits for DAD > > > > In any case I filed a report for it > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231 > > > > > By the way, I guess it's just an issue for test scripts like this one. > > > > Why do you guess that? > > Because it's kind of rare that your address changes if you use DHCPv6, > I guess, so this would be relevant almost exclusively at boot. Hm, true. I was thinking of ephemeral containers where "boot" could be quite common.. but those will most likely use --config-net. > And, at boot, if a remote peer/client happens to try to connect to the > machine where the client is running right after an address was > assigned, it must have a retry mechanism almost for sure. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address 2024-10-16 15:26 ` Stefano Brivio 2024-10-17 1:19 ` David Gibson @ 2024-10-17 5:06 ` David Gibson 1 sibling, 0 replies; 16+ messages in thread From: David Gibson @ 2024-10-17 5:06 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 2022 bytes --] On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote: [snip] > Adding 2-second delays as we have them for NDP doesn't look that bad: > $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns > /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest > /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 > __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x > perf/passt_tcp:118:guest dhclient -6 __IFNAME__ > two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ > two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__ > given that we don't need it on dhclient -x, tests would take about > 12 seconds longer. > Or we could switch to the arp off / nodad / arp on approach for > everything, including SLAAC: > $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro > --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" > pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 > dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set > dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2 > and save slightly less than 8 seconds. If you ask me, I would have a > slight preference for the nodad approach. Much as I don't like making the tests slower, I don't think the nodad approach is a good idea. It's fine if we think of the NDP/DHCP only as setting things up for the transfer tests. But they also serve the purpose of testing the NDP and DHCP paths themselves. For that purpose we should perform a full "normal" configuration, including DAD. [e.g. if we really did manage to break our NS responses so that we totally broke DAD, we'd want our tests to catch that]. I have a draft implementation of explicitly waiting for DAD to complete both for SLAAC and for DHCPv6, patches coming shortly. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-10-21 1:39 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-10-02 5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson 2024-10-02 5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson 2024-10-02 5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson 2024-10-02 5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson 2024-10-02 5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson 2024-10-09 13:07 ` Stefano Brivio 2024-10-09 20:44 ` Stefano Brivio 2024-10-10 5:57 ` David Gibson 2024-10-16 3:15 ` David Gibson 2024-10-16 5:46 ` David Gibson 2024-10-16 8:39 ` David Gibson 2024-10-16 15:26 ` Stefano Brivio 2024-10-17 1:19 ` David Gibson 2024-10-17 8:31 ` Stefano Brivio 2024-10-21 1:35 ` David Gibson 2024-10-17 5:06 ` David Gibson
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).