public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address
Date: Wed, 9 Oct 2024 15:07:21 +0200	[thread overview]
Message-ID: <20241009150721.63af48f6@elisabeth> (raw)
In-Reply-To: <20241002054826.1812844-5-david@gibson.dropbear.id.au>

On Wed,  2 Oct 2024 15:48:26 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> In pasta mode, where addressing permits we "splice" connections, forwarding
> directly from host socket to guest/container socket without any L2 or L3
> processing.  This gives us a very large performance improvement when it's
> possible.
> 
> Since the traffic is from a local socket within the guest, it will go over
> the guest's 'lo' interface, and accordingly we set the guest side address
> to be the loopback address.  However this has a surprising side effect:
> sometimes guests will run services that are only supposed to be used within
> the guest and are therefore bound to only 127.0.0.1 and/or ::1.  pasta's
> forwarding exposes those services to the host, which isn't generally what
> we want.
> 
> Correct this by instead forwarding inbound "splice" flows to the guest's
> external address.
> 
> Link: https://github.com/containers/podman/issues/24045
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  conf.c  |  9 +++++++++
>  fwd.c   | 31 +++++++++++++++++++++++--------
>  passt.1 | 23 +++++++++++++++++++----
>  passt.h |  2 ++
>  4 files changed, 53 insertions(+), 12 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 6e62510..b5318f3 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -908,6 +908,9 @@ pasta_opts:
>  		"  -U, --udp-ns SPEC	UDP port forwarding to init namespace\n"
>  		"    SPEC is as described above\n"
>  		"    default: auto\n"
> +		"  --host-lo-to-ns-lo	DEPRECATED:\n"
> +		"			Translate host-loopback forwards to\n"
> +		"			namespace loopback\n"
>  		"  --userns NSPATH 	Target user namespace to join\n"
>  		"  --netns PATH|NAME	Target network namespace to join\n"
>  		"  --netns-only		Don't join existing user namespace\n"
> @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"netns-only",	no_argument,		NULL,		20 },
>  		{"map-host-loopback", required_argument, NULL,		21 },
>  		{"map-guest-addr", required_argument,	NULL,		22 },
> +		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv)
>  			conf_nat(optarg, &c->ip4.map_guest_addr,
>  				 &c->ip6.map_guest_addr, NULL);
>  			break;
> +		case 23:
> +			if (c->mode != MODE_PASTA)
> +				die("--host-lo-to-ns-lo is for pasta mode only");
> +			c->host_lo_to_ns_lo = 1;
> +			break;
>  		case 'd':
>  			c->debug = 1;
>  			c->quiet = 0;
> diff --git a/fwd.c b/fwd.c
> index a505098..c71f5e1 100644
> --- a/fwd.c
> +++ b/fwd.c
> @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
>  	    (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
>  		/* spliceable */
>  
> -		/* Preserve the specific loopback adddress used, but let the
> -		 * kernel pick a source port on the target side
> +		/* The traffic will go over the guest's 'lo' interface, but by
> +		 * default use its external address, so we don't inadvertently
> +		 * expose services that listen only on the guest's loopback
> +		 * address.  That can be overridden by --host-lo-to-ns-lo which
> +		 * will instead forward to the loopback address in the guest.
> +		 *
> +		 * In either case, let the kernel pick the source address to
> +		 * match.
>  		 */
> -		tgt->oaddr = ini->eaddr;
> +		if (inany_v4(&ini->eaddr)) {
> +			if (c->host_lo_to_ns_lo)
> +				tgt->eaddr = inany_loopback4;
> +			else
> +				tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> +			tgt->oaddr = inany_any4;
> +		} else {
> +			if (c->host_lo_to_ns_lo)
> +				tgt->eaddr = inany_loopback6;
> +			else
> +				tgt->eaddr.a6 = c->ip6.addr_seen;

Either this...

> +			tgt->oaddr = inany_any6;

or this (and not something before this patch, up to 3/4) make the
"TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
sometimes (about one in three/four runs), that's what I mistakenly
reported as coming from Laurent's series at:

  https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/

It hangs like this (display with >= 240 columns):

ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname ==
"enp9s0").addr_info[0].local'
            │...passed. 88.198.0.164
                                                                                           │
ns$ ip -j -4 route show|jq -rM '.[] | select(.dst ==
"default").gateway'
              │Starting test: TCP/IPv4: ns to host (spliced): big
transfer 88.198.0.161
                                                                            │?
cmp /home/sbrivio/passt/test/big.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq
-rM '.[] | select(.ifname == "enp9s0").mtu'
                                          │...passed. 65520
                                                                                                                         │
ns$ /sbin/dhclient -6 --no-pid enp9s0
                                                                   │Starting
test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr
show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] |
select(.prefixlen == 128).local] | .[0]'                   │? cmp
/home/sbrivio/passt/test/big.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2
                                                                                                                  │...passed.
ns$ ip -j -6 route show|jq -rM '.[] | select(.dst ==
"default").gateway'
              │ fe80::1
                                                                                   │Starting
test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip
jq >/dev/null
                                                │? cmp
/home/sbrivio/passt/test/small.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u
TCP4-LISTEN:10002
OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc
                                    │...passed. ns$ socat -u
OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003
                                                      │ ns$ ip -j -4
route show|jq -rM '.[] | select(.dst == "default").gateway'
                                                      │Starting test:
TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161
                                                                                                                  │?
cmp /home/sbrivio/passt/test/small.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u
OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003
                                                      │...passed. ns$
socat -u TCP4-LISTEN:10002
OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc
                                    │ ns$ socat
OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003
                                                         │Starting
test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route
show|jq -rM '.[] | select(.dst == "default").gateway'
                                                │? cmp
/home/sbrivio/passt/test/small.bin
/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161
                                                                                                                    │...passed.
ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin
TCP4:88.198.0.161:10003
              │ ns$ strace socat -u TCP6-LISTEN:10002
OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc
2>/tmp/socat_server.strace       │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]'                                     │    router: 88.198.0.161 fe80::1                                                                                                                 │DNS: host$ which ip jq >/dev/null                                                                                            │    185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local'                                  │    185.12.64.2 88.198.0.164                                                                                                            │    NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]'                                     │NDP/DHCPv6: 88.198.0.161                                                                                                            │    assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'                                         │    router: fe80::1 enp9s0                                                                                                                  │    our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]'                                                                                             │    2a01:4ff:ff00::add:2 2a01:4f8:222:904::2                                                                                                     │    2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]'                                     │NDP: received RS, sending RA fe80::1                                                                                                                 │DHCP: offer to discover host$ which socat ip jq >/dev/null                                                                                      │    from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002                                               │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc        │    from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc                       │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002                                                │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc      │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc                     │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace                 │NDP: received NS, sending NA host$                                                                                                                   │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address                                                                                                      PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00

...even without strace. The client is done, the server hangs.

If I unblock this manually by re-running the same client command, the
server wakes up, writes the file, and terminates, and the test
continues normally.

Those three "received NS, sending NA" messages in the pasta pane are
printed in a short time after the test starts.

If I run this with TRACE=1 (which needs the patch I just sent), this
is pasta's debugging output for this test:

--
6.1401: pasta: epoll event on listening TCP socket 6 (events:
0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW
6.1402: Flow 0 (INI): NEW -> INI
6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ?
6.1402: Flow 0 (TGT): INI -> TGT
6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0
-> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection
(spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)):
HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
[2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)):
event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection
(spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)):
TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST
[::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
[2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun
device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA
7.0006: pasta: epoll event on namespace timer watch 12 (events:
0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288
7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP
(spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced):
cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection
(spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection
(spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on
/dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS,
sending NA 8.0006: pasta: epoll event on namespace timer watch 12
(events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at
tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at
tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device
13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006:
pasta: epoll event on namespace timer watch 12 (events: 0x00000001)
9.2065: pasta: epoll event on connected spliced TCP socket 118 (events:
0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on
socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)):
flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection
(spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)):
CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE):
HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 ->
[2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace
timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on
namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll
event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta:
epoll event on namespace timer watch 12 (events: 0x00000001) [...] --

Relevant parts of strace output from the client:

--
openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5
ioctl(5, TCGETS, 0x7ffd600ae4a0)        = -1 ENOTTY (Inappropriate
ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
connect(6, {sa_family=AF_INET6, sin6_port=htons(10002),
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr),
sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6,
sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6,
"::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5],
[6], [], NULL, NULL)   = 2 (in [5], out [6]) read(5,
"\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"...,
8192) = 8192 write(6,
"\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"...,
8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL)   = 2 (in [5], out
[6]) read(5,
"\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"...,
8192) = 8192 write(6,
"\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"...,
8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL)   = 2 (in [5], out
[6])

[...]

pselect6(7, [5], [6], [], NULL, NULL)   = 2 (in [5], out [6])
read(5, "", 8192)                       = 0
shutdown(6, SHUT_WR)                    = 0
shutdown(6, SHUT_RDWR)                  = 0
exit_group(0)                           = ?
+++ exited with 0 +++
--

and from the server:

--
socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET6, sin6_port=htons(10002),
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr),
sin6_scope_id=0}, 28) = 0 listen(6, 5)                            = 0
getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002),
sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr),
sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL --

If I connect from the host without a server in the namespace (but
with the port forwarded by pasta), I get a connection reset, and
if the port is not forwarded by pasta, connection refused.

But this is another case: we start connecting and accept the
connection (probably we shouldn't). Note the "No route to host"
error on the socket.

It looks somehow similar to the race I fixed with commit
f4e9f26480ef ("pasta: Disable neighbour solicitations on device
up to prevent DAD"), but it doesn't look like an invalid
c->ip6.addr_seen, because otherwise pasta would reset the
connection, I suppose.

I haven't debugged further yet. This looks like an existing
issue in pasta rather than in this series or in the tests,
but it blocks tests, so I haven't applied this yet.

-- 
Stefano


  reply	other threads:[~2024-10-09 13:09 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-02  5:48 [PATCH v3 0/4] Don't expose container loopback services to the host David Gibson
2024-10-02  5:48 ` [PATCH v3 1/4] passt.1: Mark --stderr as deprecated more prominently David Gibson
2024-10-02  5:48 ` [PATCH v3 2/4] passt.1: Clarify and update "Handling of local addresses" section David Gibson
2024-10-02  5:48 ` [PATCH v3 3/4] test: Clarify test for spliced inbound transfers David Gibson
2024-10-02  5:48 ` [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address David Gibson
2024-10-09 13:07   ` Stefano Brivio [this message]
2024-10-09 20:44     ` Stefano Brivio
2024-10-10  5:57       ` David Gibson
2024-10-16  3:15         ` David Gibson
2024-10-16  5:46           ` David Gibson
2024-10-16  8:39             ` David Gibson
2024-10-16 15:26               ` Stefano Brivio
2024-10-17  1:19                 ` David Gibson
2024-10-17  8:31                   ` Stefano Brivio
2024-10-21  1:35                     ` David Gibson
2024-10-17  5:06                 ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241009150721.63af48f6@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).