From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=xrabcQBt; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 7EAAA5A0619 for ; Thu, 23 Oct 2025 03:42:13 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202510; t=1761183731; bh=uiJddP6JvBJxLJ5XtmZCOrDXFCiUgl/UyMdF3+cAxsM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=xrabcQBtTJWi3tbu6R1w6WqsV6g4NhyLdwmJBqj/0LgZVdqoLY72gRvqRS+KR22pW 5O1P08pgSnS8jDilmbTO69BswQzQcM6HE1ftzHeXsrVWVPb8QRw4ddZom/I2Mm3TG2 NlzD7jFVvLwwnUvVdtNkVlgp+Ds37Vk+Gn9pm/WLLtcFzRJfHDqFViykCIH+I5Wv50 pFFLQkD15vKUVgCYTUYc3keKoRLcSbgo3oWwxi3C1RvrI+vFrPAS4WS56qfH93mLig gnHpLHxgy4G4saiVnX3i1y0BJMfGKOW0SpFgr1h/LlnRrfkhNyWNqhPfV0CfkMOhPd vqJDbdG+b5UEw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4csTKv0mDXz4wM0; Thu, 23 Oct 2025 12:42:11 +1100 (AEDT) Date: Thu, 23 Oct 2025 12:18:33 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address Message-ID: References: <20251017003447.414103-1-david@gibson.dropbear.id.au> <20251017003447.414103-4-david@gibson.dropbear.id.au> <20251021235112.17369db5@elisabeth> <20251022105916.53925523@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="88UiM/jQGtATSez+" Content-Disposition: inline In-Reply-To: <20251022105916.53925523@elisabeth> Message-ID-Hash: 7SZ45A4LPVIWVJA4DZNWVAEHR6F2VP33 X-Message-ID-Hash: 7SZ45A4LPVIWVJA4DZNWVAEHR6F2VP33 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --88UiM/jQGtATSez+ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Oct 22, 2025 at 10:59:16AM +0200, Stefano Brivio wrote: > On Wed, 22 Oct 2025 11:34:40 +1100 > David Gibson wrote: >=20 > > On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote: > > > On Fri, 17 Oct 2025 11:34:47 +1100 > > > David Gibson wrote: > > > =20 > > > > Currently, outbound forwards (-T, -U) are handled by sockets bound = to the > > > > loopback address. Typically we create two sockets, one for 127.0.0= =2E1 and > > > > one for ::1. > > > >=20 > > > > This has some disadvantages: > > > > * The guest can't connect to these services using its global IP ad= dress, > > > > it must explicitly use 127.0.0.1 or ::1 (bug 100) > > > > * The guest can't even connect via 127.0.0.0/8 addresses other than > > > > 127.0.0.1 > > > > * We can't use dual-stack sockets, we have to have separate socket= s for > > > > IPv4 and IPv6. > > > >=20 > > > > The restriction exist for a reason though. If the guest has any in= terfaces > > > > other than pasta (e.g. a VPN tunnel) external hosts could reach the= host > > > > via the forwards. Especially combined with -T auto / -U auto this = would > > > > make it very easy to make a mistake with nasty security implication= s. > > > >=20 > > > > We can achieve both goals, however, if we don't bind the outbound l= istening > > > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to re= strict > > > > them to the "lo" interface. =20 > > >=20 > > > Nice trick, I didn't think of it. I wonder if doing the same host-side > > > might help solving a part of https://bugs.passt.top/show_bug.cgi?id= =3D113 > > > as well. =20 > >=20 > > I don't think we even need to do anything host side - bug 113 arises > > because of where we're listening on the guest side. >=20 > Ah, you're right, I guess I picked the wrong bug. I have a vague memory > of another one where somebody is running a DNS proxy in a container > (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1, > and we automatically bind, on the host side, to 127.0.0.1. >=20 > > So this might be > > enough to fix it all on its own. I'm not certain, because the special > > case DNS handling complicates the picture there. >=20 > I guess perhaps worth a quick test with socat if checking against > systemd-resolved isn't pratical, to see if we can close that one too? Right, I'll have a look when I can. > > > > Link: https://bugs.passt.top/show_bug.cgi?id=3D100 > > > >=20 > > > > Signed-off-by: David Gibson > > > > --- > > > > pif.c | 6 ------ > > > > tcp.c | 18 ++---------------- > > > > udp.c | 27 ++++++++++----------------- > > > > 3 files changed, 12 insertions(+), 39 deletions(-) > > > >=20 > > > > diff --git a/pif.c b/pif.c > > > > index 592fafaa..84e3ceae 100644 > > > > --- a/pif.c > > > > +++ b/pif.c > > > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_= type type, uint8_t pif, > > > > =20 > > > > ASSERT(pif_is_socket(pif)); > > > > =20 > > > > - if (pif =3D=3D PIF_SPLICE) { > > > > - /* Sanity checks */ > > > > - ASSERT(!ifname); > > > > - ASSERT(addr && inany_is_loopback(addr)); > > > > - } > > > > - > > > > if (!addr) > > > > return sock_l4_sa(c, type, &sa, sizeof(sa.sa6), > > > > ifname, false, data); > > > > diff --git a/tcp.c b/tcp.c > > > > index 15c012d7..982c9190 100644 > > > > --- a/tcp.c > > > > +++ b/tcp.c > > > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8= _t pif, > > > > =20 > > > > return r4 < 0 ? r4 : r6; > > > > } > > > > -/** > > > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound= connections > > > > - * @c: Execution context > > > > - * @port: Port, host order > > > > - */ > > > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) > > > > -{ > > > > - ASSERT(!c->no_tcp); > > > > - > > > > - if (c->ifi4) > > > > - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); > > > > - if (c->ifi6) > > > > - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); > > > > -} > > > > =20 > > > > /** > > > > * tcp_ns_socks_init() - Bind sockets in namespace for outbound co= nnections > > > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg) > > > > if (!bitmap_isset(c->tcp.fwd_out.map, port)) > > > > continue; > > > > =20 > > > > - tcp_ns_sock_init(c, port); > > > > + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port); =20 > > >=20 > > > I thought the "lo" string would be part of the Linux UAPI, but that's > > > not the case, and loopback_net_init() just calls: > > >=20 > > > alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup); > > >=20 > > > so I think it's relatively unproblematic to hardcode that as well, an= d it > > > looks like we can't create a second loopback interface, even though: > > >=20 > > > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name= lol; ip link show lol' > > > 1: lol: mtu 65536 qdisc noqueue state DOWN mode DEFAULT gr= oup default qlen 1000 > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 =20 > >=20 > > Hm, that is a point. *thinks*. So I believe loopback always has > > index 1, so we could potentially use that. >=20 > Right, see pasta_ns_conf(): >=20 > rc =3D nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP); >=20 > > Except that BINDTODEVICE > > takes a name, not an index. I don't think looking up the name from > > the index then using BINDTODEVICE would do quite what we want either: > > IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed > > the name of lo after we did BINDTODEVICE, then it wouldn't "follow" > > the interface name change (which is what you want for intermittently > > present interfaces, not so much here). >=20 > Yes, we would need to keep querying it. >=20 > > > I don't have any quick solution and I don't think we care enough as to > > > write a function in netlink.c fetching links with loopback type, so I= 'm > > > totally fine with this as it is. =20 > >=20 > > Yeah, given the above, I think this is another case of we can only go > > so far to stop the guest shooting itself in the foot. > >=20 > > > By the way, if we fail to use SO_BINDTODEVICE, we already defensively > > > close the socket. The only possible flaw that occurs to me is that > > > somebody could rename 'lo' and then create a link called 'lo' of a > > > different type. But that needs CAP_NET_ADMIN in the container anyway.= =20 > >=20 > > Right. And while that could expose host ports in ways we didn't > > expect, a malicious guest could already do that by running a port > > forwarder. So, again, I think this falls under the category of the > > guest being allowed to shoot itself in the foot. >=20 > Indeed. >=20 > > > > } > > > > =20 > > > > return 0; > > > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bo= ol outbound) > > > > if ((c->ifi4 && socks[port][V4] =3D=3D -1) || > > > > (c->ifi6 && socks[port][V6] =3D=3D -1)) { > > > > if (outbound) > > > > - tcp_ns_sock_init(c, port); > > > > + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port); =20 > > >=20 > > > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels= ? =20 > >=20 > > For a moment I thought we didn't need a fallback, because we'd be > > entering the guest ns and thereby gaining CAP_NET_RAW. But that's not > > the case: we only enter the guest netns for this operation, we're > > already in the userns and have dropped capabilities at this point > > (unlike the bindings we create at startup). > >=20 > > So, good question. Having a fallback would make this a *lot* messier, >=20 > Hmm, why? I didn't test this (and I'm quite confused by > inany_from_sockaddr() right now) but: Ah, I didn't think of doing the workaround at this level. That does help somewhat. >=20 > --- > diff --git a/util.c b/util.c > index c492f90..6b04040 100644 > --- a/util.c > +++ b/util.c > @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type= type, > * ("net: core: enable SO_BINDTODEVICE for non-root users"). If > * it's unsupported, don't bind the socket at all, because the > * user might rely on this to filter incoming connections. > + * > + * As an exception, if we want to bind to 'lo', approximate that > + * by binding to 127.0.0.1 (which might be the wrong loopback > + * address for IPv4) or ::1 (always correct, for IPv6). This > + * adds https://bugs.passt.top/show_bug.cgi?id=3D100 back for > + * pre-5.7 kernels. > */ > if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, > ifname, strlen(ifname))) { > char str[SOCKADDR_STRLEN]; > =20 > - ret =3D -errno; > - warn("Can't bind %s socket for %s to %s, closing", > - EPOLL_TYPE_STR(proto), > - sockaddr_ntop(sa, str, sizeof(str)), ifname); > - close(fd); > - return ret; > + if (errno =3D=3D EPERM && !strcmp(ifname, "lo")) { > + /* I just realised inany_from_sockaddr() doesn't > + * actually take a sockaddr as source...? Do we It does, the description's just not great. It takes the sockaddr as a (void *) @addr so that callers can pass a struct sockaddr_whatever without casting. I'll try to improve this. > + * have another function doing that now? > + * > + * Well, change the address in 'sa' and warn(). We need to specially handle the dual stack case, to force the caller to split into separate v4 and v6 sockets here. > + */ > + ; > + } else { > + ret =3D -errno; > + warn("Can't bind %s socket for %s to %s, closing", > + EPOLL_TYPE_STR(proto), > + sockaddr_ntop(sa, str, sizeof(str)), ifname); > + close(fd); > + return ret; > + } > } > } > =20 > --- >=20 > > and perhaps more importantly means we'd get a subtle but real > > behavioural difference based on kernel version which sounds like it > > could be pretty surprising to the user. My inclination is to say that > > -T auto / -U auto requires a kernel with that patch, but if you > > overrule me, I'll see what I can do. >=20 > See https://bugs.passt.top/show_bug.cgi?id=3D121 for a clear indication > that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19 > kernels) was the 'oldoldstable' until recently, and extended long term > support ends in 2029. Poop. > If this was a new feature, I would agree. But with this, we are > introducing a regression and we might break some setups. Yeah, ok. I'll figure something out. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --88UiM/jQGtATSez+ Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmj5gmgACgkQzQJF27ox 2Gfvdw//cAV34hqorjfTdpsrh49Z6rGCnglX6PwcORGMtGiUf+lxy4zYGVuc2Kt4 NfKg0JNcIX73InrS8Px6/MzbAbYKpHFf8NZqSzW3onXVKuvkoo/dfOYDD1EwRDY9 FrG+xPqTXQF7QcZpfANZnY/SC1xquESMRX+0Yj5tgeAYUn7l3RMLnhg7iZgh7R2W Z2oz8jMnZVRh/4tZfA48Fa22HPvgHth1uORokpKJCFvUIha8WBitVyOrA5zmB9mP PhIXmyPXnuD4AwTcGUaDUDp2O/n1C7Vo0JxamLIIpw0Hjb0TA+fkLBn4x/tiKUDE WZVoI4jy/16aM/P1SrLXAg81pOvtXSeY+gK96rB8H7fC3RGDuEcPxBk60JR592ZE wNgSgza3zKryJNkAXAuyVdeqXqmRGhxoZ/xaZbKf415xVJ3gudZiXnPNIWti7S0H 8PRPYtVPZBtsykhV7F0vpgWh5I7KHRVD2ahRW9C4YlaOhPgZjHVk0TDDwSoajatL K2nY8UZWloF7k9fVp6lQGp8Jb53tPnvEureH5PhcIWJGD0BxmdtwYE5Q26UXejpx yNHhoe8D0wwSkwWvbosFoi5yJxdAHZ/z0WLU09Plp9AMMbnOOKhn0QsHDMt0ON/r IzYdx2Y8s5/bh12ICqQZEz90GZgiPBGGwpFDR04UVyGeJehT5WE= =c1VC -----END PGP SIGNATURE----- --88UiM/jQGtATSez+--