From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
Date: Thu, 23 Oct 2025 12:18:33 +1100 [thread overview]
Message-ID: <aPmCaQZTjfwmMv5S@zatzit> (raw)
In-Reply-To: <20251022105916.53925523@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 10817 bytes --]
On Wed, Oct 22, 2025 at 10:59:16AM +0200, Stefano Brivio wrote:
> On Wed, 22 Oct 2025 11:34:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
> > > On Fri, 17 Oct 2025 11:34:47 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> > > > loopback address. Typically we create two sockets, one for 127.0.0.1 and
> > > > one for ::1.
> > > >
> > > > This has some disadvantages:
> > > > * The guest can't connect to these services using its global IP address,
> > > > it must explicitly use 127.0.0.1 or ::1 (bug 100)
> > > > * The guest can't even connect via 127.0.0.0/8 addresses other than
> > > > 127.0.0.1
> > > > * We can't use dual-stack sockets, we have to have separate sockets for
> > > > IPv4 and IPv6.
> > > >
> > > > The restriction exist for a reason though. If the guest has any interfaces
> > > > other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> > > > via the forwards. Especially combined with -T auto / -U auto this would
> > > > make it very easy to make a mistake with nasty security implications.
> > > >
> > > > We can achieve both goals, however, if we don't bind the outbound listening
> > > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> > > > them to the "lo" interface.
> > >
> > > Nice trick, I didn't think of it. I wonder if doing the same host-side
> > > might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
> > > as well.
> >
> > I don't think we even need to do anything host side - bug 113 arises
> > because of where we're listening on the guest side.
>
> Ah, you're right, I guess I picked the wrong bug. I have a vague memory
> of another one where somebody is running a DNS proxy in a container
> (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1,
> and we automatically bind, on the host side, to 127.0.0.1.
>
> > So this might be
> > enough to fix it all on its own. I'm not certain, because the special
> > case DNS handling complicates the picture there.
>
> I guess perhaps worth a quick test with socat if checking against
> systemd-resolved isn't pratical, to see if we can close that one too?
Right, I'll have a look when I can.
> > > > Link: https://bugs.passt.top/show_bug.cgi?id=100
> > > >
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > > pif.c | 6 ------
> > > > tcp.c | 18 ++----------------
> > > > udp.c | 27 ++++++++++-----------------
> > > > 3 files changed, 12 insertions(+), 39 deletions(-)
> > > >
> > > > diff --git a/pif.c b/pif.c
> > > > index 592fafaa..84e3ceae 100644
> > > > --- a/pif.c
> > > > +++ b/pif.c
> > > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > > >
> > > > ASSERT(pif_is_socket(pif));
> > > >
> > > > - if (pif == PIF_SPLICE) {
> > > > - /* Sanity checks */
> > > > - ASSERT(!ifname);
> > > > - ASSERT(addr && inany_is_loopback(addr));
> > > > - }
> > > > -
> > > > if (!addr)
> > > > return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
> > > > ifname, false, data);
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 15c012d7..982c9190 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
> > > >
> > > > return r4 < 0 ? r4 : r6;
> > > > }
> > > > -/**
> > > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > > > - * @c: Execution context
> > > > - * @port: Port, host order
> > > > - */
> > > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > > > -{
> > > > - ASSERT(!c->no_tcp);
> > > > -
> > > > - if (c->ifi4)
> > > > - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> > > > - if (c->ifi6)
> > > > - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> > > > -}
> > > >
> > > > /**
> > > > * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> > > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
> > > > if (!bitmap_isset(c->tcp.fwd_out.map, port))
> > > > continue;
> > > >
> > > > - tcp_ns_sock_init(c, port);
> > > > + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> > >
> > > I thought the "lo" string would be part of the Linux UAPI, but that's
> > > not the case, and loopback_net_init() just calls:
> > >
> > > alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
> > >
> > > so I think it's relatively unproblematic to hardcode that as well, and it
> > > looks like we can't create a second loopback interface, even though:
> > >
> > > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
> > > 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
> > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >
> > Hm, that is a point. *thinks*. So I believe loopback always has
> > index 1, so we could potentially use that.
>
> Right, see pasta_ns_conf():
>
> rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
>
> > Except that BINDTODEVICE
> > takes a name, not an index. I don't think looking up the name from
> > the index then using BINDTODEVICE would do quite what we want either:
> > IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed
> > the name of lo after we did BINDTODEVICE, then it wouldn't "follow"
> > the interface name change (which is what you want for intermittently
> > present interfaces, not so much here).
>
> Yes, we would need to keep querying it.
>
> > > I don't have any quick solution and I don't think we care enough as to
> > > write a function in netlink.c fetching links with loopback type, so I'm
> > > totally fine with this as it is.
> >
> > Yeah, given the above, I think this is another case of we can only go
> > so far to stop the guest shooting itself in the foot.
> >
> > > By the way, if we fail to use SO_BINDTODEVICE, we already defensively
> > > close the socket. The only possible flaw that occurs to me is that
> > > somebody could rename 'lo' and then create a link called 'lo' of a
> > > different type. But that needs CAP_NET_ADMIN in the container anyway.
> >
> > Right. And while that could expose host ports in ways we didn't
> > expect, a malicious guest could already do that by running a port
> > forwarder. So, again, I think this falls under the category of the
> > guest being allowed to shoot itself in the foot.
>
> Indeed.
>
> > > > }
> > > >
> > > > return 0;
> > > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
> > > > if ((c->ifi4 && socks[port][V4] == -1) ||
> > > > (c->ifi6 && socks[port][V6] == -1)) {
> > > > if (outbound)
> > > > - tcp_ns_sock_init(c, port);
> > > > + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> > >
> > > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?
> >
> > For a moment I thought we didn't need a fallback, because we'd be
> > entering the guest ns and thereby gaining CAP_NET_RAW. But that's not
> > the case: we only enter the guest netns for this operation, we're
> > already in the userns and have dropped capabilities at this point
> > (unlike the bindings we create at startup).
> >
> > So, good question. Having a fallback would make this a *lot* messier,
>
> Hmm, why? I didn't test this (and I'm quite confused by
> inany_from_sockaddr() right now) but:
Ah, I didn't think of doing the workaround at this level. That does
help somewhat.
>
> ---
> diff --git a/util.c b/util.c
> index c492f90..6b04040 100644
> --- a/util.c
> +++ b/util.c
> @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> * ("net: core: enable SO_BINDTODEVICE for non-root users"). If
> * it's unsupported, don't bind the socket at all, because the
> * user might rely on this to filter incoming connections.
> + *
> + * As an exception, if we want to bind to 'lo', approximate that
> + * by binding to 127.0.0.1 (which might be the wrong loopback
> + * address for IPv4) or ::1 (always correct, for IPv6). This
> + * adds https://bugs.passt.top/show_bug.cgi?id=100 back for
> + * pre-5.7 kernels.
> */
> if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
> ifname, strlen(ifname))) {
> char str[SOCKADDR_STRLEN];
>
> - ret = -errno;
> - warn("Can't bind %s socket for %s to %s, closing",
> - EPOLL_TYPE_STR(proto),
> - sockaddr_ntop(sa, str, sizeof(str)), ifname);
> - close(fd);
> - return ret;
> + if (errno == EPERM && !strcmp(ifname, "lo")) {
> + /* I just realised inany_from_sockaddr() doesn't
> + * actually take a sockaddr as source...? Do we
It does, the description's just not great. It takes the sockaddr as a
(void *) @addr so that callers can pass a struct sockaddr_whatever
without casting. I'll try to improve this.
> + * have another function doing that now?
> + *
> + * Well, change the address in 'sa' and warn().
We need to specially handle the dual stack case, to force the caller
to split into separate v4 and v6 sockets here.
> + */
> + ;
> + } else {
> + ret = -errno;
> + warn("Can't bind %s socket for %s to %s, closing",
> + EPOLL_TYPE_STR(proto),
> + sockaddr_ntop(sa, str, sizeof(str)), ifname);
> + close(fd);
> + return ret;
> + }
> }
> }
>
> ---
>
> > and perhaps more importantly means we'd get a subtle but real
> > behavioural difference based on kernel version which sounds like it
> > could be pretty surprising to the user. My inclination is to say that
> > -T auto / -U auto requires a kernel with that patch, but if you
> > overrule me, I'll see what I can do.
>
> See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication
> that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19
> kernels) was the 'oldoldstable' until recently, and extended long term
> support ends in 2029.
Poop.
> If this was a new feature, I would agree. But with this, we are
> introducing a regression and we might break some setups.
Yeah, ok. I'll figure something out.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
prev parent reply other threads:[~2025-10-23 1:42 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-17 0:34 [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding David Gibson
2025-10-17 0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
2025-10-20 6:08 ` Stefano Brivio
2025-10-20 9:24 ` David Gibson
2025-10-20 6:09 ` Stefano Brivio
2025-10-20 9:25 ` David Gibson
2025-10-17 0:34 ` [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
2025-10-21 21:51 ` Stefano Brivio
2025-10-22 0:08 ` David Gibson
2025-10-17 0:34 ` [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
2025-10-21 21:51 ` Stefano Brivio
2025-10-22 0:34 ` David Gibson
2025-10-22 8:59 ` Stefano Brivio
2025-10-23 1:18 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPmCaQZTjfwmMv5S@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).