From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH RFT] fwd: Only do inbound IPv6 NAT to map_host_loopback / map_guest_addr with matching scope
Date: Wed, 20 May 2026 02:37:14 +0200 (CEST) [thread overview]
Message-ID: <20260520023713.69cb52e8@elisabeth> (raw)
In-Reply-To: <agqIdYZaB9F1tA8_@zatzit>
On Mon, 18 May 2026 13:33:09 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Fri, May 15, 2026 at 12:50:23AM +0200, Stefano Brivio wrote:
> > On Thu, 14 May 2026 14:22:51 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Thu, May 14, 2026 at 01:08:16AM +0200, Stefano Brivio wrote:
> > > > On Wed, 13 May 2026 15:04:35 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Thu, May 07, 2026 at 06:31:49AM +0200, Stefano Brivio wrote:
> > > > > > I'm sharing this mostly for debugging / investigation of:
> > > > > >
> > > > > > https://github.com/containers/container-libs/pull/755#issuecomment-4390420134
> > > > > >
> > > > > > even though the change is probably correct and needed regardless of
> > > > > > that.
> > > > > >
> > > > > > If we have map_guest_addr or map_host_loopback addresses set for IPv6,
> > > > > > before using them for inbound NAT from the host, make sure they match
> > > > > > the scope of the original packet, otherwise we might unexpectedly
> > > > > > turn global unicast addresses into link-local ones for packets coming
> > > > > > from the host itself.
> > > > > >
> > > > > > Link: https://github.com/containers/container-libs/pull/755#issuecomment-4390420134
> > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > >
> > > > > There's a real problem here. However, I don't think this patch really
> > > > > addresses it. Details below.
> > > > >
> > > > > > ---
> > > > > > fwd.c | 20 ++++++++++++++++++--
> > > > > > 1 file changed, 18 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/fwd.c b/fwd.c
> > > > > > index 0697435..d224c0a 100644
> > > > > > --- a/fwd.c
> > > > > > +++ b/fwd.c
> > > > > > @@ -974,6 +974,20 @@ uint8_t fwd_nat_from_splice(const struct fwd_rule *rule, uint8_t proto,
> > > > > > return PIF_HOST;
> > > > > > }
> > > > > >
> > > > > > +/**
> > > > > > + * fwd_scope6_match() - Check if the IPv6 scope of two addresses match
> > > > > > + * @a: First address
> > > > > > + * @b: Second address
> > > > > > + *
> > > > > > + * Return: true for two IPv6 link-local or both not link-local, false otherwise
> > > > > > + *
> > > > > > + * NOTE: This currently ignores any other difference in scope
> > > > > > + */
> > > > >
> > > > > Nit: we probably want this helper (or ones like it) in ip.h and/or inany.h.
> > > > >
> > > > > > +bool fwd_scope6_match(const struct in6_addr *a, const struct in6_addr *b)
> > > > > > +{
> > > > > > + return IN6_IS_ADDR_LINKLOCAL(a) == IN6_IS_ADDR_LINKLOCAL(b);
> > > > >
> > > > > This considers only linklocal vs. not linklocal. Officially those are
> > > > > the only two scopes for unicast IPv6. But... site-local unicast used
> > > > > to exist, and while it's deprecated we have once seen it in the wild.
> > > > > There's also host-local scope which I'm not sure is a term used by
> > > > > IPv6, but is used by kernel netlinkg, and kind of exists in practice
> > > > > (::1 and nothing else is host local).
> > > >
> > > > Yes, see the NOTE above. I was trying to find out if this was in any
> > > > way useful (and it looks like it wasn't, at least from the current
> > > > progress of https://github.com/containers/container-libs/pull/755).
> > >
> > > Ok.
> > >
> > > > > > +}
> > > > > > +
> > > > > > /**
> > > > > > * nat_inbound() - Apply address translation for inbound (HOST to TAP)
> > > > > > * @c: Execution context
> > > > > > @@ -993,13 +1007,15 @@ bool nat_inbound(const struct ctx *c, const union inany_addr *addr,
> > > > > > /* Specifically 127.0.0.1, not 127.0.0.0/8 */
> > > > > > *translated = inany_from_v4(c->ip4.map_host_loopback);
> > > > > > } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback) &&
> > > > > > - inany_equals6(addr, &in6addr_loopback)) {
> > > > > > + inany_equals6(addr, &in6addr_loopback) &&
> > > > > > + fwd_scope6_match(&addr->a6, &c->ip6.map_host_loopback)) {
> > > > >
> > > > > This test will always be false: we just checked that addr == ::1,
> > > > > which is not link-local (it's host-local, if we're admitting that
> > > > > category).
> > > >
> > > > Oh, right, I didn't actually test this case.
> > > >
> > > > > > translated->a6 = c->ip6.map_host_loopback;
> > > > > > } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr) &&
> > > > > > inany_equals4(addr, &c->ip4.addr)) {
> > > > > > *translated = inany_from_v4(c->ip4.map_guest_addr);
> > > > > > } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr) &&
> > > > > > - inany_equals6(addr, &c->ip6.addr)) {
> > > > > > + inany_equals6(addr, &c->ip6.addr) &&
> > > > > > + fwd_scope6_match(&addr->a6, &c->ip6.map_guest_addr)) {
> > > > >
> > > > > This may be usually be right in practice, but it's kind of by
> > > > > accident.
> > > > >
> > > > > The problem with both these checks is that they compare the scope of a
> > > > > host side address (addr) with the scope of a guest side address
> > > > > (c->map_*).
> > > >
> > > > Note that inany_equals6(addr, &c->ip6.addr) is pre-existing.
> > >
> > > Right - that's the confusing but necessary semantics of
> > > --map-guest-addr. It's a translation for the thing on the outside
> > > that has the same address as the guest does on the inside.
> > >
> > > What I didn't spot before is that makes the scope check equivalent to:
> > > fwd_scope6_match(&c->ip6.addr, &c->ip6.map_guest_addr)
> > > which we should be able to implement at conf() time (someday at
> > > address update time).
> >
> > There might be cases where one wants to have different scopes for
> > those, though, and just apply the map_guest_addr inbound remapping for
> > packets matching its scope. I'm not sure if it's useful.
>
> I don't really understand what you're saying here. By definition the
> map_guest_addr translation applies with the outside address matches
> the (assigned) inside guest address. We know the scope of the
> assigned guest address in advance, we don't have to wait until we get
> a connection.
Ah, right, if it matches map_guest_addr, it's of course the same scope
(it's the same address). So that part doesn't matter.
But, in general, I was referring to source addresses. Let's say you
have a guest with address 2001:db8::1, and now you get an inbound
packet:
3fff::1 -> 2001:db8::2
scopes match, the behaviour is clear, we'll remap to 2001:db8::1. Then
you get another one:
fe80::1 -> 2001:db8::2
should we remap it the destination to 2001:db8::1, or to a link-local address?
> > > > Further, this "mismatch" is actually intended (see commit message and
> > > > Podman's pull request I mentioned above), as I was trying to (quickly)
> > > > make sure that we don't turn a global unicast request into a link-local
> > > > one.
> > >
> > > Ok, I had misunderstood the problem somewhat. Looking at that github
> > > comment, I'm still pretty confused TBH :/.
> >
> > We didn't investigate further because it's really not that critical at
> > this point, as the follow-ups to that report should indicate.
>
> As I commented on github, I've now understood (I'm pretty sure) what's
> going on there. It's not actually related to map_guest_addr or
> nat_inbound() at all, but is instead the "translation of last resort"
> in fwd_nat_from_host().
>
> > > > > That's not what matters: what matters is that source and
> > > > > destination on the tap side have the same scope.
> > > >
> > > > Not for this particular issue: again, I was just trying to make sure
> > > > that a global unicast request doesn't get translated to a link-local
> > > > one.
> > >
> > > Hrm. It's not really clear to me why that's bad.
> >
> > Because it's surprising that a request to a valid global unicast
> > address that's assigned to a container, from another global
> > unicast address, gets translated to anything else. There's no need for
> > that.
>
> Isn't there? If the outside source is using the same address as the
> guest uses inside we have to apply some sort of translation (or drop
> it entirely).
I don't actually remember what happens in that case. Does Linux
handle those (at least by default) in the same way as Martian (RFC
1812) packets, or does it let them thorough?
> So, if we want to preserve scope as well, we have to
> pluck a global scope address from somewhere, and it's not clear how we
> can do that.
Right, that's a particular case I was ignoring. I was just saying that
in general (if the source address isn't conflicting) then it would be
good to not change the scope, if possible.
If it conflicts, then I would just pick any reasonable choice to let
packets through, including changing the scope to link-local.
> > > > This can probably checked in an indirect and more correct way "at the
> > > > source".
> > > >
> > > > > In flow table terms,
> > > > > that is, on a single flowside oaddr and eaddr must have the same
> > > > > scope (or must they? see later).
> > > > >
> > > > > The scopes on one side of the flow don't need to match the scopes on
> > > > > the other side of the flow. In fact we need to allow them to be
> > > > > different: --map-host-loopback is *always* transforming a host-scope
> > > > > flow on the outside to something else on the inside (either link-scope
> > > > > or global-scope will work, as long as it's the same for both
> > > > > addresses). We don't do it yet, but I can imagine cases where it
> > > > > would be useful to translate a flow that's global-scope on the outside
> > > > > to local-scope on the inside (because for some reason we want to or
> > > > > have to hide the external peer's address from the guest). Or from
> > > > > local-scope on the outside to global-scope on the inside (because the
> > > > > outside flow is using local-scope addresses which are not meaningful
> > > > > to the guest).
> > > >
> > > > Yes, definitely, that might actually be a feature, I just think we
> > > > don't want to do that by default / mistake.
> > >
> > > I mean, --map-host-loopback is kind of already this.
> > >
> > > > In this case we had an inbound request to a global unicast address that
> > > > was translated for some reason (we didn't really investigate) to a
> > > > link-local destination address.
> > >
> > > Hrm, ok. I really want to understand why that happened.
> >
> > In the short term it's probably easier if you try out yourself something
> > like Paul described, because there are other more critical issues we
> > discovered later that we're tackling at the moment.
> >
> > > What was the source address?
> >
> > I *think* another global unicast address. But maybe not and that would
> > then explain the non-issue.
> >
> > > This seems like it would be handled by the selection
> > > of the guest side eaddr in fwd_nat_from_host(), which explicitly tries
> > > to match scope with the (translated) source address.
> >
> > Maybe, yes.
> >
> > > > But if it's explicit it should be allowed, by all means.
> > > >
> > > > > This has some tricky implications for what we do about assigning
> > > > > addresses for "local mode" or any future variant where we need to
> > > > > assign a guest address, but can't take one from the host. If we
> > > > > assign a link-local address, as was our plan, that implies under this
> > > > > assumption that the guest can only talk to link-local machines. In
> > > > > practice that would mean only the host (via -map-*) or in future
> > > > > things we added explicit NATs, where the guest side address is
> > > > > link-local. The guest would have no ability to contact the internet
> > > > > at large.
> > > >
> > > > I don't think that's desirable.
> > >
> > > Neither do I, but how to avoid it is not obvious to me.
> >
> > By making it a matter of preference: if there's another, more fitting
> > (in terms of scope) address, we use that. Otherwise fall back to a
> > link-local.
>
> Sure, but where would we get a global scope address from?
From the guest, *if it has one*. If not, we can happily fall back to
translating to/from link-local I'd say.
> > > > > At least once we have the netlink monitor, maybe that's ok. While the
> > > > > host has no address, the guest has only link local, so it can only
> > > > > talk to the host (or explicitly configured forwards/NATs). But the
> > > > > host has no connectivity anyway, so there's nothing else to talk to
> > > > > anyway. When the host gets connectivity we add a global-scope guest
> > > > > address, so it gets connectivity too.
> > > > >
> > > > > If that's not good enough, I can only see two approaches, neither of
> > > > > which look great.
> > > > >
> > > > > a) For incoming connections from the world, to a guest with only an
> > > > > LL address, we NAT *both* source and destination address (ugh, the
> > > > > bookkeeping).
> > > >
> > > > The bookkeeping is already in place though.
> > >
> > > Well.. we can DNAT easily enough, but to match scope we also need to
> > > SNAT. That means picking a link-local source address (guest oaddr)
> > > for each incoming flow. Maybe we can use our_tap_ll for that?
>
> I'd forgotten when I wrote this, but we're already doing this. That's
> what's causing the odd behaviour seen here.
>
> > Actually, we don't really need to match the scope, though. I just think
> > it's preferable when doable. So a) could be optional, and b) could be
> > the default. At that point:
>
> But as noted, I'm not sure (b) works *at all* for IPv6. It also
> wouldn't help for this case: we *cannot* use the original source
> address if it's the same as the guest's.
I wasn't including a source address "conflict" in this case. I was
suggesting to use (b) in general, but here, we'll need to do something
different, indeed.
> > > But that means all incoming connections will appear to come from there
> > > regardless of whether they are actually the same peer or not. If the
> > > guest is talking to enough peers we risk running out of source ports.
> >
> > ...this would be a marginal risk. The user enabled that explicitly.
>
> At present, they didn't, it's always on. But if we don't do that,
> that implies the guest cannot be reached from peers that have the same
> address as it.
>
> > > Or for UDP, where we preserve source port, we risk collision between
> > > flows that are separate on the outside. Theoretically, we could avoid
> > > this by assigning a distinct, link-local, guest side address for each
> > > peer. Doing _that_ is a lot of new bookkeeping which is what I was
> > > thinking of.
> > >
> > > > > Outgoing connections to the world are only possible for targets
> > > > > where we've preconfigured a NAT.
> > > > >
> > > > > b) We _do_ allow different scopes on the two guest-side addresses.
> > > > > This implies that the guest *expects and requires* their gateway
> > > > > (us) to SNAT them.
> > > > >
> > > > > I suspect that the guest simply won't allow (b) to work for IPv6, but
> > > > > it might for IPv4, since most things don't actually look for RFC3927
> > > > > addresses, and NAT is much more expected in general.
> > > >
> > > > I think it's rather complicated to define this before having played
> > > > with the netlink monitor implementation itself, but again, this is well
> > > > beyond the scope of this patch.
> > >
> > > Maybe.
> > >
> > > >
> > > > The idea here is that *if* we have a global unicast address in a
> > > > container, the mere fact that we have a link-local address in
> > > > --map-guest-addr (not actually the case, it seems, but we haven't
> > > > investigated further), shouldn't cause inbound traffic to be mapped to
> > > > that link-local address.
> > >
> > > Indeed it should not, but I don't yet see why that's happening.
> >
> > Me neither, but when Paul raised that, it looked like the most visible
> > issue we could have solved and "got everything to work". Things turned
> > out to be rather different in reality, so we didn't look for an
> > explanation, at least not yet.
> >
> > > > But note that I'm not sure if it's an actual problem or if it's even
> > > > happening at all, at this stage.
--
Stefano
next prev parent reply other threads:[~2026-05-20 0:37 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-07 4:31 Stefano Brivio
2026-05-13 5:04 ` David Gibson
2026-05-13 23:08 ` Stefano Brivio
2026-05-14 4:22 ` David Gibson
2026-05-14 22:50 ` Stefano Brivio
2026-05-18 3:33 ` David Gibson
2026-05-20 0:37 ` Stefano Brivio [this message]
2026-05-20 7:24 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520023713.69cb52e8@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=pholzing@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).