From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=CWPiUn4F; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 5F9485A0265 for ; Wed, 20 May 2026 09:24:09 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202602; t=1779261845; bh=edZPJGXUWBdVt5qnUBpGpPzP3pZ9f2iFt4vrIgZkYXk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=CWPiUn4Fml+G4G4i7te3hU9PioLwrN/jrVeyFEdyI37EN4b9JKrFEZigOaBWhr/Za o7sN8ns551zCZZBMrwpAc0cMH37wH1w5kbSbePDLqavveMkpibsVQfc8KMEMqn05dn 9c3Jinc+NFP+Ue5ZTCIVhqo39d4xF/I6FQZNuj8LwBhU1xFARNua/t/pQ9IA4MQMYM 16PBNFzF6JzzGlMiYJA8lZ5bMVYUSzwffw+V/gergVuVqlJ/OnwnmhGask5uJBMJHK 1Pg+THOHevQ+WqtL4FGjdOZ/D8dXtsx1VkgWf6BWOvlzJW0wp+r5zXrZwIqY6jkW6N hsNvYyIk7zgGA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4gL31x3M4kz58n8; Wed, 20 May 2026 17:24:05 +1000 (AEST) Date: Wed, 20 May 2026 17:24:01 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH RFT] fwd: Only do inbound IPv6 NAT to map_host_loopback / map_guest_addr with matching scope Message-ID: References: <20260507043149.1989693-1-sbrivio@redhat.com> <20260514010816.5ccc02de@elisabeth> <20260515005015.6e23cc47@elisabeth> <20260520023713.69cb52e8@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ZMbmDQY8czMjVOGw" Content-Disposition: inline In-Reply-To: <20260520023713.69cb52e8@elisabeth> Message-ID-Hash: 44U77WGUXJS7T5OQSI7OWUKZ7SRVL257 X-Message-ID-Hash: 44U77WGUXJS7T5OQSI7OWUKZ7SRVL257 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --ZMbmDQY8czMjVOGw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, May 20, 2026 at 02:37:14AM +0200, Stefano Brivio wrote: > On Mon, 18 May 2026 13:33:09 +1000 > David Gibson wrote: >=20 > > On Fri, May 15, 2026 at 12:50:23AM +0200, Stefano Brivio wrote: > > > On Thu, 14 May 2026 14:22:51 +1000 > > > David Gibson wrote: > > > =20 > > > > On Thu, May 14, 2026 at 01:08:16AM +0200, Stefano Brivio wrote: =20 > > > > > On Wed, 13 May 2026 15:04:35 +1000 > > > > > David Gibson wrote: > > > > > =20 > > > > > > On Thu, May 07, 2026 at 06:31:49AM +0200, Stefano Brivio wrote:= =20 > > > > > > > I'm sharing this mostly for debugging / investigation of: > > > > > > >=20 > > > > > > > https://github.com/containers/container-libs/pull/755#issue= comment-4390420134 > > > > > > >=20 > > > > > > > even though the change is probably correct and needed regardl= ess of > > > > > > > that. > > > > > > >=20 > > > > > > > If we have map_guest_addr or map_host_loopback addresses set = for IPv6, > > > > > > > before using them for inbound NAT from the host, make sure th= ey match > > > > > > > the scope of the original packet, otherwise we might unexpect= edly > > > > > > > turn global unicast addresses into link-local ones for packet= s coming > > > > > > > from the host itself. > > > > > > >=20 > > > > > > > Link: https://github.com/containers/container-libs/pull/755#i= ssuecomment-4390420134 > > > > > > > Signed-off-by: Stefano Brivio =20 > > > > > >=20 > > > > > > There's a real problem here. However, I don't think this patch= really > > > > > > addresses it. Details below. > > > > > > =20 > > > > > > > --- > > > > > > > fwd.c | 20 ++++++++++++++++++-- > > > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > > > >=20 > > > > > > > diff --git a/fwd.c b/fwd.c > > > > > > > index 0697435..d224c0a 100644 > > > > > > > --- a/fwd.c > > > > > > > +++ b/fwd.c > > > > > > > @@ -974,6 +974,20 @@ uint8_t fwd_nat_from_splice(const struct= fwd_rule *rule, uint8_t proto, > > > > > > > return PIF_HOST; > > > > > > > } > > > > > > > =20 > > > > > > > +/** > > > > > > > + * fwd_scope6_match() - Check if the IPv6 scope of two addre= sses match > > > > > > > + * @a: First address > > > > > > > + * @b: Second address > > > > > > > + * > > > > > > > + * Return: true for two IPv6 link-local or both not link-loc= al, false otherwise > > > > > > > + * > > > > > > > + * NOTE: This currently ignores any other difference in scope > > > > > > > + */ =20 > > > > > >=20 > > > > > > Nit: we probably want this helper (or ones like it) in ip.h and= /or inany.h. > > > > > > =20 > > > > > > > +bool fwd_scope6_match(const struct in6_addr *a, const struct= in6_addr *b) > > > > > > > +{ > > > > > > > + return IN6_IS_ADDR_LINKLOCAL(a) =3D=3D IN6_IS_ADDR_LINKLOCA= L(b); =20 > > > > > >=20 > > > > > > This considers only linklocal vs. not linklocal. Officially th= ose are > > > > > > the only two scopes for unicast IPv6. But... site-local unicas= t used > > > > > > to exist, and while it's deprecated we have once seen it in the= wild. > > > > > > There's also host-local scope which I'm not sure is a term used= by > > > > > > IPv6, but is used by kernel netlinkg, and kind of exists in pra= ctice > > > > > > (::1 and nothing else is host local). =20 > > > > >=20 > > > > > Yes, see the NOTE above. I was trying to find out if this was in = any > > > > > way useful (and it looks like it wasn't, at least from the current > > > > > progress of https://github.com/containers/container-libs/pull/755= ). =20 > > > >=20 > > > > Ok. > > > > =20 > > > > > > > +} > > > > > > > + > > > > > > > /** > > > > > > > * nat_inbound() - Apply address translation for inbound (HO= ST to TAP) > > > > > > > * @c: Execution context > > > > > > > @@ -993,13 +1007,15 @@ bool nat_inbound(const struct ctx *c, = const union inany_addr *addr, > > > > > > > /* Specifically 127.0.0.1, not 127.0.0.0/8 */ > > > > > > > *translated =3D inany_from_v4(c->ip4.map_host_loopback); > > > > > > > } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopba= ck) && > > > > > > > - inany_equals6(addr, &in6addr_loopback)) { > > > > > > > + inany_equals6(addr, &in6addr_loopback) && > > > > > > > + fwd_scope6_match(&addr->a6, &c->ip6.map_host_loopback))= { =20 > > > > > >=20 > > > > > > This test will always be false: we just checked that addr =3D= =3D ::1, > > > > > > which is not link-local (it's host-local, if we're admitting th= at > > > > > > category). =20 > > > > >=20 > > > > > Oh, right, I didn't actually test this case. > > > > > =20 > > > > > > > translated->a6 =3D c->ip6.map_host_loopback; > > > > > > > } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr)= && > > > > > > > inany_equals4(addr, &c->ip4.addr)) { > > > > > > > *translated =3D inany_from_v4(c->ip4.map_guest_addr); > > > > > > > } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr)= && > > > > > > > - inany_equals6(addr, &c->ip6.addr)) { > > > > > > > + inany_equals6(addr, &c->ip6.addr) && > > > > > > > + fwd_scope6_match(&addr->a6, &c->ip6.map_guest_addr)) { = =20 > > > > > >=20 > > > > > > This may be usually be right in practice, but it's kind of by > > > > > > accident. > > > > > >=20 > > > > > > The problem with both these checks is that they compare the sco= pe of a > > > > > > host side address (addr) with the scope of a guest side address > > > > > > (c->map_*). =20 > > > > >=20 > > > > > Note that inany_equals6(addr, &c->ip6.addr) is pre-existing. = =20 > > > >=20 > > > > Right - that's the confusing but necessary semantics of > > > > --map-guest-addr. It's a translation for the thing on the outside > > > > that has the same address as the guest does on the inside. > > > >=20 > > > > What I didn't spot before is that makes the scope check equivalent = to: > > > > fwd_scope6_match(&c->ip6.addr, &c->ip6.map_guest_addr) > > > > which we should be able to implement at conf() time (someday at > > > > address update time). =20 > > >=20 > > > There might be cases where one wants to have different scopes for > > > those, though, and just apply the map_guest_addr inbound remapping for > > > packets matching its scope. I'm not sure if it's useful. =20 > >=20 > > I don't really understand what you're saying here. By definition the > > map_guest_addr translation applies with the outside address matches > > the (assigned) inside guest address. We know the scope of the > > assigned guest address in advance, we don't have to wait until we get > > a connection. >=20 > Ah, right, if it matches map_guest_addr, it's of course the same scope > (it's the same address). So that part doesn't matter. It's the guest address it's matching against, not map_guest_addr (that's the _output_ of the translation). But yes, it's the same address, so necessarily the same scope. > But, in general, I was referring to source addresses. So am I. For inbound flows, we generally only "translate" (as such) the source address. The destination is always set to the guest address whose scope matches the translated source address. > Let's say you > have a guest with address 2001:db8::1, and now you get an inbound > packet: >=20 > 3fff::1 -> 2001:db8::2 I'm not clear what those addresses are. The outside source and destination? > scopes match, the behaviour is clear, we'll remap to 2001:db8::1. Then > you get another one: We remap the destination to 2001:db8::1, in this case there's no reason to remap the source, so we'd leave it as is. >=20 > fe80::1 -> 2001:db8::2 I'm not sure that's possible - I think the kernel (on the host) will reject it. Basically whatever originated this packet can't know that 2001:db8::2 is on the same link, so it can't use a link-local address for the source. > should we remap it the destination to 2001:db8::1, or to a link-local add= ress? >=20 > > > > > Further, this "mismatch" is actually intended (see commit message= and > > > > > Podman's pull request I mentioned above), as I was trying to (qui= ckly) > > > > > make sure that we don't turn a global unicast request into a link= -local > > > > > one. =20 > > > >=20 > > > > Ok, I had misunderstood the problem somewhat. Looking at that gith= ub > > > > comment, I'm still pretty confused TBH :/. =20 > > >=20 > > > We didn't investigate further because it's really not that critical at > > > this point, as the follow-ups to that report should indicate. =20 > >=20 > > As I commented on github, I've now understood (I'm pretty sure) what's > > going on there. It's not actually related to map_guest_addr or > > nat_inbound() at all, but is instead the "translation of last resort" > > in fwd_nat_from_host(). > >=20 > > > > > > That's not what matters: what matters is that source and > > > > > > destination on the tap side have the same scope. =20 > > > > >=20 > > > > > Not for this particular issue: again, I was just trying to make s= ure > > > > > that a global unicast request doesn't get translated to a link-lo= cal > > > > > one. =20 > > > >=20 > > > > Hrm. It's not really clear to me why that's bad. =20 > > >=20 > > > Because it's surprising that a request to a valid global unicast > > > address that's assigned to a container, from another global > > > unicast address, gets translated to anything else. There's no need for > > > that. =20 > >=20 > > Isn't there? If the outside source is using the same address as the > > guest uses inside we have to apply some sort of translation (or drop > > it entirely). >=20 > I don't actually remember what happens in that case. Does Linux > handle those (at least by default) in the same way as Martian (RFC > 1812) packets, or does it let them thorough? I'm not certain. But it doesn't really matter: even if the kernel lets them through, the guest has no way to reply (because the reply would be routed back to the guest instead of to the orignal source). > > So, if we want to preserve scope as well, we have to > > pluck a global scope address from somewhere, and it's not clear how we > > can do that. >=20 > Right, that's a particular case I was ignoring. But AFAICT it's only because it was this case that we hit the odd translation in the first place. > I was just saying that > in general (if the source address isn't conflicting) then it would be > good to not change the scope, if possible. We already pick the guest-side destination address to match the scope of the (translated) guest side source. > If it conflicts, then I would just pick any reasonable choice to let > packets through, including changing the scope to link-local. I think we already do this. Unless the user specifically sets --map-guest-addr to a link local address, in which case it seems like we should do what they ask. We could permit both an LL and non-LL --map-guest addr, maybe, and prioritise the one with matching scope. > > > > > This can probably checked in an indirect and more correct way "at= the > > > > > source". > > > > > =20 > > > > > > In flow table terms, > > > > > > that is, on a single flowside oaddr and eaddr must have the same > > > > > > scope (or must they? see later). > > > > > >=20 > > > > > > The scopes on one side of the flow don't need to match the scop= es on > > > > > > the other side of the flow. In fact we need to allow them to be > > > > > > different: --map-host-loopback is *always* transforming a host-= scope > > > > > > flow on the outside to something else on the inside (either lin= k-scope > > > > > > or global-scope will work, as long as it's the same for both > > > > > > addresses). We don't do it yet, but I can imagine cases where = it > > > > > > would be useful to translate a flow that's global-scope on the = outside > > > > > > to local-scope on the inside (because for some reason we want t= o or > > > > > > have to hide the external peer's address from the guest). Or f= rom > > > > > > local-scope on the outside to global-scope on the inside (becau= se the > > > > > > outside flow is using local-scope addresses which are not meani= ngful > > > > > > to the guest). =20 > > > > >=20 > > > > > Yes, definitely, that might actually be a feature, I just think we > > > > > don't want to do that by default / mistake. =20 > > > >=20 > > > > I mean, --map-host-loopback is kind of already this. > > > > =20 > > > > > In this case we had an inbound request to a global unicast addres= s that > > > > > was translated for some reason (we didn't really investigate) to a > > > > > link-local destination address. =20 > > > >=20 > > > > Hrm, ok. I really want to understand why that happened. =20 > > >=20 > > > In the short term it's probably easier if you try out yourself someth= ing > > > like Paul described, because there are other more critical issues we > > > discovered later that we're tackling at the moment. > > > =20 > > > > What was the source address? =20 > > >=20 > > > I *think* another global unicast address. But maybe not and that would > > > then explain the non-issue. > > > =20 > > > > This seems like it would be handled by the selection > > > > of the guest side eaddr in fwd_nat_from_host(), which explicitly tr= ies > > > > to match scope with the (translated) source address. =20 > > >=20 > > > Maybe, yes. > > > =20 > > > > > But if it's explicit it should be allowed, by all means. > > > > > =20 > > > > > > This has some tricky implications for what we do about assigning > > > > > > addresses for "local mode" or any future variant where we need = to > > > > > > assign a guest address, but can't take one from the host. If we > > > > > > assign a link-local address, as was our plan, that implies unde= r this > > > > > > assumption that the guest can only talk to link-local machines.= In > > > > > > practice that would mean only the host (via -map-*) or in future > > > > > > things we added explicit NATs, where the guest side address is > > > > > > link-local. The guest would have no ability to contact the int= ernet > > > > > > at large. =20 > > > > >=20 > > > > > I don't think that's desirable. =20 > > > >=20 > > > > Neither do I, but how to avoid it is not obvious to me. =20 > > >=20 > > > By making it a matter of preference: if there's another, more fitting > > > (in terms of scope) address, we use that. Otherwise fall back to a > > > link-local. =20 > >=20 > > Sure, but where would we get a global scope address from? >=20 > From the guest, *if it has one*. If not, we can happily fall back to > translating to/from link-local I'd say. No.. the whole point is there's an address conflict, so we need a global scope address *different* from the guest's. > > > > > > At least once we have the netlink monitor, maybe that's ok. Wh= ile the > > > > > > host has no address, the guest has only link local, so it can o= nly > > > > > > talk to the host (or explicitly configured forwards/NATs). But= the > > > > > > host has no connectivity anyway, so there's nothing else to tal= k to > > > > > > anyway. When the host gets connectivity we add a global-scope = guest > > > > > > address, so it gets connectivity too. > > > > > >=20 > > > > > > If that's not good enough, I can only see two approaches, neith= er of > > > > > > which look great. > > > > > >=20 > > > > > > a) For incoming connections from the world, to a guest with on= ly an > > > > > > LL address, we NAT *both* source and destination address (u= gh, the > > > > > > bookkeeping). =20 > > > > >=20 > > > > > The bookkeeping is already in place though. =20 > > > >=20 > > > > Well.. we can DNAT easily enough, but to match scope we also need to > > > > SNAT. That means picking a link-local source address (guest oaddr) > > > > for each incoming flow. Maybe we can use our_tap_ll for that? =20 > >=20 > > I'd forgotten when I wrote this, but we're already doing this. That's > > what's causing the odd behaviour seen here. > >=20 > > > Actually, we don't really need to match the scope, though. I just thi= nk > > > it's preferable when doable. So a) could be optional, and b) could be > > > the default. At that point: =20 > >=20 > > But as noted, I'm not sure (b) works *at all* for IPv6. It also > > wouldn't help for this case: we *cannot* use the original source > > address if it's the same as the guest's. >=20 > I wasn't including a source address "conflict" in this case. I was > suggesting to use (b) in general, but here, we'll need to do something > different, indeed. Again, I don't know of a non-conflict case that's causing strange behaviour. > > > > But that means all incoming connections will appear to come from th= ere > > > > regardless of whether they are actually the same peer or not. If t= he > > > > guest is talking to enough peers we risk running out of source port= s. =20 > > >=20 > > > ...this would be a marginal risk. The user enabled that explicitly. = =20 > >=20 > > At present, they didn't, it's always on. But if we don't do that, > > that implies the guest cannot be reached from peers that have the same > > address as it. > >=20 > > > > Or for UDP, where we preserve source port, we risk collision between > > > > flows that are separate on the outside. Theoretically, we could av= oid > > > > this by assigning a distinct, link-local, guest side address for ea= ch > > > > peer. Doing _that_ is a lot of new bookkeeping which is what I was > > > > thinking of. > > > > =20 > > > > > > Outgoing connections to the world are only possible for tar= gets > > > > > > where we've preconfigured a NAT. > > > > > >=20 > > > > > > b) We _do_ allow different scopes on the two guest-side addres= ses. > > > > > > This implies that the guest *expects and requires* their ga= teway > > > > > > (us) to SNAT them. > > > > > >=20 > > > > > > I suspect that the guest simply won't allow (b) to work for IPv= 6, but > > > > > > it might for IPv4, since most things don't actually look for RF= C3927 > > > > > > addresses, and NAT is much more expected in general. =20 > > > > >=20 > > > > > I think it's rather complicated to define this before having play= ed > > > > > with the netlink monitor implementation itself, but again, this i= s well > > > > > beyond the scope of this patch. =20 > > > >=20 > > > > Maybe. > > > > =20 > > > > >=20 > > > > > The idea here is that *if* we have a global unicast address in a > > > > > container, the mere fact that we have a link-local address in > > > > > --map-guest-addr (not actually the case, it seems, but we haven't > > > > > investigated further), shouldn't cause inbound traffic to be mapp= ed to > > > > > that link-local address. =20 > > > >=20 > > > > Indeed it should not, but I don't yet see why that's happening. =20 > > >=20 > > > Me neither, but when Paul raised that, it looked like the most visible > > > issue we could have solved and "got everything to work". Things turned > > > out to be rather different in reality, so we didn't look for an > > > explanation, at least not yet. > > > =20 > > > > > But note that I'm not sure if it's an actual problem or if it's e= ven > > > > > happening at all, at this stage. =20 >=20 > --=20 > Stefano >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --ZMbmDQY8czMjVOGw Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmoNYZAACgkQzQJF27ox 2GfmOA/7BXPHMSYxTJPxeXUPN4xZwWAiEWLqJY7Q00q3S93diq3qV5WXxU+gkwqp eDZrKOiOXaf9YMeFKSnp3lTJMUS32c483U9ryAPq2ICA+nRJVM5Q9viIGXv4dHdO SUIcsQLpF18hs/enKHZU3B/bRGQHZlgZ2r+I3yp+3sWcHaqSFpITv29iz4mdWqgT w8JAOFSCcx2/n5/xeg4CtSDYbWHSBDooxwPsNtynBIoJXkZ0NuTuMw1hl+HJR6eb wiMaXTiCRt9303uNzUXA8kDXSGHnvattsD622PelHCWddN4U0ih6rOUFVViFjQTx Zn2funjKwaPcsmK5MGG6Xf090i4YywcveBI94vrduQXfA4M5Kl9RxI5Q9XPCBCId 5WEk/1VY9+4Y4i2Ui+Qz1bCfA+pzlvnwRyoe+E7lIJll9mwMHq0T6XXHS7YtTvPs qwmBl7GNzPf83E+wSYY4ycYU6ofUL8lDT9rZSX+KK+egmxWa82djpQCyhj8fOR/D yuPqMiTEy/eYnrCbTYrE11CH2DJTgyT/unACeF62mxqZakv0UGemdHeApchbCs5y HefNh3i6982hxXRlVFeGZlTdw3uX4eX8ogG/oOS++tOPilRB8PIYOzbp9RvDlVBF y7XL/0jTLYJdFShNOT+4EIkTOC8KHwOrtDhWhDIJxdStSqbAs8c= =wB6F -----END PGP SIGNATURE----- --ZMbmDQY8czMjVOGw--