From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=NxN6xr60; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 2650F5A0262 for ; Wed, 20 May 2026 02:37:20 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779237439; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hqrTF/c3H9+yWNx7hY4owLDDNSq4UbGr8PtYPB1l76E=; b=NxN6xr60vqva6cawH6JEJVnZBegCdsEXsVmGw0s2tMNAM7AdAdRHOUxIgN1qO3UVETCb7S sE3ROLbAL2k6S0lUUA4+MQB9HNtP9Q0wHZTOEam2mUXuNHXdeiYBo7TTgApID0aUjlbeyq wfbxa/EwVKklyomfVwWfsCFmY4eXSf8= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-166-aFTIlhBkN6Gc1HwPyjTkLA-1; Tue, 19 May 2026 20:37:17 -0400 X-MC-Unique: aFTIlhBkN6Gc1HwPyjTkLA-1 X-Mimecast-MFC-AGG-ID: aFTIlhBkN6Gc1HwPyjTkLA_1779237436 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-48fdacf2616so20628075e9.3 for ; Tue, 19 May 2026 17:37:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779237436; x=1779842236; h=date:content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=q2+j1BhESAKr0jD+zvmVtg1afauw+DzRw1fGrkw6qv8=; b=hhxVNT7n43KawKScBMBntD72UgkiZ0cVK8LFe+RWAeTxCO16vW6zElB493ykhu3yDz JwQMbZInw/30Vq4wpJRhQz62FxiFaZc7hT+PQxBMorraTUDuN2y2SZEEoqGxnYczLQq0 oTpwqNTecLTxOuLVnLfY5DTXWhIYKo8Qe3j6x9NvB8TD5T9K5Q8hh91agfYEp4fs8Jc+ iGOd0vX9lKucmFGCWwdnY2D4SATG4Y6gj8b4+is178KohPWLcGtFiDBQ7AUh6IjeMWKq 2r/AF8N96xle/UITHwcGfAJBJbewA8+EXAs3OKftgbufws9TRffXYwuLiP2e4+1P/Kmb gULA== X-Gm-Message-State: AOJu0YzOaqLrF9r+n1X1d1kbDzu96ZpVylNfPBvU2ZrfRUs1zI2LJCrC r8w9lSuV+Z7ngcanQG8weeTkq2xqnTSKB7b8O23zxmehmXl84VzmVuT44i8HveJ6Q/hVYoftpD3 C6BYV1dUtnZu0NMhRlQnsUYIwA1R47EloX+kZnPykGVK4HpX/aQNJGg== X-Gm-Gg: Acq92OEZQpjv/yrzEYMQgh+XZ+sAxPEQbir2OVDCaS5ZUq1P2bkZtXWZYbbVXvnHCFB WtsUryHTyZgsJsJkYecyAdB/cTrK4enTvvsjv5DylaDwlnI6pMrxSmIFapB4CQRlKGNKZ3QX5WV 6GcMq7BTcw21UXOrivgO7y0jMa4imH/Ybe12FljANAsTuxD5Fuj5dTgLFmmvjMZd/ehFv0GcATG f7Xa7AjSj1YGTOnc+eGQnpoMRQ0fTNiIlT/OFG11QggwA+4DX27thoHVWG/oY1UQ0FbKjJBtdbF McLPb60RrEqcgXE+xRAqiFFDHGwSyR5l1nFuIOh/OeqP+xbTYaeJ9tliN1Vq8PqL4zqH83024u4 Z8qbsuHathWQXE5fvRLv0yV4/uXQeGKTNrfg2qmGNobI= X-Received: by 2002:a05:600c:4fd4:b0:48a:581c:ead with SMTP id 5b1f17b1804b1-48fe60ed7b7mr303684625e9.10.1779237436265; Tue, 19 May 2026 17:37:16 -0700 (PDT) X-Received: by 2002:a05:600c:4fd4:b0:48a:581c:ead with SMTP id 5b1f17b1804b1-48fe60ed7b7mr303684395e9.10.1779237435736; Tue, 19 May 2026 17:37:15 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-45d9ed30110sm53698739f8f.13.2026.05.19.17.37.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 May 2026 17:37:15 -0700 (PDT) From: Stefano Brivio To: David Gibson Subject: Re: [PATCH RFT] fwd: Only do inbound IPv6 NAT to map_host_loopback / map_guest_addr with matching scope Message-ID: <20260520023713.69cb52e8@elisabeth> In-Reply-To: References: <20260507043149.1989693-1-sbrivio@redhat.com> <20260514010816.5ccc02de@elisabeth> <20260515005015.6e23cc47@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 Date: Wed, 20 May 2026 02:37:14 +0200 (CEST) X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Vv63U7QeOruXsV_bybk3gxl94FCPoQx2JEOCGAFGkUA_1779237436 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Message-ID-Hash: 2ZQD7OS7HVNNNFGDNEMNC6CMGDN5I4IX X-Message-ID-Hash: 2ZQD7OS7HVNNNFGDNEMNC6CMGDN5I4IX X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 18 May 2026 13:33:09 +1000 David Gibson wrote: > On Fri, May 15, 2026 at 12:50:23AM +0200, Stefano Brivio wrote: > > On Thu, 14 May 2026 14:22:51 +1000 > > David Gibson wrote: > > =20 > > > On Thu, May 14, 2026 at 01:08:16AM +0200, Stefano Brivio wrote: =20 > > > > On Wed, 13 May 2026 15:04:35 +1000 > > > > David Gibson wrote: > > > > =20 > > > > > On Thu, May 07, 2026 at 06:31:49AM +0200, Stefano Brivio wrote: = =20 > > > > > > I'm sharing this mostly for debugging / investigation of: > > > > > >=20 > > > > > > https://github.com/containers/container-libs/pull/755#issueco= mment-4390420134 > > > > > >=20 > > > > > > even though the change is probably correct and needed regardles= s of > > > > > > that. > > > > > >=20 > > > > > > If we have map_guest_addr or map_host_loopback addresses set fo= r IPv6, > > > > > > before using them for inbound NAT from the host, make sure they= match > > > > > > the scope of the original packet, otherwise we might unexpected= ly > > > > > > turn global unicast addresses into link-local ones for packets = coming > > > > > > from the host itself. > > > > > >=20 > > > > > > Link: https://github.com/containers/container-libs/pull/755#iss= uecomment-4390420134 > > > > > > Signed-off-by: Stefano Brivio =20 > > > > >=20 > > > > > There's a real problem here. However, I don't think this patch r= eally > > > > > addresses it. Details below. > > > > > =20 > > > > > > --- > > > > > > fwd.c | 20 ++++++++++++++++++-- > > > > > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > > >=20 > > > > > > diff --git a/fwd.c b/fwd.c > > > > > > index 0697435..d224c0a 100644 > > > > > > --- a/fwd.c > > > > > > +++ b/fwd.c > > > > > > @@ -974,6 +974,20 @@ uint8_t fwd_nat_from_splice(const struct f= wd_rule *rule, uint8_t proto, > > > > > > =09return PIF_HOST; > > > > > > } > > > > > > =20 > > > > > > +/** > > > > > > + * fwd_scope6_match() - Check if the IPv6 scope of two address= es match > > > > > > + * @a:=09=09First address > > > > > > + * @b:=09=09Second address > > > > > > + * > > > > > > + * Return: true for two IPv6 link-local or both not link-local= , false otherwise > > > > > > + * > > > > > > + * NOTE: This currently ignores any other difference in scope > > > > > > + */ =20 > > > > >=20 > > > > > Nit: we probably want this helper (or ones like it) in ip.h and/o= r inany.h. > > > > > =20 > > > > > > +bool fwd_scope6_match(const struct in6_addr *a, const struct i= n6_addr *b) > > > > > > +{ > > > > > > +=09return IN6_IS_ADDR_LINKLOCAL(a) =3D=3D IN6_IS_ADDR_LINKLOCA= L(b); =20 > > > > >=20 > > > > > This considers only linklocal vs. not linklocal. Officially thos= e are > > > > > the only two scopes for unicast IPv6. But... site-local unicast = used > > > > > to exist, and while it's deprecated we have once seen it in the w= ild. > > > > > There's also host-local scope which I'm not sure is a term used b= y > > > > > IPv6, but is used by kernel netlinkg, and kind of exists in pract= ice > > > > > (::1 and nothing else is host local). =20 > > > >=20 > > > > Yes, see the NOTE above. I was trying to find out if this was in an= y > > > > way useful (and it looks like it wasn't, at least from the current > > > > progress of https://github.com/containers/container-libs/pull/755).= =20 > > >=20 > > > Ok. > > > =20 > > > > > > +} > > > > > > + > > > > > > /** > > > > > > * nat_inbound() - Apply address translation for inbound (HOST= to TAP) > > > > > > * @c:=09=09Execution context > > > > > > @@ -993,13 +1007,15 @@ bool nat_inbound(const struct ctx *c, co= nst union inany_addr *addr, > > > > > > =09=09/* Specifically 127.0.0.1, not 127.0.0.0/8 */ > > > > > > =09=09*translated =3D inany_from_v4(c->ip4.map_host_loopback); > > > > > > =09} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopba= ck) && > > > > > > -=09=09 inany_equals6(addr, &in6addr_loopback)) { > > > > > > +=09=09 inany_equals6(addr, &in6addr_loopback) && > > > > > > +=09=09 fwd_scope6_match(&addr->a6, &c->ip6.map_host_loopback= )) { =20 > > > > >=20 > > > > > This test will always be false: we just checked that addr =3D=3D = ::1, > > > > > which is not link-local (it's host-local, if we're admitting that > > > > > category). =20 > > > >=20 > > > > Oh, right, I didn't actually test this case. > > > > =20 > > > > > > =09=09translated->a6 =3D c->ip6.map_host_loopback; > > > > > > =09} else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_guest_addr)= && > > > > > > =09=09 inany_equals4(addr, &c->ip4.addr)) { > > > > > > =09=09*translated =3D inany_from_v4(c->ip4.map_guest_addr); > > > > > > =09} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_guest_addr)= && > > > > > > -=09=09 inany_equals6(addr, &c->ip6.addr)) { > > > > > > +=09=09 inany_equals6(addr, &c->ip6.addr) && > > > > > > +=09=09 fwd_scope6_match(&addr->a6, &c->ip6.map_guest_addr)) = { =20 > > > > >=20 > > > > > This may be usually be right in practice, but it's kind of by > > > > > accident. > > > > >=20 > > > > > The problem with both these checks is that they compare the scope= of a > > > > > host side address (addr) with the scope of a guest side address > > > > > (c->map_*). =20 > > > >=20 > > > > Note that inany_equals6(addr, &c->ip6.addr) is pre-existing. =20 > > >=20 > > > Right - that's the confusing but necessary semantics of > > > --map-guest-addr. It's a translation for the thing on the outside > > > that has the same address as the guest does on the inside. > > >=20 > > > What I didn't spot before is that makes the scope check equivalent to= : > > > =09fwd_scope6_match(&c->ip6.addr, &c->ip6.map_guest_addr) > > > which we should be able to implement at conf() time (someday at > > > address update time). =20 > >=20 > > There might be cases where one wants to have different scopes for > > those, though, and just apply the map_guest_addr inbound remapping for > > packets matching its scope. I'm not sure if it's useful. =20 >=20 > I don't really understand what you're saying here. By definition the > map_guest_addr translation applies with the outside address matches > the (assigned) inside guest address. We know the scope of the > assigned guest address in advance, we don't have to wait until we get > a connection. Ah, right, if it matches map_guest_addr, it's of course the same scope (it's the same address). So that part doesn't matter. But, in general, I was referring to source addresses. Let's say you have a guest with address 2001:db8::1, and now you get an inbound packet: 3fff::1 -> 2001:db8::2 scopes match, the behaviour is clear, we'll remap to 2001:db8::1. Then you get another one: fe80::1 -> 2001:db8::2 should we remap it the destination to 2001:db8::1, or to a link-local addre= ss? > > > > Further, this "mismatch" is actually intended (see commit message a= nd > > > > Podman's pull request I mentioned above), as I was trying to (quick= ly) > > > > make sure that we don't turn a global unicast request into a link-l= ocal > > > > one. =20 > > >=20 > > > Ok, I had misunderstood the problem somewhat. Looking at that github > > > comment, I'm still pretty confused TBH :/. =20 > >=20 > > We didn't investigate further because it's really not that critical at > > this point, as the follow-ups to that report should indicate. =20 >=20 > As I commented on github, I've now understood (I'm pretty sure) what's > going on there. It's not actually related to map_guest_addr or > nat_inbound() at all, but is instead the "translation of last resort" > in fwd_nat_from_host(). >=20 > > > > > That's not what matters: what matters is that source and > > > > > destination on the tap side have the same scope. =20 > > > >=20 > > > > Not for this particular issue: again, I was just trying to make sur= e > > > > that a global unicast request doesn't get translated to a link-loca= l > > > > one. =20 > > >=20 > > > Hrm. It's not really clear to me why that's bad. =20 > >=20 > > Because it's surprising that a request to a valid global unicast > > address that's assigned to a container, from another global > > unicast address, gets translated to anything else. There's no need for > > that. =20 >=20 > Isn't there? If the outside source is using the same address as the > guest uses inside we have to apply some sort of translation (or drop > it entirely). I don't actually remember what happens in that case. Does Linux handle those (at least by default) in the same way as Martian (RFC 1812) packets, or does it let them thorough? > So, if we want to preserve scope as well, we have to > pluck a global scope address from somewhere, and it's not clear how we > can do that. Right, that's a particular case I was ignoring. I was just saying that in general (if the source address isn't conflicting) then it would be good to not change the scope, if possible. If it conflicts, then I would just pick any reasonable choice to let packets through, including changing the scope to link-local. > > > > This can probably checked in an indirect and more correct way "at t= he > > > > source". > > > > =20 > > > > > In flow table terms, > > > > > that is, on a single flowside oaddr and eaddr must have the same > > > > > scope (or must they? see later). > > > > >=20 > > > > > The scopes on one side of the flow don't need to match the scopes= on > > > > > the other side of the flow. In fact we need to allow them to be > > > > > different: --map-host-loopback is *always* transforming a host-sc= ope > > > > > flow on the outside to something else on the inside (either link-= scope > > > > > or global-scope will work, as long as it's the same for both > > > > > addresses). We don't do it yet, but I can imagine cases where it > > > > > would be useful to translate a flow that's global-scope on the ou= tside > > > > > to local-scope on the inside (because for some reason we want to = or > > > > > have to hide the external peer's address from the guest). Or fro= m > > > > > local-scope on the outside to global-scope on the inside (because= the > > > > > outside flow is using local-scope addresses which are not meaning= ful > > > > > to the guest). =20 > > > >=20 > > > > Yes, definitely, that might actually be a feature, I just think we > > > > don't want to do that by default / mistake. =20 > > >=20 > > > I mean, --map-host-loopback is kind of already this. > > > =20 > > > > In this case we had an inbound request to a global unicast address = that > > > > was translated for some reason (we didn't really investigate) to a > > > > link-local destination address. =20 > > >=20 > > > Hrm, ok. I really want to understand why that happened. =20 > >=20 > > In the short term it's probably easier if you try out yourself somethin= g > > like Paul described, because there are other more critical issues we > > discovered later that we're tackling at the moment. > > =20 > > > What was the source address? =20 > >=20 > > I *think* another global unicast address. But maybe not and that would > > then explain the non-issue. > > =20 > > > This seems like it would be handled by the selection > > > of the guest side eaddr in fwd_nat_from_host(), which explicitly trie= s > > > to match scope with the (translated) source address. =20 > >=20 > > Maybe, yes. > > =20 > > > > But if it's explicit it should be allowed, by all means. > > > > =20 > > > > > This has some tricky implications for what we do about assigning > > > > > addresses for "local mode" or any future variant where we need to > > > > > assign a guest address, but can't take one from the host. If we > > > > > assign a link-local address, as was our plan, that implies under = this > > > > > assumption that the guest can only talk to link-local machines. = In > > > > > practice that would mean only the host (via -map-*) or in future > > > > > things we added explicit NATs, where the guest side address is > > > > > link-local. The guest would have no ability to contact the inter= net > > > > > at large. =20 > > > >=20 > > > > I don't think that's desirable. =20 > > >=20 > > > Neither do I, but how to avoid it is not obvious to me. =20 > >=20 > > By making it a matter of preference: if there's another, more fitting > > (in terms of scope) address, we use that. Otherwise fall back to a > > link-local. =20 >=20 > Sure, but where would we get a global scope address from? >From the guest, *if it has one*. If not, we can happily fall back to translating to/from link-local I'd say. > > > > > At least once we have the netlink monitor, maybe that's ok. Whil= e the > > > > > host has no address, the guest has only link local, so it can onl= y > > > > > talk to the host (or explicitly configured forwards/NATs). But t= he > > > > > host has no connectivity anyway, so there's nothing else to talk = to > > > > > anyway. When the host gets connectivity we add a global-scope gu= est > > > > > address, so it gets connectivity too. > > > > >=20 > > > > > If that's not good enough, I can only see two approaches, neither= of > > > > > which look great. > > > > >=20 > > > > > a) For incoming connections from the world, to a guest with only= an > > > > > LL address, we NAT *both* source and destination address (ugh= , the > > > > > bookkeeping). =20 > > > >=20 > > > > The bookkeeping is already in place though. =20 > > >=20 > > > Well.. we can DNAT easily enough, but to match scope we also need to > > > SNAT. That means picking a link-local source address (guest oaddr) > > > for each incoming flow. Maybe we can use our_tap_ll for that? =20 >=20 > I'd forgotten when I wrote this, but we're already doing this. That's > what's causing the odd behaviour seen here. >=20 > > Actually, we don't really need to match the scope, though. I just think > > it's preferable when doable. So a) could be optional, and b) could be > > the default. At that point: =20 >=20 > But as noted, I'm not sure (b) works *at all* for IPv6. It also > wouldn't help for this case: we *cannot* use the original source > address if it's the same as the guest's. I wasn't including a source address "conflict" in this case. I was suggesting to use (b) in general, but here, we'll need to do something different, indeed. > > > But that means all incoming connections will appear to come from ther= e > > > regardless of whether they are actually the same peer or not. If the > > > guest is talking to enough peers we risk running out of source ports.= =20 > >=20 > > ...this would be a marginal risk. The user enabled that explicitly. =20 >=20 > At present, they didn't, it's always on. But if we don't do that, > that implies the guest cannot be reached from peers that have the same > address as it. >=20 > > > Or for UDP, where we preserve source port, we risk collision between > > > flows that are separate on the outside. Theoretically, we could avoi= d > > > this by assigning a distinct, link-local, guest side address for each > > > peer. Doing _that_ is a lot of new bookkeeping which is what I was > > > thinking of. > > > =20 > > > > > Outgoing connections to the world are only possible for targe= ts > > > > > where we've preconfigured a NAT. > > > > >=20 > > > > > b) We _do_ allow different scopes on the two guest-side addresse= s. > > > > > This implies that the guest *expects and requires* their gate= way > > > > > (us) to SNAT them. > > > > >=20 > > > > > I suspect that the guest simply won't allow (b) to work for IPv6,= but > > > > > it might for IPv4, since most things don't actually look for RFC3= 927 > > > > > addresses, and NAT is much more expected in general. =20 > > > >=20 > > > > I think it's rather complicated to define this before having played > > > > with the netlink monitor implementation itself, but again, this is = well > > > > beyond the scope of this patch. =20 > > >=20 > > > Maybe. > > > =20 > > > >=20 > > > > The idea here is that *if* we have a global unicast address in a > > > > container, the mere fact that we have a link-local address in > > > > --map-guest-addr (not actually the case, it seems, but we haven't > > > > investigated further), shouldn't cause inbound traffic to be mapped= to > > > > that link-local address. =20 > > >=20 > > > Indeed it should not, but I don't yet see why that's happening. =20 > >=20 > > Me neither, but when Paul raised that, it looked like the most visible > > issue we could have solved and "got everything to work". Things turned > > out to be rather different in reality, so we didn't look for an > > explanation, at least not yet. > > =20 > > > > But note that I'm not sure if it's an actual problem or if it's eve= n > > > > happening at all, at this stage. =20 --=20 Stefano