From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id ACF4F5A0280 for ; Wed, 17 May 2023 03:15:23 +0200 (CEST) Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4QLZsd1dXLz4x3s; Wed, 17 May 2023 11:15:17 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1684286117; bh=WGvQ1RhvNPn5XJUwBE6YI/EsgRYlAQpJ7XaCMiEiO54=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=YcWYCKx4Q96uJcDSBr2MKTJ+wv+75Zu44SF+RDeAW/15ycWqhNy0LGM938kmJliPY +KWGLB4zpBrkxv/2QEKS023ZO9UO01/iDWvOFm8zT/ca3MJCCrODbKAJpqfWwMoI/j G4s7ZnrTk78fFSK32OitP7PvhUVZyZtct9WQCdkg= Date: Wed, 17 May 2023 11:15:06 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 00/10] RFC/RFT: Optionally copy all routes and addresses for pasta, allow gateway-less routes Message-ID: References: <20230514181415.313420-1-sbrivio@redhat.com> <20230516234209.05e84523@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="Thu5icZ7UMkAMLFv" Content-Disposition: inline In-Reply-To: <20230516234209.05e84523@elisabeth> Message-ID-Hash: QW4NYJCJIBJPAVZM5HAOIGSEL6HWNTVH X-Message-ID-Hash: QW4NYJCJIBJPAVZM5HAOIGSEL6HWNTVH X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Callum Parsey , me@yawnt.com, lemmi@nerd2nerd.org X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --Thu5icZ7UMkAMLFv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, May 16, 2023 at 11:42:09PM +0200, Stefano Brivio wrote: > On Tue, 16 May 2023 15:06:29 +1000 > David Gibson wrote: >=20 > > On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano Brivio wrote: > > > This series, along with pseudo-related fixes, enables: > > >=20 > > > - optional copy of all routes from selected interface in outer > > > namespace, to (hopefully!) fix the issue reported by Callum at: > > > https://github.com/containers/podman/issues/18539 > > >=20 > > > - optional copy of all addresses, mostly for consistency. It doesn't, > > > however, enable assignment of multiple addresses in the sense > > > requested at: > > > https://bugs.passt.top/show_bug.cgi?id=3D47 > > >=20 > > > because the addresses still need to be present on the host, and > > > the "outer" address isn't selected depending on the address used > > > inside the container > > >=20 > > > - operation without a gateway address, to (again, hopefully) support > > > usage of Wireguard endpoints established outside the container, > > > https://bugs.passt.top/show_bug.cgi?id=3D49 > > >=20 > > > I tested the single functionalities introduced here, but I didn't > > > try to reproduce the setups where the issues were reported, so some > > > help with testing is definitely fundamental here. Thanks. =20 > >=20 > > I've sent reviews for some of the simpler patches in this series which > > make sense even without the context of the overall aim. I think those > > can be applied immediately. >=20 > Those are actually the least important patches for users Well, granted. > -- and I can't > apply 6/10 without breaking Podman's CI plus probably a number of > deployments (that's why it comes after 5/10)... so, no, I would rather > not apply the rest for the moment. Uh.. true, 6/10 is problematic, but I think the other easy ones could be applied safely enough. > > For the rest of the series, I want to address the generalities before > > doing detailed review of the implementation. > >=20 > > I think the basic idea here is sound: we want to expose anything > > routable to the host as routable to the guest, even when the host has > > a more complex routing setup that just a netmask on the "main" > > interface and a default gateway within that prefix. >=20 > The intentions behind this series are actually slightly different: >=20 > - we have a complete breakage in a seemingly common use case (I would > even say cloud-init setups in general), and I'd like to fix that > sooner rather than later Well, sure, but we should at least think about where we're going with this longer term, so we don't box ourselves in. > - this concerns only the direct configuration pasta does, with > --config-net. What we advertise is definitely related, but not the > same topic... to the point that the issues fixed by this series don't > even occur with a DHCP client: > https://github.com/containers/podman/issues/18539#issuecomment-154502= 3424 Ah, interesting. It looks like dhclient (or rather dhclient-script, I expect) is adding an explicit /32 route to the default gateway. It seems to me the best quick fix for --config-net is to do the same thing. Basically rather than expanding the netmask as we did in 6/10, if the gateway address is not in the interface's netmask add a /32 or /128 route to the gateway. > And, in general, we can't advertise everything we can configure (say, > a route without router over DHCP). Ah, true. The DHCP options for static routes are even more limited than I realized. Ok, that nixes option B.3. > I'd be much more careful about what we advertise. We have direct > control of what we configure via netlink, but for DHCP, NDP, DHCPv6, > we need to think of possible interpretations and common half-bugs as > well. >=20 > > But I think we want to think a bit more deeply about exactly what we > > need/want to expose here. > >=20 > > Even with the current code, the default gateway address we advertise > > to the guest is kind of meaningless: the guest cannot directly access > > that gateway, everything really goes through passt on the host. >=20 > In the simplest, probably most common network setups, that's actually > the gateway that connects our guest to other nodes. I don't understand what you mean by this. Yes, we have the same IP for the gateway that the host sees, but the NAT to host means that we can't even talk to the gateway at L4. Literally the only thing the guest kernel will do with that gateway address is put it into ARP and neighbour discovery packets, which passt will resolve to its own MAC, like nearly every other IP. > For other cases, I think we should eventually implement > https://bugs.passt.top/show_bug.cgi?id=3D47 anyway, and it goes without > saying that, then, we can't just use the same host route no matter what > the container chooses. We'll need to match them. Oh.. I'm wondering if I've been confusing by using "host route" in two different ways: one being "a route taken from the passt host system" and the other meaning "a route to a single network host, that is /32 or /128". I agree that we should move to allowing multiple IPs on the guest side, but I don't see how that conflicts with the routing issue here. > I mean, I'm not saying that the behaviour from this series is complete > and self-consistent, just that it works around obvious, urgent issues > and at the same time it looks like we'll probably need something > similar to support further use cases. Adding a /32 or /128 route to the gateway seems a simpler way to do that to me. Plus it matches the behaviour that DHCP seems to be doing anyway. > > This works because the gateway address (like everything) will ARP/NDP > > to passt's host side MAC address and once the packets hit passt it > > doesn't matter what the guest thought the routing was going to be. > >=20 > > I think we have a few choices in two more-or-less orthogonal > > categories. > >=20 > > A) What routable prefixes do we advertise to the guest? > >=20 > > A.1) Always a default route (0.0.0.0/0 and ::/0) > >=20 > > We tell the guest that every address is routable via the passt > > interface, regardless of routing setup on the host. This essentially > > tells the guest to delegate all routing responsibility to passt. > >=20 > > Advantages: > > * Simple > > * No need to update anything if routing configuration on the host > > changes > > Disadvantages: > > * If addresses are unroutable from the host, the guest will only > > know via ICMP/ICMPv6, rather than statically, which may be a worse > > UX on the guest side. Plus we might need to actually implement > > those host unreachable ICMPs. > > * Might be messy if the guest has multiple interfacees - e.g. if we > > allow passt to be configured to attach to a specific host > > interface only, then we have multiple passts attached to a single > > guest: they'd all be advertising a default route. > >=20 > > A.2) Copy routable prefixes from the host to the guest >=20 > I'm having a hard time figuring out the definition of this point. How > would you define that? Strictly speaking, in the case at hand, nothing > is routable: we have a /32 address. Right.. which means that if the host is working, it must have an additional static route - also probably /32 - telling it how to get to the gateway. Indeed I can see it in the bug, initial comment: 172.31.1.1 dev ens3 proto static scope link metric 100 With A.2 we'd copy that route to the guest - or at least one with the same prefix (which is a single address in this case). > > We just advertise those prefixes routable to the host to the guest > > (which might include an empty prefix =3D=3D default route). > >=20 > > Advantages: > > * Guest statically knows what addresses are routable via the passt > > interface > > Disadvantages: > > * What do we do with overlapping prefixes? On the host we might > > have more specific routes pointing to a specific interface. For > > the guest they all point to the passt interface, so what's the > > point? > > * Can we advertise an arbitrary set of static routes via all our > > mechanisms (--config-net, DHCP, NDP+DHCPv6)? Even if we can it > > adds more complexity to that code > > * How do we update things if the host routing configuration changes? > > * What do we do if the host has source-based routing or other > > advanced stuff set up? > >=20 > > B) What gateway, if any, do we advertise for each route? > >=20 > > B.1) Copy it from the host > >=20 > > Advantages: > > * Guest L3 configuration resembles that of the host >=20 > ...which is a fundamental design goal of passt: transparency, and > pretending it doesn't exist. Otherwise we can have a route, a bridge, > an interface, etc. Well... we want to be transparent for anything visible at L4. For things only visible at L3 - like routes, it's not possible for things to look 100% identical, so I think we have some wiggle room in exactly what we do. > Now, while there are use cases that rely on different aspects of this > transparency (KubeVirt and service mesh integration) I understand this > might sound a bit dogmatic, because you might say there are more > important use cases (which I'm not aware of) or supposed benefits. >=20 > What's far less dogmatic, though, is how many issues we happily and > automatically avoid by relying on the sanity of the host networking > configuration. >=20 > By trying to copy it as close as possible, we avoid one very important > source of issues, which is our interpretation or possible lack of > knowledge about how applications we don't know about chose to interact > with kernel and network setups. The main case fixed by this series > shows exactly that: I think it's broken, but it works, and users > expect it to work. >=20 > And by trusting the host configuration we don't lose much: if that's > broken, almost everything else is broken anyway. It's not a question of "trust" in the host configuration, it's the fact that parts of the host configuration don't make sense in the guest's context. Most obviously the interface names from the host routes can't be used in the guest. We can and do use the same addresses for the routers, but what does it really mean? The guest can't actually contact them as neighbours - when it tries they just ARP to passt's fake MAC and the packets get routed by the host kernel regardless of what router the guest was trying to send them to - in fact neither passt nor the host kernel will even know what router the guest thought it was using. > > Disadvantages: > > * If the host route doesn't have a gateway we have to fall back on > > B.2 or B.3 anyway >=20 > Well, they are a particular case of B.1 then: what's the disadvantage? Two cases is more complex than one. > This is consistent (especially with this series, and especially if we > start adapting the *default* behaviours in this sense). >=20 > > * Misleading: in fact everything is routed by passt and the host > > before it reaches any gateway we're listing here >=20 > But passt isn't supposed to be a router...? Let's say we have multiple > routes on the host, we configure or advertise multiple routes to the > guest. Does that make passt a router? I don't think so: we're just > associating them as closely as possible, without fancy interpretations. >=20 > A router has its own routing table, passt's would simply be a copy. > Right now it has essentially none. Sorry, by "passt" here I really meant the host kernel, which absolutely will route the packets. There's no guarantee they'll even go next to the router the guest thought it was using, although it's likely. > > B.2) Pick an address to represent passt as gateway > >=20 > > Advantages: > > * Accurately represents that everything is routed by passt >=20 > This is configurable, actually, but no, I insist that passt isn't > *functionally* routing anything, or at least that we should get as > close as possible to that. Again, the host kernel definitely will, and there's no avoiding that. > > * We can make this the same as the NAT-to-host address, so we only > > have one "magic" address (per AF) >=20 > Not really, if it's configurable. I mean one per passt instance, not one globally. As opposed to the gateway address and the NAT-to-host address being potentially different magic addresses in a single instance. > > Disadvantages: > > * Have to allocate an address that's safe, which is tricky (but we > > usually want this for NAT-to-host anyway) >=20 > There's a difference between picking an address by default and letting > the user configure one. Besides, at least for IPv4, I don't think such > an address exists. There certainly isn't one we can use everywhere. I think we have some options for probing one that will be safe in a particular case. > > * Do we want just one address, or one for each distinct gateway from > > the host? > > * If we can't pick something in the interfaces "natural" prefix, we > > will also need to advertise a static route to reach it. > >=20 > > B.3) Don't advertise a gateway for any route > >=20 > > passt essentially proxy ARPs for the entire internet. > >=20 > > Advantages: > > * No need to allocate an address - in fact passt need not have any > > guest facing IP at all > > * Extends naturally if we ever have a guest<->passt transport that's > > point-to-point rather than pseudo-ethernet > > Disadvantages: > > * Guest ARP / neighbour tables could get real big >=20 > ...it would also break a number of applications that peek at netlink > (or do ioctl()s) to check they are in fact online. Uh.. what exactly are they looking at? We'd still have at least one route, they just wouldn't have gateways attached to them. But as you pointed out above I don't think we can do this with DHCP, which pretty much kills it anyway. > > The status quo is, roughly, A.1+B.1, except that we also enforce that > > the host must have a default route, which sidesteps one of the > > complications of B.1. IIUC, this series is implementing A.2+B.1. > >=20 > > Thinking about it, I'm moderately convinced that B.1 is a bad idea. > > I'm leaning towards B.2 - combining it with the NAT-to-host cleanups > > to have a more concrete guest-visible address for passt itself - but > > I'm also open to B.3. >=20 > ...that, especially B.3, sounds like another tool, or at least like > another mode, because it conflicts quite a bit with design goals. > They're different from design _choices_ in the sense that that's what > I've been "selling" to users and what I and others have been > implementing in integrations so far. So the ways L4 transparency are valuable (including guest address) are pretty clear to me. Are there also cases where the (partial) L3 transparency matter? They're certainly not obvious to me > > I'm not sure about A.1 vs. A.2. I was leaning towards A.2, but on > > further consideration, I feel like the fact that A.1 automatically > > works for routing changes on the host might outweigh the fact that he > > guest only gets limited information (ICMP) about what's routable. >=20 > I don't think A.2 is doable, ?? AFAICT this series is doing A.2 > but even if it were, yes, I don't think > it would be worth the effort. If needed (and I never saw a request in > this sense), we could enrich ICMP/ICMPv6 handling guest- or > container-side quite a bit. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --Thu5icZ7UMkAMLFv Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmRkKowACgkQzQJF27ox 2GcXwBAAl0SswygvG7p6Ld5Vahkwn9lBrWiH+RpXTOBm6V712EOhJ2uGT6i/BmbB YFEPLWWoO1FVDuqtNtyOuMBa5NDXHlOTqjDozJA2qQQMZCPYqI8vxXkEnMwLetFY b+kRKFhWlsF/YXQkV3J5ZffEAzlzYdNeaTAJkwWIaUWZLZM5lKQioyI7RAPKwqoE AqVFVcwNtA6yB1iawzXY+IkfCFhekHJrrUqNslbp4XTjIc1eaD7p4JpxU4mKSwiZ rKAz4kH4173jNosMrIPZ7U9xk3aYKBQF3QWb+T/WiQ2dCBO/JDKiS8cKlM3wgZ9n cIQpWK6aBNKD44L+DY8/Hwlc24CK7s6ghHJU2JdLAJ51jebsRrRZ4DvotKV2gX11 SqRfd5puGJ35DnLbk0MZLX0oOkJYcoyTlUkVaaGrwYzPrLkTeNLs9zEq8c5s7CXF hGFl0wKB5iTyIA17OrnB/cM6GmRDDH0UrnFZKaJla5KHIOGOLtckyr0m4/y9KNel fu9bVYpw6r7eBtAqqikUfcEFTtuXWsbr/SQkQ6NjGVPSwhZd4UO4EQZraAoFjl6A 0CQ9aSrrRf3iVbFbWL1V+0ojEXMI5BcWqTsWw49pooiV8ryEZHE19BOMImrl1QL4 3TbV0i4Rh1mKN0IWy5fV2t+6R+5jLvPAhD1UQogR4k9/8HhlX8k= =HsDM -----END PGP SIGNATURE----- --Thu5icZ7UMkAMLFv--