From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=fail reason="key not found in DNS" header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202312 header.b=XS6bBwqd; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 26FAB5A027C for ; Wed, 21 Aug 2024 04:51:23 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1724208670; bh=u4DVoYnzYTjtWK7Q7gu9oyOKUEkhyWuJ59FFPUnFOlk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=XS6bBwqdPDNUFNWZuPbBmqnz2Y4YFTEra8O7P2RJJqmQXFRd/dL47sNgonGpIkRFm lYyqgj6ErpVc9DtQffsffIElNre28/9CNkK1NyikoEapVLlUlywDyG74CkXrvQDB7I crKHuzVqocWJRUtEEMHQFJ/JPg6c19MikvjWjORUvU+7PZetS7g1WP/LIqUxIgpIk1 M9Ze3awPNO+KcURuPjjRLW4ahUtuLAyYpeeW4UuXVdF8+T5gegL8k4Ek+mZZc4rDy8 2D7VhCALYObxXQhaDaywuA62jHGj+h4Yb8XYxwyulE1Un7L1mEq6NIOpo8E7DTFDSt +BmkHqqHuqgGg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4WpW7221Sdz4x8G; Wed, 21 Aug 2024 12:51:10 +1000 (AEST) Date: Wed, 21 Aug 2024 12:51:04 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 00/22] RFC: Allow configuration of special case NATs Message-ID: References: <20240816054004.1335006-1-david@gibson.dropbear.id.au> <20240819112749.63d7476d@elisabeth> <20240819150100.3bbdbb3f@elisabeth> <20240820223926.5e800f5f@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="dsvc1xQI50wVGKyp" Content-Disposition: inline In-Reply-To: <20240820223926.5e800f5f@elisabeth> Message-ID-Hash: OU7NLKEHNSPSUSITSSDWYXMD62CSJUOD X-Message-ID-Hash: OU7NLKEHNSPSUSITSSDWYXMD62CSJUOD X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --dsvc1xQI50wVGKyp Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 20, 2024 at 10:39:26PM +0200, Stefano Brivio wrote: > On Tue, 20 Aug 2024 10:42:17 +1000 > David Gibson wrote: >=20 > > On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote: > > > On Mon, 19 Aug 2024 19:52:49 +1000 > > > David Gibson wrote: > > > =20 > > > > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote: =20 > > > > > On Mon, 19 Aug 2024 18:46:31 +1000 > > > > > David Gibson wrote: > > > > > =20 > > > > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote: = =20 > > > > > > > Based on Stefano's recent patch for faster tests. > > > > > > >=20 > > > > > > > Allow the user to specify which addresses are translated when= used by > > > > > > > the guest, rather than always being the gateway address or no= thing. > > > > > > > We also allow this remapping to go to the host's global addre= ss (more > > > > > > > precisely the address assigned to the guest) rather than just= host > > > > > > > loopback. > > > > > > >=20 > > > > > > > Suggestions for better names for the new options in patches 2= 0 & 22 > > > > > > > are most welcome. > > > > > > >=20 > > > > > > > Along the way to implementing that make many changes to clari= fy what > > > > > > > various addresses we track mean, fixing a number of small bug= s as > > > > > > > well. > > > > > > >=20 > > > > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_= tcp perf > > > > > > > tests. I haven't managed to figure out why it's causing the = problem, > > > > > > > or even what the exact triggering conditions are (running the= single > > > > > > > stalling iperf alone doesn't do it). Have to wrap up for tod= ay, so I > > > > > > > thought I'd get this out for review anyway. =20 > > > > > >=20 > > > > > > I've identified the bug here. IMO, it's a pre-existing problem= that > > > > > > only works by accident at the moment. The immediate fix is pre= tty > > > > > > obvious, but it raises some broader questions > > > > > >=20 > > > > > > The problem arises because of the MTU changes we make in order = to test > > > > > > throughput with different packet sizes. Specifically we change= the > > > > > > MTU to values < 1280, which implicitly disables IPv6 since it r= equires > > > > > > an MTU >=3D 1280. When we change the MTU back to a larger valu= e IPv6 is > > > > > > re-enabled, but some configuration has been lost in the meantim= e. > > > > > >=20 > > > > > > After the MTU is restored the guest reconfigures with NDP, but = does > > > > > > not re-DHCPv6. That means the guest gets a SLAAC address in th= e right > > > > > > prefix but not the exact /128 address we've tried to assign to = it. > > > > > > However, at least with the sequence of things we have in the te= sts, > > > > > > the guest never sends any packets with the new address, so passt > > > > > > doesn't update addr_seen. When the inbound connection comes we= send > > > > > > it to the assigned address instead of the guest's actual addres= s and > > > > > > the guest rejects it. =20 > > > > >=20 > > > > > I still have to take a closer look, but I'm fairly sure I hit a s= imilar > > > > > issue while I was writing these tests originally. I pondered > > > > > reconfiguring the address via DHCPv6, or using the keep_addr_on_d= own > > > > > sysctl (net.ipv6.conf..keep_addr_on_down), which was a= dded > > > > > around that time. > > > > >=20 > > > > > Then: > > > > > =20 > > > > > > This "worked" previously, because before this patch, passt would > > > > > > translate the inbound connection to have source/dest as link-lo= cal > > > > > > addresses. =20 > > > > >=20 > > > > > ...I realised that this worked and forgot about the whole issue. > > > > > =20 > > > > > > We *do* have a current addr_ll_seen because (a) it won't > > > > > > change if the guest doesn't change MAC and (b) when IPv6 is re-= enabled > > > > > > the NDP traffic the guest generates will have link-local addres= ses > > > > > > that update addr_ll_seen. With this patch, and a global addres= s for > > > > > > --map-host-loopback, we now need to send to addr_seen instead of > > > > > > addr_ll_seen, hence exposing the bug. > > > > > >=20 > > > > > > In the short term, the obvious fix would be to re-run dhclient = -6 in > > > > > > the guest after we twiddle MTU but before running IPv6 tests. = =20 > > > > >=20 > > > > > I guess setting keep_addr_on_down (even for "all" interfaces) sho= uld > > > > > work as well. =20 > > > >=20 > > > > Sounds like it. I wasn't aware of that one. > > > >=20 > > > > /me tests.. actually, no it doesn't work.. > > > >=20 > > > > # sysctl -a | grep keep_addr_on_down > > > > net.ipv6.conf.all.keep_addr_on_down =3D 1 > > > > net.ipv6.conf.default.keep_addr_on_down =3D 1 > > > > net.ipv6.conf.dummy0.keep_addr_on_down =3D 1 > > > > net.ipv6.conf.lo.keep_addr_on_down =3D 0 > > > > # ip addr add 2001:db8::1 dev dummy0 > > > > # ip a > > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qle= n 1000 > > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > 2: dummy0: mtu 1500 qdisc noop state DOWN group d= efault qlen 1000 > > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > > inet6 2001:db8::1/128 scope global=20 > > > > valid_lft forever preferred_lft forever > > > > # ip link set dummy0 mtu 1200 > > > > # ip a > > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qle= n 1000 > > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > 2: dummy0: mtu 1200 qdisc noop state DOWN group d= efault qlen 1000 > > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > > # ip link set dummy0 mtu 1500 > > > > # ip a > > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qle= n 1000 > > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > 2: dummy0: mtu 1500 qdisc noop state DOWN group d= efault qlen 1000 > > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > >=20 > > > > My guess is that IPv6 being deconfigured because of an unsuitable M= TU > > > > is considered a different event from a mere "down". =20 > > >=20 > > > I guess it's because they're not IFA_F_PERMANENT, because > > > addrconf_permanent_addr() has: > > >=20 > > > case NETDEV_CHANGEMTU: > > > /* if MTU under IPV6_MIN_MTU stop IPv6 on this interf= ace. */ > > > if (dev->mtu < IPV6_MIN_MTU) { > > > addrconf_ifdown(dev, dev !=3D net->loopback_d= ev); > > > break; > > > } > > >=20 > > > but addrconf_ifdown() does: > > >=20 > > > if (!keep_addr || > > > !(ifa->flags & IFA_F_PERMANENT) || > > > addr_is_local(&ifa->addr)) { > > > hlist_del_init_rcu(&ifa->addr= _lst); > > > goto restart; > > > } > > >=20 > > > I'm not sure about the logic behind that. We could actually set those > > > addresses as permanent once the DHCPv6 client configures them, if it's > > > cleaner. =20 > >=20 > > Huh. Not in the passt/VM case, though, which is where I actually > > encountered this. >=20 > I meant using ip(8) from the test script itself, but it doesn't > actually make sense: >=20 > # ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev = eth0 > Warning: permanent option is not mutable from userspace >=20 > because (RFC 3549): >=20 > IFA_F_PERMANENT For a permanent address set by the user. > When this is not set, it means the address > was dynamically created (e.g., by stateless > autoconfiguration). >=20 > So the address you used in your test _should_ have IFA_F_PERMANENT. The > plot thickens. >=20 > I just tried this, which confirms your hypothesis that bringing the > link down is a different event: >=20 > # ip addr add 2001:db8::1 dev dummy0 > # ip link set dummy0 down > # ip addr show dev dummy0 > 5: dummy0: mtu 1280 qdisc noqueue state DOWN group defa= ult qlen 1000 > link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff > inet 1.2.3.1/24 scope global dummy0 > valid_lft forever preferred_lft forever > inet6 2001:db8::1/128 scope global=20 > valid_lft forever preferred_lft forever > # ip link set dummy0 mtu 1279 > # ip addr show dev dummy0 > 5: dummy0: mtu 1279 qdisc noqueue state DOWN group defa= ult qlen 1000 > link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff > inet 1.2.3.1/24 scope global dummy0 > valid_lft forever preferred_lft forever >=20 > ...I just can't see that from the code. Ok. > > > > > > This kind of opens a question about how hard we should try to > > > > > > accomodate guests which don't configure themselves how we told = them. =20 > > > > >=20 > > > > > There's a notable distinction between guests temporarily divergin= g (in > > > > > different ways) and guests we don't configure at all. =20 > > > >=20 > > > > I'm not really sure what you're getting at here. =20 > > >=20 > > > In this case, it's not true that the guest doesn't configure itself in > > > the way we requested -- it's just a temporary diversion from that > > > configuration. =20 > >=20 > > Oh, I see. Assuming that at some point the DHCP client will re-run. > >=20 > > > Those are different cases that we can handle in different ways, I > > > think. If it's a glitch that will only happen during testing, let's > > > work around that. > > >=20 > > > But if the guest really ignores DHCPv6 information, I think we should > > > keep that working. > > > =20 > > > > > It's probably more important to ensure we use the right type of a= ddress =20 > > > >=20 > > > > "type" in what sense here? =20 > > >=20 > > > Global unicast instead of link-local. =20 > >=20 > > Ok. > >=20 > > > > > (security) rather than ensuring we somehow manage to deliver pack= ets at > > > > > any time (minor glitch otherwise), also because the one you descr= ibe is > > > > > something we're unlikely to hit outside of tests. > > > > > =20 > > > > > > Personally I'd be ok with saying that nothing works if the guest > > > > > > doesn't configure itself properly, thereby removing addr_seen a= nd > > > > > > addr_ll_seen entirely. But I think, Stefano, you've been again= st that > > > > > > idea in the past. =20 > > > > >=20 > > > > > Yes, I still think we should support guests that don't use DHCPv6= or > > > > > NDP at all, =20 > > > >=20 > > > > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to > > > > manually configure the interface in the guest to match the address > > > > you've configured with -a. Just like you'd expect to have to > > > > correctly configure your address on a real network. =20 > > >=20 > > > True, but if we make correctness as optional as possible, we'll be mo= re > > > compatible (less time spent by users fixing situations that don't > > > necessarily need fixing, less time spent by developers to look into > > > reports, no matter who's at fault). =20 > >=20 > > Eh, maybe. Unless us trying to make sense of a nonsense situation > > causes some unpredictable behaviour that breaks something else. > >=20 > > > > > or where related exchanges fail for any reason. It improves > > > > > reliability and compatibility at a small cost. In this case, I th= ink > > > > > it's a nice feature that we would resume communicating as soon as= the > > > > > guest shows its global unicast address. =20 > > > >=20 > > > > Hm, maybe. I'm not entirely convinced the cost is so small long te= rm. > > > > It's pretty badly incompatible with having multiple guests behind t= he > > > > same passt instance: such as the initial guest bridging or routing = to > > > > nested guests. =20 > > >=20 > > > Why? We will need to hash the interface/guest index anyway, for > > > outbound flows. =20 > >=20 > > If we have separate interfaces for each guest, yes. But not if we > > have multiple guests behind a single tap because the initial guest > > sets up a bridge or routing. Then we have nothing but the address. >=20 > ...but then we should have multiple addresses anyway. Yes.. that's kind of my point. > By the way, I'm > not sure we'll ever be able to support that kind of configuration. I don't see why not. It would require configuration so that it's clear what each inbound forward targets. But I don't see any inherent problem here, though there are a number of current implementation details which prevent it (addr_seen is one, replying to all arps is another). > How > does a guest set up a bridge and use passt at the same time? I'm not thinking of a bridge shared with the host, but a bridge (or routing) between nested guests or namespaces. This is essentially the "private switch with pasta uplink" case we've discussed occasionally before. It doesn't technically have to be nested guests - the guest could bridge between its uplink and a tunnel, but nested guests is the likely use case. > > > And for inbound flows, if a guest steals the address of another guest, > > > we'll give priority to the normal 'addr' versions instead of the > > > '_seen' ones, to decide how to direct traffic. =20 > >=20 > > I don't see how we'd know we're in this situation, so when to > > prioritise which address over the other. >=20 > In the set of all addr_seen and addr, we would have at least a > non-unique value. Or, practically speaking, we should refuse to set > addr_seen if it matches addr for another guest. Ah, ok. So again, assuming a static configuration of known guests, rather than a local bridge established by a guest at runtime. > > > > I'm actually not sure if encountering this bug makes me more or less > > > > in favour of addr_seen. On the one hand I think it highlights the > > > > flakiness of this approach; there are situations where we just won't > > > > know the right address. =20 > > >=20 > > > I don't understand this argument: indeed, there are such situations, > > > and they are annoying. Why should we make them more common? =20 > >=20 > > Because predictability is good, and working _most_ of the time is a > > failure of predictability. >=20 > It avoids substantial effort and frustration for everybody involved > though. The practical problem with lacking predictability is if it > makes things harder to debug, I guess, which shouldn't be the case here. >=20 > > > > On the other hand if shows a relatively > > > > plausible case where the guest won't get exactly the address we want > > > > it to (it uses NDP but not DHCPv6) > > > >=20 > > > > Hrm... actually this also shows a potential danger in the recent > > > > patches to disable DAD in the guest. With DAD enabled, when the gu= est > > > > grabs a new address, we'd expect it to emit DAD messages, which wou= ld > > > > have the side effect of updating our addr_seen (although I'm pretty > > > > sure I hit this patch before the nodad patches were applied, so that > > > > doesn't seem to be foolproof). =20 > > >=20 > > > Well, but we do that for containers with --config-net only. In that > > > case, the addresses we configure have infinite lifetime anyway. =20 > >=20 > > Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing > > its new address. >=20 > It probably did, but we ignored that anyway because DAD is done by > sending neighbour solicitations with an unspecified address as source, > for example (the "change" here drops "nodad"): >=20 > $ ./pasta --config-net -p dad.pcap > Saving packet capture to dad.pcap > # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 > # tshark -r dad.pcap |grep Neigh > Running as user "root" and group "root". This could be dangerous. > 10 2.642467 :: =E2=86=92 ff02::1:ff01:e9a1 ICMPv6 86 Neigh= bor Solicitation for fe80::3882:b5ff:fe01:e9a1 >=20 > and in tap6_handler() we do: >=20 > } else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ > c->ip6.addr_seen =3D *saddr; > } >=20 > ...then, in ndp(): >=20 > if (IN6_IS_ADDR_UNSPECIFIED(saddr)) > return 1; >=20 > we could set addr_seen by looking at the *target* address of the > neighbour solicitation when the source address is ::, but it's not > implemented yet. Right. I forgot the NS went out with :: as source. Snooping the NS that way again assumes that there's only one logical machine on the guest side. But since this is for addr_seen which fundamentally assumes that anyway, I guess it doesn't make anything worse. > > > Besides, I don't think we need to have addr_seen updated as quickly a= nd > > > correctly as possible just for the sake of it, we can also update it > > > when we get any other neighbour solicitation because the guest is > > > actually using the network. It's not meant to be perfect. =20 > >=20 > > If the guest is a pure server (a common case for containers AFAICT), > > then I don't know that we can expect NS messages for anything other > > than the default gateway, which is (typically) link-local and so won't > > help us to learn the new global address. >=20 > Containers running actual applications are noisy. I've only seen this > kind of problem (addr_seen not set/matching) in particularly crafted > test environments. >=20 > > > > We could maybe update addr_seen when we send RA messages to the gue= st > > > > - assuming that it will use the same host part (low 64-bits) for bo= th > > > > link-local and global addresses. Not sure if that's a widely safe > > > > assumption or not. =20 > > >=20 > > > I don't understand: what case are you trying to cover with this? =20 > >=20 > > A case just like the one in the tests: the interface bounces, and we > > get NDP traffic on the link-local address, but nothing on the global > > address before an inbound connection. >=20 > Oh, I see. I think it makes sense, even though we'll set addr_seen a > bit too early, but not enough to be a practical issue, I think. Yes, but I think snopping the NS from DAD is probably a better idea. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --dsvc1xQI50wVGKyp Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmbFVhcACgkQzQJF27ox 2Gcddw//TIY8ih/Ot218KTYrD2P8AheGTl3xUBht7Suq2+9NCmNtjs9dtaaNRaAI RjqE8N9zTd3klLam5or/mbnnx71Re+XPHZXY8pYa+1bmy3oIm2whDdXCtIPLMgAM 0L10I+c1I3JFqQi7E5aDIAHUsQfgPcwMSAjNOWb82X9lunUMOgyEdxWONbUN4WbG j6/hsg1uOuBA94T2X8AYMZZE6jY9umD8G0KCuBRvNjj+O7vdCov4L9LYibhgApHm 6F3iMfcbkd2in3+1VnNG5cqKqTDz3PCigLENSE6CBK1OH5VL2oGHJau30Qt38PSB tm8wCtq0AcwQVkibeemo4imDY4Mgq6Ok6d7hrGC4bDe7t61yeQDrJgt8ETFNkHJ8 tx8DvElhysEn2DZ17u6uUX3AWc5gukFueADLAzzdMMtZAh6MFF1bxeAKVp0GRsvh ciF5fK86hLUZ0HEEq0eiFhj6VSShazVZYN1QQmVHUGG9d/oEUxWdNjGtRUnyG4Yi ki9otqYn+1q7mhriO8L/xJL1YRoyeW8M5mLrGwU6ZTI/pYMUa7zClIdrJOp1KZnY Qg+H116m3Wm3N/jwOceib0vO0agdBftB+gzkZvzDQeZsJA+7zLxxMAf+ht3ruL2O CLQCtQgxh7aObWoTWA7JW1r0EGlrPtGXxAzDsuHgH+r8rLB/Jgg= =oRv2 -----END PGP SIGNATURE----- --dsvc1xQI50wVGKyp--