On Tue, Aug 20, 2024 at 10:39:26PM +0200, Stefano Brivio wrote: > On Tue, 20 Aug 2024 10:42:17 +1000 > David Gibson wrote: > > > On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote: > > > On Mon, 19 Aug 2024 19:52:49 +1000 > > > David Gibson wrote: > > > > > > > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote: > > > > > On Mon, 19 Aug 2024 18:46:31 +1000 > > > > > David Gibson wrote: > > > > > > > > > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote: > > > > > > > Based on Stefano's recent patch for faster tests. > > > > > > > > > > > > > > Allow the user to specify which addresses are translated when used by > > > > > > > the guest, rather than always being the gateway address or nothing. > > > > > > > We also allow this remapping to go to the host's global address (more > > > > > > > precisely the address assigned to the guest) rather than just host > > > > > > > loopback. > > > > > > > > > > > > > > Suggestions for better names for the new options in patches 20 & 22 > > > > > > > are most welcome. > > > > > > > > > > > > > > Along the way to implementing that make many changes to clarify what > > > > > > > various addresses we track mean, fixing a number of small bugs as > > > > > > > well. > > > > > > > > > > > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf > > > > > > > tests. I haven't managed to figure out why it's causing the problem, > > > > > > > or even what the exact triggering conditions are (running the single > > > > > > > stalling iperf alone doesn't do it). Have to wrap up for today, so I > > > > > > > thought I'd get this out for review anyway. > > > > > > > > > > > > I've identified the bug here. IMO, it's a pre-existing problem that > > > > > > only works by accident at the moment. The immediate fix is pretty > > > > > > obvious, but it raises some broader questions > > > > > > > > > > > > The problem arises because of the MTU changes we make in order to test > > > > > > throughput with different packet sizes. Specifically we change the > > > > > > MTU to values < 1280, which implicitly disables IPv6 since it requires > > > > > > an MTU >= 1280. When we change the MTU back to a larger value IPv6 is > > > > > > re-enabled, but some configuration has been lost in the meantime. > > > > > > > > > > > > After the MTU is restored the guest reconfigures with NDP, but does > > > > > > not re-DHCPv6. That means the guest gets a SLAAC address in the right > > > > > > prefix but not the exact /128 address we've tried to assign to it. > > > > > > However, at least with the sequence of things we have in the tests, > > > > > > the guest never sends any packets with the new address, so passt > > > > > > doesn't update addr_seen. When the inbound connection comes we send > > > > > > it to the assigned address instead of the guest's actual address and > > > > > > the guest rejects it. > > > > > > > > > > I still have to take a closer look, but I'm fairly sure I hit a similar > > > > > issue while I was writing these tests originally. I pondered > > > > > reconfiguring the address via DHCPv6, or using the keep_addr_on_down > > > > > sysctl (net.ipv6.conf..keep_addr_on_down), which was added > > > > > around that time. > > > > > > > > > > Then: > > > > > > > > > > > This "worked" previously, because before this patch, passt would > > > > > > translate the inbound connection to have source/dest as link-local > > > > > > addresses. > > > > > > > > > > ...I realised that this worked and forgot about the whole issue. > > > > > > > > > > > We *do* have a current addr_ll_seen because (a) it won't > > > > > > change if the guest doesn't change MAC and (b) when IPv6 is re-enabled > > > > > > the NDP traffic the guest generates will have link-local addresses > > > > > > that update addr_ll_seen. With this patch, and a global address for > > > > > > --map-host-loopback, we now need to send to addr_seen instead of > > > > > > addr_ll_seen, hence exposing the bug. > > > > > > > > > > > > In the short term, the obvious fix would be to re-run dhclient -6 in > > > > > > the guest after we twiddle MTU but before running IPv6 tests. > > > > > > > > > > I guess setting keep_addr_on_down (even for "all" interfaces) should > > > > > work as well. > > > > > > > > Sounds like it. I wasn't aware of that one. > > > > > > > > /me tests.. actually, no it doesn't work.. > > > > > > > > # sysctl -a | grep keep_addr_on_down > > > > net.ipv6.conf.all.keep_addr_on_down = 1 > > > > net.ipv6.conf.default.keep_addr_on_down = 1 > > > > net.ipv6.conf.dummy0.keep_addr_on_down = 1 > > > > net.ipv6.conf.lo.keep_addr_on_down = 0 > > > > # ip addr add 2001:db8::1 dev dummy0 > > > > # ip a > > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 > > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 > > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > > inet6 2001:db8::1/128 scope global > > > > valid_lft forever preferred_lft forever > > > > # ip link set dummy0 mtu 1200 > > > > # ip a > > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 > > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 > > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > > # ip link set dummy0 mtu 1500 > > > > # ip a > > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 > > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 > > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > > > > > > My guess is that IPv6 being deconfigured because of an unsuitable MTU > > > > is considered a different event from a mere "down". > > > > > > I guess it's because they're not IFA_F_PERMANENT, because > > > addrconf_permanent_addr() has: > > > > > > case NETDEV_CHANGEMTU: > > > /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ > > > if (dev->mtu < IPV6_MIN_MTU) { > > > addrconf_ifdown(dev, dev != net->loopback_dev); > > > break; > > > } > > > > > > but addrconf_ifdown() does: > > > > > > if (!keep_addr || > > > !(ifa->flags & IFA_F_PERMANENT) || > > > addr_is_local(&ifa->addr)) { > > > hlist_del_init_rcu(&ifa->addr_lst); > > > goto restart; > > > } > > > > > > I'm not sure about the logic behind that. We could actually set those > > > addresses as permanent once the DHCPv6 client configures them, if it's > > > cleaner. > > > > Huh. Not in the passt/VM case, though, which is where I actually > > encountered this. > > I meant using ip(8) from the test script itself, but it doesn't > actually make sense: > > # ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev eth0 > Warning: permanent option is not mutable from userspace > > because (RFC 3549): > > IFA_F_PERMANENT For a permanent address set by the user. > When this is not set, it means the address > was dynamically created (e.g., by stateless > autoconfiguration). > > So the address you used in your test _should_ have IFA_F_PERMANENT. The > plot thickens. > > I just tried this, which confirms your hypothesis that bringing the > link down is a different event: > > # ip addr add 2001:db8::1 dev dummy0 > # ip link set dummy0 down > # ip addr show dev dummy0 > 5: dummy0: mtu 1280 qdisc noqueue state DOWN group default qlen 1000 > link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff > inet 1.2.3.1/24 scope global dummy0 > valid_lft forever preferred_lft forever > inet6 2001:db8::1/128 scope global > valid_lft forever preferred_lft forever > # ip link set dummy0 mtu 1279 > # ip addr show dev dummy0 > 5: dummy0: mtu 1279 qdisc noqueue state DOWN group default qlen 1000 > link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff > inet 1.2.3.1/24 scope global dummy0 > valid_lft forever preferred_lft forever > > ...I just can't see that from the code. Ok. > > > > > > This kind of opens a question about how hard we should try to > > > > > > accomodate guests which don't configure themselves how we told them. > > > > > > > > > > There's a notable distinction between guests temporarily diverging (in > > > > > different ways) and guests we don't configure at all. > > > > > > > > I'm not really sure what you're getting at here. > > > > > > In this case, it's not true that the guest doesn't configure itself in > > > the way we requested -- it's just a temporary diversion from that > > > configuration. > > > > Oh, I see. Assuming that at some point the DHCP client will re-run. > > > > > Those are different cases that we can handle in different ways, I > > > think. If it's a glitch that will only happen during testing, let's > > > work around that. > > > > > > But if the guest really ignores DHCPv6 information, I think we should > > > keep that working. > > > > > > > > It's probably more important to ensure we use the right type of address > > > > > > > > "type" in what sense here? > > > > > > Global unicast instead of link-local. > > > > Ok. > > > > > > > (security) rather than ensuring we somehow manage to deliver packets at > > > > > any time (minor glitch otherwise), also because the one you describe is > > > > > something we're unlikely to hit outside of tests. > > > > > > > > > > > Personally I'd be ok with saying that nothing works if the guest > > > > > > doesn't configure itself properly, thereby removing addr_seen and > > > > > > addr_ll_seen entirely. But I think, Stefano, you've been against that > > > > > > idea in the past. > > > > > > > > > > Yes, I still think we should support guests that don't use DHCPv6 or > > > > > NDP at all, > > > > > > > > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to > > > > manually configure the interface in the guest to match the address > > > > you've configured with -a. Just like you'd expect to have to > > > > correctly configure your address on a real network. > > > > > > True, but if we make correctness as optional as possible, we'll be more > > > compatible (less time spent by users fixing situations that don't > > > necessarily need fixing, less time spent by developers to look into > > > reports, no matter who's at fault). > > > > Eh, maybe. Unless us trying to make sense of a nonsense situation > > causes some unpredictable behaviour that breaks something else. > > > > > > > or where related exchanges fail for any reason. It improves > > > > > reliability and compatibility at a small cost. In this case, I think > > > > > it's a nice feature that we would resume communicating as soon as the > > > > > guest shows its global unicast address. > > > > > > > > Hm, maybe. I'm not entirely convinced the cost is so small long term. > > > > It's pretty badly incompatible with having multiple guests behind the > > > > same passt instance: such as the initial guest bridging or routing to > > > > nested guests. > > > > > > Why? We will need to hash the interface/guest index anyway, for > > > outbound flows. > > > > If we have separate interfaces for each guest, yes. But not if we > > have multiple guests behind a single tap because the initial guest > > sets up a bridge or routing. Then we have nothing but the address. > > ...but then we should have multiple addresses anyway. Yes.. that's kind of my point. > By the way, I'm > not sure we'll ever be able to support that kind of configuration. I don't see why not. It would require configuration so that it's clear what each inbound forward targets. But I don't see any inherent problem here, though there are a number of current implementation details which prevent it (addr_seen is one, replying to all arps is another). > How > does a guest set up a bridge and use passt at the same time? I'm not thinking of a bridge shared with the host, but a bridge (or routing) between nested guests or namespaces. This is essentially the "private switch with pasta uplink" case we've discussed occasionally before. It doesn't technically have to be nested guests - the guest could bridge between its uplink and a tunnel, but nested guests is the likely use case. > > > And for inbound flows, if a guest steals the address of another guest, > > > we'll give priority to the normal 'addr' versions instead of the > > > '_seen' ones, to decide how to direct traffic. > > > > I don't see how we'd know we're in this situation, so when to > > prioritise which address over the other. > > In the set of all addr_seen and addr, we would have at least a > non-unique value. Or, practically speaking, we should refuse to set > addr_seen if it matches addr for another guest. Ah, ok. So again, assuming a static configuration of known guests, rather than a local bridge established by a guest at runtime. > > > > I'm actually not sure if encountering this bug makes me more or less > > > > in favour of addr_seen. On the one hand I think it highlights the > > > > flakiness of this approach; there are situations where we just won't > > > > know the right address. > > > > > > I don't understand this argument: indeed, there are such situations, > > > and they are annoying. Why should we make them more common? > > > > Because predictability is good, and working _most_ of the time is a > > failure of predictability. > > It avoids substantial effort and frustration for everybody involved > though. The practical problem with lacking predictability is if it > makes things harder to debug, I guess, which shouldn't be the case here. > > > > > On the other hand if shows a relatively > > > > plausible case where the guest won't get exactly the address we want > > > > it to (it uses NDP but not DHCPv6) > > > > > > > > Hrm... actually this also shows a potential danger in the recent > > > > patches to disable DAD in the guest. With DAD enabled, when the guest > > > > grabs a new address, we'd expect it to emit DAD messages, which would > > > > have the side effect of updating our addr_seen (although I'm pretty > > > > sure I hit this patch before the nodad patches were applied, so that > > > > doesn't seem to be foolproof). > > > > > > Well, but we do that for containers with --config-net only. In that > > > case, the addresses we configure have infinite lifetime anyway. > > > > Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing > > its new address. > > It probably did, but we ignored that anyway because DAD is done by > sending neighbour solicitations with an unspecified address as source, > for example (the "change" here drops "nodad"): > > $ ./pasta --config-net -p dad.pcap > Saving packet capture to dad.pcap > # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 > # tshark -r dad.pcap |grep Neigh > Running as user "root" and group "root". This could be dangerous. > 10 2.642467 :: → ff02::1:ff01:e9a1 ICMPv6 86 Neighbor Solicitation for fe80::3882:b5ff:fe01:e9a1 > > and in tap6_handler() we do: > > } else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ > c->ip6.addr_seen = *saddr; > } > > ...then, in ndp(): > > if (IN6_IS_ADDR_UNSPECIFIED(saddr)) > return 1; > > we could set addr_seen by looking at the *target* address of the > neighbour solicitation when the source address is ::, but it's not > implemented yet. Right. I forgot the NS went out with :: as source. Snooping the NS that way again assumes that there's only one logical machine on the guest side. But since this is for addr_seen which fundamentally assumes that anyway, I guess it doesn't make anything worse. > > > Besides, I don't think we need to have addr_seen updated as quickly and > > > correctly as possible just for the sake of it, we can also update it > > > when we get any other neighbour solicitation because the guest is > > > actually using the network. It's not meant to be perfect. > > > > If the guest is a pure server (a common case for containers AFAICT), > > then I don't know that we can expect NS messages for anything other > > than the default gateway, which is (typically) link-local and so won't > > help us to learn the new global address. > > Containers running actual applications are noisy. I've only seen this > kind of problem (addr_seen not set/matching) in particularly crafted > test environments. > > > > > We could maybe update addr_seen when we send RA messages to the guest > > > > - assuming that it will use the same host part (low 64-bits) for both > > > > link-local and global addresses. Not sure if that's a widely safe > > > > assumption or not. > > > > > > I don't understand: what case are you trying to cover with this? > > > > A case just like the one in the tests: the interface bounces, and we > > get NDP traffic on the link-local address, but nothing on the global > > address before an inbound connection. > > Oh, I see. I think it makes sense, even though we'll set addr_seen a > bit too early, but not enough to be a practical issue, I think. Yes, but I think snopping the NS from DAD is probably a better idea. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson