On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote: > On Mon, 19 Aug 2024 19:52:49 +1000 > David Gibson wrote: > > > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote: > > > On Mon, 19 Aug 2024 18:46:31 +1000 > > > David Gibson wrote: > > > > > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote: > > > > > Based on Stefano's recent patch for faster tests. > > > > > > > > > > Allow the user to specify which addresses are translated when used by > > > > > the guest, rather than always being the gateway address or nothing. > > > > > We also allow this remapping to go to the host's global address (more > > > > > precisely the address assigned to the guest) rather than just host > > > > > loopback. > > > > > > > > > > Suggestions for better names for the new options in patches 20 & 22 > > > > > are most welcome. > > > > > > > > > > Along the way to implementing that make many changes to clarify what > > > > > various addresses we track mean, fixing a number of small bugs as > > > > > well. > > > > > > > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf > > > > > tests. I haven't managed to figure out why it's causing the problem, > > > > > or even what the exact triggering conditions are (running the single > > > > > stalling iperf alone doesn't do it). Have to wrap up for today, so I > > > > > thought I'd get this out for review anyway. > > > > > > > > I've identified the bug here. IMO, it's a pre-existing problem that > > > > only works by accident at the moment. The immediate fix is pretty > > > > obvious, but it raises some broader questions > > > > > > > > The problem arises because of the MTU changes we make in order to test > > > > throughput with different packet sizes. Specifically we change the > > > > MTU to values < 1280, which implicitly disables IPv6 since it requires > > > > an MTU >= 1280. When we change the MTU back to a larger value IPv6 is > > > > re-enabled, but some configuration has been lost in the meantime. > > > > > > > > After the MTU is restored the guest reconfigures with NDP, but does > > > > not re-DHCPv6. That means the guest gets a SLAAC address in the right > > > > prefix but not the exact /128 address we've tried to assign to it. > > > > However, at least with the sequence of things we have in the tests, > > > > the guest never sends any packets with the new address, so passt > > > > doesn't update addr_seen. When the inbound connection comes we send > > > > it to the assigned address instead of the guest's actual address and > > > > the guest rejects it. > > > > > > I still have to take a closer look, but I'm fairly sure I hit a similar > > > issue while I was writing these tests originally. I pondered > > > reconfiguring the address via DHCPv6, or using the keep_addr_on_down > > > sysctl (net.ipv6.conf..keep_addr_on_down), which was added > > > around that time. > > > > > > Then: > > > > > > > This "worked" previously, because before this patch, passt would > > > > translate the inbound connection to have source/dest as link-local > > > > addresses. > > > > > > ...I realised that this worked and forgot about the whole issue. > > > > > > > We *do* have a current addr_ll_seen because (a) it won't > > > > change if the guest doesn't change MAC and (b) when IPv6 is re-enabled > > > > the NDP traffic the guest generates will have link-local addresses > > > > that update addr_ll_seen. With this patch, and a global address for > > > > --map-host-loopback, we now need to send to addr_seen instead of > > > > addr_ll_seen, hence exposing the bug. > > > > > > > > In the short term, the obvious fix would be to re-run dhclient -6 in > > > > the guest after we twiddle MTU but before running IPv6 tests. > > > > > > I guess setting keep_addr_on_down (even for "all" interfaces) should > > > work as well. > > > > Sounds like it. I wasn't aware of that one. > > > > /me tests.. actually, no it doesn't work.. > > > > # sysctl -a | grep keep_addr_on_down > > net.ipv6.conf.all.keep_addr_on_down = 1 > > net.ipv6.conf.default.keep_addr_on_down = 1 > > net.ipv6.conf.dummy0.keep_addr_on_down = 1 > > net.ipv6.conf.lo.keep_addr_on_down = 0 > > # ip addr add 2001:db8::1 dev dummy0 > > # ip a > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > inet6 2001:db8::1/128 scope global > > valid_lft forever preferred_lft forever > > # ip link set dummy0 mtu 1200 > > # ip a > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > # ip link set dummy0 mtu 1500 > > # ip a > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > > My guess is that IPv6 being deconfigured because of an unsuitable MTU > > is considered a different event from a mere "down". > > I guess it's because they're not IFA_F_PERMANENT, because > addrconf_permanent_addr() has: > > case NETDEV_CHANGEMTU: > /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ > if (dev->mtu < IPV6_MIN_MTU) { > addrconf_ifdown(dev, dev != net->loopback_dev); > break; > } > > but addrconf_ifdown() does: > > if (!keep_addr || > !(ifa->flags & IFA_F_PERMANENT) || > addr_is_local(&ifa->addr)) { > hlist_del_init_rcu(&ifa->addr_lst); > goto restart; > } > > I'm not sure about the logic behind that. We could actually set those > addresses as permanent once the DHCPv6 client configures them, if it's > cleaner. Huh. Not in the passt/VM case, though, which is where I actually encountered this. > > > > This kind of opens a question about how hard we should try to > > > > accomodate guests which don't configure themselves how we told them. > > > > > > There's a notable distinction between guests temporarily diverging (in > > > different ways) and guests we don't configure at all. > > > > I'm not really sure what you're getting at here. > > In this case, it's not true that the guest doesn't configure itself in > the way we requested -- it's just a temporary diversion from that > configuration. Oh, I see. Assuming that at some point the DHCP client will re-run. > Those are different cases that we can handle in different ways, I > think. If it's a glitch that will only happen during testing, let's > work around that. > > But if the guest really ignores DHCPv6 information, I think we should > keep that working. > > > > It's probably more important to ensure we use the right type of address > > > > "type" in what sense here? > > Global unicast instead of link-local. Ok. > > > (security) rather than ensuring we somehow manage to deliver packets at > > > any time (minor glitch otherwise), also because the one you describe is > > > something we're unlikely to hit outside of tests. > > > > > > > Personally I'd be ok with saying that nothing works if the guest > > > > doesn't configure itself properly, thereby removing addr_seen and > > > > addr_ll_seen entirely. But I think, Stefano, you've been against that > > > > idea in the past. > > > > > > Yes, I still think we should support guests that don't use DHCPv6 or > > > NDP at all, > > > > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to > > manually configure the interface in the guest to match the address > > you've configured with -a. Just like you'd expect to have to > > correctly configure your address on a real network. > > True, but if we make correctness as optional as possible, we'll be more > compatible (less time spent by users fixing situations that don't > necessarily need fixing, less time spent by developers to look into > reports, no matter who's at fault). Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else. > > > or where related exchanges fail for any reason. It improves > > > reliability and compatibility at a small cost. In this case, I think > > > it's a nice feature that we would resume communicating as soon as the > > > guest shows its global unicast address. > > > > Hm, maybe. I'm not entirely convinced the cost is so small long term. > > It's pretty badly incompatible with having multiple guests behind the > > same passt instance: such as the initial guest bridging or routing to > > nested guests. > > Why? We will need to hash the interface/guest index anyway, for > outbound flows. If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address. > And for inbound flows, if a guest steals the address of another guest, > we'll give priority to the normal 'addr' versions instead of the > '_seen' ones, to decide how to direct traffic. I don't see how we'd know we're in this situation, so when to prioritise which address over the other. > > I'm actually not sure if encountering this bug makes me more or less > > in favour of addr_seen. On the one hand I think it highlights the > > flakiness of this approach; there are situations where we just won't > > know the right address. > > I don't understand this argument: indeed, there are such situations, > and they are annoying. Why should we make them more common? Because predictability is good, and working _most_ of the time is a failure of predictability. > > On the other hand if shows a relatively > > plausible case where the guest won't get exactly the address we want > > it to (it uses NDP but not DHCPv6) > > > > Hrm... actually this also shows a potential danger in the recent > > patches to disable DAD in the guest. With DAD enabled, when the guest > > grabs a new address, we'd expect it to emit DAD messages, which would > > have the side effect of updating our addr_seen (although I'm pretty > > sure I hit this patch before the nodad patches were applied, so that > > doesn't seem to be foolproof). > > Well, but we do that for containers with --config-net only. In that > case, the addresses we configure have infinite lifetime anyway. Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address. > Besides, I don't think we need to have addr_seen updated as quickly and > correctly as possible just for the sake of it, we can also update it > when we get any other neighbour solicitation because the guest is > actually using the network. It's not meant to be perfect. If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address. > > We could maybe update addr_seen when we send RA messages to the guest > > - assuming that it will use the same host part (low 64-bits) for both > > link-local and global addresses. Not sure if that's a widely safe > > assumption or not. > > I don't understand: what case are you trying to cover with this? A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson