On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
> On Mon, 19 Aug 2024 19:52:49 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
> > > On Mon, 19 Aug 2024 18:46:31 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:  
> > > > > Based on Stefano's recent patch for faster tests.
> > > > > 
> > > > > Allow the user to specify which addresses are translated when used by
> > > > > the guest, rather than always being the gateway address or nothing.
> > > > > We also allow this remapping to go to the host's global address (more
> > > > > precisely the address assigned to the guest) rather than just host
> > > > > loopback.
> > > > > 
> > > > > Suggestions for better names for the new options in patches 20 & 22
> > > > > are most welcome.
> > > > > 
> > > > > Along the way to implementing that make many changes to clarify what
> > > > > various addresses we track mean, fixing a number of small bugs as
> > > > > well.
> > > > > 
> > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf
> > > > > tests.  I haven't managed to figure out why it's causing the problem,
> > > > > or even what the exact triggering conditions are (running the single
> > > > > stalling iperf alone doesn't do it).  Have to wrap up for today, so I
> > > > > thought I'd get this out for review anyway.    
> > > > 
> > > > I've identified the bug here.  IMO, it's a pre-existing problem that
> > > > only works by accident at the moment.  The immediate fix is pretty
> > > > obvious, but it raises some broader questions
> > > > 
> > > > The problem arises because of the MTU changes we make in order to test
> > > > throughput with different packet sizes.  Specifically we change the
> > > > MTU to values < 1280, which implicitly disables IPv6 since it requires
> > > > an MTU >= 1280.  When we change the MTU back to a larger value IPv6 is
> > > > re-enabled, but some configuration has been lost in the meantime.
> > > > 
> > > > After the MTU is restored the guest reconfigures with NDP, but does
> > > > not re-DHCPv6.  That means the guest gets a SLAAC address in the right
> > > > prefix but not the exact /128 address we've tried to assign to it.
> > > > However, at least with the sequence of things we have in the tests,
> > > > the guest never sends any packets with the new address, so passt
> > > > doesn't update addr_seen.  When the inbound connection comes we send
> > > > it to the assigned address instead of the guest's actual address and
> > > > the guest rejects it.  
> > > 
> > > I still have to take a closer look, but I'm fairly sure I hit a similar
> > > issue while I was writing these tests originally. I pondered
> > > reconfiguring the address via DHCPv6, or using the keep_addr_on_down
> > > sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added
> > > around that time.
> > > 
> > > Then:
> > >   
> > > > This "worked" previously, because before this patch, passt would
> > > > translate the inbound connection to have source/dest as link-local
> > > > addresses.  
> > > 
> > > ...I realised that this worked and forgot about the whole issue.
> > >   
> > > > We *do* have a current addr_ll_seen because (a) it won't
> > > > change if the guest doesn't change MAC and (b) when IPv6 is re-enabled
> > > > the NDP traffic the guest generates will have link-local addresses
> > > > that update addr_ll_seen.  With this patch, and a global address for
> > > > --map-host-loopback, we now need to send to addr_seen instead of
> > > > addr_ll_seen, hence exposing the bug.
> > > > 
> > > > In the short term, the obvious fix would be to re-run dhclient -6 in
> > > > the guest after we twiddle MTU but before running IPv6 tests.  
> > > 
> > > I guess setting keep_addr_on_down (even for "all" interfaces) should
> > > work as well.  
> > 
> > Sounds like it.  I wasn't aware of that one.
> > 
> > /me tests..  actually, no it doesn't work..
> > 
> > # sysctl -a | grep keep_addr_on_down
> > net.ipv6.conf.all.keep_addr_on_down = 1
> > net.ipv6.conf.default.keep_addr_on_down = 1
> > net.ipv6.conf.dummy0.keep_addr_on_down = 1
> > net.ipv6.conf.lo.keep_addr_on_down = 0
> > # ip addr add 2001:db8::1 dev dummy0
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> >     inet6 2001:db8::1/128 scope global 
> >        valid_lft forever preferred_lft forever
> > # ip link set dummy0 mtu 1200
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1200 qdisc noop state DOWN group default qlen 1000
> >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > # ip link set dummy0 mtu 1500
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > 
> > My guess is that IPv6 being deconfigured because of an unsuitable MTU
> > is considered a different event from a mere "down".
> 
> I guess it's because they're not IFA_F_PERMANENT, because
> addrconf_permanent_addr() has:
> 
>         case NETDEV_CHANGEMTU:
>                 /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */
>                 if (dev->mtu < IPV6_MIN_MTU) {
>                         addrconf_ifdown(dev, dev != net->loopback_dev);
>                         break;
>                 }
> 
> but addrconf_ifdown() does:
> 
>                                 if (!keep_addr ||
>                                     !(ifa->flags & IFA_F_PERMANENT) ||
>                                     addr_is_local(&ifa->addr)) {
>                                         hlist_del_init_rcu(&ifa->addr_lst);
>                                         goto restart;
>                                 }
> 
> I'm not sure about the logic behind that. We could actually set those
> addresses as permanent once the DHCPv6 client configures them, if it's
> cleaner.

Huh.  Not in the passt/VM case, though, which is where I actually
encountered this.

> > > > This kind of opens a question about how hard we should try to
> > > > accomodate guests which don't configure themselves how we told them.  
> > > 
> > > There's a notable distinction between guests temporarily diverging (in
> > > different ways) and guests we don't configure at all.  
> > 
> > I'm not really sure what you're getting at here.
> 
> In this case, it's not true that the guest doesn't configure itself in
> the way we requested -- it's just a temporary diversion from that
> configuration.

Oh, I see.  Assuming that at some point the DHCP client will re-run.

> Those are different cases that we can handle in different ways, I
> think. If it's a glitch that will only happen during testing, let's
> work around that.
> 
> But if the guest really ignores DHCPv6 information, I think we should
> keep that working.
> 
> > > It's probably more important to ensure we use the right type of address  
> > 
> > "type" in what sense here?
> 
> Global unicast instead of link-local.

Ok.

> > > (security) rather than ensuring we somehow manage to deliver packets at
> > > any time (minor glitch otherwise), also because the one you describe is
> > > something we're unlikely to hit outside of tests.
> > >   
> > > > Personally I'd be ok with saying that nothing works if the guest
> > > > doesn't configure itself properly, thereby removing addr_seen and
> > > > addr_ll_seen entirely.  But I think, Stefano, you've been against that
> > > > idea in the past.  
> > > 
> > > Yes, I still think we should support guests that don't use DHCPv6 or
> > > NDP at all,  
> > 
> > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to
> > manually configure the interface in the guest to match the address
> > you've configured with -a.  Just like you'd expect to have to
> > correctly configure your address on a real network.
> 
> True, but if we make correctness as optional as possible, we'll be more
> compatible (less time spent by users fixing situations that don't
> necessarily need fixing, less time spent by developers to look into
> reports, no matter who's at fault).

Eh, maybe.  Unless us trying to make sense of a nonsense situation
causes some unpredictable behaviour that breaks something else.

> > > or where related exchanges fail for any reason. It improves
> > > reliability and compatibility at a small cost. In this case, I think
> > > it's a nice feature that we would resume communicating as soon as the
> > > guest shows its global unicast address.  
> > 
> > Hm, maybe.  I'm not entirely convinced the cost is so small long term.
> > It's pretty badly incompatible with having multiple guests behind the
> > same passt instance: such as the initial guest bridging or routing to
> > nested guests.
> 
> Why? We will need to hash the interface/guest index anyway, for
> outbound flows.

If we have separate interfaces for each guest, yes.  But not if we
have multiple guests behind a single tap because the initial guest
sets up a bridge or routing.  Then we have nothing but the address.

> And for inbound flows, if a guest steals the address of another guest,
> we'll give priority to the normal 'addr' versions instead of the
> '_seen' ones, to decide how to direct traffic.

I don't see how we'd know we're in this situation, so when to
prioritise which address over the other.

> > I'm actually not sure if encountering this bug makes me more or less
> > in favour of addr_seen.  On the one hand I think it highlights the
> > flakiness of this approach; there are situations where we just won't
> > know the right address.
> 
> I don't understand this argument: indeed, there are such situations,
> and they are annoying. Why should we make them more common?

Because predictability is good, and working _most_ of the time is a
failure of predictability.

> > On the other hand if shows a relatively
> > plausible case where the guest won't get exactly the address we want
> > it to (it uses NDP but not DHCPv6)
> > 
> > Hrm... actually this also shows a potential danger in the recent
> > patches to disable DAD in the guest.  With DAD enabled, when the guest
> > grabs a new address, we'd expect it to emit DAD messages, which would
> > have the side effect of updating our addr_seen (although I'm pretty
> > sure I hit this patch before the nodad patches were applied, so that
> > doesn't seem to be foolproof).
> 
> Well, but we do that for containers with --config-net only. In that
> case, the addresses we configure have infinite lifetime anyway.

Oh, good point.  Hrm... then I'm unsure why the guest wasn't re-DADing
its new address.

> Besides, I don't think we need to have addr_seen updated as quickly and
> correctly as possible just for the sake of it, we can also update it
> when we get any other neighbour solicitation because the guest is
> actually using the network. It's not meant to be perfect.

If the guest is a pure server (a common case for containers AFAICT),
then I don't know that we can expect NS messages for anything other
than the default gateway, which is (typically) link-local and so won't
help us to learn the new global address.

> > We could maybe update addr_seen when we send RA messages to the guest
> > - assuming that it will use the same host part (low 64-bits) for both
> > link-local and global addresses.  Not sure if that's a widely safe
> > assumption or not.
> 
> I don't understand: what case are you trying to cover with this?

A case just like the one in the tests: the interface bounces, and we
get NDP traffic on the link-local address, but nothing on the global
address before an inbound connection.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson