On Tue, Aug 20, 2024 at 10:39:26PM +0200, Stefano Brivio wrote:
> On Tue, 20 Aug 2024 10:42:17 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
> > > On Mon, 19 Aug 2024 19:52:49 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:  
> > > > > On Mon, 19 Aug 2024 18:46:31 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:    
> > > > > > > Based on Stefano's recent patch for faster tests.
> > > > > > > 
> > > > > > > Allow the user to specify which addresses are translated when used by
> > > > > > > the guest, rather than always being the gateway address or nothing.
> > > > > > > We also allow this remapping to go to the host's global address (more
> > > > > > > precisely the address assigned to the guest) rather than just host
> > > > > > > loopback.
> > > > > > > 
> > > > > > > Suggestions for better names for the new options in patches 20 & 22
> > > > > > > are most welcome.
> > > > > > > 
> > > > > > > Along the way to implementing that make many changes to clarify what
> > > > > > > various addresses we track mean, fixing a number of small bugs as
> > > > > > > well.
> > > > > > > 
> > > > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf
> > > > > > > tests.  I haven't managed to figure out why it's causing the problem,
> > > > > > > or even what the exact triggering conditions are (running the single
> > > > > > > stalling iperf alone doesn't do it).  Have to wrap up for today, so I
> > > > > > > thought I'd get this out for review anyway.      
> > > > > > 
> > > > > > I've identified the bug here.  IMO, it's a pre-existing problem that
> > > > > > only works by accident at the moment.  The immediate fix is pretty
> > > > > > obvious, but it raises some broader questions
> > > > > > 
> > > > > > The problem arises because of the MTU changes we make in order to test
> > > > > > throughput with different packet sizes.  Specifically we change the
> > > > > > MTU to values < 1280, which implicitly disables IPv6 since it requires
> > > > > > an MTU >= 1280.  When we change the MTU back to a larger value IPv6 is
> > > > > > re-enabled, but some configuration has been lost in the meantime.
> > > > > > 
> > > > > > After the MTU is restored the guest reconfigures with NDP, but does
> > > > > > not re-DHCPv6.  That means the guest gets a SLAAC address in the right
> > > > > > prefix but not the exact /128 address we've tried to assign to it.
> > > > > > However, at least with the sequence of things we have in the tests,
> > > > > > the guest never sends any packets with the new address, so passt
> > > > > > doesn't update addr_seen.  When the inbound connection comes we send
> > > > > > it to the assigned address instead of the guest's actual address and
> > > > > > the guest rejects it.    
> > > > > 
> > > > > I still have to take a closer look, but I'm fairly sure I hit a similar
> > > > > issue while I was writing these tests originally. I pondered
> > > > > reconfiguring the address via DHCPv6, or using the keep_addr_on_down
> > > > > sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added
> > > > > around that time.
> > > > > 
> > > > > Then:
> > > > >     
> > > > > > This "worked" previously, because before this patch, passt would
> > > > > > translate the inbound connection to have source/dest as link-local
> > > > > > addresses.    
> > > > > 
> > > > > ...I realised that this worked and forgot about the whole issue.
> > > > >     
> > > > > > We *do* have a current addr_ll_seen because (a) it won't
> > > > > > change if the guest doesn't change MAC and (b) when IPv6 is re-enabled
> > > > > > the NDP traffic the guest generates will have link-local addresses
> > > > > > that update addr_ll_seen.  With this patch, and a global address for
> > > > > > --map-host-loopback, we now need to send to addr_seen instead of
> > > > > > addr_ll_seen, hence exposing the bug.
> > > > > > 
> > > > > > In the short term, the obvious fix would be to re-run dhclient -6 in
> > > > > > the guest after we twiddle MTU but before running IPv6 tests.    
> > > > > 
> > > > > I guess setting keep_addr_on_down (even for "all" interfaces) should
> > > > > work as well.    
> > > > 
> > > > Sounds like it.  I wasn't aware of that one.
> > > > 
> > > > /me tests..  actually, no it doesn't work..
> > > > 
> > > > # sysctl -a | grep keep_addr_on_down
> > > > net.ipv6.conf.all.keep_addr_on_down = 1
> > > > net.ipv6.conf.default.keep_addr_on_down = 1
> > > > net.ipv6.conf.dummy0.keep_addr_on_down = 1
> > > > net.ipv6.conf.lo.keep_addr_on_down = 0
> > > > # ip addr add 2001:db8::1 dev dummy0
> > > > # ip a
> > > > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> > > >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > > > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> > > >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > > >     inet6 2001:db8::1/128 scope global 
> > > >        valid_lft forever preferred_lft forever
> > > > # ip link set dummy0 mtu 1200
> > > > # ip a
> > > > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> > > >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > > > 2: dummy0: <BROADCAST,NOARP> mtu 1200 qdisc noop state DOWN group default qlen 1000
> > > >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > > > # ip link set dummy0 mtu 1500
> > > > # ip a
> > > > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> > > >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > > > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> > > >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > > > 
> > > > My guess is that IPv6 being deconfigured because of an unsuitable MTU
> > > > is considered a different event from a mere "down".  
> > > 
> > > I guess it's because they're not IFA_F_PERMANENT, because
> > > addrconf_permanent_addr() has:
> > > 
> > >         case NETDEV_CHANGEMTU:
> > >                 /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */
> > >                 if (dev->mtu < IPV6_MIN_MTU) {
> > >                         addrconf_ifdown(dev, dev != net->loopback_dev);
> > >                         break;
> > >                 }
> > > 
> > > but addrconf_ifdown() does:
> > > 
> > >                                 if (!keep_addr ||
> > >                                     !(ifa->flags & IFA_F_PERMANENT) ||
> > >                                     addr_is_local(&ifa->addr)) {
> > >                                         hlist_del_init_rcu(&ifa->addr_lst);
> > >                                         goto restart;
> > >                                 }
> > > 
> > > I'm not sure about the logic behind that. We could actually set those
> > > addresses as permanent once the DHCPv6 client configures them, if it's
> > > cleaner.  
> > 
> > Huh.  Not in the passt/VM case, though, which is where I actually
> > encountered this.
> 
> I meant using ip(8) from the test script itself, but it doesn't
> actually make sense:
> 
> # ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev eth0
> Warning: permanent option is not mutable from userspace
> 
> because (RFC 3549):
> 
>    IFA_F_PERMANENT  For a permanent address set by the user.
>                     When this is not set, it means the address
>                     was dynamically created (e.g., by stateless
>                     autoconfiguration).
> 
> So the address you used in your test _should_ have IFA_F_PERMANENT. The
> plot thickens.
> 
> I just tried this, which confirms your hypothesis that bringing the
> link down is a different event:
> 
> # ip addr add 2001:db8::1 dev dummy0
> # ip link set dummy0 down
> # ip addr show dev dummy0
> 5: dummy0: <BROADCAST,NOARP> mtu 1280 qdisc noqueue state DOWN group default qlen 1000
>     link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff
>     inet 1.2.3.1/24 scope global dummy0
>        valid_lft forever preferred_lft forever
>     inet6 2001:db8::1/128 scope global 
>        valid_lft forever preferred_lft forever
> # ip link set dummy0 mtu 1279
> # ip addr show dev dummy0
> 5: dummy0: <BROADCAST,NOARP> mtu 1279 qdisc noqueue state DOWN group default qlen 1000
>     link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff
>     inet 1.2.3.1/24 scope global dummy0
>        valid_lft forever preferred_lft forever
> 
> ...I just can't see that from the code.

Ok.

> > > > > > This kind of opens a question about how hard we should try to
> > > > > > accomodate guests which don't configure themselves how we told them.    
> > > > > 
> > > > > There's a notable distinction between guests temporarily diverging (in
> > > > > different ways) and guests we don't configure at all.    
> > > > 
> > > > I'm not really sure what you're getting at here.  
> > > 
> > > In this case, it's not true that the guest doesn't configure itself in
> > > the way we requested -- it's just a temporary diversion from that
> > > configuration.  
> > 
> > Oh, I see.  Assuming that at some point the DHCP client will re-run.
> > 
> > > Those are different cases that we can handle in different ways, I
> > > think. If it's a glitch that will only happen during testing, let's
> > > work around that.
> > > 
> > > But if the guest really ignores DHCPv6 information, I think we should
> > > keep that working.
> > >   
> > > > > It's probably more important to ensure we use the right type of address    
> > > > 
> > > > "type" in what sense here?  
> > > 
> > > Global unicast instead of link-local.  
> > 
> > Ok.
> > 
> > > > > (security) rather than ensuring we somehow manage to deliver packets at
> > > > > any time (minor glitch otherwise), also because the one you describe is
> > > > > something we're unlikely to hit outside of tests.
> > > > >     
> > > > > > Personally I'd be ok with saying that nothing works if the guest
> > > > > > doesn't configure itself properly, thereby removing addr_seen and
> > > > > > addr_ll_seen entirely.  But I think, Stefano, you've been against that
> > > > > > idea in the past.    
> > > > > 
> > > > > Yes, I still think we should support guests that don't use DHCPv6 or
> > > > > NDP at all,    
> > > > 
> > > > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to
> > > > manually configure the interface in the guest to match the address
> > > > you've configured with -a.  Just like you'd expect to have to
> > > > correctly configure your address on a real network.  
> > > 
> > > True, but if we make correctness as optional as possible, we'll be more
> > > compatible (less time spent by users fixing situations that don't
> > > necessarily need fixing, less time spent by developers to look into
> > > reports, no matter who's at fault).  
> > 
> > Eh, maybe.  Unless us trying to make sense of a nonsense situation
> > causes some unpredictable behaviour that breaks something else.
> > 
> > > > > or where related exchanges fail for any reason. It improves
> > > > > reliability and compatibility at a small cost. In this case, I think
> > > > > it's a nice feature that we would resume communicating as soon as the
> > > > > guest shows its global unicast address.    
> > > > 
> > > > Hm, maybe.  I'm not entirely convinced the cost is so small long term.
> > > > It's pretty badly incompatible with having multiple guests behind the
> > > > same passt instance: such as the initial guest bridging or routing to
> > > > nested guests.  
> > > 
> > > Why? We will need to hash the interface/guest index anyway, for
> > > outbound flows.  
> > 
> > If we have separate interfaces for each guest, yes.  But not if we
> > have multiple guests behind a single tap because the initial guest
> > sets up a bridge or routing.  Then we have nothing but the address.
> 
> ...but then we should have multiple addresses anyway.

Yes.. that's kind of my point.

> By the way, I'm
> not sure we'll ever be able to support that kind of configuration.

I don't see why not.  It would require configuration so that it's
clear what each inbound forward targets.  But I don't see any inherent
problem here, though there are a number of current implementation
details which prevent it (addr_seen is one, replying to all arps is
another).

> How
> does a guest set up a bridge and use passt at the same time?

I'm not thinking of a bridge shared with the host, but a bridge (or
routing) between nested guests or namespaces.  This is essentially the
"private switch with pasta uplink" case we've discussed occasionally
before.  It doesn't technically have to be nested guests - the guest
could bridge between its uplink and a tunnel, but nested guests is the
likely use case.

> > > And for inbound flows, if a guest steals the address of another guest,
> > > we'll give priority to the normal 'addr' versions instead of the
> > > '_seen' ones, to decide how to direct traffic.  
> > 
> > I don't see how we'd know we're in this situation, so when to
> > prioritise which address over the other.
> 
> In the set of all addr_seen and addr, we would have at least a
> non-unique value. Or, practically speaking, we should refuse to set
> addr_seen if it matches addr for another guest.

Ah, ok.  So again, assuming a static configuration of known guests,
rather than a local bridge established by a guest at runtime.

> > > > I'm actually not sure if encountering this bug makes me more or less
> > > > in favour of addr_seen.  On the one hand I think it highlights the
> > > > flakiness of this approach; there are situations where we just won't
> > > > know the right address.  
> > > 
> > > I don't understand this argument: indeed, there are such situations,
> > > and they are annoying. Why should we make them more common?  
> > 
> > Because predictability is good, and working _most_ of the time is a
> > failure of predictability.
> 
> It avoids substantial effort and frustration for everybody involved
> though. The practical problem with lacking predictability is if it
> makes things harder to debug, I guess, which shouldn't be the case here.
> 
> > > > On the other hand if shows a relatively
> > > > plausible case where the guest won't get exactly the address we want
> > > > it to (it uses NDP but not DHCPv6)
> > > > 
> > > > Hrm... actually this also shows a potential danger in the recent
> > > > patches to disable DAD in the guest.  With DAD enabled, when the guest
> > > > grabs a new address, we'd expect it to emit DAD messages, which would
> > > > have the side effect of updating our addr_seen (although I'm pretty
> > > > sure I hit this patch before the nodad patches were applied, so that
> > > > doesn't seem to be foolproof).  
> > > 
> > > Well, but we do that for containers with --config-net only. In that
> > > case, the addresses we configure have infinite lifetime anyway.  
> > 
> > Oh, good point.  Hrm... then I'm unsure why the guest wasn't re-DADing
> > its new address.
> 
> It probably did, but we ignored that anyway because DAD is done by
> sending neighbour solicitations with an unspecified address as source,
> for example (the "change" here drops "nodad"):
> 
> $ ./pasta --config-net -p dad.pcap
> Saving packet capture to dad.pcap
> # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64
> # tshark -r dad.pcap |grep Neigh
> Running as user "root" and group "root". This could be dangerous.
>    10   2.642467           :: → ff02::1:ff01:e9a1 ICMPv6 86 Neighbor Solicitation for fe80::3882:b5ff:fe01:e9a1
> 
> and in tap6_handler() we do:
> 
>                 } else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){
>                         c->ip6.addr_seen = *saddr;
>                 }
> 
> ...then, in ndp():
> 
>                 if (IN6_IS_ADDR_UNSPECIFIED(saddr))
>                         return 1;
> 
> we could set addr_seen by looking at the *target* address of the
> neighbour solicitation when the source address is ::, but it's not
> implemented yet.

Right.  I forgot the NS went out with :: as source.  Snooping the NS
that way again assumes that there's only one logical machine on the
guest side.  But since this is for addr_seen which fundamentally
assumes that anyway, I guess it doesn't make anything worse.

> > > Besides, I don't think we need to have addr_seen updated as quickly and
> > > correctly as possible just for the sake of it, we can also update it
> > > when we get any other neighbour solicitation because the guest is
> > > actually using the network. It's not meant to be perfect.  
> > 
> > If the guest is a pure server (a common case for containers AFAICT),
> > then I don't know that we can expect NS messages for anything other
> > than the default gateway, which is (typically) link-local and so won't
> > help us to learn the new global address.
> 
> Containers running actual applications are noisy. I've only seen this
> kind of problem (addr_seen not set/matching) in particularly crafted
> test environments.
> 
> > > > We could maybe update addr_seen when we send RA messages to the guest
> > > > - assuming that it will use the same host part (low 64-bits) for both
> > > > link-local and global addresses.  Not sure if that's a widely safe
> > > > assumption or not.  
> > > 
> > > I don't understand: what case are you trying to cover with this?  
> > 
> > A case just like the one in the tests: the interface bounces, and we
> > get NDP traffic on the link-local address, but nothing on the global
> > address before an inbound connection.
> 
> Oh, I see. I think it makes sense, even though we'll set addr_seen a
> bit too early, but not enough to be a practical issue, I think.

Yes, but I think snopping the NS from DAD is probably a better idea.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson