On Wed, Jul 20, 2022 at 10:23:22AM +0200, Stefano Brivio wrote:
> On Wed, 20 Jul 2022 17:24:18 +1000
> David Gibson <david(a)gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jul 20, 2022 at 12:47:57PM +1000, David Gibson wrote:
> > > On Tue, Jul 19, 2022 at 09:05:52PM +0200, Stefano Brivio wrote:  
> > > > On Tue, 19 Jul 2022 16:20:45 +1000
> > > > David Gibson <david(a)gibson.dropbear.id.au> wrote:
> > > >   
> > > > > On Fri, Jul 15, 2022 at 03:21:40PM +1000, David Gibson wrote:  
> > > > > > By default, passt itself attaches to the first host interface with a
> > > > > > default route.  However, when determining the host interface name the tests
> > > > > > implicitly select the *last* host interface: they use a jq expression which
> > > > > > will list all interfaces with default routes, but the way output detection
> > > > > > works in the scripts, it will only pick up the last line.
> > > > > > 
> > > > > > If there are multiple interfaces with default routes on the host, and they
> > > > > > each have a different address, this can cause spurious test
> > > > > > failures.    
> > > > > 
> > > > > It seems this change is not enough to always fix the tests when there
> > > > > are multiple default routes.  I'm still sometimes getting failures,
> > > > > now because passt itself doesn't seem to be picking the interface
> > > > > with the first default route.
> > > > > 
> > > > > I'm wondering if this is because ip(8) is sorting the output, not just
> > > > > presenting it in the same order that the underlying netlink interface
> > > > > does.  
> > > > 
> > > > I don't see that happening at least in my environment (and I also can't
> > > > see any code that would sort it, that pretty much comes from
> > > > rtnl_dump_filter_l() in lib/libnetlink.c).
> > > > 
> > > > The netlink "filter", though, is slightly different. For IPv4:
> > > > 
> > > > $ strace -e sendto ip route show >/dev/null
> > > > sendto(3, [{{len=36, type=RTM_GETROUTE, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1658257027, pid=0}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_UNSPEC, rtm_protocol=RTPROT_UNSPEC, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNSPEC, rtm_flags=0}, {{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN}}, {len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}], 156, 0, NULL, 0) = 156
> > > > 
> > > > $ strace -e sendto ./passt -f
> > > > sendto(5, {{len=28, type=RTM_GETROUTE, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=0, pid=0}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_UNSPEC, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}}, 28, 0, NULL, 0) = 28
> > > > [...]
> > > > 
> > > > and, while I don't think the FIB trie is actually descended in a
> > > > different way, we might still have a slightly different result from the
> > > > kernel (if I recall correctly, I didn't check right now).  
> > 
> > So, I've been staring at this for most of the day.  I don't think the
> > difference is because of the slightly different filters.  Instead it's
> > an easy to miss subtletly in nl_get_ext_if().
> > 
> > In nl_get_ext_if() both first_v4 and first_v6 will be set to the
> > interface which meets the criteria and comes first in the order of
> > routes returned by netlink from the kernel.  However, there's this early return:
> > 
> > 	word = (long *)has_v4;
> > 	for (i = 0; i < ARRAY_SIZE(has_v4) / sizeof(long); i++, word++) {
> > 		tmp = *word;
> > 		while ((n = ffsl(tmp))) {
> > 			int ifi = i * sizeof(long) * 8 + n - 1;
> > 
> > 			if (!first_v4)
> > 				first_v4 = ifi;
> > 
> > 			tmp &= ~(1UL << (n - 1));
> > 			if (bitmap_isset(has_v6, ifi)) {
> > 				*v4 = *v6 = IP_VERSION_ENABLED;
> > 				return ifi;    <================ HERE
> > 			}
> > 		}
> > 	}
> > 
> > Which means that if we find an interface with both a v4 and a v6
> > default route, we'll return it immediately.  That loop is across the
> > has_v4 bitmap, so we'll return the first one in *interface index order*,
> > which might not necessarily be the same as the order the routes are
> > returned from netlink.
> 
> Whoops :) ...right, sorry, I forgot about that.
> 
> > In the other cases: one of v4 or v6 is disabled, or there's no
> > interface with both v4 and v6 default routes, then we'll end up
> > returning the first interface in netlink route order.  Which makes
> > passt's overall priority of selecting interfaces pretty hard to
> > follow.
> > 
> > I've been thinking a bunch about what the order should be.  The
> > obvious cleanup to the current logic would be:
> > 	1. If there are >0 interfaces with both default v4 and v6
> > 	   routes, pick the one with the lowest metric
> > 	2. Otherwise if there are >0 interfaces with a default v4
> > 	   route, pick the one with the lowest metric
> > 	3. Otherwise if there are >0 interfaces with a default v6
> > 	   route, pick the one with the lowest metric
> > 	4. Otherwise fail.
> > 
> > ..but, I think that's the wrong choice.  For one thing it gets weird
> > in the unlikely but possible case that there are two interfaces with
> > both v4 and v6 routes, but one has a lower v4 metric and the other has
> > a lower v6 metric.  But more importantly, I think it fails the "do
> > what you want by default" test in the case of a host that has both v4
> > and v6 routing, but not on the same interface.  That's quite a
> > realistic situation if the v6 connectivity is coming from a tunnel
> > (6in4 or 6rd or 6over4 or whatever).
> 
> Definitely, I didn't consider the tunneled IPv6 case, but it looks
> still rather real even nowadays.

Yeah, I gather IPv6 is finally fairly standard in Germany, but here in
Australia I'm with basically the only residential ISP which provides v6.

> > Instead, I think we should separately pick an external interface for
> > v4 and v6 connectivity.  Yes, it's weird to have a view of two
> > different host interfaces on the same virtual interface.  It also
> > means pasta has to arbitrarily pick one of the host interface names
> > for the namespace, which will make the tests a bit uglier.  But it
> > makes picking the right interfaces a lot simpler and clearer, and does
> > what you probably want for the case of a host with tunnelled ipv6
> > connectivity.
> 
> I totally agree. It's still less weird than the alternative.

Excellent.

> We'll need to store two interface indices (say, "ifi4" and "ifi6"), but
> their usage right now is almost entirely limited to configuration and
> netlink functions.

Yes, I looked through all the uses of c->ifi, and it looks fairly
straightforward.

> One additional usage is to select a given interface for sockets bound
> to link-local addresses by means of setting .sin6_scope_id in struct
> sockaddr_in6 -- that happens in tcp_conn_from_tap() and sock_l4().
> There, we just need to switch to the new "ifi6" index.
> 
> Let me know if I should start looking into this or if you can do it
> instead.

I'll work on it.  Hoping I can get that done before heading on holiday
next week.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson