public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH 00/22] RFC: Allow configuration of special case NATs
Date: Tue, 20 Aug 2024 10:42:17 +1000	[thread overview]
Message-ID: <ZsPmaTe8JcZNsxFj@zatzit.fritz.box> (raw)
In-Reply-To: <20240819150100.3bbdbb3f@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 12638 bytes --]

On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
> On Mon, 19 Aug 2024 19:52:49 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
> > > On Mon, 19 Aug 2024 18:46:31 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:  
> > > > > Based on Stefano's recent patch for faster tests.
> > > > > 
> > > > > Allow the user to specify which addresses are translated when used by
> > > > > the guest, rather than always being the gateway address or nothing.
> > > > > We also allow this remapping to go to the host's global address (more
> > > > > precisely the address assigned to the guest) rather than just host
> > > > > loopback.
> > > > > 
> > > > > Suggestions for better names for the new options in patches 20 & 22
> > > > > are most welcome.
> > > > > 
> > > > > Along the way to implementing that make many changes to clarify what
> > > > > various addresses we track mean, fixing a number of small bugs as
> > > > > well.
> > > > > 
> > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf
> > > > > tests.  I haven't managed to figure out why it's causing the problem,
> > > > > or even what the exact triggering conditions are (running the single
> > > > > stalling iperf alone doesn't do it).  Have to wrap up for today, so I
> > > > > thought I'd get this out for review anyway.    
> > > > 
> > > > I've identified the bug here.  IMO, it's a pre-existing problem that
> > > > only works by accident at the moment.  The immediate fix is pretty
> > > > obvious, but it raises some broader questions
> > > > 
> > > > The problem arises because of the MTU changes we make in order to test
> > > > throughput with different packet sizes.  Specifically we change the
> > > > MTU to values < 1280, which implicitly disables IPv6 since it requires
> > > > an MTU >= 1280.  When we change the MTU back to a larger value IPv6 is
> > > > re-enabled, but some configuration has been lost in the meantime.
> > > > 
> > > > After the MTU is restored the guest reconfigures with NDP, but does
> > > > not re-DHCPv6.  That means the guest gets a SLAAC address in the right
> > > > prefix but not the exact /128 address we've tried to assign to it.
> > > > However, at least with the sequence of things we have in the tests,
> > > > the guest never sends any packets with the new address, so passt
> > > > doesn't update addr_seen.  When the inbound connection comes we send
> > > > it to the assigned address instead of the guest's actual address and
> > > > the guest rejects it.  
> > > 
> > > I still have to take a closer look, but I'm fairly sure I hit a similar
> > > issue while I was writing these tests originally. I pondered
> > > reconfiguring the address via DHCPv6, or using the keep_addr_on_down
> > > sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added
> > > around that time.
> > > 
> > > Then:
> > >   
> > > > This "worked" previously, because before this patch, passt would
> > > > translate the inbound connection to have source/dest as link-local
> > > > addresses.  
> > > 
> > > ...I realised that this worked and forgot about the whole issue.
> > >   
> > > > We *do* have a current addr_ll_seen because (a) it won't
> > > > change if the guest doesn't change MAC and (b) when IPv6 is re-enabled
> > > > the NDP traffic the guest generates will have link-local addresses
> > > > that update addr_ll_seen.  With this patch, and a global address for
> > > > --map-host-loopback, we now need to send to addr_seen instead of
> > > > addr_ll_seen, hence exposing the bug.
> > > > 
> > > > In the short term, the obvious fix would be to re-run dhclient -6 in
> > > > the guest after we twiddle MTU but before running IPv6 tests.  
> > > 
> > > I guess setting keep_addr_on_down (even for "all" interfaces) should
> > > work as well.  
> > 
> > Sounds like it.  I wasn't aware of that one.
> > 
> > /me tests..  actually, no it doesn't work..
> > 
> > # sysctl -a | grep keep_addr_on_down
> > net.ipv6.conf.all.keep_addr_on_down = 1
> > net.ipv6.conf.default.keep_addr_on_down = 1
> > net.ipv6.conf.dummy0.keep_addr_on_down = 1
> > net.ipv6.conf.lo.keep_addr_on_down = 0
> > # ip addr add 2001:db8::1 dev dummy0
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> >     inet6 2001:db8::1/128 scope global 
> >        valid_lft forever preferred_lft forever
> > # ip link set dummy0 mtu 1200
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1200 qdisc noop state DOWN group default qlen 1000
> >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > # ip link set dummy0 mtu 1500
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> >     link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > 
> > My guess is that IPv6 being deconfigured because of an unsuitable MTU
> > is considered a different event from a mere "down".
> 
> I guess it's because they're not IFA_F_PERMANENT, because
> addrconf_permanent_addr() has:
> 
>         case NETDEV_CHANGEMTU:
>                 /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */
>                 if (dev->mtu < IPV6_MIN_MTU) {
>                         addrconf_ifdown(dev, dev != net->loopback_dev);
>                         break;
>                 }
> 
> but addrconf_ifdown() does:
> 
>                                 if (!keep_addr ||
>                                     !(ifa->flags & IFA_F_PERMANENT) ||
>                                     addr_is_local(&ifa->addr)) {
>                                         hlist_del_init_rcu(&ifa->addr_lst);
>                                         goto restart;
>                                 }
> 
> I'm not sure about the logic behind that. We could actually set those
> addresses as permanent once the DHCPv6 client configures them, if it's
> cleaner.

Huh.  Not in the passt/VM case, though, which is where I actually
encountered this.

> > > > This kind of opens a question about how hard we should try to
> > > > accomodate guests which don't configure themselves how we told them.  
> > > 
> > > There's a notable distinction between guests temporarily diverging (in
> > > different ways) and guests we don't configure at all.  
> > 
> > I'm not really sure what you're getting at here.
> 
> In this case, it's not true that the guest doesn't configure itself in
> the way we requested -- it's just a temporary diversion from that
> configuration.

Oh, I see.  Assuming that at some point the DHCP client will re-run.

> Those are different cases that we can handle in different ways, I
> think. If it's a glitch that will only happen during testing, let's
> work around that.
> 
> But if the guest really ignores DHCPv6 information, I think we should
> keep that working.
> 
> > > It's probably more important to ensure we use the right type of address  
> > 
> > "type" in what sense here?
> 
> Global unicast instead of link-local.

Ok.

> > > (security) rather than ensuring we somehow manage to deliver packets at
> > > any time (minor glitch otherwise), also because the one you describe is
> > > something we're unlikely to hit outside of tests.
> > >   
> > > > Personally I'd be ok with saying that nothing works if the guest
> > > > doesn't configure itself properly, thereby removing addr_seen and
> > > > addr_ll_seen entirely.  But I think, Stefano, you've been against that
> > > > idea in the past.  
> > > 
> > > Yes, I still think we should support guests that don't use DHCPv6 or
> > > NDP at all,  
> > 
> > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to
> > manually configure the interface in the guest to match the address
> > you've configured with -a.  Just like you'd expect to have to
> > correctly configure your address on a real network.
> 
> True, but if we make correctness as optional as possible, we'll be more
> compatible (less time spent by users fixing situations that don't
> necessarily need fixing, less time spent by developers to look into
> reports, no matter who's at fault).

Eh, maybe.  Unless us trying to make sense of a nonsense situation
causes some unpredictable behaviour that breaks something else.

> > > or where related exchanges fail for any reason. It improves
> > > reliability and compatibility at a small cost. In this case, I think
> > > it's a nice feature that we would resume communicating as soon as the
> > > guest shows its global unicast address.  
> > 
> > Hm, maybe.  I'm not entirely convinced the cost is so small long term.
> > It's pretty badly incompatible with having multiple guests behind the
> > same passt instance: such as the initial guest bridging or routing to
> > nested guests.
> 
> Why? We will need to hash the interface/guest index anyway, for
> outbound flows.

If we have separate interfaces for each guest, yes.  But not if we
have multiple guests behind a single tap because the initial guest
sets up a bridge or routing.  Then we have nothing but the address.

> And for inbound flows, if a guest steals the address of another guest,
> we'll give priority to the normal 'addr' versions instead of the
> '_seen' ones, to decide how to direct traffic.

I don't see how we'd know we're in this situation, so when to
prioritise which address over the other.

> > I'm actually not sure if encountering this bug makes me more or less
> > in favour of addr_seen.  On the one hand I think it highlights the
> > flakiness of this approach; there are situations where we just won't
> > know the right address.
> 
> I don't understand this argument: indeed, there are such situations,
> and they are annoying. Why should we make them more common?

Because predictability is good, and working _most_ of the time is a
failure of predictability.

> > On the other hand if shows a relatively
> > plausible case where the guest won't get exactly the address we want
> > it to (it uses NDP but not DHCPv6)
> > 
> > Hrm... actually this also shows a potential danger in the recent
> > patches to disable DAD in the guest.  With DAD enabled, when the guest
> > grabs a new address, we'd expect it to emit DAD messages, which would
> > have the side effect of updating our addr_seen (although I'm pretty
> > sure I hit this patch before the nodad patches were applied, so that
> > doesn't seem to be foolproof).
> 
> Well, but we do that for containers with --config-net only. In that
> case, the addresses we configure have infinite lifetime anyway.

Oh, good point.  Hrm... then I'm unsure why the guest wasn't re-DADing
its new address.

> Besides, I don't think we need to have addr_seen updated as quickly and
> correctly as possible just for the sake of it, we can also update it
> when we get any other neighbour solicitation because the guest is
> actually using the network. It's not meant to be perfect.

If the guest is a pure server (a common case for containers AFAICT),
then I don't know that we can expect NS messages for anything other
than the default gateway, which is (typically) link-local and so won't
help us to learn the new global address.

> > We could maybe update addr_seen when we send RA messages to the guest
> > - assuming that it will use the same host part (low 64-bits) for both
> > link-local and global addresses.  Not sure if that's a widely safe
> > assumption or not.
> 
> I don't understand: what case are you trying to cover with this?

A case just like the one in the tests: the interface bounces, and we
get NDP traffic on the link-local address, but nothing on the global
address before an inbound connection.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-08-20  0:49 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-16  5:39 [PATCH 00/22] RFC: Allow configuration of special case NATs David Gibson
2024-08-16  5:39 ` [PATCH 01/22] treewide: Use "our address" instead of "forwarding address" David Gibson
2024-08-18 15:44   ` Stefano Brivio
2024-08-19  1:28     ` David Gibson
2024-08-16  5:39 ` [PATCH 02/22] util: Helper for formatting MAC addresses David Gibson
2024-08-18 15:44   ` Stefano Brivio
2024-08-19  1:29     ` David Gibson
2024-08-16  5:39 ` [PATCH 03/22] treewide: Rename MAC address fields for clarity David Gibson
2024-08-18 15:45   ` Stefano Brivio
2024-08-19  1:36     ` David Gibson
2024-08-16  5:39 ` [PATCH 04/22] treewide: Use struct assignment instead of memcpy() for IP addresses David Gibson
2024-08-18 15:45   ` Stefano Brivio
2024-08-19  1:38     ` David Gibson
2024-08-16  5:39 ` [PATCH 05/22] conf: Use array indices rather than pointers for DNS array slots David Gibson
2024-08-16  5:39 ` [PATCH 06/22] conf: More accurately count entries added in get_dns() David Gibson
2024-08-16  5:39 ` [PATCH 07/22] conf: Move DNS array bounds checks into add_dns[46] David Gibson
2024-08-16  5:39 ` [PATCH 08/22] conf: Move adding of a nameserver from resolv.conf into subfunction David Gibson
2024-08-16  5:39 ` [PATCH 09/22] conf: Correct setting of dns_match address in add_dns6() David Gibson
2024-08-16  5:39 ` [PATCH 10/22] conf: Treat --dns addresses as guest visible addresses David Gibson
2024-08-16  5:39 ` [PATCH 11/22] conf: Remove incorrect initialisation of addr_ll_seen David Gibson
2024-08-16  5:39 ` [PATCH 12/22] util: Correct sock_l4() binding for link local addresses David Gibson
2024-08-20  0:14   ` Stefano Brivio
2024-08-20  1:29     ` David Gibson
2024-08-16  5:39 ` [PATCH 13/22] treewide: Change misleading 'addr_ll' name David Gibson
2024-08-20  0:15   ` Stefano Brivio
2024-08-20  1:30     ` David Gibson
2024-08-16  5:39 ` [PATCH 14/22] Clarify which addresses in ip[46]_ctx are meaningful where David Gibson
2024-08-16  5:39 ` [PATCH 15/22] Initialise our_tap_ll to ip6.gw when suitable David Gibson
2024-08-16  5:39 ` [PATCH 16/22] fwd: Helpers to clarify what host addresses aren't guest accessible David Gibson
2024-08-20 19:56   ` Stefano Brivio
2024-08-21  1:40     ` David Gibson
2024-08-16  5:39 ` [PATCH 17/22] fwd: Split notion of "our tap address" from gateway for IPv4 David Gibson
2024-08-20 19:56   ` Stefano Brivio
2024-08-21  1:56     ` David Gibson
2024-08-16  5:39 ` [PATCH 18/22] Don't take "our" MAC address from the host David Gibson
2024-08-16  5:40 ` [PATCH 19/22] conf, fwd: Split notion of gateway/router from guest-visible host address David Gibson
2024-08-20 19:56   ` Stefano Brivio
2024-08-21  1:59     ` David Gibson
2024-08-16  5:40 ` [PATCH 20/22] conf: Allow address remapped to host to be configured David Gibson
2024-08-20 19:56   ` Stefano Brivio
2024-08-21  2:23     ` David Gibson
2024-08-16  5:40 ` [PATCH 21/22] fwd: Distinguish translatable from untranslatable addresses on inbound David Gibson
2024-08-16  5:40 ` [PATCH 22/22] fwd, conf: Allow NAT of the guest's assigned address David Gibson
2024-08-20 19:56   ` Stefano Brivio
2024-08-21  2:28     ` David Gibson
2024-08-16 14:45 ` [PATCH 00/22] RFC: Allow configuration of special case NATs Paul Holzinger
2024-08-16 15:03   ` Stefano Brivio
2024-08-17  8:01     ` David Gibson
2024-08-19  8:46 ` David Gibson
2024-08-19  9:27   ` Stefano Brivio
2024-08-19  9:52     ` David Gibson
2024-08-19 13:01       ` Stefano Brivio
2024-08-20  0:42         ` David Gibson [this message]
2024-08-20 20:39           ` Stefano Brivio
2024-08-21  2:51             ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZsPmaTe8JcZNsxFj@zatzit.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=pholzing@redhat.com \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).