From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH 00/22] RFC: Allow configuration of special case NATs
Date: Tue, 20 Aug 2024 10:42:17 +1000 [thread overview]
Message-ID: <ZsPmaTe8JcZNsxFj@zatzit.fritz.box> (raw)
In-Reply-To: <20240819150100.3bbdbb3f@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 12638 bytes --]
On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
> On Mon, 19 Aug 2024 19:52:49 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
> > > On Mon, 19 Aug 2024 18:46:31 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
> > > > > Based on Stefano's recent patch for faster tests.
> > > > >
> > > > > Allow the user to specify which addresses are translated when used by
> > > > > the guest, rather than always being the gateway address or nothing.
> > > > > We also allow this remapping to go to the host's global address (more
> > > > > precisely the address assigned to the guest) rather than just host
> > > > > loopback.
> > > > >
> > > > > Suggestions for better names for the new options in patches 20 & 22
> > > > > are most welcome.
> > > > >
> > > > > Along the way to implementing that make many changes to clarify what
> > > > > various addresses we track mean, fixing a number of small bugs as
> > > > > well.
> > > > >
> > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf
> > > > > tests. I haven't managed to figure out why it's causing the problem,
> > > > > or even what the exact triggering conditions are (running the single
> > > > > stalling iperf alone doesn't do it). Have to wrap up for today, so I
> > > > > thought I'd get this out for review anyway.
> > > >
> > > > I've identified the bug here. IMO, it's a pre-existing problem that
> > > > only works by accident at the moment. The immediate fix is pretty
> > > > obvious, but it raises some broader questions
> > > >
> > > > The problem arises because of the MTU changes we make in order to test
> > > > throughput with different packet sizes. Specifically we change the
> > > > MTU to values < 1280, which implicitly disables IPv6 since it requires
> > > > an MTU >= 1280. When we change the MTU back to a larger value IPv6 is
> > > > re-enabled, but some configuration has been lost in the meantime.
> > > >
> > > > After the MTU is restored the guest reconfigures with NDP, but does
> > > > not re-DHCPv6. That means the guest gets a SLAAC address in the right
> > > > prefix but not the exact /128 address we've tried to assign to it.
> > > > However, at least with the sequence of things we have in the tests,
> > > > the guest never sends any packets with the new address, so passt
> > > > doesn't update addr_seen. When the inbound connection comes we send
> > > > it to the assigned address instead of the guest's actual address and
> > > > the guest rejects it.
> > >
> > > I still have to take a closer look, but I'm fairly sure I hit a similar
> > > issue while I was writing these tests originally. I pondered
> > > reconfiguring the address via DHCPv6, or using the keep_addr_on_down
> > > sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added
> > > around that time.
> > >
> > > Then:
> > >
> > > > This "worked" previously, because before this patch, passt would
> > > > translate the inbound connection to have source/dest as link-local
> > > > addresses.
> > >
> > > ...I realised that this worked and forgot about the whole issue.
> > >
> > > > We *do* have a current addr_ll_seen because (a) it won't
> > > > change if the guest doesn't change MAC and (b) when IPv6 is re-enabled
> > > > the NDP traffic the guest generates will have link-local addresses
> > > > that update addr_ll_seen. With this patch, and a global address for
> > > > --map-host-loopback, we now need to send to addr_seen instead of
> > > > addr_ll_seen, hence exposing the bug.
> > > >
> > > > In the short term, the obvious fix would be to re-run dhclient -6 in
> > > > the guest after we twiddle MTU but before running IPv6 tests.
> > >
> > > I guess setting keep_addr_on_down (even for "all" interfaces) should
> > > work as well.
> >
> > Sounds like it. I wasn't aware of that one.
> >
> > /me tests.. actually, no it doesn't work..
> >
> > # sysctl -a | grep keep_addr_on_down
> > net.ipv6.conf.all.keep_addr_on_down = 1
> > net.ipv6.conf.default.keep_addr_on_down = 1
> > net.ipv6.conf.dummy0.keep_addr_on_down = 1
> > net.ipv6.conf.lo.keep_addr_on_down = 0
> > # ip addr add 2001:db8::1 dev dummy0
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > inet6 2001:db8::1/128 scope global
> > valid_lft forever preferred_lft forever
> > # ip link set dummy0 mtu 1200
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1200 qdisc noop state DOWN group default qlen 1000
> > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> > # ip link set dummy0 mtu 1500
> > # ip a
> > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
> > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> > 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
> >
> > My guess is that IPv6 being deconfigured because of an unsuitable MTU
> > is considered a different event from a mere "down".
>
> I guess it's because they're not IFA_F_PERMANENT, because
> addrconf_permanent_addr() has:
>
> case NETDEV_CHANGEMTU:
> /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */
> if (dev->mtu < IPV6_MIN_MTU) {
> addrconf_ifdown(dev, dev != net->loopback_dev);
> break;
> }
>
> but addrconf_ifdown() does:
>
> if (!keep_addr ||
> !(ifa->flags & IFA_F_PERMANENT) ||
> addr_is_local(&ifa->addr)) {
> hlist_del_init_rcu(&ifa->addr_lst);
> goto restart;
> }
>
> I'm not sure about the logic behind that. We could actually set those
> addresses as permanent once the DHCPv6 client configures them, if it's
> cleaner.
Huh. Not in the passt/VM case, though, which is where I actually
encountered this.
> > > > This kind of opens a question about how hard we should try to
> > > > accomodate guests which don't configure themselves how we told them.
> > >
> > > There's a notable distinction between guests temporarily diverging (in
> > > different ways) and guests we don't configure at all.
> >
> > I'm not really sure what you're getting at here.
>
> In this case, it's not true that the guest doesn't configure itself in
> the way we requested -- it's just a temporary diversion from that
> configuration.
Oh, I see. Assuming that at some point the DHCP client will re-run.
> Those are different cases that we can handle in different ways, I
> think. If it's a glitch that will only happen during testing, let's
> work around that.
>
> But if the guest really ignores DHCPv6 information, I think we should
> keep that working.
>
> > > It's probably more important to ensure we use the right type of address
> >
> > "type" in what sense here?
>
> Global unicast instead of link-local.
Ok.
> > > (security) rather than ensuring we somehow manage to deliver packets at
> > > any time (minor glitch otherwise), also because the one you describe is
> > > something we're unlikely to hit outside of tests.
> > >
> > > > Personally I'd be ok with saying that nothing works if the guest
> > > > doesn't configure itself properly, thereby removing addr_seen and
> > > > addr_ll_seen entirely. But I think, Stefano, you've been against that
> > > > idea in the past.
> > >
> > > Yes, I still think we should support guests that don't use DHCPv6 or
> > > NDP at all,
> >
> > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to
> > manually configure the interface in the guest to match the address
> > you've configured with -a. Just like you'd expect to have to
> > correctly configure your address on a real network.
>
> True, but if we make correctness as optional as possible, we'll be more
> compatible (less time spent by users fixing situations that don't
> necessarily need fixing, less time spent by developers to look into
> reports, no matter who's at fault).
Eh, maybe. Unless us trying to make sense of a nonsense situation
causes some unpredictable behaviour that breaks something else.
> > > or where related exchanges fail for any reason. It improves
> > > reliability and compatibility at a small cost. In this case, I think
> > > it's a nice feature that we would resume communicating as soon as the
> > > guest shows its global unicast address.
> >
> > Hm, maybe. I'm not entirely convinced the cost is so small long term.
> > It's pretty badly incompatible with having multiple guests behind the
> > same passt instance: such as the initial guest bridging or routing to
> > nested guests.
>
> Why? We will need to hash the interface/guest index anyway, for
> outbound flows.
If we have separate interfaces for each guest, yes. But not if we
have multiple guests behind a single tap because the initial guest
sets up a bridge or routing. Then we have nothing but the address.
> And for inbound flows, if a guest steals the address of another guest,
> we'll give priority to the normal 'addr' versions instead of the
> '_seen' ones, to decide how to direct traffic.
I don't see how we'd know we're in this situation, so when to
prioritise which address over the other.
> > I'm actually not sure if encountering this bug makes me more or less
> > in favour of addr_seen. On the one hand I think it highlights the
> > flakiness of this approach; there are situations where we just won't
> > know the right address.
>
> I don't understand this argument: indeed, there are such situations,
> and they are annoying. Why should we make them more common?
Because predictability is good, and working _most_ of the time is a
failure of predictability.
> > On the other hand if shows a relatively
> > plausible case where the guest won't get exactly the address we want
> > it to (it uses NDP but not DHCPv6)
> >
> > Hrm... actually this also shows a potential danger in the recent
> > patches to disable DAD in the guest. With DAD enabled, when the guest
> > grabs a new address, we'd expect it to emit DAD messages, which would
> > have the side effect of updating our addr_seen (although I'm pretty
> > sure I hit this patch before the nodad patches were applied, so that
> > doesn't seem to be foolproof).
>
> Well, but we do that for containers with --config-net only. In that
> case, the addresses we configure have infinite lifetime anyway.
Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing
its new address.
> Besides, I don't think we need to have addr_seen updated as quickly and
> correctly as possible just for the sake of it, we can also update it
> when we get any other neighbour solicitation because the guest is
> actually using the network. It's not meant to be perfect.
If the guest is a pure server (a common case for containers AFAICT),
then I don't know that we can expect NS messages for anything other
than the default gateway, which is (typically) link-local and so won't
help us to learn the new global address.
> > We could maybe update addr_seen when we send RA messages to the guest
> > - assuming that it will use the same host part (low 64-bits) for both
> > link-local and global addresses. Not sure if that's a widely safe
> > assumption or not.
>
> I don't understand: what case are you trying to cover with this?
A case just like the one in the tests: the interface bounces, and we
get NDP traffic on the link-local address, but nothing on the global
address before an inbound connection.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-08-20 0:49 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-16 5:39 [PATCH 00/22] RFC: Allow configuration of special case NATs David Gibson
2024-08-16 5:39 ` [PATCH 01/22] treewide: Use "our address" instead of "forwarding address" David Gibson
2024-08-18 15:44 ` Stefano Brivio
2024-08-19 1:28 ` David Gibson
2024-08-16 5:39 ` [PATCH 02/22] util: Helper for formatting MAC addresses David Gibson
2024-08-18 15:44 ` Stefano Brivio
2024-08-19 1:29 ` David Gibson
2024-08-16 5:39 ` [PATCH 03/22] treewide: Rename MAC address fields for clarity David Gibson
2024-08-18 15:45 ` Stefano Brivio
2024-08-19 1:36 ` David Gibson
2024-08-16 5:39 ` [PATCH 04/22] treewide: Use struct assignment instead of memcpy() for IP addresses David Gibson
2024-08-18 15:45 ` Stefano Brivio
2024-08-19 1:38 ` David Gibson
2024-08-16 5:39 ` [PATCH 05/22] conf: Use array indices rather than pointers for DNS array slots David Gibson
2024-08-16 5:39 ` [PATCH 06/22] conf: More accurately count entries added in get_dns() David Gibson
2024-08-16 5:39 ` [PATCH 07/22] conf: Move DNS array bounds checks into add_dns[46] David Gibson
2024-08-16 5:39 ` [PATCH 08/22] conf: Move adding of a nameserver from resolv.conf into subfunction David Gibson
2024-08-16 5:39 ` [PATCH 09/22] conf: Correct setting of dns_match address in add_dns6() David Gibson
2024-08-16 5:39 ` [PATCH 10/22] conf: Treat --dns addresses as guest visible addresses David Gibson
2024-08-16 5:39 ` [PATCH 11/22] conf: Remove incorrect initialisation of addr_ll_seen David Gibson
2024-08-16 5:39 ` [PATCH 12/22] util: Correct sock_l4() binding for link local addresses David Gibson
2024-08-20 0:14 ` Stefano Brivio
2024-08-20 1:29 ` David Gibson
2024-08-16 5:39 ` [PATCH 13/22] treewide: Change misleading 'addr_ll' name David Gibson
2024-08-20 0:15 ` Stefano Brivio
2024-08-20 1:30 ` David Gibson
2024-08-16 5:39 ` [PATCH 14/22] Clarify which addresses in ip[46]_ctx are meaningful where David Gibson
2024-08-16 5:39 ` [PATCH 15/22] Initialise our_tap_ll to ip6.gw when suitable David Gibson
2024-08-16 5:39 ` [PATCH 16/22] fwd: Helpers to clarify what host addresses aren't guest accessible David Gibson
2024-08-20 19:56 ` Stefano Brivio
2024-08-21 1:40 ` David Gibson
2024-08-16 5:39 ` [PATCH 17/22] fwd: Split notion of "our tap address" from gateway for IPv4 David Gibson
2024-08-20 19:56 ` Stefano Brivio
2024-08-21 1:56 ` David Gibson
2024-08-16 5:39 ` [PATCH 18/22] Don't take "our" MAC address from the host David Gibson
2024-08-16 5:40 ` [PATCH 19/22] conf, fwd: Split notion of gateway/router from guest-visible host address David Gibson
2024-08-20 19:56 ` Stefano Brivio
2024-08-21 1:59 ` David Gibson
2024-08-16 5:40 ` [PATCH 20/22] conf: Allow address remapped to host to be configured David Gibson
2024-08-20 19:56 ` Stefano Brivio
2024-08-21 2:23 ` David Gibson
2024-08-16 5:40 ` [PATCH 21/22] fwd: Distinguish translatable from untranslatable addresses on inbound David Gibson
2024-08-16 5:40 ` [PATCH 22/22] fwd, conf: Allow NAT of the guest's assigned address David Gibson
2024-08-20 19:56 ` Stefano Brivio
2024-08-21 2:28 ` David Gibson
2024-08-16 14:45 ` [PATCH 00/22] RFC: Allow configuration of special case NATs Paul Holzinger
2024-08-16 15:03 ` Stefano Brivio
2024-08-17 8:01 ` David Gibson
2024-08-19 8:46 ` David Gibson
2024-08-19 9:27 ` Stefano Brivio
2024-08-19 9:52 ` David Gibson
2024-08-19 13:01 ` Stefano Brivio
2024-08-20 0:42 ` David Gibson [this message]
2024-08-20 20:39 ` Stefano Brivio
2024-08-21 2:51 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZsPmaTe8JcZNsxFj@zatzit.fritz.box \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=pholzing@redhat.com \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).