From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=W44Uox6A; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 830145A004E for ; Tue, 20 Aug 2024 22:39:35 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1724186374; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ew7ydMxWZOrWGgjJ+sig6+rqPlvg6Du6/bnfrbNUWLU=; b=W44Uox6AetKpjFm1pcGXGeymO2d5uzQZFZ9UZZm4q477IUOTIPwEZoJphpKAP2T/vQ7R8T DRxD6wQJ+rk4F1WiJBj3kz7BUi4aITTz9coBonxZfmCsxECNwCNihXnsVDdr1Xf2lQDXvv SXE6jBYF3pRD9hAlu7xIyL+a/SQ1F/0= Received: from mail-pl1-f198.google.com (mail-pl1-f198.google.com [209.85.214.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-509-Ac39lme9McWnmBq20jgXWw-1; Tue, 20 Aug 2024 16:39:32 -0400 X-MC-Unique: Ac39lme9McWnmBq20jgXWw-1 Received: by mail-pl1-f198.google.com with SMTP id d9443c01a7336-201f0280971so58727445ad.2 for ; Tue, 20 Aug 2024 13:39:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724186371; x=1724791171; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=eVuTuY5uKQOtRxjwp41AvRiHGnD2S9pZ1s4l4U+19pA=; b=jBIBcBNwNaW9EKbwRzsyKsqxNBtca7fBkpX+AJnV0TzggzMum63u9+sRw2msZaTnJG URkmX1UBXliyj51GpG2el4w5b7E6ZkP7vD/4uOWBPprpT1GgqEAF73vl0L0DU4owcSnE ZqfL0DqGBNDdnweWeYF4FeXG/4SHoHIVMsfMwRhFr2qTPV80raxvYlM3hXhwZ33GS1X8 9r4GBDTCs520bAGVFx4+0YxEhQ5USNKn2voTZuZ9Q7ZCwGlGuNfsQM/Qn+1cWsLrjK0c Hx7EHVH3rjc5CWTMOGwk1mo9ovH5iAO7lshR/O0xevf3yVEZbLrBYcBrAjgSZjQiqp+N +kyQ== X-Gm-Message-State: AOJu0Yxq8/jShRUX3rf05DdsMBIGZ9yNngR0elBBuq7I0dVZKZQg9MNi pU5r4Llx+dMn6utJr0f2llWyJoximpwCIlXG+np2BUJ0T8b2h51L+ve16o+CMW4J2aAwYkSkrSc JuoXNyv2Xy7/A90l7W5ABsFBrODYbiU7+Sf1+d9/6DiNkgGepGg== X-Received: by 2002:a17:903:32d2:b0:1fb:cf82:11b4 with SMTP id d9443c01a7336-20367af1c5emr2035495ad.6.1724186370996; Tue, 20 Aug 2024 13:39:30 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEVmmhbi7nTxoWXBo4sem0FqIawqX980ZFD0+LDpZKV3RUh/QNy30pJnZm+9a6OaCNS5zg5jw== X-Received: by 2002:a17:903:32d2:b0:1fb:cf82:11b4 with SMTP id d9443c01a7336-20367af1c5emr2035185ad.6.1724186370282; Tue, 20 Aug 2024 13:39:30 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-201f031c5e1sm82359475ad.94.2024.08.20.13.39.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Aug 2024 13:39:29 -0700 (PDT) Date: Tue, 20 Aug 2024 22:39:26 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 00/22] RFC: Allow configuration of special case NATs Message-ID: <20240820223926.5e800f5f@elisabeth> In-Reply-To: References: <20240816054004.1335006-1-david@gibson.dropbear.id.au> <20240819112749.63d7476d@elisabeth> <20240819150100.3bbdbb3f@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: PYA2UXF6WD7XLW6A5JA2UOKGJUYEK2T3 X-Message-ID-Hash: PYA2UXF6WD7XLW6A5JA2UOKGJUYEK2T3 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, 20 Aug 2024 10:42:17 +1000 David Gibson wrote: > On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote: > > On Mon, 19 Aug 2024 19:52:49 +1000 > > David Gibson wrote: > > =20 > > > On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote: =20 > > > > On Mon, 19 Aug 2024 18:46:31 +1000 > > > > David Gibson wrote: > > > > =20 > > > > > On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote: = =20 > > > > > > Based on Stefano's recent patch for faster tests. > > > > > >=20 > > > > > > Allow the user to specify which addresses are translated when u= sed by > > > > > > the guest, rather than always being the gateway address or noth= ing. > > > > > > We also allow this remapping to go to the host's global address= (more > > > > > > precisely the address assigned to the guest) rather than just h= ost > > > > > > loopback. > > > > > >=20 > > > > > > Suggestions for better names for the new options in patches 20 = & 22 > > > > > > are most welcome. > > > > > >=20 > > > > > > Along the way to implementing that make many changes to clarify= what > > > > > > various addresses we track mean, fixing a number of small bugs = as > > > > > > well. > > > > > >=20 > > > > > > NOTE: there is a bug in 21/22 which breaks some of the passt_tc= p perf > > > > > > tests. I haven't managed to figure out why it's causing the pr= oblem, > > > > > > or even what the exact triggering conditions are (running the s= ingle > > > > > > stalling iperf alone doesn't do it). Have to wrap up for today= , so I > > > > > > thought I'd get this out for review anyway. =20 > > > > >=20 > > > > > I've identified the bug here. IMO, it's a pre-existing problem t= hat > > > > > only works by accident at the moment. The immediate fix is prett= y > > > > > obvious, but it raises some broader questions > > > > >=20 > > > > > The problem arises because of the MTU changes we make in order to= test > > > > > throughput with different packet sizes. Specifically we change t= he > > > > > MTU to values < 1280, which implicitly disables IPv6 since it req= uires > > > > > an MTU >=3D 1280. When we change the MTU back to a larger value = IPv6 is > > > > > re-enabled, but some configuration has been lost in the meantime. > > > > >=20 > > > > > After the MTU is restored the guest reconfigures with NDP, but do= es > > > > > not re-DHCPv6. That means the guest gets a SLAAC address in the = right > > > > > prefix but not the exact /128 address we've tried to assign to it= . > > > > > However, at least with the sequence of things we have in the test= s, > > > > > the guest never sends any packets with the new address, so passt > > > > > doesn't update addr_seen. When the inbound connection comes we s= end > > > > > it to the assigned address instead of the guest's actual address = and > > > > > the guest rejects it. =20 > > > >=20 > > > > I still have to take a closer look, but I'm fairly sure I hit a sim= ilar > > > > issue while I was writing these tests originally. I pondered > > > > reconfiguring the address via DHCPv6, or using the keep_addr_on_dow= n > > > > sysctl (net.ipv6.conf..keep_addr_on_down), which was add= ed > > > > around that time. > > > >=20 > > > > Then: > > > > =20 > > > > > This "worked" previously, because before this patch, passt would > > > > > translate the inbound connection to have source/dest as link-loca= l > > > > > addresses. =20 > > > >=20 > > > > ...I realised that this worked and forgot about the whole issue. > > > > =20 > > > > > We *do* have a current addr_ll_seen because (a) it won't > > > > > change if the guest doesn't change MAC and (b) when IPv6 is re-en= abled > > > > > the NDP traffic the guest generates will have link-local addresse= s > > > > > that update addr_ll_seen. With this patch, and a global address = for > > > > > --map-host-loopback, we now need to send to addr_seen instead of > > > > > addr_ll_seen, hence exposing the bug. > > > > >=20 > > > > > In the short term, the obvious fix would be to re-run dhclient -6= in > > > > > the guest after we twiddle MTU but before running IPv6 tests. = =20 > > > >=20 > > > > I guess setting keep_addr_on_down (even for "all" interfaces) shoul= d > > > > work as well. =20 > > >=20 > > > Sounds like it. I wasn't aware of that one. > > >=20 > > > /me tests.. actually, no it doesn't work.. > > >=20 > > > # sysctl -a | grep keep_addr_on_down > > > net.ipv6.conf.all.keep_addr_on_down =3D 1 > > > net.ipv6.conf.default.keep_addr_on_down =3D 1 > > > net.ipv6.conf.dummy0.keep_addr_on_down =3D 1 > > > net.ipv6.conf.lo.keep_addr_on_down =3D 0 > > > # ip addr add 2001:db8::1 dev dummy0 > > > # ip a > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen = 1000 > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > 2: dummy0: mtu 1500 qdisc noop state DOWN group def= ault qlen 1000 > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > inet6 2001:db8::1/128 scope global=20 > > > valid_lft forever preferred_lft forever > > > # ip link set dummy0 mtu 1200 > > > # ip a > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen = 1000 > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > 2: dummy0: mtu 1200 qdisc noop state DOWN group def= ault qlen 1000 > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > > # ip link set dummy0 mtu 1500 > > > # ip a > > > 1: lo: mtu 65536 qdisc noop state DOWN group default qlen = 1000 > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > 2: dummy0: mtu 1500 qdisc noop state DOWN group def= ault qlen 1000 > > > link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff > > >=20 > > > My guess is that IPv6 being deconfigured because of an unsuitable MTU > > > is considered a different event from a mere "down". =20 > >=20 > > I guess it's because they're not IFA_F_PERMANENT, because > > addrconf_permanent_addr() has: > >=20 > > case NETDEV_CHANGEMTU: > > /* if MTU under IPV6_MIN_MTU stop IPv6 on this interfac= e. */ > > if (dev->mtu < IPV6_MIN_MTU) { > > addrconf_ifdown(dev, dev !=3D net->loopback_dev= ); > > break; > > } > >=20 > > but addrconf_ifdown() does: > >=20 > > if (!keep_addr || > > !(ifa->flags & IFA_F_PERMANENT) || > > addr_is_local(&ifa->addr)) { > > hlist_del_init_rcu(&ifa->addr_l= st); > > goto restart; > > } > >=20 > > I'm not sure about the logic behind that. We could actually set those > > addresses as permanent once the DHCPv6 client configures them, if it's > > cleaner. =20 >=20 > Huh. Not in the passt/VM case, though, which is where I actually > encountered this. I meant using ip(8) from the test script itself, but it doesn't actually make sense: # ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev et= h0 Warning: permanent option is not mutable from userspace because (RFC 3549): IFA_F_PERMANENT For a permanent address set by the user. When this is not set, it means the address was dynamically created (e.g., by stateless autoconfiguration). So the address you used in your test _should_ have IFA_F_PERMANENT. The plot thickens. I just tried this, which confirms your hypothesis that bringing the link down is a different event: # ip addr add 2001:db8::1 dev dummy0 # ip link set dummy0 down # ip addr show dev dummy0 5: dummy0: mtu 1280 qdisc noqueue state DOWN group defaul= t qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever inet6 2001:db8::1/128 scope global=20 valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1279 # ip addr show dev dummy0 5: dummy0: mtu 1279 qdisc noqueue state DOWN group defaul= t qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever ...I just can't see that from the code. > > > > > This kind of opens a question about how hard we should try to > > > > > accomodate guests which don't configure themselves how we told th= em. =20 > > > >=20 > > > > There's a notable distinction between guests temporarily diverging = (in > > > > different ways) and guests we don't configure at all. =20 > > >=20 > > > I'm not really sure what you're getting at here. =20 > >=20 > > In this case, it's not true that the guest doesn't configure itself in > > the way we requested -- it's just a temporary diversion from that > > configuration. =20 >=20 > Oh, I see. Assuming that at some point the DHCP client will re-run. >=20 > > Those are different cases that we can handle in different ways, I > > think. If it's a glitch that will only happen during testing, let's > > work around that. > >=20 > > But if the guest really ignores DHCPv6 information, I think we should > > keep that working. > > =20 > > > > It's probably more important to ensure we use the right type of add= ress =20 > > >=20 > > > "type" in what sense here? =20 > >=20 > > Global unicast instead of link-local. =20 >=20 > Ok. >=20 > > > > (security) rather than ensuring we somehow manage to deliver packet= s at > > > > any time (minor glitch otherwise), also because the one you describ= e is > > > > something we're unlikely to hit outside of tests. > > > > =20 > > > > > Personally I'd be ok with saying that nothing works if the guest > > > > > doesn't configure itself properly, thereby removing addr_seen and > > > > > addr_ll_seen entirely. But I think, Stefano, you've been against= that > > > > > idea in the past. =20 > > > >=20 > > > > Yes, I still think we should support guests that don't use DHCPv6 o= r > > > > NDP at all, =20 > > >=20 > > > Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to > > > manually configure the interface in the guest to match the address > > > you've configured with -a. Just like you'd expect to have to > > > correctly configure your address on a real network. =20 > >=20 > > True, but if we make correctness as optional as possible, we'll be more > > compatible (less time spent by users fixing situations that don't > > necessarily need fixing, less time spent by developers to look into > > reports, no matter who's at fault). =20 >=20 > Eh, maybe. Unless us trying to make sense of a nonsense situation > causes some unpredictable behaviour that breaks something else. >=20 > > > > or where related exchanges fail for any reason. It improves > > > > reliability and compatibility at a small cost. In this case, I thin= k > > > > it's a nice feature that we would resume communicating as soon as t= he > > > > guest shows its global unicast address. =20 > > >=20 > > > Hm, maybe. I'm not entirely convinced the cost is so small long term= . > > > It's pretty badly incompatible with having multiple guests behind the > > > same passt instance: such as the initial guest bridging or routing to > > > nested guests. =20 > >=20 > > Why? We will need to hash the interface/guest index anyway, for > > outbound flows. =20 >=20 > If we have separate interfaces for each guest, yes. But not if we > have multiple guests behind a single tap because the initial guest > sets up a bridge or routing. Then we have nothing but the address. ...but then we should have multiple addresses anyway. By the way, I'm not sure we'll ever be able to support that kind of configuration. How does a guest set up a bridge and use passt at the same time? > > And for inbound flows, if a guest steals the address of another guest, > > we'll give priority to the normal 'addr' versions instead of the > > '_seen' ones, to decide how to direct traffic. =20 >=20 > I don't see how we'd know we're in this situation, so when to > prioritise which address over the other. In the set of all addr_seen and addr, we would have at least a non-unique value. Or, practically speaking, we should refuse to set addr_seen if it matches addr for another guest. > > > I'm actually not sure if encountering this bug makes me more or less > > > in favour of addr_seen. On the one hand I think it highlights the > > > flakiness of this approach; there are situations where we just won't > > > know the right address. =20 > >=20 > > I don't understand this argument: indeed, there are such situations, > > and they are annoying. Why should we make them more common? =20 >=20 > Because predictability is good, and working _most_ of the time is a > failure of predictability. It avoids substantial effort and frustration for everybody involved though. The practical problem with lacking predictability is if it makes things harder to debug, I guess, which shouldn't be the case here. > > > On the other hand if shows a relatively > > > plausible case where the guest won't get exactly the address we want > > > it to (it uses NDP but not DHCPv6) > > >=20 > > > Hrm... actually this also shows a potential danger in the recent > > > patches to disable DAD in the guest. With DAD enabled, when the gues= t > > > grabs a new address, we'd expect it to emit DAD messages, which would > > > have the side effect of updating our addr_seen (although I'm pretty > > > sure I hit this patch before the nodad patches were applied, so that > > > doesn't seem to be foolproof). =20 > >=20 > > Well, but we do that for containers with --config-net only. In that > > case, the addresses we configure have infinite lifetime anyway. =20 >=20 > Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing > its new address. It probably did, but we ignored that anyway because DAD is done by sending neighbour solicitations with an unspecified address as source, for example (the "change" here drops "nodad"): $ ./pasta --config-net -p dad.pcap Saving packet capture to dad.pcap # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 # tshark -r dad.pcap |grep Neigh Running as user "root" and group "root". This could be dangerous. 10 2.642467 :: =E2=86=92 ff02::1:ff01:e9a1 ICMPv6 86 Neighbo= r Solicitation for fe80::3882:b5ff:fe01:e9a1 and in tap6_handler() we do: } else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ c->ip6.addr_seen =3D *saddr; } ...then, in ndp(): if (IN6_IS_ADDR_UNSPECIFIED(saddr)) return 1; we could set addr_seen by looking at the *target* address of the neighbour solicitation when the source address is ::, but it's not implemented yet. > > Besides, I don't think we need to have addr_seen updated as quickly and > > correctly as possible just for the sake of it, we can also update it > > when we get any other neighbour solicitation because the guest is > > actually using the network. It's not meant to be perfect. =20 >=20 > If the guest is a pure server (a common case for containers AFAICT), > then I don't know that we can expect NS messages for anything other > than the default gateway, which is (typically) link-local and so won't > help us to learn the new global address. Containers running actual applications are noisy. I've only seen this kind of problem (addr_seen not set/matching) in particularly crafted test environments. > > > We could maybe update addr_seen when we send RA messages to the guest > > > - assuming that it will use the same host part (low 64-bits) for both > > > link-local and global addresses. Not sure if that's a widely safe > > > assumption or not. =20 > >=20 > > I don't understand: what case are you trying to cover with this? =20 >=20 > A case just like the one in the tests: the interface bounces, and we > get NDP traffic on the link-local address, but nothing on the global > address before an inbound connection. Oh, I see. I think it makes sense, even though we'll set addr_seen a bit too early, but not enough to be a practical issue, I think. --=20 Stefano