From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Zt9OUk24; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 9A9735A004E for ; Wed, 16 Oct 2024 17:26:58 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729092417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SyZd8RBvXiT1Xsp8VzdcfnKb8oPnEQPxHSf5/qYFZT8=; b=Zt9OUk24aW7AHUaxYbaEsUpln9RzeJwQKam05+qfyFACc4JbLcd/Lf8NWYl2FPY8X5qmEt G7U/NpnIar+qwUrW0sbERjtOgE/a4gR758FZ4RRm9fCTQgMeu2/JOo5Qi45RuWR2zOfRGK aRw867XyJTbYUJLgOBbdbzYpvD/dR9A= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-367-9rfsimQvPYm-XiyFXoIR3A-1; Wed, 16 Oct 2024 11:26:56 -0400 X-MC-Unique: 9rfsimQvPYm-XiyFXoIR3A-1 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-43113dab986so59550765e9.2 for ; Wed, 16 Oct 2024 08:26:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729092413; x=1729697213; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=SyZd8RBvXiT1Xsp8VzdcfnKb8oPnEQPxHSf5/qYFZT8=; b=Is/ZR18ejYL0jrwRi5FUST3iI4aie4W8JZ3eT4+dAwZqpEviaTcIMLvfoFhpNlwGmZ j+doAYXFDAumN8N6c9eoUPgIDVqq+n5SwwTiW7MluhfuKCIdEWjS150oJdOaRq9eN3P6 ewlDq1afpapJ1AlQH3GGiXuLtP2VSN6rdGexl3/i/N+t9pFXrmY0X0mzjzv7bqN1q3Pg 3UBaKmPRQAD87GcNZcGVSN0gk0WMusXZEO9S1wiOE/wYno7C7ryiOYEiVyl6+j3MVwAn UYsB3U5qRJqns5P7/5sh/8bM6YU3Hg1jWgOZdYfdl/KBlrCqx1BHKffGp/rFG1jHN/l7 drjA== X-Gm-Message-State: AOJu0YxjRs6Tt+KTg6nYTbFjs3wKIM3cO4/2GwzwyxlVUtEh0Arz1zdV S/hzohWTJBkkzwlBG76BkrCJLi0mbkYTrcwnYFh5Ql7ezoadVdXRvV4Gm8eEPm77rFPvANJ+7Cf s81sYH0QwAmgBmXAUAzveN46AvHyJQeKJyc5fUv7Cal7m31aBnRwqYcySVA== X-Received: by 2002:a05:600c:5129:b0:426:5e1c:1ac2 with SMTP id 5b1f17b1804b1-431255da9a3mr168884495e9.8.1729092413183; Wed, 16 Oct 2024 08:26:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEJPaofs7tvGgAPgx/M3ti+fpcEU1Q+pfXj/hoArL2H6OgRwL+I/D616oA1JpbjvbtW2aqcCQ== X-Received: by 2002:a05:600c:5129:b0:426:5e1c:1ac2 with SMTP id 5b1f17b1804b1-431255da9a3mr168884085e9.8.1729092412564; Wed, 16 Oct 2024 08:26:52 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4313f6c4d58sm52191575e9.42.2024.10.16.08.26.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Oct 2024 08:26:50 -0700 (PDT) Date: Wed, 16 Oct 2024 17:26:48 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address Message-ID: <20241016172648.666b0f8c@elisabeth> In-Reply-To: References: <20241002054826.1812844-1-david@gibson.dropbear.id.au> <20241002054826.1812844-5-david@gibson.dropbear.id.au> <20241009150721.63af48f6@elisabeth> <20241009224433.7fc28fc7@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: 2IE6SHE7BP3E7A3LEYSXN6TY7NGU77LS X-Message-ID-Hash: 2IE6SHE7BP3E7A3LEYSXN6TY7NGU77LS X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson wrote: > On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > > Stefano Brivio wrote: > > > [snip] > > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > > /* spliceable */ > > > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > > - * kernel pick a source port on the target side > > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > > + * default use its external address, so we don't inadvertently > > > > > > > + * expose services that listen only on the guest's loopback > > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > > + * > > > > > > > + * In either case, let the kernel pick the source address to > > > > > > > + * match. > > > > > > > */ > > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > > + else > > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > > + tgt->oaddr = inany_any4; > > > > > > > + } else { > > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > > + else > > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > > > Either this... > > > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > > reported as coming from Laurent's series at: > > > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > > have administrivia that will probably fill the rest of this week, but > > > > I'll look into this as soon as I can. > > > > > > I reproduced the problem on passt.top, and I have a partial idea > > > what's going on. As you say it's seeming like the address (addr_seen > > > == addr in this case) isn't properly ready. This is over splice, but > > > on the tap interface, I see the container sending NS messages for its > > > own address - seems like it's doing DAD. But more importantly, we're > > > answering those NS messages with NA messages, because we answer all > > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > > ever worked at all. --config-net makes sense, since we disable DAD, > > > but our test suite has always been using NDP+DHCP instead of > > > --config-net. > > > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > > happens most of the time and for IPv4 via ARP, which is used much more > > > rarely. I think we need to be more selective in what NS or ARP > > > lookups we resopnd to. The question is what approach to take: > > > > Hmm... no.. there's more to this. > > > > Usually DAD requests have :: as the source address, and we *do* > > exclude those from getting replies. In this case though, we're > > getting NS requests for the assigned address from what looks like the > > SLAAC address. So, I do think it would be wise to explicitly exclude > > these: we shouldn't be giving NA responses for an address that ought > > to belong to the guest, even if it doesn't look like a DAD. > > > > But, I'm not sure what's triggering this. Is for some reason the DHCP > > address not "taking", so the container is trying to locate it on the > > network instead? Or _is_ this DAD, but under some circumstances > > rather than using :: as the source address it uses another configured > > address. > > Ok.. I've understood a bit more. While timing is a factor here, it > looks like the main reason I wasn't seeing it on my machine is what > I'd consider a bug in the Debian version of the dhclient-script: > when adding an IPv6 address, it returns without waiting for DAD to > complete (i.e. for the address to be non-tentative). Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them. By the way, I guess it's just an issue for test scripts like this one. > There's also an additional bug, which doesn't cause this problem, I > think, but caused some problems when I was investigating. DHCPv6 > needs the link-local SLAAC address already configured and > non-tentative. The Fedora dhclient-script waits for that too at the > PREINIT6 stage, but the Debian one doesn't, meaning if you attempt > dhclient -6 immediately after starting the namespace it will fail to > bind the UDP address it needs. Right, and that's fine for us because we have a 2-second delay after SLAAC. This looks to me a bit more like a real bug, but again, there might be many other reasons why dhclient can't use a link-local address. One might argue that it's the responsibility of the user/caller to invoke dhclient when appropriate. In that sense, you might be wondering why there's a 2-second delay after SLAAC, but no delay after invoking dhclient -6: the reason is that I was convinced that one wouldn't need DAD once a DHCPv6 client configures an address. The server is already checking that, I thought. Well, no. RFC 8415 18.2.10.1: https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1 says: If the client can operate with the addresses and/or prefixes obtained from the server: [...] - The client MUST perform duplicate address detection as per Section 5.4 of [RFC4862], which does list some exceptions, on each of the received addresses in any IAs on which it has not performed duplicate address detection during processing of any of the previous Reply messages from the server. The client performs the duplicate address detection before using the received addresses for any traffic. If any of the addresses are found to be in use on the link, the client sends a Decline message to the server for those addresses as described in Section 18.2.8. > I still think it's a good idea not to give NA messages for the guest > assigned address, but we'll need a different workaround for this > issue. I read the rest of your reasoning about it, but the nice thing of the current behaviour (and that's why I added that single check on the source address in ndp()) is that the guest can really use whatever address it wants, regardless of what we tried to configure, and we'll resolve any other address. If we receive a neighbour solicitation for the guest assigned address, and the source address is not unspecified, it means that the guest is _not_ using the assigned address, and it's actually trying to reach it. > I guess we'll have to manually wait for DAD to complete in the > DHCP tests, which will be kind of mucky :/ Alternatively, we could use the same trick as added by commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"): disable neighbour solicitations, run dhclient -6, set 'nodad' on the address, and re-enable neighbour solicitations. This works for me: -- diff --git a/test/pasta/dhcp b/test/pasta/dhcp index 41556b8..76aa723 100644 --- a/test/pasta/dhcp +++ b/test/pasta/dhcp @@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu' check [ __MTU__ = 65520 ] test DHCPv6: address +ns ip link set dev __IFNAME__ arp off ns /sbin/dhclient -6 --no-pid __IFNAME__ hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]' +ns ip addr change __ADDR6__ dev __IFNAME__ nodad +ns ip link set dev __IFNAME__ arp on hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]' check [ __ADDR6__ = __HOST_ADDR6__ ] -- Adding 2-second delays as we have them for NDP doesn't look that bad: $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__ given that we don't need it on dhclient -x, tests would take about 12 seconds longer. Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC: $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2 and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach. -- Stefano