From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 7F4C55A026F for ; Fri, 15 Mar 2024 16:17:46 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1710515865; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K9B3v6LGjwPJA2GLEWvtkPbCZunvQU/51iXmW+kYEv0=; b=bQeCzrWateOirFNtOO6mmrO685L6zrfWrFJj/2VaPs38QKfV/zSi+Y6yYzi2CJ0GzPra0y G8JQa+kYN1y5ROybvkQTwnMHZav0BrEN65V+PN6EVQ8PZ38DytdsMOhqd9m+YeJ3SwOlPh yrdVRfrPga2bDM69JQjWKHXmQJC6hNg= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-134-93FLHAg3NiCkOsvHJ0t09g-1; Fri, 15 Mar 2024 11:17:44 -0400 X-MC-Unique: 93FLHAg3NiCkOsvHJ0t09g-1 Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-5684345c0ebso3746888a12.1 for ; Fri, 15 Mar 2024 08:17:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710515863; x=1711120663; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=K9B3v6LGjwPJA2GLEWvtkPbCZunvQU/51iXmW+kYEv0=; b=uilOzxRbrwlGMpBzsXTTTnaTmUsJCH6bBxZ8SEzmPRoBlP6B55SpnN8ZE/jUhf+ZHr PNPikWJkDfCUiW70PrwWF2uL32fhoEi7G/6SSsDcFCk86Lyf2Iw+pMaIvS+WttC87etG 0j4VjE8+HEqLCMnQ2SXqfrb9y0HGP91/A7ATdAiu/0e6kGUAgRda4iYx/MQSJNYHzDr4 I9Lk24SawaZ3XG6sGuFa6KQZR3EWgdOSodIyjY2e89RLhEf4/9znKhwn36cMRFJ9n8GJ Oqy0MCheIL2v/NzA0EaJiL4jR27OrNubNx24u/HjNkpFXrQT1NU6MnWnHBh3cP2MOSXS Nogw== X-Gm-Message-State: AOJu0YyTQe8D5NWJ9vQ1RyyYQP3huQbBgX14orHW0BCscTZG4mtvQxiv j28wVQhJU+ssJ19CbliopJvIaXxtlhMYPZs2MbXEN2qjfCQKE5lTyggUF/v20iRU6wUtbGn1iXX OAvfAV7eXHX+0q6HIyN0b82VDNN71nIl94AXOQ8Fku/Bn6cZXGw== X-Received: by 2002:a17:907:1b0e:b0:a46:4ea8:9df5 with SMTP id mp14-20020a1709071b0e00b00a464ea89df5mr9988131ejc.5.1710515862621; Fri, 15 Mar 2024 08:17:42 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEEuaI5/gRXLVQE6BvTyTBEkyqDRb/uHMg4zC4+0q6wm2tz0AuR07jmtcGQNrm+87jSPl8mtA== X-Received: by 2002:a17:907:1b0e:b0:a46:4ea8:9df5 with SMTP id mp14-20020a1709071b0e00b00a464ea89df5mr9988102ejc.5.1710515862123; Fri, 15 Mar 2024 08:17:42 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id dm18-20020a170907949200b00a45200fe2b5sm1775242ejc.224.2024.03.15.08.17.41 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Mar 2024 08:17:41 -0700 (PDT) Date: Fri, 15 Mar 2024 16:17:05 +0100 From: Stefano Brivio To: Paul Holzinger Subject: Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Message-ID: <20240315161705.638c347d@elisabeth> In-Reply-To: <20240315155250.237cb94f@elisabeth> References: <20240315112432.382212-1-sbrivio@redhat.com> <2772fc43-252a-4dea-96fb-454d615f9d40@redhat.com> <20240315155250.237cb94f@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: R54OML2PWQPD5AHPR76QPUTYAF4WPQ2R X-Message-ID-Hash: R54OML2PWQPD5AHPR76QPUTYAF4WPQ2R X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Martin Pitt , David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 15 Mar 2024 15:52:50 +0100 Stefano Brivio wrote: > On Fri, 15 Mar 2024 14:11:40 +0100 > Paul Holzinger wrote: > > > On 15/03/2024 12:24, Stefano Brivio wrote: > > > Martin reports that, with Fedora Linux kernel version > > > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, > > > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same > > > read() as families"), pasta doesn't exit once the network namespace > > > is gone. > > > > > > Actually, pasta is completely non-functional, at least with default > > > options, because nl_route_dup(), which duplicates routes from the > > > parent namespace into the target namespace at start-up, is stuck on > > > a second receive operation for RTM_GETROUTE. > > > > > > However, with that commit, the kernel is now able to fit the whole > > > response, including the NLMSG_DONE message, into a single datagram, > > > so no further messages will be received. > > > > > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all > > > responses to a netlink request") accidentally relied on the fact that > > > we would always get at least two datagrams as a response to > > > RTM_GETROUTE. > > > > > > That is, the test to check if we expect another datagram, is based > > > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, > > > but we'll also expect another datagram if NLMSG_OK on the last > > > message is false. But NLMSG_OK with a zero length is always false. > > > > > > The problem is that we don't distinguish if status is zero because > > > we got a NLMSG_DONE message, or because we processed all the > > > available datagram bytes. > > > > > > Introduce an explicit check on NLMSG_DONE. We should probably > > > refactor this slightly, for example by introducing a special return > > > code from nl_status(), but this is probably the least invasive fix > > > for the issue at hand. > > > > > > Reported-by: Martin Pitt > > > Link: https://github.com/containers/podman/issues/22052 > > > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") > > > Signed-off-by: Stefano Brivio > > > --- > > > netlink.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/netlink.c b/netlink.c > > > index 9e7cccb..20de9b3 100644 > > > --- a/netlink.c > > > +++ b/netlink.c > > > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, > > > } > > > } > > > > > > - if (!NLMSG_OK(nh, status) || status > 0) { > > > + if (nh->nlmsg_type != NLMSG_DONE && > > > + (!NLMSG_OK(nh, status) || status > 0)) { > > > /* Process any remaining datagrams in a different > > > * buffer so we don't overwrite the first one. > > > */ > > I was about to add my tested-by when I noticed a weird thing, but that > > happens only on the new kernel as well: > > > > On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp > > src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel > > scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route > > default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 > > 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 > > 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems > > we now have the same local route duplicated for some reason? > > I can reproduce this, it looks rather weird at the moment. Let's say I > have two IPv4 routes on the host: > > $ ip ro sh > default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 > 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 > > if I stop pasta after it re-creates the default route (and not the > subnet one, yet), say, at this point (the message with padding confuses > strace so we don't get the full decoding, but it's valid): > > recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156 > sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 > > I see that the kernel already created both in the target namespace: > > # ip ro sh > default via 88.198.0.161 dev eth0 proto dhcp metric 100 > 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 Ah, no, the issue is pre-existing, you can actually see the same on an older kernel, without this patch. The subnet route is added autonomously by the kernel after we add the address via RTM_NEWADDR with a ifa_prefixlen < 32 attribute. Then we add it "again" via RTM_NEWROUTE. We could probably avoid the creation of any route on RTM_NEWADDR giving 32 or 128 as prefix length, because anyway we'll explicitly copy the subnet route later, but that needs a bit more pondering -- I'd say it's beyond the scope of this change. -- Stefano