public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Paul Holzinger <pholzing@redhat.com>
Cc: passt-dev@passt.top, Martin Pitt <mpitt@redhat.com>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
Date: Fri, 15 Mar 2024 16:17:05 +0100	[thread overview]
Message-ID: <20240315161705.638c347d@elisabeth> (raw)
In-Reply-To: <20240315155250.237cb94f@elisabeth>

On Fri, 15 Mar 2024 15:52:50 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 15 Mar 2024 14:11:40 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
> 
> > On 15/03/2024 12:24, Stefano Brivio wrote:  
> > > Martin reports that, with Fedora Linux kernel version
> > > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
> > > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
> > > read() as families"), pasta doesn't exit once the network namespace
> > > is gone.
> > >
> > > Actually, pasta is completely non-functional, at least with default
> > > options, because nl_route_dup(), which duplicates routes from the
> > > parent namespace into the target namespace at start-up, is stuck on
> > > a second receive operation for RTM_GETROUTE.
> > >
> > > However, with that commit, the kernel is now able to fit the whole
> > > response, including the NLMSG_DONE message, into a single datagram,
> > > so no further messages will be received.
> > >
> > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
> > > responses to a netlink request") accidentally relied on the fact that
> > > we would always get at least two datagrams as a response to
> > > RTM_GETROUTE.
> > >
> > > That is, the test to check if we expect another datagram, is based
> > > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
> > > but we'll also expect another datagram if NLMSG_OK on the last
> > > message is false. But NLMSG_OK with a zero length is always false.
> > >
> > > The problem is that we don't distinguish if status is zero because
> > > we got a NLMSG_DONE message, or because we processed all the
> > > available datagram bytes.
> > >
> > > Introduce an explicit check on NLMSG_DONE. We should probably
> > > refactor this slightly, for example by introducing a special return
> > > code from nl_status(), but this is probably the least invasive fix
> > > for the issue at hand.
> > >
> > > Reported-by: Martin Pitt <mpitt@redhat.com>
> > > Link: https://github.com/containers/podman/issues/22052
> > > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >   netlink.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/netlink.c b/netlink.c
> > > index 9e7cccb..20de9b3 100644
> > > --- a/netlink.c
> > > +++ b/netlink.c
> > > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
> > >   		}
> > >   	}
> > >   
> > > -	if (!NLMSG_OK(nh, status) || status > 0) {
> > > +	if (nh->nlmsg_type != NLMSG_DONE &&
> > > +	    (!NLMSG_OK(nh, status) || status > 0)) {
> > >   		/* Process any remaining datagrams in a different
> > >   		 * buffer so we don't overwrite the first one.
> > >   		 */    
> > I was about to add my tested-by when I noticed a weird thing, but that 
> > happens only on the new kernel as well:
> > 
> > On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp 
> > src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel 
> > scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route 
> > default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 
> > 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 
> > 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems 
> > we now have the same local route duplicated for some reason?  
> 
> I can reproduce this, it looks rather weird at the moment. Let's say I
> have two IPv4 routes on the host:
> 
> $ ip ro sh
> default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 
> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100
> 
> if I stop pasta after it re-creates the default route (and not the
> subnet one, yet), say, at this point (the message with padding confuses
> strace so we don't get the full decoding, but it's valid):
> 
> recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], 
 [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156
> sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68
> 
> I see that the kernel already created both in the target namespace:
> 
> # ip ro sh
> default via 88.198.0.161 dev eth0 proto dhcp metric 100 
> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 

Ah, no, the issue is pre-existing, you can actually see the same on an
older kernel, without this patch. The subnet route is added
autonomously by the kernel after we add the address via RTM_NEWADDR
with a ifa_prefixlen < 32 attribute.

Then we add it "again" via RTM_NEWROUTE.

We could probably avoid the creation of any route on RTM_NEWADDR giving
32 or 128 as prefix length, because anyway we'll explicitly copy the
subnet route later, but that needs a bit more pondering -- I'd say it's
beyond the scope of this change.

-- 
Stefano


  reply	other threads:[~2024-03-15 15:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio
2024-03-15 13:11 ` Paul Holzinger
2024-03-15 14:52   ` Stefano Brivio
2024-03-15 15:17     ` Stefano Brivio [this message]
2024-03-15 15:21       ` Paul Holzinger
2024-03-18  3:16 ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240315161705.638c347d@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=mpitt@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=pholzing@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).