* [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
@ 2024-03-15 11:24 Stefano Brivio
2024-03-15 13:11 ` Paul Holzinger
2024-03-18 3:16 ` David Gibson
0 siblings, 2 replies; 6+ messages in thread
From: Stefano Brivio @ 2024-03-15 11:24 UTC (permalink / raw)
To: passt-dev; +Cc: Martin Pitt, Paul Holzinger, David Gibson
Martin reports that, with Fedora Linux kernel version
kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
read() as families"), pasta doesn't exit once the network namespace
is gone.
Actually, pasta is completely non-functional, at least with default
options, because nl_route_dup(), which duplicates routes from the
parent namespace into the target namespace at start-up, is stuck on
a second receive operation for RTM_GETROUTE.
However, with that commit, the kernel is now able to fit the whole
response, including the NLMSG_DONE message, into a single datagram,
so no further messages will be received.
It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
responses to a netlink request") accidentally relied on the fact that
we would always get at least two datagrams as a response to
RTM_GETROUTE.
That is, the test to check if we expect another datagram, is based
on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
but we'll also expect another datagram if NLMSG_OK on the last
message is false. But NLMSG_OK with a zero length is always false.
The problem is that we don't distinguish if status is zero because
we got a NLMSG_DONE message, or because we processed all the
available datagram bytes.
Introduce an explicit check on NLMSG_DONE. We should probably
refactor this slightly, for example by introducing a special return
code from nl_status(), but this is probably the least invasive fix
for the issue at hand.
Reported-by: Martin Pitt <mpitt@redhat.com>
Link: https://github.com/containers/podman/issues/22052
Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
netlink.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/netlink.c b/netlink.c
index 9e7cccb..20de9b3 100644
--- a/netlink.c
+++ b/netlink.c
@@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
}
}
- if (!NLMSG_OK(nh, status) || status > 0) {
+ if (nh->nlmsg_type != NLMSG_DONE &&
+ (!NLMSG_OK(nh, status) || status > 0)) {
/* Process any remaining datagrams in a different
* buffer so we don't overwrite the first one.
*/
--
@@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
}
}
- if (!NLMSG_OK(nh, status) || status > 0) {
+ if (nh->nlmsg_type != NLMSG_DONE &&
+ (!NLMSG_OK(nh, status) || status > 0)) {
/* Process any remaining datagrams in a different
* buffer so we don't overwrite the first one.
*/
--
2.39.2
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio
@ 2024-03-15 13:11 ` Paul Holzinger
2024-03-15 14:52 ` Stefano Brivio
2024-03-18 3:16 ` David Gibson
1 sibling, 1 reply; 6+ messages in thread
From: Paul Holzinger @ 2024-03-15 13:11 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: Martin Pitt, David Gibson
On 15/03/2024 12:24, Stefano Brivio wrote:
> Martin reports that, with Fedora Linux kernel version
> kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
> including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
> read() as families"), pasta doesn't exit once the network namespace
> is gone.
>
> Actually, pasta is completely non-functional, at least with default
> options, because nl_route_dup(), which duplicates routes from the
> parent namespace into the target namespace at start-up, is stuck on
> a second receive operation for RTM_GETROUTE.
>
> However, with that commit, the kernel is now able to fit the whole
> response, including the NLMSG_DONE message, into a single datagram,
> so no further messages will be received.
>
> It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
> responses to a netlink request") accidentally relied on the fact that
> we would always get at least two datagrams as a response to
> RTM_GETROUTE.
>
> That is, the test to check if we expect another datagram, is based
> on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
> but we'll also expect another datagram if NLMSG_OK on the last
> message is false. But NLMSG_OK with a zero length is always false.
>
> The problem is that we don't distinguish if status is zero because
> we got a NLMSG_DONE message, or because we processed all the
> available datagram bytes.
>
> Introduce an explicit check on NLMSG_DONE. We should probably
> refactor this slightly, for example by introducing a special return
> code from nl_status(), but this is probably the least invasive fix
> for the issue at hand.
>
> Reported-by: Martin Pitt <mpitt@redhat.com>
> Link: https://github.com/containers/podman/issues/22052
> Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> netlink.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/netlink.c b/netlink.c
> index 9e7cccb..20de9b3 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
> }
> }
>
> - if (!NLMSG_OK(nh, status) || status > 0) {
> + if (nh->nlmsg_type != NLMSG_DONE &&
> + (!NLMSG_OK(nh, status) || status > 0)) {
> /* Process any remaining datagrams in a different
> * buffer so we don't overwrite the first one.
> */
I was about to add my tested-by when I noticed a weird thing, but that
happens only on the new kernel as well:
On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp
src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel
scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route
default via 192.168.122.1 dev enp1s0 proto dhcp metric 100
192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92
192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems
we now have the same local route duplicated for some reason? I am not
sure if it is caused by this patch as I cannot test versions without
this patch on a newer kernel. I can however confirm that this patch
works and it no longer hangs.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
2024-03-15 13:11 ` Paul Holzinger
@ 2024-03-15 14:52 ` Stefano Brivio
2024-03-15 15:17 ` Stefano Brivio
0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2024-03-15 14:52 UTC (permalink / raw)
To: Paul Holzinger; +Cc: passt-dev, Martin Pitt, David Gibson
On Fri, 15 Mar 2024 14:11:40 +0100
Paul Holzinger <pholzing@redhat.com> wrote:
> On 15/03/2024 12:24, Stefano Brivio wrote:
> > Martin reports that, with Fedora Linux kernel version
> > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
> > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
> > read() as families"), pasta doesn't exit once the network namespace
> > is gone.
> >
> > Actually, pasta is completely non-functional, at least with default
> > options, because nl_route_dup(), which duplicates routes from the
> > parent namespace into the target namespace at start-up, is stuck on
> > a second receive operation for RTM_GETROUTE.
> >
> > However, with that commit, the kernel is now able to fit the whole
> > response, including the NLMSG_DONE message, into a single datagram,
> > so no further messages will be received.
> >
> > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
> > responses to a netlink request") accidentally relied on the fact that
> > we would always get at least two datagrams as a response to
> > RTM_GETROUTE.
> >
> > That is, the test to check if we expect another datagram, is based
> > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
> > but we'll also expect another datagram if NLMSG_OK on the last
> > message is false. But NLMSG_OK with a zero length is always false.
> >
> > The problem is that we don't distinguish if status is zero because
> > we got a NLMSG_DONE message, or because we processed all the
> > available datagram bytes.
> >
> > Introduce an explicit check on NLMSG_DONE. We should probably
> > refactor this slightly, for example by introducing a special return
> > code from nl_status(), but this is probably the least invasive fix
> > for the issue at hand.
> >
> > Reported-by: Martin Pitt <mpitt@redhat.com>
> > Link: https://github.com/containers/podman/issues/22052
> > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > netlink.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/netlink.c b/netlink.c
> > index 9e7cccb..20de9b3 100644
> > --- a/netlink.c
> > +++ b/netlink.c
> > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
> > }
> > }
> >
> > - if (!NLMSG_OK(nh, status) || status > 0) {
> > + if (nh->nlmsg_type != NLMSG_DONE &&
> > + (!NLMSG_OK(nh, status) || status > 0)) {
> > /* Process any remaining datagrams in a different
> > * buffer so we don't overwrite the first one.
> > */
> I was about to add my tested-by when I noticed a weird thing, but that
> happens only on the new kernel as well:
>
> On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp
> src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel
> scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route
> default via 192.168.122.1 dev enp1s0 proto dhcp metric 100
> 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92
> 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems
> we now have the same local route duplicated for some reason?
I can reproduce this, it looks rather weird at the moment. Let's say I
have two IPv4 routes on the host:
$ ip ro sh
default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100
88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100
if I stop pasta after it re-creates the default route (and not the
subnet one, yet), say, at this point (the message with padding confuses
strace so we don't get the full decoding, but it's valid):
recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], [{
nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156
sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68
I see that the kernel already created both in the target namespace:
# ip ro sh
default via 88.198.0.161 dev eth0 proto dhcp metric 100
88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164
...and the second one gets the RTA_PREFSRC attribute we just stripped.
I'm still trying to find out what might cause this, even if it looks
harmless.
--
Stefano
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
2024-03-15 14:52 ` Stefano Brivio
@ 2024-03-15 15:17 ` Stefano Brivio
2024-03-15 15:21 ` Paul Holzinger
0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2024-03-15 15:17 UTC (permalink / raw)
To: Paul Holzinger; +Cc: passt-dev, Martin Pitt, David Gibson
On Fri, 15 Mar 2024 15:52:50 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> On Fri, 15 Mar 2024 14:11:40 +0100
> Paul Holzinger <pholzing@redhat.com> wrote:
>
> > On 15/03/2024 12:24, Stefano Brivio wrote:
> > > Martin reports that, with Fedora Linux kernel version
> > > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
> > > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
> > > read() as families"), pasta doesn't exit once the network namespace
> > > is gone.
> > >
> > > Actually, pasta is completely non-functional, at least with default
> > > options, because nl_route_dup(), which duplicates routes from the
> > > parent namespace into the target namespace at start-up, is stuck on
> > > a second receive operation for RTM_GETROUTE.
> > >
> > > However, with that commit, the kernel is now able to fit the whole
> > > response, including the NLMSG_DONE message, into a single datagram,
> > > so no further messages will be received.
> > >
> > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
> > > responses to a netlink request") accidentally relied on the fact that
> > > we would always get at least two datagrams as a response to
> > > RTM_GETROUTE.
> > >
> > > That is, the test to check if we expect another datagram, is based
> > > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
> > > but we'll also expect another datagram if NLMSG_OK on the last
> > > message is false. But NLMSG_OK with a zero length is always false.
> > >
> > > The problem is that we don't distinguish if status is zero because
> > > we got a NLMSG_DONE message, or because we processed all the
> > > available datagram bytes.
> > >
> > > Introduce an explicit check on NLMSG_DONE. We should probably
> > > refactor this slightly, for example by introducing a special return
> > > code from nl_status(), but this is probably the least invasive fix
> > > for the issue at hand.
> > >
> > > Reported-by: Martin Pitt <mpitt@redhat.com>
> > > Link: https://github.com/containers/podman/issues/22052
> > > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > > netlink.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/netlink.c b/netlink.c
> > > index 9e7cccb..20de9b3 100644
> > > --- a/netlink.c
> > > +++ b/netlink.c
> > > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
> > > }
> > > }
> > >
> > > - if (!NLMSG_OK(nh, status) || status > 0) {
> > > + if (nh->nlmsg_type != NLMSG_DONE &&
> > > + (!NLMSG_OK(nh, status) || status > 0)) {
> > > /* Process any remaining datagrams in a different
> > > * buffer so we don't overwrite the first one.
> > > */
> > I was about to add my tested-by when I noticed a weird thing, but that
> > happens only on the new kernel as well:
> >
> > On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp
> > src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel
> > scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route
> > default via 192.168.122.1 dev enp1s0 proto dhcp metric 100
> > 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92
> > 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems
> > we now have the same local route duplicated for some reason?
>
> I can reproduce this, it looks rather weird at the moment. Let's say I
> have two IPv4 routes on the host:
>
> $ ip ro sh
> default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100
> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100
>
> if I stop pasta after it re-creates the default route (and not the
> subnet one, yet), say, at this point (the message with padding confuses
> strace so we don't get the full decoding, but it's valid):
>
> recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")],
[{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156
> sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68
>
> I see that the kernel already created both in the target namespace:
>
> # ip ro sh
> default via 88.198.0.161 dev eth0 proto dhcp metric 100
> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164
Ah, no, the issue is pre-existing, you can actually see the same on an
older kernel, without this patch. The subnet route is added
autonomously by the kernel after we add the address via RTM_NEWADDR
with a ifa_prefixlen < 32 attribute.
Then we add it "again" via RTM_NEWROUTE.
We could probably avoid the creation of any route on RTM_NEWADDR giving
32 or 128 as prefix length, because anyway we'll explicitly copy the
subnet route later, but that needs a bit more pondering -- I'd say it's
beyond the scope of this change.
--
Stefano
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
2024-03-15 15:17 ` Stefano Brivio
@ 2024-03-15 15:21 ` Paul Holzinger
0 siblings, 0 replies; 6+ messages in thread
From: Paul Holzinger @ 2024-03-15 15:21 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Martin Pitt, David Gibson
On 15/03/2024 16:17, Stefano Brivio wrote:
> On Fri, 15 Mar 2024 15:52:50 +0100
> Stefano Brivio <sbrivio@redhat.com> wrote:
>
>> On Fri, 15 Mar 2024 14:11:40 +0100
>> Paul Holzinger <pholzing@redhat.com> wrote:
>>
>>> On 15/03/2024 12:24, Stefano Brivio wrote:
>>>> Martin reports that, with Fedora Linux kernel version
>>>> kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
>>>> including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
>>>> read() as families"), pasta doesn't exit once the network namespace
>>>> is gone.
>>>>
>>>> Actually, pasta is completely non-functional, at least with default
>>>> options, because nl_route_dup(), which duplicates routes from the
>>>> parent namespace into the target namespace at start-up, is stuck on
>>>> a second receive operation for RTM_GETROUTE.
>>>>
>>>> However, with that commit, the kernel is now able to fit the whole
>>>> response, including the NLMSG_DONE message, into a single datagram,
>>>> so no further messages will be received.
>>>>
>>>> It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
>>>> responses to a netlink request") accidentally relied on the fact that
>>>> we would always get at least two datagrams as a response to
>>>> RTM_GETROUTE.
>>>>
>>>> That is, the test to check if we expect another datagram, is based
>>>> on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
>>>> but we'll also expect another datagram if NLMSG_OK on the last
>>>> message is false. But NLMSG_OK with a zero length is always false.
>>>>
>>>> The problem is that we don't distinguish if status is zero because
>>>> we got a NLMSG_DONE message, or because we processed all the
>>>> available datagram bytes.
>>>>
>>>> Introduce an explicit check on NLMSG_DONE. We should probably
>>>> refactor this slightly, for example by introducing a special return
>>>> code from nl_status(), but this is probably the least invasive fix
>>>> for the issue at hand.
>>>>
>>>> Reported-by: Martin Pitt <mpitt@redhat.com>
>>>> Link: https://github.com/containers/podman/issues/22052
>>>> Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
>>>> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
>>>> ---
>>>> netlink.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/netlink.c b/netlink.c
>>>> index 9e7cccb..20de9b3 100644
>>>> --- a/netlink.c
>>>> +++ b/netlink.c
>>>> @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
>>>> }
>>>> }
>>>>
>>>> - if (!NLMSG_OK(nh, status) || status > 0) {
>>>> + if (nh->nlmsg_type != NLMSG_DONE &&
>>>> + (!NLMSG_OK(nh, status) || status > 0)) {
>>>> /* Process any remaining datagrams in a different
>>>> * buffer so we don't overwrite the first one.
>>>> */
>>> I was about to add my tested-by when I noticed a weird thing, but that
>>> happens only on the new kernel as well:
>>>
>>> On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp
>>> src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel
>>> scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route
>>> default via 192.168.122.1 dev enp1s0 proto dhcp metric 100
>>> 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92
>>> 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems
>>> we now have the same local route duplicated for some reason?
>> I can reproduce this, it looks rather weird at the moment. Let's say I
>> have two IPv4 routes on the host:
>>
>> $ ip ro sh
>> default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100
>> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100
>>
>> if I stop pasta after it re-creates the default route (and not the
>> subnet one, yet), say, at this point (the message with padding confuses
>> strace so we don't get the full decoding, but it's valid):
>>
>> recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")],
> [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156
>> sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68
>>
>> I see that the kernel already created both in the target namespace:
>>
>> # ip ro sh
>> default via 88.198.0.161 dev eth0 proto dhcp metric 100
>> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164
> Ah, no, the issue is pre-existing, you can actually see the same on an
> older kernel, without this patch. The subnet route is added
> autonomously by the kernel after we add the address via RTM_NEWADDR
> with a ifa_prefixlen < 32 attribute.
>
> Then we add it "again" via RTM_NEWROUTE.
>
> We could probably avoid the creation of any route on RTM_NEWADDR giving
> 32 or 128 as prefix length, because anyway we'll explicitly copy the
> subnet route later, but that needs a bit more pondering -- I'd say it's
> beyond the scope of this change.
>
Yes, agreed. This fix looks correct and fixes the hang which is the most
important thing.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE
2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio
2024-03-15 13:11 ` Paul Holzinger
@ 2024-03-18 3:16 ` David Gibson
1 sibling, 0 replies; 6+ messages in thread
From: David Gibson @ 2024-03-18 3:16 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Martin Pitt, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]
On Fri, Mar 15, 2024 at 12:24:32PM +0100, Stefano Brivio wrote:
> Martin reports that, with Fedora Linux kernel version
> kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64,
> including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same
> read() as families"), pasta doesn't exit once the network namespace
> is gone.
>
> Actually, pasta is completely non-functional, at least with default
> options, because nl_route_dup(), which duplicates routes from the
> parent namespace into the target namespace at start-up, is stuck on
> a second receive operation for RTM_GETROUTE.
>
> However, with that commit, the kernel is now able to fit the whole
> response, including the NLMSG_DONE message, into a single datagram,
> so no further messages will be received.
>
> It turns out that commit 4d6e9d0816e2 ("netlink: Always process all
> responses to a netlink request") accidentally relied on the fact that
> we would always get at least two datagrams as a response to
> RTM_GETROUTE.
Huh, oops. I feel like the expectation that NLMSG_DONE wold appear in
its own datagram was pre-existing, althogh the conseqences of it not
being so might have been different and less serious. I recall
noticing that the code expected that, and assuming that was part of
the ABI.
> That is, the test to check if we expect another datagram, is based
> on the 'status' variable, which is 0 if we just parsed NLMSG_DONE,
> but we'll also expect another datagram if NLMSG_OK on the last
> message is false. But NLMSG_OK with a zero length is always false.
>
> The problem is that we don't distinguish if status is zero because
> we got a NLMSG_DONE message, or because we processed all the
> available datagram bytes.
>
> Introduce an explicit check on NLMSG_DONE. We should probably
> refactor this slightly, for example by introducing a special return
> code from nl_status(), but this is probably the least invasive fix
> for the issue at hand.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On the argument above. I do think we need to check over the netlink
code to handle this more carefully, though.
>
> Reported-by: Martin Pitt <mpitt@redhat.com>
> Link: https://github.com/containers/podman/issues/22052
> Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> netlink.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/netlink.c b/netlink.c
> index 9e7cccb..20de9b3 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
> }
> }
>
> - if (!NLMSG_OK(nh, status) || status > 0) {
> + if (nh->nlmsg_type != NLMSG_DONE &&
> + (!NLMSG_OK(nh, status) || status > 0)) {
> /* Process any remaining datagrams in a different
> * buffer so we don't overwrite the first one.
> */
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-03-18 3:18 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio
2024-03-15 13:11 ` Paul Holzinger
2024-03-15 14:52 ` Stefano Brivio
2024-03-15 15:17 ` Stefano Brivio
2024-03-15 15:21 ` Paul Holzinger
2024-03-18 3:16 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).