* [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE @ 2024-03-15 11:24 Stefano Brivio 2024-03-15 13:11 ` Paul Holzinger 2024-03-18 3:16 ` David Gibson 0 siblings, 2 replies; 6+ messages in thread From: Stefano Brivio @ 2024-03-15 11:24 UTC (permalink / raw) To: passt-dev; +Cc: Martin Pitt, Paul Holzinger, David Gibson Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- netlink.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/netlink.c b/netlink.c index 9e7cccb..20de9b3 100644 --- a/netlink.c +++ b/netlink.c @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */ -- @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, } } - if (!NLMSG_OK(nh, status) || status > 0) { + if (nh->nlmsg_type != NLMSG_DONE && + (!NLMSG_OK(nh, status) || status > 0)) { /* Process any remaining datagrams in a different * buffer so we don't overwrite the first one. */ -- 2.39.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE 2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio @ 2024-03-15 13:11 ` Paul Holzinger 2024-03-15 14:52 ` Stefano Brivio 2024-03-18 3:16 ` David Gibson 1 sibling, 1 reply; 6+ messages in thread From: Paul Holzinger @ 2024-03-15 13:11 UTC (permalink / raw) To: Stefano Brivio, passt-dev; +Cc: Martin Pitt, David Gibson On 15/03/2024 12:24, Stefano Brivio wrote: > Martin reports that, with Fedora Linux kernel version > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same > read() as families"), pasta doesn't exit once the network namespace > is gone. > > Actually, pasta is completely non-functional, at least with default > options, because nl_route_dup(), which duplicates routes from the > parent namespace into the target namespace at start-up, is stuck on > a second receive operation for RTM_GETROUTE. > > However, with that commit, the kernel is now able to fit the whole > response, including the NLMSG_DONE message, into a single datagram, > so no further messages will be received. > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all > responses to a netlink request") accidentally relied on the fact that > we would always get at least two datagrams as a response to > RTM_GETROUTE. > > That is, the test to check if we expect another datagram, is based > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, > but we'll also expect another datagram if NLMSG_OK on the last > message is false. But NLMSG_OK with a zero length is always false. > > The problem is that we don't distinguish if status is zero because > we got a NLMSG_DONE message, or because we processed all the > available datagram bytes. > > Introduce an explicit check on NLMSG_DONE. We should probably > refactor this slightly, for example by introducing a special return > code from nl_status(), but this is probably the least invasive fix > for the issue at hand. > > Reported-by: Martin Pitt <mpitt@redhat.com> > Link: https://github.com/containers/podman/issues/22052 > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > --- > netlink.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/netlink.c b/netlink.c > index 9e7cccb..20de9b3 100644 > --- a/netlink.c > +++ b/netlink.c > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, > } > } > > - if (!NLMSG_OK(nh, status) || status > 0) { > + if (nh->nlmsg_type != NLMSG_DONE && > + (!NLMSG_OK(nh, status) || status > 0)) { > /* Process any remaining datagrams in a different > * buffer so we don't overwrite the first one. > */ I was about to add my tested-by when I noticed a weird thing, but that happens only on the new kernel as well: On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems we now have the same local route duplicated for some reason? I am not sure if it is caused by this patch as I cannot test versions without this patch on a newer kernel. I can however confirm that this patch works and it no longer hangs. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE 2024-03-15 13:11 ` Paul Holzinger @ 2024-03-15 14:52 ` Stefano Brivio 2024-03-15 15:17 ` Stefano Brivio 0 siblings, 1 reply; 6+ messages in thread From: Stefano Brivio @ 2024-03-15 14:52 UTC (permalink / raw) To: Paul Holzinger; +Cc: passt-dev, Martin Pitt, David Gibson On Fri, 15 Mar 2024 14:11:40 +0100 Paul Holzinger <pholzing@redhat.com> wrote: > On 15/03/2024 12:24, Stefano Brivio wrote: > > Martin reports that, with Fedora Linux kernel version > > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, > > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same > > read() as families"), pasta doesn't exit once the network namespace > > is gone. > > > > Actually, pasta is completely non-functional, at least with default > > options, because nl_route_dup(), which duplicates routes from the > > parent namespace into the target namespace at start-up, is stuck on > > a second receive operation for RTM_GETROUTE. > > > > However, with that commit, the kernel is now able to fit the whole > > response, including the NLMSG_DONE message, into a single datagram, > > so no further messages will be received. > > > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all > > responses to a netlink request") accidentally relied on the fact that > > we would always get at least two datagrams as a response to > > RTM_GETROUTE. > > > > That is, the test to check if we expect another datagram, is based > > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, > > but we'll also expect another datagram if NLMSG_OK on the last > > message is false. But NLMSG_OK with a zero length is always false. > > > > The problem is that we don't distinguish if status is zero because > > we got a NLMSG_DONE message, or because we processed all the > > available datagram bytes. > > > > Introduce an explicit check on NLMSG_DONE. We should probably > > refactor this slightly, for example by introducing a special return > > code from nl_status(), but this is probably the least invasive fix > > for the issue at hand. > > > > Reported-by: Martin Pitt <mpitt@redhat.com> > > Link: https://github.com/containers/podman/issues/22052 > > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > --- > > netlink.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/netlink.c b/netlink.c > > index 9e7cccb..20de9b3 100644 > > --- a/netlink.c > > +++ b/netlink.c > > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, > > } > > } > > > > - if (!NLMSG_OK(nh, status) || status > 0) { > > + if (nh->nlmsg_type != NLMSG_DONE && > > + (!NLMSG_OK(nh, status) || status > 0)) { > > /* Process any remaining datagrams in a different > > * buffer so we don't overwrite the first one. > > */ > I was about to add my tested-by when I noticed a weird thing, but that > happens only on the new kernel as well: > > On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp > src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel > scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route > default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 > 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 > 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems > we now have the same local route duplicated for some reason? I can reproduce this, it looks rather weird at the moment. Let's say I have two IPv4 routes on the host: $ ip ro sh default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 if I stop pasta after it re-creates the default route (and not the subnet one, yet), say, at this point (the message with padding confuses strace so we don't get the full decoding, but it's valid): recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], [{ nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156 sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 I see that the kernel already created both in the target namespace: # ip ro sh default via 88.198.0.161 dev eth0 proto dhcp metric 100 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 ...and the second one gets the RTA_PREFSRC attribute we just stripped. I'm still trying to find out what might cause this, even if it looks harmless. -- Stefano ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE 2024-03-15 14:52 ` Stefano Brivio @ 2024-03-15 15:17 ` Stefano Brivio 2024-03-15 15:21 ` Paul Holzinger 0 siblings, 1 reply; 6+ messages in thread From: Stefano Brivio @ 2024-03-15 15:17 UTC (permalink / raw) To: Paul Holzinger; +Cc: passt-dev, Martin Pitt, David Gibson On Fri, 15 Mar 2024 15:52:50 +0100 Stefano Brivio <sbrivio@redhat.com> wrote: > On Fri, 15 Mar 2024 14:11:40 +0100 > Paul Holzinger <pholzing@redhat.com> wrote: > > > On 15/03/2024 12:24, Stefano Brivio wrote: > > > Martin reports that, with Fedora Linux kernel version > > > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, > > > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same > > > read() as families"), pasta doesn't exit once the network namespace > > > is gone. > > > > > > Actually, pasta is completely non-functional, at least with default > > > options, because nl_route_dup(), which duplicates routes from the > > > parent namespace into the target namespace at start-up, is stuck on > > > a second receive operation for RTM_GETROUTE. > > > > > > However, with that commit, the kernel is now able to fit the whole > > > response, including the NLMSG_DONE message, into a single datagram, > > > so no further messages will be received. > > > > > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all > > > responses to a netlink request") accidentally relied on the fact that > > > we would always get at least two datagrams as a response to > > > RTM_GETROUTE. > > > > > > That is, the test to check if we expect another datagram, is based > > > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, > > > but we'll also expect another datagram if NLMSG_OK on the last > > > message is false. But NLMSG_OK with a zero length is always false. > > > > > > The problem is that we don't distinguish if status is zero because > > > we got a NLMSG_DONE message, or because we processed all the > > > available datagram bytes. > > > > > > Introduce an explicit check on NLMSG_DONE. We should probably > > > refactor this slightly, for example by introducing a special return > > > code from nl_status(), but this is probably the least invasive fix > > > for the issue at hand. > > > > > > Reported-by: Martin Pitt <mpitt@redhat.com> > > > Link: https://github.com/containers/podman/issues/22052 > > > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > --- > > > netlink.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/netlink.c b/netlink.c > > > index 9e7cccb..20de9b3 100644 > > > --- a/netlink.c > > > +++ b/netlink.c > > > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, > > > } > > > } > > > > > > - if (!NLMSG_OK(nh, status) || status > 0) { > > > + if (nh->nlmsg_type != NLMSG_DONE && > > > + (!NLMSG_OK(nh, status) || status > 0)) { > > > /* Process any remaining datagrams in a different > > > * buffer so we don't overwrite the first one. > > > */ > > I was about to add my tested-by when I noticed a weird thing, but that > > happens only on the new kernel as well: > > > > On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp > > src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel > > scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route > > default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 > > 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 > > 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems > > we now have the same local route duplicated for some reason? > > I can reproduce this, it looks rather weird at the moment. Let's say I > have two IPv4 routes on the host: > > $ ip ro sh > default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 > 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 > > if I stop pasta after it re-creates the default route (and not the > subnet one, yet), say, at this point (the message with padding confuses > strace so we don't get the full decoding, but it's valid): > > recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156 > sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 > > I see that the kernel already created both in the target namespace: > > # ip ro sh > default via 88.198.0.161 dev eth0 proto dhcp metric 100 > 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 Ah, no, the issue is pre-existing, you can actually see the same on an older kernel, without this patch. The subnet route is added autonomously by the kernel after we add the address via RTM_NEWADDR with a ifa_prefixlen < 32 attribute. Then we add it "again" via RTM_NEWROUTE. We could probably avoid the creation of any route on RTM_NEWADDR giving 32 or 128 as prefix length, because anyway we'll explicitly copy the subnet route later, but that needs a bit more pondering -- I'd say it's beyond the scope of this change. -- Stefano ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE 2024-03-15 15:17 ` Stefano Brivio @ 2024-03-15 15:21 ` Paul Holzinger 0 siblings, 0 replies; 6+ messages in thread From: Paul Holzinger @ 2024-03-15 15:21 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Martin Pitt, David Gibson On 15/03/2024 16:17, Stefano Brivio wrote: > On Fri, 15 Mar 2024 15:52:50 +0100 > Stefano Brivio <sbrivio@redhat.com> wrote: > >> On Fri, 15 Mar 2024 14:11:40 +0100 >> Paul Holzinger <pholzing@redhat.com> wrote: >> >>> On 15/03/2024 12:24, Stefano Brivio wrote: >>>> Martin reports that, with Fedora Linux kernel version >>>> kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, >>>> including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same >>>> read() as families"), pasta doesn't exit once the network namespace >>>> is gone. >>>> >>>> Actually, pasta is completely non-functional, at least with default >>>> options, because nl_route_dup(), which duplicates routes from the >>>> parent namespace into the target namespace at start-up, is stuck on >>>> a second receive operation for RTM_GETROUTE. >>>> >>>> However, with that commit, the kernel is now able to fit the whole >>>> response, including the NLMSG_DONE message, into a single datagram, >>>> so no further messages will be received. >>>> >>>> It turns out that commit 4d6e9d0816e2 ("netlink: Always process all >>>> responses to a netlink request") accidentally relied on the fact that >>>> we would always get at least two datagrams as a response to >>>> RTM_GETROUTE. >>>> >>>> That is, the test to check if we expect another datagram, is based >>>> on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, >>>> but we'll also expect another datagram if NLMSG_OK on the last >>>> message is false. But NLMSG_OK with a zero length is always false. >>>> >>>> The problem is that we don't distinguish if status is zero because >>>> we got a NLMSG_DONE message, or because we processed all the >>>> available datagram bytes. >>>> >>>> Introduce an explicit check on NLMSG_DONE. We should probably >>>> refactor this slightly, for example by introducing a special return >>>> code from nl_status(), but this is probably the least invasive fix >>>> for the issue at hand. >>>> >>>> Reported-by: Martin Pitt <mpitt@redhat.com> >>>> Link: https://github.com/containers/podman/issues/22052 >>>> Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") >>>> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com> >>>> --- >>>> netlink.c | 3 ++- >>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/netlink.c b/netlink.c >>>> index 9e7cccb..20de9b3 100644 >>>> --- a/netlink.c >>>> +++ b/netlink.c >>>> @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, >>>> } >>>> } >>>> >>>> - if (!NLMSG_OK(nh, status) || status > 0) { >>>> + if (nh->nlmsg_type != NLMSG_DONE && >>>> + (!NLMSG_OK(nh, status) || status > 0)) { >>>> /* Process any remaining datagrams in a different >>>> * buffer so we don't overwrite the first one. >>>> */ >>> I was about to add my tested-by when I noticed a weird thing, but that >>> happens only on the new kernel as well: >>> >>> On the host $ ip route default via 192.168.122.1 dev enp1s0 proto dhcp >>> src 192.168.122.92 metric 100 192.168.122.0/24 dev enp1s0 proto kernel >>> scope link src 192.168.122.92 metric 100 ./pasta --config-net ip route >>> default via 192.168.122.1 dev enp1s0 proto dhcp metric 100 >>> 192.168.122.0/24 dev enp1s0 proto kernel scope link src 192.168.122.92 >>> 192.168.122.0/24 dev enp1s0 proto kernel scope link metric 100 It seems >>> we now have the same local route duplicated for some reason? >> I can reproduce this, it looks rather weird at the moment. Let's say I >> have two IPv4 routes on the host: >> >> $ ip ro sh >> default via 88.198.0.161 dev eth0 proto dhcp src 88.198.0.164 metric 100 >> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 metric 100 >> >> if I stop pasta after it re-creates the default route (and not the >> subnet one, yet), say, at this point (the message with padding confuses >> strace so we don't get the full decoding, but it's valid): >> >> recvfrom(5, [[{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_DHCP, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("88.198.0.161")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=68, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, {rtm_family=AF_INET, rtm_dst_len=27, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=8, nla_type=RTA_DST}, inet_addr("88.198.0.160")], > [{nla_len=8, nla_type=RTA_PRIORITY}, 100], [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("88.198.0.164")], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("eth0")]]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=13, nlmsg_pid=28341}, 0]], 65536, 0, NULL, NULL) = 156 >> sendto(7, [{nlmsg_len=68, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|0x400, nlmsg_seq=14, nlmsg_pid=0}, "\x02\x00\x00\x00\xfe\x10\x00\x01\x00\x00\x00\x00\x08\x00\x0f\x00\xfe\x00\x00\x00\x08\x00\x06\x00\x64\x00\x00\x00\x08\x00\x00\x00"...], 68, 0, NULL, 0) = 68 >> >> I see that the kernel already created both in the target namespace: >> >> # ip ro sh >> default via 88.198.0.161 dev eth0 proto dhcp metric 100 >> 88.198.0.160/27 dev eth0 proto kernel scope link src 88.198.0.164 > Ah, no, the issue is pre-existing, you can actually see the same on an > older kernel, without this patch. The subnet route is added > autonomously by the kernel after we add the address via RTM_NEWADDR > with a ifa_prefixlen < 32 attribute. > > Then we add it "again" via RTM_NEWROUTE. > > We could probably avoid the creation of any route on RTM_NEWADDR giving > 32 or 128 as prefix length, because anyway we'll explicitly copy the > subnet route later, but that needs a bit more pondering -- I'd say it's > beyond the scope of this change. > Yes, agreed. This fix looks correct and fixes the hang which is the most important thing. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE 2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio 2024-03-15 13:11 ` Paul Holzinger @ 2024-03-18 3:16 ` David Gibson 1 sibling, 0 replies; 6+ messages in thread From: David Gibson @ 2024-03-18 3:16 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Martin Pitt, Paul Holzinger [-- Attachment #1: Type: text/plain, Size: 3158 bytes --] On Fri, Mar 15, 2024 at 12:24:32PM +0100, Stefano Brivio wrote: > Martin reports that, with Fedora Linux kernel version > kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, > including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same > read() as families"), pasta doesn't exit once the network namespace > is gone. > > Actually, pasta is completely non-functional, at least with default > options, because nl_route_dup(), which duplicates routes from the > parent namespace into the target namespace at start-up, is stuck on > a second receive operation for RTM_GETROUTE. > > However, with that commit, the kernel is now able to fit the whole > response, including the NLMSG_DONE message, into a single datagram, > so no further messages will be received. > > It turns out that commit 4d6e9d0816e2 ("netlink: Always process all > responses to a netlink request") accidentally relied on the fact that > we would always get at least two datagrams as a response to > RTM_GETROUTE. Huh, oops. I feel like the expectation that NLMSG_DONE wold appear in its own datagram was pre-existing, althogh the conseqences of it not being so might have been different and less serious. I recall noticing that the code expected that, and assuming that was part of the ABI. > That is, the test to check if we expect another datagram, is based > on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, > but we'll also expect another datagram if NLMSG_OK on the last > message is false. But NLMSG_OK with a zero length is always false. > > The problem is that we don't distinguish if status is zero because > we got a NLMSG_DONE message, or because we processed all the > available datagram bytes. > > Introduce an explicit check on NLMSG_DONE. We should probably > refactor this slightly, for example by introducing a special return > code from nl_status(), but this is probably the least invasive fix > for the issue at hand. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> On the argument above. I do think we need to check over the netlink code to handle this more carefully, though. > > Reported-by: Martin Pitt <mpitt@redhat.com> > Link: https://github.com/containers/podman/issues/22052 > Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > --- > netlink.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/netlink.c b/netlink.c > index 9e7cccb..20de9b3 100644 > --- a/netlink.c > +++ b/netlink.c > @@ -525,7 +525,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, > } > } > > - if (!NLMSG_OK(nh, status) || status > 0) { > + if (nh->nlmsg_type != NLMSG_DONE && > + (!NLMSG_OK(nh, status) || status > 0)) { > /* Process any remaining datagrams in a different > * buffer so we don't overwrite the first one. > */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-03-18 3:18 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-15 11:24 [PATCH] netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE Stefano Brivio 2024-03-15 13:11 ` Paul Holzinger 2024-03-15 14:52 ` Stefano Brivio 2024-03-15 15:17 ` Stefano Brivio 2024-03-15 15:21 ` Paul Holzinger 2024-03-18 3:16 ` David Gibson
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).