public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95
@ 2024-09-06  5:17 David Gibson
  2024-09-06  5:17 ` [PATCH 1/6] flow: Fix incorrect hash probe in flowside_lookup() David Gibson
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

As discussed on IRC through the last day, here's a more polished
version of possible fixes for bug 94 (EPOLLRDHUP storm) and bug 95
(EPOLLERR storm).  Both of those were sub-problems arising while
investigating podman bug 23686.

We're pretty confident about the EPOLLRDHUP fix (bug 94, patch 1/6),
based on Stefano's testing.  I ended up rewriting Stefano's draft
patch for the EPOLLERR case (bug 95, remaining patches), because I
thought of a possibility we hadn't discussed yet: we weren't getting
an error from the socket error queue, but we might be able to get one
with the SO_ERROR getsockopt().  My examination of the kernel code
suggests that's plausible, and that if that's the case using SO_ERROR
should also clear that error condition.

Link: https://bugs.passt.top/show_bug.cgi?id=94
Link: https://bugs.passt.top/show_bug.cgi?id=95
Link: https://github.com/containers/podman/issues/23686

David Gibson (6):
  flow: Fix incorrect hash probe in flowside_lookup()
  udp: Allow UDP flows to be prematurely closed
  flow: Helpers to log details of a flow
  udp: Split socket error handling out from udp_sock_recv()
  udp: Treat errors getting errors as unrecoverable
  udp: Handle more error conditions in udp_sock_errs()

 flow.c     | 53 ++++++++++++++++++++------------
 flow.h     |  7 +++++
 udp.c      | 88 +++++++++++++++++++++++++++++++++++++++++++++---------
 udp_flow.c | 18 ++++++++++-
 udp_flow.h |  4 +++
 5 files changed, 136 insertions(+), 34 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/6] flow: Fix incorrect hash probe in flowside_lookup()
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
@ 2024-09-06  5:17 ` David Gibson
  2024-09-06  5:17 ` [PATCH 2/6] udp: Allow UDP flows to be prematurely closed David Gibson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions.  Usually
that's implemented by flow_hash_probe().  However, due to some details we
need a second implementation in flowside_lookup().  An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.

In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state.  If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.

Fixes: acca4235c ("flow, tcp: Generalise TCP hash table to general...")
Link: https://bugs.passt.top/show_bug.cgi?id=94

Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/flow.c b/flow.c
index 02631eb2..a00e01d1 100644
--- a/flow.c
+++ b/flow.c
@@ -697,7 +697,7 @@ static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto,
 	       !(FLOW_PROTO(&flow->f) == proto &&
 		 flow->f.pif[sidx.sidei] == pif &&
 		 flowside_eq(&flow->f.side[sidx.sidei], side)))
-		b = (b + 1) % FLOW_HASH_SIZE;
+		b = mod_sub(b, 1, FLOW_HASH_SIZE);
 
 	return flow_hashtab[b];
 }
-- 
@@ -697,7 +697,7 @@ static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto,
 	       !(FLOW_PROTO(&flow->f) == proto &&
 		 flow->f.pif[sidx.sidei] == pif &&
 		 flowside_eq(&flow->f.side[sidx.sidei], side)))
-		b = (b + 1) % FLOW_HASH_SIZE;
+		b = mod_sub(b, 1, FLOW_HASH_SIZE);
 
 	return flow_hashtab[b];
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/6] udp: Allow UDP flows to be prematurely closed
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
  2024-09-06  5:17 ` [PATCH 1/6] flow: Fix incorrect hash probe in flowside_lookup() David Gibson
@ 2024-09-06  5:17 ` David Gibson
  2024-09-06  5:17 ` [PATCH 3/6] flow: Helpers to log details of a flow David Gibson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Unlike TCP, UDP has no in-band signalling for the end of a flow.  So the
only way we remove flows is on a timer if they have no activity for 180s.

However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow.  We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash).  However it will still wait 3 minutes
to clear away the stale entry.

Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up.  We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c     |  3 ++-
 udp_flow.c | 18 +++++++++++++++++-
 udp_flow.h |  4 ++++
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/flow.c b/flow.c
index a00e01d1..f2de0414 100644
--- a/flow.c
+++ b/flow.c
@@ -832,7 +832,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 				closed = icmp_ping_timer(c, &flow->ping, now);
 			break;
 		case FLOW_UDP:
-			if (timer)
+			closed = udp_flow_defer(&flow->udp);
+			if (!closed && timer)
 				closed = udp_flow_timer(c, &flow->udp, now);
 			break;
 		default:
diff --git a/udp_flow.c b/udp_flow.c
index 1ff59a9b..b81be2cf 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -39,8 +39,11 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx)
  * @c:		Execution context
  * @uflow:	UDP flow
  */
-static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
+void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 {
+	if (uflow->closed)
+		return; /* Nothing to do */
+
 	if (uflow->s[INISIDE] >= 0) {
 		/* The listening socket needs to stay in epoll */
 		close(uflow->s[INISIDE]);
@@ -56,6 +59,8 @@ static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 	flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE));
 	if (!pif_is_socket(uflow->f.pif[TGTSIDE]))
 		flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE));
+
+	uflow->closed = true;
 }
 
 /**
@@ -256,6 +261,17 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 	return udp_flow_new(c, flow, -1, now);
 }
 
+/**
+ * udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
+ * @uflow:	Flow to handle
+ *
+ * Return: true if the connection is ready to free, false otherwise
+ */
+bool udp_flow_defer(const struct udp_flow *uflow)
+{
+	return uflow->closed;
+}
+
 /**
  * udp_flow_timer() - Handler for timed events related to a given flow
  * @c:		Execution context
diff --git a/udp_flow.h b/udp_flow.h
index 12ddf039..9a1b059c 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -10,6 +10,7 @@
 /**
  * struct udp - Descriptor for a flow of UDP packets
  * @f:		Generic flow information
+ * @closed:	Flow is already closed
  * @ts:		Activity timestamp
  * @s:		Socket fd (or -1) for each side of the flow
  */
@@ -17,6 +18,7 @@ struct udp_flow {
 	/* Must be first element */
 	struct flow_common f;
 
+	bool closed :1;
 	time_t ts;
 	int s[SIDES];
 };
@@ -30,6 +32,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 			      const void *saddr, const void *daddr,
 			      in_port_t srcport, in_port_t dstport,
 			      const struct timespec *now);
+void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
+bool udp_flow_defer(const struct udp_flow *uflow);
 bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
 		    const struct timespec *now);
 
-- 
@@ -10,6 +10,7 @@
 /**
  * struct udp - Descriptor for a flow of UDP packets
  * @f:		Generic flow information
+ * @closed:	Flow is already closed
  * @ts:		Activity timestamp
  * @s:		Socket fd (or -1) for each side of the flow
  */
@@ -17,6 +18,7 @@ struct udp_flow {
 	/* Must be first element */
 	struct flow_common f;
 
+	bool closed :1;
 	time_t ts;
 	int s[SIDES];
 };
@@ -30,6 +32,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
 			      const void *saddr, const void *daddr,
 			      in_port_t srcport, in_port_t dstport,
 			      const struct timespec *now);
+void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
+bool udp_flow_defer(const struct udp_flow *uflow);
 bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
 		    const struct timespec *now);
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/6] flow: Helpers to log details of a flow
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
  2024-09-06  5:17 ` [PATCH 1/6] flow: Fix incorrect hash probe in flowside_lookup() David Gibson
  2024-09-06  5:17 ` [PATCH 2/6] udp: Allow UDP flows to be prematurely closed David Gibson
@ 2024-09-06  5:17 ` David Gibson
  2024-09-06  5:17 ` [PATCH 4/6] udp: Split socket error handling out from udp_sock_recv() David Gibson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging.  We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions.  Add some
helper functions and macros to make it easy to do that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 48 +++++++++++++++++++++++++++++++-----------------
 flow.h |  7 +++++++
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/flow.c b/flow.c
index f2de0414..1ea112b4 100644
--- a/flow.c
+++ b/flow.c
@@ -283,28 +283,23 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	       "Flow %u (%s): %s", flow_idx(f), type_or_state, msg);
 }
 
-/**
- * flow_set_state() - Change flow's state
- * @f:		Flow changing state
- * @state:	New state
+/** flow_log_details_() - Log the details of a flow
+ * @f:		flow to log
+ * @pri:	Log priority
+ * @state:	State to log details according to
+ *
+ * Logs the details of the flow: endpoints, interfaces, type etc.
  */
-static void flow_set_state(struct flow_common *f, enum flow_state state)
+void flow_log_details_(const struct flow_common *f, int pri,
+		       enum flow_state state)
 {
 	char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN];
 	char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN];
 	const struct flowside *ini = &f->side[INISIDE];
 	const struct flowside *tgt = &f->side[TGTSIDE];
-	uint8_t oldstate = f->state;
 
-	ASSERT(state < FLOW_NUM_STATES);
-	ASSERT(oldstate < FLOW_NUM_STATES);
-
-	f->state = state;
-	flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
-		  FLOW_STATE(f));
-
-	if (MAX(state, oldstate) >= FLOW_STATE_TGT)
-		flow_log_(f, LOG_DEBUG,
+	if (state >= FLOW_STATE_TGT)
+		flow_log_(f, pri,
 			  "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
 			  pif_name(f->pif[INISIDE]),
 			  inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
@@ -316,8 +311,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 			  tgt->oport,
 			  inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)),
 			  tgt->eport);
-	else if (MAX(state, oldstate) >= FLOW_STATE_INI)
-		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?",
+	else if (state >= FLOW_STATE_INI)
+		flow_log_(f, pri, "%s [%s]:%hu -> [%s]:%hu => ?",
 			  pif_name(f->pif[INISIDE]),
 			  inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
 			  ini->eport,
@@ -325,6 +320,25 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 			  ini->oport);
 }
 
+/**
+ * flow_set_state() - Change flow's state
+ * @f:		Flow changing state
+ * @state:	New state
+ */
+static void flow_set_state(struct flow_common *f, enum flow_state state)
+{
+	uint8_t oldstate = f->state;
+
+	ASSERT(state < FLOW_NUM_STATES);
+	ASSERT(oldstate < FLOW_NUM_STATES);
+
+	f->state = state;
+	flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
+		  FLOW_STATE(f));
+
+	flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
+}
+
 /**
  * flow_initiate_() - Move flow to INI, setting pif[INISIDE]
  * @flow:	Flow to change state
diff --git a/flow.h b/flow.h
index d167b654..24ba3ef0 100644
--- a/flow.h
+++ b/flow.h
@@ -264,4 +264,11 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 			flow_dbg((f), __VA_ARGS__);			\
 	} while (0)
 
+void flow_log_details_(const struct flow_common *f, int pri,
+		       enum flow_state state);
+#define flow_log_details(f_, pri) \
+	flow_log_details_(&((f_)->f), (pri), (f_)->f.state)
+#define flow_dbg_details(f_)	flow_log_details((f_), LOG_DEBUG)
+#define flow_err_details(f_)	flow_log_details((f_), LOG_ERR)
+
 #endif /* FLOW_H */
-- 
@@ -264,4 +264,11 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 			flow_dbg((f), __VA_ARGS__);			\
 	} while (0)
 
+void flow_log_details_(const struct flow_common *f, int pri,
+		       enum flow_state state);
+#define flow_log_details(f_, pri) \
+	flow_log_details_(&((f_)->f), (pri), (f_)->f.state)
+#define flow_dbg_details(f_)	flow_log_details((f_), LOG_DEBUG)
+#define flow_err_details(f_)	flow_log_details((f_), LOG_ERR)
+
 #endif /* FLOW_H */
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/6] udp: Split socket error handling out from udp_sock_recv()
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
                   ` (2 preceding siblings ...)
  2024-09-06  5:17 ` [PATCH 3/6] flow: Helpers to log details of a flow David Gibson
@ 2024-09-06  5:17 ` David Gibson
  2024-09-06  5:17 ` [PATCH 5/6] udp: Treat errors getting errors as unrecoverable David Gibson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding.  That made sense initially, since
both listening and reply sockets need to do this.  However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.

So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors".  We don't
have any of these so far, but some cases we're working on might require it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 udp.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/udp.c b/udp.c
index aae6d142..fd91b539 100644
--- a/udp.c
+++ b/udp.c
@@ -436,6 +436,30 @@ static bool udp_sock_recverr(int s)
 	return true;
 }
 
+/**
+ * udp_sock_errs() - Process errors on a socket
+ * @c:		Execution context
+ * @s:		Socket to receive from
+ * @events:	epoll events bitmap
+ *
+ * Return: Number of errors handled, or < 0 if we have an unrecoverable error
+ */
+static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
+{
+	unsigned n_err = 0;
+
+	ASSERT(!c->no_udp);
+
+	if (!(events & EPOLLERR))
+		return 0; /* Nothing to do */
+
+	/* Empty the error queue */
+	while (udp_sock_recverr(s))
+		n_err++;
+
+	return n_err;
+}
+
 /**
  * udp_sock_recv() - Receive datagrams from a socket
  * @c:		Execution context
@@ -443,6 +467,8 @@ static bool udp_sock_recverr(int s)
  * @events:	epoll events bitmap
  * @mmh		mmsghdr array to receive into
  *
+ * Return: Number of datagrams received
+ *
  * #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
  */
 static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
@@ -459,12 +485,6 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 
 	ASSERT(!c->no_udp);
 
-	/* Clear any errors first */
-	if (events & EPOLLERR) {
-		while (udp_sock_recverr(s))
-			;
-	}
-
 	if (!(events & EPOLLIN))
 		return 0;
 
@@ -492,6 +512,13 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
 	int n, i;
 
+	if (udp_sock_errs(c, ref.fd, events) < 0) {
+		err("UDP: Unrecoverable error on listening socket:"
+		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
+		/* FIXME: what now?  close/re-open socket? */
+		return;
+	}
+
 	if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
 		return;
 
@@ -566,6 +593,13 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	ASSERT(!c->no_udp && uflow);
 
+	if (udp_sock_errs(c, from_s, events) < 0) {
+		flow_err(uflow, "Unrecoverable error on reply socket");
+		flow_err_details(uflow);
+		udp_flow_close(c, uflow);
+		return;
+	}
+
 	if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
 		return;
 
-- 
@@ -436,6 +436,30 @@ static bool udp_sock_recverr(int s)
 	return true;
 }
 
+/**
+ * udp_sock_errs() - Process errors on a socket
+ * @c:		Execution context
+ * @s:		Socket to receive from
+ * @events:	epoll events bitmap
+ *
+ * Return: Number of errors handled, or < 0 if we have an unrecoverable error
+ */
+static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
+{
+	unsigned n_err = 0;
+
+	ASSERT(!c->no_udp);
+
+	if (!(events & EPOLLERR))
+		return 0; /* Nothing to do */
+
+	/* Empty the error queue */
+	while (udp_sock_recverr(s))
+		n_err++;
+
+	return n_err;
+}
+
 /**
  * udp_sock_recv() - Receive datagrams from a socket
  * @c:		Execution context
@@ -443,6 +467,8 @@ static bool udp_sock_recverr(int s)
  * @events:	epoll events bitmap
  * @mmh		mmsghdr array to receive into
  *
+ * Return: Number of datagrams received
+ *
  * #syscalls recvmmsg arm:recvmmsg_time64 i686:recvmmsg_time64
  */
 static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
@@ -459,12 +485,6 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 
 	ASSERT(!c->no_udp);
 
-	/* Clear any errors first */
-	if (events & EPOLLERR) {
-		while (udp_sock_recverr(s))
-			;
-	}
-
 	if (!(events & EPOLLIN))
 		return 0;
 
@@ -492,6 +512,13 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
 	int n, i;
 
+	if (udp_sock_errs(c, ref.fd, events) < 0) {
+		err("UDP: Unrecoverable error on listening socket:"
+		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
+		/* FIXME: what now?  close/re-open socket? */
+		return;
+	}
+
 	if ((n = udp_sock_recv(c, ref.fd, events, udp_mh_recv)) <= 0)
 		return;
 
@@ -566,6 +593,13 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 
 	ASSERT(!c->no_udp && uflow);
 
+	if (udp_sock_errs(c, from_s, events) < 0) {
+		flow_err(uflow, "Unrecoverable error on reply socket");
+		flow_err_details(uflow);
+		udp_flow_close(c, uflow);
+		return;
+	}
+
 	if ((n = udp_sock_recv(c, from_s, events, udp_mh_recv)) <= 0)
 		return;
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/6] udp: Treat errors getting errors as unrecoverable
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
                   ` (3 preceding siblings ...)
  2024-09-06  5:17 ` [PATCH 4/6] udp: Split socket error handling out from udp_sock_recv() David Gibson
@ 2024-09-06  5:17 ` David Gibson
  2024-09-06  5:17 ` [PATCH 6/6] udp: Handle more error conditions in udp_sock_errs() David Gibson
  2024-09-06 11:29 ` [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 Stefano Brivio
  6 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

We can get network errors, usually transient, reported via the socket error
queue.  However, at least theoretically, we could get errors trying to
read the queue itself.  Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 udp.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/udp.c b/udp.c
index fd91b539..fe5df995 100644
--- a/udp.c
+++ b/udp.c
@@ -387,11 +387,12 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
  * udp_sock_recverr() - Receive and clear an error from a socket
  * @s:		Socket to receive from
  *
- * Return: true if errors received and processed, false if no more errors
+ * Return: 1 if error received and processed, 0 if no more errors in queue, < 0
+ *         if there was an error reading the queue
  *
  * #syscalls recvmsg
  */
-static bool udp_sock_recverr(int s)
+static int udp_sock_recverr(int s)
 {
 	const struct sock_extended_err *ee;
 	const struct cmsghdr *hdr;
@@ -408,14 +409,16 @@ static bool udp_sock_recverr(int s)
 
 	rc = recvmsg(s, &mh, MSG_ERRQUEUE);
 	if (rc < 0) {
-		if (errno != EAGAIN && errno != EWOULDBLOCK)
-			err_perror("Failed to read error queue");
-		return false;
+		if (errno == EAGAIN || errno == EWOULDBLOCK)
+			return 0;
+
+		err_perror("UDP: Failed to read error queue");
+		return -1;
 	}
 
 	if (!(mh.msg_flags & MSG_ERRQUEUE)) {
 		err("Missing MSG_ERRQUEUE flag reading error queue");
-		return false;
+		return -1;
 	}
 
 	hdr = CMSG_FIRSTHDR(&mh);
@@ -424,7 +427,7 @@ static bool udp_sock_recverr(int s)
 	      (hdr->cmsg_level == IPPROTO_IPV6 &&
 	       hdr->cmsg_type == IPV6_RECVERR))) {
 		err("Unexpected cmsg reading error queue");
-		return false;
+		return -1;
 	}
 
 	ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
@@ -433,7 +436,7 @@ static bool udp_sock_recverr(int s)
 	debug("%s error on UDP socket %i: %s",
 	      str_ee_origin(ee), s, strerror(ee->ee_errno));
 
-	return true;
+	return 1;
 }
 
 /**
@@ -447,6 +450,7 @@ static bool udp_sock_recverr(int s)
 static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 {
 	unsigned n_err = 0;
+	int rc;
 
 	ASSERT(!c->no_udp);
 
@@ -454,8 +458,11 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 		return 0; /* Nothing to do */
 
 	/* Empty the error queue */
-	while (udp_sock_recverr(s))
-		n_err++;
+	while ((rc = udp_sock_recverr(s)) > 0)
+		n_err += rc;
+
+	if (rc < 0)
+		return -1; /* error reading error, unrecoverable */
 
 	return n_err;
 }
-- 
@@ -387,11 +387,12 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
  * udp_sock_recverr() - Receive and clear an error from a socket
  * @s:		Socket to receive from
  *
- * Return: true if errors received and processed, false if no more errors
+ * Return: 1 if error received and processed, 0 if no more errors in queue, < 0
+ *         if there was an error reading the queue
  *
  * #syscalls recvmsg
  */
-static bool udp_sock_recverr(int s)
+static int udp_sock_recverr(int s)
 {
 	const struct sock_extended_err *ee;
 	const struct cmsghdr *hdr;
@@ -408,14 +409,16 @@ static bool udp_sock_recverr(int s)
 
 	rc = recvmsg(s, &mh, MSG_ERRQUEUE);
 	if (rc < 0) {
-		if (errno != EAGAIN && errno != EWOULDBLOCK)
-			err_perror("Failed to read error queue");
-		return false;
+		if (errno == EAGAIN || errno == EWOULDBLOCK)
+			return 0;
+
+		err_perror("UDP: Failed to read error queue");
+		return -1;
 	}
 
 	if (!(mh.msg_flags & MSG_ERRQUEUE)) {
 		err("Missing MSG_ERRQUEUE flag reading error queue");
-		return false;
+		return -1;
 	}
 
 	hdr = CMSG_FIRSTHDR(&mh);
@@ -424,7 +427,7 @@ static bool udp_sock_recverr(int s)
 	      (hdr->cmsg_level == IPPROTO_IPV6 &&
 	       hdr->cmsg_type == IPV6_RECVERR))) {
 		err("Unexpected cmsg reading error queue");
-		return false;
+		return -1;
 	}
 
 	ee = (const struct sock_extended_err *)CMSG_DATA(hdr);
@@ -433,7 +436,7 @@ static bool udp_sock_recverr(int s)
 	debug("%s error on UDP socket %i: %s",
 	      str_ee_origin(ee), s, strerror(ee->ee_errno));
 
-	return true;
+	return 1;
 }
 
 /**
@@ -447,6 +450,7 @@ static bool udp_sock_recverr(int s)
 static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 {
 	unsigned n_err = 0;
+	int rc;
 
 	ASSERT(!c->no_udp);
 
@@ -454,8 +458,11 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 		return 0; /* Nothing to do */
 
 	/* Empty the error queue */
-	while (udp_sock_recverr(s))
-		n_err++;
+	while ((rc = udp_sock_recverr(s)) > 0)
+		n_err += rc;
+
+	if (rc < 0)
+		return -1; /* error reading error, unrecoverable */
 
 	return n_err;
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 6/6] udp: Handle more error conditions in udp_sock_errs()
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
                   ` (4 preceding siblings ...)
  2024-09-06  5:17 ` [PATCH 5/6] udp: Treat errors getting errors as unrecoverable David Gibson
@ 2024-09-06  5:17 ` David Gibson
  2024-09-06 11:29 ` [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 Stefano Brivio
  6 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2024-09-06  5:17 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

udp_sock_errs() reads out everything in the socket error queue.  However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.

One possibility is that the error is reported instead by the SO_ERROR
sockopt.  Check for that case and report it as best we can.  If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.

[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010

Link: https://bugs.passt.top/show_bug.cgi?id=95

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 udp.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/udp.c b/udp.c
index fe5df995..2ba00c9c 100644
--- a/udp.c
+++ b/udp.c
@@ -450,7 +450,8 @@ static int udp_sock_recverr(int s)
 static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 {
 	unsigned n_err = 0;
-	int rc;
+	socklen_t errlen;
+	int rc, err;
 
 	ASSERT(!c->no_udp);
 
@@ -464,6 +465,24 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 	if (rc < 0)
 		return -1; /* error reading error, unrecoverable */
 
+	errlen = sizeof(err);
+	if (getsockopt(s, SOL_SOCKET, SO_ERROR, &err, &errlen) < 0 ||
+	    errlen != sizeof(err)) {
+		err_perror("Error reading SO_ERROR");
+		return -1;  /* error reading error, unrecoverable */
+	}
+
+	if (err) {
+		debug("Unqueued error on UDP socket %i: %s", s, strerror(err));
+		n_err++;
+	}
+
+	if (!n_err) {
+		/* EPOLLERR, but no errors to clear !? */
+		err("EPOLLERR event without reported errors on socket %i", s);
+		return -1; /* no way to clear, unrecoverable */
+	}
+
 	return n_err;
 }
 
-- 
@@ -450,7 +450,8 @@ static int udp_sock_recverr(int s)
 static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 {
 	unsigned n_err = 0;
-	int rc;
+	socklen_t errlen;
+	int rc, err;
 
 	ASSERT(!c->no_udp);
 
@@ -464,6 +465,24 @@ static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 	if (rc < 0)
 		return -1; /* error reading error, unrecoverable */
 
+	errlen = sizeof(err);
+	if (getsockopt(s, SOL_SOCKET, SO_ERROR, &err, &errlen) < 0 ||
+	    errlen != sizeof(err)) {
+		err_perror("Error reading SO_ERROR");
+		return -1;  /* error reading error, unrecoverable */
+	}
+
+	if (err) {
+		debug("Unqueued error on UDP socket %i: %s", s, strerror(err));
+		n_err++;
+	}
+
+	if (!n_err) {
+		/* EPOLLERR, but no errors to clear !? */
+		err("EPOLLERR event without reported errors on socket %i", s);
+		return -1; /* no way to clear, unrecoverable */
+	}
+
 	return n_err;
 }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95
  2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
                   ` (5 preceding siblings ...)
  2024-09-06  5:17 ` [PATCH 6/6] udp: Handle more error conditions in udp_sock_errs() David Gibson
@ 2024-09-06 11:29 ` Stefano Brivio
  6 siblings, 0 replies; 8+ messages in thread
From: Stefano Brivio @ 2024-09-06 11:29 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri,  6 Sep 2024 15:17:04 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> As discussed on IRC through the last day, here's a more polished
> version of possible fixes for bug 94 (EPOLLRDHUP storm) and bug 95
> (EPOLLERR storm).  Both of those were sub-problems arising while
> investigating podman bug 23686.
> 
> We're pretty confident about the EPOLLRDHUP fix (bug 94, patch 1/6),
> based on Stefano's testing.  I ended up rewriting Stefano's draft
> patch for the EPOLLERR case (bug 95, remaining patches), because I
> thought of a possibility we hadn't discussed yet: we weren't getting
> an error from the socket error queue, but we might be able to get one
> with the SO_ERROR getsockopt().  My examination of the kernel code
> suggests that's plausible, and that if that's the case using SO_ERROR
> should also clear that error condition.
> 
> Link: https://bugs.passt.top/show_bug.cgi?id=94
> Link: https://bugs.passt.top/show_bug.cgi?id=95
> Link: https://github.com/containers/podman/issues/23686

Applied.

-- 
Stefano


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-09-06 11:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-06  5:17 [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 David Gibson
2024-09-06  5:17 ` [PATCH 1/6] flow: Fix incorrect hash probe in flowside_lookup() David Gibson
2024-09-06  5:17 ` [PATCH 2/6] udp: Allow UDP flows to be prematurely closed David Gibson
2024-09-06  5:17 ` [PATCH 3/6] flow: Helpers to log details of a flow David Gibson
2024-09-06  5:17 ` [PATCH 4/6] udp: Split socket error handling out from udp_sock_recv() David Gibson
2024-09-06  5:17 ` [PATCH 5/6] udp: Treat errors getting errors as unrecoverable David Gibson
2024-09-06  5:17 ` [PATCH 6/6] udp: Handle more error conditions in udp_sock_errs() David Gibson
2024-09-06 11:29 ` [PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95 Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).