public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 0/6] RFC: Clean up tap-side event handling
@ 2024-09-03 12:02 David Gibson
  2024-09-03 12:02 ` [PATCH 1/6] tap: Split out handling of EPOLLIN events David Gibson
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This is a draft patch working towards adding EPOLLOUT handling to the
tap code, which could then be used to "unstick" flows which have
unsent data from the socket side.  For now that's just a stub, but
makes what I think are some worthwhile cleanups to the tap side event
handling in the meantime.

David Gibson (6):
  tap: Split out handling of EPOLLIN events
  tap: Improve handling of EINTR in tap_passt_input()
  tap: Restructure in tap_pasta_input()
  tap: Don't risk truncating frames on full buffer in tap_pasta_input()
  tap: Re-introduce EPOLLET for tap connections
  tap: Stub EPOLLOUT handling

 tap.c | 131 +++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 88 insertions(+), 43 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] tap: Split out handling of EPOLLIN events
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
@ 2024-09-03 12:02 ` David Gibson
  2024-09-03 19:25   ` Stefano Brivio
  2024-09-03 12:02 ` [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input() David Gibson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 50 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 14 deletions(-)

diff --git a/tap.c b/tap.c
index 852d8376..14c88871 100644
--- a/tap.c
+++ b/tap.c
@@ -982,24 +982,17 @@ static void tap_sock_reset(struct ctx *c)
 }
 
 /**
- * tap_handler_passt() - Packet handler for AF_UNIX file descriptor
+ * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
- * @events:	epoll events
  * @now:	Current timestamp
  */
-void tap_handler_passt(struct ctx *c, uint32_t events,
-		       const struct timespec *now)
+static void tap_passt_input(struct ctx *c, const struct timespec *now)
 {
 	static const char *partial_frame;
 	static ssize_t partial_len = 0;
 	ssize_t n;
 	char *p;
 
-	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
-		tap_sock_reset(c);
-		return;
-	}
-
 	tap_flush_pools();
 
 	if (partial_len) {
@@ -1052,20 +1045,33 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 }
 
 /**
- * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
+ * tap_handler_passt() - Event handler for AF_UNIX file descriptor
  * @c:		Execution context
  * @events:	epoll events
  * @now:	Current timestamp
  */
-void tap_handler_pasta(struct ctx *c, uint32_t events,
+void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
+{
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
+		tap_sock_reset(c);
+		return;
+	}
+
+	if (events & EPOLLIN)
+		tap_passt_input(c, now);
+}
+
+/**
+ * tap_passt_input() - Handler for new data on the socket to qemu
+ * @c:		Execution context
+ * @now:	Current timestamp
+ */
+static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
 	int ret;
 
-	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
-		die("Disconnect event on /dev/net/tun device, exiting");
-
 redo:
 	n = 0;
 
@@ -1102,6 +1108,22 @@ restart:
 	die("Error on tap device, exiting");
 }
 
+/**
+ * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
+ * @c:		Execution context
+ * @events:	epoll events
+ * @now:	Current timestamp
+ */
+void tap_handler_pasta(struct ctx *c, uint32_t events,
+		       const struct timespec *now)
+{
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
+		die("Disconnect event on /dev/net/tun device, exiting");
+
+	if (events & EPOLLIN)
+		tap_pasta_input(c, now);
+}
+
 /**
  * tap_sock_unix_open() - Create and bind AF_UNIX socket
  * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
-- 
@@ -982,24 +982,17 @@ static void tap_sock_reset(struct ctx *c)
 }
 
 /**
- * tap_handler_passt() - Packet handler for AF_UNIX file descriptor
+ * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
- * @events:	epoll events
  * @now:	Current timestamp
  */
-void tap_handler_passt(struct ctx *c, uint32_t events,
-		       const struct timespec *now)
+static void tap_passt_input(struct ctx *c, const struct timespec *now)
 {
 	static const char *partial_frame;
 	static ssize_t partial_len = 0;
 	ssize_t n;
 	char *p;
 
-	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
-		tap_sock_reset(c);
-		return;
-	}
-
 	tap_flush_pools();
 
 	if (partial_len) {
@@ -1052,20 +1045,33 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 }
 
 /**
- * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
+ * tap_handler_passt() - Event handler for AF_UNIX file descriptor
  * @c:		Execution context
  * @events:	epoll events
  * @now:	Current timestamp
  */
-void tap_handler_pasta(struct ctx *c, uint32_t events,
+void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
+{
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
+		tap_sock_reset(c);
+		return;
+	}
+
+	if (events & EPOLLIN)
+		tap_passt_input(c, now);
+}
+
+/**
+ * tap_passt_input() - Handler for new data on the socket to qemu
+ * @c:		Execution context
+ * @now:	Current timestamp
+ */
+static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
 	int ret;
 
-	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
-		die("Disconnect event on /dev/net/tun device, exiting");
-
 redo:
 	n = 0;
 
@@ -1102,6 +1108,22 @@ restart:
 	die("Error on tap device, exiting");
 }
 
+/**
+ * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
+ * @c:		Execution context
+ * @events:	epoll events
+ * @now:	Current timestamp
+ */
+void tap_handler_pasta(struct ctx *c, uint32_t events,
+		       const struct timespec *now)
+{
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
+		die("Disconnect event on /dev/net/tun device, exiting");
+
+	if (events & EPOLLIN)
+		tap_pasta_input(c, now);
+}
+
 /**
  * tap_sock_unix_open() - Create and bind AF_UNIX socket
  * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input()
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
  2024-09-03 12:02 ` [PATCH 1/6] tap: Split out handling of EPOLLIN events David Gibson
@ 2024-09-03 12:02 ` David Gibson
  2024-09-03 19:25   ` Stefano Brivio
  2024-09-03 12:02 ` [PATCH 3/6] tap: Restructure in tap_pasta_input() David Gibson
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
three cases it returns from the function.  That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again.  For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tap.c b/tap.c
index 14c88871..9ee59faa 100644
--- a/tap.c
+++ b/tap.c
@@ -1003,10 +1003,13 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		memmove(pkt_buf, partial_frame, partial_len);
 	}
 
-	n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
-		 MSG_DONTWAIT);
+	do {
+		n = recv(c->fd_tap, pkt_buf + partial_len,
+			 TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
+	} while ((n < 0) && errno == EINTR);
+
 	if (n < 0) {
-		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
+		if (errno != EAGAIN && errno != EWOULDBLOCK) {
 			err_perror("Receive error on guest connection, reset");
 			tap_sock_reset(c);
 		}
-- 
@@ -1003,10 +1003,13 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		memmove(pkt_buf, partial_frame, partial_len);
 	}
 
-	n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
-		 MSG_DONTWAIT);
+	do {
+		n = recv(c->fd_tap, pkt_buf + partial_len,
+			 TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
+	} while ((n < 0) && errno == EINTR);
+
 	if (n < 0) {
-		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
+		if (errno != EAGAIN && errno != EWOULDBLOCK) {
 			err_perror("Receive error on guest connection, reset");
 			tap_sock_reset(c);
 		}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] tap: Restructure in tap_pasta_input()
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
  2024-09-03 12:02 ` [PATCH 1/6] tap: Split out handling of EPOLLIN events David Gibson
  2024-09-03 12:02 ` [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input() David Gibson
@ 2024-09-03 12:02 ` David Gibson
  2024-09-03 19:25   ` Stefano Brivio
  2024-09-03 12:02 ` [PATCH 4/6] tap: Don't risk truncating frames on full buffer " David Gibson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around.  This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass.  This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.

Along the way handle a couple of extra edge cases:
 - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
   ports where those might not have the same value
 - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/tap.c b/tap.c
index 9ee59faa..71742748 100644
--- a/tap.c
+++ b/tap.c
@@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
-	int ret;
-
-redo:
-	n = 0;
 
 	tap_flush_pools();
-restart:
-	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
 
-		if (len < (ssize_t)sizeof(struct ethhdr) ||
-		    len > (ssize_t)ETH_MAX_MTU) {
-			n += len;
-			continue;
+	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
+		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
+
+		if (len <= 0) {
+			if (errno == EINTR) {
+				len = 0;
+				continue;
+			}
+			break;
 		}
 
+		/* Ignore frames of bad length */
+		if (len < (ssize_t)sizeof(struct ethhdr) ||
+		    len > (ssize_t)ETH_MAX_MTU)
+			continue;
 
 		tap_add_packet(c, len, pkt_buf + n);
-
-		if ((n += len) == TAP_BUF_BYTES)
-			break;
 	}
 
-	if (len < 0 && errno == EINTR)
-		goto restart;
-
-	ret = errno;
+	if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK)
+		die("Error on tap device, exiting");
+	else if (len == 0)
+		die("EOF on tap device, exiting");
 
 	tap_handler(c, now);
-
-	if (len > 0 || ret == EAGAIN)
-		return;
-
-	if (n == TAP_BUF_BYTES)
-		goto redo;
-
-	die("Error on tap device, exiting");
 }
 
 /**
-- 
@@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
-	int ret;
-
-redo:
-	n = 0;
 
 	tap_flush_pools();
-restart:
-	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
 
-		if (len < (ssize_t)sizeof(struct ethhdr) ||
-		    len > (ssize_t)ETH_MAX_MTU) {
-			n += len;
-			continue;
+	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
+		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
+
+		if (len <= 0) {
+			if (errno == EINTR) {
+				len = 0;
+				continue;
+			}
+			break;
 		}
 
+		/* Ignore frames of bad length */
+		if (len < (ssize_t)sizeof(struct ethhdr) ||
+		    len > (ssize_t)ETH_MAX_MTU)
+			continue;
 
 		tap_add_packet(c, len, pkt_buf + n);
-
-		if ((n += len) == TAP_BUF_BYTES)
-			break;
 	}
 
-	if (len < 0 && errno == EINTR)
-		goto restart;
-
-	ret = errno;
+	if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK)
+		die("Error on tap device, exiting");
+	else if (len == 0)
+		die("EOF on tap device, exiting");
 
 	tap_handler(c, now);
-
-	if (len > 0 || ret == EAGAIN)
-		return;
-
-	if (n == TAP_BUF_BYTES)
-		goto redo;
-
-	die("Error on tap device, exiting");
 }
 
 /**
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] tap: Don't risk truncating frames on full buffer in tap_pasta_input()
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
                   ` (2 preceding siblings ...)
  2024-09-03 12:02 ` [PATCH 3/6] tap: Restructure in tap_pasta_input() David Gibson
@ 2024-09-03 12:02 ` David Gibson
  2024-09-03 19:25   ` Stefano Brivio
  2024-09-03 12:02 ` [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections David Gibson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tap_pasta_input() keeps reading frames from the tap device until the
buffer is full.  However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer.  If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tap.c b/tap.c
index 71742748..2fbcef04 100644
--- a/tap.c
+++ b/tap.c
@@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 
 	tap_flush_pools();
 
-	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
-		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
+	for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
+		len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
 
 		if (len <= 0) {
 			if (errno == EINTR) {
-- 
@@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 
 	tap_flush_pools();
 
-	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
-		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
+	for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
+		len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
 
 		if (len <= 0) {
 			if (errno == EINTR) {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
                   ` (3 preceding siblings ...)
  2024-09-03 12:02 ` [PATCH 4/6] tap: Don't risk truncating frames on full buffer " David Gibson
@ 2024-09-03 12:02 ` David Gibson
  2024-09-03 19:25   ` Stefano Brivio
  2024-09-03 12:02 ` [PATCH 6/6] tap: Stub EPOLLOUT handling David Gibson
  2024-09-03 19:25 ` [PATCH 0/6] RFC: Clean up tap-side event handling Stefano Brivio
  6 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only
used level-triggered events for the tap device.  Prior to that we used it
inconsistently which caused some problems.

We want to add support for EPOLLOUT events on the tap connection, and
without EPOLLET that would require toggling EPOLLOUT on and off, which is
awkward.  So, re-introduce EPOLLET, but now use it uniformly for all tap
modes.  The main change this requires is making sure on EPOLLIN we loop
until all there's no more data to process.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/tap.c b/tap.c
index 2fbcef04..d7d3fc19 100644
--- a/tap.c
+++ b/tap.c
@@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c)
  * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_passt_input(struct ctx *c, const struct timespec *now)
+static bool tap_passt_input(struct ctx *c, const struct timespec *now)
 {
 	static const char *partial_frame;
 	static ssize_t partial_len = 0;
@@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 			err_perror("Receive error on guest connection, reset");
 			tap_sock_reset(c);
 		}
-		return;
+		return false;
 	}
 
 	p = pkt_buf;
@@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
 			err("Bad frame size from guest, resetting connection");
 			tap_sock_reset(c);
-			return;
+			return false;
 		}
 
 		if (l2len + sizeof(uint32_t) > (size_t)n)
@@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 	partial_frame = p;
 
 	tap_handler(c, now);
+
+	return true;
 }
 
 /**
@@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 	}
 
 	if (events & EPOLLIN)
-		tap_passt_input(c, now);
+		while (tap_passt_input(c, now))
+			;
 }
 
 /**
  * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_pasta_input(struct ctx *c, const struct timespec *now)
+static bool tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
 
@@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 		die("EOF on tap device, exiting");
 
 	tap_handler(c, now);
+
+	return len > 0;
 }
 
 /**
@@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		die("Disconnect event on /dev/net/tun device, exiting");
 
 	if (events & EPOLLIN)
-		tap_pasta_input(c, now);
+		while (tap_pasta_input(c, now))
+			;
 }
 
 /**
@@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c)
 	pasta_ns_conf(c);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLRDHUP;
+		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
-- 
@@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c)
  * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_passt_input(struct ctx *c, const struct timespec *now)
+static bool tap_passt_input(struct ctx *c, const struct timespec *now)
 {
 	static const char *partial_frame;
 	static ssize_t partial_len = 0;
@@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 			err_perror("Receive error on guest connection, reset");
 			tap_sock_reset(c);
 		}
-		return;
+		return false;
 	}
 
 	p = pkt_buf;
@@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
 			err("Bad frame size from guest, resetting connection");
 			tap_sock_reset(c);
-			return;
+			return false;
 		}
 
 		if (l2len + sizeof(uint32_t) > (size_t)n)
@@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 	partial_frame = p;
 
 	tap_handler(c, now);
+
+	return true;
 }
 
 /**
@@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 	}
 
 	if (events & EPOLLIN)
-		tap_passt_input(c, now);
+		while (tap_passt_input(c, now))
+			;
 }
 
 /**
  * tap_passt_input() - Handler for new data on the socket to qemu
  * @c:		Execution context
  * @now:	Current timestamp
+ *
+ * Return: true if there may be additional data to read, false otherwise
  */
-static void tap_pasta_input(struct ctx *c, const struct timespec *now)
+static bool tap_pasta_input(struct ctx *c, const struct timespec *now)
 {
 	ssize_t n, len;
 
@@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 		die("EOF on tap device, exiting");
 
 	tap_handler(c, now);
+
+	return len > 0;
 }
 
 /**
@@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		die("Disconnect event on /dev/net/tun device, exiting");
 
 	if (events & EPOLLIN)
-		tap_pasta_input(c, now);
+		while (tap_pasta_input(c, now))
+			;
 }
 
 /**
@@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c)
 	pasta_ns_conf(c);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLRDHUP;
+		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] tap: Stub EPOLLOUT handling
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
                   ` (4 preceding siblings ...)
  2024-09-03 12:02 ` [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections David Gibson
@ 2024-09-03 12:02 ` David Gibson
  2024-09-03 19:25 ` [PATCH 0/6] RFC: Clean up tap-side event handling Stefano Brivio
  6 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-09-03 12:02 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Process EPOLLOUT events.  For now this is just a stub, and needs to be
extended to call into the various protocols to actually do something
useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tap.c b/tap.c
index d7d3fc19..3754fffe 100644
--- a/tap.c
+++ b/tap.c
@@ -932,6 +932,18 @@ void tap_handler(struct ctx *c, const struct timespec *now)
 	tap6_handler(c, pool_tap6, now);
 }
 
+/**
+ * tap_ready() - Notify the rest of passt that the tap device is ready for more data
+ * @c:		Execution context
+ * @now:	Current timestamp
+ */
+static void tap_ready(struct ctx *c, const struct timespec *now)
+{
+	(void)c;
+	(void)now;
+	/* TODO: notify protocols */
+}
+
 /**
  * tap_add_packet() - Queue/capture packet, update notion of guest MAC address
  * @c:		Execution context
@@ -1068,6 +1080,9 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 	if (events & EPOLLIN)
 		while (tap_passt_input(c, now))
 			;
+
+	if (events & EPOLLOUT)
+		tap_ready(c, now);
 }
 
 /**
@@ -1127,6 +1142,9 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 	if (events & EPOLLIN)
 		while (tap_pasta_input(c, now))
 			;
+
+	if (events & EPOLLOUT)
+		tap_ready(c, now);
 }
 
 /**
-- 
@@ -932,6 +932,18 @@ void tap_handler(struct ctx *c, const struct timespec *now)
 	tap6_handler(c, pool_tap6, now);
 }
 
+/**
+ * tap_ready() - Notify the rest of passt that the tap device is ready for more data
+ * @c:		Execution context
+ * @now:	Current timestamp
+ */
+static void tap_ready(struct ctx *c, const struct timespec *now)
+{
+	(void)c;
+	(void)now;
+	/* TODO: notify protocols */
+}
+
 /**
  * tap_add_packet() - Queue/capture packet, update notion of guest MAC address
  * @c:		Execution context
@@ -1068,6 +1080,9 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 	if (events & EPOLLIN)
 		while (tap_passt_input(c, now))
 			;
+
+	if (events & EPOLLOUT)
+		tap_ready(c, now);
 }
 
 /**
@@ -1127,6 +1142,9 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 	if (events & EPOLLIN)
 		while (tap_pasta_input(c, now))
 			;
+
+	if (events & EPOLLOUT)
+		tap_ready(c, now);
 }
 
 /**
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] tap: Split out handling of EPOLLIN events
  2024-09-03 12:02 ` [PATCH 1/6] tap: Split out handling of EPOLLIN events David Gibson
@ 2024-09-03 19:25   ` Stefano Brivio
  2024-09-04  1:17     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-03 19:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue,  3 Sep 2024 22:02:30 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
> EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
> cases that may want to also handle EPOLLOUT, so in preparation explicitly
> handle EPOLLIN, moving the logic to a subfunction.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 50 ++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 36 insertions(+), 14 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 852d8376..14c88871 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -982,24 +982,17 @@ static void tap_sock_reset(struct ctx *c)
>  }
>  
>  /**
> - * tap_handler_passt() - Packet handler for AF_UNIX file descriptor
> + * tap_passt_input() - Handler for new data on the socket to qemu
>   * @c:		Execution context
> - * @events:	epoll events
>   * @now:	Current timestamp
>   */
> -void tap_handler_passt(struct ctx *c, uint32_t events,
> -		       const struct timespec *now)
> +static void tap_passt_input(struct ctx *c, const struct timespec *now)
>  {
>  	static const char *partial_frame;
>  	static ssize_t partial_len = 0;
>  	ssize_t n;
>  	char *p;
>  
> -	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
> -		tap_sock_reset(c);
> -		return;
> -	}
> -
>  	tap_flush_pools();
>  
>  	if (partial_len) {
> @@ -1052,20 +1045,33 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
>  }
>  
>  /**
> - * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
> + * tap_handler_passt() - Event handler for AF_UNIX file descriptor
>   * @c:		Execution context
>   * @events:	epoll events
>   * @now:	Current timestamp
>   */
> -void tap_handler_pasta(struct ctx *c, uint32_t events,
> +void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now)
> +{
> +	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
> +		tap_sock_reset(c);
> +		return;
> +	}
> +
> +	if (events & EPOLLIN)
> +		tap_passt_input(c, now);
> +}
> +
> +/**
> + * tap_passt_input() - Handler for new data on the socket to qemu

tap_pasta_input(), QEMU (we could just call it hypervisor, given that
libkrun/krun also use this path).

> + * @c:		Execution context
> + * @now:	Current timestamp
> + */
> +static void tap_pasta_input(struct ctx *c, const struct timespec *now)
>  {
>  	ssize_t n, len;
>  	int ret;
>  
> -	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
> -		die("Disconnect event on /dev/net/tun device, exiting");
> -
>  redo:
>  	n = 0;
>  
> @@ -1102,6 +1108,22 @@ restart:
>  	die("Error on tap device, exiting");
>  }
>  
> +/**
> + * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor
> + * @c:		Execution context
> + * @events:	epoll events
> + * @now:	Current timestamp
> + */
> +void tap_handler_pasta(struct ctx *c, uint32_t events,
> +		       const struct timespec *now)
> +{
> +	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR))
> +		die("Disconnect event on /dev/net/tun device, exiting");
> +
> +	if (events & EPOLLIN)
> +		tap_pasta_input(c, now);
> +}
> +
>  /**
>   * tap_sock_unix_open() - Create and bind AF_UNIX socket
>   * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input()
  2024-09-03 12:02 ` [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input() David Gibson
@ 2024-09-03 19:25   ` Stefano Brivio
  2024-09-04  1:30     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-03 19:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue,  3 Sep 2024 22:02:31 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> When tap_passt_input() gets an error from recv() it (correctly) does not
> print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
> three cases it returns from the function.  That makes sense for EAGAIN and
> EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
> trying again.  For EINTR, however, it makes more sense to retry immediately
> - as it stands we're likely to get a renewer EPOLLIN event immediately in
> that case, since we're using level triggered signalling.
> 
> So, handle EINTR separately by immediately retrying until we succeed or
> get a different type of error.

I don't see an actual improvement: we don't know why we would get EINTR
(because of signals) so repeating the recv() right away isn't
necessarily a better choice.

I'd say whatever way we have of carrying on on EINTR is fine. If this
patch helps with brevity for the next ones, it makes sense, but
otherwise I don't see a real advantage.

Well, it's more consistent with the way we handle EINTR on other calls,
but that's about it.

> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 14c88871..9ee59faa 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1003,10 +1003,13 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
>  		memmove(pkt_buf, partial_frame, partial_len);
>  	}
>  
> -	n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
> -		 MSG_DONTWAIT);
> +	do {
> +		n = recv(c->fd_tap, pkt_buf + partial_len,
> +			 TAP_BUF_BYTES - partial_len, MSG_DONTWAIT);
> +	} while ((n < 0) && errno == EINTR);
> +
>  	if (n < 0) {
> -		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
> +		if (errno != EAGAIN && errno != EWOULDBLOCK) {
>  			err_perror("Receive error on guest connection, reset");
>  			tap_sock_reset(c);
>  		}

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] tap: Restructure in tap_pasta_input()
  2024-09-03 12:02 ` [PATCH 3/6] tap: Restructure in tap_pasta_input() David Gibson
@ 2024-09-03 19:25   ` Stefano Brivio
  2024-09-04  1:33     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-03 19:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue,  3 Sep 2024 22:02:32 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> tap_pasta_input() has a rather confusing structure, using two gotos.
> Remove these by restructuring the function to have the main loop condition
> based on filling our buffer space, with errors or running out of data
> treated as the exception, rather than the other way around.  This allows
> us to handle the EINTR which triggered the 'restart' goto with a continue.
> 
> The outer 'redo' was triggered if we completely filled our buffer, to flush
> it and do another pass.  This one is unnecessary since we don't (yet) use
> EPOLLET on the tap device: if there's still more data we'll get another
> event and re-enter the loop.
> 
> Along the way handle a couple of extra edge cases:
>  - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
>    ports where those might not have the same value
>  - Detect EOF on the tap device and exit in that case
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 42 +++++++++++++++++-------------------------
>  1 file changed, 17 insertions(+), 25 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 9ee59faa..71742748 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
>  static void tap_pasta_input(struct ctx *c, const struct timespec *now)
>  {
>  	ssize_t n, len;
> -	int ret;
> -
> -redo:
> -	n = 0;
>  
>  	tap_flush_pools();
> -restart:
> -	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
>  
> -		if (len < (ssize_t)sizeof(struct ethhdr) ||
> -		    len > (ssize_t)ETH_MAX_MTU) {
> -			n += len;
> -			continue;
> +	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
> +		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
> +
> +		if (len <= 0) {
> +			if (errno == EINTR) {

You might be checking errno when read() returns 0, in which case it's
not set. I guess you should zero out errno before read() if you want to
keep this, or, alternatively, directly handle len == 0 here.

> +				len = 0;
> +				continue;
> +			}
> +			break;
>  		}
>  
> +		/* Ignore frames of bad length */
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU)
> +			continue;
>  
>  		tap_add_packet(c, len, pkt_buf + n);
> -
> -		if ((n += len) == TAP_BUF_BYTES)
> -			break;
>  	}
>  
> -	if (len < 0 && errno == EINTR)
> -		goto restart;
> -
> -	ret = errno;
> +	if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK)
> +		die("Error on tap device, exiting");
> +	else if (len == 0)
> +		die("EOF on tap device, exiting");
>  
>  	tap_handler(c, now);
> -
> -	if (len > 0 || ret == EAGAIN)
> -		return;
> -
> -	if (n == TAP_BUF_BYTES)
> -		goto redo;
> -
> -	die("Error on tap device, exiting");
>  }
>  
>  /**

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] tap: Don't risk truncating frames on full buffer in tap_pasta_input()
  2024-09-03 12:02 ` [PATCH 4/6] tap: Don't risk truncating frames on full buffer " David Gibson
@ 2024-09-03 19:25   ` Stefano Brivio
  2024-09-04  1:33     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-03 19:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue,  3 Sep 2024 22:02:33 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> tap_pasta_input() keeps reading frames from the tap device until the
> buffer is full.  However, this has an ugly edge case, when we get close
> to buffer full, we will provide just the remaining space as a read()
> buffer.  If this is shorter than the next frame to read, the tap device
> will truncate the frame and discard the remainder.
> 
> Adjust the code to make sure we always have room for a maximum size frame.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 71742748..2fbcef04 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
>  
>  	tap_flush_pools();
>  
> -	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
> -		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
> +	for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {

Logically speaking, this should be TAP_BUF_BYTES - ETH_MAX_MTU + 1, I
guess, because if we have ETH_MAX_MTU bytes left, we can read another
frame. Only if we have strictly less than that we might truncate one.

In any case, not a strong preference, and perhaps this version is
actually more readable.

> +		len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
>  
>  		if (len <= 0) {
>  			if (errno == EINTR) {

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections
  2024-09-03 12:02 ` [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections David Gibson
@ 2024-09-03 19:25   ` Stefano Brivio
  2024-09-04  1:36     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-03 19:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue,  3 Sep 2024 22:02:34 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only
> used level-triggered events for the tap device.  Prior to that we used it
> inconsistently which caused some problems.

It didn't actually cause any issue according to your commit message for
4684f603446b itself.

> We want to add support for EPOLLOUT events on the tap connection, and
> without EPOLLET that would require toggling EPOLLOUT on and off, which is
> awkward.  So, re-introduce EPOLLET, but now use it uniformly for all tap
> modes.  The main change this requires is making sure on EPOLLIN we loop
> until all there's no more data to process.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 28 +++++++++++++++++++---------
>  1 file changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 2fbcef04..d7d3fc19 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c)
>   * tap_passt_input() - Handler for new data on the socket to qemu
>   * @c:		Execution context
>   * @now:	Current timestamp
> + *
> + * Return: true if there may be additional data to read, false otherwise
>   */
> -static void tap_passt_input(struct ctx *c, const struct timespec *now)
> +static bool tap_passt_input(struct ctx *c, const struct timespec *now)
>  {
>  	static const char *partial_frame;
>  	static ssize_t partial_len = 0;
> @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
>  			err_perror("Receive error on guest connection, reset");
>  			tap_sock_reset(c);
>  		}
> -		return;
> +		return false;
>  	}
>  
>  	p = pkt_buf;
> @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
>  		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
>  			err("Bad frame size from guest, resetting connection");
>  			tap_sock_reset(c);
> -			return;
> +			return false;
>  		}
>  
>  		if (l2len + sizeof(uint32_t) > (size_t)n)
> @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
>  	partial_frame = p;
>  
>  	tap_handler(c, now);
> +
> +	return true;
>  }
>  
>  /**
> @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
>  	}
>  
>  	if (events & EPOLLIN)
> -		tap_passt_input(c, now);
> +		while (tap_passt_input(c, now))
> +			;

Nit (same below): use curly brackets for multi-line block. Or just use:

		while (tap_passt_input(c, now));

>  }
>  
>  /**
>   * tap_passt_input() - Handler for new data on the socket to qemu
>   * @c:		Execution context
>   * @now:	Current timestamp
> + *
> + * Return: true if there may be additional data to read, false otherwise
>   */
> -static void tap_pasta_input(struct ctx *c, const struct timespec *now)
> +static bool tap_pasta_input(struct ctx *c, const struct timespec *now)
>  {
>  	ssize_t n, len;
>  
> @@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
>  		die("EOF on tap device, exiting");
>  
>  	tap_handler(c, now);
> +
> +	return len > 0;
>  }
>  
>  /**
> @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  		die("Disconnect event on /dev/net/tun device, exiting");
>  
>  	if (events & EPOLLIN)
> -		tap_pasta_input(c, now);
> +		while (tap_pasta_input(c, now))
> +			;
>  }
>  
>  /**
> @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> -	ev.events = EPOLLIN | EPOLLRDHUP;
> +	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  }
> @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c)
>  	pasta_ns_conf(c);
>  
>  	ref.fd = c->fd_tap;
> -	ev.events = EPOLLIN | EPOLLRDHUP;
> +	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  }
> @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c)
>  		else
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
>  
> -		ev.events = EPOLLIN | EPOLLRDHUP;
> +		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
>  		ev.data.u64 = ref.u64;
>  		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  		return;

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] RFC: Clean up tap-side event handling
  2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
                   ` (5 preceding siblings ...)
  2024-09-03 12:02 ` [PATCH 6/6] tap: Stub EPOLLOUT handling David Gibson
@ 2024-09-03 19:25 ` Stefano Brivio
  2024-09-04  3:17   ` David Gibson
  6 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-03 19:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue,  3 Sep 2024 22:02:29 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> This is a draft patch working towards adding EPOLLOUT handling to the
> tap code, which could then be used to "unstick" flows which have
> unsent data from the socket side.  For now that's just a stub, but
> makes what I think are some worthwhile cleanups to the tap side event
> handling in the meantime.

Except for the issue in 3/6 and nits elsewhere, it all makes sense and
tap-side EPOLLOUT handling is definitely going to be an improvement.

I wonder if it's the right moment for this kind of series, though, in
terms of future bisections, as long as we're grappling with
https://github.com/containers/podman/issues/23686 and
https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that
this series doesn't fix anything.

That is, once/if we come up with fixes for those, as they might involve
setting different event masks, I'd rather have those in *before* this
series, to avoid further noise in case we manage to break something
else with those hypothetical fixes.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/6] tap: Split out handling of EPOLLIN events
  2024-09-03 19:25   ` Stefano Brivio
@ 2024-09-04  1:17     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-09-04  1:17 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

On Tue, Sep 03, 2024 at 09:25:35PM +0200, Stefano Brivio wrote:
> On Tue,  3 Sep 2024 22:02:30 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
> > EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
> > cases that may want to also handle EPOLLOUT, so in preparation explicitly
> > handle EPOLLIN, moving the logic to a subfunction.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[snip]
> > +/**
> > + * tap_passt_input() - Handler for new data on the socket to qemu
> 
> tap_pasta_input(), QEMU (we could just call it hypervisor, given that
> libkrun/krun also use this path).

Updated.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input()
  2024-09-03 19:25   ` Stefano Brivio
@ 2024-09-04  1:30     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-09-04  1:30 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1853 bytes --]

On Tue, Sep 03, 2024 at 09:25:39PM +0200, Stefano Brivio wrote:
> On Tue,  3 Sep 2024 22:02:31 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > When tap_passt_input() gets an error from recv() it (correctly) does not
> > print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
> > three cases it returns from the function.  That makes sense for EAGAIN and
> > EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
> > trying again.  For EINTR, however, it makes more sense to retry immediately
> > - as it stands we're likely to get a renewer EPOLLIN event immediately in
> > that case, since we're using level triggered signalling.
> > 
> > So, handle EINTR separately by immediately retrying until we succeed or
> > get a different type of error.
> 
> I don't see an actual improvement: we don't know why we would get EINTR
> (because of signals) so repeating the recv() right away isn't
> necessarily a better choice.

My understanding is that EINTR means the system call was aborted
because a signal handler happened during it.  It usually doesn't
matter why we got the signal - we still need to redo the system call
and we might as well do so immediately.

> I'd say whatever way we have of carrying on on EINTR is fine. If this
> patch helps with brevity for the next ones, it makes sense, but
> otherwise I don't see a real advantage.

Right.  Specifically in order to use EPOLLET, I need to _not_ treat
EINTR the same way as EAGAIN and EWOULDBLOCK.

> Well, it's more consistent with the way we handle EINTR on other calls,
> but that's about it.

That too.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/6] tap: Restructure in tap_pasta_input()
  2024-09-03 19:25   ` Stefano Brivio
@ 2024-09-04  1:33     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-09-04  1:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3406 bytes --]

On Tue, Sep 03, 2024 at 09:25:42PM +0200, Stefano Brivio wrote:
> On Tue,  3 Sep 2024 22:02:32 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > tap_pasta_input() has a rather confusing structure, using two gotos.
> > Remove these by restructuring the function to have the main loop condition
> > based on filling our buffer space, with errors or running out of data
> > treated as the exception, rather than the other way around.  This allows
> > us to handle the EINTR which triggered the 'restart' goto with a continue.
> > 
> > The outer 'redo' was triggered if we completely filled our buffer, to flush
> > it and do another pass.  This one is unnecessary since we don't (yet) use
> > EPOLLET on the tap device: if there's still more data we'll get another
> > event and re-enter the loop.
> > 
> > Along the way handle a couple of extra edge cases:
> >  - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
> >    ports where those might not have the same value
> >  - Detect EOF on the tap device and exit in that case
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tap.c | 42 +++++++++++++++++-------------------------
> >  1 file changed, 17 insertions(+), 25 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 9ee59faa..71742748 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
> >  static void tap_pasta_input(struct ctx *c, const struct timespec *now)
> >  {
> >  	ssize_t n, len;
> > -	int ret;
> > -
> > -redo:
> > -	n = 0;
> >  
> >  	tap_flush_pools();
> > -restart:
> > -	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
> >  
> > -		if (len < (ssize_t)sizeof(struct ethhdr) ||
> > -		    len > (ssize_t)ETH_MAX_MTU) {
> > -			n += len;
> > -			continue;
> > +	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
> > +		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
> > +
> > +		if (len <= 0) {
> > +			if (errno == EINTR) {
> 
> You might be checking errno when read() returns 0, in which case it's
> not set. I guess you should zero out errno before read() if you want to
> keep this, or, alternatively, directly handle len == 0 here.

Good catch.  I've re-organised to avoid that.


> 
> > +				len = 0;
> > +				continue;
> > +			}
> > +			break;
> >  		}
> >  
> > +		/* Ignore frames of bad length */
> > +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> > +		    len > (ssize_t)ETH_MAX_MTU)
> > +			continue;
> >  
> >  		tap_add_packet(c, len, pkt_buf + n);
> > -
> > -		if ((n += len) == TAP_BUF_BYTES)
> > -			break;
> >  	}
> >  
> > -	if (len < 0 && errno == EINTR)
> > -		goto restart;
> > -
> > -	ret = errno;
> > +	if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK)
> > +		die("Error on tap device, exiting");
> > +	else if (len == 0)
> > +		die("EOF on tap device, exiting");
> >  
> >  	tap_handler(c, now);
> > -
> > -	if (len > 0 || ret == EAGAIN)
> > -		return;
> > -
> > -	if (n == TAP_BUF_BYTES)
> > -		goto redo;
> > -
> > -	die("Error on tap device, exiting");
> >  }
> >  
> >  /**
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] tap: Don't risk truncating frames on full buffer in tap_pasta_input()
  2024-09-03 19:25   ` Stefano Brivio
@ 2024-09-04  1:33     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-09-04  1:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

On Tue, Sep 03, 2024 at 09:25:46PM +0200, Stefano Brivio wrote:
> On Tue,  3 Sep 2024 22:02:33 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > tap_pasta_input() keeps reading frames from the tap device until the
> > buffer is full.  However, this has an ugly edge case, when we get close
> > to buffer full, we will provide just the remaining space as a read()
> > buffer.  If this is shorter than the next frame to read, the tap device
> > will truncate the frame and discard the remainder.
> > 
> > Adjust the code to make sure we always have room for a maximum size frame.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tap.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 71742748..2fbcef04 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
> >  
> >  	tap_flush_pools();
> >  
> > -	for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) {
> > -		len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n);
> > +	for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
> 
> Logically speaking, this should be TAP_BUF_BYTES - ETH_MAX_MTU + 1, I
> guess, because if we have ETH_MAX_MTU bytes left, we can read another
> frame. Only if we have strictly less than that we might truncate
> one.

Good point.  I've changed the < to <= to correct.

> 
> In any case, not a strong preference, and perhaps this version is
> actually more readable.
> 
> > +		len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
> >  
> >  		if (len <= 0) {
> >  			if (errno == EINTR) {
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections
  2024-09-03 19:25   ` Stefano Brivio
@ 2024-09-04  1:36     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-09-04  1:36 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5074 bytes --]

On Tue, Sep 03, 2024 at 09:25:50PM +0200, Stefano Brivio wrote:
> On Tue,  3 Sep 2024 22:02:34 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only
> > used level-triggered events for the tap device.  Prior to that we used it
> > inconsistently which caused some problems.
> 
> It didn't actually cause any issue according to your commit message for
> 4684f603446b itself.

I've reworded this.

> 
> > We want to add support for EPOLLOUT events on the tap connection, and
> > without EPOLLET that would require toggling EPOLLOUT on and off, which is
> > awkward.  So, re-introduce EPOLLET, but now use it uniformly for all tap
> > modes.  The main change this requires is making sure on EPOLLIN we loop
> > until all there's no more data to process.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tap.c | 28 +++++++++++++++++++---------
> >  1 file changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 2fbcef04..d7d3fc19 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c)
> >   * tap_passt_input() - Handler for new data on the socket to qemu
> >   * @c:		Execution context
> >   * @now:	Current timestamp
> > + *
> > + * Return: true if there may be additional data to read, false otherwise
> >   */
> > -static void tap_passt_input(struct ctx *c, const struct timespec *now)
> > +static bool tap_passt_input(struct ctx *c, const struct timespec *now)
> >  {
> >  	static const char *partial_frame;
> >  	static ssize_t partial_len = 0;
> > @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
> >  			err_perror("Receive error on guest connection, reset");
> >  			tap_sock_reset(c);
> >  		}
> > -		return;
> > +		return false;
> >  	}
> >  
> >  	p = pkt_buf;
> > @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
> >  		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
> >  			err("Bad frame size from guest, resetting connection");
> >  			tap_sock_reset(c);
> > -			return;
> > +			return false;
> >  		}
> >  
> >  		if (l2len + sizeof(uint32_t) > (size_t)n)
> > @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
> >  	partial_frame = p;
> >  
> >  	tap_handler(c, now);
> > +
> > +	return true;
> >  }
> >  
> >  /**
> > @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
> >  	}
> >  
> >  	if (events & EPOLLIN)
> > -		tap_passt_input(c, now);
> > +		while (tap_passt_input(c, now))
> > +			;
> 
> Nit (same below): use curly brackets for multi-line block. Or just use:
> 
> 		while (tap_passt_input(c, now));

Fixed.

> >  }
> >  
> >  /**
> >   * tap_passt_input() - Handler for new data on the socket to qemu
> >   * @c:		Execution context
> >   * @now:	Current timestamp
> > + *
> > + * Return: true if there may be additional data to read, false otherwise
> >   */
> > -static void tap_pasta_input(struct ctx *c, const struct timespec *now)
> > +static bool tap_pasta_input(struct ctx *c, const struct timespec *now)
> >  {
> >  	ssize_t n, len;
> >  
> > @@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
> >  		die("EOF on tap device, exiting");
> >  
> >  	tap_handler(c, now);
> > +
> > +	return len > 0;
> >  }
> >  
> >  /**
> > @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
> >  		die("Disconnect event on /dev/net/tun device, exiting");
> >  
> >  	if (events & EPOLLIN)
> > -		tap_pasta_input(c, now);
> > +		while (tap_pasta_input(c, now))
> > +			;
> >  }
> >  
> >  /**
> > @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
> >  		trace("tap: failed to set SO_SNDBUF to %i", v);
> >  
> >  	ref.fd = c->fd_tap;
> > -	ev.events = EPOLLIN | EPOLLRDHUP;
> > +	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
> >  	ev.data.u64 = ref.u64;
> >  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> >  }
> > @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c)
> >  	pasta_ns_conf(c);
> >  
> >  	ref.fd = c->fd_tap;
> > -	ev.events = EPOLLIN | EPOLLRDHUP;
> > +	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
> >  	ev.data.u64 = ref.u64;
> >  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> >  }
> > @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c)
> >  		else
> >  			ref.type = EPOLL_TYPE_TAP_PASTA;
> >  
> > -		ev.events = EPOLLIN | EPOLLRDHUP;
> > +		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
> >  		ev.data.u64 = ref.u64;
> >  		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> >  		return;
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] RFC: Clean up tap-side event handling
  2024-09-03 19:25 ` [PATCH 0/6] RFC: Clean up tap-side event handling Stefano Brivio
@ 2024-09-04  3:17   ` David Gibson
  2024-09-04 17:19     ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-04  3:17 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2397 bytes --]

On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
> On Tue,  3 Sep 2024 22:02:29 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > This is a draft patch working towards adding EPOLLOUT handling to the
> > tap code, which could then be used to "unstick" flows which have
> > unsent data from the socket side.  For now that's just a stub, but
> > makes what I think are some worthwhile cleanups to the tap side event
> > handling in the meantime.
> 
> Except for the issue in 3/6 and nits elsewhere, it all makes sense and
> tap-side EPOLLOUT handling is definitely going to be an improvement.
> 
> I wonder if it's the right moment for this kind of series, though, in
> terms of future bisections, as long as we're grappling with
> https://github.com/containers/podman/issues/23686 and
> https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that
> this series doesn't fix anything.

I don't think this series will fix anything as it stands.  It is,
indirectly, aimed at addressing bug 94.  I'm struggling to figure out
what to do with bug 94, because I find it almost impossible to reason
about the current event masks in TCP.  I'd really like to simplify
them so it's clearer what's correct and not and I think the most
obvious path to doing so is using EPOLLET all the time.  That requires
some sort of kick when the tap is ready to accept more data, hence
this series as a prerequisite.

> That is, once/if we come up with fixes for those, as they might involve
> setting different event masks, I'd rather have those in *before* this
> series, to avoid further noise in case we manage to break something
> else with those hypothetical fixes.

Right, I understand the impetus.  Although as I said I find the
current TCP event handling nigh-incomprehensible so I'm not as yet
confident we can find a small fix without cleaning up the event
handling more generally.

That said, these changes to tap side event handling are a prerequisite
/ preliminary and shouldn't as yet really alter the TCP event flow.
So I don't think this series will of itself make bisection harder,
although follow on things based on it might.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] RFC: Clean up tap-side event handling
  2024-09-04  3:17   ` David Gibson
@ 2024-09-04 17:19     ` Stefano Brivio
  2024-09-05  0:35       ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-04 17:19 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 4 Sep 2024 13:17:53 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
> > On Tue,  3 Sep 2024 22:02:29 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > This is a draft patch working towards adding EPOLLOUT handling to the
> > > tap code, which could then be used to "unstick" flows which have
> > > unsent data from the socket side.  For now that's just a stub, but
> > > makes what I think are some worthwhile cleanups to the tap side event
> > > handling in the meantime.  
> > 
> > Except for the issue in 3/6 and nits elsewhere, it all makes sense and
> > tap-side EPOLLOUT handling is definitely going to be an improvement.
> > 
> > I wonder if it's the right moment for this kind of series, though, in
> > terms of future bisections, as long as we're grappling with
> > https://github.com/containers/podman/issues/23686 and
> > https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that
> > this series doesn't fix anything.  
> 
> I don't think this series will fix anything as it stands.  It is,
> indirectly, aimed at addressing bug 94.  I'm struggling to figure out
> what to do with bug 94, because I find it almost impossible to reason
> about the current event masks in TCP.

I don't see at the moment anything indicating TCP issues other than the
one you addressed with your tentative debug patch at:

  https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc

Given that, with that patch, we had at least another report of event
storms, this time on UDP, that is, the one from:

  https://github.com/containers/podman/issues/23686#issuecomment-2324945010

I shared this other one on top:

  https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae

> I'd really like to simplify
> them so it's clearer what's correct and not and I think the most
> obvious path to doing so is using EPOLLET all the time.  That requires
> some sort of kick when the tap is ready to accept more data, hence
> this series as a prerequisite.

Sure, it's going to be simpler and more robust, but on the other hand
we wouldn't notice these kind of issues.

> > That is, once/if we come up with fixes for those, as they might involve
> > setting different event masks, I'd rather have those in *before* this
> > series, to avoid further noise in case we manage to break something
> > else with those hypothetical fixes.  
> 
> Right, I understand the impetus.  Although as I said I find the
> current TCP event handling nigh-incomprehensible so I'm not as yet
> confident we can find a small fix without cleaning up the event
> handling more generally.

I'm not sure either, but I don't think we have any indication, at the
moment, that any of the issues from those two tickets have anything to
do with TCP event handling (minus the one you tentatively fixed).

> That said, these changes to tap side event handling are a prerequisite
> / preliminary and shouldn't as yet really alter the TCP event flow.
> So I don't think this series will of itself make bisection harder,
> although follow on things based on it might.

I understand that they shouldn't alter it, but if we missed something
subtle and they actually do, they'll make bisection more complicated.

If this series is only needed for switching TCP sockets to EPOLLET
(well, minus 4/6, which is a fix on its own), maybe we could wait until
you have the whole thing ready (and, hopefully, we manage to fix those
two tickets meanwhile)?

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] RFC: Clean up tap-side event handling
  2024-09-04 17:19     ` Stefano Brivio
@ 2024-09-05  0:35       ` David Gibson
  2024-09-05  8:32         ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-09-05  0:35 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4618 bytes --]

On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:
> On Wed, 4 Sep 2024 13:17:53 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
> > > On Tue,  3 Sep 2024 22:02:29 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > This is a draft patch working towards adding EPOLLOUT handling to the
> > > > tap code, which could then be used to "unstick" flows which have
> > > > unsent data from the socket side.  For now that's just a stub, but
> > > > makes what I think are some worthwhile cleanups to the tap side event
> > > > handling in the meantime.  
> > > 
> > > Except for the issue in 3/6 and nits elsewhere, it all makes sense and
> > > tap-side EPOLLOUT handling is definitely going to be an improvement.
> > > 
> > > I wonder if it's the right moment for this kind of series, though, in
> > > terms of future bisections, as long as we're grappling with
> > > https://github.com/containers/podman/issues/23686 and
> > > https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that
> > > this series doesn't fix anything.  
> > 
> > I don't think this series will fix anything as it stands.  It is,
> > indirectly, aimed at addressing bug 94.  I'm struggling to figure out
> > what to do with bug 94, because I find it almost impossible to reason
> > about the current event masks in TCP.
> 
> I don't see at the moment anything indicating TCP issues other than the
> one you addressed with your tentative debug patch at:
> 
>   https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc
> 
> Given that, with that patch, we had at least another report of event
> storms, this time on UDP, that is, the one from:
> 
>   https://github.com/containers/podman/issues/23686#issuecomment-2324945010
> 
> I shared this other one on top:
> 
>   https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae

Ah, nice.

> > I'd really like to simplify
> > them so it's clearer what's correct and not and I think the most
> > obvious path to doing so is using EPOLLET all the time.  That requires
> > some sort of kick when the tap is ready to accept more data, hence
> > this series as a prerequisite.
> 
> Sure, it's going to be simpler and more robust, but on the other hand
> we wouldn't notice these kind of issues.

Uh.. I'm confused.  In what way would we not notice issues, other than
the issues not existing which.. would be good, right?

> > > That is, once/if we come up with fixes for those, as they might involve
> > > setting different event masks, I'd rather have those in *before* this
> > > series, to avoid further noise in case we manage to break something
> > > else with those hypothetical fixes.  
> > 
> > Right, I understand the impetus.  Although as I said I find the
> > current TCP event handling nigh-incomprehensible so I'm not as yet
> > confident we can find a small fix without cleaning up the event
> > handling more generally.
> 
> I'm not sure either, but I don't think we have any indication, at the
> moment, that any of the issues from those two tickets have anything to
> do with TCP event handling (minus the one you tentatively fixed).

Right, this reasoning is pretty much specific to the EPOLLRDHUP storm.
I may have written some of the descriptions before registering that
the EPOLLERR storm was UDP and therefore unrelated.

> > That said, these changes to tap side event handling are a prerequisite
> > / preliminary and shouldn't as yet really alter the TCP event flow.
> > So I don't think this series will of itself make bisection harder,
> > although follow on things based on it might.
> 
> I understand that they shouldn't alter it, but if we missed something
> subtle and they actually do, they'll make bisection more complicated.

I guess.  Seems pretty unlikely to me given this doesn't touch the TCP
events themselves.

> If this series is only needed for switching TCP sockets to EPOLLET
> (well, minus 4/6, which is a fix on its own), maybe we could wait until
> you have the whole thing ready (and, hopefully, we manage to fix those
> two tickets meanwhile)?

Right, I'm ok to wait on this until I have the whole picture including
TCP event masks as well.  That's kind of why it's an RFC.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] RFC: Clean up tap-side event handling
  2024-09-05  0:35       ` David Gibson
@ 2024-09-05  8:32         ` Stefano Brivio
  2024-09-05 11:33           ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-09-05  8:32 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 5 Sep 2024 10:35:14 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:
> > On Wed, 4 Sep 2024 13:17:53 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:  
> > > > On Tue,  3 Sep 2024 22:02:29 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > This is a draft patch working towards adding EPOLLOUT handling to the
> > > > > tap code, which could then be used to "unstick" flows which have
> > > > > unsent data from the socket side.  For now that's just a stub, but
> > > > > makes what I think are some worthwhile cleanups to the tap side event
> > > > > handling in the meantime.    
> > > > 
> > > > Except for the issue in 3/6 and nits elsewhere, it all makes sense and
> > > > tap-side EPOLLOUT handling is definitely going to be an improvement.
> > > > 
> > > > I wonder if it's the right moment for this kind of series, though, in
> > > > terms of future bisections, as long as we're grappling with
> > > > https://github.com/containers/podman/issues/23686 and
> > > > https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that
> > > > this series doesn't fix anything.    
> > > 
> > > I don't think this series will fix anything as it stands.  It is,
> > > indirectly, aimed at addressing bug 94.  I'm struggling to figure out
> > > what to do with bug 94, because I find it almost impossible to reason
> > > about the current event masks in TCP.  
> > 
> > I don't see at the moment anything indicating TCP issues other than the
> > one you addressed with your tentative debug patch at:
> > 
> >   https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc
> > 
> > Given that, with that patch, we had at least another report of event
> > storms, this time on UDP, that is, the one from:
> > 
> >   https://github.com/containers/podman/issues/23686#issuecomment-2324945010
> > 
> > I shared this other one on top:
> > 
> >   https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae  
> 
> Ah, nice.
> 
> > > I'd really like to simplify
> > > them so it's clearer what's correct and not and I think the most
> > > obvious path to doing so is using EPOLLET all the time.  That requires
> > > some sort of kick when the tap is ready to accept more data, hence
> > > this series as a prerequisite.  
> > 
> > Sure, it's going to be simpler and more robust, but on the other hand
> > we wouldn't notice these kind of issues.  
> 
> Uh.. I'm confused.  In what way would we not notice issues, other than
> the issues not existing which.. would be good, right?

Right now, we have a condition where we fail to handle EPOLLRDHUP
before an outbound connection is established, see
https://github.com/containers/podman/issues/23686#issuecomment-233023742,
and we end up in a tight event processing loop.

I guess what we're missing in tcp_sock_handler() is clear (we should
orderly close the connection), but the tight loop didn't happen on
2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it
didn't.

If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP
event is reported just once, but that doesn't mean we solved this.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] RFC: Clean up tap-side event handling
  2024-09-05  8:32         ` Stefano Brivio
@ 2024-09-05 11:33           ` Stefano Brivio
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2024-09-05 11:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 5 Sep 2024 10:32:38 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Thu, 5 Sep 2024 10:35:14 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:  
> > > On Wed, 4 Sep 2024 13:17:53 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >     
> > > > On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:    
> > > > > On Tue,  3 Sep 2024 22:02:29 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >       
> > > > > > This is a draft patch working towards adding EPOLLOUT handling to the
> > > > > > tap code, which could then be used to "unstick" flows which have
> > > > > > unsent data from the socket side.  For now that's just a stub, but
> > > > > > makes what I think are some worthwhile cleanups to the tap side event
> > > > > > handling in the meantime.      
> > > > > 
> > > > > Except for the issue in 3/6 and nits elsewhere, it all makes sense and
> > > > > tap-side EPOLLOUT handling is definitely going to be an improvement.
> > > > > 
> > > > > I wonder if it's the right moment for this kind of series, though, in
> > > > > terms of future bisections, as long as we're grappling with
> > > > > https://github.com/containers/podman/issues/23686 and
> > > > > https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that
> > > > > this series doesn't fix anything.      
> > > > 
> > > > I don't think this series will fix anything as it stands.  It is,
> > > > indirectly, aimed at addressing bug 94.  I'm struggling to figure out
> > > > what to do with bug 94, because I find it almost impossible to reason
> > > > about the current event masks in TCP.    
> > > 
> > > I don't see at the moment anything indicating TCP issues other than the
> > > one you addressed with your tentative debug patch at:
> > > 
> > >   https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc
> > > 
> > > Given that, with that patch, we had at least another report of event
> > > storms, this time on UDP, that is, the one from:
> > > 
> > >   https://github.com/containers/podman/issues/23686#issuecomment-2324945010
> > > 
> > > I shared this other one on top:
> > > 
> > >   https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae    
> > 
> > Ah, nice.
> >   
> > > > I'd really like to simplify
> > > > them so it's clearer what's correct and not and I think the most
> > > > obvious path to doing so is using EPOLLET all the time.  That requires
> > > > some sort of kick when the tap is ready to accept more data, hence
> > > > this series as a prerequisite.    
> > > 
> > > Sure, it's going to be simpler and more robust, but on the other hand
> > > we wouldn't notice these kind of issues.    
> > 
> > Uh.. I'm confused.  In what way would we not notice issues, other than
> > the issues not existing which.. would be good, right?  
> 
> Right now, we have a condition where we fail to handle EPOLLRDHUP
> before an outbound connection is established, see
> https://github.com/containers/podman/issues/23686#issuecomment-233023742,

Sorry, it's:

  https://github.com/containers/podman/issues/23686#issuecomment-2330237424

> and we end up in a tight event processing loop.
> 
> I guess what we're missing in tcp_sock_handler() is clear (we should
> orderly close the connection), but the tight loop didn't happen on
> 2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it
> didn't.
> 
> If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP
> event is reported just once, but that doesn't mean we solved this.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-09-05 11:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-03 12:02 [PATCH 0/6] RFC: Clean up tap-side event handling David Gibson
2024-09-03 12:02 ` [PATCH 1/6] tap: Split out handling of EPOLLIN events David Gibson
2024-09-03 19:25   ` Stefano Brivio
2024-09-04  1:17     ` David Gibson
2024-09-03 12:02 ` [PATCH 2/6] tap: Improve handling of EINTR in tap_passt_input() David Gibson
2024-09-03 19:25   ` Stefano Brivio
2024-09-04  1:30     ` David Gibson
2024-09-03 12:02 ` [PATCH 3/6] tap: Restructure in tap_pasta_input() David Gibson
2024-09-03 19:25   ` Stefano Brivio
2024-09-04  1:33     ` David Gibson
2024-09-03 12:02 ` [PATCH 4/6] tap: Don't risk truncating frames on full buffer " David Gibson
2024-09-03 19:25   ` Stefano Brivio
2024-09-04  1:33     ` David Gibson
2024-09-03 12:02 ` [PATCH 5/6] tap: Re-introduce EPOLLET for tap connections David Gibson
2024-09-03 19:25   ` Stefano Brivio
2024-09-04  1:36     ` David Gibson
2024-09-03 12:02 ` [PATCH 6/6] tap: Stub EPOLLOUT handling David Gibson
2024-09-03 19:25 ` [PATCH 0/6] RFC: Clean up tap-side event handling Stefano Brivio
2024-09-04  3:17   ` David Gibson
2024-09-04 17:19     ` Stefano Brivio
2024-09-05  0:35       ` David Gibson
2024-09-05  8:32         ` Stefano Brivio
2024-09-05 11:33           ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).