public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler
@ 2024-07-26  7:20 David Gibson
  2024-07-26  7:20 ` [PATCH 1/5] tap: Better report errors receiving from QEMU socket David Gibson
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: David Gibson @ 2024-07-26  7:20 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

A long time ago Matej Hrica pointed out a possible buffer overrun when
receiving data from the qemu socket.  Stefano recently proposed a fix
for this, but I don't think it's quite right.  This series is a
different approach to fixing that problem and a few adjacent ones.

David Gibson (5):
  tap: Better report errors receiving from QEMU socket
  tap: Don't attempt to carry on if we get a bad frame length from qemu
  tap: Don't use EPOLLET on Qemu sockets
  tap: Correctly handle frames of odd length
  tap: Improve handling of partially received frames on qemu socket

 passt.h |  1 -
 tap.c   | 72 ++++++++++++++++++++++++++++++---------------------------
 util.h  | 16 +++++++++++++
 3 files changed, 54 insertions(+), 35 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] tap: Better report errors receiving from QEMU socket
  2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
@ 2024-07-26  7:20 ` David Gibson
  2024-07-26 11:25   ` Stefano Brivio
  2024-07-26  7:20 ` [PATCH 2/5] tap: Don't attempt to carry on if we get a bad frame length from qemu David Gibson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2024-07-26  7:20 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error.  Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.

Add an error message in this case.  Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c
index 44bd4445..a2ae7c2a 100644
--- a/tap.c
+++ b/tap.c
@@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
  */
 static void tap_sock_reset(struct ctx *c)
 {
-	if (c->one_off) {
-		info("Client closed connection, exiting");
+	info("Client connection closed%s", c->one_off ? ", exiting" : "");
+
+	if (c->one_off)
 		exit(EXIT_SUCCESS);
-	}
 
 	/* Close the connected socket, wait for a new connection */
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
@@ -1005,8 +1005,10 @@ redo:
 
 	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
 	if (n < 0) {
-		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
+		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
+			err_perror("Error receiving from QEMU socket");
 			tap_sock_reset(c);
+		}
 		return;
 	}
 
-- 
@@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
  */
 static void tap_sock_reset(struct ctx *c)
 {
-	if (c->one_off) {
-		info("Client closed connection, exiting");
+	info("Client connection closed%s", c->one_off ? ", exiting" : "");
+
+	if (c->one_off)
 		exit(EXIT_SUCCESS);
-	}
 
 	/* Close the connected socket, wait for a new connection */
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
@@ -1005,8 +1005,10 @@ redo:
 
 	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
 	if (n < 0) {
-		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
+		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
+			err_perror("Error receiving from QEMU socket");
 			tap_sock_reset(c);
+		}
 		return;
 	}
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] tap: Don't attempt to carry on if we get a bad frame length from qemu
  2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
  2024-07-26  7:20 ` [PATCH 1/5] tap: Better report errors receiving from QEMU socket David Gibson
@ 2024-07-26  7:20 ` David Gibson
  2024-07-26 11:26   ` Stefano Brivio
  2024-07-26  7:20 ` [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets David Gibson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2024-07-26  7:20 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on.  That sounds sensible on first blush, but
probably isn't wise in practice.  If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket.  Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.

Neither (b) nor (c) is really salvageable with the same stream.  Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.

So, instead of just skipping the frame and trying to carry on, log an error
and close the socket.  As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/tap.c b/tap.c
index a2ae7c2a..08700f65 100644
--- a/tap.c
+++ b/tap.c
@@ -1013,7 +1013,13 @@ redo:
 	}
 
 	while (n > (ssize_t)sizeof(uint32_t)) {
-		ssize_t l2len = ntohl(*(uint32_t *)p);
+		uint32_t l2len = ntohl(*(uint32_t *)p);
+
+		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
+			err("Invalid frame size from QEMU (corrupt stream?)");
+			tap_sock_reset(c);
+			return;
+		}
 
 		p += sizeof(uint32_t);
 		n -= sizeof(uint32_t);
@@ -1027,16 +1033,8 @@ redo:
 				return;
 		}
 
-		/* Complete the partial read above before discarding a malformed
-		 * frame, otherwise the stream will be inconsistent.
-		 */
-		if (l2len < (ssize_t)sizeof(struct ethhdr) ||
-		    l2len > (ssize_t)ETH_MAX_MTU)
-			goto next;
-
 		tap_add_packet(c, l2len, p);
 
-next:
 		p += l2len;
 		n -= l2len;
 	}
-- 
@@ -1013,7 +1013,13 @@ redo:
 	}
 
 	while (n > (ssize_t)sizeof(uint32_t)) {
-		ssize_t l2len = ntohl(*(uint32_t *)p);
+		uint32_t l2len = ntohl(*(uint32_t *)p);
+
+		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
+			err("Invalid frame size from QEMU (corrupt stream?)");
+			tap_sock_reset(c);
+			return;
+		}
 
 		p += sizeof(uint32_t);
 		n -= sizeof(uint32_t);
@@ -1027,16 +1033,8 @@ redo:
 				return;
 		}
 
-		/* Complete the partial read above before discarding a malformed
-		 * frame, otherwise the stream will be inconsistent.
-		 */
-		if (l2len < (ssize_t)sizeof(struct ethhdr) ||
-		    l2len > (ssize_t)ETH_MAX_MTU)
-			goto next;
-
 		tap_add_packet(c, l2len, p);
 
-next:
 		p += l2len;
 		n -= l2len;
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets
  2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
  2024-07-26  7:20 ` [PATCH 1/5] tap: Better report errors receiving from QEMU socket David Gibson
  2024-07-26  7:20 ` [PATCH 2/5] tap: Don't attempt to carry on if we get a bad frame length from qemu David Gibson
@ 2024-07-26  7:20 ` David Gibson
  2024-07-26  8:00   ` Stefano Brivio
  2024-07-26  7:20 ` [PATCH 4/5] tap: Correctly handle frames of odd length David Gibson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2024-07-26  7:20 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket.  It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling.  That consideration
doesn't apply to the way we handle the qemu socket however.

Furthermore, using EPOLLET causes additional complications:

1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
   we *do* set it when using pasta mode with --fd.  This inconsistency
   doesn't seem to have broken anything, but it's odd.

2) EPOLLET requires that tap_handler_passt() loop until all data available
   is read (otherwise we may have data in the buffer but never get an event
   causing us to read it).  We do that with a rather ugly goto.

   Worse, our condition for that goto appears to be incorrect.  We'll only
   loop if rem is non-zero, which will only happen if we perform a blocking
   recv() for a partially received frame.  We'll only perform that second
   recv() if the original recv() resulted in a partially read frame.  As
   far as I can tell the original recv() could end on a frame boundary
   (never triggering the second recv()) even if there is additional data in
   the socket buffer.  In that circumstance we wouldn't goto redo and could
   leave unprocessed frames in the qemu socket buffer indefinitely.

   This doesn't seem to have caused any problems in practice, but since
   there's no obvious reason to use EPOLLET here anyway, we might as well
   get rid of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/tap.c b/tap.c
index 08700f65..df1287ad 100644
--- a/tap.c
+++ b/tap.c
@@ -989,7 +989,7 @@ static void tap_sock_reset(struct ctx *c)
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
 {
-	ssize_t n, rem;
+	ssize_t n;
 	char *p;
 
 	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
@@ -997,9 +997,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-redo:
 	p = pkt_buf;
-	rem = 0;
 
 	tap_flush_pools();
 
@@ -1028,7 +1026,7 @@ redo:
 		 * needs to be blocking.
 		 */
 		if (l2len > n) {
-			rem = recv(c->fd_tap, p + n, l2len - n, 0);
+			ssize_t rem = recv(c->fd_tap, p + n, l2len - n, 0);
 			if ((n += rem) != l2len)
 				return;
 		}
@@ -1040,10 +1038,6 @@ redo:
 	}
 
 	tap_handler(c, now);
-
-	/* We can't use EPOLLET otherwise. */
-	if (rem)
-		goto redo;
 }
 
 /**
@@ -1228,7 +1222,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1317,7 +1311,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+		ev.events = EPOLLIN | EPOLLRDHUP;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
-- 
@@ -989,7 +989,7 @@ static void tap_sock_reset(struct ctx *c)
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
 {
-	ssize_t n, rem;
+	ssize_t n;
 	char *p;
 
 	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
@@ -997,9 +997,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-redo:
 	p = pkt_buf;
-	rem = 0;
 
 	tap_flush_pools();
 
@@ -1028,7 +1026,7 @@ redo:
 		 * needs to be blocking.
 		 */
 		if (l2len > n) {
-			rem = recv(c->fd_tap, p + n, l2len - n, 0);
+			ssize_t rem = recv(c->fd_tap, p + n, l2len - n, 0);
 			if ((n += rem) != l2len)
 				return;
 		}
@@ -1040,10 +1038,6 @@ redo:
 	}
 
 	tap_handler(c, now);
-
-	/* We can't use EPOLLET otherwise. */
-	if (rem)
-		goto redo;
 }
 
 /**
@@ -1228,7 +1222,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+	ev.events = EPOLLIN | EPOLLRDHUP;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1317,7 +1311,7 @@ void tap_sock_init(struct ctx *c)
 		else
 			ref.type = EPOLL_TYPE_TAP_PASTA;
 
-		ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+		ev.events = EPOLLIN | EPOLLRDHUP;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] tap: Correctly handle frames of odd length
  2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
                   ` (2 preceding siblings ...)
  2024-07-26  7:20 ` [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets David Gibson
@ 2024-07-26  7:20 ` David Gibson
  2024-07-26  7:20 ` [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket David Gibson
  2024-07-26 13:19 ` [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler Stefano Brivio
  5 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2024-07-26  7:20 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself.  As far as I can tell,
frames can be any length, with no particular alignment requirement.  This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.

Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading.  Some platforms will generate a fatal trap on
such an unaligned load.  Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.

Introduce a new helper to safely load a possibly unaligned value here.  We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads.  If that turns out not
to be the case, we can look at improvements then.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tap.c  |  2 +-
 util.h | 16 ++++++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tap.c b/tap.c
index df1287ad..aca27bb9 100644
--- a/tap.c
+++ b/tap.c
@@ -1011,7 +1011,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 	}
 
 	while (n > (ssize_t)sizeof(uint32_t)) {
-		uint32_t l2len = ntohl(*(uint32_t *)p);
+		uint32_t l2len = ntohl_unaligned(p);
 
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
 			err("Invalid frame size from QEMU (corrupt stream?)");
diff --git a/util.h b/util.h
index 826614cf..5a038e29 100644
--- a/util.h
+++ b/util.h
@@ -10,8 +10,10 @@
 #include <stdarg.h>
 #include <stdbool.h>
 #include <stddef.h>
+#include <stdint.h>
 #include <string.h>
 #include <signal.h>
+#include <arpa/inet.h>
 
 #include "log.h"
 
@@ -116,6 +118,20 @@
 #define	htonl_constant(x)	(__bswap_constant_32(x))
 #endif
 
+/**
+ * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address
+ * @p:		Pointer to the BE value in memory
+ *
+ * Returns: Host-order value of 32-bit BE quantity at @p
+ */
+static inline uint32_t ntohl_unaligned(const void *p)
+{
+	uint32_t val;
+
+	memcpy(&val, p, sizeof(val));
+	return ntohl(val);
+}
+
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
-- 
@@ -10,8 +10,10 @@
 #include <stdarg.h>
 #include <stdbool.h>
 #include <stddef.h>
+#include <stdint.h>
 #include <string.h>
 #include <signal.h>
+#include <arpa/inet.h>
 
 #include "log.h"
 
@@ -116,6 +118,20 @@
 #define	htonl_constant(x)	(__bswap_constant_32(x))
 #endif
 
+/**
+ * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address
+ * @p:		Pointer to the BE value in memory
+ *
+ * Returns: Host-order value of 32-bit BE quantity at @p
+ */
+static inline uint32_t ntohl_unaligned(const void *p)
+{
+	uint32_t val;
+
+	memcpy(&val, p, sizeof(val));
+	return ntohl(val);
+}
+
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket
  2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
                   ` (3 preceding siblings ...)
  2024-07-26  7:20 ` [PATCH 4/5] tap: Correctly handle frames of odd length David Gibson
@ 2024-07-26  7:20 ` David Gibson
  2024-07-26 11:39   ` Stefano Brivio
  2024-07-26 13:19 ` [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler Stefano Brivio
  5 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2024-07-26  7:20 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie.  Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.

Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket.  This handles a number of (unlikely) cases
which previously would not be dealt with correctly:

* If qemu sent a partial frame then waited some time before sending the
  remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
  doing the partial frame handling, which would put us out of sync with
  the stream from qemu
* If a the blocking recv() only received some of the remainder of the
  frame, not all of it, we'd return leaving us out of sync with the
  stream again

Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
is probably acceptable because it's an unlikely case in practice.  If
necessary we could mitigate this by using a true ring buffer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt.h |  1 -
 tap.c   | 36 +++++++++++++++++++++++-------------
 2 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/passt.h b/passt.h
index 4cc2b6f0..d0f31a23 100644
--- a/passt.h
+++ b/passt.h
@@ -60,7 +60,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
 
 #define TAP_BUF_BYTES							\
 	ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
-#define TAP_BUF_FILL		(TAP_BUF_BYTES - ETH_MAX_MTU - sizeof(uint32_t))
 #define TAP_MSGS							\
 	DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
 
diff --git a/tap.c b/tap.c
index aca27bb9..18eec279 100644
--- a/tap.c
+++ b/tap.c
@@ -989,6 +989,8 @@ static void tap_sock_reset(struct ctx *c)
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
 {
+	static const char *partial_frame;
+	static ssize_t partial_len = 0;
 	ssize_t n;
 	char *p;
 
@@ -997,11 +999,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-	p = pkt_buf;
-
 	tap_flush_pools();
 
-	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
+	if (partial_len) {
+		/* We have a partial frame from an earlier pass.  Move it to the
+		 * start of the buffer, top up with new data, then process all
+		 * of it.
+		 */
+		memmove(pkt_buf, partial_frame, partial_len);
+	}
+
+	n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
+		 MSG_DONTWAIT);
 	if (n < 0) {
 		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
 			err_perror("Error receiving from QEMU socket");
@@ -1010,7 +1019,10 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-	while (n > (ssize_t)sizeof(uint32_t)) {
+	p = pkt_buf;
+	n += partial_len;
+
+	while (n >= (ssize_t)sizeof(uint32_t)) {
 		uint32_t l2len = ntohl_unaligned(p);
 
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
@@ -1019,24 +1031,22 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 			return;
 		}
 
+		if (l2len + sizeof(uint32_t) > (size_t)n)
+			/* Leave this incomplete frame for later */
+			break;
+
 		p += sizeof(uint32_t);
 		n -= sizeof(uint32_t);
 
-		/* At most one packet might not fit in a single read, and this
-		 * needs to be blocking.
-		 */
-		if (l2len > n) {
-			ssize_t rem = recv(c->fd_tap, p + n, l2len - n, 0);
-			if ((n += rem) != l2len)
-				return;
-		}
-
 		tap_add_packet(c, l2len, p);
 
 		p += l2len;
 		n -= l2len;
 	}
 
+	partial_len = n;
+	partial_frame = p;
+
 	tap_handler(c, now);
 }
 
-- 
@@ -989,6 +989,8 @@ static void tap_sock_reset(struct ctx *c)
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
 {
+	static const char *partial_frame;
+	static ssize_t partial_len = 0;
 	ssize_t n;
 	char *p;
 
@@ -997,11 +999,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-	p = pkt_buf;
-
 	tap_flush_pools();
 
-	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
+	if (partial_len) {
+		/* We have a partial frame from an earlier pass.  Move it to the
+		 * start of the buffer, top up with new data, then process all
+		 * of it.
+		 */
+		memmove(pkt_buf, partial_frame, partial_len);
+	}
+
+	n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len,
+		 MSG_DONTWAIT);
 	if (n < 0) {
 		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
 			err_perror("Error receiving from QEMU socket");
@@ -1010,7 +1019,10 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 		return;
 	}
 
-	while (n > (ssize_t)sizeof(uint32_t)) {
+	p = pkt_buf;
+	n += partial_len;
+
+	while (n >= (ssize_t)sizeof(uint32_t)) {
 		uint32_t l2len = ntohl_unaligned(p);
 
 		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
@@ -1019,24 +1031,22 @@ void tap_handler_passt(struct ctx *c, uint32_t events,
 			return;
 		}
 
+		if (l2len + sizeof(uint32_t) > (size_t)n)
+			/* Leave this incomplete frame for later */
+			break;
+
 		p += sizeof(uint32_t);
 		n -= sizeof(uint32_t);
 
-		/* At most one packet might not fit in a single read, and this
-		 * needs to be blocking.
-		 */
-		if (l2len > n) {
-			ssize_t rem = recv(c->fd_tap, p + n, l2len - n, 0);
-			if ((n += rem) != l2len)
-				return;
-		}
-
 		tap_add_packet(c, l2len, p);
 
 		p += l2len;
 		n -= l2len;
 	}
 
+	partial_len = n;
+	partial_frame = p;
+
 	tap_handler(c, now);
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets
  2024-07-26  7:20 ` [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets David Gibson
@ 2024-07-26  8:00   ` Stefano Brivio
  2024-07-26 10:44     ` Stefano Brivio
  2024-07-26 12:12     ` David Gibson
  0 siblings, 2 replies; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26  8:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 17:20:29 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently we set EPOLLET (edge trigger) on the epoll flags for the
> connected Qemu Unix socket.  It's not clear that there's a reason for
> doing this: for TCP sockets we need to use EPOLLET, because we leave data
> in the socket buffers for our flow control handling.  That consideration
> doesn't apply to the way we handle the qemu socket however.

It significantly decreases epoll_wait() overhead on sustained data
transfers, because we can read multiple TAP_BUF_SIZE buffers at a time
instead of just one.

I can check that now again with current QEMU and kernel versions, plus
several fundamental changes in buffer handling, but I don't see a real
reason why this shouldn't have changed meanwhile.

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets
  2024-07-26  8:00   ` Stefano Brivio
@ 2024-07-26 10:44     ` Stefano Brivio
  2024-07-26 12:12     ` David Gibson
  1 sibling, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 10:44 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 10:00:56 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 26 Jul 2024 17:20:29 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we set EPOLLET (edge trigger) on the epoll flags for the
> > connected Qemu Unix socket.  It's not clear that there's a reason for
> > doing this: for TCP sockets we need to use EPOLLET, because we leave data
> > in the socket buffers for our flow control handling.  That consideration
> > doesn't apply to the way we handle the qemu socket however.  
> 
> It significantly decreases epoll_wait() overhead on sustained data
> transfers, because we can read multiple TAP_BUF_SIZE buffers at a time
> instead of just one.
> 
> I can check that now again with current QEMU and kernel versions, plus
> several fundamental changes in buffer handling, but I don't see a real
> reason why this shouldn't have changed meanwhile.

Surprisingly, this doesn't affect throughput at all on my setup, no
matter the packet size.

I didn't check with perf(1), though, and we probably should give all the
recent substantial changes a pass with it (it's been a while since I
last checked where overhead typically is...).

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] tap: Better report errors receiving from QEMU socket
  2024-07-26  7:20 ` [PATCH 1/5] tap: Better report errors receiving from QEMU socket David Gibson
@ 2024-07-26 11:25   ` Stefano Brivio
  2024-07-26 11:50     ` Stefano Brivio
  2024-07-26 12:02     ` David Gibson
  0 siblings, 2 replies; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 11:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 17:20:27 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> If we get an error on recv() from the QEMU socket, we currently don't
> print any kind of error.  Although this can happen in a non-fatal situation
> such as a guest restarting, it's unusual enough that we realy should report
> something for debugability.
> 
> Add an error message in this case.  Also always report when the qemu
> connection closes for any reason, not just when it will cause us to exit
> (--one-off).
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 44bd4445..a2ae7c2a 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
>   */
>  static void tap_sock_reset(struct ctx *c)
>  {
> -	if (c->one_off) {
> -		info("Client closed connection, exiting");
> +	info("Client connection closed%s", c->one_off ? ", exiting" : "");
> +
> +	if (c->one_off)
>  		exit(EXIT_SUCCESS);
> -	}
>  
>  	/* Close the connected socket, wait for a new connection */
>  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
> @@ -1005,8 +1005,10 @@ redo:
>  
>  	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
>  	if (n < 0) {
> -		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
> +		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
> +			err_perror("Error receiving from QEMU socket");

Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun
and perhaps others also use this path (in fact, the whole problem was
reported as part of the libkrun integration).

Maybe it's obvious to users anyway, but this might seriously confuse
somebody who's using e.g. krun on Asahi Linux (is QEMU running, one
might wonder):
  https://github.com/slp/krun#motivation
  https://github.com/slp/krun/blob/main/crates/krun/src/net.rs

On top of that, now that we have an error message, I guess it would be
nice to state we're resetting the connection, because it's not really
obvious: the subsequent message from tap_sock_reset() makes it look like
the client decided to close the connection on its own.

So I'm changing this to:

			err("Error receiving from guest, resetting connection");

...if you see an issue with it, I'll send another patch.

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] tap: Don't attempt to carry on if we get a bad frame length from qemu
  2024-07-26  7:20 ` [PATCH 2/5] tap: Don't attempt to carry on if we get a bad frame length from qemu David Gibson
@ 2024-07-26 11:26   ` Stefano Brivio
  0 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 11:26 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 17:20:28 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> If we receive a too-short or too-long frame from the QEMU socket, currently
> we try to skip it and carry on.  That sounds sensible on first blush, but
> probably isn't wise in practice.  If this happens, either (a) qemu has done
> something seriously unexpected, or (b) we've received corrupt data over a
> Unix socket.  Or more likely (c), we have a bug elswhere which has put us
> out of sync with the stream, so we're trying to read something that's not a
> frame length as a frame length.
> 
> Neither (b) nor (c) is really salvageable with the same stream.  Case (a)
> might be ok, but we can no longer be confident qemu won't do something else
> we can't cope with.
> 
> So, instead of just skipping the frame and trying to carry on, log an error
> and close the socket.  As a bonus, establishing firm bounds on l2len early
> will allow simplifications to how we deal with the case where a partial
> frame is recv()ed.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tap.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index a2ae7c2a..08700f65 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1013,7 +1013,13 @@ redo:
>  	}
>  
>  	while (n > (ssize_t)sizeof(uint32_t)) {
> -		ssize_t l2len = ntohl(*(uint32_t *)p);
> +		uint32_t l2len = ntohl(*(uint32_t *)p);
> +
> +		if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) {
> +			err("Invalid frame size from QEMU (corrupt stream?)");

Same as my comment to 1/5, let me change this to:

			err("Bad frame size from guest, resetting connection");

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket
  2024-07-26  7:20 ` [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket David Gibson
@ 2024-07-26 11:39   ` Stefano Brivio
  2024-07-26 12:33     ` David Gibson
  0 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 11:39 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 17:20:31 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Because the Unix socket to qemu is a stream socket, we have no guarantee
> of where the boundaries between recv() calls will lie.  Typically they
> will lie on frame boundaries, because that's how qemu will send then, but
> we can't rely on it.
> 
> Currently we handle this case by detecting when we have received a partial
> frame and performing a blocking recv() to get the remainder, and only then
> processing the frames. Change it so instead we save the partial frame
> persistently and include it as the first thing processed next time we
> receive data from the socket.  This handles a number of (unlikely) cases
> which previously would not be dealt with correctly:
> 
> * If qemu sent a partial frame then waited some time before sending the
>   remainder, previously we could block here for an unacceptably long time
> * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
>   doing the partial frame handling, which would put us out of sync with
>   the stream from qemu
> * If a the blocking recv() only received some of the remainder of the
>   frame, not all of it, we'd return leaving us out of sync with the
>   stream again
> 
> Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
> is probably acceptable because it's an unlikely case in practice.  If
> necessary we could mitigate this by using a true ring buffer.

I don't think that that memmove() is a problem if we have a single
recv(), even if it happens to be one memmove() for every recv() (guest
filling up the buffer, common in throughput tests and bulk transfers),
because it's very small in relative terms anyway.

I think the ringbuffer would be worth it with multiple recv() calls per
epoll wakeup, with EPOLLET.

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] tap: Better report errors receiving from QEMU socket
  2024-07-26 11:25   ` Stefano Brivio
@ 2024-07-26 11:50     ` Stefano Brivio
  2024-07-26 12:02     ` David Gibson
  1 sibling, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 11:50 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 13:25:08 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 26 Jul 2024 17:20:27 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > If we get an error on recv() from the QEMU socket, we currently don't
> > print any kind of error.  Although this can happen in a non-fatal situation
> > such as a guest restarting, it's unusual enough that we realy should report
> > something for debugability.
> > 
> > Add an error message in this case.  Also always report when the qemu
> > connection closes for any reason, not just when it will cause us to exit
> > (--one-off).
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tap.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 44bd4445..a2ae7c2a 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
> >   */
> >  static void tap_sock_reset(struct ctx *c)
> >  {
> > -	if (c->one_off) {
> > -		info("Client closed connection, exiting");
> > +	info("Client connection closed%s", c->one_off ? ", exiting" : "");
> > +
> > +	if (c->one_off)
> >  		exit(EXIT_SUCCESS);
> > -	}
> >  
> >  	/* Close the connected socket, wait for a new connection */
> >  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
> > @@ -1005,8 +1005,10 @@ redo:
> >  
> >  	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
> >  	if (n < 0) {
> > -		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
> > +		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
> > +			err_perror("Error receiving from QEMU socket");  
> 
> Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun
> and perhaps others also use this path (in fact, the whole problem was
> reported as part of the libkrun integration).
> 
> Maybe it's obvious to users anyway, but this might seriously confuse
> somebody who's using e.g. krun on Asahi Linux (is QEMU running, one
> might wonder):
>   https://github.com/slp/krun#motivation
>   https://github.com/slp/krun/blob/main/crates/krun/src/net.rs
> 
> On top of that, now that we have an error message, I guess it would be
> nice to state we're resetting the connection, because it's not really
> obvious: the subsequent message from tap_sock_reset() makes it look like
> the client decided to close the connection on its own.
> 
> So I'm changing this to:
> 
> 			err("Error receiving from guest, resetting connection");

...or rather:

			err_perror("Receive error on guest connection, reset");

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] tap: Better report errors receiving from QEMU socket
  2024-07-26 11:25   ` Stefano Brivio
  2024-07-26 11:50     ` Stefano Brivio
@ 2024-07-26 12:02     ` David Gibson
  1 sibling, 0 replies; 18+ messages in thread
From: David Gibson @ 2024-07-26 12:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3075 bytes --]

On Fri, Jul 26, 2024 at 01:25:08PM +0200, Stefano Brivio wrote:
> On Fri, 26 Jul 2024 17:20:27 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > If we get an error on recv() from the QEMU socket, we currently don't
> > print any kind of error.  Although this can happen in a non-fatal situation
> > such as a guest restarting, it's unusual enough that we realy should report
> > something for debugability.
> > 
> > Add an error message in this case.  Also always report when the qemu
> > connection closes for any reason, not just when it will cause us to exit
> > (--one-off).
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tap.c | 10 ++++++----
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 44bd4445..a2ae7c2a 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
> >   */
> >  static void tap_sock_reset(struct ctx *c)
> >  {
> > -	if (c->one_off) {
> > -		info("Client closed connection, exiting");
> > +	info("Client connection closed%s", c->one_off ? ", exiting" : "");
> > +
> > +	if (c->one_off)
> >  		exit(EXIT_SUCCESS);
> > -	}
> >  
> >  	/* Close the connected socket, wait for a new connection */
> >  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
> > @@ -1005,8 +1005,10 @@ redo:
> >  
> >  	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
> >  	if (n < 0) {
> > -		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
> > +		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
> > +			err_perror("Error receiving from QEMU socket");
> 
> Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun
> and perhaps others also use this path (in fact, the whole problem was
> reported as part of the libkrun integration).

Fair point.  I'm not sure what term to use to describe specifically a
socket using this qemu-originated protocol, though.

> Maybe it's obvious to users anyway, but this might seriously confuse
> somebody who's using e.g. krun on Asahi Linux (is QEMU running, one
> might wonder):
>   https://github.com/slp/krun#motivation
>   https://github.com/slp/krun/blob/main/crates/krun/src/net.rs
> 
> On top of that, now that we have an error message, I guess it would be
> nice to state we're resetting the connection, because it's not really
> obvious: the subsequent message from tap_sock_reset() makes it look like
> the client decided to close the connection on its own.

That's why I changed the message in tap_sock_reset() from "client
closed connection" to "client connection closed".

> 
> So I'm changing this to:
> 
> 			err("Error receiving from guest, resetting connection");
> 
> ...if you see an issue with it, I'll send another patch.

Ok.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets
  2024-07-26  8:00   ` Stefano Brivio
  2024-07-26 10:44     ` Stefano Brivio
@ 2024-07-26 12:12     ` David Gibson
  2024-07-26 13:25       ` Stefano Brivio
  1 sibling, 1 reply; 18+ messages in thread
From: David Gibson @ 2024-07-26 12:12 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:
> On Fri, 26 Jul 2024 17:20:29 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we set EPOLLET (edge trigger) on the epoll flags for the
> > connected Qemu Unix socket.  It's not clear that there's a reason for
> > doing this: for TCP sockets we need to use EPOLLET, because we leave data
> > in the socket buffers for our flow control handling.  That consideration
> > doesn't apply to the way we handle the qemu socket however.
> 
> It significantly decreases epoll_wait() overhead on sustained data
> transfers, because we can read multiple TAP_BUF_SIZE buffers at a time
> instead of just one.

That's a reason to keep the loop, but not EPOLLET itself, AFAICT.  I'd
be happy enough to put the loop back in as an optimization (although,
I'd prefer to avoid the goto).

> I can check that now again with current QEMU and kernel versions, plus
> several fundamental changes in buffer handling, but I don't see a real
> reason why this shouldn't have changed meanwhile.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket
  2024-07-26 11:39   ` Stefano Brivio
@ 2024-07-26 12:33     ` David Gibson
  0 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2024-07-26 12:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2704 bytes --]

On Fri, Jul 26, 2024 at 01:39:13PM +0200, Stefano Brivio wrote:
> On Fri, 26 Jul 2024 17:20:31 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Because the Unix socket to qemu is a stream socket, we have no guarantee
> > of where the boundaries between recv() calls will lie.  Typically they
> > will lie on frame boundaries, because that's how qemu will send then, but
> > we can't rely on it.
> > 
> > Currently we handle this case by detecting when we have received a partial
> > frame and performing a blocking recv() to get the remainder, and only then
> > processing the frames. Change it so instead we save the partial frame
> > persistently and include it as the first thing processed next time we
> > receive data from the socket.  This handles a number of (unlikely) cases
> > which previously would not be dealt with correctly:
> > 
> > * If qemu sent a partial frame then waited some time before sending the
> >   remainder, previously we could block here for an unacceptably long time
> > * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
> >   doing the partial frame handling, which would put us out of sync with
> >   the stream from qemu
> > * If a the blocking recv() only received some of the remainder of the
> >   frame, not all of it, we'd return leaving us out of sync with the
> >   stream again
> > 
> > Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
> > is probably acceptable because it's an unlikely case in practice.  If
> > necessary we could mitigate this by using a true ring buffer.
> 
> I don't think that that memmove() is a problem if we have a single
> recv(), even if it happens to be one memmove() for every recv() (guest
> filling up the buffer, common in throughput tests and bulk transfers),
> because it's very small in relative terms anyway.
> 
> I think the ringbuffer would be worth it with multiple recv() calls per
> epoll wakeup, with EPOLLET.

So first, as noted on the earlier patch, I don't think multiple
recv()s per wakeup requires EPOLLET, though the reverse is true.

Regardless, AFAICT the proportion of memmove()s to data received would
not vary regardless of whether we do multiple recv()s per wakeup or
the same number of recv()s split across multiple wakeups.

Of course, if the recv()s line up with frame boundaries, as we expect,
then it doesn't matter anyway, since we won't get partial frames and
we won't need memmove()s.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler
  2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
                   ` (4 preceding siblings ...)
  2024-07-26  7:20 ` [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket David Gibson
@ 2024-07-26 13:19 ` Stefano Brivio
  5 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 13:19 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 17:20:26 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> A long time ago Matej Hrica pointed out a possible buffer overrun when
> receiving data from the qemu socket.  Stefano recently proposed a fix
> for this, but I don't think it's quite right.  This series is a
> different approach to fixing that problem and a few adjacent ones.
> 
> David Gibson (5):
>   tap: Better report errors receiving from QEMU socket
>   tap: Don't attempt to carry on if we get a bad frame length from qemu
>   tap: Don't use EPOLLET on Qemu sockets
>   tap: Correctly handle frames of odd length
>   tap: Improve handling of partially received frames on qemu socket

Applied. I'm a bit nervous about 3/5 but anyway the series as a whole is
clearly better than the alternative.

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets
  2024-07-26 12:12     ` David Gibson
@ 2024-07-26 13:25       ` Stefano Brivio
  2024-07-29  1:15         ` David Gibson
  0 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2024-07-26 13:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 26 Jul 2024 22:12:27 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:
> > On Fri, 26 Jul 2024 17:20:29 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Currently we set EPOLLET (edge trigger) on the epoll flags for the
> > > connected Qemu Unix socket.  It's not clear that there's a reason for
> > > doing this: for TCP sockets we need to use EPOLLET, because we leave data
> > > in the socket buffers for our flow control handling.  That consideration
> > > doesn't apply to the way we handle the qemu socket however.  
> > 
> > It significantly decreases epoll_wait() overhead on sustained data
> > transfers, because we can read multiple TAP_BUF_SIZE buffers at a time
> > instead of just one.  
> 
> That's a reason to keep the loop, but not EPOLLET itself, AFAICT.  I'd
> be happy enough to put the loop back in as an optimization (although,
> I'd prefer to avoid the goto).

True, we could actually have that loop back without EPOLLET.

But the reason why I added EPOLLET, despite the resulting complexity,
was surely increased overhead without it.

I can't remember (and unfortunately I didn't write this in that commit
message from 2021) exactly how that looked like, if we had spurious or
more frequent wake-ups or what else. Maybe that was a side effect of
something that's fixed or otherwise changed now, but still we should
give this a pass with perf(1) before we try to optimise this again (if
it even needs to be optimised, that is).

-- 
Stefano


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets
  2024-07-26 13:25       ` Stefano Brivio
@ 2024-07-29  1:15         ` David Gibson
  0 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2024-07-29  1:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2419 bytes --]

On Fri, Jul 26, 2024 at 03:25:38PM +0200, Stefano Brivio wrote:
> On Fri, 26 Jul 2024 22:12:27 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:
> > > On Fri, 26 Jul 2024 17:20:29 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Currently we set EPOLLET (edge trigger) on the epoll flags for the
> > > > connected Qemu Unix socket.  It's not clear that there's a reason for
> > > > doing this: for TCP sockets we need to use EPOLLET, because we leave data
> > > > in the socket buffers for our flow control handling.  That consideration
> > > > doesn't apply to the way we handle the qemu socket however.  
> > > 
> > > It significantly decreases epoll_wait() overhead on sustained data
> > > transfers, because we can read multiple TAP_BUF_SIZE buffers at a time
> > > instead of just one.  
> > 
> > That's a reason to keep the loop, but not EPOLLET itself, AFAICT.  I'd
> > be happy enough to put the loop back in as an optimization (although,
> > I'd prefer to avoid the goto).
> 
> True, we could actually have that loop back without EPOLLET.
> 
> But the reason why I added EPOLLET, despite the resulting complexity,
> was surely increased overhead without it.
>
> I can't remember (and unfortunately I didn't write this in that commit
> message from 2021) exactly how that looked like, if we had spurious or
> more frequent wake-ups or what else. Maybe that was a side effect of
> something that's fixed or otherwise changed now, but still we should
> give this a pass with perf(1) before we try to optimise this again (if
> it even needs to be optimised, that is).

I think we need to understand if and why that's still the case before
putting this back in.  I can see an obvious reason why the loop might
reduce overhead, but not why the EPOLLET flag itself would.  If
anything I'd expect level triggered events to more accurately give us
wakeups only exactly when we need them.

Note also that even looping withouth EPOLLET does have its own cost
here: it potentially allows a heavy burst of traffic from qemu to
starve processing of events on other sockets.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-07-29  1:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-26  7:20 [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler David Gibson
2024-07-26  7:20 ` [PATCH 1/5] tap: Better report errors receiving from QEMU socket David Gibson
2024-07-26 11:25   ` Stefano Brivio
2024-07-26 11:50     ` Stefano Brivio
2024-07-26 12:02     ` David Gibson
2024-07-26  7:20 ` [PATCH 2/5] tap: Don't attempt to carry on if we get a bad frame length from qemu David Gibson
2024-07-26 11:26   ` Stefano Brivio
2024-07-26  7:20 ` [PATCH 3/5] tap: Don't use EPOLLET on Qemu sockets David Gibson
2024-07-26  8:00   ` Stefano Brivio
2024-07-26 10:44     ` Stefano Brivio
2024-07-26 12:12     ` David Gibson
2024-07-26 13:25       ` Stefano Brivio
2024-07-29  1:15         ` David Gibson
2024-07-26  7:20 ` [PATCH 4/5] tap: Correctly handle frames of odd length David Gibson
2024-07-26  7:20 ` [PATCH 5/5] tap: Improve handling of partially received frames on qemu socket David Gibson
2024-07-26 11:39   ` Stefano Brivio
2024-07-26 12:33     ` David Gibson
2024-07-26 13:19 ` [PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).