public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v2 00/12] Cleanups to packet pool handling and sizing
@ 2024-12-20  8:35 David Gibson
  2024-12-20  8:35 ` [PATCH v2 01/12] test focus David Gibson
                   ` (12 more replies)
  0 siblings, 13 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

This... is not any of the things I said I would be working on.  I can
only say that a herd of very hairy yaks led me astray.  Looking at bug
66 I spotted some problems with our handling of MTUs / maximum frame
sizes.  Looking at that I found some weirdness and some real, if
minor, bugs in the sizing and handling of the packet pools.

Changes in v2:
 * Stefano convinced me that packet_check_range() is still worthwhile.
   * So don't remove it... but in looking at it I spotted various
     flaws in the checks, so address those in a number of new patches.

David Gibson (12):
  test focus
  hack: stop on fail, but not perf fail
  make passt dumpable
  packet: Use flexible array member in struct pool
  packet: Don't pass start and offset separately too
    packet_check_range()
  packet: Don't hard code maximum packet size to UINT16_MAX
  packet: Remove unhelpful packet_get_try() macro
  util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers
  packet: Distinguish severities of different packet_{add,git}_do()
    errors
  packet: Move packet length checks into packet_check_range()
  tap: Don't size pool_tap[46] for the maximum number of packets
  packet: More cautious checks to avoid pointer arithmetic UB

 dhcpv6.c      |   2 +-
 ip.c          |   2 +-
 isolation.c   |   2 +-
 packet.c      | 106 ++++++++++++++++++++++----------------------------
 packet.h      |  19 ++++++---
 passt.h       |   2 -
 tap.c         |  18 +++++++--
 tap.h         |   3 +-
 test/lib/term |   1 +
 test/lib/test |   4 +-
 test/run      |  38 +++++++++---------
 util.c        |  19 +++++++++
 util.h        |  25 +++++-------
 vu_common.c   |  34 ++++++++++------
 14 files changed, 153 insertions(+), 122 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 01/12] test focus
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 02/12] hack: stop on fail, but not perf fail David Gibson
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

---
 test/run | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/test/run b/test/run
index f188d8ea..fc710475 100755
--- a/test/run
+++ b/test/run
@@ -81,9 +81,9 @@ run() {
 	test passt/shutdown
 	teardown pasta
 
-	setup pasta_options
-	test pasta_options/log_to_file
-	teardown pasta_options
+	#setup pasta_options
+	#test pasta_options/log_to_file
+	#teardown pasta_options
 
 	setup build
 	test pasta_podman/bats
@@ -132,24 +132,24 @@ run() {
 
 	VALGRIND=0
 	VHOST_USER=0
-	setup passt_in_ns
-	test passt/ndp
-	test passt_in_ns/dhcp
-	test perf/passt_tcp
-	test perf/passt_udp
-	test perf/pasta_tcp
-	test perf/pasta_udp
-	test passt_in_ns/shutdown
-	teardown passt_in_ns
+	#setup passt_in_ns
+	#test passt/ndp
+	#test passt_in_ns/dhcp
+	#test perf/passt_tcp
+	#test perf/passt_udp
+	#test perf/pasta_tcp
+	#test perf/pasta_udp
+	#test passt_in_ns/shutdown
+	#teardown passt_in_ns
 
 	VHOST_USER=1
-	setup passt_in_ns
-	test passt_vu/ndp
-	test passt_vu_in_ns/dhcp
-	test perf/passt_vu_tcp
-	test perf/passt_vu_udp
-	test passt_vu_in_ns/shutdown
-	teardown passt_in_ns
+	#setup passt_in_ns
+	#test passt_vu/ndp
+	#test passt_vu_in_ns/dhcp
+	#test perf/passt_vu_tcp
+	#test perf/passt_vu_udp
+	#test passt_vu_in_ns/shutdown
+	#teardown passt_in_ns
 
 	# TODO: Make those faster by at least pre-installing gcc and make on
 	# non-x86 images, then re-enable.
-- 
@@ -81,9 +81,9 @@ run() {
 	test passt/shutdown
 	teardown pasta
 
-	setup pasta_options
-	test pasta_options/log_to_file
-	teardown pasta_options
+	#setup pasta_options
+	#test pasta_options/log_to_file
+	#teardown pasta_options
 
 	setup build
 	test pasta_podman/bats
@@ -132,24 +132,24 @@ run() {
 
 	VALGRIND=0
 	VHOST_USER=0
-	setup passt_in_ns
-	test passt/ndp
-	test passt_in_ns/dhcp
-	test perf/passt_tcp
-	test perf/passt_udp
-	test perf/pasta_tcp
-	test perf/pasta_udp
-	test passt_in_ns/shutdown
-	teardown passt_in_ns
+	#setup passt_in_ns
+	#test passt/ndp
+	#test passt_in_ns/dhcp
+	#test perf/passt_tcp
+	#test perf/passt_udp
+	#test perf/pasta_tcp
+	#test perf/pasta_udp
+	#test passt_in_ns/shutdown
+	#teardown passt_in_ns
 
 	VHOST_USER=1
-	setup passt_in_ns
-	test passt_vu/ndp
-	test passt_vu_in_ns/dhcp
-	test perf/passt_vu_tcp
-	test perf/passt_vu_udp
-	test passt_vu_in_ns/shutdown
-	teardown passt_in_ns
+	#setup passt_in_ns
+	#test passt_vu/ndp
+	#test passt_vu_in_ns/dhcp
+	#test perf/passt_vu_tcp
+	#test perf/passt_vu_udp
+	#test passt_vu_in_ns/shutdown
+	#teardown passt_in_ns
 
 	# TODO: Make those faster by at least pre-installing gcc and make on
 	# non-x86 images, then re-enable.
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 02/12] hack: stop on fail, but not perf fail
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
  2024-12-20  8:35 ` [PATCH v2 01/12] test focus David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 03/12] make passt dumpable David Gibson
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

---
 test/lib/term | 1 +
 test/lib/test | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/test/lib/term b/test/lib/term
index ed690de8..81191714 100755
--- a/test/lib/term
+++ b/test/lib/term
@@ -413,6 +413,7 @@ info_failed() {
 
 	[ ${FAST} -eq 1 ] || status_bar_blink
 
+	exit 1
 	pause_continue \
 		"Press any key to pause test session"		\
 		"Resuming in "					\
diff --git a/test/lib/test b/test/lib/test
index e6726bea..91729af7 100755
--- a/test/lib/test
+++ b/test/lib/test
@@ -101,7 +101,7 @@ test_one_line() {
 		IFS="${__ifs}"
 		;;
 	"test")
-		[ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
+		# [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
 		[ ${TEST_ONE_nok} -eq 1 ] && status_test_fail
 		[ ${TEST_ONE_nok} -eq 0 ] && status_test_ok
 
@@ -374,7 +374,7 @@ test_one() {
 	[ ${DEMO} -eq 1 ] && return
 
 	[ ${TEST_ONE_skip} -eq 1 ] && status_test_skip && return
-	[ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
+	# [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
 	[ ${TEST_ONE_nok} -eq 0 ] && status_test_ok || status_test_fail
 }
 
-- 
@@ -101,7 +101,7 @@ test_one_line() {
 		IFS="${__ifs}"
 		;;
 	"test")
-		[ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
+		# [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
 		[ ${TEST_ONE_nok} -eq 1 ] && status_test_fail
 		[ ${TEST_ONE_nok} -eq 0 ] && status_test_ok
 
@@ -374,7 +374,7 @@ test_one() {
 	[ ${DEMO} -eq 1 ] && return
 
 	[ ${TEST_ONE_skip} -eq 1 ] && status_test_skip && return
-	[ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
+	# [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1
 	[ ${TEST_ONE_nok} -eq 0 ] && status_test_ok || status_test_fail
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 03/12] make passt dumpable
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
  2024-12-20  8:35 ` [PATCH v2 01/12] test focus David Gibson
  2024-12-20  8:35 ` [PATCH v2 02/12] hack: stop on fail, but not perf fail David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 04/12] packet: Use flexible array member in struct pool David Gibson
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

---
 isolation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/isolation.c b/isolation.c
index c944fb35..df58bb87 100644
--- a/isolation.c
+++ b/isolation.c
@@ -377,7 +377,7 @@ void isolate_postfork(const struct ctx *c)
 {
 	struct sock_fprog prog;
 
-	prctl(PR_SET_DUMPABLE, 0);
+//	prctl(PR_SET_DUMPABLE, 0);
 
 	switch (c->mode) {
 	case MODE_PASST:
-- 
@@ -377,7 +377,7 @@ void isolate_postfork(const struct ctx *c)
 {
 	struct sock_fprog prog;
 
-	prctl(PR_SET_DUMPABLE, 0);
+//	prctl(PR_SET_DUMPABLE, 0);
 
 	switch (c->mode) {
 	case MODE_PASST:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 04/12] packet: Use flexible array member in struct pool
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (2 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 03/12] make passt dumpable David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range() David Gibson
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros.  However, we already require C11 which
includes flexible array members, so we can do better.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/packet.h b/packet.h
index 3f70e949..85ee5508 100644
--- a/packet.h
+++ b/packet.h
@@ -21,7 +21,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct iovec pkt[1];
+	struct iovec pkt[];
 };
 
 int vu_packet_check_range(void *buf, size_t offset, size_t len,
-- 
@@ -21,7 +21,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct iovec pkt[1];
+	struct iovec pkt[];
 };
 
 int vu_packet_check_range(void *buf, size_t offset, size_t len,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range()
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (3 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 04/12] packet: Use flexible array member in struct pool David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool.  That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.

However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in.  We never use these separately, only as
(start + offset).  Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c    | 36 +++++++++++++++++++-----------------
 packet.h    |  3 +--
 vu_common.c | 11 ++++-------
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/packet.c b/packet.c
index 03a11e6a..0330b548 100644
--- a/packet.c
+++ b/packet.c
@@ -23,23 +23,22 @@
 #include "log.h"
 
 /**
- * packet_check_range() - Check if a packet memory range is valid
+ * packet_check_range() - Check if a memory range is valid for a pool
  * @p:		Packet pool
- * @offset:	Offset of data range in packet descriptor
+ * @ptr:	Start of desired data range
  * @len:	Length of desired data range
- * @start:	Start of the packet descriptor
  * @func:	For tracing: name of calling function
  * @line:	For tracing: caller line of function call
  *
  * Return: 0 if the range is valid, -1 otherwise
  */
-static int packet_check_range(const struct pool *p, size_t offset, size_t len,
-			      const char *start, const char *func, int line)
+static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
+			      const char *func, int line)
 {
 	if (p->buf_size == 0) {
 		int ret;
 
-		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
+		ret = vu_packet_check_range((void *)p->buf, ptr, len);
 
 		if (ret == -1)
 			trace("cannot find region, %s:%i", func, line);
@@ -47,16 +46,16 @@ static int packet_check_range(const struct pool *p, size_t offset, size_t len,
 		return ret;
 	}
 
-	if (start < p->buf) {
-		trace("packet start %p before buffer start %p, "
-		      "%s:%i", (void *)start, (void *)p->buf, func, line);
+	if (ptr < p->buf) {
+		trace("packet range start %p before buffer start %p, %s:%i",
+		      (void *)ptr, (void *)p->buf, func, line);
 		return -1;
 	}
 
-	if (start + len + offset > p->buf + p->buf_size) {
-		trace("packet offset plus length %zu from size %zu, "
-		      "%s:%i", start - p->buf + len + offset,
-		      p->buf_size, func, line);
+	if (ptr + len > p->buf + p->buf_size) {
+		trace("packet range end %p after buffer end %p, %s:%i",
+		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
+		      func, line);
 		return -1;
 	}
 
@@ -81,7 +80,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 		return;
 	}
 
-	if (packet_check_range(p, 0, len, start, func, line))
+	if (packet_check_range(p, start, len, func, line))
 		return;
 
 	if (len > UINT16_MAX) {
@@ -110,6 +109,8 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		    size_t len, size_t *left, const char *func, int line)
 {
+	char *ptr;
+
 	if (idx >= p->size || idx >= p->count) {
 		if (func) {
 			trace("packet %zu from pool size: %zu, count: %zu, "
@@ -135,14 +136,15 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
-			       func, line))
+	ptr = (char *)p->pkt[idx].iov_base + offset;
+
+	if (packet_check_range(p, ptr, len, func, line))
 		return NULL;
 
 	if (left)
 		*left = p->pkt[idx].iov_len - offset - len;
 
-	return (char *)p->pkt[idx].iov_base + offset;
+	return ptr;
 }
 
 /**
diff --git a/packet.h b/packet.h
index 85ee5508..bdc07fef 100644
--- a/packet.h
+++ b/packet.h
@@ -24,8 +24,7 @@ struct pool {
 	struct iovec pkt[];
 };
 
-int vu_packet_check_range(void *buf, size_t offset, size_t len,
-			  const char *start);
+int vu_packet_check_range(void *buf, const char *ptr, size_t len);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/vu_common.c b/vu_common.c
index 299b5a32..531f8786 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -22,14 +22,12 @@
  * vu_packet_check_range() - Check if a given memory zone is contained in
  * 			     a mapped guest memory region
  * @buf:	Array of the available memory regions
- * @offset:	Offset of data range in packet descriptor
+ * @ptr:	Start of desired data range
  * @size:	Length of desired data range
- * @start:	Start of the packet descriptor
  *
  * Return: 0 if the zone is in a mapped memory region, -1 otherwise
  */
-int vu_packet_check_range(void *buf, size_t offset, size_t len,
-			  const char *start)
+int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 {
 	struct vu_dev_region *dev_region;
 
@@ -37,9 +35,8 @@ int vu_packet_check_range(void *buf, size_t offset, size_t len,
 		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
 		char *m = (char *)(uintptr_t)dev_region->mmap_addr;
 
-		if (m <= start &&
-		    start + offset + len <= m + dev_region->mmap_offset +
-					       dev_region->size)
+		if (m <= ptr &&
+		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
 			return 0;
 	}
 
-- 
@@ -22,14 +22,12 @@
  * vu_packet_check_range() - Check if a given memory zone is contained in
  * 			     a mapped guest memory region
  * @buf:	Array of the available memory regions
- * @offset:	Offset of data range in packet descriptor
+ * @ptr:	Start of desired data range
  * @size:	Length of desired data range
- * @start:	Start of the packet descriptor
  *
  * Return: 0 if the zone is in a mapped memory region, -1 otherwise
  */
-int vu_packet_check_range(void *buf, size_t offset, size_t len,
-			  const char *start)
+int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 {
 	struct vu_dev_region *dev_region;
 
@@ -37,9 +35,8 @@ int vu_packet_check_range(void *buf, size_t offset, size_t len,
 		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
 		char *m = (char *)(uintptr_t)dev_region->mmap_addr;
 
-		if (m <= start &&
-		    start + offset + len <= m + dev_region->mmap_offset +
-					       dev_region->size)
+		if (m <= ptr &&
+		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
 			return 0;
 	}
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (4 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range() David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2025-01-01 21:54   ` Stefano Brivio
  2024-12-20  8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

We verify that every packet we store in a pool - and every partial packet
we retreive from it has a length no longer than UINT16_MAX.  This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t.  Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.

However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
stored in the pool is in fact an entire frame including both L2 and any
backend specific headers.  We can exceed this in passt mode, even with the
default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
4 bytes of qemu stream length header = 65538 bytes.

Introduce our own define for the maximum length of a packet in the pool and
set it slightly larger, allowing 128 bytes for L2 and/or other backend
specific headers.  We'll use different amounts of that depending on the
tap backend, but since this is just a sanity check, the bound doesn't need
to be 100% tight.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c | 4 ++--
 packet.h | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/packet.c b/packet.c
index 0330b548..bcac0375 100644
--- a/packet.c
+++ b/packet.c
@@ -83,7 +83,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 	if (packet_check_range(p, start, len, func, line))
 		return;
 
-	if (len > UINT16_MAX) {
+	if (len > PACKET_MAX_LEN) {
 		trace("add packet length %zu, %s:%i", len, func, line);
 		return;
 	}
@@ -119,7 +119,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (len > UINT16_MAX) {
+	if (len > PACKET_MAX_LEN) {
 		if (func) {
 			trace("packet data length %zu, %s:%i",
 			      len, func, line);
diff --git a/packet.h b/packet.h
index bdc07fef..05d37695 100644
--- a/packet.h
+++ b/packet.h
@@ -6,6 +6,13 @@
 #ifndef PACKET_H
 #define PACKET_H
 
+/* Maximum size of a single packet in a pool, including all headers.  Sized to
+ * allow a maximum size IP datagram (65535) plus some extra space or L2 and
+ * backend specific headers.  This is just for sanity checking, so doesn't need
+ * to be a tight limit.
+ */
+#define PACKET_MAX_LEN	ROUND_UP(IP_MAX_MTU + 128, sizeof(unsigned long))
+
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors,
-- 
@@ -6,6 +6,13 @@
 #ifndef PACKET_H
 #define PACKET_H
 
+/* Maximum size of a single packet in a pool, including all headers.  Sized to
+ * allow a maximum size IP datagram (65535) plus some extra space or L2 and
+ * backend specific headers.  This is just for sanity checking, so doesn't need
+ * to be a tight limit.
+ */
+#define PACKET_MAX_LEN	ROUND_UP(IP_MAX_MTU + 128, sizeof(unsigned long))
+
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (5 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2025-01-01 21:54   ` Stefano Brivio
  2024-12-20  8:35 ` [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers David Gibson
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Two places in the code use the packet_get_try() variant on packet_get().
The difference is that packet_get_try() passes a NULL 'func' to
packet_get_do(), which suppresses log messages.  The places we use this
are ones where we expect to sometimes have a failure retreiving the packet
range, even in normal cases.  So, suppressing the log messages seems like
it makes sense, except:

1) It suppresses log messages on all errors.  We (somewhat) expect to hit
   the case where the requested range is not within the received packet.
   However, it also suppresses message in cases where the requested packet
   index doesn't exist, the requested range has a nonsensical length or
   doesn't lie in even the right vague part of memory.  None of those
   latter cases are expected, and the message would be helpful if we ever
   actually hit them.

2) The suppressed messages aren't actually that disruptive.  For the case
   in ip.c, we'd log only if we run out of IPv6 packet before reaching a
   (non-option) L4 header.  That shouldn't be the case in normal operation
   so getting a message (at trave level) is not unreasonable.
   For the case in dhcpv6.c we do suppress a message every time we look for
   but don't find a specific DHCPv6 option.  That can happen in perfectly
   ok cases, but, again these are trace() level and DHCPv6 transactions
   aren't that common.  Suppressing the message for this case doesn't
   seem worth the drawbacks of (1).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 dhcpv6.c |  2 +-
 ip.c     |  2 +-
 packet.c | 18 +++++-------------
 packet.h |  3 ---
 4 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/dhcpv6.c b/dhcpv6.c
index 0523bba8..c0d05131 100644
--- a/dhcpv6.c
+++ b/dhcpv6.c
@@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
 	if (!*offset)
 		*offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
 
-	while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) {
+	while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) {
 		unsigned int opt_len = ntohs(o->l) + sizeof(*o);
 
 		if (ntohs(o->l) > left)
diff --git a/ip.c b/ip.c
index 2cc7f654..e391f81b 100644
--- a/ip.c
+++ b/ip.c
@@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
 	if (!IPV6_NH_OPT(nh))
 		goto found;
 
-	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
+	while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) {
 		nh = o->nexthdr;
 		hdrlen = (o->hdrlen + 1) * 8;
 
diff --git a/packet.c b/packet.c
index bcac0375..c921aa15 100644
--- a/packet.c
+++ b/packet.c
@@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 	char *ptr;
 
 	if (idx >= p->size || idx >= p->count) {
-		if (func) {
-			trace("packet %zu from pool size: %zu, count: %zu, "
-			      "%s:%i", idx, p->size, p->count, func, line);
-		}
+		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
+		      idx, p->size, p->count, func, line);
 		return NULL;
 	}
 
 	if (len > PACKET_MAX_LEN) {
-		if (func) {
-			trace("packet data length %zu, %s:%i",
-			      len, func, line);
-		}
+		trace("packet data length %zu, %s:%i", len, func, line);
 		return NULL;
 	}
 
 	if (len + offset > p->pkt[idx].iov_len) {
-		if (func) {
-			trace("data length %zu, offset %zu from length %zu, "
-			      "%s:%i", len, offset, p->pkt[idx].iov_len,
-			      func, line);
-		}
+		trace("data length %zu, offset %zu from length %zu, %s:%i",
+		      len, offset, p->pkt[idx].iov_len, func, line);
 		return NULL;
 	}
 
diff --git a/packet.h b/packet.h
index 05d37695..f95cda08 100644
--- a/packet.h
+++ b/packet.h
@@ -45,9 +45,6 @@ void pool_flush(struct pool *p);
 #define packet_get(p, idx, offset, len, left)				\
 	packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
 
-#define packet_get_try(p, idx, offset, len, left)			\
-	packet_get_do(p, idx, offset, len, left, NULL, 0)
-
 #define PACKET_POOL_DECL(_name, _size, _buf)				\
 struct _name ## _t {							\
 	char *buf;							\
-- 
@@ -45,9 +45,6 @@ void pool_flush(struct pool *p);
 #define packet_get(p, idx, offset, len, left)				\
 	packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
 
-#define packet_get_try(p, idx, offset, len, left)			\
-	packet_get_do(p, idx, offset, len, left, NULL, 0)
-
 #define PACKET_POOL_DECL(_name, _size, _buf)				\
 struct _name ## _t {							\
 	char *buf;							\
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (6 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

We already have the ASSERT() macro which will abort() passt based on a
condition.  It always has a fixed error message based on its location and
the asserted expression.  We have some upcoming cases where we want to
customise the message when hitting an assert.

Add abort_with_msg() and ASSERT_WITH_MSG() helpers to allow this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 util.c | 19 +++++++++++++++++++
 util.h | 25 ++++++++++---------------
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/util.c b/util.c
index 11973c44..a62c255b 100644
--- a/util.c
+++ b/util.c
@@ -837,3 +837,22 @@ void raw_random(void *buf, size_t buflen)
 	if (random_read < buflen)
 		die("Unexpected EOF on random data source");
 }
+
+/**
+ * abort_with_msg() - Print error message and abort
+ * @fmt:	Format string
+ * @...:	Format parameters
+ */
+void abort_with_msg(const char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	vlogmsg(true, false, LOG_CRIT, fmt, ap);
+	va_end(ap);
+
+	/* This may actually cause a SIGSYS instead of SIGABRT, due to seccomp,
+	 * but that will still get the job done.
+	 */
+	abort();
+}
diff --git a/util.h b/util.h
index 3fa1d125..020e75a5 100644
--- a/util.h
+++ b/util.h
@@ -67,27 +67,22 @@
 #define STRINGIFY(x)	#x
 #define STR(x)		STRINGIFY(x)
 
-#ifdef CPPCHECK_6936
+void abort_with_msg(const char *fmt, ...)
+	__attribute__((format(printf, 1, 2), noreturn));
+
 /* Some cppcheck versions get confused by aborts inside a loop, causing
  * it to give false positive uninitialised variable warnings later in
  * the function, because it doesn't realise the non-initialising path
  * already exited.  See https://trac.cppcheck.net/ticket/13227
+ *
+ * Therefore, avoid using the usual do while wrapper we use to force the macro
+ * to act like a single statement requiring a ';'.
  */
-#define ASSERT(expr)		\
-	((expr) ? (void)0 : abort())
-#else
+#define ASSERT_WITH_MSG(expr, ...)					\
+	((expr) ? (void)0 : abort_with_msg(__VA_ARGS__))
 #define ASSERT(expr)							\
-	do {								\
-		if (!(expr)) {						\
-			err("ASSERTION FAILED in %s (%s:%d): %s",	\
-			    __func__, __FILE__, __LINE__, STRINGIFY(expr)); \
-			/* This may actually SIGSYS, due to seccomp,	\
-			 * but that will still get the job done		\
-			 */						\
-			abort();					\
-		}							\
-	} while (0)
-#endif
+	ASSERT_WITH_MSG((expr), "ASSSERTION FAILED in %s (%s:%d): %s",	\
+			__func__, __FILE__, __LINE__, STRINGIFY(expr))
 
 #ifdef P_tmpdir
 #define TMPDIR		P_tmpdir
-- 
@@ -67,27 +67,22 @@
 #define STRINGIFY(x)	#x
 #define STR(x)		STRINGIFY(x)
 
-#ifdef CPPCHECK_6936
+void abort_with_msg(const char *fmt, ...)
+	__attribute__((format(printf, 1, 2), noreturn));
+
 /* Some cppcheck versions get confused by aborts inside a loop, causing
  * it to give false positive uninitialised variable warnings later in
  * the function, because it doesn't realise the non-initialising path
  * already exited.  See https://trac.cppcheck.net/ticket/13227
+ *
+ * Therefore, avoid using the usual do while wrapper we use to force the macro
+ * to act like a single statement requiring a ';'.
  */
-#define ASSERT(expr)		\
-	((expr) ? (void)0 : abort())
-#else
+#define ASSERT_WITH_MSG(expr, ...)					\
+	((expr) ? (void)0 : abort_with_msg(__VA_ARGS__))
 #define ASSERT(expr)							\
-	do {								\
-		if (!(expr)) {						\
-			err("ASSERTION FAILED in %s (%s:%d): %s",	\
-			    __func__, __FILE__, __LINE__, STRINGIFY(expr)); \
-			/* This may actually SIGSYS, due to seccomp,	\
-			 * but that will still get the job done		\
-			 */						\
-			abort();					\
-		}							\
-	} while (0)
-#endif
+	ASSERT_WITH_MSG((expr), "ASSSERTION FAILED in %s (%s:%d): %s",	\
+			__func__, __FILE__, __LINE__, STRINGIFY(expr))
 
 #ifdef P_tmpdir
 #define TMPDIR		P_tmpdir
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (7 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2025-01-01 21:54   ` Stefano Brivio
  2024-12-20  8:35 ` [PATCH v2 10/12] packet: Move packet length checks into packet_check_range() David Gibson
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

packet_add() and packet_get() can fail for a number of reasons, and we
currently treat them all basically the same: we log a trace() level message
and for packet_get() we return NULL.  However the different causes of
error are quite different and suggest different handling:

1) If we run out of space in the pool on add, that's (rightly) non-fatal,
   but is an unusual situation which we might want to know about.  Promote
   it's message to debug() level.

2) All packet_check_range() errors and the checks on packet length indicate
   a serious problem.  Due to either a bug in calling code, or some sort
   of memory-clobbering, we've been given addresses or sizes of packets
   that are nonsensical.  If things are this bad, we shouldn't try to carry
   on replace these with asserts.

3) Requesting a packet index that doesn't exist in the pool in packet_get()
   indicates a bug in the caller - it should always check the index first.
   Replace this with an assert.

4) On packet_get() requesting a section of the packet beyond its bounds is
   correct already.  This happens routinely from some callers when they
   probe to see if certain parts of the packet are present.  At worst it
   indicates that the guest has generate a malformed packet, which we
   should quietly drop as we already do.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c    | 71 ++++++++++++++++++++---------------------------------
 packet.h    |  3 ++-
 vu_common.c | 13 +++++++---
 3 files changed, 38 insertions(+), 49 deletions(-)

diff --git a/packet.c b/packet.c
index c921aa15..24f12448 100644
--- a/packet.c
+++ b/packet.c
@@ -30,37 +30,27 @@
  * @func:	For tracing: name of calling function
  * @line:	For tracing: caller line of function call
  *
- * Return: 0 if the range is valid, -1 otherwise
+ * ASSERT()s if the given range isn't valid (within the expected buffer(s) for
+ * the pool).
  */
-static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
-			      const char *func, int line)
+static void packet_check_range(const struct pool *p,
+			       const char *ptr, size_t len,
+			       const char *func, int line)
 {
 	if (p->buf_size == 0) {
-		int ret;
-
-		ret = vu_packet_check_range((void *)p->buf, ptr, len);
-
-		if (ret == -1)
-			trace("cannot find region, %s:%i", func, line);
-
-		return ret;
-	}
-
-	if (ptr < p->buf) {
-		trace("packet range start %p before buffer start %p, %s:%i",
-		      (void *)ptr, (void *)p->buf, func, line);
-		return -1;
-	}
-
-	if (ptr + len > p->buf + p->buf_size) {
-		trace("packet range end %p after buffer end %p, %s:%i",
-		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
-		      func, line);
-		return -1;
+		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
+		return;
 	}
 
-	return 0;
+	ASSERT_WITH_MSG(ptr >= p->buf,
+			"packet range start %p before buffer start %p, %s:%i",
+			(void *)ptr, (void *)p->buf, func, line);
+	ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size,
+			"packet range end %p after buffer end %p, %s:%i",
+			(void *)(ptr + len), (void *)(p->buf + p->buf_size),
+			func, line);
 }
+
 /**
  * packet_add_do() - Add data as packet descriptor to given pool
  * @p:		Existing pool
@@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 	size_t idx = p->count;
 
 	if (idx >= p->size) {
-		trace("add packet index %zu to pool with size %zu, %s:%i",
+		debug("add packet index %zu to pool with size %zu, %s:%i",
 		      idx, p->size, func, line);
 		return;
 	}
 
-	if (packet_check_range(p, start, len, func, line))
-		return;
+	packet_check_range(p, start, len, func, line);
 
-	if (len > PACKET_MAX_LEN) {
-		trace("add packet length %zu, %s:%i", len, func, line);
-		return;
-	}
+	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
+			"add packet length %zu (max %zu), %s:%i",
+			len, PACKET_MAX_LEN, func, line);
 
 	p->pkt[idx].iov_base = (void *)start;
 	p->pkt[idx].iov_len = len;
@@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 {
 	char *ptr;
 
-	if (idx >= p->size || idx >= p->count) {
-		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
-		      idx, p->size, p->count, func, line);
-		return NULL;
-	}
-
-	if (len > PACKET_MAX_LEN) {
-		trace("packet data length %zu, %s:%i", len, func, line);
-		return NULL;
-	}
+	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
+			"packet %zu from pool size: %zu, count: %zu, %s:%i",
+			idx, p->size, p->count, func, line);
+	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
+			"packet range length %zu (max %zu), %s:%i",
+			len, PACKET_MAX_LEN, func, line);
 
 	if (len + offset > p->pkt[idx].iov_len) {
 		trace("data length %zu, offset %zu from length %zu, %s:%i",
@@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 
 	ptr = (char *)p->pkt[idx].iov_base + offset;
 
-	if (packet_check_range(p, ptr, len, func, line))
-		return NULL;
+	packet_check_range(p, ptr, len, func, line);
 
 	if (left)
 		*left = p->pkt[idx].iov_len - offset - len;
diff --git a/packet.h b/packet.h
index f95cda08..b164f77e 100644
--- a/packet.h
+++ b/packet.h
@@ -31,7 +31,8 @@ struct pool {
 	struct iovec pkt[];
 };
 
-int vu_packet_check_range(void *buf, const char *ptr, size_t len);
+void vu_packet_check_range(void *buf, const char *ptr, size_t len,
+			   const char *func, int line);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/vu_common.c b/vu_common.c
index 531f8786..528b9b08 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -24,10 +24,13 @@
  * @buf:	Array of the available memory regions
  * @ptr:	Start of desired data range
  * @size:	Length of desired data range
+ * @func:	For tracing: name of calling function
+ * @line:	For tracing: caller line of function call
  *
- * Return: 0 if the zone is in a mapped memory region, -1 otherwise
+ * Aborts if the zone isn't in any of the device regions
  */
-int vu_packet_check_range(void *buf, const char *ptr, size_t len)
+void vu_packet_check_range(void *buf, const char *ptr, size_t len,
+			   const char *func, int line)
 {
 	struct vu_dev_region *dev_region;
 
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 
 		if (m <= ptr &&
 		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
-			return 0;
+			return;
 	}
 
-	return -1;
+	abort_with_msg(
+		"package range at %p, length %zd not within dev region %s:%i",
+		(void *)ptr, len, func, line);
 }
 
 /**
-- 
@@ -24,10 +24,13 @@
  * @buf:	Array of the available memory regions
  * @ptr:	Start of desired data range
  * @size:	Length of desired data range
+ * @func:	For tracing: name of calling function
+ * @line:	For tracing: caller line of function call
  *
- * Return: 0 if the zone is in a mapped memory region, -1 otherwise
+ * Aborts if the zone isn't in any of the device regions
  */
-int vu_packet_check_range(void *buf, const char *ptr, size_t len)
+void vu_packet_check_range(void *buf, const char *ptr, size_t len,
+			   const char *func, int line)
 {
 	struct vu_dev_region *dev_region;
 
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
 
 		if (m <= ptr &&
 		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
-			return 0;
+			return;
 	}
 
-	return -1;
+	abort_with_msg(
+		"package range at %p, length %zd not within dev region %s:%i",
+		(void *)ptr, len, func, line);
 }
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 10/12] packet: Move packet length checks into packet_check_range()
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (8 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Both packet_add_do() and packet_get_do() have a check on the given length,
essentially sanity checking it before validating that it's in an expected
memory region.  This can be folded into packet_check_range() which performs
similar checks for both functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/packet.c b/packet.c
index 24f12448..9e0e6555 100644
--- a/packet.c
+++ b/packet.c
@@ -37,6 +37,10 @@ static void packet_check_range(const struct pool *p,
 			       const char *ptr, size_t len,
 			       const char *func, int line)
 {
+	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
+			"packet_check_range length %zu (max %zu), %s:%i",
+			len, PACKET_MAX_LEN, func, line);
+
 	if (p->buf_size == 0) {
 		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
 		return;
@@ -72,10 +76,6 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 
 	packet_check_range(p, start, len, func, line);
 
-	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
-			"add packet length %zu (max %zu), %s:%i",
-			len, PACKET_MAX_LEN, func, line);
-
 	p->pkt[idx].iov_base = (void *)start;
 	p->pkt[idx].iov_len = len;
 
@@ -102,9 +102,6 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
 			"packet %zu from pool size: %zu, count: %zu, %s:%i",
 			idx, p->size, p->count, func, line);
-	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
-			"packet range length %zu (max %zu), %s:%i",
-			len, PACKET_MAX_LEN, func, line);
 
 	if (len + offset > p->pkt[idx].iov_len) {
 		trace("data length %zu, offset %zu from length %zu, %s:%i",
-- 
@@ -37,6 +37,10 @@ static void packet_check_range(const struct pool *p,
 			       const char *ptr, size_t len,
 			       const char *func, int line)
 {
+	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
+			"packet_check_range length %zu (max %zu), %s:%i",
+			len, PACKET_MAX_LEN, func, line);
+
 	if (p->buf_size == 0) {
 		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
 		return;
@@ -72,10 +76,6 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 
 	packet_check_range(p, start, len, func, line);
 
-	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
-			"add packet length %zu (max %zu), %s:%i",
-			len, PACKET_MAX_LEN, func, line);
-
 	p->pkt[idx].iov_base = (void *)start;
 	p->pkt[idx].iov_len = len;
 
@@ -102,9 +102,6 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
 			"packet %zu from pool size: %zu, count: %zu, %s:%i",
 			idx, p->size, p->count, func, line);
-	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
-			"packet range length %zu (max %zu), %s:%i",
-			len, PACKET_MAX_LEN, func, line);
 
 	if (len + offset > p->pkt[idx].iov_len) {
 		trace("data length %zu, offset %zu from length %zu, %s:%i",
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (9 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 10/12] packet: Move packet length checks into packet_check_range() David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2025-01-01 21:54   ` Stefano Brivio
  2024-12-20  8:35 ` [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB David Gibson
  2024-12-20  9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
  12 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf, TAP_MSGS.  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But, we don't enforce that L2 frames
are at least ETH_ZLEN when we receive them from the tap backend, and since
we're dealing with virtual interfaces we don't have the physical Ethernet
limitations requiring that length.  Indeed it is possible to generate a
legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
the 'pasta' backend is only 42 bytes long).

It's also unclear if this limit is sufficient for vhost-user which isn't
limited by the size of pkt_buf as the other modes are.

We could attempt to correct the calculation, but that would leave us with
even larger arrays, which in practice rarely accumulate more than a handful
of packets.  So, instead, put an arbitrary cap on the number of packets we
can put in a batch, and if we run out of space, process and flush the
batch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c    | 13 ++++++++++++-
 packet.h    |  3 +++
 passt.h     |  2 --
 tap.c       | 18 +++++++++++++++---
 tap.h       |  3 ++-
 vu_common.c |  3 ++-
 6 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/packet.c b/packet.c
index 9e0e6555..c3b80924 100644
--- a/packet.c
+++ b/packet.c
@@ -55,6 +55,17 @@ static void packet_check_range(const struct pool *p,
 			func, line);
 }
 
+/**
+ * pool_full() - Is a packet pool full?
+ * @p:		Pointer to packet pool
+ *
+ * Return: true if the pool is full, false if more packets can be added
+ */
+bool pool_full(const struct pool *p)
+{
+	return p->count >= p->size;
+}
+
 /**
  * packet_add_do() - Add data as packet descriptor to given pool
  * @p:		Existing pool
@@ -68,7 +79,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 {
 	size_t idx = p->count;
 
-	if (idx >= p->size) {
+	if (pool_full(p)) {
 		debug("add packet index %zu to pool with size %zu, %s:%i",
 		      idx, p->size, func, line);
 		return;
diff --git a/packet.h b/packet.h
index b164f77e..8d78841b 100644
--- a/packet.h
+++ b/packet.h
@@ -6,6 +6,8 @@
 #ifndef PACKET_H
 #define PACKET_H
 
+#include <stdbool.h>
+
 /* Maximum size of a single packet in a pool, including all headers.  Sized to
  * allow a maximum size IP datagram (65535) plus some extra space or L2 and
  * backend specific headers.  This is just for sanity checking, so doesn't need
@@ -38,6 +40,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 void *packet_get_do(const struct pool *p, const size_t idx,
 		    size_t offset, size_t len, size_t *left,
 		    const char *func, int line);
+bool pool_full(const struct pool *p);
 void pool_flush(struct pool *p);
 
 #define packet_add(p, len, start)					\
diff --git a/passt.h b/passt.h
index 0dd4efa0..81b2787f 100644
--- a/passt.h
+++ b/passt.h
@@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
 
 #define TAP_BUF_BYTES							\
 	ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE)
-#define TAP_MSGS							\
-	DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
 
 #define PKT_BUF_BYTES		MAX(TAP_BUF_BYTES, 0)
 extern char pkt_buf		[PKT_BUF_BYTES];
diff --git a/tap.c b/tap.c
index cd32a901..4421ce4d 100644
--- a/tap.c
+++ b/tap.c
@@ -61,6 +61,8 @@
 #include "vhost_user.h"
 #include "vu_common.h"
 
+#define TAP_MSGS		256
+
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
 static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
@@ -966,8 +968,10 @@ void tap_handler(struct ctx *c, const struct timespec *now)
  * @c:		Execution context
  * @l2len:	Total L2 packet length
  * @p:		Packet buffer
+ * @now:	Current timestamp
  */
-void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
+void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
+		    const struct timespec *now)
 {
 	const struct ethhdr *eh;
 
@@ -983,9 +987,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
 	switch (ntohs(eh->h_proto)) {
 	case ETH_P_ARP:
 	case ETH_P_IP:
+		if (pool_full(pool_tap4)) {
+			tap4_handler(c, pool_tap4, now);
+			pool_flush(pool_tap4);
+		}
 		packet_add(pool_tap4, l2len, p);
 		break;
 	case ETH_P_IPV6:
+		if (pool_full(pool_tap6)) {
+			tap6_handler(c, pool_tap6, now);
+			pool_flush(pool_tap6);
+		}
 		packet_add(pool_tap6, l2len, p);
 		break;
 	default:
@@ -1066,7 +1078,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now)
 		p += sizeof(uint32_t);
 		n -= sizeof(uint32_t);
 
-		tap_add_packet(c, l2len, p);
+		tap_add_packet(c, l2len, p, now);
 
 		p += l2len;
 		n -= l2len;
@@ -1129,7 +1141,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
 		    len > (ssize_t)ETH_MAX_MTU)
 			continue;
 
-		tap_add_packet(c, len, pkt_buf + n);
+		tap_add_packet(c, len, pkt_buf + n, now);
 	}
 
 	tap_handler(c, now);
diff --git a/tap.h b/tap.h
index dfbd8b9e..a1b18430 100644
--- a/tap.h
+++ b/tap.h
@@ -74,6 +74,7 @@ void tap_sock_update_pool(void *base, size_t size);
 void tap_backend_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
-void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
+void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
+		    const struct timespec *now);
 
 #endif /* TAP_H */
diff --git a/vu_common.c b/vu_common.c
index 528b9b08..6f49f873 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -187,7 +187,8 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 
 		tap_add_packet(vdev->context,
 			       elem[count].out_sg[0].iov_len - hdrlen,
-			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
+			       (char *)elem[count].out_sg[0].iov_base + hdrlen,
+			       now);
 		count++;
 	}
 	tap_handler(vdev->context, now);
-- 
@@ -187,7 +187,8 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
 
 		tap_add_packet(vdev->context,
 			       elem[count].out_sg[0].iov_len - hdrlen,
-			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
+			       (char *)elem[count].out_sg[0].iov_base + hdrlen,
+			       now);
 		count++;
 	}
 	tap_handler(vdev->context, now);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (10 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
@ 2024-12-20  8:35 ` David Gibson
  2024-12-20  9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
  12 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2024-12-20  8:35 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

packet_check_range (and vu_packet_check_range()) verify that the packet or
section of packet we're interested in lies in the packet buffer pool we
expect it to.  However, in doing so it doesn't avoid the possibility of
an integer overflow while performing pointer arithmetic, with is UB.  In
fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct
a pointer outside of a known valid buffer.

To do this safely, we can't calculate the end of a memory region with
pointer addition, at least if we're treating the length as untrusted.
Instead we must work out the offset of one memory region within another
using pointer subtraction, then do integer checks against the length of
the outer region.  We then need to be careful about the order of checks so
that those integer checks can't themselves overflow.

While addressing this, correct a flaw in vu_packet_check_range() where it
only checked that the given packet piece was above the region's mmap
address, not the base of the specific section of the mmap we expect to find
it in (dev_region->mmap_addr + dev_region->mmap_offset).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c    |  9 ++++++---
 vu_common.c | 15 +++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/packet.c b/packet.c
index c3b80924..04a7a829 100644
--- a/packet.c
+++ b/packet.c
@@ -49,9 +49,12 @@ static void packet_check_range(const struct pool *p,
 	ASSERT_WITH_MSG(ptr >= p->buf,
 			"packet range start %p before buffer start %p, %s:%i",
 			(void *)ptr, (void *)p->buf, func, line);
-	ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size,
-			"packet range end %p after buffer end %p, %s:%i",
-			(void *)(ptr + len), (void *)(p->buf + p->buf_size),
+	ASSERT_WITH_MSG(len <= p->buf_size,
+			"packet range length %zu larger than buffer %zu, %s:%i",
+			len, p->buf_size, func, line);
+	ASSERT_WITH_MSG((size_t)(ptr - p->buf) <= p->buf_size - len,
+"packet range %p, len %zu extends beyond buffer %p, len %zu, %s:%i",
+			(void *)ptr, len, (void *)p->buf, p->buf_size,
 			func, line);
 }
 
diff --git a/vu_common.c b/vu_common.c
index 6f49f873..3f7fb29d 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -35,16 +35,23 @@ void vu_packet_check_range(void *buf, const char *ptr, size_t len,
 	struct vu_dev_region *dev_region;
 
 	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
+		uintptr_t base_addr = dev_region->mmap_addr +
+			dev_region->mmap_offset;
 		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
-		char *m = (char *)(uintptr_t)dev_region->mmap_addr;
+		const char *base = (const char *)base_addr;
 
-		if (m <= ptr &&
-		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
+		ASSERT(base_addr >= dev_region->mmap_addr);
+
+		if (len > dev_region->size)
+			continue;
+
+		if (base <= ptr &&
+		    (size_t)(ptr - base) <= dev_region->size - len)
 			return;
 	}
 
 	abort_with_msg(
-		"package range at %p, length %zd not within dev region %s:%i",
+		"packet range at %p, length %zd not within dev region, %s:%i",
 		(void *)ptr, len, func, line);
 }
 
-- 
@@ -35,16 +35,23 @@ void vu_packet_check_range(void *buf, const char *ptr, size_t len,
 	struct vu_dev_region *dev_region;
 
 	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
+		uintptr_t base_addr = dev_region->mmap_addr +
+			dev_region->mmap_offset;
 		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
-		char *m = (char *)(uintptr_t)dev_region->mmap_addr;
+		const char *base = (const char *)base_addr;
 
-		if (m <= ptr &&
-		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
+		ASSERT(base_addr >= dev_region->mmap_addr);
+
+		if (len > dev_region->size)
+			continue;
+
+		if (base <= ptr &&
+		    (size_t)(ptr - base) <= dev_region->size - len)
 			return;
 	}
 
 	abort_with_msg(
-		"package range at %p, length %zd not within dev region %s:%i",
+		"packet range at %p, length %zd not within dev region, %s:%i",
 		(void *)ptr, len, func, line);
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/12] Cleanups to packet pool handling and sizing
  2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
                   ` (11 preceding siblings ...)
  2024-12-20  8:35 ` [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB David Gibson
@ 2024-12-20  9:00 ` David Gibson
  2024-12-20 10:06   ` Stefano Brivio
  12 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2024-12-20  9:00 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2211 bytes --]

On Fri, Dec 20, 2024 at 07:35:23PM +1100, David Gibson wrote:
> This... is not any of the things I said I would be working on.  I can
> only say that a herd of very hairy yaks led me astray.  Looking at bug
> 66 I spotted some problems with our handling of MTUs / maximum frame
> sizes.  Looking at that I found some weirdness and some real, if
> minor, bugs in the sizing and handling of the packet pools.
> 
> Changes in v2:
>  * Stefano convinced me that packet_check_range() is still worthwhile.
>    * So don't remove it... but in looking at it I spotted various
>      flaws in the checks, so address those in a number of new patches.
> 
> David Gibson (12):
>   test focus
>   hack: stop on fail, but not perf fail
>   make passt dumpable

Argh, oops.  Of course, 1..3 are test/debug patches that should not be
included in the series.

>   packet: Use flexible array member in struct pool
>   packet: Don't pass start and offset separately too
>     packet_check_range()
>   packet: Don't hard code maximum packet size to UINT16_MAX
>   packet: Remove unhelpful packet_get_try() macro
>   util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers
>   packet: Distinguish severities of different packet_{add,git}_do()
>     errors
>   packet: Move packet length checks into packet_check_range()
>   tap: Don't size pool_tap[46] for the maximum number of packets
>   packet: More cautious checks to avoid pointer arithmetic UB
> 
>  dhcpv6.c      |   2 +-
>  ip.c          |   2 +-
>  isolation.c   |   2 +-
>  packet.c      | 106 ++++++++++++++++++++++----------------------------
>  packet.h      |  19 ++++++---
>  passt.h       |   2 -
>  tap.c         |  18 +++++++--
>  tap.h         |   3 +-
>  test/lib/term |   1 +
>  test/lib/test |   4 +-
>  test/run      |  38 +++++++++---------
>  util.c        |  19 +++++++++
>  util.h        |  25 +++++-------
>  vu_common.c   |  34 ++++++++++------
>  14 files changed, 153 insertions(+), 122 deletions(-)
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/12] Cleanups to packet pool handling and sizing
  2024-12-20  9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
@ 2024-12-20 10:06   ` Stefano Brivio
  0 siblings, 0 replies; 31+ messages in thread
From: Stefano Brivio @ 2024-12-20 10:06 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 20 Dec 2024 20:00:13 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Dec 20, 2024 at 07:35:23PM +1100, David Gibson wrote:
> > This... is not any of the things I said I would be working on.  I can
> > only say that a herd of very hairy yaks led me astray.  Looking at bug
> > 66 I spotted some problems with our handling of MTUs / maximum frame
> > sizes.  Looking at that I found some weirdness and some real, if
> > minor, bugs in the sizing and handling of the packet pools.
> > 
> > Changes in v2:
> >  * Stefano convinced me that packet_check_range() is still worthwhile.
> >    * So don't remove it... but in looking at it I spotted various
> >      flaws in the checks, so address those in a number of new patches.
> > 
> > David Gibson (12):
> >   test focus
> >   hack: stop on fail, but not perf fail
> >   make passt dumpable  
> 
> Argh, oops.  Of course, 1..3 are test/debug patches that should not be
> included in the series.

Yeah, it's fine, 4/12 to 12/12 apply cleanly anyway.

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
  2024-12-20  8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
@ 2025-01-01 21:54   ` Stefano Brivio
  2025-01-02  1:00     ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-01 21:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 20 Dec 2024 19:35:29 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> We verify that every packet we store in a pool - and every partial packet
> we retreive from it has a length no longer than UINT16_MAX.  This
> originated in the older packet pool implementation which stored packet
> lengths in a uint16_t.  Now, that packets are represented by a struct
> iovec with its size_t length, this check serves only as a sanity / security
> check that we don't have some wildly out of range length due to a bug
> elsewhere.
> 
> However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> stored in the pool is in fact an entire frame including both L2 and any
> backend specific headers.  We can exceed this in passt mode, even with the
> default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> 4 bytes of qemu stream length header = 65538 bytes.
> 
> Introduce our own define for the maximum length of a packet in the pool and
> set it slightly larger, allowing 128 bytes for L2 and/or other backend
> specific headers.  We'll use different amounts of that depending on the
> tap backend, but since this is just a sanity check, the bound doesn't need
> to be 100% tight.

I couldn't find the time to check what's the maximum amount of bytes we
can get here depending on hypervisor and interface, but if this patch
fixes an actual issue as you seem to imply, actually checking that with
QEMU and muvm would be nice.

By the way, as you mention a specific calculation, does it really make
sense to use a "good enough" value here? Can we ever exceed 65538
bytes, or can we use that as limit? It would be good to find out, while
at it.

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro
  2024-12-20  8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
@ 2025-01-01 21:54   ` Stefano Brivio
  2025-01-02  2:15     ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-01 21:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 20 Dec 2024 19:35:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Two places in the code use the packet_get_try() variant on packet_get().
> The difference is that packet_get_try() passes a NULL 'func' to
> packet_get_do(), which suppresses log messages.  The places we use this
> are ones where we expect to sometimes have a failure retreiving the packet
> range, even in normal cases.  So, suppressing the log messages seems like
> it makes sense, except:
> 
> 1) It suppresses log messages on all errors.  We (somewhat) expect to hit
>    the case where the requested range is not within the received packet.
>    However, it also suppresses message in cases where the requested packet
>    index doesn't exist, the requested range has a nonsensical length or
>    doesn't lie in even the right vague part of memory.  None of those
>    latter cases are expected, and the message would be helpful if we ever
>    actually hit them.
> 
> 2) The suppressed messages aren't actually that disruptive.  For the case
>    in ip.c, we'd log only if we run out of IPv6 packet before reaching a
>    (non-option) L4 header.  That shouldn't be the case in normal operation
>    so getting a message (at trave level) is not unreasonable.
>    For the case in dhcpv6.c we do suppress a message every time we look for
>    but don't find a specific DHCPv6 option.  That can happen in perfectly
>    ok cases, but, again these are trace() level and DHCPv6 transactions
>    aren't that common.  Suppressing the message for this case doesn't
>    seem worth the drawbacks of (1).

The reason why I implemented packet_get_try() is that the messages from
packet_get_do() indicate serious issues, and if I'm debugging something
by looking at traces it's not great to have messages indicating that we
hit a serious issue while we're simply validating identity associations.

It's not about the amount of logged messages, it's about the type of
message being logged and the distracting noise possibly resulting in a
substantial time waste.

About 1): dhcpv6_opt() always picks pool index 0, and the base offset
was already checked by the caller. In ipv6_l4hdr(), the index was
already validated by a previous call to packet_get(), and the starting
offset as well.

I guess the main reason to have this patch in this series is to make a
later change simpler, but I'm not sure which one, so I don't know what
to suggest as an alternative.

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  dhcpv6.c |  2 +-
>  ip.c     |  2 +-
>  packet.c | 18 +++++-------------
>  packet.h |  3 ---
>  4 files changed, 7 insertions(+), 18 deletions(-)
> 
> diff --git a/dhcpv6.c b/dhcpv6.c
> index 0523bba8..c0d05131 100644
> --- a/dhcpv6.c
> +++ b/dhcpv6.c
> @@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
>  	if (!*offset)
>  		*offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
>  
> -	while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) {
> +	while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) {
>  		unsigned int opt_len = ntohs(o->l) + sizeof(*o);
>  
>  		if (ntohs(o->l) > left)
> diff --git a/ip.c b/ip.c
> index 2cc7f654..e391f81b 100644
> --- a/ip.c
> +++ b/ip.c
> @@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
>  	if (!IPV6_NH_OPT(nh))
>  		goto found;
>  
> -	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
> +	while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) {
>  		nh = o->nexthdr;
>  		hdrlen = (o->hdrlen + 1) * 8;
>  
> diff --git a/packet.c b/packet.c
> index bcac0375..c921aa15 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
>  	char *ptr;
>  
>  	if (idx >= p->size || idx >= p->count) {
> -		if (func) {
> -			trace("packet %zu from pool size: %zu, count: %zu, "
> -			      "%s:%i", idx, p->size, p->count, func, line);
> -		}
> +		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> +		      idx, p->size, p->count, func, line);

If you change this, the documentation of the arguments for
packet_get_do() should be updated as well.

>  		return NULL;
>  	}
>  
>  	if (len > PACKET_MAX_LEN) {
> -		if (func) {
> -			trace("packet data length %zu, %s:%i",
> -			      len, func, line);
> -		}
> +		trace("packet data length %zu, %s:%i", len, func, line);
>  		return NULL;
>  	}
>  
>  	if (len + offset > p->pkt[idx].iov_len) {
> -		if (func) {
> -			trace("data length %zu, offset %zu from length %zu, "
> -			      "%s:%i", len, offset, p->pkt[idx].iov_len,
> -			      func, line);
> -		}
> +		trace("data length %zu, offset %zu from length %zu, %s:%i",
> +		      len, offset, p->pkt[idx].iov_len, func, line);
>  		return NULL;
>  	}
>  
> diff --git a/packet.h b/packet.h
> index 05d37695..f95cda08 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -45,9 +45,6 @@ void pool_flush(struct pool *p);
>  #define packet_get(p, idx, offset, len, left)				\
>  	packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
>  
> -#define packet_get_try(p, idx, offset, len, left)			\
> -	packet_get_do(p, idx, offset, len, left, NULL, 0)
> -
>  #define PACKET_POOL_DECL(_name, _size, _buf)				\
>  struct _name ## _t {							\
>  	char *buf;							\

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors
  2024-12-20  8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
@ 2025-01-01 21:54   ` Stefano Brivio
  2025-01-02  2:58     ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-01 21:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 20 Dec 2024 19:35:32 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> packet_add() and packet_get() can fail for a number of reasons, and we
> currently treat them all basically the same: we log a trace() level message
> and for packet_get() we return NULL.  However the different causes of
> error are quite different and suggest different handling:
> 
> 1) If we run out of space in the pool on add, that's (rightly) non-fatal,
>    but is an unusual situation which we might want to know about.  Promote
>    it's message to debug() level.
> 
> 2) All packet_check_range() errors and the checks on packet length indicate
>    a serious problem.  Due to either a bug in calling code, or some sort
>    of memory-clobbering, we've been given addresses or sizes of packets
>    that are nonsensical.  If things are this bad, we shouldn't try to carry
>    on replace these with asserts.
> 
> 3) Requesting a packet index that doesn't exist in the pool in packet_get()
>    indicates a bug in the caller - it should always check the index first.
>    Replace this with an assert.

The reasons for 2) and 3) being trace() messages (they should probably
be debug() following the same reasoning as 1)) are graceful degradation
and security (availability).

If we have a packet triggering some sort of "memory clobbering", or an
issue in the caller, that might very well be just one packet or a few
of them.

I think it's a disservice to users to crash in that case, and it has
the potential to worsen a security flaw if this packet can be built on
purpose. It's also a disservice to package maintainers because it has
the potential to turn a harmless issue into an annoying situation with
associated time pressure and everything.

Those are not warn() calls, by the way, because we don't want to make
it too easy, in that (perhaps unlikely) case, to flood logs, and
possibly hide information due to rotation.

I suppose that the argument for asserts here is so that we discover
functional issues as soon as possible, which I understand of course,
but it relies on two assumptions that might not hold:

1. the crash will be reported, while a debug() message will go
   unnoticed.

   Well, the crash is more noticeable, but that doesn't mean that it
   will be reported. Users might/will try to hack something around it
   or try to survive with slirp4netns and all its flaws.

   On the other hand, a malfunction (say, failed connections) might be
   equally likely to be reported, along with debug() messages.

2. the check and the assertion themselves are correct. This was not the
   case for the two most recent severe issues (commit 61c0b0d0f199, bug
   #105) we discovered through assertions.

   Issues such as https://github.com/containers/podman/issues/22925, on
   the other hand, are good examples of how asserts saved us a lot
   of time and effort later on (compared to, say, "connections stop
   working after three days"). But I would argue that those are
   substantially different cases where assertions are checking
   something that's much more inherent to the implementation.

In this case, I would argue that there's no need to crash, so we
shouldn't. We can presumably carry on after that, and we're not leaking
any resource or sneakily leave behind some condition that will cause
apparently unrelated issues at a later time.

> 4) On packet_get() requesting a section of the packet beyond its bounds is
>    correct already.  This happens routinely from some callers when they
>    probe to see if certain parts of the packet are present.  At worst it
>    indicates that the guest has generate a malformed packet, which we
>    should quietly drop as we already do.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  packet.c    | 71 ++++++++++++++++++++---------------------------------
>  packet.h    |  3 ++-
>  vu_common.c | 13 +++++++---
>  3 files changed, 38 insertions(+), 49 deletions(-)
> 
> diff --git a/packet.c b/packet.c
> index c921aa15..24f12448 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -30,37 +30,27 @@
>   * @func:	For tracing: name of calling function
>   * @line:	For tracing: caller line of function call
>   *
> - * Return: 0 if the range is valid, -1 otherwise
> + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for
> + * the pool).
>   */
> -static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
> -			      const char *func, int line)
> +static void packet_check_range(const struct pool *p,
> +			       const char *ptr, size_t len,
> +			       const char *func, int line)
>  {
>  	if (p->buf_size == 0) {
> -		int ret;
> -
> -		ret = vu_packet_check_range((void *)p->buf, ptr, len);
> -
> -		if (ret == -1)
> -			trace("cannot find region, %s:%i", func, line);
> -
> -		return ret;
> -	}
> -
> -	if (ptr < p->buf) {
> -		trace("packet range start %p before buffer start %p, %s:%i",
> -		      (void *)ptr, (void *)p->buf, func, line);
> -		return -1;
> -	}
> -
> -	if (ptr + len > p->buf + p->buf_size) {
> -		trace("packet range end %p after buffer end %p, %s:%i",
> -		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
> -		      func, line);
> -		return -1;
> +		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
> +		return;
>  	}
>  
> -	return 0;
> +	ASSERT_WITH_MSG(ptr >= p->buf,
> +			"packet range start %p before buffer start %p, %s:%i",
> +			(void *)ptr, (void *)p->buf, func, line);
> +	ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size,
> +			"packet range end %p after buffer end %p, %s:%i",
> +			(void *)(ptr + len), (void *)(p->buf + p->buf_size),
> +			func, line);
>  }
> +
>  /**
>   * packet_add_do() - Add data as packet descriptor to given pool
>   * @p:		Existing pool
> @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
>  	size_t idx = p->count;
>  
>  	if (idx >= p->size) {
> -		trace("add packet index %zu to pool with size %zu, %s:%i",
> +		debug("add packet index %zu to pool with size %zu, %s:%i",
>  		      idx, p->size, func, line);
>  		return;
>  	}
>  
> -	if (packet_check_range(p, start, len, func, line))
> -		return;
> +	packet_check_range(p, start, len, func, line);
>  
> -	if (len > PACKET_MAX_LEN) {
> -		trace("add packet length %zu, %s:%i", len, func, line);
> -		return;
> -	}
> +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> +			"add packet length %zu (max %zu), %s:%i",
> +			len, PACKET_MAX_LEN, func, line);
>  
>  	p->pkt[idx].iov_base = (void *)start;
>  	p->pkt[idx].iov_len = len;
> @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
>  {
>  	char *ptr;
>  
> -	if (idx >= p->size || idx >= p->count) {
> -		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> -		      idx, p->size, p->count, func, line);
> -		return NULL;
> -	}
> -
> -	if (len > PACKET_MAX_LEN) {
> -		trace("packet data length %zu, %s:%i", len, func, line);
> -		return NULL;
> -	}
> +	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
> +			"packet %zu from pool size: %zu, count: %zu, %s:%i",
> +			idx, p->size, p->count, func, line);
> +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> +			"packet range length %zu (max %zu), %s:%i",
> +			len, PACKET_MAX_LEN, func, line);
>  
>  	if (len + offset > p->pkt[idx].iov_len) {
>  		trace("data length %zu, offset %zu from length %zu, %s:%i",
> @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
>  
>  	ptr = (char *)p->pkt[idx].iov_base + offset;
>  
> -	if (packet_check_range(p, ptr, len, func, line))
> -		return NULL;
> +	packet_check_range(p, ptr, len, func, line);
>  
>  	if (left)
>  		*left = p->pkt[idx].iov_len - offset - len;
> diff --git a/packet.h b/packet.h
> index f95cda08..b164f77e 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -31,7 +31,8 @@ struct pool {
>  	struct iovec pkt[];
>  };
>  
> -int vu_packet_check_range(void *buf, const char *ptr, size_t len);
> +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> +			   const char *func, int line);
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
>  void *packet_get_do(const struct pool *p, const size_t idx,
> diff --git a/vu_common.c b/vu_common.c
> index 531f8786..528b9b08 100644
> --- a/vu_common.c
> +++ b/vu_common.c
> @@ -24,10 +24,13 @@
>   * @buf:	Array of the available memory regions
>   * @ptr:	Start of desired data range
>   * @size:	Length of desired data range
> + * @func:	For tracing: name of calling function
> + * @line:	For tracing: caller line of function call

If those arguments are mandatory, I would drop the "For tracing"
specification.

>   *
> - * Return: 0 if the zone is in a mapped memory region, -1 otherwise
> + * Aborts if the zone isn't in any of the device regions
>   */
> -int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> +			   const char *func, int line)
>  {
>  	struct vu_dev_region *dev_region;
>  
> @@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
>  
>  		if (m <= ptr &&
>  		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
> -			return 0;
> +			return;
>  	}
>  
> -	return -1;
> +	abort_with_msg(
> +		"package range at %p, length %zd not within dev region %s:%i",

s/package/packet/

> +		(void *)ptr, len, func, line);
>  }
>  
>  /**

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets
  2024-12-20  8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
@ 2025-01-01 21:54   ` Stefano Brivio
  2025-01-02  3:46     ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-01 21:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 20 Dec 2024 19:35:34 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently we attempt to size pool_tap[46] so they have room for the maximum
> possible number of packets that could fit in pkt_buf, TAP_MSGS.  However,
> the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
> the minimum possible L2 frame size.  But, we don't enforce that L2 frames
> are at least ETH_ZLEN when we receive them from the tap backend, and since
> we're dealing with virtual interfaces we don't have the physical Ethernet
> limitations requiring that length.  Indeed it is possible to generate a
> legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
> the 'pasta' backend is only 42 bytes long).
> 
> It's also unclear if this limit is sufficient for vhost-user which isn't
> limited by the size of pkt_buf as the other modes are.
> 
> We could attempt to correct the calculation, but that would leave us with
> even larger arrays, which in practice rarely accumulate more than a handful
> of packets.  So, instead, put an arbitrary cap on the number of packets we
> can put in a batch, and if we run out of space, process and flush the
> batch.

I ran a few more tests with this, keeping TAP_MSGS at 256, and in
general I couldn't really see a difference in latency (especially for
UDP streams with small packets) or throughput. Figures from short
throughput tests (such as the ones from the test suite) look a bit more
variable, but I don't have any statistically meaningful data.

Then I looked into how many messages we might have in the array without
this change, and I realised that, with the throughput tests from the
suite, we very easily exceed the 256 limit.

Perhaps surprisingly we get the highest buffer counts with TCP transfers
and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval,
the extra system call overhead from forcibly flushing batches might
become rather relevant.

With lower MTUs, it looks like we have a lower CPU load and
transmissions are scheduled differently (resulting in smaller batches),
but I didn't really trace things.

So I start thinking that this has the *potential* to introduce a
performance regression in some cases and we shouldn't just assume that
some arbitrary 256 limit is good enough. I didn't check with perf(1),
though.

Right now that array takes, effectively, less than 100 KiB (it's ~5000
copies of struct iovec, 16 bytes each), and in theory that could be
~2.5 MiB (at 161319 items). Even if we double or triple that (let's
assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will
have no practical effect anyway.

All in all, I think we shouldn't change this limit without a deeper
understanding of the practical impact. While this change doesn't bring
any practical advantage, the current behaviour is somewhat tested by
now, and a small limit isn't.

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
  2025-01-01 21:54   ` Stefano Brivio
@ 2025-01-02  1:00     ` David Gibson
  2025-01-02 21:59       ` Stefano Brivio
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2025-01-02  1:00 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4501 bytes --]

On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
> On Fri, 20 Dec 2024 19:35:29 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > We verify that every packet we store in a pool - and every partial packet
> > we retreive from it has a length no longer than UINT16_MAX.  This
> > originated in the older packet pool implementation which stored packet
> > lengths in a uint16_t.  Now, that packets are represented by a struct
> > iovec with its size_t length, this check serves only as a sanity / security
> > check that we don't have some wildly out of range length due to a bug
> > elsewhere.
> > 
> > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> > stored in the pool is in fact an entire frame including both L2 and any
> > backend specific headers.  We can exceed this in passt mode, even with the
> > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> > 4 bytes of qemu stream length header = 65538 bytes.
> > 
> > Introduce our own define for the maximum length of a packet in the pool and
> > set it slightly larger, allowing 128 bytes for L2 and/or other backend
> > specific headers.  We'll use different amounts of that depending on the
> > tap backend, but since this is just a sanity check, the bound doesn't need
> > to be 100% tight.
> 
> I couldn't find the time to check what's the maximum amount of bytes we
> can get here depending on hypervisor and interface, but if this patch

So, it's a separate calculation for each backend type, and some of
them are pretty tricky.

For anything based on the kernel tap device it is 65535, because it
has an internal frame size limit of 65535, already including any L2
headers (it explicitly limits the MTU to 65535 - hard_header_len).
There is no "hardware" header.

For the qemu stream protocol it gets pretty complicated, because there
are multiple layers which could clamp the maximum size.  It doesn't
look like the socket protocol code itself imposes a limit beyond the
structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
which could be less for 32-bit systems).  AFAICT, it's not
theoretically impossible to have gigabyte frames with a weird virtual
NIC model... though obviously that wouldn't be IP, and probably not
even Ethernet.

Each virtual NIC could have its own limit.  I suspect that's going to
be in the vicinity of 64k.  But, I'm really struggling to figure out
what it is just for virtio-net, so I really don't want to try to
figure it out for all of them.  With a virtio-net NIC, I seem to be
able to set MTU all the way up to 65535 successfully, which implies a
maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
header) = 65553 at least.

Similar situation for vhost-user, where I'm finding it even more
inscrutable to figure out what limits are imposed at the sub-IP
levels.  At the moment the "hardware" header
(virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
packet.c layer, but we might have reasons to change that.

So, any sub-IP limits for qemu, I'm basically not managing to find.
However, we (more or less) only care about IP, which imposes a more
practical limit of: 65535 + L2 header size + "hardware" header size.

At present that maxes out at 65553, as above, but if we ever support
other L2 encapsulations, or other backend protocols with larger
"hardware" headers, that could change.

> fixes an actual issue as you seem to imply, actually checking that with
> QEMU and muvm would be nice.
> 
> By the way, as you mention a specific calculation, does it really make
> sense to use a "good enough" value here? Can we ever exceed 65538
> bytes, or can we use that as limit? It would be good to find out, while
> at it.

So, yes, I think we can exceed 65538.  But more significantly, trying
to make the limit tight here feels like a borderline layering
violation.  The packet layer doesn't really care about the frame size
as long as it's "sane".  Fwiw, in the draft changes I have improving
MTU handling, it's my intention that individual backends calculate
and/or enforce tighter limits of their own where practical, and
BUILD_ASSERT() that those fit within the packet layer's frame size
limit.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro
  2025-01-01 21:54   ` Stefano Brivio
@ 2025-01-02  2:15     ` David Gibson
  2025-01-02 22:00       ` Stefano Brivio
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2025-01-02  2:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7328 bytes --]

On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
> On Fri, 20 Dec 2024 19:35:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Two places in the code use the packet_get_try() variant on packet_get().
> > The difference is that packet_get_try() passes a NULL 'func' to
> > packet_get_do(), which suppresses log messages.  The places we use this
> > are ones where we expect to sometimes have a failure retreiving the packet
> > range, even in normal cases.  So, suppressing the log messages seems like
> > it makes sense, except:
> > 
> > 1) It suppresses log messages on all errors.  We (somewhat) expect to hit
> >    the case where the requested range is not within the received packet.
> >    However, it also suppresses message in cases where the requested packet
> >    index doesn't exist, the requested range has a nonsensical length or
> >    doesn't lie in even the right vague part of memory.  None of those
> >    latter cases are expected, and the message would be helpful if we ever
> >    actually hit them.
> > 
> > 2) The suppressed messages aren't actually that disruptive.  For the case
> >    in ip.c, we'd log only if we run out of IPv6 packet before reaching a
> >    (non-option) L4 header.  That shouldn't be the case in normal operation
> >    so getting a message (at trave level) is not unreasonable.
> >    For the case in dhcpv6.c we do suppress a message every time we look for
> >    but don't find a specific DHCPv6 option.  That can happen in perfectly
> >    ok cases, but, again these are trace() level and DHCPv6 transactions
> >    aren't that common.  Suppressing the message for this case doesn't
> >    seem worth the drawbacks of (1).
> 
> The reason why I implemented packet_get_try() is that the messages from
> packet_get_do() indicate serious issues, and if I'm debugging something
> by looking at traces it's not great to have messages indicating that we
> hit a serious issue while we're simply validating identity associations.

I'm not following your argument here.  It's exactly because (most of)
the message indicate serious issues that I don't want to suppress
them.  I don't know what you mean by "validating identity
associations".

> It's not about the amount of logged messages, it's about the type of
> message being logged and the distracting noise possibly resulting in a
> substantial time waste.
> 
> About 1): dhcpv6_opt() always picks pool index 0, and the base offset
> was already checked by the caller.

Right, but dhcpv6_opt() is called in a loop, that only stops when it
returns NULL.  So, by construction the last call to dhcpv6_opt(),
which terminates the loop, _will_ have a failing call to packet_get().
At this point - at least assuming a correctly constructed packet - the
offset will point to just past the last option, which should be
exactly at the end of the packet.

> In ipv6_l4hdr(), the index was
> already validated by a previous call to packet_get(), and the starting
> offset as well.

Not AFAICT, the initial packet_get just validates the basic IPv6
header.  The calls to packet_get_try() in the loop validate additional
IP options.  I don't think it will ever fail on a well-constructed
packet, but it could on a bogus (or truncated) packet, where the
nexthdr field indicates an option that's actually missing.

This is kind of my point: it will only trip on a badly constructed
packet, in which case I don't think we want to suppress messages.

> I guess the main reason to have this patch in this series is to make a
> later change simpler, but I'm not sure which one, so I don't know what
> to suggest as an alternative.

Mostly the next one - making more distinction between the different
error severities that can occur here.  Having to deal with the
possibility of func being NULL complicates things.

> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  dhcpv6.c |  2 +-
> >  ip.c     |  2 +-
> >  packet.c | 18 +++++-------------
> >  packet.h |  3 ---
> >  4 files changed, 7 insertions(+), 18 deletions(-)
> > 
> > diff --git a/dhcpv6.c b/dhcpv6.c
> > index 0523bba8..c0d05131 100644
> > --- a/dhcpv6.c
> > +++ b/dhcpv6.c
> > @@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
> >  	if (!*offset)
> >  		*offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
> >  
> > -	while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) {
> > +	while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) {
> >  		unsigned int opt_len = ntohs(o->l) + sizeof(*o);
> >  
> >  		if (ntohs(o->l) > left)
> > diff --git a/ip.c b/ip.c
> > index 2cc7f654..e391f81b 100644
> > --- a/ip.c
> > +++ b/ip.c
> > @@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> >  	if (!IPV6_NH_OPT(nh))
> >  		goto found;
> >  
> > -	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
> > +	while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) {
> >  		nh = o->nexthdr;
> >  		hdrlen = (o->hdrlen + 1) * 8;
> >  
> > diff --git a/packet.c b/packet.c
> > index bcac0375..c921aa15 100644
> > --- a/packet.c
> > +++ b/packet.c
> > @@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> >  	char *ptr;
> >  
> >  	if (idx >= p->size || idx >= p->count) {
> > -		if (func) {
> > -			trace("packet %zu from pool size: %zu, count: %zu, "
> > -			      "%s:%i", idx, p->size, p->count, func, line);
> > -		}
> > +		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> > +		      idx, p->size, p->count, func, line);
> 
> If you change this, the documentation of the arguments for
> packet_get_do() should be updated as well.

Fixed.  The doc for packet_add_do() also changed, where it was already wrong.

> >  		return NULL;
> >  	}
> >  
> >  	if (len > PACKET_MAX_LEN) {
> > -		if (func) {
> > -			trace("packet data length %zu, %s:%i",
> > -			      len, func, line);
> > -		}
> > +		trace("packet data length %zu, %s:%i", len, func, line);
> >  		return NULL;
> >  	}
> >  
> >  	if (len + offset > p->pkt[idx].iov_len) {
> > -		if (func) {
> > -			trace("data length %zu, offset %zu from length %zu, "
> > -			      "%s:%i", len, offset, p->pkt[idx].iov_len,
> > -			      func, line);
> > -		}
> > +		trace("data length %zu, offset %zu from length %zu, %s:%i",
> > +		      len, offset, p->pkt[idx].iov_len, func, line);
> >  		return NULL;
> >  	}
> >  
> > diff --git a/packet.h b/packet.h
> > index 05d37695..f95cda08 100644
> > --- a/packet.h
> > +++ b/packet.h
> > @@ -45,9 +45,6 @@ void pool_flush(struct pool *p);
> >  #define packet_get(p, idx, offset, len, left)				\
> >  	packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
> >  
> > -#define packet_get_try(p, idx, offset, len, left)			\
> > -	packet_get_do(p, idx, offset, len, left, NULL, 0)
> > -
> >  #define PACKET_POOL_DECL(_name, _size, _buf)				\
> >  struct _name ## _t {							\
> >  	char *buf;							\
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors
  2025-01-01 21:54   ` Stefano Brivio
@ 2025-01-02  2:58     ` David Gibson
  2025-01-02 22:00       ` Stefano Brivio
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2025-01-02  2:58 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13146 bytes --]

On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
> On Fri, 20 Dec 2024 19:35:32 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > packet_add() and packet_get() can fail for a number of reasons, and we
> > currently treat them all basically the same: we log a trace() level message
> > and for packet_get() we return NULL.  However the different causes of
> > error are quite different and suggest different handling:
> > 
> > 1) If we run out of space in the pool on add, that's (rightly) non-fatal,
> >    but is an unusual situation which we might want to know about.  Promote
> >    it's message to debug() level.
> > 
> > 2) All packet_check_range() errors and the checks on packet length indicate
> >    a serious problem.  Due to either a bug in calling code, or some sort
> >    of memory-clobbering, we've been given addresses or sizes of packets
> >    that are nonsensical.  If things are this bad, we shouldn't try to carry
> >    on replace these with asserts.
> > 
> > 3) Requesting a packet index that doesn't exist in the pool in packet_get()
> >    indicates a bug in the caller - it should always check the index first.
> >    Replace this with an assert.
> 
> The reasons for 2) and 3) being trace() messages (they should probably
> be debug() following the same reasoning as 1)) are graceful degradation
> and security (availability).

Hmm..

> If we have a packet triggering some sort of "memory clobbering", or an
> issue in the caller, that might very well be just one packet or a few
> of them.

I see the case for packet_add_do().  In that case it could be that
this is just a bad packet triggering bugs in the initial processing
code, and just dropping it will be sufficient to save us from damaging
state.

For packet_get_do(), however, this indicates that we have a packet
that's out of bounds, despite the fact that we checked its bounds when
it was added to the pool.  That means we *already* have corrupted
memory, and we have no way to assess the extent of the damage.  It's
entirely possible we're already badly compromised.  Under those
circumstances I think it is much better to terminate immediately,
rather than risk doing any more hard to trace damage.

> I think it's a disservice to users to crash in that case, and it has
> the potential to worsen a security flaw if this packet can be built on
> purpose. It's also a disservice to package maintainers because it has
> the potential to turn a harmless issue into an annoying situation with
> associated time pressure and everything.

Ok, that argument makes sense to me for packet_add_do(), but again,
not for packet_get_do().

> Those are not warn() calls, by the way, because we don't want to make
> it too easy, in that (perhaps unlikely) case, to flood logs, and
> possibly hide information due to rotation.

Right, I see the argument, but it also worries me that the message -
which even if we are successfully containing the damage here
definitely indicates a bug in the caller - could easily never be seen
at trace or even debug level.

> I suppose that the argument for asserts here is so that we discover
> functional issues as soon as possible, which I understand of course,

That's the main reason.  The secondary reason is that it clarifies the
meaning of a NULL return from packet_get_do().  With this change it
means: you requested a range that's not in the packet, whereas
previously it could mean either that or "something is horribly wrong
in the pool state, and we don't know what".

> but it relies on two assumptions that might not hold:
> 
> 1. the crash will be reported, while a debug() message will go
>    unnoticed.
> 
>    Well, the crash is more noticeable, but that doesn't mean that it
>    will be reported. Users might/will try to hack something around it
>    or try to survive with slirp4netns and all its flaws.

I suppose so.

>    On the other hand, a malfunction (say, failed connections) might be
>    equally likely to be reported, along with debug() messages.

That only applies if this causes malfunctions in intended behaviour.
If it's triggered due to an attack, rather than a flaw handling
expected behaviour then there's no reason it would be reported.

> 2. the check and the assertion themselves are correct. This was not the
>    case for the two most recent severe issues (commit 61c0b0d0f199, bug
>    #105) we discovered through assertions.

That's fair, I can see why you'd be gun shy about asserts given that.

>    Issues such as https://github.com/containers/podman/issues/22925, on
>    the other hand, are good examples of how asserts saved us a lot
>    of time and effort later on (compared to, say, "connections stop
>    working after three days"). But I would argue that those are
>    substantially different cases where assertions are checking
>    something that's much more inherent to the implementation.

Eh... I can't really see a way these assertions are different other
than that one turned out to be wrong and the other didn't.

> In this case, I would argue that there's no need to crash, so we
> shouldn't. We can presumably carry on after that, and we're not leaking
> any resource or sneakily leave behind some condition that will cause
> apparently unrelated issues at a later time.

Again, true (plausibly) for packet_add_do(), but not for
packet_get_do().

> 
> > 4) On packet_get() requesting a section of the packet beyond its bounds is
> >    correct already.  This happens routinely from some callers when they
> >    probe to see if certain parts of the packet are present.  At worst it
> >    indicates that the guest has generate a malformed packet, which we
> >    should quietly drop as we already do.

This patch doesn't change that.  We check against the stated packet
bounds first and return NULL *before* we check against the buffer
bounds and possibly trigger an assert.

Ok... here's a tentative proposal, let me know what you think:
  * I promote the packet_check_range() messages to warn(), or even
    err().  Yes, this might allow for log flooding, but I would argue
    that if we have a bug allowing this path to be triggered and
    something is frequently triggering it, we *want* that to be noisy.
    We *think* we're containing the damage here, but we don't really
    know.
  * I make packet_check_range() failures in packet_add_do() drop the
    packet and continue, as they used to
  * packet_check_range() failures in packet_get_do() will still
    assert()
  * I leave the assert()s for packet _index_ out of range

> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  packet.c    | 71 ++++++++++++++++++++---------------------------------
> >  packet.h    |  3 ++-
> >  vu_common.c | 13 +++++++---
> >  3 files changed, 38 insertions(+), 49 deletions(-)
> > 
> > diff --git a/packet.c b/packet.c
> > index c921aa15..24f12448 100644
> > --- a/packet.c
> > +++ b/packet.c
> > @@ -30,37 +30,27 @@
> >   * @func:	For tracing: name of calling function
> >   * @line:	For tracing: caller line of function call
> >   *
> > - * Return: 0 if the range is valid, -1 otherwise
> > + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for
> > + * the pool).
> >   */
> > -static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
> > -			      const char *func, int line)
> > +static void packet_check_range(const struct pool *p,
> > +			       const char *ptr, size_t len,
> > +			       const char *func, int line)
> >  {
> >  	if (p->buf_size == 0) {
> > -		int ret;
> > -
> > -		ret = vu_packet_check_range((void *)p->buf, ptr, len);
> > -
> > -		if (ret == -1)
> > -			trace("cannot find region, %s:%i", func, line);
> > -
> > -		return ret;
> > -	}
> > -
> > -	if (ptr < p->buf) {
> > -		trace("packet range start %p before buffer start %p, %s:%i",
> > -		      (void *)ptr, (void *)p->buf, func, line);
> > -		return -1;
> > -	}
> > -
> > -	if (ptr + len > p->buf + p->buf_size) {
> > -		trace("packet range end %p after buffer end %p, %s:%i",
> > -		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
> > -		      func, line);
> > -		return -1;
> > +		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
> > +		return;
> >  	}
> >  
> > -	return 0;
> > +	ASSERT_WITH_MSG(ptr >= p->buf,
> > +			"packet range start %p before buffer start %p, %s:%i",
> > +			(void *)ptr, (void *)p->buf, func, line);
> > +	ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size,
> > +			"packet range end %p after buffer end %p, %s:%i",
> > +			(void *)(ptr + len), (void *)(p->buf + p->buf_size),
> > +			func, line);
> >  }
> > +
> >  /**
> >   * packet_add_do() - Add data as packet descriptor to given pool
> >   * @p:		Existing pool
> > @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
> >  	size_t idx = p->count;
> >  
> >  	if (idx >= p->size) {
> > -		trace("add packet index %zu to pool with size %zu, %s:%i",
> > +		debug("add packet index %zu to pool with size %zu, %s:%i",
> >  		      idx, p->size, func, line);
> >  		return;
> >  	}
> >  
> > -	if (packet_check_range(p, start, len, func, line))
> > -		return;
> > +	packet_check_range(p, start, len, func, line);
> >  
> > -	if (len > PACKET_MAX_LEN) {
> > -		trace("add packet length %zu, %s:%i", len, func, line);
> > -		return;
> > -	}
> > +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> > +			"add packet length %zu (max %zu), %s:%i",
> > +			len, PACKET_MAX_LEN, func, line);
> >  
> >  	p->pkt[idx].iov_base = (void *)start;
> >  	p->pkt[idx].iov_len = len;
> > @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> >  {
> >  	char *ptr;
> >  
> > -	if (idx >= p->size || idx >= p->count) {
> > -		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> > -		      idx, p->size, p->count, func, line);
> > -		return NULL;
> > -	}
> > -
> > -	if (len > PACKET_MAX_LEN) {
> > -		trace("packet data length %zu, %s:%i", len, func, line);
> > -		return NULL;
> > -	}
> > +	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
> > +			"packet %zu from pool size: %zu, count: %zu, %s:%i",
> > +			idx, p->size, p->count, func, line);
> > +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> > +			"packet range length %zu (max %zu), %s:%i",
> > +			len, PACKET_MAX_LEN, func, line);
> >  
> >  	if (len + offset > p->pkt[idx].iov_len) {
> >  		trace("data length %zu, offset %zu from length %zu, %s:%i",
> > @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> >  
> >  	ptr = (char *)p->pkt[idx].iov_base + offset;
> >  
> > -	if (packet_check_range(p, ptr, len, func, line))
> > -		return NULL;
> > +	packet_check_range(p, ptr, len, func, line);
> >  
> >  	if (left)
> >  		*left = p->pkt[idx].iov_len - offset - len;
> > diff --git a/packet.h b/packet.h
> > index f95cda08..b164f77e 100644
> > --- a/packet.h
> > +++ b/packet.h
> > @@ -31,7 +31,8 @@ struct pool {
> >  	struct iovec pkt[];
> >  };
> >  
> > -int vu_packet_check_range(void *buf, const char *ptr, size_t len);
> > +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> > +			   const char *func, int line);
> >  void packet_add_do(struct pool *p, size_t len, const char *start,
> >  		   const char *func, int line);
> >  void *packet_get_do(const struct pool *p, const size_t idx,
> > diff --git a/vu_common.c b/vu_common.c
> > index 531f8786..528b9b08 100644
> > --- a/vu_common.c
> > +++ b/vu_common.c
> > @@ -24,10 +24,13 @@
> >   * @buf:	Array of the available memory regions
> >   * @ptr:	Start of desired data range
> >   * @size:	Length of desired data range
> > + * @func:	For tracing: name of calling function
> > + * @line:	For tracing: caller line of function call
> 
> If those arguments are mandatory, I would drop the "For tracing"
> specification.
> 
> >   *
> > - * Return: 0 if the zone is in a mapped memory region, -1 otherwise
> > + * Aborts if the zone isn't in any of the device regions
> >   */
> > -int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> > +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> > +			   const char *func, int line)
> >  {
> >  	struct vu_dev_region *dev_region;
> >  
> > @@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> >  
> >  		if (m <= ptr &&
> >  		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
> > -			return 0;
> > +			return;
> >  	}
> >  
> > -	return -1;
> > +	abort_with_msg(
> > +		"package range at %p, length %zd not within dev region %s:%i",
> 
> s/package/packet/
> 
> > +		(void *)ptr, len, func, line);
> >  }
> >  
> >  /**
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets
  2025-01-01 21:54   ` Stefano Brivio
@ 2025-01-02  3:46     ` David Gibson
  2025-01-02 22:00       ` Stefano Brivio
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2025-01-02  3:46 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4370 bytes --]

On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
> On Fri, 20 Dec 2024 19:35:34 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we attempt to size pool_tap[46] so they have room for the maximum
> > possible number of packets that could fit in pkt_buf, TAP_MSGS.  However,
> > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
> > the minimum possible L2 frame size.  But, we don't enforce that L2 frames
> > are at least ETH_ZLEN when we receive them from the tap backend, and since
> > we're dealing with virtual interfaces we don't have the physical Ethernet
> > limitations requiring that length.  Indeed it is possible to generate a
> > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
> > the 'pasta' backend is only 42 bytes long).
> > 
> > It's also unclear if this limit is sufficient for vhost-user which isn't
> > limited by the size of pkt_buf as the other modes are.
> > 
> > We could attempt to correct the calculation, but that would leave us with
> > even larger arrays, which in practice rarely accumulate more than a handful
> > of packets.  So, instead, put an arbitrary cap on the number of packets we
> > can put in a batch, and if we run out of space, process and flush the
> > batch.
> 
> I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> general I couldn't really see a difference in latency (especially for
> UDP streams with small packets) or throughput. Figures from short
> throughput tests (such as the ones from the test suite) look a bit more
> variable, but I don't have any statistically meaningful data.
> 
> Then I looked into how many messages we might have in the array without
> this change, and I realised that, with the throughput tests from the
> suite, we very easily exceed the 256 limit.

Ah, interesting.

> Perhaps surprisingly we get the highest buffer counts with TCP transfers
> and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
> more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
> in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval,
> the extra system call overhead from forcibly flushing batches might
> become rather relevant.

Really?  I thought syscall overhead (as in the part that's
per-syscall, rather than per-work) was generally in the tens of µs
range, rather than the ms range.

But in any case, I'd be fine with upping the size of the array to 4k
or 8k based on that empirical data.  That's still much smaller than
the >150k we have now.

> With lower MTUs, it looks like we have a lower CPU load and
> transmissions are scheduled differently (resulting in smaller batches),
> but I didn't really trace things.

Ok.  I wonder if with the small MTUs we're hitting throughput
bottlenecks elsewhere which mean this particular path isn't
over-exercised.

> So I start thinking that this has the *potential* to introduce a
> performance regression in some cases and we shouldn't just assume that
> some arbitrary 256 limit is good enough. I didn't check with perf(1),
> though.
> 
> Right now that array takes, effectively, less than 100 KiB (it's ~5000
> copies of struct iovec, 16 bytes each), and in theory that could be
> ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will
> have no practical effect anyway.

Yeah, I guess.  I don't love the fact that currently for correctness
(not spuriously dropping packets) we rely on a fairly complex
calculation that's based on information from different layers: the
buffer size and enforcement is in the packet pool layer and is
independent of packet layout, but the minimum frame size comes from
the tap layer and depends quite specifically on which L2 encapsulation
we're using.

> All in all, I think we shouldn't change this limit without a deeper
> understanding of the practical impact. While this change doesn't bring
> any practical advantage, the current behaviour is somewhat tested by
> now, and a small limit isn't.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
  2025-01-02  1:00     ` David Gibson
@ 2025-01-02 21:59       ` Stefano Brivio
  2025-01-03  1:16         ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-02 21:59 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 2 Jan 2025 12:00:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
> > On Fri, 20 Dec 2024 19:35:29 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > We verify that every packet we store in a pool - and every partial packet
> > > we retreive from it has a length no longer than UINT16_MAX.  This
> > > originated in the older packet pool implementation which stored packet
> > > lengths in a uint16_t.  Now, that packets are represented by a struct
> > > iovec with its size_t length, this check serves only as a sanity / security
> > > check that we don't have some wildly out of range length due to a bug
> > > elsewhere.
> > > 
> > > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> > > stored in the pool is in fact an entire frame including both L2 and any
> > > backend specific headers.  We can exceed this in passt mode, even with the
> > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> > > 4 bytes of qemu stream length header = 65538 bytes.
> > > 
> > > Introduce our own define for the maximum length of a packet in the pool and
> > > set it slightly larger, allowing 128 bytes for L2 and/or other backend
> > > specific headers.  We'll use different amounts of that depending on the
> > > tap backend, but since this is just a sanity check, the bound doesn't need
> > > to be 100% tight.  
> > 
> > I couldn't find the time to check what's the maximum amount of bytes we
> > can get here depending on hypervisor and interface, but if this patch  
> 
> So, it's a separate calculation for each backend type, and some of
> them are pretty tricky.
> 
> For anything based on the kernel tap device it is 65535, because it
> has an internal frame size limit of 65535, already including any L2
> headers (it explicitly limits the MTU to 65535 - hard_header_len).
> There is no "hardware" header.
> 
> For the qemu stream protocol it gets pretty complicated, because there
> are multiple layers which could clamp the maximum size.  It doesn't
> look like the socket protocol code itself imposes a limit beyond the
> structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
> which could be less for 32-bit systems).  AFAICT, it's not
> theoretically impossible to have gigabyte frames with a weird virtual
> NIC model... though obviously that wouldn't be IP, and probably not
> even Ethernet.

Theoretically speaking, it could actually be IPv6 with Jumbograms. They
never really gained traction (because of Ethernet, I think) and we don't
support them, but the only attempt to deprecate them I'm aware of
didn't succeed (yet):

  https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/

...and I actually wonder if somebody will ever try to revive them for
virtual networks, where they might make somewhat more sense (you could
transfer filesystems with one packet per file or similar silly tricks).

> Each virtual NIC could have its own limit.  I suspect that's going to
> be in the vicinity of 64k.  But, I'm really struggling to figure out
> what it is just for virtio-net, so I really don't want to try to
> figure it out for all of them.  With a virtio-net NIC, I seem to be
> able to set MTU all the way up to 65535 successfully, which implies a
> maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
> header) = 65553 at least.

The Layer-2 header is included (because that also happens to be
ETH_MAX_MTU, on Linux), it wouldn't be on top.

> Similar situation for vhost-user, where I'm finding it even more
> inscrutable to figure out what limits are imposed at the sub-IP
> levels.  At the moment the "hardware" header
> (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
> packet.c layer, but we might have reasons to change that.
> 
> So, any sub-IP limits for qemu, I'm basically not managing to find.
> However, we (more or less) only care about IP, which imposes a more
> practical limit of: 65535 + L2 header size + "hardware" header size.
> 
> At present that maxes out at 65553, as above, but if we ever support
> other L2 encapsulations, or other backend protocols with larger
> "hardware" headers, that could change.

Okay. I was thinking of a more practical approach, based on the fact
that we only support Ethernet anyway, with essentially four types of
adapters (three virtio-net implementations, and tap), plus rather rare
reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we
could actually test things. 

Well, I did that anyway, because I'm not quite comfortable dropping the
UINT16_MAX check in packet_add_do() without testing... and yes, with
this patch we can trigger a buffer overflow with vhost-user. In detail:

- muvm, virtio-net with 65535 bytes MTU:

  - ping -s 65492 works (65534 bytes "on wire" from pcap)
  - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes
    on wire (that's with a newer kernel, probably that's the reason why
    it's not the same limit as QEMU, below)

- QEMU, virtio-net without vhost-user, 65535 bytes MTU:

  - ping -s 65493 works (65535 bytes "on wire" from pcap)
  - with -s 65494: "Bad frame size from guest, resetting connection"

- QEMU, virtio-net with vhost-user, 65535 bytes MTU:

  - ping -s 65493 works (65535 bytes "on wire" from pcap)
  - ping -s 65494:

    *** buffer overflow detected ***: terminated

    without this patch, we catch that in packet_add_do() (and without
    9/12 we don't crash)

- tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the
  correct maximum though):

  - ping -s 65493 works (65535 bytes on wire)
  - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)

So, I guess we should find out the issue with vhost-user first.

Other than that, it looks like we can reach at least 65539 bytes if we
add the "virtio-net" length descriptors, but still the frame size
itself (which is actually what matters for the functions in packet.c)
can't exceed 65535 bytes, at least from my tests.

Then, yes, I agree that it's not actually correct, even though it fits
all the use cases we have, because we *could* have an implementation
exceeding that value (at the moment, it looks like we don't).

> > fixes an actual issue as you seem to imply, actually checking that with
> > QEMU and muvm would be nice.
> > 
> > By the way, as you mention a specific calculation, does it really make
> > sense to use a "good enough" value here? Can we ever exceed 65538
> > bytes, or can we use that as limit? It would be good to find out, while
> > at it.  
> 
> So, yes, I think we can exceed 65538.  But more significantly, trying
> to make the limit tight here feels like a borderline layering
> violation.  The packet layer doesn't really care about the frame size
> as long as it's "sane".

It might still be convenient for some back-ends to define "sane" as 64
KiB. I'm really not sure if it is, I didn't look into the matching part
of vhost-user in detail.

If it's buggy because we have a user that can exceed that, sure, let's
fix it. If not... also fine by me as it's some kind of theoretical
flaw, but we shouldn't crash.

> Fwiw, in the draft changes I have improving
> MTU handling, it's my intention that individual backends calculate
> and/or enforce tighter limits of their own where practical, and
> BUILD_ASSERT() that those fit within the packet layer's frame size
> limit.

Ah, nice, that definitely sounds like an improvement.

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro
  2025-01-02  2:15     ` David Gibson
@ 2025-01-02 22:00       ` Stefano Brivio
  2025-01-03  4:48         ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-02 22:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 2 Jan 2025 13:15:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
> > On Fri, 20 Dec 2024 19:35:30 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Two places in the code use the packet_get_try() variant on packet_get().
> > > The difference is that packet_get_try() passes a NULL 'func' to
> > > packet_get_do(), which suppresses log messages.  The places we use this
> > > are ones where we expect to sometimes have a failure retreiving the packet
> > > range, even in normal cases.  So, suppressing the log messages seems like
> > > it makes sense, except:
> > > 
> > > 1) It suppresses log messages on all errors.  We (somewhat) expect to hit
> > >    the case where the requested range is not within the received packet.
> > >    However, it also suppresses message in cases where the requested packet
> > >    index doesn't exist, the requested range has a nonsensical length or
> > >    doesn't lie in even the right vague part of memory.  None of those
> > >    latter cases are expected, and the message would be helpful if we ever
> > >    actually hit them.
> > > 
> > > 2) The suppressed messages aren't actually that disruptive.  For the case
> > >    in ip.c, we'd log only if we run out of IPv6 packet before reaching a
> > >    (non-option) L4 header.  That shouldn't be the case in normal operation
> > >    so getting a message (at trave level) is not unreasonable.
> > >    For the case in dhcpv6.c we do suppress a message every time we look for
> > >    but don't find a specific DHCPv6 option.  That can happen in perfectly
> > >    ok cases, but, again these are trace() level and DHCPv6 transactions
> > >    aren't that common.  Suppressing the message for this case doesn't
> > >    seem worth the drawbacks of (1).  
> > 
> > The reason why I implemented packet_get_try() is that the messages from
> > packet_get_do() indicate serious issues, and if I'm debugging something
> > by looking at traces it's not great to have messages indicating that we
> > hit a serious issue while we're simply validating identity associations.  
> 
> I'm not following your argument here.  It's exactly because (most of)
> the message indicate serious issues that I don't want to suppress
> them.  I don't know what you mean by "validating identity
> associations".

But dhcpv6_opt() trying to get data that doesn't exist is *not* an
issue, including not a serious one, so if I'm debugging something with
--trace and I see one of these messages I'll shout at "memory" or
"packet" badness and waste time thinking it's an actual issue.

Validating identity associations (IA_NA, IA_TA, RFC 3315) is what
dhcpv6_ia_notonlink() does. That's the most common case where we'll
routinely call dhcpv6_opt() to fetch data which isn't there.

> > It's not about the amount of logged messages, it's about the type of
> > message being logged and the distracting noise possibly resulting in a
> > substantial time waste.
> > 
> > About 1): dhcpv6_opt() always picks pool index 0, and the base offset
> > was already checked by the caller.  
> 
> Right, but dhcpv6_opt() is called in a loop, that only stops when it
> returns NULL.  So, by construction the last call to dhcpv6_opt(),
> which terminates the loop, _will_ have a failing call to packet_get().
> At this point - at least assuming a correctly constructed packet - the
> offset will point to just past the last option, which should be
> exactly at the end of the packet.

Yes, I get that, and:

- I would be happy if that one were *not* reported as failure

- the calls before that one should always be enough to check if we have
  an actual issue with the packet

> > In ipv6_l4hdr(), the index was
> > already validated by a previous call to packet_get(), and the starting
> > offset as well.  
> 
> Not AFAICT, the initial packet_get just validates the basic IPv6
> header.  The calls to packet_get_try() in the loop validate additional
> IP options.  I don't think it will ever fail on a well-constructed
> packet, but it could on a bogus (or truncated) packet, where the
> nexthdr field indicates an option that's actually missing.
> 
> This is kind of my point: it will only trip on a badly constructed
> packet, in which case I don't think we want to suppress messages.

There, I used packet_get_try() because a missing option or payload
doesn't indicate a bad packet at the _data_ level.

On the other hand, it's bad at the network level anyway, because option
59 *must* be there otherwise (I just realised), so while I'd still
prefer another wording of the warning (not mentioning packet/buffer
ranges... something more network-y), I would be fine with it.

> > I guess the main reason to have this patch in this series is to make a
> > later change simpler, but I'm not sure which one, so I don't know what
> > to suggest as an alternative.  
> 
> Mostly the next one - making more distinction between the different
> error severities that can occur here.  Having to deal with the
> possibility of func being NULL complicates things.

I fail to see that much of a complication as a result, and I'd still
very much prefer to *not* have a warning for that case in dhcpv6_opt()
(and a slight preference to have a different message for ipv6_l4hdr()),
but if it really adds a ton of lines for whatever reason I'm missing,
so be it...

> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  dhcpv6.c |  2 +-
> > >  ip.c     |  2 +-
> > >  packet.c | 18 +++++-------------
> > >  packet.h |  3 ---
> > >  4 files changed, 7 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/dhcpv6.c b/dhcpv6.c
> > > index 0523bba8..c0d05131 100644
> > > --- a/dhcpv6.c
> > > +++ b/dhcpv6.c
> > > @@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset,
> > >  	if (!*offset)
> > >  		*offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
> > >  
> > > -	while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) {
> > > +	while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) {
> > >  		unsigned int opt_len = ntohs(o->l) + sizeof(*o);
> > >  
> > >  		if (ntohs(o->l) > left)
> > > diff --git a/ip.c b/ip.c
> > > index 2cc7f654..e391f81b 100644
> > > --- a/ip.c
> > > +++ b/ip.c
> > > @@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> > >  	if (!IPV6_NH_OPT(nh))
> > >  		goto found;
> > >  
> > > -	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
> > > +	while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) {
> > >  		nh = o->nexthdr;
> > >  		hdrlen = (o->hdrlen + 1) * 8;
> > >  
> > > diff --git a/packet.c b/packet.c
> > > index bcac0375..c921aa15 100644
> > > --- a/packet.c
> > > +++ b/packet.c
> > > @@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> > >  	char *ptr;
> > >  
> > >  	if (idx >= p->size || idx >= p->count) {
> > > -		if (func) {
> > > -			trace("packet %zu from pool size: %zu, count: %zu, "
> > > -			      "%s:%i", idx, p->size, p->count, func, line);
> > > -		}
> > > +		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> > > +		      idx, p->size, p->count, func, line);  
> > 
> > If you change this, the documentation of the arguments for
> > packet_get_do() should be updated as well.  
> 
> Fixed.  The doc for packet_add_do() also changed, where it was already wrong.
> 
> > >  		return NULL;
> > >  	}
> > >  
> > >  	if (len > PACKET_MAX_LEN) {
> > > -		if (func) {
> > > -			trace("packet data length %zu, %s:%i",
> > > -			      len, func, line);
> > > -		}
> > > +		trace("packet data length %zu, %s:%i", len, func, line);
> > >  		return NULL;
> > >  	}
> > >  
> > >  	if (len + offset > p->pkt[idx].iov_len) {
> > > -		if (func) {
> > > -			trace("data length %zu, offset %zu from length %zu, "
> > > -			      "%s:%i", len, offset, p->pkt[idx].iov_len,
> > > -			      func, line);
> > > -		}
> > > +		trace("data length %zu, offset %zu from length %zu, %s:%i",
> > > +		      len, offset, p->pkt[idx].iov_len, func, line);
> > >  		return NULL;
> > >  	}
> > >  
> > > diff --git a/packet.h b/packet.h
> > > index 05d37695..f95cda08 100644
> > > --- a/packet.h
> > > +++ b/packet.h
> > > @@ -45,9 +45,6 @@ void pool_flush(struct pool *p);
> > >  #define packet_get(p, idx, offset, len, left)				\
> > >  	packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
> > >  
> > > -#define packet_get_try(p, idx, offset, len, left)			\
> > > -	packet_get_do(p, idx, offset, len, left, NULL, 0)
> > > -
> > >  #define PACKET_POOL_DECL(_name, _size, _buf)				\
> > >  struct _name ## _t {							\
> > >  	char *buf;							\  

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors
  2025-01-02  2:58     ` David Gibson
@ 2025-01-02 22:00       ` Stefano Brivio
  2025-01-03  5:06         ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-02 22:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 2 Jan 2025 13:58:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
> > On Fri, 20 Dec 2024 19:35:32 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > packet_add() and packet_get() can fail for a number of reasons, and we
> > > currently treat them all basically the same: we log a trace() level message
> > > and for packet_get() we return NULL.  However the different causes of
> > > error are quite different and suggest different handling:
> > > 
> > > 1) If we run out of space in the pool on add, that's (rightly) non-fatal,
> > >    but is an unusual situation which we might want to know about.  Promote
> > >    it's message to debug() level.
> > > 
> > > 2) All packet_check_range() errors and the checks on packet length indicate
> > >    a serious problem.  Due to either a bug in calling code, or some sort
> > >    of memory-clobbering, we've been given addresses or sizes of packets
> > >    that are nonsensical.  If things are this bad, we shouldn't try to carry
> > >    on replace these with asserts.
> > > 
> > > 3) Requesting a packet index that doesn't exist in the pool in packet_get()
> > >    indicates a bug in the caller - it should always check the index first.
> > >    Replace this with an assert.  
> > 
> > The reasons for 2) and 3) being trace() messages (they should probably
> > be debug() following the same reasoning as 1)) are graceful degradation
> > and security (availability).  
> 
> Hmm..
> 
> > If we have a packet triggering some sort of "memory clobbering", or an
> > issue in the caller, that might very well be just one packet or a few
> > of them.  
> 
> I see the case for packet_add_do().  In that case it could be that
> this is just a bad packet triggering bugs in the initial processing
> code, and just dropping it will be sufficient to save us from damaging
> state.
> 
> For packet_get_do(), however, this indicates that we have a packet
> that's out of bounds, despite the fact that we checked its bounds when
> it was added to the pool.  That means we *already* have corrupted
> memory, and we have no way to assess the extent of the damage.  It's
> entirely possible we're already badly compromised.  Under those
> circumstances I think it is much better to terminate immediately,
> rather than risk doing any more hard to trace damage.

It depends on which part of packet_get_do(). If we're exceeding the
packet count or the size of a pool, then I guess yes, otherwise not so
much.

> > I think it's a disservice to users to crash in that case, and it has
> > the potential to worsen a security flaw if this packet can be built on
> > purpose. It's also a disservice to package maintainers because it has
> > the potential to turn a harmless issue into an annoying situation with
> > associated time pressure and everything.  
> 
> Ok, that argument makes sense to me for packet_add_do(), but again,
> not for packet_get_do().
> 
> > Those are not warn() calls, by the way, because we don't want to make
> > it too easy, in that (perhaps unlikely) case, to flood logs, and
> > possibly hide information due to rotation.  
> 
> Right, I see the argument, but it also worries me that the message -
> which even if we are successfully containing the damage here
> definitely indicates a bug in the caller - could easily never be seen
> at trace or even debug level.
> 
> > I suppose that the argument for asserts here is so that we discover
> > functional issues as soon as possible, which I understand of course,  
> 
> That's the main reason.  The secondary reason is that it clarifies the
> meaning of a NULL return from packet_get_do().  With this change it
> means: you requested a range that's not in the packet, whereas
> previously it could mean either that or "something is horribly wrong
> in the pool state, and we don't know what".
> 
> > but it relies on two assumptions that might not hold:
> > 
> > 1. the crash will be reported, while a debug() message will go
> >    unnoticed.
> > 
> >    Well, the crash is more noticeable, but that doesn't mean that it
> >    will be reported. Users might/will try to hack something around it
> >    or try to survive with slirp4netns and all its flaws.  
> 
> I suppose so.
> 
> >    On the other hand, a malfunction (say, failed connections) might be
> >    equally likely to be reported, along with debug() messages.  
> 
> That only applies if this causes malfunctions in intended behaviour.
> If it's triggered due to an attack, rather than a flaw handling
> expected behaviour then there's no reason it would be reported.
> 
> > 2. the check and the assertion themselves are correct. This was not the
> >    case for the two most recent severe issues (commit 61c0b0d0f199, bug
> >    #105) we discovered through assertions.  
> 
> That's fair, I can see why you'd be gun shy about asserts given that.
> 
> >    Issues such as https://github.com/containers/podman/issues/22925, on
> >    the other hand, are good examples of how asserts saved us a lot
> >    of time and effort later on (compared to, say, "connections stop
> >    working after three days"). But I would argue that those are
> >    substantially different cases where assertions are checking
> >    something that's much more inherent to the implementation.  
> 
> Eh... I can't really see a way these assertions are different other
> than that one turned out to be wrong and the other didn't.

Ah, sorry, I meant that there is a substantial difference between the
ones related to Podman issue 22925 *and what might happen in
packet_get_range()*.

Not that there's a difference between those related to Podman issue
22925 and those from commit 61c0b0d0f199 and bug #105.

> > In this case, I would argue that there's no need to crash, so we
> > shouldn't. We can presumably carry on after that, and we're not leaking
> > any resource or sneakily leave behind some condition that will cause
> > apparently unrelated issues at a later time.  
> 
> Again, true (plausibly) for packet_add_do(), but not for
> packet_get_do().
> 
> > > 4) On packet_get() requesting a section of the packet beyond its bounds is
> > >    correct already.  This happens routinely from some callers when they
> > >    probe to see if certain parts of the packet are present.  At worst it
> > >    indicates that the guest has generate a malformed packet, which we
> > >    should quietly drop as we already do.  
> 
> This patch doesn't change that.  We check against the stated packet
> bounds first and return NULL *before* we check against the buffer
> bounds and possibly trigger an assert.

Yes... I wasn't actually disputing that (quote is from yourself).

> Ok... here's a tentative proposal, let me know what you think:

So, first off, I have to say that it's a bit hard to reason about
something that is not giving users any problem at the moment (well, at
least not without a part of this series), and which didn't really give
us problems in the past.

It's difficult to quantify with experience and examples (even letting
alone the priority of the whole matter for a moment...). Anyway, let me
try:

>   * I promote the packet_check_range() messages to warn(), or even
>     err().  Yes, this might allow for log flooding, but I would argue
>     that if we have a bug allowing this path to be triggered and
>     something is frequently triggering it, we *want* that to be noisy.
>     We *think* we're containing the damage here, but we don't really
>     know.

I still think it's our responsibility to not flood the system log in
*any* circumstance.

Remember that this is not necessarily *our* log where the worst that
can happen is that we hide information. Even with sane system log
configurations, one message per packet can get close to a substantial
denial of service (especially with systemd-syslogd, see also
https://github.com/systemd/systemd/issues/13425).

I guess you don't realise until you run passt without -f, for example
with libvirt, for a large number of virtual machines. Or in KubeVirt.
Or for a bunch of containers. It's a common and very much intended
usage.

It's not true in general that we don't know. We can quantify the
difference in damage very clearly if the attacker manages to make
packet_check_range() print one line for each packet (and nothing
else).

>   * I make packet_check_range() failures in packet_add_do() drop the
>     packet and continue, as they used to
>   * packet_check_range() failures in packet_get_do() will still
>     assert()

I guess the (start < p->buf) one makes sense, but not so much the one
where we check that offset plus length is within the pool, because that
one could be very well triggered by a specific packet itself.

>   * I leave the assert()s for packet _index_ out of range

Okay, yes, those make sense to me as well.

> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  packet.c    | 71 ++++++++++++++++++++---------------------------------
> > >  packet.h    |  3 ++-
> > >  vu_common.c | 13 +++++++---
> > >  3 files changed, 38 insertions(+), 49 deletions(-)
> > > 
> > > diff --git a/packet.c b/packet.c
> > > index c921aa15..24f12448 100644
> > > --- a/packet.c
> > > +++ b/packet.c
> > > @@ -30,37 +30,27 @@
> > >   * @func:	For tracing: name of calling function
> > >   * @line:	For tracing: caller line of function call
> > >   *
> > > - * Return: 0 if the range is valid, -1 otherwise
> > > + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for
> > > + * the pool).
> > >   */
> > > -static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
> > > -			      const char *func, int line)
> > > +static void packet_check_range(const struct pool *p,
> > > +			       const char *ptr, size_t len,
> > > +			       const char *func, int line)
> > >  {
> > >  	if (p->buf_size == 0) {
> > > -		int ret;
> > > -
> > > -		ret = vu_packet_check_range((void *)p->buf, ptr, len);
> > > -
> > > -		if (ret == -1)
> > > -			trace("cannot find region, %s:%i", func, line);
> > > -
> > > -		return ret;
> > > -	}
> > > -
> > > -	if (ptr < p->buf) {
> > > -		trace("packet range start %p before buffer start %p, %s:%i",
> > > -		      (void *)ptr, (void *)p->buf, func, line);
> > > -		return -1;
> > > -	}
> > > -
> > > -	if (ptr + len > p->buf + p->buf_size) {
> > > -		trace("packet range end %p after buffer end %p, %s:%i",
> > > -		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
> > > -		      func, line);
> > > -		return -1;
> > > +		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
> > > +		return;
> > >  	}
> > >  
> > > -	return 0;
> > > +	ASSERT_WITH_MSG(ptr >= p->buf,
> > > +			"packet range start %p before buffer start %p, %s:%i",
> > > +			(void *)ptr, (void *)p->buf, func, line);
> > > +	ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size,
> > > +			"packet range end %p after buffer end %p, %s:%i",
> > > +			(void *)(ptr + len), (void *)(p->buf + p->buf_size),
> > > +			func, line);
> > >  }
> > > +
> > >  /**
> > >   * packet_add_do() - Add data as packet descriptor to given pool
> > >   * @p:		Existing pool
> > > @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
> > >  	size_t idx = p->count;
> > >  
> > >  	if (idx >= p->size) {
> > > -		trace("add packet index %zu to pool with size %zu, %s:%i",
> > > +		debug("add packet index %zu to pool with size %zu, %s:%i",
> > >  		      idx, p->size, func, line);
> > >  		return;
> > >  	}
> > >  
> > > -	if (packet_check_range(p, start, len, func, line))
> > > -		return;
> > > +	packet_check_range(p, start, len, func, line);
> > >  
> > > -	if (len > PACKET_MAX_LEN) {
> > > -		trace("add packet length %zu, %s:%i", len, func, line);
> > > -		return;
> > > -	}
> > > +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> > > +			"add packet length %zu (max %zu), %s:%i",
> > > +			len, PACKET_MAX_LEN, func, line);
> > >  
> > >  	p->pkt[idx].iov_base = (void *)start;
> > >  	p->pkt[idx].iov_len = len;
> > > @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> > >  {
> > >  	char *ptr;
> > >  
> > > -	if (idx >= p->size || idx >= p->count) {
> > > -		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> > > -		      idx, p->size, p->count, func, line);
> > > -		return NULL;
> > > -	}
> > > -
> > > -	if (len > PACKET_MAX_LEN) {
> > > -		trace("packet data length %zu, %s:%i", len, func, line);
> > > -		return NULL;
> > > -	}
> > > +	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
> > > +			"packet %zu from pool size: %zu, count: %zu, %s:%i",
> > > +			idx, p->size, p->count, func, line);
> > > +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> > > +			"packet range length %zu (max %zu), %s:%i",
> > > +			len, PACKET_MAX_LEN, func, line);
> > >  
> > >  	if (len + offset > p->pkt[idx].iov_len) {
> > >  		trace("data length %zu, offset %zu from length %zu, %s:%i",
> > > @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> > >  
> > >  	ptr = (char *)p->pkt[idx].iov_base + offset;
> > >  
> > > -	if (packet_check_range(p, ptr, len, func, line))
> > > -		return NULL;
> > > +	packet_check_range(p, ptr, len, func, line);
> > >  
> > >  	if (left)
> > >  		*left = p->pkt[idx].iov_len - offset - len;
> > > diff --git a/packet.h b/packet.h
> > > index f95cda08..b164f77e 100644
> > > --- a/packet.h
> > > +++ b/packet.h
> > > @@ -31,7 +31,8 @@ struct pool {
> > >  	struct iovec pkt[];
> > >  };
> > >  
> > > -int vu_packet_check_range(void *buf, const char *ptr, size_t len);
> > > +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> > > +			   const char *func, int line);
> > >  void packet_add_do(struct pool *p, size_t len, const char *start,
> > >  		   const char *func, int line);
> > >  void *packet_get_do(const struct pool *p, const size_t idx,
> > > diff --git a/vu_common.c b/vu_common.c
> > > index 531f8786..528b9b08 100644
> > > --- a/vu_common.c
> > > +++ b/vu_common.c
> > > @@ -24,10 +24,13 @@
> > >   * @buf:	Array of the available memory regions
> > >   * @ptr:	Start of desired data range
> > >   * @size:	Length of desired data range
> > > + * @func:	For tracing: name of calling function
> > > + * @line:	For tracing: caller line of function call  
> > 
> > If those arguments are mandatory, I would drop the "For tracing"
> > specification.
> >   
> > >   *
> > > - * Return: 0 if the zone is in a mapped memory region, -1 otherwise
> > > + * Aborts if the zone isn't in any of the device regions
> > >   */
> > > -int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> > > +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> > > +			   const char *func, int line)
> > >  {
> > >  	struct vu_dev_region *dev_region;
> > >  
> > > @@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> > >  
> > >  		if (m <= ptr &&
> > >  		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
> > > -			return 0;
> > > +			return;
> > >  	}
> > >  
> > > -	return -1;
> > > +	abort_with_msg(
> > > +		"package range at %p, length %zd not within dev region %s:%i",  
> > 
> > s/package/packet/
> >   
> > > +		(void *)ptr, len, func, line);
> > >  }
> > >  
> > >  /**  

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets
  2025-01-02  3:46     ` David Gibson
@ 2025-01-02 22:00       ` Stefano Brivio
  2025-01-03  6:06         ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefano Brivio @ 2025-01-02 22:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 2 Jan 2025 14:46:45 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
> > On Fri, 20 Dec 2024 19:35:34 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Currently we attempt to size pool_tap[46] so they have room for the maximum
> > > possible number of packets that could fit in pkt_buf, TAP_MSGS.  However,
> > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
> > > the minimum possible L2 frame size.  But, we don't enforce that L2 frames
> > > are at least ETH_ZLEN when we receive them from the tap backend, and since
> > > we're dealing with virtual interfaces we don't have the physical Ethernet
> > > limitations requiring that length.  Indeed it is possible to generate a
> > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
> > > the 'pasta' backend is only 42 bytes long).
> > > 
> > > It's also unclear if this limit is sufficient for vhost-user which isn't
> > > limited by the size of pkt_buf as the other modes are.
> > > 
> > > We could attempt to correct the calculation, but that would leave us with
> > > even larger arrays, which in practice rarely accumulate more than a handful
> > > of packets.  So, instead, put an arbitrary cap on the number of packets we
> > > can put in a batch, and if we run out of space, process and flush the
> > > batch.  
> > 
> > I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> > general I couldn't really see a difference in latency (especially for
> > UDP streams with small packets) or throughput. Figures from short
> > throughput tests (such as the ones from the test suite) look a bit more
> > variable, but I don't have any statistically meaningful data.
> > 
> > Then I looked into how many messages we might have in the array without
> > this change, and I realised that, with the throughput tests from the
> > suite, we very easily exceed the 256 limit.  
> 
> Ah, interesting.
> 
> > Perhaps surprisingly we get the highest buffer counts with TCP transfers
> > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
> > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
> > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval,
> > the extra system call overhead from forcibly flushing batches might
> > become rather relevant.  
> 
> Really?  I thought syscall overhead (as in the part that's
> per-syscall, rather than per-work) was generally in the tens of µs
> range, rather than the ms range.

Tens or hundreds of µs (mind that it's several of them), so there
could be just one order of magnitude between the two.

> But in any case, I'd be fine with upping the size of the array to 4k
> or 8k based on that empirical data.  That's still much smaller than
> the >150k we have now.

I would even go with 32k -- there are embedded systems with a ton of
memory but still much slower clocks compared to my typical setup. Go
figure. Again, I think we should test and profile this, ideally, but if
not, then let's pick something that's ~10x of what I see.

> > With lower MTUs, it looks like we have a lower CPU load and
> > transmissions are scheduled differently (resulting in smaller batches),
> > but I didn't really trace things.  
> 
> Ok.  I wonder if with the small MTUs we're hitting throughput
> bottlenecks elsewhere which mean this particular path isn't
> over-exercised.
> 
> > So I start thinking that this has the *potential* to introduce a
> > performance regression in some cases and we shouldn't just assume that
> > some arbitrary 256 limit is good enough. I didn't check with perf(1),
> > though.
> > 
> > Right now that array takes, effectively, less than 100 KiB (it's ~5000
> > copies of struct iovec, 16 bytes each), and in theory that could be
> > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will
> > have no practical effect anyway.  
> 
> Yeah, I guess.  I don't love the fact that currently for correctness
> (not spuriously dropping packets) we rely on a fairly complex
> calculation that's based on information from different layers: the
> buffer size and enforcement is in the packet pool layer and is
> independent of packet layout, but the minimum frame size comes from
> the tap layer and depends quite specifically on which L2 encapsulation
> we're using.

Well but it's exactly one line, and we're talking about the same
project and tool, not something that's separated by several API layers.

By the way: on one hand you have that. On the other hand, you're adding
an arbitrary limit that comes from a test I just did, which is also
based on information from different layers.

> > All in all, I think we shouldn't change this limit without a deeper
> > understanding of the practical impact. While this change doesn't bring
> > any practical advantage, the current behaviour is somewhat tested by
> > now, and a small limit isn't.  

...and this still applies, I think.

-- 
Stefano


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX
  2025-01-02 21:59       ` Stefano Brivio
@ 2025-01-03  1:16         ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2025-01-03  1:16 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 11339 bytes --]

On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 12:00:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:29 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > We verify that every packet we store in a pool - and every partial packet
> > > > we retreive from it has a length no longer than UINT16_MAX.  This
> > > > originated in the older packet pool implementation which stored packet
> > > > lengths in a uint16_t.  Now, that packets are represented by a struct
> > > > iovec with its size_t length, this check serves only as a sanity / security
> > > > check that we don't have some wildly out of range length due to a bug
> > > > elsewhere.
> > > > 
> > > > However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
> > > > stored in the pool is in fact an entire frame including both L2 and any
> > > > backend specific headers.  We can exceed this in passt mode, even with the
> > > > default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
> > > > 4 bytes of qemu stream length header = 65538 bytes.
> > > > 
> > > > Introduce our own define for the maximum length of a packet in the pool and
> > > > set it slightly larger, allowing 128 bytes for L2 and/or other backend
> > > > specific headers.  We'll use different amounts of that depending on the
> > > > tap backend, but since this is just a sanity check, the bound doesn't need
> > > > to be 100% tight.  
> > > 
> > > I couldn't find the time to check what's the maximum amount of bytes we
> > > can get here depending on hypervisor and interface, but if this patch  
> > 
> > So, it's a separate calculation for each backend type, and some of
> > them are pretty tricky.
> > 
> > For anything based on the kernel tap device it is 65535, because it
> > has an internal frame size limit of 65535, already including any L2
> > headers (it explicitly limits the MTU to 65535 - hard_header_len).
> > There is no "hardware" header.
> > 
> > For the qemu stream protocol it gets pretty complicated, because there
> > are multiple layers which could clamp the maximum size.  It doesn't
> > look like the socket protocol code itself imposes a limit beyond the
> > structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
> > which could be less for 32-bit systems).  AFAICT, it's not
> > theoretically impossible to have gigabyte frames with a weird virtual
> > NIC model... though obviously that wouldn't be IP, and probably not
> > even Ethernet.
> 
> Theoretically speaking, it could actually be IPv6 with Jumbograms. They
> never really gained traction (because of Ethernet, I think) and we don't
> support them, but the only attempt to deprecate them I'm aware of
> didn't succeed (yet):
> 
>   https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
> 
> ...and I actually wonder if somebody will ever try to revive them for
> virtual networks, where they might make somewhat more sense (you could
> transfer filesystems with one packet per file or similar silly tricks).

Hm, yes.  Well, one problem at a time.  Well, ok, 2 or 3 problems at a
time.

> > Each virtual NIC could have its own limit.  I suspect that's going to
> > be in the vicinity of 64k.  But, I'm really struggling to figure out
> > what it is just for virtio-net, so I really don't want to try to
> > figure it out for all of them.  With a virtio-net NIC, I seem to be
> > able to set MTU all the way up to 65535 successfully, which implies a
> > maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
> > header) = 65553 at least.
> 
> The Layer-2 header is included (because that also happens to be
> ETH_MAX_MTU, on Linux), it wouldn't be on top.

No.  Or if so, that's a guest side bug.  The MTU set with ip-link is
the maximum L3 size - at the default 1500, the L2 frame can be 1514
bytes.  If the driver can't send an L2 frame greater than 65535 bytes,
then it should clamp the MTU to (65535 - hard_header_len) like tuntap
already does.

I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU
which can be had in an ethernet frame (that's longer), or is it the
maximum ethernet frame size (leading to an IP mtu of 65521).  I plan
to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU
series.  I should merge that with this one, it might make the context
clearer.

AFAICT there is *no* structural limit on the size of an ethernet
frame; the length isn't in any header, it's just assumed to be
reported out of band by the hardware.  No theoretical reason that
reporting mechanism couldn't allow lengths > 65535, whether slightly
(65535 bytes of payload + header & FCS) or vastly.

> > Similar situation for vhost-user, where I'm finding it even more
> > inscrutable to figure out what limits are imposed at the sub-IP
> > levels.  At the moment the "hardware" header
> > (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
> > packet.c layer, but we might have reasons to change that.
> > 
> > So, any sub-IP limits for qemu, I'm basically not managing to find.
> > However, we (more or less) only care about IP, which imposes a more
> > practical limit of: 65535 + L2 header size + "hardware" header size.
> > 
> > At present that maxes out at 65553, as above, but if we ever support
> > other L2 encapsulations, or other backend protocols with larger
> > "hardware" headers, that could change.
> 
> Okay. I was thinking of a more practical approach, based on the fact
> that we only support Ethernet anyway, with essentially four types of
> adapters (three virtio-net implementations, and tap), plus rather rare
> reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we
> could actually test things. 
> 
> Well, I did that anyway, because I'm not quite comfortable dropping the
> UINT16_MAX check in packet_add_do() without testing...

We're not dropping it, just raising the limit, fairly slightly.

> and yes, with
> this patch we can trigger a buffer overflow with vhost-user. In detail:
> 
> - muvm, virtio-net with 65535 bytes MTU:
> 
>   - ping -s 65492 works (65534 bytes "on wire" from pcap)
>   - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes
>     on wire (that's with a newer kernel, probably that's the reason why
>     it's not the same limit as QEMU, below)

That's a guest driver bug.  If the MTU is 65535, it should be able to
send a 65535 byte IP datagram without fragmentation.  Looks like it
needs to clamp the MTU based on L2 limitations.

> - QEMU, virtio-net without vhost-user, 65535 bytes MTU:
> 
>   - ping -s 65493 works (65535 bytes "on wire" from pcap)
>   - with -s 65494: "Bad frame size from guest, resetting connection"

That's our check, which I plan to fix in the MTU series.

> - QEMU, virtio-net with vhost-user, 65535 bytes MTU:
> 
>   - ping -s 65493 works (65535 bytes "on wire" from pcap)
>   - ping -s 65494:
> 
>     *** buffer overflow detected ***: terminated
> 
>     without this patch, we catch that in packet_add_do() (and without
>     9/12 we don't crash)

Ouch.  That's a bug in our vhost-user code.  The easy fix would be to
clamp MTU to 65521, arguably more correct would be to decouple its
notion of maximum frame size from ETH_MAX_MTU.

> - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the
>   correct maximum though):

No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed
tap frame size.  Seems like there are other drivers that should also
be clamping their max MTU similarly, but aren't.

>   - ping -s 65493 works (65535 bytes on wire)
>   - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)

This time the fragmentation is correct, because the MTU is only 65521.

> So, I guess we should find out the issue with vhost-user first.

Yeah.  I definitely need to intermingle the MTU series with this one
to get the order of operations right.

> Other than that, it looks like we can reach at least 65539 bytes if we
> add the "virtio-net" length descriptors, but still the frame size
> itself (which is actually what matters for the functions in packet.c)

Well.. sort of.  The packet.c functions really care nothing about any
of the layers, it's just a blob of data to them.  I did miss that
tap_passt_input() excludes the qemu header before inserting the frame
into the pool, so indeed, the pool layer won't currently see length
greater than 65535.  Of course, since that frame header *is* stored in
pkt_buf with all the rest, that's arguably not using the pool layer's
buffer bound checks to the full extent we could.

> can't exceed 65535 bytes, at least from my tests.
> 
> Then, yes, I agree that it's not actually correct, even though it fits
> all the use cases we have, because we *could* have an implementation
> exceeding that value (at the moment, it looks like we don't).

So, it seems like the Linux drivers might not actually generate
ethernet frames > 65535 bytes - although they don't all correctly
reduce their maximum MTU to reflect that.  I don't think we should
rely on that; AFAICT it would be reasonable for a driver + VMM
implementation to allow ethernet frames that are at least 65535 bytes
+ L2 header.  That might also allow for 16 byte 802.1q vlan L2
headers.

> > > fixes an actual issue as you seem to imply, actually checking that with
> > > QEMU and muvm would be nice.
> > > 
> > > By the way, as you mention a specific calculation, does it really make
> > > sense to use a "good enough" value here? Can we ever exceed 65538
> > > bytes, or can we use that as limit? It would be good to find out, while
> > > at it.  
> > 
> > So, yes, I think we can exceed 65538.  But more significantly, trying
> > to make the limit tight here feels like a borderline layering
> > violation.  The packet layer doesn't really care about the frame size
> > as long as it's "sane".
> 
> It might still be convenient for some back-ends to define "sane" as 64
> KiB. I'm really not sure if it is, I didn't look into the matching part
> of vhost-user in detail.

That's fair, but I don't think the pool layer should impose that limit
on the backends, because I think it's equally reasonable or another
backend to allow slightly larger frames with a 65535 byte L3 payload.
Hence setting the limit to 64k + "a little bit".

> If it's buggy because we have a user that can exceed that, sure, let's
> fix it. If not... also fine by me as it's some kind of theoretical
> flaw, but we shouldn't crash.
> 
> > Fwiw, in the draft changes I have improving
> > MTU handling, it's my intention that individual backends calculate
> > and/or enforce tighter limits of their own where practical, and
> > BUILD_ASSERT() that those fit within the packet layer's frame size
> > limit.
> 
> Ah, nice, that definitely sounds like an improvement.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro
  2025-01-02 22:00       ` Stefano Brivio
@ 2025-01-03  4:48         ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2025-01-03  4:48 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6767 bytes --]

On Thu, Jan 02, 2025 at 11:00:04PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 13:15:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:30 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Two places in the code use the packet_get_try() variant on packet_get().
> > > > The difference is that packet_get_try() passes a NULL 'func' to
> > > > packet_get_do(), which suppresses log messages.  The places we use this
> > > > are ones where we expect to sometimes have a failure retreiving the packet
> > > > range, even in normal cases.  So, suppressing the log messages seems like
> > > > it makes sense, except:
> > > > 
> > > > 1) It suppresses log messages on all errors.  We (somewhat) expect to hit
> > > >    the case where the requested range is not within the received packet.
> > > >    However, it also suppresses message in cases where the requested packet
> > > >    index doesn't exist, the requested range has a nonsensical length or
> > > >    doesn't lie in even the right vague part of memory.  None of those
> > > >    latter cases are expected, and the message would be helpful if we ever
> > > >    actually hit them.
> > > > 
> > > > 2) The suppressed messages aren't actually that disruptive.  For the case
> > > >    in ip.c, we'd log only if we run out of IPv6 packet before reaching a
> > > >    (non-option) L4 header.  That shouldn't be the case in normal operation
> > > >    so getting a message (at trave level) is not unreasonable.
> > > >    For the case in dhcpv6.c we do suppress a message every time we look for
> > > >    but don't find a specific DHCPv6 option.  That can happen in perfectly
> > > >    ok cases, but, again these are trace() level and DHCPv6 transactions
> > > >    aren't that common.  Suppressing the message for this case doesn't
> > > >    seem worth the drawbacks of (1).  
> > > 
> > > The reason why I implemented packet_get_try() is that the messages from
> > > packet_get_do() indicate serious issues, and if I'm debugging something
> > > by looking at traces it's not great to have messages indicating that we
> > > hit a serious issue while we're simply validating identity associations.  
> > 
> > I'm not following your argument here.  It's exactly because (most of)
> > the message indicate serious issues that I don't want to suppress
> > them.  I don't know what you mean by "validating identity
> > associations".
> 
> But dhcpv6_opt() trying to get data that doesn't exist is *not* an
> issue, including not a serious one, so if I'm debugging something with
> --trace and I see one of these messages I'll shout at "memory" or
> "packet" badness and waste time thinking it's an actual issue.

Oh.. I think I see the confusion.  dhcpv6_opt() trying to get data
that's not in the packet is not an issue.  dhcpv6_opt() trying to get
data that is (theoretically) within the packet, but *not* in the
buffer indicates something very bad has happened.  The former is
exactly one check, every other one is the second class - trying to
separate those cases is the purpose of the later "different
severities" patch.

The difficulty is that passing func==NULL to indicate the "try" case
doesn't work if we want to still give useful errors for the serious
cases: we need the function name for those too.

I had been considering printing occasional trace level messages for
the ok case an acceptable tradeoff for not suppressing the messages
which are serious.  But I see your case or that being too confusing
when debugging.  I did have a draft where I used an explicit boolean
flag to enable/disable the non-serious errors, but gave up on it for
simplicity.

I'll look into a way to continue suppressing the non-serious error
here.  Maybe moving the (single) non-serious error case message into
the caller with a wrapper.

> Validating identity associations (IA_NA, IA_TA, RFC 3315) is what
> dhcpv6_ia_notonlink() does. That's the most common case where we'll
> routinely call dhcpv6_opt() to fetch data which isn't there.

Ok.

> > > It's not about the amount of logged messages, it's about the type of
> > > message being logged and the distracting noise possibly resulting in a
> > > substantial time waste.
> > > 
> > > About 1): dhcpv6_opt() always picks pool index 0, and the base offset
> > > was already checked by the caller.  
> > 
> > Right, but dhcpv6_opt() is called in a loop, that only stops when it
> > returns NULL.  So, by construction the last call to dhcpv6_opt(),
> > which terminates the loop, _will_ have a failing call to packet_get().
> > At this point - at least assuming a correctly constructed packet - the
> > offset will point to just past the last option, which should be
> > exactly at the end of the packet.
> 
> Yes, I get that, and:
> 
> - I would be happy if that one were *not* reported as failure

Right, that's also my preference, but as above I compromised on this
to simplify preserving the error cases that do matter.

> - the calls before that one should always be enough to check if we have
>   an actual issue with the packet

Yes, in this case I think that's correct.

> > > In ipv6_l4hdr(), the index was
> > > already validated by a previous call to packet_get(), and the starting
> > > offset as well.  
> > 
> > Not AFAICT, the initial packet_get just validates the basic IPv6
> > header.  The calls to packet_get_try() in the loop validate additional
> > IP options.  I don't think it will ever fail on a well-constructed
> > packet, but it could on a bogus (or truncated) packet, where the
> > nexthdr field indicates an option that's actually missing.
> > 
> > This is kind of my point: it will only trip on a badly constructed
> > packet, in which case I don't think we want to suppress messages.
> 
> There, I used packet_get_try() because a missing option or payload
> doesn't indicate a bad packet at the _data_ level.

Not really sure what you mean by the data level, here.

> On the other hand, it's bad at the network level anyway, because option
> 59 *must* be there otherwise (I just realised), so while I'd still
> prefer another wording of the warning (not mentioning packet/buffer
> ranges... something more network-y), I would be fine with it.

That sounds like another argument for moving the message for the
"requested range is outside packet" case into the caller.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors
  2025-01-02 22:00       ` Stefano Brivio
@ 2025-01-03  5:06         ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2025-01-03  5:06 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 18588 bytes --]

On Thu, Jan 02, 2025 at 11:00:08PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 13:58:19 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:32 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > packet_add() and packet_get() can fail for a number of reasons, and we
> > > > currently treat them all basically the same: we log a trace() level message
> > > > and for packet_get() we return NULL.  However the different causes of
> > > > error are quite different and suggest different handling:
> > > > 
> > > > 1) If we run out of space in the pool on add, that's (rightly) non-fatal,
> > > >    but is an unusual situation which we might want to know about.  Promote
> > > >    it's message to debug() level.
> > > > 
> > > > 2) All packet_check_range() errors and the checks on packet length indicate
> > > >    a serious problem.  Due to either a bug in calling code, or some sort
> > > >    of memory-clobbering, we've been given addresses or sizes of packets
> > > >    that are nonsensical.  If things are this bad, we shouldn't try to carry
> > > >    on replace these with asserts.
> > > > 
> > > > 3) Requesting a packet index that doesn't exist in the pool in packet_get()
> > > >    indicates a bug in the caller - it should always check the index first.
> > > >    Replace this with an assert.  
> > > 
> > > The reasons for 2) and 3) being trace() messages (they should probably
> > > be debug() following the same reasoning as 1)) are graceful degradation
> > > and security (availability).  
> > 
> > Hmm..
> > 
> > > If we have a packet triggering some sort of "memory clobbering", or an
> > > issue in the caller, that might very well be just one packet or a few
> > > of them.  
> > 
> > I see the case for packet_add_do().  In that case it could be that
> > this is just a bad packet triggering bugs in the initial processing
> > code, and just dropping it will be sufficient to save us from damaging
> > state.
> > 
> > For packet_get_do(), however, this indicates that we have a packet
> > that's out of bounds, despite the fact that we checked its bounds when
> > it was added to the pool.  That means we *already* have corrupted
> > memory, and we have no way to assess the extent of the damage.  It's
> > entirely possible we're already badly compromised.  Under those
> > circumstances I think it is much better to terminate immediately,
> > rather than risk doing any more hard to trace damage.
> 
> It depends on which part of packet_get_do(). If we're exceeding the
> packet count or the size of a pool, then I guess yes, otherwise not so
> much.

There's one, maybe two of the tests in packet_get_do() that don't
indicate a serious error:
	if (len + offset > p->pkt[idx].iov_len) {
is definitely ok, and you've convinced me I should try harder to make
sure this doesn't print any error (even trace()) in at least the "try"
cases.
	if (len > UINT16_MAX) {
is questionable.  Technically it could be ok, but since that parameter
is generally a constant, chances are it does indicate a bad error in
the caller.

All the other checks, from packet_check_range() must indicate
something very bad.

> > > I think it's a disservice to users to crash in that case, and it has
> > > the potential to worsen a security flaw if this packet can be built on
> > > purpose. It's also a disservice to package maintainers because it has
> > > the potential to turn a harmless issue into an annoying situation with
> > > associated time pressure and everything.  
> > 
> > Ok, that argument makes sense to me for packet_add_do(), but again,
> > not for packet_get_do().
> > 
> > > Those are not warn() calls, by the way, because we don't want to make
> > > it too easy, in that (perhaps unlikely) case, to flood logs, and
> > > possibly hide information due to rotation.  
> > 
> > Right, I see the argument, but it also worries me that the message -
> > which even if we are successfully containing the damage here
> > definitely indicates a bug in the caller - could easily never be seen
> > at trace or even debug level.
> > 
> > > I suppose that the argument for asserts here is so that we discover
> > > functional issues as soon as possible, which I understand of course,  
> > 
> > That's the main reason.  The secondary reason is that it clarifies the
> > meaning of a NULL return from packet_get_do().  With this change it
> > means: you requested a range that's not in the packet, whereas
> > previously it could mean either that or "something is horribly wrong
> > in the pool state, and we don't know what".
> > 
> > > but it relies on two assumptions that might not hold:
> > > 
> > > 1. the crash will be reported, while a debug() message will go
> > >    unnoticed.
> > > 
> > >    Well, the crash is more noticeable, but that doesn't mean that it
> > >    will be reported. Users might/will try to hack something around it
> > >    or try to survive with slirp4netns and all its flaws.  
> > 
> > I suppose so.
> > 
> > >    On the other hand, a malfunction (say, failed connections) might be
> > >    equally likely to be reported, along with debug() messages.  
> > 
> > That only applies if this causes malfunctions in intended behaviour.
> > If it's triggered due to an attack, rather than a flaw handling
> > expected behaviour then there's no reason it would be reported.
> > 
> > > 2. the check and the assertion themselves are correct. This was not the
> > >    case for the two most recent severe issues (commit 61c0b0d0f199, bug
> > >    #105) we discovered through assertions.  
> > 
> > That's fair, I can see why you'd be gun shy about asserts given that.
> > 
> > >    Issues such as https://github.com/containers/podman/issues/22925, on
> > >    the other hand, are good examples of how asserts saved us a lot
> > >    of time and effort later on (compared to, say, "connections stop
> > >    working after three days"). But I would argue that those are
> > >    substantially different cases where assertions are checking
> > >    something that's much more inherent to the implementation.  
> > 
> > Eh... I can't really see a way these assertions are different other
> > than that one turned out to be wrong and the other didn't.
> 
> Ah, sorry, I meant that there is a substantial difference between the
> ones related to Podman issue 22925 *and what might happen in
> packet_get_range()*.
> 
> Not that there's a difference between those related to Podman issue
> 22925 and those from commit 61c0b0d0f199 and bug #105.

Ok, but I'm still not really seeing how to distinguish the cases
without the benefit of hindsight.

> > > In this case, I would argue that there's no need to crash, so we
> > > shouldn't. We can presumably carry on after that, and we're not leaking
> > > any resource or sneakily leave behind some condition that will cause
> > > apparently unrelated issues at a later time.  
> > 
> > Again, true (plausibly) for packet_add_do(), but not for
> > packet_get_do().
> > 
> > > > 4) On packet_get() requesting a section of the packet beyond its bounds is
> > > >    correct already.  This happens routinely from some callers when they
> > > >    probe to see if certain parts of the packet are present.  At worst it
> > > >    indicates that the guest has generate a malformed packet, which we
> > > >    should quietly drop as we already do.  
> > 
> > This patch doesn't change that.  We check against the stated packet
> > bounds first and return NULL *before* we check against the buffer
> > bounds and possibly trigger an assert.
> 
> Yes... I wasn't actually disputing that (quote is from yourself).
> 
> > Ok... here's a tentative proposal, let me know what you think:
> 
> So, first off, I have to say that it's a bit hard to reason about
> something that is not giving users any problem at the moment (well, at
> least not without a part of this series), and which didn't really give
> us problems in the past.
> 
> It's difficult to quantify with experience and examples (even letting
> alone the priority of the whole matter for a moment...). Anyway, let me
> try:

I mean, I see your point.  This is basically following on from the MTU
stuff - I was tracing what enforces / relies on packet/datagram/frame
size limits and making sure they're consistent through the entire
chain.  At the moment we allow the user to set --mtu 65535 (i.e. 65535
byte IP datagrams, not including L2), but the pool layer (amongst
other things) doesn't allow us to actually deliver it.  This does
cause user visible (though minor) problems, such as bug 65 & bug 66.

I hope merging the two series will make what's going on clearer.

> >   * I promote the packet_check_range() messages to warn(), or even
> >     err().  Yes, this might allow for log flooding, but I would argue
> >     that if we have a bug allowing this path to be triggered and
> >     something is frequently triggering it, we *want* that to be noisy.
> >     We *think* we're containing the damage here, but we don't really
> >     know.
> 
> I still think it's our responsibility to not flood the system log in
> *any* circumstance.
> 
> Remember that this is not necessarily *our* log where the worst that
> can happen is that we hide information. Even with sane system log
> configurations, one message per packet can get close to a substantial
> denial of service (especially with systemd-syslogd, see also
> https://github.com/systemd/systemd/issues/13425).
> 
> I guess you don't realise until you run passt without -f, for example
> with libvirt, for a large number of virtual machines. Or in KubeVirt.
> Or for a bunch of containers. It's a common and very much intended
> usage.

Hm, fair point.  I feel like at some point we're probably going to
have to bite the bullet and implement rate-limited messages and/or
some sort of "warn once" thing at some point.

> It's not true in general that we don't know. We can quantify the
> difference in damage very clearly if the attacker manages to make
> packet_check_range() print one line for each packet (and nothing
> else).
> 
> >   * I make packet_check_range() failures in packet_add_do() drop the
> >     packet and continue, as they used to
> >   * packet_check_range() failures in packet_get_do() will still
> >     assert()
> 
> I guess the (start < p->buf) one makes sense, but not so much the one
> where we check that offset plus length is within the pool, because that
> one could be very well triggered by a specific packet itself.

I really don't see the difference.  Since we're talking
packet_get_do() here, this means that we've *already* inserted a
packet in the pool which extends outside the buffer(s) it's supposed
to be in.  But we can't have inserted this packet with
packet_add_do(), because it checked the same things non-fatally.
Which means either a packet has been inserted bypassing
packet_add_do() or the pool metadata has been corrupted since then.
Either of those situations indicates something very, very bad.

> >   * I leave the assert()s for packet _index_ out of range
> 
> Okay, yes, those make sense to me as well.
> 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  packet.c    | 71 ++++++++++++++++++++---------------------------------
> > > >  packet.h    |  3 ++-
> > > >  vu_common.c | 13 +++++++---
> > > >  3 files changed, 38 insertions(+), 49 deletions(-)
> > > > 
> > > > diff --git a/packet.c b/packet.c
> > > > index c921aa15..24f12448 100644
> > > > --- a/packet.c
> > > > +++ b/packet.c
> > > > @@ -30,37 +30,27 @@
> > > >   * @func:	For tracing: name of calling function
> > > >   * @line:	For tracing: caller line of function call
> > > >   *
> > > > - * Return: 0 if the range is valid, -1 otherwise
> > > > + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for
> > > > + * the pool).
> > > >   */
> > > > -static int packet_check_range(const struct pool *p, const char *ptr, size_t len,
> > > > -			      const char *func, int line)
> > > > +static void packet_check_range(const struct pool *p,
> > > > +			       const char *ptr, size_t len,
> > > > +			       const char *func, int line)
> > > >  {
> > > >  	if (p->buf_size == 0) {
> > > > -		int ret;
> > > > -
> > > > -		ret = vu_packet_check_range((void *)p->buf, ptr, len);
> > > > -
> > > > -		if (ret == -1)
> > > > -			trace("cannot find region, %s:%i", func, line);
> > > > -
> > > > -		return ret;
> > > > -	}
> > > > -
> > > > -	if (ptr < p->buf) {
> > > > -		trace("packet range start %p before buffer start %p, %s:%i",
> > > > -		      (void *)ptr, (void *)p->buf, func, line);
> > > > -		return -1;
> > > > -	}
> > > > -
> > > > -	if (ptr + len > p->buf + p->buf_size) {
> > > > -		trace("packet range end %p after buffer end %p, %s:%i",
> > > > -		      (void *)(ptr + len), (void *)(p->buf + p->buf_size),
> > > > -		      func, line);
> > > > -		return -1;
> > > > +		vu_packet_check_range((void *)p->buf, ptr, len, func, line);
> > > > +		return;
> > > >  	}
> > > >  
> > > > -	return 0;
> > > > +	ASSERT_WITH_MSG(ptr >= p->buf,
> > > > +			"packet range start %p before buffer start %p, %s:%i",
> > > > +			(void *)ptr, (void *)p->buf, func, line);
> > > > +	ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size,
> > > > +			"packet range end %p after buffer end %p, %s:%i",
> > > > +			(void *)(ptr + len), (void *)(p->buf + p->buf_size),
> > > > +			func, line);
> > > >  }
> > > > +
> > > >  /**
> > > >   * packet_add_do() - Add data as packet descriptor to given pool
> > > >   * @p:		Existing pool
> > > > @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
> > > >  	size_t idx = p->count;
> > > >  
> > > >  	if (idx >= p->size) {
> > > > -		trace("add packet index %zu to pool with size %zu, %s:%i",
> > > > +		debug("add packet index %zu to pool with size %zu, %s:%i",
> > > >  		      idx, p->size, func, line);
> > > >  		return;
> > > >  	}
> > > >  
> > > > -	if (packet_check_range(p, start, len, func, line))
> > > > -		return;
> > > > +	packet_check_range(p, start, len, func, line);
> > > >  
> > > > -	if (len > PACKET_MAX_LEN) {
> > > > -		trace("add packet length %zu, %s:%i", len, func, line);
> > > > -		return;
> > > > -	}
> > > > +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> > > > +			"add packet length %zu (max %zu), %s:%i",
> > > > +			len, PACKET_MAX_LEN, func, line);
> > > >  
> > > >  	p->pkt[idx].iov_base = (void *)start;
> > > >  	p->pkt[idx].iov_len = len;
> > > > @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> > > >  {
> > > >  	char *ptr;
> > > >  
> > > > -	if (idx >= p->size || idx >= p->count) {
> > > > -		trace("packet %zu from pool size: %zu, count: %zu, %s:%i",
> > > > -		      idx, p->size, p->count, func, line);
> > > > -		return NULL;
> > > > -	}
> > > > -
> > > > -	if (len > PACKET_MAX_LEN) {
> > > > -		trace("packet data length %zu, %s:%i", len, func, line);
> > > > -		return NULL;
> > > > -	}
> > > > +	ASSERT_WITH_MSG(idx < p->size && idx < p->count,
> > > > +			"packet %zu from pool size: %zu, count: %zu, %s:%i",
> > > > +			idx, p->size, p->count, func, line);
> > > > +	ASSERT_WITH_MSG(len <= PACKET_MAX_LEN,
> > > > +			"packet range length %zu (max %zu), %s:%i",
> > > > +			len, PACKET_MAX_LEN, func, line);
> > > >  
> > > >  	if (len + offset > p->pkt[idx].iov_len) {
> > > >  		trace("data length %zu, offset %zu from length %zu, %s:%i",
> > > > @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> > > >  
> > > >  	ptr = (char *)p->pkt[idx].iov_base + offset;
> > > >  
> > > > -	if (packet_check_range(p, ptr, len, func, line))
> > > > -		return NULL;
> > > > +	packet_check_range(p, ptr, len, func, line);
> > > >  
> > > >  	if (left)
> > > >  		*left = p->pkt[idx].iov_len - offset - len;
> > > > diff --git a/packet.h b/packet.h
> > > > index f95cda08..b164f77e 100644
> > > > --- a/packet.h
> > > > +++ b/packet.h
> > > > @@ -31,7 +31,8 @@ struct pool {
> > > >  	struct iovec pkt[];
> > > >  };
> > > >  
> > > > -int vu_packet_check_range(void *buf, const char *ptr, size_t len);
> > > > +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> > > > +			   const char *func, int line);
> > > >  void packet_add_do(struct pool *p, size_t len, const char *start,
> > > >  		   const char *func, int line);
> > > >  void *packet_get_do(const struct pool *p, const size_t idx,
> > > > diff --git a/vu_common.c b/vu_common.c
> > > > index 531f8786..528b9b08 100644
> > > > --- a/vu_common.c
> > > > +++ b/vu_common.c
> > > > @@ -24,10 +24,13 @@
> > > >   * @buf:	Array of the available memory regions
> > > >   * @ptr:	Start of desired data range
> > > >   * @size:	Length of desired data range
> > > > + * @func:	For tracing: name of calling function
> > > > + * @line:	For tracing: caller line of function call  
> > > 
> > > If those arguments are mandatory, I would drop the "For tracing"
> > > specification.
> > >   
> > > >   *
> > > > - * Return: 0 if the zone is in a mapped memory region, -1 otherwise
> > > > + * Aborts if the zone isn't in any of the device regions
> > > >   */
> > > > -int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> > > > +void vu_packet_check_range(void *buf, const char *ptr, size_t len,
> > > > +			   const char *func, int line)
> > > >  {
> > > >  	struct vu_dev_region *dev_region;
> > > >  
> > > > @@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
> > > >  
> > > >  		if (m <= ptr &&
> > > >  		    ptr + len <= m + dev_region->mmap_offset + dev_region->size)
> > > > -			return 0;
> > > > +			return;
> > > >  	}
> > > >  
> > > > -	return -1;
> > > > +	abort_with_msg(
> > > > +		"package range at %p, length %zd not within dev region %s:%i",  
> > > 
> > > s/package/packet/
> > >   
> > > > +		(void *)ptr, len, func, line);
> > > >  }
> > > >  
> > > >  /**  
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets
  2025-01-02 22:00       ` Stefano Brivio
@ 2025-01-03  6:06         ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2025-01-03  6:06 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6276 bytes --]

On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 14:46:45 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:34 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Currently we attempt to size pool_tap[46] so they have room for the maximum
> > > > possible number of packets that could fit in pkt_buf, TAP_MSGS.  However,
> > > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
> > > > the minimum possible L2 frame size.  But, we don't enforce that L2 frames
> > > > are at least ETH_ZLEN when we receive them from the tap backend, and since
> > > > we're dealing with virtual interfaces we don't have the physical Ethernet
> > > > limitations requiring that length.  Indeed it is possible to generate a
> > > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
> > > > the 'pasta' backend is only 42 bytes long).
> > > > 
> > > > It's also unclear if this limit is sufficient for vhost-user which isn't
> > > > limited by the size of pkt_buf as the other modes are.
> > > > 
> > > > We could attempt to correct the calculation, but that would leave us with
> > > > even larger arrays, which in practice rarely accumulate more than a handful
> > > > of packets.  So, instead, put an arbitrary cap on the number of packets we
> > > > can put in a batch, and if we run out of space, process and flush the
> > > > batch.  
> > > 
> > > I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> > > general I couldn't really see a difference in latency (especially for
> > > UDP streams with small packets) or throughput. Figures from short
> > > throughput tests (such as the ones from the test suite) look a bit more
> > > variable, but I don't have any statistically meaningful data.
> > > 
> > > Then I looked into how many messages we might have in the array without
> > > this change, and I realised that, with the throughput tests from the
> > > suite, we very easily exceed the 256 limit.  
> > 
> > Ah, interesting.
> > 
> > > Perhaps surprisingly we get the highest buffer counts with TCP transfers
> > > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
> > > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
> > > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval,
> > > the extra system call overhead from forcibly flushing batches might
> > > become rather relevant.  
> > 
> > Really?  I thought syscall overhead (as in the part that's
> > per-syscall, rather than per-work) was generally in the tens of µs
> > range, rather than the ms range.
> 
> Tens or hundreds of µs (mind that it's several of them), so there
> could be just one order of magnitude between the two.

Hm, ok.

> > But in any case, I'd be fine with upping the size of the array to 4k
> > or 8k based on that empirical data.  That's still much smaller than
> > the >150k we have now.
> 
> I would even go with 32k -- there are embedded systems with a ton of
> memory but still much slower clocks compared to my typical setup. Go
> figure. Again, I think we should test and profile this, ideally, but if
> not, then let's pick something that's ~10x of what I see.
> 
> > > With lower MTUs, it looks like we have a lower CPU load and
> > > transmissions are scheduled differently (resulting in smaller batches),
> > > but I didn't really trace things.  
> > 
> > Ok.  I wonder if with the small MTUs we're hitting throughput
> > bottlenecks elsewhere which mean this particular path isn't
> > over-exercised.
> > 
> > > So I start thinking that this has the *potential* to introduce a
> > > performance regression in some cases and we shouldn't just assume that
> > > some arbitrary 256 limit is good enough. I didn't check with perf(1),
> > > though.
> > > 
> > > Right now that array takes, effectively, less than 100 KiB (it's ~5000
> > > copies of struct iovec, 16 bytes each), and in theory that could be
> > > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> > > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will
> > > have no practical effect anyway.  
> > 
> > Yeah, I guess.  I don't love the fact that currently for correctness
> > (not spuriously dropping packets) we rely on a fairly complex
> > calculation that's based on information from different layers: the
> > buffer size and enforcement is in the packet pool layer and is
> > independent of packet layout, but the minimum frame size comes from
> > the tap layer and depends quite specifically on which L2 encapsulation
> > we're using.
> 
> Well but it's exactly one line, and we're talking about the same
> project and tool, not something that's separated by several API layers.
> 
> By the way: on one hand you have that. On the other hand, you're adding
> an arbitrary limit that comes from a test I just did, which is also
> based on information from different layers.

The difference is that it used to be a hard limit: if we exceeded it
we drop packets.  Now, we just do a little more work.
Actually... that makes me realise the actual size of the batches isn't
the relevant factor in choosing the size.  The amortized cost of
processing a packet will be pretty much:

	  (per byte costs) * (packet size)
	+ (per packet costs)
	+ (per batch costs) / (batch size)

So, we just need to allow (batch size) to become large enough to make
the last term negligible compared to the other two.  I find it hard to
believe that we'd need more than ~1000 to do that.

> > > All in all, I think we shouldn't change this limit without a deeper
> > > understanding of the practical impact. While this change doesn't bring
> > > any practical advantage, the current behaviour is somewhat tested by
> > > now, and a small limit isn't.  
> 
> ...and this still applies, I think.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-01-03  6:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20  8:35 ` [PATCH v2 01/12] test focus David Gibson
2024-12-20  8:35 ` [PATCH v2 02/12] hack: stop on fail, but not perf fail David Gibson
2024-12-20  8:35 ` [PATCH v2 03/12] make passt dumpable David Gibson
2024-12-20  8:35 ` [PATCH v2 04/12] packet: Use flexible array member in struct pool David Gibson
2024-12-20  8:35 ` [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range() David Gibson
2024-12-20  8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  1:00     ` David Gibson
2025-01-02 21:59       ` Stefano Brivio
2025-01-03  1:16         ` David Gibson
2024-12-20  8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  2:15     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  4:48         ` David Gibson
2024-12-20  8:35 ` [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers David Gibson
2024-12-20  8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  2:58     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  5:06         ` David Gibson
2024-12-20  8:35 ` [PATCH v2 10/12] packet: Move packet length checks into packet_check_range() David Gibson
2024-12-20  8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  3:46     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  6:06         ` David Gibson
2024-12-20  8:35 ` [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB David Gibson
2024-12-20  9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20 10:06   ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).