public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 00/24] Add vhost-user support to passt.
@ 2024-02-02 14:11 Laurent Vivier
  2024-02-02 14:11 ` [PATCH 01/24] iov: add some functions to manage iovec Laurent Vivier
                   ` (23 more replies)
  0 siblings, 24 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

This series of patches adds vhost-user support to passt
and then allows passt to connect to QEMU network backend using
virtqueue rather than a socket.

We have with iperf3 a 10x speed improvement:

  $ iperf3 -c localhost -p 10001  -t 60 -6 -u -b 50G

  socket:
  [  5]   0.00-60.04  sec  30.5 GBytes  4.36 Gbits/sec  0.065 ms  9127377/10125415 (90%)  receiver
  vhost-user:
  [  5]   0.00-60.05  sec   292 GBytes  41.8 Gbits/sec  0.007 ms  259805/9832736 (2.6%)  receiver

  $ iperf3 -c localhost -p 10001  -t 60 -4 -u -b 50G

  socket:
  [  5]   0.00-60.04  sec  36.4 GBytes  5.21 Gbits/sec  0.048 ms  7535735/8728101 (86%)  receiver
  vhost-user:
  [  5]   0.00-60.05  sec   259 GBytes  37.0 Gbits/sec  0.003 ms  142594/8616705 (1.7%)  receiver

  $ iperf3 -c localhost -p 10001  -t 60 -6

  socket:
  [  5]   0.00-60.00  sec  16.3 GBytes  2.33 Gbits/sec    0             sender
  [  5]   0.00-60.06  sec  16.3 GBytes  2.32 Gbits/sec                  receiver
  vhost-user:
  [ ID] Interval           Transfer     Bitrate         Retr
  [  5]   0.00-60.00  sec   205 GBytes  29.3 Gbits/sec    0             sender
  [  5]   0.00-60.04  sec   205 GBytes  29.3 Gbits/sec                  receiver

  $ iperf3 -c localhost -p 10001  -t 60 -4

  socket:
  [  5]   0.00-60.00  sec  16.1 GBytes  2.31 Gbits/sec    0             sender
  [  5]   0.00-60.07  sec  16.1 GBytes  2.31 Gbits/sec                  receiver
  vhost-user:
  [  5]   0.00-60.00  sec   201 GBytes  28.7 Gbits/sec    0             sender
  [  5]   0.00-60.04  sec   201 GBytes  28.7 Gbits/sec                  receiver

With QEMU, rather than using to connect:

  -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket

we will use:

  -chardev socket,id=chr0,path=/tmp/passt_1.socket
  -netdev vhost-user,id=netdev0,chardev=chr0
  -device virtio-net,netdev=netdev0
  -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE
  -numa node,memdev=memfd0

The memory backend is needed to share data between passt and QEMU.

The series starts to introduce new functions to manage iovec and
to do checksum on unaligned memory (we cannot align buffers provided
by the guest with the same value we use with internal passt buffers):

  iov: add some functions to manage iovec
  pcap: add pcap_iov()
  checksum: align buffers
  checksum: add csum_iov()

We introduce new files ip.c and ip.h to provide IP generic functions:

  util: move IP stuff from util.[ch] to ip.[ch]
  ip: move duplicate IPv4 checksum function to ip.h

Then we extract from existing TCP and UDP function the internal
passt buffer management to be able to use them with the guest
provided buffers:

  tcp: extract buffer management from tcp_send_flag()
  tcp: extract buffer management from tcp_conn_tap_mss()
  tcp: rename functions that manage buffers
  tcp: move buffers management functions to their own file
  tap: make tap_update_mac() generic
  tap: export pool_flush()/tapX_handler()/packet_add()
  udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX()
  udp: rename udp_sock_handler() to udp_buf_sock_handler()
  packet: replace struct desc by struct iovec

As vhost-user is a variant of passt mode, modify the code to compare
to (!MODE_PASTA) rather than (MODE_PASST || MODE_VU)

  vhost-user: compare mode MODE_PASTA and not MODE_PASST

We introduce virtio and vhost-user management functions:

  vhost-user: introduce virtio API
  vhost-user: introduce vhost-user API

And then a first version of vhost-user that copies data from the passt buffers
to guest memory, and vice-versa, as it's done with the socket algorithm:

  vhost-user: add vhost-user

And finaly remove the buffers copy on TX and RX (TCP/UDP):

  vhost-user: use guest buffer directly in vu_handle_tx()
  tcp: vhost-user RX nocopy
  udp: vhost-user RX nocopy
  vhost-user: remove tap_send_frames_vu()

Thanks,
Laurent

Laurent Vivier (24):
  iov: add some functions to manage iovec
  pcap: add pcap_iov()
  checksum: align buffers
  checksum: add csum_iov()
  util: move IP stuff from util.[ch] to ip.[ch]
  ip: move duplicate IPv4 checksum function to ip.h
  ip: introduce functions to compute the header part checksum for
    TCP/UDP
  tcp: extract buffer management from tcp_send_flag()
  tcp: extract buffer management from tcp_conn_tap_mss()
  tcp: rename functions that manage buffers
  tcp: move buffers management functions to their own file
  tap: make tap_update_mac() generic
  tap: export pool_flush()/tapX_handler()/packet_add()
  udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX()
  udp: rename udp_sock_handler() to udp_buf_sock_handler()
  packet: replace struct desc by struct iovec
  vhost-user: compare mode MODE_PASTA and not MODE_PASST
  vhost-user: introduce virtio API
  vhost-user: introduce vhost-user API
  vhost-user: add vhost-user
  vhost-user: use guest buffer directly in vu_handle_tx()
  tcp: vhost-user RX nocopy
  udp: vhost-user RX nocopy
  vhost-user: remove tap_send_frames_vu()

 Makefile       |    7 +-
 checksum.c     |   51 ++-
 checksum.h     |    1 +
 conf.c         |   33 +-
 dhcp.c         |    1 +
 flow.c         |    1 +
 icmp.c         |    1 +
 iov.c          |   78 ++++
 iov.h          |   46 +++
 ip.c           |   72 ++++
 ip.h           |  124 ++++++
 isolation.c    |   10 +-
 ndp.c          |    1 +
 packet.c       |   81 ++--
 packet.h       |   16 +-
 passt.c        |   18 +-
 passt.h        |   10 +
 pcap.c         |   32 ++
 pcap.h         |    1 +
 port_fwd.c     |    1 +
 qrap.c         |    1 +
 tap.c          |  226 +++++++----
 tap.h          |   13 +-
 tcp.c          |  789 ++++++------------------------------
 tcp.h          |    2 +-
 tcp_buf.c      |  569 ++++++++++++++++++++++++++
 tcp_buf.h      |   17 +
 tcp_internal.h |   78 ++++
 tcp_splice.c   |    1 +
 tcp_vu.c       |  447 +++++++++++++++++++++
 tcp_vu.h       |   10 +
 udp.c          |  171 ++++----
 udp.h          |    4 +-
 udp_internal.h |   21 +
 udp_vu.c       |  215 ++++++++++
 udp_vu.h       |    8 +
 util.c         |   55 ---
 util.h         |   83 +---
 vhost_user.c   | 1050 ++++++++++++++++++++++++++++++++++++++++++++++++
 vhost_user.h   |  137 +++++++
 virtio.c       |  484 ++++++++++++++++++++++
 virtio.h       |  121 ++++++
 42 files changed, 4041 insertions(+), 1046 deletions(-)
 create mode 100644 iov.c
 create mode 100644 iov.h
 create mode 100644 ip.c
 create mode 100644 ip.h
 create mode 100644 tcp_buf.c
 create mode 100644 tcp_buf.h
 create mode 100644 tcp_internal.h
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h
 create mode 100644 udp_internal.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h
 create mode 100644 vhost_user.c
 create mode 100644 vhost_user.h
 create mode 100644 virtio.c
 create mode 100644 virtio.h

-- 
2.42.0


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  5:57   ` David Gibson
  2024-02-06 16:10   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 02/24] pcap: add pcap_iov() Laurent Vivier
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile |  4 +--
 iov.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 iov.h    | 46 +++++++++++++++++++++++++++++++++
 3 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 iov.c
 create mode 100644 iov.h

diff --git a/Makefile b/Makefile
index af4fa87e7e13..c1138fb91d26 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
 	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
-	util.c
+	util.c iov.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -56,7 +56,7 @@ MANPAGES = passt.1 pasta.1 qrap.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
-	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h
+	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/iov.c b/iov.c
new file mode 100644
index 000000000000..38a8e7566021
--- /dev/null
+++ b/iov.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts copied from QEMU util/iov.c */
+
+#include <sys/socket.h>
+#include "util.h"
+#include "iov.h"
+
+size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
+			 size_t offset, const void *buf, size_t bytes)
+{
+	size_t done;
+	unsigned int i;
+	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
+		if (offset < iov[i].iov_len) {
+			size_t len = MIN(iov[i].iov_len - offset, bytes - done);
+			memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
+			done += len;
+			offset = 0;
+		} else {
+			offset -= iov[i].iov_len;
+		}
+	}
+	return done;
+}
+
+size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
+		       size_t offset, void *buf, size_t bytes)
+{
+	size_t done;
+	unsigned int i;
+	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
+		if (offset < iov[i].iov_len) {
+			size_t len = MIN(iov[i].iov_len - offset, bytes - done);
+			memcpy((char *)buf + done, (char *)iov[i].iov_base + offset, len);
+			done += len;
+			offset = 0;
+		} else {
+			offset -= iov[i].iov_len;
+		}
+	}
+	return done;
+}
+
+size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt)
+{
+	size_t len;
+	unsigned int i;
+
+	len = 0;
+	for (i = 0; i < iov_cnt; i++) {
+		len += iov[i].iov_len;
+	}
+	return len;
+}
+
+unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
+		  const struct iovec *iov, unsigned int iov_cnt,
+		  size_t offset, size_t bytes)
+{
+	size_t len;
+	unsigned int i, j;
+	for (i = 0, j = 0;
+		 i < iov_cnt && j < dst_iov_cnt && (offset || bytes); i++) {
+		if (offset >= iov[i].iov_len) {
+			offset -= iov[i].iov_len;
+			continue;
+		}
+		len = MIN(bytes, iov[i].iov_len - offset);
+
+		dst_iov[j].iov_base = (char *)iov[i].iov_base + offset;
+		dst_iov[j].iov_len = len;
+		j++;
+		bytes -= len;
+		offset = 0;
+	}
+	return j;
+}
diff --git a/iov.h b/iov.h
new file mode 100644
index 000000000000..31fbf6d0e1cf
--- /dev/null
+++ b/iov.h
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts copied from QEMU include/qemu/iov.h */
+
+#ifndef IOVEC_H
+#define IOVEC_H
+
+#include <unistd.h>
+#include <string.h>
+
+size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
+			 size_t offset, const void *buf, size_t bytes);
+size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
+		       size_t offset, void *buf, size_t bytes);
+
+static inline size_t iov_from_buf(const struct iovec *iov,
+				  unsigned int iov_cnt, size_t offset,
+				  const void *buf, size_t bytes)
+{
+	if (__builtin_constant_p(bytes) && iov_cnt &&
+		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
+		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
+		return bytes;
+	} else {
+		return iov_from_buf_full(iov, iov_cnt, offset, buf, bytes);
+	}
+}
+
+static inline size_t iov_to_buf(const struct iovec *iov,
+				const unsigned int iov_cnt, size_t offset,
+				void *buf, size_t bytes)
+{
+	if (__builtin_constant_p(bytes) && iov_cnt &&
+		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
+		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
+		return bytes;
+	} else {
+		return iov_to_buf_full(iov, iov_cnt, offset, buf, bytes);
+	}
+}
+
+size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt);
+unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
+		  const struct iovec *iov, unsigned int iov_cnt,
+		  size_t offset, size_t bytes);
+#endif
-- 
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts copied from QEMU include/qemu/iov.h */
+
+#ifndef IOVEC_H
+#define IOVEC_H
+
+#include <unistd.h>
+#include <string.h>
+
+size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
+			 size_t offset, const void *buf, size_t bytes);
+size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
+		       size_t offset, void *buf, size_t bytes);
+
+static inline size_t iov_from_buf(const struct iovec *iov,
+				  unsigned int iov_cnt, size_t offset,
+				  const void *buf, size_t bytes)
+{
+	if (__builtin_constant_p(bytes) && iov_cnt &&
+		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
+		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
+		return bytes;
+	} else {
+		return iov_from_buf_full(iov, iov_cnt, offset, buf, bytes);
+	}
+}
+
+static inline size_t iov_to_buf(const struct iovec *iov,
+				const unsigned int iov_cnt, size_t offset,
+				void *buf, size_t bytes)
+{
+	if (__builtin_constant_p(bytes) && iov_cnt &&
+		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
+		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
+		return bytes;
+	} else {
+		return iov_to_buf_full(iov, iov_cnt, offset, buf, bytes);
+	}
+}
+
+size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt);
+unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
+		  const struct iovec *iov, unsigned int iov_cnt,
+		  size_t offset, size_t bytes);
+#endif
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/24] pcap: add pcap_iov()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
  2024-02-02 14:11 ` [PATCH 01/24] iov: add some functions to manage iovec Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  6:25   ` David Gibson
  2024-02-06 16:10   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 03/24] checksum: align buffers Laurent Vivier
                   ` (21 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 pcap.c | 32 ++++++++++++++++++++++++++++++++
 pcap.h |  1 +
 2 files changed, 33 insertions(+)

diff --git a/pcap.c b/pcap.c
index 501d52d4992b..b002bb01314c 100644
--- a/pcap.c
+++ b/pcap.c
@@ -31,6 +31,7 @@
 #include "util.h"
 #include "passt.h"
 #include "log.h"
+#include "iov.h"
 
 #define PCAP_VERSION_MINOR 4
 
@@ -130,6 +131,37 @@ void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset)
 	}
 }
 
+void pcap_iov(const struct iovec *iov, unsigned int n)
+{
+	struct timeval tv;
+	struct pcap_pkthdr h;
+	size_t len;
+	unsigned int i;
+
+	if (pcap_fd == -1)
+		return;
+
+	gettimeofday(&tv, NULL);
+
+	len = iov_size(iov, n);
+
+	h.tv_sec = tv.tv_sec;
+	h.tv_usec = tv.tv_usec;
+	h.caplen = h.len = len;
+
+	if (write(pcap_fd, &h, sizeof(h)) < 0) {
+		debug("Cannot write pcap header");
+		return;
+	}
+
+	for (i = 0; i < n; i++) {
+		if (write(pcap_fd, iov[i].iov_base, iov[i].iov_len) < 0) {
+			debug("Cannot log packet, iov %d length %lu\n",
+			      i, iov[i].iov_len);
+		}
+	}
+}
+
 /**
  * pcap_init() - Initialise pcap file
  * @c:		Execution context
diff --git a/pcap.h b/pcap.h
index da5a7e846b72..732a0ddf14cc 100644
--- a/pcap.h
+++ b/pcap.h
@@ -8,6 +8,7 @@
 
 void pcap(const char *pkt, size_t len);
 void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset);
+void pcap_iov(const struct iovec *iov, unsigned int n);
 void pcap_init(struct ctx *c);
 
 #endif /* PCAP_H */
-- 
@@ -8,6 +8,7 @@
 
 void pcap(const char *pkt, size_t len);
 void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset);
+void pcap_iov(const struct iovec *iov, unsigned int n);
 void pcap_init(struct ctx *c);
 
 #endif /* PCAP_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/24] checksum: align buffers
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
  2024-02-02 14:11 ` [PATCH 01/24] iov: add some functions to manage iovec Laurent Vivier
  2024-02-02 14:11 ` [PATCH 02/24] pcap: add pcap_iov() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  6:02   ` David Gibson
  2024-02-02 14:11 ` [PATCH 04/24] checksum: add csum_iov() Laurent Vivier
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

if buffer is not aligned use sum_16b() only on the not aligned
part, and then use csum() on the remaining part

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 checksum.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/checksum.c b/checksum.c
index f21c9b7a14d1..c94980771c63 100644
--- a/checksum.c
+++ b/checksum.c
@@ -407,7 +407,19 @@ less_than_128_bytes:
 __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
 uint16_t csum(const void *buf, size_t len, uint32_t init)
 {
-	return (uint16_t)~csum_fold(csum_avx2(buf, len, init));
+	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;
+	unsigned int pad = align - (intptr_t)buf;
+
+	if (len < pad)
+		pad = len;
+
+	if (pad)
+		init += sum_16b(buf, pad);
+
+	if (len > pad)
+		init = csum_avx2((void *)align, len - pad, init);
+
+	return (uint16_t)~csum_fold(init);
 }
 
 #else /* __AVX2__ */
-- 
@@ -407,7 +407,19 @@ less_than_128_bytes:
 __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
 uint16_t csum(const void *buf, size_t len, uint32_t init)
 {
-	return (uint16_t)~csum_fold(csum_avx2(buf, len, init));
+	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;
+	unsigned int pad = align - (intptr_t)buf;
+
+	if (len < pad)
+		pad = len;
+
+	if (pad)
+		init += sum_16b(buf, pad);
+
+	if (len > pad)
+		init = csum_avx2((void *)align, len - pad, init);
+
+	return (uint16_t)~csum_fold(init);
 }
 
 #else /* __AVX2__ */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/24] checksum: add csum_iov()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (2 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 03/24] checksum: align buffers Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  6:07   ` David Gibson
  2024-02-07  9:02   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch] Laurent Vivier
                   ` (19 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 checksum.c | 39 ++++++++++++++++++++++-----------------
 checksum.h |  1 +
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/checksum.c b/checksum.c
index c94980771c63..14b6057684d9 100644
--- a/checksum.c
+++ b/checksum.c
@@ -395,17 +395,8 @@ less_than_128_bytes:
 	return (uint32_t)sum64;
 }
 
-/**
- * csum() - Compute TCP/IP-style checksum
- * @buf:	Input buffer, must be aligned to 32-byte boundary
- * @len:	Input length
- * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
- *
- * Return: 16-bit folded, complemented checksum sum
- */
-/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
 __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
-uint16_t csum(const void *buf, size_t len, uint32_t init)
+uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
 {
 	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;
 	unsigned int pad = align - (intptr_t)buf;
@@ -419,24 +410,38 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
 	if (len > pad)
 		init = csum_avx2((void *)align, len - pad, init);
 
-	return (uint16_t)~csum_fold(init);
+	return init;
 }
-
 #else /* __AVX2__ */
 
+__attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
+uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
+{
+	return sum_16b(buf, len) + init;
+}
+#endif /* !__AVX2__ */
+
 /**
  * csum() - Compute TCP/IP-style checksum
- * @buf:	Input buffer
+ * @buf:	Input buffer, must be aligned to 32-byte boundary
  * @len:	Input length
- * @sum:	Initial 32-bit checksum, 0 for no pre-computed checksum
+ * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
  *
- * Return: 16-bit folded, complemented checksum
+ * Return: 16-bit folded, complemented checksum sum
  */
 /* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
 __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
 uint16_t csum(const void *buf, size_t len, uint32_t init)
 {
-	return csum_unaligned(buf, len, init);
+	return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
 }
 
-#endif /* !__AVX2__ */
+uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init)
+{
+	unsigned int i;
+
+	for (i = 0; i < n;  i++)
+		init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
+
+	return (uint16_t)~csum_fold(init);
+}
diff --git a/checksum.h b/checksum.h
index 21c0310d3804..6a20297a5826 100644
--- a/checksum.h
+++ b/checksum.h
@@ -25,5 +25,6 @@ void csum_icmp6(struct icmp6hdr *icmp6hr,
 		const struct in6_addr *saddr, const struct in6_addr *daddr,
 		const void *payload, size_t len);
 uint16_t csum(const void *buf, size_t len, uint32_t init);
+uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init);
 
 #endif /* CHECKSUM_H */
-- 
@@ -25,5 +25,6 @@ void csum_icmp6(struct icmp6hdr *icmp6hr,
 		const struct in6_addr *saddr, const struct in6_addr *daddr,
 		const void *payload, size_t len);
 uint16_t csum(const void *buf, size_t len, uint32_t init);
+uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init);
 
 #endif /* CHECKSUM_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch]
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (3 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 04/24] checksum: add csum_iov() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  6:13   ` David Gibson
  2024-02-02 14:11 ` [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h Laurent Vivier
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile     |  4 +--
 conf.c       |  1 +
 dhcp.c       |  1 +
 flow.c       |  1 +
 icmp.c       |  1 +
 ip.c         | 72 +++++++++++++++++++++++++++++++++++++++++++
 ip.h         | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 ndp.c        |  1 +
 port_fwd.c   |  1 +
 qrap.c       |  1 +
 tap.c        |  1 +
 tcp.c        |  1 +
 tcp_splice.c |  1 +
 udp.c        |  1 +
 util.c       | 55 ---------------------------------
 util.h       | 76 ----------------------------------------------
 16 files changed, 171 insertions(+), 133 deletions(-)
 create mode 100644 ip.c
 create mode 100644 ip.h

diff --git a/Makefile b/Makefile
index c1138fb91d26..acf37f5a2036 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
 	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
-	util.c iov.c
+	util.c iov.c ip.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -56,7 +56,7 @@ MANPAGES = passt.1 pasta.1 qrap.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
-	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h
+	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h ip.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/conf.c b/conf.c
index 5e15b665be9c..93bfda331349 100644
--- a/conf.c
+++ b/conf.c
@@ -35,6 +35,7 @@
 #include <netinet/if_ether.h>
 
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "netlink.h"
 #include "udp.h"
diff --git a/dhcp.c b/dhcp.c
index 110772867632..ff4834a3dce9 100644
--- a/dhcp.c
+++ b/dhcp.c
@@ -25,6 +25,7 @@
 #include <limits.h>
 
 #include "util.h"
+#include "ip.h"
 #include "checksum.h"
 #include "packet.h"
 #include "passt.h"
diff --git a/flow.c b/flow.c
index 5e94a7a949e5..73d52bda8774 100644
--- a/flow.c
+++ b/flow.c
@@ -11,6 +11,7 @@
 #include <string.h>
 
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "siphash.h"
 #include "inany.h"
diff --git a/icmp.c b/icmp.c
index 9434fc5a7490..3b85a8578316 100644
--- a/icmp.c
+++ b/icmp.c
@@ -33,6 +33,7 @@
 
 #include "packet.h"
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "tap.h"
 #include "log.h"
diff --git a/ip.c b/ip.c
new file mode 100644
index 000000000000..64e336139819
--- /dev/null
+++ b/ip.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * util.c - Convenience helpers
+ *
+ * Copyright (c) 2020-2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <stddef.h>
+#include "util.h"
+#include "ip.h"
+
+#define IPV6_NH_OPT(nh)							\
+	((nh) == 0   || (nh) == 43  || (nh) == 44  || (nh) == 50  ||	\
+	 (nh) == 51  || (nh) == 60  || (nh) == 135 || (nh) == 139 ||	\
+	 (nh) == 140 || (nh) == 253 || (nh) == 254)
+
+/**
+ * ipv6_l4hdr() - Find pointer to L4 header in IPv6 packet and extract protocol
+ * @p:		Packet pool, packet number @idx has IPv6 header at @offset
+ * @idx:	Index of packet in pool
+ * @offset:	Pre-calculated IPv6 header offset
+ * @proto:	Filled with L4 protocol number
+ * @dlen:	Data length (payload excluding header extensions), set on return
+ *
+ * Return: pointer to L4 header, NULL if not found
+ */
+char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
+		 size_t *dlen)
+{
+	const struct ipv6_opt_hdr *o;
+	const struct ipv6hdr *ip6h;
+	char *base;
+	int hdrlen;
+	uint8_t nh;
+
+	base = packet_get(p, idx, 0, 0, NULL);
+	ip6h = packet_get(p, idx, offset, sizeof(*ip6h), dlen);
+	if (!ip6h)
+		return NULL;
+
+	offset += sizeof(*ip6h);
+
+	nh = ip6h->nexthdr;
+	if (!IPV6_NH_OPT(nh))
+		goto found;
+
+	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
+		nh = o->nexthdr;
+		hdrlen = (o->hdrlen + 1) * 8;
+
+		if (IPV6_NH_OPT(nh))
+			offset += hdrlen;
+		else
+			goto found;
+	}
+
+	return NULL;
+
+found:
+	if (nh == 59)
+		return NULL;
+
+	*proto = nh;
+	return base + offset;
+}
diff --git a/ip.h b/ip.h
new file mode 100644
index 000000000000..b2e08bc049f3
--- /dev/null
+++ b/ip.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef IP_H
+#define IP_H
+
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+
+#define IN4_IS_ADDR_UNSPECIFIED(a) \
+	((a)->s_addr == htonl_constant(INADDR_ANY))
+#define IN4_IS_ADDR_BROADCAST(a) \
+	((a)->s_addr == htonl_constant(INADDR_BROADCAST))
+#define IN4_IS_ADDR_LOOPBACK(a) \
+	(ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
+#define IN4_IS_ADDR_MULTICAST(a) \
+	(IN_MULTICAST(ntohl((a)->s_addr)))
+#define IN4_ARE_ADDR_EQUAL(a, b) \
+	(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
+#define IN4ADDR_LOOPBACK_INIT \
+	{ .s_addr	= htonl_constant(INADDR_LOOPBACK) }
+#define IN4ADDR_ANY_INIT \
+	{ .s_addr	= htonl_constant(INADDR_ANY) }
+
+#define L2_BUF_IP4_INIT(proto)						\
+	{								\
+		.version	= 4,					\
+		.ihl		= 5,					\
+		.tos		= 0,					\
+		.tot_len	= 0,					\
+		.id		= 0,					\
+		.frag_off	= 0,					\
+		.ttl		= 0xff,					\
+		.protocol	= (proto),				\
+		.saddr		= 0,					\
+		.daddr		= 0,					\
+	}
+#define L2_BUF_IP4_PSUM(proto)	((uint32_t)htons_constant(0x4500) +	\
+				 (uint32_t)htons_constant(0xff00 | (proto)))
+
+#define L2_BUF_IP6_INIT(proto)						\
+	{								\
+		.priority	= 0,					\
+		.version	= 6,					\
+		.flow_lbl	= { 0 },				\
+		.payload_len	= 0,					\
+		.nexthdr	= (proto),				\
+		.hop_limit	= 255,					\
+		.saddr		= IN6ADDR_ANY_INIT,			\
+		.daddr		= IN6ADDR_ANY_INIT,			\
+	}
+
+struct ipv6hdr {
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wpedantic"
+#if __BYTE_ORDER == __BIG_ENDIAN
+	uint8_t			version:4,
+				priority:4;
+#else
+	uint8_t			priority:4,
+				version:4;
+#endif
+#pragma GCC diagnostic pop
+	uint8_t			flow_lbl[3];
+
+	uint16_t		payload_len;
+	uint8_t			nexthdr;
+	uint8_t			hop_limit;
+
+	struct in6_addr		saddr;
+	struct in6_addr		daddr;
+};
+
+struct ipv6_opt_hdr {
+	uint8_t			nexthdr;
+	uint8_t			hdrlen;
+	/*
+	 * TLV encoded option data follows.
+	 */
+} __attribute__((packed));	/* required for some archs */
+
+char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
+		 size_t *dlen);
+#endif /* IP_H */
diff --git a/ndp.c b/ndp.c
index 4c85ab8bcaee..c58f4b222b76 100644
--- a/ndp.c
+++ b/ndp.c
@@ -28,6 +28,7 @@
 
 #include "checksum.h"
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "tap.h"
 #include "log.h"
diff --git a/port_fwd.c b/port_fwd.c
index 6f6c836c57ad..e1ec31e2232c 100644
--- a/port_fwd.c
+++ b/port_fwd.c
@@ -21,6 +21,7 @@
 #include <stdio.h>
 
 #include "util.h"
+#include "ip.h"
 #include "port_fwd.h"
 #include "passt.h"
 #include "lineread.h"
diff --git a/qrap.c b/qrap.c
index 97f350a4bf0b..d59670621731 100644
--- a/qrap.c
+++ b/qrap.c
@@ -32,6 +32,7 @@
 #include <linux/icmpv6.h>
 
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "arp.h"
 
diff --git a/tap.c b/tap.c
index 396dee7eef25..3ea03f720d6d 100644
--- a/tap.c
+++ b/tap.c
@@ -45,6 +45,7 @@
 
 #include "checksum.h"
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "arp.h"
 #include "dhcp.h"
diff --git a/tcp.c b/tcp.c
index 905d26f6c656..4c9c5fb51c60 100644
--- a/tcp.c
+++ b/tcp.c
@@ -289,6 +289,7 @@
 
 #include "checksum.h"
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "tap.h"
 #include "siphash.h"
diff --git a/tcp_splice.c b/tcp_splice.c
index 26d32065cd47..66575ca95a1e 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -49,6 +49,7 @@
 #include <sys/socket.h>
 
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "log.h"
 #include "tcp_splice.h"
diff --git a/udp.c b/udp.c
index b5b8f8a7cd5b..d514c864ab5b 100644
--- a/udp.c
+++ b/udp.c
@@ -112,6 +112,7 @@
 
 #include "checksum.h"
 #include "util.h"
+#include "ip.h"
 #include "passt.h"
 #include "tap.h"
 #include "pcap.h"
diff --git a/util.c b/util.c
index 21b35ff94db1..f73ea1d98a09 100644
--- a/util.c
+++ b/util.c
@@ -30,61 +30,6 @@
 #include "packet.h"
 #include "log.h"
 
-#define IPV6_NH_OPT(nh)							\
-	((nh) == 0   || (nh) == 43  || (nh) == 44  || (nh) == 50  ||	\
-	 (nh) == 51  || (nh) == 60  || (nh) == 135 || (nh) == 139 ||	\
-	 (nh) == 140 || (nh) == 253 || (nh) == 254)
-
-/**
- * ipv6_l4hdr() - Find pointer to L4 header in IPv6 packet and extract protocol
- * @p:		Packet pool, packet number @idx has IPv6 header at @offset
- * @idx:	Index of packet in pool
- * @offset:	Pre-calculated IPv6 header offset
- * @proto:	Filled with L4 protocol number
- * @dlen:	Data length (payload excluding header extensions), set on return
- *
- * Return: pointer to L4 header, NULL if not found
- */
-char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
-		 size_t *dlen)
-{
-	const struct ipv6_opt_hdr *o;
-	const struct ipv6hdr *ip6h;
-	char *base;
-	int hdrlen;
-	uint8_t nh;
-
-	base = packet_get(p, idx, 0, 0, NULL);
-	ip6h = packet_get(p, idx, offset, sizeof(*ip6h), dlen);
-	if (!ip6h)
-		return NULL;
-
-	offset += sizeof(*ip6h);
-
-	nh = ip6h->nexthdr;
-	if (!IPV6_NH_OPT(nh))
-		goto found;
-
-	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
-		nh = o->nexthdr;
-		hdrlen = (o->hdrlen + 1) * 8;
-
-		if (IPV6_NH_OPT(nh))
-			offset += hdrlen;
-		else
-			goto found;
-	}
-
-	return NULL;
-
-found:
-	if (nh == 59)
-		return NULL;
-
-	*proto = nh;
-	return base + offset;
-}
-
 /**
  * sock_l4() - Create and bind socket for given L4, add to epoll list
  * @c:		Execution context
diff --git a/util.h b/util.h
index d2320f8cc99a..f7c3dfee9972 100644
--- a/util.h
+++ b/util.h
@@ -110,22 +110,6 @@
 #define	htonl_constant(x)	(__bswap_constant_32(x))
 #endif
 
-#define IN4_IS_ADDR_UNSPECIFIED(a) \
-	((a)->s_addr == htonl_constant(INADDR_ANY))
-#define IN4_IS_ADDR_BROADCAST(a) \
-	((a)->s_addr == htonl_constant(INADDR_BROADCAST))
-#define IN4_IS_ADDR_LOOPBACK(a) \
-	(ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
-#define IN4_IS_ADDR_MULTICAST(a) \
-	(IN_MULTICAST(ntohl((a)->s_addr)))
-#define IN4_ARE_ADDR_EQUAL(a, b) \
-	(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
-#define IN4ADDR_LOOPBACK_INIT \
-	{ .s_addr	= htonl_constant(INADDR_LOOPBACK) }
-#define IN4ADDR_ANY_INIT \
-	{ .s_addr	= htonl_constant(INADDR_ANY) }
-
-
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
@@ -138,34 +122,6 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 			 (void *)(arg));				\
 	} while (0)
 
-#define L2_BUF_IP4_INIT(proto)						\
-	{								\
-		.version	= 4,					\
-		.ihl		= 5,					\
-		.tos		= 0,					\
-		.tot_len	= 0,					\
-		.id		= 0,					\
-		.frag_off	= 0,					\
-		.ttl		= 0xff,					\
-		.protocol	= (proto),				\
-		.saddr		= 0,					\
-		.daddr		= 0,					\
-	}
-#define L2_BUF_IP4_PSUM(proto)	((uint32_t)htons_constant(0x4500) +	\
-				 (uint32_t)htons_constant(0xff00 | (proto)))
-
-#define L2_BUF_IP6_INIT(proto)						\
-	{								\
-		.priority	= 0,					\
-		.version	= 6,					\
-		.flow_lbl	= { 0 },				\
-		.payload_len	= 0,					\
-		.nexthdr	= (proto),				\
-		.hop_limit	= 255,					\
-		.saddr		= IN6ADDR_ANY_INIT,			\
-		.daddr		= IN6ADDR_ANY_INIT,			\
-	}
-
 #define RCVBUF_BIG		(2UL * 1024 * 1024)
 #define SNDBUF_BIG		(4UL * 1024 * 1024)
 #define SNDBUF_SMALL		(128UL * 1024)
@@ -173,45 +129,13 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 #include <net/if.h>
 #include <limits.h>
 #include <stdint.h>
-#include <netinet/ip6.h>
 
 #include "packet.h"
 
 struct ctx;
 
-struct ipv6hdr {
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wpedantic"
-#if __BYTE_ORDER == __BIG_ENDIAN
-	uint8_t			version:4,
-				priority:4;
-#else
-	uint8_t			priority:4,
-				version:4;
-#endif
-#pragma GCC diagnostic pop
-	uint8_t			flow_lbl[3];
-
-	uint16_t		payload_len;
-	uint8_t			nexthdr;
-	uint8_t			hop_limit;
-
-	struct in6_addr		saddr;
-	struct in6_addr		daddr;
-};
-
-struct ipv6_opt_hdr {
-	uint8_t			nexthdr;
-	uint8_t			hdrlen;
-	/*
-	 * TLV encoded option data follows.
-	 */
-} __attribute__((packed));	/* required for some archs */
-
 /* cppcheck-suppress funcArgNamesDifferent */
 __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); }
-char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
-		 size_t *dlen);
 int sock_l4(const struct ctx *c, int af, uint8_t proto,
 	    const void *bind_addr, const char *ifname, uint16_t port,
 	    uint32_t data);
-- 
@@ -110,22 +110,6 @@
 #define	htonl_constant(x)	(__bswap_constant_32(x))
 #endif
 
-#define IN4_IS_ADDR_UNSPECIFIED(a) \
-	((a)->s_addr == htonl_constant(INADDR_ANY))
-#define IN4_IS_ADDR_BROADCAST(a) \
-	((a)->s_addr == htonl_constant(INADDR_BROADCAST))
-#define IN4_IS_ADDR_LOOPBACK(a) \
-	(ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
-#define IN4_IS_ADDR_MULTICAST(a) \
-	(IN_MULTICAST(ntohl((a)->s_addr)))
-#define IN4_ARE_ADDR_EQUAL(a, b) \
-	(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
-#define IN4ADDR_LOOPBACK_INIT \
-	{ .s_addr	= htonl_constant(INADDR_LOOPBACK) }
-#define IN4ADDR_ANY_INIT \
-	{ .s_addr	= htonl_constant(INADDR_ANY) }
-
-
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
@@ -138,34 +122,6 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 			 (void *)(arg));				\
 	} while (0)
 
-#define L2_BUF_IP4_INIT(proto)						\
-	{								\
-		.version	= 4,					\
-		.ihl		= 5,					\
-		.tos		= 0,					\
-		.tot_len	= 0,					\
-		.id		= 0,					\
-		.frag_off	= 0,					\
-		.ttl		= 0xff,					\
-		.protocol	= (proto),				\
-		.saddr		= 0,					\
-		.daddr		= 0,					\
-	}
-#define L2_BUF_IP4_PSUM(proto)	((uint32_t)htons_constant(0x4500) +	\
-				 (uint32_t)htons_constant(0xff00 | (proto)))
-
-#define L2_BUF_IP6_INIT(proto)						\
-	{								\
-		.priority	= 0,					\
-		.version	= 6,					\
-		.flow_lbl	= { 0 },				\
-		.payload_len	= 0,					\
-		.nexthdr	= (proto),				\
-		.hop_limit	= 255,					\
-		.saddr		= IN6ADDR_ANY_INIT,			\
-		.daddr		= IN6ADDR_ANY_INIT,			\
-	}
-
 #define RCVBUF_BIG		(2UL * 1024 * 1024)
 #define SNDBUF_BIG		(4UL * 1024 * 1024)
 #define SNDBUF_SMALL		(128UL * 1024)
@@ -173,45 +129,13 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 #include <net/if.h>
 #include <limits.h>
 #include <stdint.h>
-#include <netinet/ip6.h>
 
 #include "packet.h"
 
 struct ctx;
 
-struct ipv6hdr {
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wpedantic"
-#if __BYTE_ORDER == __BIG_ENDIAN
-	uint8_t			version:4,
-				priority:4;
-#else
-	uint8_t			priority:4,
-				version:4;
-#endif
-#pragma GCC diagnostic pop
-	uint8_t			flow_lbl[3];
-
-	uint16_t		payload_len;
-	uint8_t			nexthdr;
-	uint8_t			hop_limit;
-
-	struct in6_addr		saddr;
-	struct in6_addr		daddr;
-};
-
-struct ipv6_opt_hdr {
-	uint8_t			nexthdr;
-	uint8_t			hdrlen;
-	/*
-	 * TLV encoded option data follows.
-	 */
-} __attribute__((packed));	/* required for some archs */
-
 /* cppcheck-suppress funcArgNamesDifferent */
 __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); }
-char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
-		 size_t *dlen);
 int sock_l4(const struct ctx *c, int af, uint8_t proto,
 	    const void *bind_addr, const char *ifname, uint16_t port,
 	    uint32_t data);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (4 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch] Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  6:16   ` David Gibson
  2024-02-07 10:40   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP Laurent Vivier
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

We can find the same function to compute the IPv4 header
checksum in tcp.c and udp.c

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 ip.h  | 14 ++++++++++++++
 tcp.c | 23 ++---------------------
 udp.c | 19 +------------------
 3 files changed, 17 insertions(+), 39 deletions(-)

diff --git a/ip.h b/ip.h
index b2e08bc049f3..ff7902c45a95 100644
--- a/ip.h
+++ b/ip.h
@@ -9,6 +9,8 @@
 #include <netinet/ip.h>
 #include <netinet/ip6.h>
 
+#include "checksum.h"
+
 #define IN4_IS_ADDR_UNSPECIFIED(a) \
 	((a)->s_addr == htonl_constant(INADDR_ANY))
 #define IN4_IS_ADDR_BROADCAST(a) \
@@ -83,4 +85,16 @@ struct ipv6_opt_hdr {
 
 char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
 		 size_t *dlen);
+static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)
+{
+	uint32_t sum = L2_BUF_IP4_PSUM(proto);
+
+	sum += iph->tot_len;
+	sum += (iph->saddr >> 16) & 0xffff;
+	sum += iph->saddr & 0xffff;
+	sum += (iph->daddr >> 16) & 0xffff;
+	sum += iph->daddr & 0xffff;
+
+	return ~csum_fold(sum);
+}
 #endif /* IP_H */
diff --git a/tcp.c b/tcp.c
index 4c9c5fb51c60..293ab12d8c21 100644
--- a/tcp.c
+++ b/tcp.c
@@ -934,23 +934,6 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
 		trace("TCP: failed to set SO_SNDBUF to %i", v);
 }
 
-/**
- * tcp_update_check_ip4() - Update IPv4 with variable parts from stored one
- * @buf:	L2 packet buffer with final IPv4 header
- */
-static void tcp_update_check_ip4(struct tcp4_l2_buf_t *buf)
-{
-	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_TCP);
-
-	sum += buf->iph.tot_len;
-	sum += (buf->iph.saddr >> 16) & 0xffff;
-	sum += buf->iph.saddr & 0xffff;
-	sum += (buf->iph.daddr >> 16) & 0xffff;
-	sum += buf->iph.daddr & 0xffff;
-
-	buf->iph.check = (uint16_t)~csum_fold(sum);
-}
-
 /**
  * tcp_update_check_tcp4() - Update TCP checksum from stored one
  * @buf:	L2 packet buffer with final IPv4 header
@@ -1393,10 +1376,8 @@ do {									\
 		b->iph.saddr = a4->s_addr;
 		b->iph.daddr = c->ip4.addr_seen.s_addr;
 
-		if (check)
-			b->iph.check = *check;
-		else
-			tcp_update_check_ip4(b);
+		b->iph.check = check ? *check :
+				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
 
 		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
 
diff --git a/udp.c b/udp.c
index d514c864ab5b..6f867df81c05 100644
--- a/udp.c
+++ b/udp.c
@@ -270,23 +270,6 @@ static void udp_invert_portmap(struct udp_port_fwd *fwd)
 	}
 }
 
-/**
- * udp_update_check4() - Update checksum with variable parts from stored one
- * @buf:	L2 packet buffer with final IPv4 header
- */
-static void udp_update_check4(struct udp4_l2_buf_t *buf)
-{
-	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_UDP);
-
-	sum += buf->iph.tot_len;
-	sum += (buf->iph.saddr >> 16) & 0xffff;
-	sum += buf->iph.saddr & 0xffff;
-	sum += (buf->iph.daddr >> 16) & 0xffff;
-	sum += buf->iph.daddr & 0xffff;
-
-	buf->iph.check = (uint16_t)~csum_fold(sum);
-}
-
 /**
  * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
  * @eth_d:	Ethernet destination address, NULL if unchanged
@@ -614,7 +597,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
 		b->iph.saddr = b->s_in.sin_addr.s_addr;
 	}
 
-	udp_update_check4(b);
+	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
 	b->uh.source = b->s_in.sin_port;
 	b->uh.dest = htons(dstport);
 	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
-- 
@@ -270,23 +270,6 @@ static void udp_invert_portmap(struct udp_port_fwd *fwd)
 	}
 }
 
-/**
- * udp_update_check4() - Update checksum with variable parts from stored one
- * @buf:	L2 packet buffer with final IPv4 header
- */
-static void udp_update_check4(struct udp4_l2_buf_t *buf)
-{
-	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_UDP);
-
-	sum += buf->iph.tot_len;
-	sum += (buf->iph.saddr >> 16) & 0xffff;
-	sum += buf->iph.saddr & 0xffff;
-	sum += (buf->iph.daddr >> 16) & 0xffff;
-	sum += buf->iph.daddr & 0xffff;
-
-	buf->iph.check = (uint16_t)~csum_fold(sum);
-}
-
 /**
  * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
  * @eth_d:	Ethernet destination address, NULL if unchanged
@@ -614,7 +597,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
 		b->iph.saddr = b->s_in.sin_addr.s_addr;
 	}
 
-	udp_update_check4(b);
+	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
 	b->uh.source = b->s_in.sin_port;
 	b->uh.dest = htons(dstport);
 	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (5 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-05  6:20   ` David Gibson
  2024-02-07 10:41   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 08/24] tcp: extract buffer management from tcp_send_flag() Laurent Vivier
                   ` (16 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

The TCP and UDP checksums are computed using the data in the TCP/UDP
payload but also some informations in the IP header (protocol,
length, source and destination addresses).

We add two functions, proto_ipv4_header_checksum() and
proto_ipv6_header_checksum(), to compute the checksum of the IP
header part.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 ip.h  | 24 ++++++++++++++++++++++++
 tcp.c | 40 +++++++++++++++-------------------------
 udp.c |  6 ++----
 3 files changed, 41 insertions(+), 29 deletions(-)

diff --git a/ip.h b/ip.h
index ff7902c45a95..87cb8dd21d2e 100644
--- a/ip.h
+++ b/ip.h
@@ -97,4 +97,28 @@ static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)
 
 	return ~csum_fold(sum);
 }
+
+static inline uint32_t proto_ipv4_header_checksum(struct iphdr *iph, int proto)
+{
+	uint32_t sum = htons(proto);
+
+	sum += (iph->saddr >> 16) & 0xffff;
+	sum += iph->saddr & 0xffff;
+	sum += (iph->daddr >> 16) & 0xffff;
+	sum += iph->daddr & 0xffff;
+	sum += htons(ntohs(iph->tot_len) - 20);
+
+	return sum;
+}
+
+static inline uint32_t proto_ipv6_header_checksum(struct ipv6hdr *ip6h,
+						  int proto)
+{
+	uint32_t sum = htons(proto) + ip6h->payload_len;
+
+	sum += sum_16b(&ip6h->saddr, sizeof(ip6h->saddr));
+	sum += sum_16b(&ip6h->daddr, sizeof(ip6h->daddr));
+
+	return sum;
+}
 #endif /* IP_H */
diff --git a/tcp.c b/tcp.c
index 293ab12d8c21..2fd6bc2eda53 100644
--- a/tcp.c
+++ b/tcp.c
@@ -938,39 +938,25 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
  * tcp_update_check_tcp4() - Update TCP checksum from stored one
  * @buf:	L2 packet buffer with final IPv4 header
  */
-static void tcp_update_check_tcp4(struct tcp4_l2_buf_t *buf)
+static uint16_t tcp_update_check_tcp4(struct iphdr *iph)
 {
-	uint16_t tlen = ntohs(buf->iph.tot_len) - 20;
-	uint32_t sum = htons(IPPROTO_TCP);
+	struct tcphdr *th = (void *)(iph + 1);
+	uint16_t tlen = ntohs(iph->tot_len) - 20;
+	uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
 
-	sum += (buf->iph.saddr >> 16) & 0xffff;
-	sum += buf->iph.saddr & 0xffff;
-	sum += (buf->iph.daddr >> 16) & 0xffff;
-	sum += buf->iph.daddr & 0xffff;
-	sum += htons(ntohs(buf->iph.tot_len) - 20);
-
-	buf->th.check = 0;
-	buf->th.check = csum(&buf->th, tlen, sum);
+	return csum(th, tlen, sum);
 }
 
 /**
  * tcp_update_check_tcp6() - Calculate TCP checksum for IPv6
  * @buf:	L2 packet buffer with final IPv6 header
  */
-static void tcp_update_check_tcp6(struct tcp6_l2_buf_t *buf)
+static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
 {
-	int len = ntohs(buf->ip6h.payload_len) + sizeof(struct ipv6hdr);
-
-	buf->ip6h.hop_limit = IPPROTO_TCP;
-	buf->ip6h.version = 0;
-	buf->ip6h.nexthdr = 0;
+	struct tcphdr *th = (void *)(ip6h + 1);
+	uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
 
-	buf->th.check = 0;
-	buf->th.check = csum(&buf->ip6h, len, 0);
-
-	buf->ip6h.hop_limit = 255;
-	buf->ip6h.version = 6;
-	buf->ip6h.nexthdr = IPPROTO_TCP;
+	return csum(th, ntohs(ip6h->payload_len), sum);
 }
 
 /**
@@ -1381,7 +1367,7 @@ do {									\
 
 		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
 
-		tcp_update_check_tcp4(b);
+		b->th.check = tcp_update_check_tcp4(&b->iph);
 
 		tlen = tap_iov_len(c, &b->taph, ip_len);
 	} else {
@@ -1400,7 +1386,11 @@ do {									\
 
 		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
 
-		tcp_update_check_tcp6(b);
+		b->th.check = tcp_update_check_tcp6(&b->ip6h);
+
+		b->ip6h.hop_limit = 255;
+		b->ip6h.version = 6;
+		b->ip6h.nexthdr = IPPROTO_TCP;
 
 		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
 		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
diff --git a/udp.c b/udp.c
index 6f867df81c05..96b4e6ca9a85 100644
--- a/udp.c
+++ b/udp.c
@@ -669,10 +669,8 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
 	b->uh.source = b->s_in6.sin6_port;
 	b->uh.dest = htons(dstport);
 	b->uh.len = b->ip6h.payload_len;
-
-	b->ip6h.hop_limit = IPPROTO_UDP;
-	b->ip6h.version = b->ip6h.nexthdr = b->uh.check = 0;
-	b->uh.check = csum(&b->ip6h, ip_len, 0);
+	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
+			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
 	b->ip6h.version = 6;
 	b->ip6h.nexthdr = IPPROTO_UDP;
 	b->ip6h.hop_limit = 255;
-- 
@@ -669,10 +669,8 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
 	b->uh.source = b->s_in6.sin6_port;
 	b->uh.dest = htons(dstport);
 	b->uh.len = b->ip6h.payload_len;
-
-	b->ip6h.hop_limit = IPPROTO_UDP;
-	b->ip6h.version = b->ip6h.nexthdr = b->uh.check = 0;
-	b->uh.check = csum(&b->ip6h, ip_len, 0);
+	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
+			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
 	b->ip6h.version = 6;
 	b->ip6h.nexthdr = IPPROTO_UDP;
 	b->ip6h.hop_limit = 255;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/24] tcp: extract buffer management from tcp_send_flag()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (6 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  0:24   ` David Gibson
  2024-02-08 16:57   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss() Laurent Vivier
                   ` (15 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 tcp.c | 224 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 129 insertions(+), 95 deletions(-)

diff --git a/tcp.c b/tcp.c
index 2fd6bc2eda53..20ad8a4e5271 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1320,87 +1320,98 @@ void tcp_defer_handler(struct ctx *c)
 	tcp_l2_data_buf_flush(c);
 }
 
+static void tcp_set_tcp_header(struct tcphdr *th,
+			       const struct tcp_tap_conn *conn, uint32_t seq)
+{
+	th->source = htons(conn->fport);
+	th->dest = htons(conn->eport);
+	th->seq = htonl(seq);
+	th->ack_seq = htonl(conn->seq_ack_to_tap);
+	if (conn->events & ESTABLISHED)	{
+		th->window = htons(conn->wnd_to_tap);
+	} else {
+		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;
+
+		th->window = htons(MIN(wnd, USHRT_MAX));
+	}
+}
+
 /**
- * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
+ * ipv4_fill_headers() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
  * @c:		Execution context
  * @conn:	Connection pointer
- * @p:		Pointer to any type of TCP pre-cooked buffer
+ * @iph:	Pointer to IPv4 header, immediately followed by a TCP header
  * @plen:	Payload length (including TCP header options)
  * @check:	Checksum, if already known
  * @seq:	Sequence number for this segment
  *
- * Return: frame length including L2 headers, host order
+ * Return: IP frame length including L2 headers, host order
  */
-static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_tap_conn *conn,
-				      void *p, size_t plen,
-				      const uint16_t *check, uint32_t seq)
+
+static size_t ipv4_fill_headers(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
+				struct iphdr *iph, size_t plen,
+				const uint16_t *check, uint32_t seq)
 {
+	struct tcphdr *th = (void *)(iph + 1);
 	const struct in_addr *a4 = inany_v4(&conn->faddr);
-	size_t ip_len, tlen;
-
-#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq)			\
-do {									\
-	b->th.source = htons(conn->fport);				\
-	b->th.dest = htons(conn->eport);				\
-	b->th.seq = htonl(seq);						\
-	b->th.ack_seq = htonl(conn->seq_ack_to_tap);			\
-	if (conn->events & ESTABLISHED)	{				\
-		b->th.window = htons(conn->wnd_to_tap);			\
-	} else {							\
-		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;	\
-									\
-		b->th.window = htons(MIN(wnd, USHRT_MAX));		\
-	}								\
-} while (0)
-
-	if (a4) {
-		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
-
-		ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
-		b->iph.tot_len = htons(ip_len);
-		b->iph.saddr = a4->s_addr;
-		b->iph.daddr = c->ip4.addr_seen.s_addr;
-
-		b->iph.check = check ? *check :
-				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
-
-		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
-
-		b->th.check = tcp_update_check_tcp4(&b->iph);
-
-		tlen = tap_iov_len(c, &b->taph, ip_len);
-	} else {
-		struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
+	size_t ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
 
-		ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
+	iph->tot_len = htons(ip_len);
+	iph->saddr = a4->s_addr;
+	iph->daddr = c->ip4.addr_seen.s_addr;
 
-		b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr));
-		b->ip6h.saddr = conn->faddr.a6;
-		if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr))
-			b->ip6h.daddr = c->ip6.addr_ll_seen;
-		else
-			b->ip6h.daddr = c->ip6.addr_seen;
+	iph->check = check ? *check : ipv4_hdr_checksum(iph, IPPROTO_TCP);
+
+	tcp_set_tcp_header(th, conn, seq);
+
+	th->check = tcp_update_check_tcp4(iph);
+
+	return ip_len;
+}
+
+/**
+ * ipv6_fill_headers() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ * @ip6h:	Pointer to IPv6 header, immediately followed by a TCP header
+ * @plen:	Payload length (including TCP header options)
+ * @check:	Checksum, if already known
+ * @seq:	Sequence number for this segment
+ *
+ * Return: IP frame length including L2 headers, host order
+ */
+
+static size_t ipv6_fill_headers(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
+				struct ipv6hdr *ip6h, size_t plen,
+				uint32_t seq)
+{
+	struct tcphdr *th = (void *)(ip6h + 1);
+	size_t ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
 
-		memset(b->ip6h.flow_lbl, 0, 3);
+	ip6h->payload_len = htons(plen + sizeof(struct tcphdr));
+	ip6h->saddr = conn->faddr.a6;
+	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
+		ip6h->daddr = c->ip6.addr_ll_seen;
+	else
+		ip6h->daddr = c->ip6.addr_seen;
 
-		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
+	memset(ip6h->flow_lbl, 0, 3);
 
-		b->th.check = tcp_update_check_tcp6(&b->ip6h);
+	tcp_set_tcp_header(th, conn, seq);
 
-		b->ip6h.hop_limit = 255;
-		b->ip6h.version = 6;
-		b->ip6h.nexthdr = IPPROTO_TCP;
+	th->check = tcp_update_check_tcp6(ip6h);
 
-		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
-		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
-		b->ip6h.flow_lbl[2] = (conn->sock >> 0) & 0xff;
+	ip6h->hop_limit = 255;
+	ip6h->version = 6;
+	ip6h->nexthdr = IPPROTO_TCP;
 
-		tlen = tap_iov_len(c, &b->taph, ip_len);
-	}
-#undef SET_TCP_HEADER_COMMON_V4_V6
+	ip6h->flow_lbl[0] = (conn->sock >> 16) & 0xf;
+	ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff;
+	ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff;
 
-	return tlen;
+	return ip_len;
 }
 
 /**
@@ -1520,27 +1531,21 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
 }
 
 /**
- * tcp_send_flag() - Send segment with flags to tap (no payload)
+ * do_tcp_send_flag() - Send segment with flags to tap (no payload)
  * @c:		Execution context
  * @conn:	Connection pointer
  * @flags:	TCP flags: if not set, send segment only if ACK is due
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+
+static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,			    struct tcphdr *th, char *data, size_t optlen)
 {
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
-	struct tcp4_l2_flags_buf_t *b4 = NULL;
-	struct tcp6_l2_flags_buf_t *b6 = NULL;
 	struct tcp_info tinfo = { 0 };
 	socklen_t sl = sizeof(tinfo);
 	int s = conn->sock;
-	size_t optlen = 0;
-	struct iovec *iov;
-	struct tcphdr *th;
-	char *data;
-	void *p;
 
 	if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
 	    !flags && conn->wnd_to_tap)
@@ -1562,26 +1567,9 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
 		return 0;
 
-	if (CONN_V4(conn)) {
-		iov = tcp4_l2_flags_iov    + tcp4_l2_flags_buf_used;
-		p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
-		th = &b4->th;
-
-		/* gcc 11.2 would complain on data = (char *)(th + 1); */
-		data = b4->opts;
-	} else {
-		iov = tcp6_l2_flags_iov    + tcp6_l2_flags_buf_used;
-		p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
-		th = &b6->th;
-		data = b6->opts;
-	}
-
 	if (flags & SYN) {
 		int mss;
 
-		/* Options: MSS, NOP and window scale (8 bytes) */
-		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
-
 		*data++ = OPT_MSS;
 		*data++ = OPT_MSS_LEN;
 
@@ -1624,9 +1612,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	th->syn = !!(flags & SYN);
 	th->fin = !!(flags & FIN);
 
-	iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen,
-					       NULL, conn->seq_to_tap);
-
 	if (th->ack) {
 		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
 			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
@@ -1641,8 +1626,38 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (th->fin || th->syn)
 		conn->seq_to_tap++;
 
+	return 1;
+}
+
+static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	size_t optlen = 0;
+	struct iovec *iov;
+	size_t ip_len;
+	int ret;
+
+	/* Options: MSS, NOP and window scale (8 bytes) */
+	if (flags & SYN)
+		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
+
 	if (CONN_V4(conn)) {
+		struct tcp4_l2_flags_buf_t *b4;
+
+		iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
+		b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
+
+		ret = do_tcp_send_flag(c, conn, flags, &b4->th, b4->opts,
+				       optlen);
+		if (ret <= 0)
+			return ret;
+
+		ip_len = ipv4_fill_headers(c, conn, &b4->iph, optlen,
+					   NULL, conn->seq_to_tap);
+
+		iov->iov_len = tap_iov_len(c, &b4->taph, ip_len);
+
 		if (flags & DUP_ACK) {
+
 			memcpy(b4 + 1, b4, sizeof(*b4));
 			(iov + 1)->iov_len = iov->iov_len;
 			tcp4_l2_flags_buf_used++;
@@ -1651,6 +1666,21 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
 			tcp_l2_flags_buf_flush(c);
 	} else {
+		struct tcp6_l2_flags_buf_t *b6;
+
+		iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
+		b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
+
+		ret = do_tcp_send_flag(c, conn, flags, &b6->th, b6->opts,
+				       optlen);
+		if (ret <= 0)
+			return ret;
+
+		ip_len = ipv6_fill_headers(c, conn, &b6->ip6h, optlen,
+					   conn->seq_to_tap);
+
+		iov->iov_len = tap_iov_len(c, &b6->taph, ip_len);
+
 		if (flags & DUP_ACK) {
 			memcpy(b6 + 1, b6, sizeof(*b6));
 			(iov + 1)->iov_len = iov->iov_len;
@@ -2050,6 +2080,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 {
 	uint32_t *seq_update = &conn->seq_to_tap;
 	struct iovec *iov;
+	size_t ip_len;
 
 	if (CONN_V4(conn)) {
 		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
@@ -2058,9 +2089,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
 		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
 
+		ip_len = ipv4_fill_headers(c, conn, &b->iph, plen,
+					   check, seq);
+
 		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
-		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
-						       check, seq);
+		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
 		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
 			tcp_l2_data_buf_flush(c);
 	} else if (CONN_V6(conn)) {
@@ -2069,9 +2102,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
 		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
 
+		ip_len = ipv6_fill_headers(c, conn, &b->ip6h, plen, seq);
+
 		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
-		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
-						       NULL, seq);
+		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
 		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
 			tcp_l2_data_buf_flush(c);
 	}
-- 
@@ -1320,87 +1320,98 @@ void tcp_defer_handler(struct ctx *c)
 	tcp_l2_data_buf_flush(c);
 }
 
+static void tcp_set_tcp_header(struct tcphdr *th,
+			       const struct tcp_tap_conn *conn, uint32_t seq)
+{
+	th->source = htons(conn->fport);
+	th->dest = htons(conn->eport);
+	th->seq = htonl(seq);
+	th->ack_seq = htonl(conn->seq_ack_to_tap);
+	if (conn->events & ESTABLISHED)	{
+		th->window = htons(conn->wnd_to_tap);
+	} else {
+		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;
+
+		th->window = htons(MIN(wnd, USHRT_MAX));
+	}
+}
+
 /**
- * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
+ * ipv4_fill_headers() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
  * @c:		Execution context
  * @conn:	Connection pointer
- * @p:		Pointer to any type of TCP pre-cooked buffer
+ * @iph:	Pointer to IPv4 header, immediately followed by a TCP header
  * @plen:	Payload length (including TCP header options)
  * @check:	Checksum, if already known
  * @seq:	Sequence number for this segment
  *
- * Return: frame length including L2 headers, host order
+ * Return: IP frame length including L2 headers, host order
  */
-static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_tap_conn *conn,
-				      void *p, size_t plen,
-				      const uint16_t *check, uint32_t seq)
+
+static size_t ipv4_fill_headers(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
+				struct iphdr *iph, size_t plen,
+				const uint16_t *check, uint32_t seq)
 {
+	struct tcphdr *th = (void *)(iph + 1);
 	const struct in_addr *a4 = inany_v4(&conn->faddr);
-	size_t ip_len, tlen;
-
-#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq)			\
-do {									\
-	b->th.source = htons(conn->fport);				\
-	b->th.dest = htons(conn->eport);				\
-	b->th.seq = htonl(seq);						\
-	b->th.ack_seq = htonl(conn->seq_ack_to_tap);			\
-	if (conn->events & ESTABLISHED)	{				\
-		b->th.window = htons(conn->wnd_to_tap);			\
-	} else {							\
-		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;	\
-									\
-		b->th.window = htons(MIN(wnd, USHRT_MAX));		\
-	}								\
-} while (0)
-
-	if (a4) {
-		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
-
-		ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
-		b->iph.tot_len = htons(ip_len);
-		b->iph.saddr = a4->s_addr;
-		b->iph.daddr = c->ip4.addr_seen.s_addr;
-
-		b->iph.check = check ? *check :
-				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
-
-		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
-
-		b->th.check = tcp_update_check_tcp4(&b->iph);
-
-		tlen = tap_iov_len(c, &b->taph, ip_len);
-	} else {
-		struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
+	size_t ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
 
-		ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
+	iph->tot_len = htons(ip_len);
+	iph->saddr = a4->s_addr;
+	iph->daddr = c->ip4.addr_seen.s_addr;
 
-		b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr));
-		b->ip6h.saddr = conn->faddr.a6;
-		if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr))
-			b->ip6h.daddr = c->ip6.addr_ll_seen;
-		else
-			b->ip6h.daddr = c->ip6.addr_seen;
+	iph->check = check ? *check : ipv4_hdr_checksum(iph, IPPROTO_TCP);
+
+	tcp_set_tcp_header(th, conn, seq);
+
+	th->check = tcp_update_check_tcp4(iph);
+
+	return ip_len;
+}
+
+/**
+ * ipv6_fill_headers() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ * @ip6h:	Pointer to IPv6 header, immediately followed by a TCP header
+ * @plen:	Payload length (including TCP header options)
+ * @check:	Checksum, if already known
+ * @seq:	Sequence number for this segment
+ *
+ * Return: IP frame length including L2 headers, host order
+ */
+
+static size_t ipv6_fill_headers(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
+				struct ipv6hdr *ip6h, size_t plen,
+				uint32_t seq)
+{
+	struct tcphdr *th = (void *)(ip6h + 1);
+	size_t ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
 
-		memset(b->ip6h.flow_lbl, 0, 3);
+	ip6h->payload_len = htons(plen + sizeof(struct tcphdr));
+	ip6h->saddr = conn->faddr.a6;
+	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
+		ip6h->daddr = c->ip6.addr_ll_seen;
+	else
+		ip6h->daddr = c->ip6.addr_seen;
 
-		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
+	memset(ip6h->flow_lbl, 0, 3);
 
-		b->th.check = tcp_update_check_tcp6(&b->ip6h);
+	tcp_set_tcp_header(th, conn, seq);
 
-		b->ip6h.hop_limit = 255;
-		b->ip6h.version = 6;
-		b->ip6h.nexthdr = IPPROTO_TCP;
+	th->check = tcp_update_check_tcp6(ip6h);
 
-		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
-		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
-		b->ip6h.flow_lbl[2] = (conn->sock >> 0) & 0xff;
+	ip6h->hop_limit = 255;
+	ip6h->version = 6;
+	ip6h->nexthdr = IPPROTO_TCP;
 
-		tlen = tap_iov_len(c, &b->taph, ip_len);
-	}
-#undef SET_TCP_HEADER_COMMON_V4_V6
+	ip6h->flow_lbl[0] = (conn->sock >> 16) & 0xf;
+	ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff;
+	ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff;
 
-	return tlen;
+	return ip_len;
 }
 
 /**
@@ -1520,27 +1531,21 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
 }
 
 /**
- * tcp_send_flag() - Send segment with flags to tap (no payload)
+ * do_tcp_send_flag() - Send segment with flags to tap (no payload)
  * @c:		Execution context
  * @conn:	Connection pointer
  * @flags:	TCP flags: if not set, send segment only if ACK is due
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+
+static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,			    struct tcphdr *th, char *data, size_t optlen)
 {
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
-	struct tcp4_l2_flags_buf_t *b4 = NULL;
-	struct tcp6_l2_flags_buf_t *b6 = NULL;
 	struct tcp_info tinfo = { 0 };
 	socklen_t sl = sizeof(tinfo);
 	int s = conn->sock;
-	size_t optlen = 0;
-	struct iovec *iov;
-	struct tcphdr *th;
-	char *data;
-	void *p;
 
 	if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
 	    !flags && conn->wnd_to_tap)
@@ -1562,26 +1567,9 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
 		return 0;
 
-	if (CONN_V4(conn)) {
-		iov = tcp4_l2_flags_iov    + tcp4_l2_flags_buf_used;
-		p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
-		th = &b4->th;
-
-		/* gcc 11.2 would complain on data = (char *)(th + 1); */
-		data = b4->opts;
-	} else {
-		iov = tcp6_l2_flags_iov    + tcp6_l2_flags_buf_used;
-		p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
-		th = &b6->th;
-		data = b6->opts;
-	}
-
 	if (flags & SYN) {
 		int mss;
 
-		/* Options: MSS, NOP and window scale (8 bytes) */
-		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
-
 		*data++ = OPT_MSS;
 		*data++ = OPT_MSS_LEN;
 
@@ -1624,9 +1612,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	th->syn = !!(flags & SYN);
 	th->fin = !!(flags & FIN);
 
-	iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen,
-					       NULL, conn->seq_to_tap);
-
 	if (th->ack) {
 		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
 			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
@@ -1641,8 +1626,38 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (th->fin || th->syn)
 		conn->seq_to_tap++;
 
+	return 1;
+}
+
+static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	size_t optlen = 0;
+	struct iovec *iov;
+	size_t ip_len;
+	int ret;
+
+	/* Options: MSS, NOP and window scale (8 bytes) */
+	if (flags & SYN)
+		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
+
 	if (CONN_V4(conn)) {
+		struct tcp4_l2_flags_buf_t *b4;
+
+		iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
+		b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
+
+		ret = do_tcp_send_flag(c, conn, flags, &b4->th, b4->opts,
+				       optlen);
+		if (ret <= 0)
+			return ret;
+
+		ip_len = ipv4_fill_headers(c, conn, &b4->iph, optlen,
+					   NULL, conn->seq_to_tap);
+
+		iov->iov_len = tap_iov_len(c, &b4->taph, ip_len);
+
 		if (flags & DUP_ACK) {
+
 			memcpy(b4 + 1, b4, sizeof(*b4));
 			(iov + 1)->iov_len = iov->iov_len;
 			tcp4_l2_flags_buf_used++;
@@ -1651,6 +1666,21 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
 			tcp_l2_flags_buf_flush(c);
 	} else {
+		struct tcp6_l2_flags_buf_t *b6;
+
+		iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
+		b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
+
+		ret = do_tcp_send_flag(c, conn, flags, &b6->th, b6->opts,
+				       optlen);
+		if (ret <= 0)
+			return ret;
+
+		ip_len = ipv6_fill_headers(c, conn, &b6->ip6h, optlen,
+					   conn->seq_to_tap);
+
+		iov->iov_len = tap_iov_len(c, &b6->taph, ip_len);
+
 		if (flags & DUP_ACK) {
 			memcpy(b6 + 1, b6, sizeof(*b6));
 			(iov + 1)->iov_len = iov->iov_len;
@@ -2050,6 +2080,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 {
 	uint32_t *seq_update = &conn->seq_to_tap;
 	struct iovec *iov;
+	size_t ip_len;
 
 	if (CONN_V4(conn)) {
 		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
@@ -2058,9 +2089,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
 		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
 
+		ip_len = ipv4_fill_headers(c, conn, &b->iph, plen,
+					   check, seq);
+
 		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
-		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
-						       check, seq);
+		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
 		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
 			tcp_l2_data_buf_flush(c);
 	} else if (CONN_V6(conn)) {
@@ -2069,9 +2102,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
 		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
 
+		ip_len = ipv6_fill_headers(c, conn, &b->ip6h, plen, seq);
+
 		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
-		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
-						       NULL, seq);
+		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
 		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
 			tcp_l2_data_buf_flush(c);
 	}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (7 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 08/24] tcp: extract buffer management from tcp_send_flag() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  0:47   ` David Gibson
  2024-02-08 16:59   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 10/24] tcp: rename functions that manage buffers Laurent Vivier
                   ` (14 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 tcp.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/tcp.c b/tcp.c
index 20ad8a4e5271..cdbceed65033 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1813,6 +1813,14 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 	return s;
 }
 
+static uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn)
+{
+	if (CONN_V4(conn))
+		return MSS4;
+
+	return MSS6;
+}
+
 /**
  * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
  * @conn:	Connection pointer
@@ -1832,10 +1840,7 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
 	else
 		mss = ret;
 
-	if (CONN_V4(conn))
-		mss = MIN(MSS4, mss);
-	else
-		mss = MIN(MSS6, mss);
+	mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
 
 	return MIN(mss, USHRT_MAX);
 }
-- 
@@ -1813,6 +1813,14 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 	return s;
 }
 
+static uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn)
+{
+	if (CONN_V4(conn))
+		return MSS4;
+
+	return MSS6;
+}
+
 /**
  * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
  * @conn:	Connection pointer
@@ -1832,10 +1840,7 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
 	else
 		mss = ret;
 
-	if (CONN_V4(conn))
-		mss = MIN(MSS4, mss);
-	else
-		mss = MIN(MSS6, mss);
+	mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
 
 	return MIN(mss, USHRT_MAX);
 }
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/24] tcp: rename functions that manage buffers
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (8 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  1:48   ` David Gibson
  2024-02-02 14:11 ` [PATCH 11/24] tcp: move buffers management functions to their own file Laurent Vivier
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

To separate these functions from the ones specific to TCP management,
we are going to move it to a new file, but before that update their names
to reflect their role.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 passt.c |  2 +-
 tcp.c   | 84 ++++++++++++++++++++++++++++-----------------------------
 tcp.h   |  2 +-
 3 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/passt.c b/passt.c
index 44d3a0b0548c..10042a9b9789 100644
--- a/passt.c
+++ b/passt.c
@@ -164,7 +164,7 @@ static void timer_init(struct ctx *c, const struct timespec *now)
  */
 void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
 {
-	tcp_update_l2_buf(eth_d, eth_s);
+	tcp_buf_update_l2(eth_d, eth_s);
 	udp_update_l2_buf(eth_d, eth_s);
 }
 
diff --git a/tcp.c b/tcp.c
index cdbceed65033..640209533772 100644
--- a/tcp.c
+++ b/tcp.c
@@ -383,7 +383,7 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
 #define ACK		(1 << 4)
 /* Flags for internal usage */
 #define DUP_ACK		(1 << 5)
-#define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
+#define ACK_IF_NEEDED	0		/* See tcp_buf_send_flag() */
 
 #define OPT_EOL		0
 #define OPT_NOP		1
@@ -960,11 +960,11 @@ static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
 }
 
 /**
- * tcp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
+ * tcp_buf_update_l2() - Update L2 buffers with Ethernet and IPv4 addresses
  * @eth_d:	Ethernet destination address, NULL if unchanged
  * @eth_s:	Ethernet source address, NULL if unchanged
  */
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
+void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
 {
 	int i;
 
@@ -982,10 +982,10 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
 }
 
 /**
- * tcp_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
+ * tcp_buf_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
  * @c:		Execution context
  */
-static void tcp_sock4_iov_init(const struct ctx *c)
+static void tcp_buf_sock4_iov_init(const struct ctx *c)
 {
 	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
 	struct iovec *iov;
@@ -1014,10 +1014,10 @@ static void tcp_sock4_iov_init(const struct ctx *c)
 }
 
 /**
- * tcp_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
+ * tcp_buf_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
  * @c:		Execution context
  */
-static void tcp_sock6_iov_init(const struct ctx *c)
+static void tcp_buf_sock6_iov_init(const struct ctx *c)
 {
 	struct iovec *iov;
 	int i;
@@ -1277,10 +1277,10 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
 	} while (0)
 
 /**
- * tcp_l2_flags_buf_flush() - Send out buffers for segments with no data (flags)
+ * tcp_buf_l2_flags_flush() - Send out buffers for segments with no data (flags)
  * @c:		Execution context
  */
-static void tcp_l2_flags_buf_flush(const struct ctx *c)
+static void tcp_buf_l2_flags_flush(const struct ctx *c)
 {
 	tap_send_frames(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used);
 	tcp6_l2_flags_buf_used = 0;
@@ -1290,10 +1290,10 @@ static void tcp_l2_flags_buf_flush(const struct ctx *c)
 }
 
 /**
- * tcp_l2_data_buf_flush() - Send out buffers for segments with data
+ * tcp_buf_l2_data_flush() - Send out buffers for segments with data
  * @c:		Execution context
  */
-static void tcp_l2_data_buf_flush(const struct ctx *c)
+static void tcp_buf_l2_data_flush(const struct ctx *c)
 {
 	unsigned i;
 	size_t m;
@@ -1316,8 +1316,8 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
 /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
 void tcp_defer_handler(struct ctx *c)
 {
-	tcp_l2_flags_buf_flush(c);
-	tcp_l2_data_buf_flush(c);
+	tcp_buf_l2_flags_flush(c);
+	tcp_buf_l2_data_flush(c);
 }
 
 static void tcp_set_tcp_header(struct tcphdr *th,
@@ -1629,7 +1629,7 @@ static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
 	return 1;
 }
 
-static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+static int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
 	size_t optlen = 0;
 	struct iovec *iov;
@@ -1664,7 +1664,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		}
 
 		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
-			tcp_l2_flags_buf_flush(c);
+			tcp_buf_l2_flags_flush(c);
 	} else {
 		struct tcp6_l2_flags_buf_t *b6;
 
@@ -1688,7 +1688,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		}
 
 		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
-			tcp_l2_flags_buf_flush(c);
+			tcp_buf_l2_flags_flush(c);
 	}
 
 	return 0;
@@ -1704,7 +1704,7 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
 	if (conn->events == CLOSED)
 		return;
 
-	if (!tcp_send_flag(c, conn, RST))
+	if (!tcp_buf_send_flag(c, conn, RST))
 		conn_event(c, conn, CLOSED);
 }
 
@@ -2024,7 +2024,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 	} else {
 		tcp_get_sndbuf(conn);
 
-		if (tcp_send_flag(c, conn, SYN | ACK))
+		if (tcp_buf_send_flag(c, conn, SYN | ACK))
 			return;
 
 		conn_event(c, conn, TAP_SYN_ACK_SENT);
@@ -2100,7 +2100,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
 		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
 		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
-			tcp_l2_data_buf_flush(c);
+			tcp_buf_l2_data_flush(c);
 	} else if (CONN_V6(conn)) {
 		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
 
@@ -2112,12 +2112,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
 		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
 		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
-			tcp_l2_data_buf_flush(c);
+			tcp_buf_l2_data_flush(c);
 	}
 }
 
 /**
- * tcp_data_from_sock() - Handle new data from socket, queue to tap, in window
+ * tcp_buf_data_from_sock() - Handle new data from socket, queue to tap, in window
  * @c:		Execution context
  * @conn:	Connection pointer
  *
@@ -2125,7 +2125,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
  *
  * #syscalls recvmsg
  */
-static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+static int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
@@ -2169,7 +2169,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 
 	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
 	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
-		tcp_l2_data_buf_flush(c);
+		tcp_buf_l2_data_flush(c);
 
 		/* Silence Coverity CWE-125 false positive */
 		tcp4_l2_buf_used = tcp6_l2_buf_used = 0;
@@ -2195,7 +2195,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 
 	if (!len) {
 		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
-			if ((ret = tcp_send_flag(c, conn, FIN | ACK))) {
+			if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
 				tcp_rst(c, conn);
 				return ret;
 			}
@@ -2378,7 +2378,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			   max_ack_seq, conn->seq_to_tap);
 		conn->seq_ack_from_tap = max_ack_seq;
 		conn->seq_to_tap = max_ack_seq;
-		tcp_data_from_sock(c, conn);
+		tcp_buf_data_from_sock(c, conn);
 	}
 
 	if (!iov_i)
@@ -2394,14 +2394,14 @@ eintr:
 			 *   Then swiftly looked away and left.
 			 */
 			conn->seq_from_tap = seq_from_tap;
-			tcp_send_flag(c, conn, ACK);
+			tcp_buf_send_flag(c, conn, ACK);
 		}
 
 		if (errno == EINTR)
 			goto eintr;
 
 		if (errno == EAGAIN || errno == EWOULDBLOCK) {
-			tcp_send_flag(c, conn, ACK_IF_NEEDED);
+			tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
 			return p->count - idx;
 
 		}
@@ -2411,7 +2411,7 @@ eintr:
 	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
 		partial_send = 1;
 		conn->seq_from_tap += n;
-		tcp_send_flag(c, conn, ACK_IF_NEEDED);
+		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
 	} else {
 		conn->seq_from_tap += n;
 	}
@@ -2424,7 +2424,7 @@ out:
 		 */
 		if (conn->seq_dup_ack_approx != (conn->seq_from_tap & 0xff)) {
 			conn->seq_dup_ack_approx = conn->seq_from_tap & 0xff;
-			tcp_send_flag(c, conn, DUP_ACK);
+			tcp_buf_send_flag(c, conn, DUP_ACK);
 		}
 		return p->count - idx;
 	}
@@ -2438,7 +2438,7 @@ out:
 
 		conn_event(c, conn, TAP_FIN_RCVD);
 	} else {
-		tcp_send_flag(c, conn, ACK_IF_NEEDED);
+		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
 	}
 
 	return p->count - idx;
@@ -2474,8 +2474,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
 	/* The client might have sent data already, which we didn't
 	 * dequeue waiting for SYN,ACK from tap -- check now.
 	 */
-	tcp_data_from_sock(c, conn);
-	tcp_send_flag(c, conn, ACK);
+	tcp_buf_data_from_sock(c, conn);
+	tcp_buf_send_flag(c, conn, ACK);
 }
 
 /**
@@ -2555,7 +2555,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 			conn->seq_from_tap++;
 
 			shutdown(conn->sock, SHUT_WR);
-			tcp_send_flag(c, conn, ACK);
+			tcp_buf_send_flag(c, conn, ACK);
 			conn_event(c, conn, SOCK_FIN_SENT);
 
 			return 1;
@@ -2566,7 +2566,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 
 		tcp_tap_window_update(conn, ntohs(th->window));
 
-		tcp_data_from_sock(c, conn);
+		tcp_buf_data_from_sock(c, conn);
 
 		if (p->count - idx == 1)
 			return 1;
@@ -2596,7 +2596,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
 		shutdown(conn->sock, SHUT_WR);
 		conn_event(c, conn, SOCK_FIN_SENT);
-		tcp_send_flag(c, conn, ACK);
+		tcp_buf_send_flag(c, conn, ACK);
 		ack_due = 0;
 	}
 
@@ -2630,7 +2630,7 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 		return;
 	}
 
-	if (tcp_send_flag(c, conn, SYN | ACK))
+	if (tcp_buf_send_flag(c, conn, SYN | ACK))
 		return;
 
 	conn_event(c, conn, TAP_SYN_ACK_SENT);
@@ -2698,7 +2698,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
 
 	conn->wnd_from_tap = WINDOW_DEFAULT;
 
-	tcp_send_flag(c, conn, SYN);
+	tcp_buf_send_flag(c, conn, SYN);
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	tcp_get_sndbuf(conn);
@@ -2762,7 +2762,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 		return;
 
 	if (conn->flags & ACK_TO_TAP_DUE) {
-		tcp_send_flag(c, conn, ACK_IF_NEEDED);
+		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
 		tcp_timer_ctl(c, conn);
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		if (!(conn->events & ESTABLISHED)) {
@@ -2778,7 +2778,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 			flow_dbg(conn, "ACK timeout, retry");
 			conn->retrans++;
 			conn->seq_to_tap = conn->seq_ack_from_tap;
-			tcp_data_from_sock(c, conn);
+			tcp_buf_data_from_sock(c, conn);
 			tcp_timer_ctl(c, conn);
 		}
 	} else {
@@ -2833,7 +2833,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
 			conn_event(c, conn, SOCK_FIN_RCVD);
 
 		if (events & EPOLLIN)
-			tcp_data_from_sock(c, conn);
+			tcp_buf_data_from_sock(c, conn);
 
 		if (events & EPOLLOUT)
 			tcp_update_seqack_wnd(c, conn, 0, NULL);
@@ -3058,10 +3058,10 @@ int tcp_init(struct ctx *c)
 		tc_hash[b] = FLOW_SIDX_NONE;
 
 	if (c->ifi4)
-		tcp_sock4_iov_init(c);
+		tcp_buf_sock4_iov_init(c);
 
 	if (c->ifi6)
-		tcp_sock6_iov_init(c);
+		tcp_buf_sock6_iov_init(c);
 
 	memset(init_sock_pool4,		0xff,	sizeof(init_sock_pool4));
 	memset(init_sock_pool6,		0xff,	sizeof(init_sock_pool6));
diff --git a/tcp.h b/tcp.h
index b9f546d31002..e7dbcfa2ddbd 100644
--- a/tcp.h
+++ b/tcp.h
@@ -23,7 +23,7 @@ int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
 
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
+void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s);
 
 /**
  * union tcp_epoll_ref - epoll reference portion for TCP connections
-- 
@@ -23,7 +23,7 @@ int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
 
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
+void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s);
 
 /**
  * union tcp_epoll_ref - epoll reference portion for TCP connections
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 11/24] tcp: move buffers management functions to their own file
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (9 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 10/24] tcp: rename functions that manage buffers Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-02 14:11 ` [PATCH 12/24] tap: make tap_update_mac() generic Laurent Vivier
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile       |   7 +-
 tcp.c          | 618 ++-----------------------------------------------
 tcp_buf.c      | 569 +++++++++++++++++++++++++++++++++++++++++++++
 tcp_buf.h      |  17 ++
 tcp_internal.h |  78 +++++++
 5 files changed, 689 insertions(+), 600 deletions(-)
 create mode 100644 tcp_buf.c
 create mode 100644 tcp_buf.h
 create mode 100644 tcp_internal.h

diff --git a/Makefile b/Makefile
index acf37f5a2036..bf370b6ec2e6 100644
--- a/Makefile
+++ b/Makefile
@@ -46,8 +46,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
-	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
-	util.c iov.c ip.c
+	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
+	tcp_buf.c udp.c util.c iov.c ip.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -56,7 +56,8 @@ MANPAGES = passt.1 pasta.1 qrap.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
-	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h ip.h
+	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
+	util.h iov.h ip.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/tcp.c b/tcp.c
index 640209533772..54c15087d678 100644
--- a/tcp.c
+++ b/tcp.c
@@ -300,57 +300,19 @@
 #include "flow.h"
 
 #include "flow_table.h"
+#include "tcp_internal.h"
+#include "tcp_buf.h"
 
 /* Sides of a flow as we use them in "tap" connections */
 #define	SOCKSIDE	0
 #define	TAPSIDE		1
 
-#define TCP_FRAMES_MEM			128
-#define TCP_FRAMES							\
-	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
-
 #define TCP_HASH_TABLE_LOAD		70		/* % */
 #define TCP_HASH_TABLE_SIZE		(FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD)
 
-#define MAX_WS				8
-#define MAX_WINDOW			(1 << (16 + (MAX_WS)))
-
 /* MSS rounding: see SET_MSS() */
 #define MSS_DEFAULT			536
 
-struct tcp4_l2_head {	/* For MSS4 macro: keep in sync with tcp4_l2_buf_t */
-#ifdef __AVX2__
-	uint8_t pad[26];
-#else
-	uint8_t pad[2];
-#endif
-	struct tap_hdr taph;
-	struct iphdr iph;
-	struct tcphdr th;
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)));
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
-#endif
-
-struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
-#ifdef __AVX2__
-	uint8_t pad[14];
-#else
-	uint8_t pad[2];
-#endif
-	struct tap_hdr taph;
-	struct ipv6hdr ip6h;
-	struct tcphdr th;
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)));
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
-#endif
-
-#define MSS4	ROUND_DOWN(USHRT_MAX - sizeof(struct tcp4_l2_head), 4)
-#define MSS6	ROUND_DOWN(USHRT_MAX - sizeof(struct tcp6_l2_head), 4)
-
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
 #ifdef HAS_SND_WND
 # define KERNEL_REPORTS_SND_WND(c)	(c->tcp.kernel_snd_wnd)
@@ -372,31 +334,9 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
  */
 #define SOL_TCP				IPPROTO_TCP
 
-#define SEQ_LE(a, b)			((b) - (a) < MAX_WINDOW)
-#define SEQ_LT(a, b)			((b) - (a) - 1 < MAX_WINDOW)
-#define SEQ_GE(a, b)			((a) - (b) < MAX_WINDOW)
-#define SEQ_GT(a, b)			((a) - (b) - 1 < MAX_WINDOW)
-
-#define FIN		(1 << 0)
-#define SYN		(1 << 1)
-#define RST		(1 << 2)
-#define ACK		(1 << 4)
-/* Flags for internal usage */
-#define DUP_ACK		(1 << 5)
 #define ACK_IF_NEEDED	0		/* See tcp_buf_send_flag() */
 
-#define OPT_EOL		0
-#define OPT_NOP		1
-#define OPT_MSS		2
-#define OPT_MSS_LEN	4
-#define OPT_WS		3
-#define OPT_WS_LEN	3
-#define OPT_SACKP	4
-#define OPT_SACK	5
-#define OPT_TS		8
-
-#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
-#define CONN_V6(conn)		(!CONN_V4(conn))
+
 #define CONN_IS_CLOSING(conn)						\
 	((conn->events & ESTABLISHED) &&				\
 	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
@@ -433,144 +373,11 @@ static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
  */
 static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
 
-/**
- * tcp_buf_seq_update - Sequences to update with length of frames once sent
- * @seq:	Pointer to sequence number sent to tap-side, to be updated
- * @len:	TCP payload length
- */
-struct tcp_buf_seq_update {
-	uint32_t *seq;
-	uint16_t len;
-};
-
-/* Static buffers */
-
-/**
- * tcp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections
- * @pad:	Align TCP header to 32 bytes, for AVX2 checksum calculation only
- * @taph:	Tap-level headers (partially pre-filled)
- * @iph:	Pre-filled IP header (except for tot_len and saddr)
- * @uh:		Headroom for TCP header
- * @data:	Storage for TCP payload
- */
-static struct tcp4_l2_buf_t {
-#ifdef __AVX2__
-	uint8_t pad[26];	/* 0, align th to 32 bytes */
-#else
-	uint8_t pad[2];		/*	align iph to 4 bytes	0 */
-#endif
-	struct tap_hdr taph;	/* 26				2 */
-	struct iphdr iph;	/* 44				20 */
-	struct tcphdr th;	/* 64				40 */
-	uint8_t data[MSS4];	/* 84				60 */
-				/* 65536			65532 */
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-tcp4_l2_buf[TCP_FRAMES_MEM];
-
-static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
-
-static unsigned int tcp4_l2_buf_used;
-
-/**
- * tcp6_l2_buf_t - Pre-cooked IPv6 packet buffers for tap connections
- * @pad:	Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
- * @taph:	Tap-level headers (partially pre-filled)
- * @ip6h:	Pre-filled IP header (except for payload_len and addresses)
- * @th:		Headroom for TCP header
- * @data:	Storage for TCP payload
- */
-struct tcp6_l2_buf_t {
-#ifdef __AVX2__
-	uint8_t pad[14];	/* 0	align ip6h to 32 bytes */
-#else
-	uint8_t pad[2];		/*	align ip6h to 4 bytes	0 */
-#endif
-	struct tap_hdr taph;	/* 14				2 */
-	struct ipv6hdr ip6h;	/* 32				20 */
-	struct tcphdr th;	/* 72				60 */
-	uint8_t data[MSS6];	/* 92				80 */
-				/* 65536			65532 */
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-tcp6_l2_buf[TCP_FRAMES_MEM];
-
-static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
-
-static unsigned int tcp6_l2_buf_used;
-
-/* recvmsg()/sendmsg() data for tap */
-static char 		tcp_buf_discard		[MAX_WINDOW];
-static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
-
-static struct iovec	tcp4_l2_iov		[TCP_FRAMES_MEM];
-static struct iovec	tcp6_l2_iov		[TCP_FRAMES_MEM];
-static struct iovec	tcp4_l2_flags_iov	[TCP_FRAMES_MEM];
-static struct iovec	tcp6_l2_flags_iov	[TCP_FRAMES_MEM];
+char		tcp_buf_discard		[MAX_WINDOW];
 
 /* sendmsg() to socket */
 static struct iovec	tcp_iov			[UIO_MAXIOV];
 
-/**
- * tcp4_l2_flags_buf_t - IPv4 packet buffers for segments without data (flags)
- * @pad:	Align TCP header to 32 bytes, for AVX2 checksum calculation only
- * @taph:	Tap-level headers (partially pre-filled)
- * @iph:	Pre-filled IP header (except for tot_len and saddr)
- * @th:		Headroom for TCP header
- * @opts:	Headroom for TCP options
- */
-static struct tcp4_l2_flags_buf_t {
-#ifdef __AVX2__
-	uint8_t pad[26];	/* 0, align th to 32 bytes */
-#else
-	uint8_t pad[2];		/*	align iph to 4 bytes	0 */
-#endif
-	struct tap_hdr taph;	/* 26				2 */
-	struct iphdr iph;	/* 44				20 */
-	struct tcphdr th;	/* 64				40 */
-	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-tcp4_l2_flags_buf[TCP_FRAMES_MEM];
-
-static unsigned int tcp4_l2_flags_buf_used;
-
-/**
- * tcp6_l2_flags_buf_t - IPv6 packet buffers for segments without data (flags)
- * @pad:	Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
- * @taph:	Tap-level headers (partially pre-filled)
- * @ip6h:	Pre-filled IP header (except for payload_len and addresses)
- * @th:		Headroom for TCP header
- * @opts:	Headroom for TCP options
- */
-static struct tcp6_l2_flags_buf_t {
-#ifdef __AVX2__
-	uint8_t pad[14];	/* 0	align ip6h to 32 bytes */
-#else
-	uint8_t pad[2];		/*	align ip6h to 4 bytes		   0 */
-#endif
-	struct tap_hdr taph;	/* 14					   2 */
-	struct ipv6hdr ip6h;	/* 32					  20 */
-	struct tcphdr th	/* 72 */ __attribute__ ((aligned(4))); /* 60 */
-	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-tcp6_l2_flags_buf[TCP_FRAMES_MEM];
-
-static unsigned int tcp6_l2_flags_buf_used;
-
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
 /* Table for lookup from remote address, local port, remote port */
@@ -611,14 +418,6 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 	return EPOLLRDHUP;
 }
 
-static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
-			 unsigned long flag);
-#define conn_flag(c, conn, flag)					\
-	do {								\
-		flow_trace(conn, "flag at %s:%i", __func__, __LINE__);	\
-		conn_flag_do(c, conn, flag);				\
-	} while (0)
-
 /**
  * tcp_epoll_ctl() - Add/modify/delete epoll state from connection events
  * @c:		Execution context
@@ -730,8 +529,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
  * @conn:	Connection pointer
  * @flag:	Flag to set, or ~flag to unset
  */
-static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
-			 unsigned long flag)
+void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
+		  unsigned long flag)
 {
 	if (flag & (flag - 1)) {
 		int flag_index = fls(~flag);
@@ -781,8 +580,8 @@ static void tcp_hash_remove(const struct ctx *c,
  * @conn:	Connection pointer
  * @event:	Connection event
  */
-static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
-			  unsigned long event)
+void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
+		   unsigned long event)
 {
 	int prev, new, num = fls(event);
 
@@ -830,12 +629,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp_timer_ctl(c, conn);
 }
 
-#define conn_event(c, conn, event)					\
-	do {								\
-		flow_trace(conn, "event at %s:%i", __func__, __LINE__);	\
-		conn_event_do(c, conn, event);				\
-	} while (0)
-
 /**
  * tcp_rtt_dst_low() - Check if low RTT was seen for connection endpoint
  * @conn:	Connection pointer
@@ -959,91 +752,6 @@ static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
 	return csum(th, ntohs(ip6h->payload_len), sum);
 }
 
-/**
- * tcp_buf_update_l2() - Update L2 buffers with Ethernet and IPv4 addresses
- * @eth_d:	Ethernet destination address, NULL if unchanged
- * @eth_s:	Ethernet source address, NULL if unchanged
- */
-void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
-{
-	int i;
-
-	for (i = 0; i < TCP_FRAMES_MEM; i++) {
-		struct tcp4_l2_flags_buf_t *b4f = &tcp4_l2_flags_buf[i];
-		struct tcp6_l2_flags_buf_t *b6f = &tcp6_l2_flags_buf[i];
-		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
-		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
-
-		tap_update_mac(&b4->taph, eth_d, eth_s);
-		tap_update_mac(&b6->taph, eth_d, eth_s);
-		tap_update_mac(&b4f->taph, eth_d, eth_s);
-		tap_update_mac(&b6f->taph, eth_d, eth_s);
-	}
-}
-
-/**
- * tcp_buf_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
- * @c:		Execution context
- */
-static void tcp_buf_sock4_iov_init(const struct ctx *c)
-{
-	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
-	struct iovec *iov;
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(tcp4_l2_buf); i++) {
-		tcp4_l2_buf[i] = (struct tcp4_l2_buf_t) {
-			.taph = TAP_HDR_INIT(ETH_P_IP),
-			.iph = iph,
-			.th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
-		};
-	}
-
-	for (i = 0; i < ARRAY_SIZE(tcp4_l2_flags_buf); i++) {
-		tcp4_l2_flags_buf[i] = (struct tcp4_l2_flags_buf_t) {
-			.taph = TAP_HDR_INIT(ETH_P_IP),
-			.iph = L2_BUF_IP4_INIT(IPPROTO_TCP)
-		};
-	}
-
-	for (i = 0, iov = tcp4_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
-		iov->iov_base = tap_iov_base(c, &tcp4_l2_buf[i].taph);
-
-	for (i = 0, iov = tcp4_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
-		iov->iov_base = tap_iov_base(c, &tcp4_l2_flags_buf[i].taph);
-}
-
-/**
- * tcp_buf_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
- * @c:		Execution context
- */
-static void tcp_buf_sock6_iov_init(const struct ctx *c)
-{
-	struct iovec *iov;
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(tcp6_l2_buf); i++) {
-		tcp6_l2_buf[i] = (struct tcp6_l2_buf_t) {
-			.taph = TAP_HDR_INIT(ETH_P_IPV6),
-			.ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP),
-			.th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
-		};
-	}
-
-	for (i = 0; i < ARRAY_SIZE(tcp6_l2_flags_buf); i++) {
-		tcp6_l2_flags_buf[i] = (struct tcp6_l2_flags_buf_t) {
-			.taph = TAP_HDR_INIT(ETH_P_IPV6),
-			.ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP)
-		};
-	}
-
-	for (i = 0, iov = tcp6_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
-		iov->iov_base = tap_iov_base(c, &tcp6_l2_buf[i].taph);
-
-	for (i = 0, iov = tcp6_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
-		iov->iov_base = tap_iov_base(c, &tcp6_l2_flags_buf[i].taph);
-}
-
 /**
  * tcp_opt_get() - Get option, and value if any, from TCP header
  * @opts:	Pointer to start of TCP options in header
@@ -1269,46 +977,6 @@ bool tcp_flow_defer(union flow *flow)
 	return true;
 }
 
-static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
-#define tcp_rst(c, conn)						\
-	do {								\
-		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
-		tcp_rst_do(c, conn);					\
-	} while (0)
-
-/**
- * tcp_buf_l2_flags_flush() - Send out buffers for segments with no data (flags)
- * @c:		Execution context
- */
-static void tcp_buf_l2_flags_flush(const struct ctx *c)
-{
-	tap_send_frames(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used);
-	tcp6_l2_flags_buf_used = 0;
-
-	tap_send_frames(c, tcp4_l2_flags_iov, tcp4_l2_flags_buf_used);
-	tcp4_l2_flags_buf_used = 0;
-}
-
-/**
- * tcp_buf_l2_data_flush() - Send out buffers for segments with data
- * @c:		Execution context
- */
-static void tcp_buf_l2_data_flush(const struct ctx *c)
-{
-	unsigned i;
-	size_t m;
-
-	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
-	for (i = 0; i < m; i++)
-		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
-	tcp6_l2_buf_used = 0;
-
-	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
-	for (i = 0; i < m; i++)
-		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
-	tcp4_l2_buf_used = 0;
-}
-
 /**
  * tcp_defer_handler() - Handler for TCP deferred tasks
  * @c:		Execution context
@@ -1348,10 +1016,10 @@ static void tcp_set_tcp_header(struct tcphdr *th,
  * Return: IP frame length including L2 headers, host order
  */
 
-static size_t ipv4_fill_headers(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
-				struct iphdr *iph, size_t plen,
-				const uint16_t *check, uint32_t seq)
+size_t ipv4_fill_headers(const struct ctx *c,
+			 const struct tcp_tap_conn *conn,
+			 struct iphdr *iph, size_t plen,
+			 const uint16_t *check, uint32_t seq)
 {
 	struct tcphdr *th = (void *)(iph + 1);
 	const struct in_addr *a4 = inany_v4(&conn->faddr);
@@ -1382,10 +1050,10 @@ static size_t ipv4_fill_headers(const struct ctx *c,
  * Return: IP frame length including L2 headers, host order
  */
 
-static size_t ipv6_fill_headers(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
-				struct ipv6hdr *ip6h, size_t plen,
-				uint32_t seq)
+size_t ipv6_fill_headers(const struct ctx *c,
+			 const struct tcp_tap_conn *conn,
+			 struct ipv6hdr *ip6h, size_t plen,
+			 uint32_t seq)
 {
 	struct tcphdr *th = (void *)(ip6h + 1);
 	size_t ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
@@ -1423,8 +1091,8 @@ static size_t ipv6_fill_headers(const struct ctx *c,
  *
  * Return: 1 if sequence or window were updated, 0 otherwise
  */
-static int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
-				 int force_seq, struct tcp_info *tinfo)
+int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
+			  int force_seq, struct tcp_info *tinfo)
 {
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap << conn->ws_to_tap;
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
@@ -1539,7 +1207,8 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
  * Return: negative error code on connection reset, 0 otherwise
  */
 
-static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,			    struct tcphdr *th, char *data, size_t optlen)
+int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
+		     struct tcphdr *th, char *data, size_t optlen)
 {
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
@@ -1629,77 +1298,13 @@ static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
 	return 1;
 }
 
-static int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
-{
-	size_t optlen = 0;
-	struct iovec *iov;
-	size_t ip_len;
-	int ret;
-
-	/* Options: MSS, NOP and window scale (8 bytes) */
-	if (flags & SYN)
-		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
-
-	if (CONN_V4(conn)) {
-		struct tcp4_l2_flags_buf_t *b4;
-
-		iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
-		b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
-
-		ret = do_tcp_send_flag(c, conn, flags, &b4->th, b4->opts,
-				       optlen);
-		if (ret <= 0)
-			return ret;
-
-		ip_len = ipv4_fill_headers(c, conn, &b4->iph, optlen,
-					   NULL, conn->seq_to_tap);
-
-		iov->iov_len = tap_iov_len(c, &b4->taph, ip_len);
-
-		if (flags & DUP_ACK) {
-
-			memcpy(b4 + 1, b4, sizeof(*b4));
-			(iov + 1)->iov_len = iov->iov_len;
-			tcp4_l2_flags_buf_used++;
-		}
-
-		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
-			tcp_buf_l2_flags_flush(c);
-	} else {
-		struct tcp6_l2_flags_buf_t *b6;
-
-		iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
-		b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
-
-		ret = do_tcp_send_flag(c, conn, flags, &b6->th, b6->opts,
-				       optlen);
-		if (ret <= 0)
-			return ret;
-
-		ip_len = ipv6_fill_headers(c, conn, &b6->ip6h, optlen,
-					   conn->seq_to_tap);
-
-		iov->iov_len = tap_iov_len(c, &b6->taph, ip_len);
-
-		if (flags & DUP_ACK) {
-			memcpy(b6 + 1, b6, sizeof(*b6));
-			(iov + 1)->iov_len = iov->iov_len;
-			tcp6_l2_flags_buf_used++;
-		}
-
-		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
-			tcp_buf_l2_flags_flush(c);
-	}
-
-	return 0;
-}
 
 /**
  * tcp_rst_do() - Reset a tap connection: send RST segment to tap, close socket
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
+void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	if (conn->events == CLOSED)
 		return;
@@ -1813,14 +1418,6 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 	return s;
 }
 
-static uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn)
-{
-	if (CONN_V4(conn))
-		return MSS4;
-
-	return MSS6;
-}
-
 /**
  * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
  * @conn:	Connection pointer
@@ -2072,179 +1669,6 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
 	return 0;
 }
 
-/**
- * tcp_data_to_tap() - Finalise (queue) highest-numbered scatter-gather buffer
- * @c:		Execution context
- * @conn:	Connection pointer
- * @plen:	Payload length at L4
- * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
- * @seq:	Sequence number to be sent
- */
-static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
-			    ssize_t plen, int no_csum, uint32_t seq)
-{
-	uint32_t *seq_update = &conn->seq_to_tap;
-	struct iovec *iov;
-	size_t ip_len;
-
-	if (CONN_V4(conn)) {
-		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
-		const uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
-
-		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
-		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
-
-		ip_len = ipv4_fill_headers(c, conn, &b->iph, plen,
-					   check, seq);
-
-		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
-		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
-		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
-			tcp_buf_l2_data_flush(c);
-	} else if (CONN_V6(conn)) {
-		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
-
-		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
-		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
-
-		ip_len = ipv6_fill_headers(c, conn, &b->ip6h, plen, seq);
-
-		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
-		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
-		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
-			tcp_buf_l2_data_flush(c);
-	}
-}
-
-/**
- * tcp_buf_data_from_sock() - Handle new data from socket, queue to tap, in window
- * @c:		Execution context
- * @conn:	Connection pointer
- *
- * Return: negative on connection reset, 0 otherwise
- *
- * #syscalls recvmsg
- */
-static int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
-{
-	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
-	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
-	int sendlen, len, plen, v4 = CONN_V4(conn);
-	int s = conn->sock, i, ret = 0;
-	struct msghdr mh_sock = { 0 };
-	uint16_t mss = MSS_GET(conn);
-	uint32_t already_sent, seq;
-	struct iovec *iov;
-
-	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
-
-	if (SEQ_LT(already_sent, 0)) {
-		/* RFC 761, section 2.1. */
-		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
-			   conn->seq_ack_from_tap, conn->seq_to_tap);
-		conn->seq_to_tap = conn->seq_ack_from_tap;
-		already_sent = 0;
-	}
-
-	if (!wnd_scaled || already_sent >= wnd_scaled) {
-		conn_flag(c, conn, STALLED);
-		conn_flag(c, conn, ACK_FROM_TAP_DUE);
-		return 0;
-	}
-
-	/* Set up buffer descriptors we'll fill completely and partially. */
-	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
-	if (fill_bufs > TCP_FRAMES) {
-		fill_bufs = TCP_FRAMES;
-		iov_rem = 0;
-	} else {
-		iov_rem = (wnd_scaled - already_sent) % mss;
-	}
-
-	mh_sock.msg_iov = iov_sock;
-	mh_sock.msg_iovlen = fill_bufs + 1;
-
-	iov_sock[0].iov_base = tcp_buf_discard;
-	iov_sock[0].iov_len = already_sent;
-
-	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
-	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
-		tcp_buf_l2_data_flush(c);
-
-		/* Silence Coverity CWE-125 false positive */
-		tcp4_l2_buf_used = tcp6_l2_buf_used = 0;
-	}
-
-	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
-		if (v4)
-			iov->iov_base = &tcp4_l2_buf[tcp4_l2_buf_used + i].data;
-		else
-			iov->iov_base = &tcp6_l2_buf[tcp6_l2_buf_used + i].data;
-		iov->iov_len = mss;
-	}
-	if (iov_rem)
-		iov_sock[fill_bufs].iov_len = iov_rem;
-
-	/* Receive into buffers, don't dequeue until acknowledged by guest. */
-	do
-		len = recvmsg(s, &mh_sock, MSG_PEEK);
-	while (len < 0 && errno == EINTR);
-
-	if (len < 0)
-		goto err;
-
-	if (!len) {
-		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
-			if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
-				tcp_rst(c, conn);
-				return ret;
-			}
-
-			conn_event(c, conn, TAP_FIN_SENT);
-		}
-
-		return 0;
-	}
-
-	sendlen = len - already_sent;
-	if (sendlen <= 0) {
-		conn_flag(c, conn, STALLED);
-		return 0;
-	}
-
-	conn_flag(c, conn, ~STALLED);
-
-	send_bufs = DIV_ROUND_UP(sendlen, mss);
-	last_len = sendlen - (send_bufs - 1) * mss;
-
-	/* Likely, some new data was acked too. */
-	tcp_update_seqack_wnd(c, conn, 0, NULL);
-
-	/* Finally, queue to tap */
-	plen = mss;
-	seq = conn->seq_to_tap;
-	for (i = 0; i < send_bufs; i++) {
-		int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
-
-		if (i == send_bufs - 1)
-			plen = last_len;
-
-		tcp_data_to_tap(c, conn, plen, no_csum, seq);
-		seq += plen;
-	}
-
-	conn_flag(c, conn, ACK_FROM_TAP_DUE);
-
-	return 0;
-
-err:
-	if (errno != EAGAIN && errno != EWOULDBLOCK) {
-		ret = -errno;
-		tcp_rst(c, conn);
-	}
-
-	return ret;
-}
 
 /**
  * tcp_data_from_tap() - tap/guest data for established connection
diff --git a/tcp_buf.c b/tcp_buf.c
new file mode 100644
index 000000000000..d70e7f810e4a
--- /dev/null
+++ b/tcp_buf.c
@@ -0,0 +1,569 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * tcp_buf.c - TCP L2-L4 translation state machine
+ *
+ * Copyright (c) 2020-2022 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+#include <limits.h>
+#include <string.h>
+#include <errno.h>
+
+#include <netinet/ip.h>
+
+#include <linux/tcp.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "tap.h"
+#include "siphash.h"
+#include "inany.h"
+#include "tcp_conn.h"
+#include "tcp_internal.h"
+#include "tcp_buf.h"
+
+#define TCP_FRAMES_MEM			128
+#define TCP_FRAMES							\
+	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
+
+struct tcp4_l2_head {	/* For MSS4 macro: keep in sync with tcp4_l2_buf_t */
+#ifdef __AVX2__
+	uint8_t pad[26];
+#else
+	uint8_t pad[2];
+#endif
+	struct tap_hdr taph;
+	struct iphdr iph;
+	struct tcphdr th;
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)));
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
+#endif
+
+struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
+#ifdef __AVX2__
+	uint8_t pad[14];
+#else
+	uint8_t pad[2];
+#endif
+	struct tap_hdr taph;
+	struct ipv6hdr ip6h;
+	struct tcphdr th;
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)));
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
+#endif
+
+#define MSS4	ROUND_DOWN(USHRT_MAX - sizeof(struct tcp4_l2_head), 4)
+#define MSS6	ROUND_DOWN(USHRT_MAX - sizeof(struct tcp6_l2_head), 4)
+
+/**
+ * tcp_buf_seq_update - Sequences to update with length of frames once sent
+ * @seq:	Pointer to sequence number sent to tap-side, to be updated
+ * @len:	TCP payload length
+ */
+struct tcp_buf_seq_update {
+	uint32_t *seq;
+	uint16_t len;
+};
+
+/* Static buffers */
+
+/**
+ * tcp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections
+ * @pad:	Align TCP header to 32 bytes, for AVX2 checksum calculation only
+ * @taph:	Tap-level headers (partially pre-filled)
+ * @iph:	Pre-filled IP header (except for tot_len and saddr)
+ * @uh:		Headroom for TCP header
+ * @data:	Storage for TCP payload
+ */
+static struct tcp4_l2_buf_t {
+#ifdef __AVX2__
+	uint8_t pad[26];	/* 0, align th to 32 bytes */
+#else
+	uint8_t pad[2];		/*	align iph to 4 bytes	0 */
+#endif
+	struct tap_hdr taph;	/* 26				2 */
+	struct iphdr iph;	/* 44				20 */
+	struct tcphdr th;	/* 64				40 */
+	uint8_t data[MSS4];	/* 84				60 */
+				/* 65536			65532 */
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)))
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
+#endif
+tcp4_l2_buf[TCP_FRAMES_MEM];
+
+static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
+
+static unsigned int tcp4_l2_buf_used;
+
+/**
+ * tcp6_l2_buf_t - Pre-cooked IPv6 packet buffers for tap connections
+ * @pad:	Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
+ * @taph:	Tap-level headers (partially pre-filled)
+ * @ip6h:	Pre-filled IP header (except for payload_len and addresses)
+ * @th:		Headroom for TCP header
+ * @data:	Storage for TCP payload
+ */
+struct tcp6_l2_buf_t {
+#ifdef __AVX2__
+	uint8_t pad[14];	/* 0	align ip6h to 32 bytes */
+#else
+	uint8_t pad[2];		/*	align ip6h to 4 bytes	0 */
+#endif
+	struct tap_hdr taph;	/* 14				2 */
+	struct ipv6hdr ip6h;	/* 32				20 */
+	struct tcphdr th;	/* 72				60 */
+	uint8_t data[MSS6];	/* 92				80 */
+				/* 65536			65532 */
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)))
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
+#endif
+tcp6_l2_buf[TCP_FRAMES_MEM];
+
+static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
+
+static unsigned int tcp6_l2_buf_used;
+
+/* recvmsg()/sendmsg() data for tap */
+static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
+
+static struct iovec	tcp4_l2_iov		[TCP_FRAMES_MEM];
+static struct iovec	tcp6_l2_iov		[TCP_FRAMES_MEM];
+static struct iovec	tcp4_l2_flags_iov	[TCP_FRAMES_MEM];
+static struct iovec	tcp6_l2_flags_iov	[TCP_FRAMES_MEM];
+
+/**
+ * tcp4_l2_flags_buf_t - IPv4 packet buffers for segments without data (flags)
+ * @pad:	Align TCP header to 32 bytes, for AVX2 checksum calculation only
+ * @taph:	Tap-level headers (partially pre-filled)
+ * @iph:	Pre-filled IP header (except for tot_len and saddr)
+ * @th:		Headroom for TCP header
+ * @opts:	Headroom for TCP options
+ */
+static struct tcp4_l2_flags_buf_t {
+#ifdef __AVX2__
+	uint8_t pad[26];	/* 0, align th to 32 bytes */
+#else
+	uint8_t pad[2];		/*	align iph to 4 bytes	0 */
+#endif
+	struct tap_hdr taph;	/* 26				2 */
+	struct iphdr iph;	/* 44				20 */
+	struct tcphdr th;	/* 64				40 */
+	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)))
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
+#endif
+tcp4_l2_flags_buf[TCP_FRAMES_MEM];
+
+static unsigned int tcp4_l2_flags_buf_used;
+
+/**
+ * tcp6_l2_flags_buf_t - IPv6 packet buffers for segments without data (flags)
+ * @pad:	Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
+ * @taph:	Tap-level headers (partially pre-filled)
+ * @ip6h:	Pre-filled IP header (except for payload_len and addresses)
+ * @th:		Headroom for TCP header
+ * @opts:	Headroom for TCP options
+ */
+static struct tcp6_l2_flags_buf_t {
+#ifdef __AVX2__
+	uint8_t pad[14];	/* 0	align ip6h to 32 bytes */
+#else
+	uint8_t pad[2];		/*	align ip6h to 4 bytes		   0 */
+#endif
+	struct tap_hdr taph;	/* 14					   2 */
+	struct ipv6hdr ip6h;	/* 32					  20 */
+	struct tcphdr th	/* 72 */ __attribute__ ((aligned(4))); /* 60 */
+	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)))
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
+#endif
+tcp6_l2_flags_buf[TCP_FRAMES_MEM];
+
+static unsigned int tcp6_l2_flags_buf_used;
+
+/**
+ * tcp_buf_update_l2() - Update L2 buffers with Ethernet and IPv4 addresses
+ * @eth_d:	Ethernet destination address, NULL if unchanged
+ * @eth_s:	Ethernet source address, NULL if unchanged
+ */
+void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
+{
+	int i;
+
+	for (i = 0; i < TCP_FRAMES_MEM; i++) {
+		struct tcp4_l2_flags_buf_t *b4f = &tcp4_l2_flags_buf[i];
+		struct tcp6_l2_flags_buf_t *b6f = &tcp6_l2_flags_buf[i];
+		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
+		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
+
+		tap_update_mac(&b4->taph, eth_d, eth_s);
+		tap_update_mac(&b6->taph, eth_d, eth_s);
+		tap_update_mac(&b4f->taph, eth_d, eth_s);
+		tap_update_mac(&b6f->taph, eth_d, eth_s);
+	}
+}
+
+/**
+ * tcp_buf_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
+ * @c:		Execution context
+ */
+void tcp_buf_sock4_iov_init(const struct ctx *c)
+{
+	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
+	struct iovec *iov;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tcp4_l2_buf); i++) {
+		tcp4_l2_buf[i] = (struct tcp4_l2_buf_t) {
+			.taph = TAP_HDR_INIT(ETH_P_IP),
+			.iph = iph,
+			.th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
+		};
+	}
+
+	for (i = 0; i < ARRAY_SIZE(tcp4_l2_flags_buf); i++) {
+		tcp4_l2_flags_buf[i] = (struct tcp4_l2_flags_buf_t) {
+			.taph = TAP_HDR_INIT(ETH_P_IP),
+			.iph = L2_BUF_IP4_INIT(IPPROTO_TCP)
+		};
+	}
+
+	for (i = 0, iov = tcp4_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
+		iov->iov_base = tap_iov_base(c, &tcp4_l2_buf[i].taph);
+
+	for (i = 0, iov = tcp4_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
+		iov->iov_base = tap_iov_base(c, &tcp4_l2_flags_buf[i].taph);
+}
+
+/**
+ * tcp_buf_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
+ * @c:		Execution context
+ */
+void tcp_buf_sock6_iov_init(const struct ctx *c)
+{
+	struct iovec *iov;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tcp6_l2_buf); i++) {
+		tcp6_l2_buf[i] = (struct tcp6_l2_buf_t) {
+			.taph = TAP_HDR_INIT(ETH_P_IPV6),
+			.ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP),
+			.th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
+		};
+	}
+
+	for (i = 0; i < ARRAY_SIZE(tcp6_l2_flags_buf); i++) {
+		tcp6_l2_flags_buf[i] = (struct tcp6_l2_flags_buf_t) {
+			.taph = TAP_HDR_INIT(ETH_P_IPV6),
+			.ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP)
+		};
+	}
+
+	for (i = 0, iov = tcp6_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
+		iov->iov_base = tap_iov_base(c, &tcp6_l2_buf[i].taph);
+
+	for (i = 0, iov = tcp6_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
+		iov->iov_base = tap_iov_base(c, &tcp6_l2_flags_buf[i].taph);
+}
+
+/**
+ * tcp_buf_l2_flags_flush() - Send out buffers for segments with no data (flags)
+ * @c:		Execution context
+ */
+void tcp_buf_l2_flags_flush(const struct ctx *c)
+{
+	tap_send_frames(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used);
+	tcp6_l2_flags_buf_used = 0;
+
+	tap_send_frames(c, tcp4_l2_flags_iov, tcp4_l2_flags_buf_used);
+	tcp4_l2_flags_buf_used = 0;
+}
+
+/**
+ * tcp_buf_l2_data_flush() - Send out buffers for segments with data
+ * @c:		Execution context
+ */
+void tcp_buf_l2_data_flush(const struct ctx *c)
+{
+	unsigned i;
+	size_t m;
+
+	m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used);
+	for (i = 0; i < m; i++)
+		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
+	tcp6_l2_buf_used = 0;
+
+	m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used);
+	for (i = 0; i < m; i++)
+		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
+	tcp4_l2_buf_used = 0;
+}
+
+int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	size_t optlen = 0;
+	struct iovec *iov;
+	size_t ip_len;
+	int ret;
+
+	/* Options: MSS, NOP and window scale (8 bytes) */
+	if (flags & SYN)
+		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
+
+	if (CONN_V4(conn)) {
+		struct tcp4_l2_flags_buf_t *b4;
+
+		iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
+		b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
+
+		ret = do_tcp_send_flag(c, conn, flags, &b4->th, b4->opts,
+				       optlen);
+		if (ret <= 0)
+			return ret;
+
+		ip_len = ipv4_fill_headers(c, conn, &b4->iph, optlen,
+					   NULL, conn->seq_to_tap);
+
+		iov->iov_len = tap_iov_len(c, &b4->taph, ip_len);
+
+		if (flags & DUP_ACK) {
+
+			memcpy(b4 + 1, b4, sizeof(*b4));
+			(iov + 1)->iov_len = iov->iov_len;
+			tcp4_l2_flags_buf_used++;
+		}
+
+		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
+			tcp_buf_l2_flags_flush(c);
+	} else {
+		struct tcp6_l2_flags_buf_t *b6;
+
+		iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
+		b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
+
+		ret = do_tcp_send_flag(c, conn, flags, &b6->th, b6->opts,
+				       optlen);
+		if (ret <= 0)
+			return ret;
+
+		ip_len = ipv6_fill_headers(c, conn, &b6->ip6h, optlen,
+					   conn->seq_to_tap);
+
+		iov->iov_len = tap_iov_len(c, &b6->taph, ip_len);
+
+		if (flags & DUP_ACK) {
+			memcpy(b6 + 1, b6, sizeof(*b6));
+			(iov + 1)->iov_len = iov->iov_len;
+			tcp6_l2_flags_buf_used++;
+		}
+
+		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
+			tcp_buf_l2_flags_flush(c);
+	}
+
+	return 0;
+}
+
+uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn)
+{
+	if (CONN_V4(conn))
+		return MSS4;
+
+	return MSS6;
+}
+
+/**
+ * tcp_data_to_tap() - Finalise (queue) highest-numbered scatter-gather buffer
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ * @plen:	Payload length at L4
+ * @no_csum:	Don't compute IPv4 checksum, use the one from previous buffer
+ * @seq:	Sequence number to be sent
+ */
+static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
+			    ssize_t plen, int no_csum, uint32_t seq)
+{
+	uint32_t *seq_update = &conn->seq_to_tap;
+	struct iovec *iov;
+	size_t ip_len;
+
+	if (CONN_V4(conn)) {
+		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
+		const uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
+
+		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
+		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
+
+		ip_len = ipv4_fill_headers(c, conn, &b->iph, plen,
+					   check, seq);
+
+		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
+		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
+		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
+			tcp_buf_l2_data_flush(c);
+	} else if (CONN_V6(conn)) {
+		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
+
+		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
+		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
+
+		ip_len = ipv6_fill_headers(c, conn, &b->ip6h, plen, seq);
+
+		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
+		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
+		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
+			tcp_buf_l2_data_flush(c);
+	}
+}
+
+/**
+ * tcp_buf_data_from_sock() - Handle new data from socket, queue to tap, in window
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ *
+ * Return: negative on connection reset, 0 otherwise
+ *
+ * #syscalls recvmsg
+ */
+int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
+	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
+	int sendlen, len, plen, v4 = CONN_V4(conn);
+	int s = conn->sock, i, ret = 0;
+	struct msghdr mh_sock = { 0 };
+	uint16_t mss = MSS_GET(conn);
+	uint32_t already_sent, seq;
+	struct iovec *iov;
+
+	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
+
+	if (SEQ_LT(already_sent, 0)) {
+		/* RFC 761, section 2.1. */
+		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
+			   conn->seq_ack_from_tap, conn->seq_to_tap);
+		conn->seq_to_tap = conn->seq_ack_from_tap;
+		already_sent = 0;
+	}
+
+	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, STALLED);
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
+		return 0;
+	}
+
+	/* Set up buffer descriptors we'll fill completely and partially. */
+	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
+	if (fill_bufs > TCP_FRAMES) {
+		fill_bufs = TCP_FRAMES;
+		iov_rem = 0;
+	} else {
+		iov_rem = (wnd_scaled - already_sent) % mss;
+	}
+
+	mh_sock.msg_iov = iov_sock;
+	mh_sock.msg_iovlen = fill_bufs + 1;
+
+	iov_sock[0].iov_base = tcp_buf_discard;
+	iov_sock[0].iov_len = already_sent;
+
+	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
+	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
+		tcp_buf_l2_data_flush(c);
+
+		/* Silence Coverity CWE-125 false positive */
+		tcp4_l2_buf_used = tcp6_l2_buf_used = 0;
+	}
+
+	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
+		if (v4)
+			iov->iov_base = &tcp4_l2_buf[tcp4_l2_buf_used + i].data;
+		else
+			iov->iov_base = &tcp6_l2_buf[tcp6_l2_buf_used + i].data;
+		iov->iov_len = mss;
+	}
+	if (iov_rem)
+		iov_sock[fill_bufs].iov_len = iov_rem;
+
+	/* Receive into buffers, don't dequeue until acknowledged by guest. */
+	do
+		len = recvmsg(s, &mh_sock, MSG_PEEK);
+	while (len < 0 && errno == EINTR);
+
+	if (len < 0)
+		goto err;
+
+	if (!len) {
+		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+			if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
+				tcp_rst(c, conn);
+				return ret;
+			}
+
+			conn_event(c, conn, TAP_FIN_SENT);
+		}
+
+		return 0;
+	}
+
+	sendlen = len - already_sent;
+	if (sendlen <= 0) {
+		conn_flag(c, conn, STALLED);
+		return 0;
+	}
+
+	conn_flag(c, conn, ~STALLED);
+
+	send_bufs = DIV_ROUND_UP(sendlen, mss);
+	last_len = sendlen - (send_bufs - 1) * mss;
+
+	/* Likely, some new data was acked too. */
+	tcp_update_seqack_wnd(c, conn, 0, NULL);
+
+	/* Finally, queue to tap */
+	plen = mss;
+	seq = conn->seq_to_tap;
+	for (i = 0; i < send_bufs; i++) {
+		int no_csum = i && i != send_bufs - 1 && tcp4_l2_buf_used;
+
+		if (i == send_bufs - 1)
+			plen = last_len;
+
+		tcp_data_to_tap(c, conn, plen, no_csum, seq);
+		seq += plen;
+	}
+
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
+
+	return 0;
+
+err:
+	if (errno != EAGAIN && errno != EWOULDBLOCK) {
+		ret = -errno;
+		tcp_rst(c, conn);
+	}
+
+	return ret;
+}
diff --git a/tcp_buf.h b/tcp_buf.h
new file mode 100644
index 000000000000..d23031252002
--- /dev/null
+++ b/tcp_buf.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef TCP_BUF_H
+#define TCP_BUF_H
+
+void tcp_buf_sock4_iov_init(const struct ctx *c);
+void tcp_buf_sock6_iov_init(const struct ctx *c);
+void tcp_buf_l2_flags_flush(const struct ctx *c);
+void tcp_buf_l2_data_flush(const struct ctx *c);
+uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn);
+int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
+int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
+
+#endif  /*TCP_BUF_H */
diff --git a/tcp_internal.h b/tcp_internal.h
new file mode 100644
index 000000000000..36eb2463dd5a
--- /dev/null
+++ b/tcp_internal.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef TCP_INTERNAL_H
+#define TCP_INTERNAL_H
+
+#define MAX_WS				8
+#define MAX_WINDOW			(1 << (16 + (MAX_WS)))
+
+#define SEQ_LE(a, b)			((b) - (a) < MAX_WINDOW)
+#define SEQ_LT(a, b)			((b) - (a) - 1 < MAX_WINDOW)
+#define SEQ_GE(a, b)			((a) - (b) < MAX_WINDOW)
+#define SEQ_GT(a, b)			((a) - (b) - 1 < MAX_WINDOW)
+
+#define FIN		(1 << 0)
+#define SYN		(1 << 1)
+#define RST		(1 << 2)
+#define ACK		(1 << 4)
+
+/* Flags for internal usage */
+#define DUP_ACK		(1 << 5)
+#define OPT_EOL		0
+#define OPT_NOP		1
+#define OPT_MSS		2
+#define OPT_MSS_LEN	4
+#define OPT_WS		3
+#define OPT_WS_LEN	3
+#define OPT_SACKP	4
+#define OPT_SACK	5
+#define OPT_TS		8
+
+#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
+#define CONN_V6(conn)		(!CONN_V4(conn))
+
+extern char tcp_buf_discard [MAX_WINDOW];
+
+void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
+		  unsigned long flag);
+#define conn_flag(c, conn, flag)					\
+	do {								\
+		flow_trace(conn, "flag at %s:%i", __func__, __LINE__);	\
+		conn_flag_do(c, conn, flag);				\
+	} while (0)
+
+
+void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
+		   unsigned long event);
+#define conn_event(c, conn, event)					\
+	do {								\
+		flow_trace(conn, "event at %s:%i", __func__, __LINE__);	\
+		conn_event_do(c, conn, event);				\
+	} while (0)
+
+void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
+#define tcp_rst(c, conn)						\
+	do {								\
+		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
+		tcp_rst_do(c, conn);					\
+	} while (0)
+
+
+size_t ipv4_fill_headers(const struct ctx *c,
+			 const struct tcp_tap_conn *conn,
+			 struct iphdr *iph, size_t plen,
+			 const uint16_t *check, uint32_t seq);
+size_t ipv6_fill_headers(const struct ctx *c,
+			 const struct tcp_tap_conn *conn,
+			 struct ipv6hdr *ip6h, size_t plen,
+			 uint32_t seq);
+
+int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
+			  int force_seq, struct tcp_info *tinfo);
+int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
+		     struct tcphdr *th, char *data, size_t optlen);
+
+#endif /* TCP_INTERNAL_H */
-- 
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef TCP_INTERNAL_H
+#define TCP_INTERNAL_H
+
+#define MAX_WS				8
+#define MAX_WINDOW			(1 << (16 + (MAX_WS)))
+
+#define SEQ_LE(a, b)			((b) - (a) < MAX_WINDOW)
+#define SEQ_LT(a, b)			((b) - (a) - 1 < MAX_WINDOW)
+#define SEQ_GE(a, b)			((a) - (b) < MAX_WINDOW)
+#define SEQ_GT(a, b)			((a) - (b) - 1 < MAX_WINDOW)
+
+#define FIN		(1 << 0)
+#define SYN		(1 << 1)
+#define RST		(1 << 2)
+#define ACK		(1 << 4)
+
+/* Flags for internal usage */
+#define DUP_ACK		(1 << 5)
+#define OPT_EOL		0
+#define OPT_NOP		1
+#define OPT_MSS		2
+#define OPT_MSS_LEN	4
+#define OPT_WS		3
+#define OPT_WS_LEN	3
+#define OPT_SACKP	4
+#define OPT_SACK	5
+#define OPT_TS		8
+
+#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
+#define CONN_V6(conn)		(!CONN_V4(conn))
+
+extern char tcp_buf_discard [MAX_WINDOW];
+
+void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
+		  unsigned long flag);
+#define conn_flag(c, conn, flag)					\
+	do {								\
+		flow_trace(conn, "flag at %s:%i", __func__, __LINE__);	\
+		conn_flag_do(c, conn, flag);				\
+	} while (0)
+
+
+void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
+		   unsigned long event);
+#define conn_event(c, conn, event)					\
+	do {								\
+		flow_trace(conn, "event at %s:%i", __func__, __LINE__);	\
+		conn_event_do(c, conn, event);				\
+	} while (0)
+
+void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
+#define tcp_rst(c, conn)						\
+	do {								\
+		flow_dbg((conn), "TCP reset at %s:%i", __func__, __LINE__); \
+		tcp_rst_do(c, conn);					\
+	} while (0)
+
+
+size_t ipv4_fill_headers(const struct ctx *c,
+			 const struct tcp_tap_conn *conn,
+			 struct iphdr *iph, size_t plen,
+			 const uint16_t *check, uint32_t seq);
+size_t ipv6_fill_headers(const struct ctx *c,
+			 const struct tcp_tap_conn *conn,
+			 struct ipv6hdr *ip6h, size_t plen,
+			 uint32_t seq);
+
+int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
+			  int force_seq, struct tcp_info *tinfo);
+int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
+		     struct tcphdr *th, char *data, size_t optlen);
+
+#endif /* TCP_INTERNAL_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 12/24] tap: make tap_update_mac() generic
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (10 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 11/24] tcp: move buffers management functions to their own file Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  1:49   ` David Gibson
  2024-02-02 14:11 ` [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add() Laurent Vivier
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Use ethhdr rather than tap_hdr.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 tap.c     | 6 +++---
 tap.h     | 2 +-
 tcp_buf.c | 8 ++++----
 udp.c     | 4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/tap.c b/tap.c
index 3ea03f720d6d..29f389057ac1 100644
--- a/tap.c
+++ b/tap.c
@@ -447,13 +447,13 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
  * @eth_d:	Ethernet destination address, NULL if unchanged
  * @eth_s:	Ethernet source address, NULL if unchanged
  */
-void tap_update_mac(struct tap_hdr *taph,
+void eth_update_mac(struct ethhdr *eh,
 		    const unsigned char *eth_d, const unsigned char *eth_s)
 {
 	if (eth_d)
-		memcpy(taph->eh.h_dest, eth_d, sizeof(taph->eh.h_dest));
+		memcpy(eh->h_dest, eth_d, sizeof(eh->h_dest));
 	if (eth_s)
-		memcpy(taph->eh.h_source, eth_s, sizeof(taph->eh.h_source));
+		memcpy(eh->h_source, eth_s, sizeof(eh->h_source));
 }
 
 PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
diff --git a/tap.h b/tap.h
index 466d91466c3d..437b9aa2b43f 100644
--- a/tap.h
+++ b/tap.h
@@ -74,7 +74,7 @@ void tap_icmp6_send(const struct ctx *c,
 		    const void *in, size_t len);
 int tap_send(const struct ctx *c, const void *data, size_t len);
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n);
-void tap_update_mac(struct tap_hdr *taph,
+void eth_update_mac(struct ethhdr *eh,
 		    const unsigned char *eth_d, const unsigned char *eth_s);
 void tap_listen_handler(struct ctx *c, uint32_t events);
 void tap_handler_pasta(struct ctx *c, uint32_t events,
diff --git a/tcp_buf.c b/tcp_buf.c
index d70e7f810e4a..4c1f00c1d1b2 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -218,10 +218,10 @@ void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
 		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
 		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
 
-		tap_update_mac(&b4->taph, eth_d, eth_s);
-		tap_update_mac(&b6->taph, eth_d, eth_s);
-		tap_update_mac(&b4f->taph, eth_d, eth_s);
-		tap_update_mac(&b6f->taph, eth_d, eth_s);
+		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
+		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
+		eth_update_mac(&b4f->taph.eh, eth_d, eth_s);
+		eth_update_mac(&b6f->taph.eh, eth_d, eth_s);
 	}
 }
 
diff --git a/udp.c b/udp.c
index 96b4e6ca9a85..db635742319b 100644
--- a/udp.c
+++ b/udp.c
@@ -283,8 +283,8 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
 		struct udp4_l2_buf_t *b4 = &udp4_l2_buf[i];
 		struct udp6_l2_buf_t *b6 = &udp6_l2_buf[i];
 
-		tap_update_mac(&b4->taph, eth_d, eth_s);
-		tap_update_mac(&b6->taph, eth_d, eth_s);
+		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
+		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
 	}
 }
 
-- 
@@ -283,8 +283,8 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
 		struct udp4_l2_buf_t *b4 = &udp4_l2_buf[i];
 		struct udp6_l2_buf_t *b6 = &udp6_l2_buf[i];
 
-		tap_update_mac(&b4->taph, eth_d, eth_s);
-		tap_update_mac(&b6->taph, eth_d, eth_s);
+		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
+		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
 	}
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (11 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 12/24] tap: make tap_update_mac() generic Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-02 14:29   ` Laurent Vivier
                     ` (2 more replies)
  2024-02-02 14:11 ` [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX() Laurent Vivier
                   ` (10 subsequent siblings)
  23 siblings, 3 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

From: Laurent Vivier <laurent@vivier.eu>

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
---
 tap.c | 98 +++++++++++++++++++++++++++++------------------------------
 tap.h |  7 +++++
 2 files changed, 56 insertions(+), 49 deletions(-)

diff --git a/tap.c b/tap.c
index 29f389057ac1..5b1b61550c13 100644
--- a/tap.c
+++ b/tap.c
@@ -911,6 +911,45 @@ append:
 	return in->count;
 }
 
+void pool_flush_all(void)
+{
+	pool_flush(pool_tap4);
+	pool_flush(pool_tap6);
+}
+
+void tap_handler_all(struct ctx *c, const struct timespec *now)
+{
+	tap4_handler(c, pool_tap4, now);
+	tap6_handler(c, pool_tap6, now);
+}
+
+void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
+		       const char *func, int line)
+{
+	const struct ethhdr *eh;
+
+	pcap(p, len);
+
+	eh = (struct ethhdr *)p;
+
+	if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
+		memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
+		proto_update_l2_buf(c->mac_guest, NULL);
+	}
+
+	switch (ntohs(eh->h_proto)) {
+	case ETH_P_ARP:
+	case ETH_P_IP:
+		packet_add_do(pool_tap4, len, p, func, line);
+		break;
+	case ETH_P_IPV6:
+		packet_add_do(pool_tap6, len, p, func, line);
+		break;
+	default:
+		break;
+	}
+}
+
 /**
  * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
  * @c:		Execution context
@@ -937,7 +976,6 @@ static void tap_sock_reset(struct ctx *c)
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now)
 {
-	const struct ethhdr *eh;
 	ssize_t n, rem;
 	char *p;
 
@@ -950,8 +988,7 @@ redo:
 	p = pkt_buf;
 	rem = 0;
 
-	pool_flush(pool_tap4);
-	pool_flush(pool_tap6);
+	pool_flush_all();
 
 	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
 	if (n < 0) {
@@ -978,37 +1015,18 @@ redo:
 		/* Complete the partial read above before discarding a malformed
 		 * frame, otherwise the stream will be inconsistent.
 		 */
-		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU)
+		if (len < (ssize_t)sizeof(struct ethhdr) ||
+		    len > (ssize_t)ETH_MAX_MTU)
 			goto next;
 
-		pcap(p, len);
-
-		eh = (struct ethhdr *)p;
-
-		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
-			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
-			proto_update_l2_buf(c->mac_guest, NULL);
-		}
-
-		switch (ntohs(eh->h_proto)) {
-		case ETH_P_ARP:
-		case ETH_P_IP:
-			packet_add(pool_tap4, len, p);
-			break;
-		case ETH_P_IPV6:
-			packet_add(pool_tap6, len, p);
-			break;
-		default:
-			break;
-		}
+		packet_add_all(c, len, p);
 
 next:
 		p += len;
 		n -= len;
 	}
 
-	tap4_handler(c, pool_tap4, now);
-	tap6_handler(c, pool_tap6, now);
+	tap_handler_all(c, now);
 
 	/* We can't use EPOLLET otherwise. */
 	if (rem)
@@ -1033,35 +1051,18 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 redo:
 	n = 0;
 
-	pool_flush(pool_tap4);
-	pool_flush(pool_tap6);
+	pool_flush_all();
 restart:
 	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
-		const struct ethhdr *eh = (struct ethhdr *)(pkt_buf + n);
 
-		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU) {
+		if (len < (ssize_t)sizeof(struct ethhdr) ||
+		    len > (ssize_t)ETH_MAX_MTU) {
 			n += len;
 			continue;
 		}
 
-		pcap(pkt_buf + n, len);
 
-		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
-			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
-			proto_update_l2_buf(c->mac_guest, NULL);
-		}
-
-		switch (ntohs(eh->h_proto)) {
-		case ETH_P_ARP:
-		case ETH_P_IP:
-			packet_add(pool_tap4, len, pkt_buf + n);
-			break;
-		case ETH_P_IPV6:
-			packet_add(pool_tap6, len, pkt_buf + n);
-			break;
-		default:
-			break;
-		}
+		packet_add_all(c, len, pkt_buf + n);
 
 		if ((n += len) == TAP_BUF_BYTES)
 			break;
@@ -1072,8 +1073,7 @@ restart:
 
 	ret = errno;
 
-	tap4_handler(c, pool_tap4, now);
-	tap6_handler(c, pool_tap6, now);
+	tap_handler_all(c, now);
 
 	if (len > 0 || ret == EAGAIN)
 		return;
diff --git a/tap.h b/tap.h
index 437b9aa2b43f..7157ef37ee6e 100644
--- a/tap.h
+++ b/tap.h
@@ -82,5 +82,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 void tap_sock_init(struct ctx *c);
+void pool_flush_all(void);
+void tap_handler_all(struct ctx *c, const struct timespec *now);
+
+void packet_add_do(struct pool *p, size_t len, const char *start,
+		   const char *func, int line);
+#define packet_add_all(p, len, start)					\
+	packet_add_all_do(p, len, start, __func__, __LINE__)
 
 #endif /* TAP_H */
-- 
@@ -82,5 +82,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 void tap_sock_init(struct ctx *c);
+void pool_flush_all(void);
+void tap_handler_all(struct ctx *c, const struct timespec *now);
+
+void packet_add_do(struct pool *p, size_t len, const char *start,
+		   const char *func, int line);
+#define packet_add_all(p, len, start)					\
+	packet_add_all_do(p, len, start, __func__, __LINE__)
 
 #endif /* TAP_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (12 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  1:59   ` David Gibson
  2024-02-11 23:16   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler() Laurent Vivier
                   ` (9 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 udp.c | 126 ++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 73 insertions(+), 53 deletions(-)

diff --git a/udp.c b/udp.c
index db635742319b..77168fb0a2af 100644
--- a/udp.c
+++ b/udp.c
@@ -562,47 +562,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
  *
  * Return: size of tap frame with headers
  */
-static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
-			      const struct timespec *now)
+static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
+			      size_t data_len, struct sockaddr_in *s_in,
+			      in_port_t dstport, const struct timespec *now)
 {
-	struct udp4_l2_buf_t *b = &udp4_l2_buf[n];
+	struct udphdr *uh = (struct udphdr *)(iph + 1);
 	in_port_t src_port;
 	size_t ip_len;
 
-	ip_len = udp4_l2_mh_sock[n].msg_len + sizeof(b->iph) + sizeof(b->uh);
+	ip_len = data_len + sizeof(struct iphdr) + sizeof(struct udphdr);
 
-	b->iph.tot_len = htons(ip_len);
-	b->iph.daddr = c->ip4.addr_seen.s_addr;
+	iph->tot_len = htons(ip_len);
+	iph->daddr = c->ip4.addr_seen.s_addr;
 
-	src_port = ntohs(b->s_in.sin_port);
+	src_port = ntohs(s_in->sin_port);
 
 	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) &&
-	    IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.dns_host) &&
+	    IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.dns_host) &&
 	    src_port == 53) {
-		b->iph.saddr = c->ip4.dns_match.s_addr;
-	} else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) ||
-		   IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)||
-		   IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) {
-		b->iph.saddr = c->ip4.gw.s_addr;
+		iph->saddr = c->ip4.dns_match.s_addr;
+	} else if (IN4_IS_ADDR_LOOPBACK(&s_in->sin_addr) ||
+		   IN4_IS_ADDR_UNSPECIFIED(&s_in->sin_addr)||
+		   IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.addr_seen)) {
+		iph->saddr = c->ip4.gw.s_addr;
 		udp_tap_map[V4][src_port].ts = now->tv_sec;
 		udp_tap_map[V4][src_port].flags |= PORT_LOCAL;
 
-		if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen))
+		if (IN4_ARE_ADDR_EQUAL(&s_in->sin_addr.s_addr, &c->ip4.addr_seen))
 			udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK;
 		else
 			udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK;
 
 		bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port);
 	} else {
-		b->iph.saddr = b->s_in.sin_addr.s_addr;
+		iph->saddr = s_in->sin_addr.s_addr;
 	}
 
-	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
-	b->uh.source = b->s_in.sin_port;
-	b->uh.dest = htons(dstport);
-	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
+	iph->check = ipv4_hdr_checksum(iph, IPPROTO_UDP);
+	uh->source = s_in->sin_port;
+	uh->dest = htons(dstport);
+	uh->len= htons(data_len + sizeof(struct udphdr));
 
-	return tap_iov_len(c, &b->taph, ip_len);
+	return ip_len;
 }
 
 /**
@@ -614,38 +615,39 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
  *
  * Return: size of tap frame with headers
  */
-static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
-			      const struct timespec *now)
+static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
+			      size_t data_len, struct sockaddr_in6 *s_in6,
+			      in_port_t dstport, const struct timespec *now)
 {
-	struct udp6_l2_buf_t *b = &udp6_l2_buf[n];
+	struct udphdr *uh = (struct udphdr *)(ip6h + 1);
 	struct in6_addr *src;
 	in_port_t src_port;
 	size_t ip_len;
 
-	src = &b->s_in6.sin6_addr;
-	src_port = ntohs(b->s_in6.sin6_port);
+	src = &s_in6->sin6_addr;
+	src_port = ntohs(s_in6->sin6_port);
 
-	ip_len = udp6_l2_mh_sock[n].msg_len + sizeof(b->ip6h) + sizeof(b->uh);
+	ip_len = data_len + sizeof(struct ipv6hdr) + sizeof(struct udphdr);
 
-	b->ip6h.payload_len = htons(udp6_l2_mh_sock[n].msg_len + sizeof(b->uh));
+	ip6h->payload_len = htons(data_len + sizeof(struct udphdr));
 
 	if (IN6_IS_ADDR_LINKLOCAL(src)) {
-		b->ip6h.daddr = c->ip6.addr_ll_seen;
-		b->ip6h.saddr = b->s_in6.sin6_addr;
+		ip6h->daddr = c->ip6.addr_ll_seen;
+		ip6h->saddr = s_in6->sin6_addr;
 	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) &&
 		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) &&
 		   src_port == 53) {
-		b->ip6h.daddr = c->ip6.addr_seen;
-		b->ip6h.saddr = c->ip6.dns_match;
+		ip6h->daddr = c->ip6.addr_seen;
+		ip6h->saddr = c->ip6.dns_match;
 	} else if (IN6_IS_ADDR_LOOPBACK(src)			||
 		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen)	||
 		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) {
-		b->ip6h.daddr = c->ip6.addr_ll_seen;
+		ip6h->daddr = c->ip6.addr_ll_seen;
 
 		if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
-			b->ip6h.saddr = c->ip6.gw;
+			ip6h->saddr = c->ip6.gw;
 		else
-			b->ip6h.saddr = c->ip6.addr_ll;
+			ip6h->saddr = c->ip6.addr_ll;
 
 		udp_tap_map[V6][src_port].ts = now->tv_sec;
 		udp_tap_map[V6][src_port].flags |= PORT_LOCAL;
@@ -662,20 +664,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
 
 		bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port);
 	} else {
-		b->ip6h.daddr = c->ip6.addr_seen;
-		b->ip6h.saddr = b->s_in6.sin6_addr;
+		ip6h->daddr = c->ip6.addr_seen;
+		ip6h->saddr = s_in6->sin6_addr;
 	}
 
-	b->uh.source = b->s_in6.sin6_port;
-	b->uh.dest = htons(dstport);
-	b->uh.len = b->ip6h.payload_len;
-	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
-			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
-	b->ip6h.version = 6;
-	b->ip6h.nexthdr = IPPROTO_UDP;
-	b->ip6h.hop_limit = 255;
+	uh->source = s_in6->sin6_port;
+	uh->dest = htons(dstport);
+	uh->len = ip6h->payload_len;
+	uh->check = csum(uh, ntohs(ip6h->payload_len),
+			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
+	ip6h->version = 6;
+	ip6h->nexthdr = IPPROTO_UDP;
+	ip6h->hop_limit = 255;
 
-	return tap_iov_len(c, &b->taph, ip_len);
+	return ip_len;
 }
 
 /**
@@ -689,6 +691,11 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
  *
  * Return: size of tap frame with headers
  */
+#pragma GCC diagnostic push
+/* ignore unaligned pointer value warning for &udp6_l2_buf[i].ip6h and 
+ * &udp4_l2_buf[i].iph
+ */
+#pragma GCC diagnostic ignored "-Waddress-of-packed-member"
 static void udp_tap_send(const struct ctx *c,
 			 unsigned int start, unsigned int n,
 			 in_port_t dstport, bool v6, const struct timespec *now)
@@ -702,18 +709,31 @@ static void udp_tap_send(const struct ctx *c,
 		tap_iov = udp4_l2_iov_tap;
 
 	for (i = start; i < start + n; i++) {
-		size_t buf_len;
-
-		if (v6)
-			buf_len = udp_update_hdr6(c, i, dstport, now);
-		else
-			buf_len = udp_update_hdr4(c, i, dstport, now);
-
-		tap_iov[i].iov_len = buf_len;
+		size_t ip_len;
+
+		if (v6) {
+			ip_len = udp_update_hdr6(c, &udp6_l2_buf[i].ip6h,
+						 udp6_l2_mh_sock[i].msg_len,
+						 &udp6_l2_buf[i].s_in6, dstport,
+						 now);
+			tap_iov[i].iov_len = tap_iov_len(c,
+							 &udp6_l2_buf[i].taph,
+							 ip_len);
+		} else {
+			ip_len = udp_update_hdr4(c, &udp4_l2_buf[i].iph,
+						 udp4_l2_mh_sock[i].msg_len,
+						 &udp4_l2_buf[i].s_in,
+						 dstport, now);
+
+			tap_iov[i].iov_len = tap_iov_len(c,
+							 &udp4_l2_buf[i].taph,
+							 ip_len);
+		}
 	}
 
 	tap_send_frames(c, tap_iov + start, n);
 }
+#pragma GCC diagnostic pop
 
 /**
  * udp_sock_handler() - Handle new data from socket
-- 
@@ -562,47 +562,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
  *
  * Return: size of tap frame with headers
  */
-static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
-			      const struct timespec *now)
+static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
+			      size_t data_len, struct sockaddr_in *s_in,
+			      in_port_t dstport, const struct timespec *now)
 {
-	struct udp4_l2_buf_t *b = &udp4_l2_buf[n];
+	struct udphdr *uh = (struct udphdr *)(iph + 1);
 	in_port_t src_port;
 	size_t ip_len;
 
-	ip_len = udp4_l2_mh_sock[n].msg_len + sizeof(b->iph) + sizeof(b->uh);
+	ip_len = data_len + sizeof(struct iphdr) + sizeof(struct udphdr);
 
-	b->iph.tot_len = htons(ip_len);
-	b->iph.daddr = c->ip4.addr_seen.s_addr;
+	iph->tot_len = htons(ip_len);
+	iph->daddr = c->ip4.addr_seen.s_addr;
 
-	src_port = ntohs(b->s_in.sin_port);
+	src_port = ntohs(s_in->sin_port);
 
 	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) &&
-	    IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.dns_host) &&
+	    IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.dns_host) &&
 	    src_port == 53) {
-		b->iph.saddr = c->ip4.dns_match.s_addr;
-	} else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) ||
-		   IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)||
-		   IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) {
-		b->iph.saddr = c->ip4.gw.s_addr;
+		iph->saddr = c->ip4.dns_match.s_addr;
+	} else if (IN4_IS_ADDR_LOOPBACK(&s_in->sin_addr) ||
+		   IN4_IS_ADDR_UNSPECIFIED(&s_in->sin_addr)||
+		   IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.addr_seen)) {
+		iph->saddr = c->ip4.gw.s_addr;
 		udp_tap_map[V4][src_port].ts = now->tv_sec;
 		udp_tap_map[V4][src_port].flags |= PORT_LOCAL;
 
-		if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen))
+		if (IN4_ARE_ADDR_EQUAL(&s_in->sin_addr.s_addr, &c->ip4.addr_seen))
 			udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK;
 		else
 			udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK;
 
 		bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port);
 	} else {
-		b->iph.saddr = b->s_in.sin_addr.s_addr;
+		iph->saddr = s_in->sin_addr.s_addr;
 	}
 
-	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
-	b->uh.source = b->s_in.sin_port;
-	b->uh.dest = htons(dstport);
-	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
+	iph->check = ipv4_hdr_checksum(iph, IPPROTO_UDP);
+	uh->source = s_in->sin_port;
+	uh->dest = htons(dstport);
+	uh->len= htons(data_len + sizeof(struct udphdr));
 
-	return tap_iov_len(c, &b->taph, ip_len);
+	return ip_len;
 }
 
 /**
@@ -614,38 +615,39 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
  *
  * Return: size of tap frame with headers
  */
-static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
-			      const struct timespec *now)
+static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
+			      size_t data_len, struct sockaddr_in6 *s_in6,
+			      in_port_t dstport, const struct timespec *now)
 {
-	struct udp6_l2_buf_t *b = &udp6_l2_buf[n];
+	struct udphdr *uh = (struct udphdr *)(ip6h + 1);
 	struct in6_addr *src;
 	in_port_t src_port;
 	size_t ip_len;
 
-	src = &b->s_in6.sin6_addr;
-	src_port = ntohs(b->s_in6.sin6_port);
+	src = &s_in6->sin6_addr;
+	src_port = ntohs(s_in6->sin6_port);
 
-	ip_len = udp6_l2_mh_sock[n].msg_len + sizeof(b->ip6h) + sizeof(b->uh);
+	ip_len = data_len + sizeof(struct ipv6hdr) + sizeof(struct udphdr);
 
-	b->ip6h.payload_len = htons(udp6_l2_mh_sock[n].msg_len + sizeof(b->uh));
+	ip6h->payload_len = htons(data_len + sizeof(struct udphdr));
 
 	if (IN6_IS_ADDR_LINKLOCAL(src)) {
-		b->ip6h.daddr = c->ip6.addr_ll_seen;
-		b->ip6h.saddr = b->s_in6.sin6_addr;
+		ip6h->daddr = c->ip6.addr_ll_seen;
+		ip6h->saddr = s_in6->sin6_addr;
 	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) &&
 		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) &&
 		   src_port == 53) {
-		b->ip6h.daddr = c->ip6.addr_seen;
-		b->ip6h.saddr = c->ip6.dns_match;
+		ip6h->daddr = c->ip6.addr_seen;
+		ip6h->saddr = c->ip6.dns_match;
 	} else if (IN6_IS_ADDR_LOOPBACK(src)			||
 		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen)	||
 		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) {
-		b->ip6h.daddr = c->ip6.addr_ll_seen;
+		ip6h->daddr = c->ip6.addr_ll_seen;
 
 		if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
-			b->ip6h.saddr = c->ip6.gw;
+			ip6h->saddr = c->ip6.gw;
 		else
-			b->ip6h.saddr = c->ip6.addr_ll;
+			ip6h->saddr = c->ip6.addr_ll;
 
 		udp_tap_map[V6][src_port].ts = now->tv_sec;
 		udp_tap_map[V6][src_port].flags |= PORT_LOCAL;
@@ -662,20 +664,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
 
 		bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port);
 	} else {
-		b->ip6h.daddr = c->ip6.addr_seen;
-		b->ip6h.saddr = b->s_in6.sin6_addr;
+		ip6h->daddr = c->ip6.addr_seen;
+		ip6h->saddr = s_in6->sin6_addr;
 	}
 
-	b->uh.source = b->s_in6.sin6_port;
-	b->uh.dest = htons(dstport);
-	b->uh.len = b->ip6h.payload_len;
-	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
-			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
-	b->ip6h.version = 6;
-	b->ip6h.nexthdr = IPPROTO_UDP;
-	b->ip6h.hop_limit = 255;
+	uh->source = s_in6->sin6_port;
+	uh->dest = htons(dstport);
+	uh->len = ip6h->payload_len;
+	uh->check = csum(uh, ntohs(ip6h->payload_len),
+			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
+	ip6h->version = 6;
+	ip6h->nexthdr = IPPROTO_UDP;
+	ip6h->hop_limit = 255;
 
-	return tap_iov_len(c, &b->taph, ip_len);
+	return ip_len;
 }
 
 /**
@@ -689,6 +691,11 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
  *
  * Return: size of tap frame with headers
  */
+#pragma GCC diagnostic push
+/* ignore unaligned pointer value warning for &udp6_l2_buf[i].ip6h and 
+ * &udp4_l2_buf[i].iph
+ */
+#pragma GCC diagnostic ignored "-Waddress-of-packed-member"
 static void udp_tap_send(const struct ctx *c,
 			 unsigned int start, unsigned int n,
 			 in_port_t dstport, bool v6, const struct timespec *now)
@@ -702,18 +709,31 @@ static void udp_tap_send(const struct ctx *c,
 		tap_iov = udp4_l2_iov_tap;
 
 	for (i = start; i < start + n; i++) {
-		size_t buf_len;
-
-		if (v6)
-			buf_len = udp_update_hdr6(c, i, dstport, now);
-		else
-			buf_len = udp_update_hdr4(c, i, dstport, now);
-
-		tap_iov[i].iov_len = buf_len;
+		size_t ip_len;
+
+		if (v6) {
+			ip_len = udp_update_hdr6(c, &udp6_l2_buf[i].ip6h,
+						 udp6_l2_mh_sock[i].msg_len,
+						 &udp6_l2_buf[i].s_in6, dstport,
+						 now);
+			tap_iov[i].iov_len = tap_iov_len(c,
+							 &udp6_l2_buf[i].taph,
+							 ip_len);
+		} else {
+			ip_len = udp_update_hdr4(c, &udp4_l2_buf[i].iph,
+						 udp4_l2_mh_sock[i].msg_len,
+						 &udp4_l2_buf[i].s_in,
+						 dstport, now);
+
+			tap_iov[i].iov_len = tap_iov_len(c,
+							 &udp4_l2_buf[i].taph,
+							 ip_len);
+		}
 	}
 
 	tap_send_frames(c, tap_iov + start, n);
 }
+#pragma GCC diagnostic pop
 
 /**
  * udp_sock_handler() - Handle new data from socket
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (13 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  2:14   ` David Gibson
  2024-02-02 14:11 ` [PATCH 16/24] packet: replace struct desc by struct iovec Laurent Vivier
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 passt.c | 2 +-
 udp.c   | 6 +++---
 udp.h   | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/passt.c b/passt.c
index 10042a9b9789..c70caf464e61 100644
--- a/passt.c
+++ b/passt.c
@@ -389,7 +389,7 @@ loop:
 			tcp_timer_handler(&c, ref);
 			break;
 		case EPOLL_TYPE_UDP:
-			udp_sock_handler(&c, ref, eventmask, &now);
+			udp_buf_sock_handler(&c, ref, eventmask, &now);
 			break;
 		case EPOLL_TYPE_ICMP:
 			icmp_sock_handler(&c, AF_INET, ref);
diff --git a/udp.c b/udp.c
index 77168fb0a2af..9c56168c6340 100644
--- a/udp.c
+++ b/udp.c
@@ -736,7 +736,7 @@ static void udp_tap_send(const struct ctx *c,
 #pragma GCC diagnostic pop
 
 /**
- * udp_sock_handler() - Handle new data from socket
+ * udp_buf_sock_handler() - Handle new data from socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -744,8 +744,8 @@ static void udp_tap_send(const struct ctx *c,
  *
  * #syscalls recvmmsg
  */
-void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
-		      const struct timespec *now)
+void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
+			  const struct timespec *now)
 {
 	/* For not entirely clear reasons (data locality?) pasta gets
 	 * better throughput if we receive tap datagrams one at a
diff --git a/udp.h b/udp.h
index 087e4820f93c..6c8519e87f1a 100644
--- a/udp.h
+++ b/udp.h
@@ -9,8 +9,8 @@
 #define UDP_TIMER_INTERVAL		1000 /* ms */
 
 void udp_portmap_clear(void);
-void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
-		      const struct timespec *now);
+void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
+			  const struct timespec *now);
 int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
 		    const void *saddr, const void *daddr,
 		    const struct pool *p, int idx, const struct timespec *now);
-- 
@@ -9,8 +9,8 @@
 #define UDP_TIMER_INTERVAL		1000 /* ms */
 
 void udp_portmap_clear(void);
-void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
-		      const struct timespec *now);
+void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
+			  const struct timespec *now);
 int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
 		    const void *saddr, const void *daddr,
 		    const struct pool *p, int idx, const struct timespec *now);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 16/24] packet: replace struct desc by struct iovec
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (14 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  2:25   ` David Gibson
  2024-02-02 14:11 ` [PATCH 17/24] vhost-user: compare mode MODE_PASTA and not MODE_PASST Laurent Vivier
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 packet.c | 75 +++++++++++++++++++++++++++++++-------------------------
 packet.h | 14 ++---------
 2 files changed, 43 insertions(+), 46 deletions(-)

diff --git a/packet.c b/packet.c
index ccfc84607709..af2a539a1794 100644
--- a/packet.c
+++ b/packet.c
@@ -22,6 +22,36 @@
 #include "util.h"
 #include "log.h"
 
+static int packet_check_range(const struct pool *p, size_t offset, size_t len,
+			      const char *start, const char *func, int line)
+{
+	if (start < p->buf) {
+		if (func) {
+			trace("add packet start %p before buffer start %p, "
+			      "%s:%i", (void *)start, (void *)p->buf, func, line);
+		}
+		return -1;
+	}
+
+	if (start + len + offset > p->buf + p->buf_size) {
+		if (func) {
+			trace("packet offset plus length %lu from size %lu, "
+			      "%s:%i", start - p->buf + len + offset,
+			      p->buf_size, func, line);
+		}
+		return -1;
+	}
+
+#if UINTPTR_MAX == UINT64_MAX
+	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
+		trace("add packet start %p, buffer start %p, %s:%i",
+		      (void *)start, (void *)p->buf, func, line);
+		return -1;
+	}
+#endif
+
+	return 0;
+}
 /**
  * packet_add_do() - Add data as packet descriptor to given pool
  * @p:		Existing pool
@@ -41,34 +71,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 		return;
 	}
 
-	if (start < p->buf) {
-		trace("add packet start %p before buffer start %p, %s:%i",
-		      (void *)start, (void *)p->buf, func, line);
+	if (packet_check_range(p, 0, len, start, func, line))
 		return;
-	}
-
-	if (start + len > p->buf + p->buf_size) {
-		trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
-		      (void *)start, len, (void *)(p->buf + p->buf_size),
-		      func, line);
-		return;
-	}
 
 	if (len > UINT16_MAX) {
 		trace("add packet length %zu, %s:%i", len, func, line);
 		return;
 	}
 
-#if UINTPTR_MAX == UINT64_MAX
-	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
-		trace("add packet start %p, buffer start %p, %s:%i",
-		      (void *)start, (void *)p->buf, func, line);
-		return;
-	}
-#endif
-
-	p->pkt[idx].offset = start - p->buf;
-	p->pkt[idx].len = len;
+	p->pkt[idx].iov_base = (void *)start;
+	p->pkt[idx].iov_len = len;
 
 	p->count++;
 }
@@ -104,28 +116,23 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (p->pkt[idx].offset + len + offset > p->buf_size) {
+	if (len + offset > p->pkt[idx].iov_len) {
 		if (func) {
-			trace("packet offset plus length %zu from size %zu, "
-			      "%s:%i", p->pkt[idx].offset + len + offset,
-			      p->buf_size, func, line);
+			trace("data length %zu, offset %zu from length %zu, "
+			      "%s:%i", len, offset, p->pkt[idx].iov_len,
+			      func, line);
 		}
 		return NULL;
 	}
 
-	if (len + offset > p->pkt[idx].len) {
-		if (func) {
-			trace("data length %zu, offset %zu from length %u, "
-			      "%s:%i", len, offset, p->pkt[idx].len,
-			      func, line);
-		}
+	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
+			       func, line))
 		return NULL;
-	}
 
 	if (left)
-		*left = p->pkt[idx].len - offset - len;
+		*left = p->pkt[idx].iov_len - offset - len;
 
-	return p->buf + p->pkt[idx].offset + offset;
+	return (char *)p->pkt[idx].iov_base + offset;
 }
 
 /**
diff --git a/packet.h b/packet.h
index a784b07bbed5..8377dcf678bb 100644
--- a/packet.h
+++ b/packet.h
@@ -6,16 +6,6 @@
 #ifndef PACKET_H
 #define PACKET_H
 
-/**
- * struct desc - Generic offset-based descriptor within buffer
- * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
- * @len:	Length of descriptor, host order, 16-bit limit
- */
-struct desc {
-	uint32_t offset;
-	uint16_t len;
-};
-
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors
@@ -29,7 +19,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct desc pkt[1];
+	struct iovec pkt[1];
 };
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -54,7 +44,7 @@ struct _name ## _t {							\
 	size_t buf_size;						\
 	size_t size;							\
 	size_t count;							\
-	struct desc pkt[_size];						\
+	struct iovec pkt[_size];					\
 }
 
 #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\
-- 
@@ -6,16 +6,6 @@
 #ifndef PACKET_H
 #define PACKET_H
 
-/**
- * struct desc - Generic offset-based descriptor within buffer
- * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
- * @len:	Length of descriptor, host order, 16-bit limit
- */
-struct desc {
-	uint32_t offset;
-	uint16_t len;
-};
-
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors
@@ -29,7 +19,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct desc pkt[1];
+	struct iovec pkt[1];
 };
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -54,7 +44,7 @@ struct _name ## _t {							\
 	size_t buf_size;						\
 	size_t size;							\
 	size_t count;							\
-	struct desc pkt[_size];						\
+	struct iovec pkt[_size];					\
 }
 
 #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 17/24] vhost-user: compare mode MODE_PASTA and not MODE_PASST
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (15 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 16/24] packet: replace struct desc by struct iovec Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  2:29   ` David Gibson
  2024-02-02 14:11 ` [PATCH 18/24] vhost-user: introduce virtio API Laurent Vivier
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 conf.c      | 12 ++++++------
 isolation.c | 10 +++++-----
 passt.c     |  2 +-
 tap.c       | 12 ++++++------
 tcp_buf.c   |  2 +-
 udp.c       |  2 +-
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/conf.c b/conf.c
index 93bfda331349..b6a2a1f0fdc3 100644
--- a/conf.c
+++ b/conf.c
@@ -147,7 +147,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
 		if (fwd->mode)
 			goto mode_conflict;
 
-		if (c->mode != MODE_PASST)
+		if (c->mode == MODE_PASTA)
 			die("'all' port forwarding is only allowed for passt");
 
 		fwd->mode = FWD_ALL;
@@ -1240,7 +1240,7 @@ void conf(struct ctx *c, int argc, char **argv)
 			c->no_dhcp_dns = 0;
 			break;
 		case 6:
-			if (c->mode != MODE_PASST)
+			if (c->mode == MODE_PASTA)
 				die("--no-dhcp-dns is for passt mode only");
 
 			c->no_dhcp_dns = 1;
@@ -1252,7 +1252,7 @@ void conf(struct ctx *c, int argc, char **argv)
 			c->no_dhcp_dns_search = 0;
 			break;
 		case 8:
-			if (c->mode != MODE_PASST)
+			if (c->mode == MODE_PASTA)
 				die("--no-dhcp-search is for passt mode only");
 
 			c->no_dhcp_dns_search = 1;
@@ -1307,7 +1307,7 @@ void conf(struct ctx *c, int argc, char **argv)
 			break;
 		case 14:
 			fprintf(stdout,
-				c->mode == MODE_PASST ? "passt " : "pasta ");
+				c->mode == MODE_PASTA ? "pasta " : "passt ");
 			fprintf(stdout, VERSION_BLOB);
 			exit(EXIT_SUCCESS);
 		case 15:
@@ -1610,7 +1610,7 @@ void conf(struct ctx *c, int argc, char **argv)
 			v6_only = true;
 			break;
 		case '1':
-			if (c->mode != MODE_PASST)
+			if (c->mode == MODE_PASTA)
 				die("--one-off is for passt mode only");
 
 			if (c->one_off)
@@ -1657,7 +1657,7 @@ void conf(struct ctx *c, int argc, char **argv)
 	conf_ugid(runas, &uid, &gid);
 
 	if (logfile) {
-		logfile_init(c->mode == MODE_PASST ? "passt" : "pasta",
+		logfile_init(c->mode == MODE_PASTA ? "pasta" : "passt",
 			     logfile, logsize);
 	}
 
diff --git a/isolation.c b/isolation.c
index f394e93b8526..ca2c68b52ec7 100644
--- a/isolation.c
+++ b/isolation.c
@@ -312,7 +312,7 @@ int isolate_prefork(const struct ctx *c)
 	 * PID namespace. For passt, use CLONE_NEWPID anyway, in case somebody
 	 * ever gets around seccomp profiles -- there's no harm in passing it.
 	 */
-	if (!c->foreground || c->mode == MODE_PASST)
+	if (!c->foreground || c->mode != MODE_PASTA)
 		flags |= CLONE_NEWPID;
 
 	if (unshare(flags)) {
@@ -379,12 +379,12 @@ void isolate_postfork(const struct ctx *c)
 
 	prctl(PR_SET_DUMPABLE, 0);
 
-	if (c->mode == MODE_PASST) {
-		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
-		prog.filter = filter_passt;
-	} else {
+	if (c->mode == MODE_PASTA) {
 		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
 		prog.filter = filter_pasta;
+	} else {
+		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
+		prog.filter = filter_passt;
 	}
 
 	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
diff --git a/passt.c b/passt.c
index c70caf464e61..5056a49dec95 100644
--- a/passt.c
+++ b/passt.c
@@ -360,7 +360,7 @@ loop:
 		uint32_t eventmask = events[i].events;
 
 		trace("%s: epoll event on %s %i (events: 0x%08x)",
-		      c.mode == MODE_PASST ? "passt" : "pasta",
+		      c.mode == MODE_PASTA ? "pasta" : "passt",
 		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
 
 		switch (ref.type) {
diff --git a/tap.c b/tap.c
index 5b1b61550c13..ebe52247ad87 100644
--- a/tap.c
+++ b/tap.c
@@ -428,10 +428,10 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
 	if (!n)
 		return 0;
 
-	if (c->mode == MODE_PASST)
-		m = tap_send_frames_passt(c, iov, n);
-	else
+	if (c->mode == MODE_PASTA)
 		m = tap_send_frames_pasta(c, iov, n);
+	else
+		m = tap_send_frames_passt(c, iov, n);
 
 	if (m < n)
 		debug("tap: failed to send %zu frames of %zu", n - m, n);
@@ -1299,10 +1299,10 @@ void tap_sock_init(struct ctx *c)
 		return;
 	}
 
-	if (c->mode == MODE_PASST) {
+	if (c->mode == MODE_PASTA) {
+		tap_sock_tun_init(c);
+	} else {
 		if (c->fd_tap_listen == -1)
 			tap_sock_unix_init(c);
-	} else {
-		tap_sock_tun_init(c);
 	}
 }
diff --git a/tcp_buf.c b/tcp_buf.c
index 4c1f00c1d1b2..dff6802c5734 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -34,7 +34,7 @@
 
 #define TCP_FRAMES_MEM			128
 #define TCP_FRAMES							\
-	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
+	(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
 
 struct tcp4_l2_head {	/* For MSS4 macro: keep in sync with tcp4_l2_buf_t */
 #ifdef __AVX2__
diff --git a/udp.c b/udp.c
index 9c56168c6340..a189c2e0b5a2 100644
--- a/udp.c
+++ b/udp.c
@@ -755,7 +755,7 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve
 	 * whether we'll use tap or splice, always go one at a time
 	 * for pasta mode.
 	 */
-	ssize_t n = (c->mode == MODE_PASST ? UDP_MAX_FRAMES : 1);
+	ssize_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
 	in_port_t dstport = ref.udp.port;
 	bool v6 = ref.udp.v6;
 	struct mmsghdr *mmh_recv;
-- 
@@ -755,7 +755,7 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve
 	 * whether we'll use tap or splice, always go one at a time
 	 * for pasta mode.
 	 */
-	ssize_t n = (c->mode == MODE_PASST ? UDP_MAX_FRAMES : 1);
+	ssize_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
 	in_port_t dstport = ref.udp.port;
 	bool v6 = ref.udp.v6;
 	struct mmsghdr *mmh_recv;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 18/24] vhost-user: introduce virtio API
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (16 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 17/24] vhost-user: compare mode MODE_PASTA and not MODE_PASST Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-06  3:51   ` David Gibson
  2024-02-02 14:11 ` [PATCH 19/24] vhost-user: introduce vhost-user API Laurent Vivier
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile |   4 +-
 util.h   |  11 ++
 virtio.c | 484 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 virtio.h | 121 ++++++++++++++
 4 files changed, 618 insertions(+), 2 deletions(-)
 create mode 100644 virtio.c
 create mode 100644 virtio.h

diff --git a/Makefile b/Makefile
index bf370b6ec2e6..ae1daa6b2b50 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
 	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
-	tcp_buf.c udp.c util.c iov.c ip.c
+	tcp_buf.c udp.c util.c iov.c ip.c virtio.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
 	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
-	util.h iov.h ip.h
+	util.h iov.h ip.h virtio.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/util.h b/util.h
index f7c3dfee9972..a80024e3b797 100644
--- a/util.h
+++ b/util.h
@@ -43,6 +43,9 @@
 #define ROUND_DOWN(x, y)	((x) & ~((y) - 1))
 #define ROUND_UP(x, y)		(((x) + (y) - 1) & ~((y) - 1))
 
+#define ALIGN_DOWN(n, m)	((n) / (m) * (m))
+#define ALIGN_UP(n, m)		ALIGN_DOWN((n) + (m) - 1, (m))
+
 #define MAX_FROM_BITS(n)	(((1U << (n)) - 1))
 
 #define BIT(n)			(1UL << (n))
@@ -110,6 +113,14 @@
 #define	htonl_constant(x)	(__bswap_constant_32(x))
 #endif
 
+#define  barrier()		do { __asm__ __volatile__("" ::: "memory"); } while (0)
+#define smp_mb()		do { barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); } while (0)
+#define smp_mb_release()	do { barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); } while (0)
+#define smp_mb_acquire()	do { barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); } while (0)
+
+#define smp_wmb()	smp_mb_release()
+#define smp_rmb()	smp_mb_acquire()
+
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
diff --git a/virtio.c b/virtio.c
new file mode 100644
index 000000000000..1edd4155eec2
--- /dev/null
+++ b/virtio.c
@@ -0,0 +1,484 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts copied from QEMU subprojects/libvhost-user/libvhost-user.c */
+
+#include <stddef.h>
+#include <endian.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/eventfd.h>
+#include <sys/socket.h>
+
+#include "util.h"
+#include "virtio.h"
+
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/* Translate guest physical address to our virtual address.  */
+static void *vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr)
+{
+	unsigned int i;
+
+	if (*plen == 0) {
+		return NULL;
+	}
+
+	/* Find matching memory region.  */
+	for (i = 0; i < dev->nregions; i++) {
+		VuDevRegion *r = &dev->regions[i];
+
+		if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) {
+			if ((guest_addr + *plen) > (r->gpa + r->size)) {
+				*plen = r->gpa + r->size - guest_addr;
+			}
+			return (void *)(guest_addr - (uintptr_t)r->gpa +
+					(uintptr_t)r->mmap_addr + r->mmap_offset);
+		}
+	}
+
+	return NULL;
+}
+
+static inline uint16_t vring_avail_flags(VuVirtq *vq)
+{
+	return le16toh(vq->vring.avail->flags);
+}
+
+static inline uint16_t vring_avail_idx(VuVirtq *vq)
+{
+	vq->shadow_avail_idx = le16toh(vq->vring.avail->idx);
+
+	return vq->shadow_avail_idx;
+}
+
+static inline uint16_t vring_avail_ring(VuVirtq *vq, int i)
+{
+	return le16toh(vq->vring.avail->ring[i]);
+}
+
+static inline uint16_t vring_get_used_event(VuVirtq *vq)
+{
+	return vring_avail_ring(vq, vq->vring.num);
+}
+
+static bool virtqueue_get_head(VuDev *dev, VuVirtq *vq,
+		   unsigned int idx, unsigned int *head)
+{
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	*head = vring_avail_ring(vq, idx % vq->vring.num);
+
+	/* If their number is silly, that's a fatal mistake. */
+	if (*head >= vq->vring.num) {
+		vu_panic(dev, "Guest says index %u is available", *head);
+		return false;
+	}
+
+	return true;
+}
+
+static int
+virtqueue_read_indirect_desc(VuDev *dev, struct vring_desc *desc,
+			     uint64_t addr, size_t len)
+{
+	struct vring_desc *ori_desc;
+	uint64_t read_len;
+
+	if (len > (VIRTQUEUE_MAX_SIZE * sizeof(struct vring_desc))) {
+		return -1;
+	}
+
+	if (len == 0) {
+		return -1;
+	}
+
+	while (len) {
+		read_len = len;
+		ori_desc = vu_gpa_to_va(dev, &read_len, addr);
+		if (!ori_desc) {
+			return -1;
+		}
+
+		memcpy(desc, ori_desc, read_len);
+		len -= read_len;
+		addr += read_len;
+		desc += read_len;
+	}
+
+	return 0;
+}
+
+enum {
+	VIRTQUEUE_READ_DESC_ERROR = -1,
+	VIRTQUEUE_READ_DESC_DONE = 0,   /* end of chain */
+	VIRTQUEUE_READ_DESC_MORE = 1,   /* more buffers in chain */
+};
+
+static int
+virtqueue_read_next_desc(VuDev *dev, struct vring_desc *desc,
+			 int i, unsigned int max, unsigned int *next)
+{
+	/* If this descriptor says it doesn't chain, we're done. */
+	if (!(le16toh(desc[i].flags) & VRING_DESC_F_NEXT)) {
+		return VIRTQUEUE_READ_DESC_DONE;
+	}
+
+	/* Check they're not leading us off end of descriptors. */
+	*next = le16toh(desc[i].next);
+	/* Make sure compiler knows to grab that: we don't want it changing! */
+	smp_wmb();
+
+	if (*next >= max) {
+		vu_panic(dev, "Desc next is %u", *next);
+		return VIRTQUEUE_READ_DESC_ERROR;
+	}
+
+	return VIRTQUEUE_READ_DESC_MORE;
+}
+
+bool vu_queue_empty(VuDev *dev, VuVirtq *vq)
+{
+	if (dev->broken ||
+		!vq->vring.avail) {
+		return true;
+	}
+
+	if (vq->shadow_avail_idx != vq->last_avail_idx) {
+		return false;
+	}
+
+	return vring_avail_idx(vq) == vq->last_avail_idx;
+}
+
+static bool vring_notify(VuDev *dev, VuVirtq *vq)
+{
+	uint16_t old, new;
+	bool v;
+
+	/* We need to expose used array entries before checking used event. */
+	smp_mb();
+
+	/* Always notify when queue is empty (when feature acknowledge) */
+	if (vu_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+		!vq->inuse && vu_queue_empty(dev, vq)) {
+		return true;
+	}
+
+	if (!vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
+	}
+
+	v = vq->signalled_used_valid;
+	vq->signalled_used_valid = true;
+	old = vq->signalled_used;
+	new = vq->signalled_used = vq->used_idx;
+	return !v || vring_need_event(vring_get_used_event(vq), new, old);
+}
+
+void vu_queue_notify(VuDev *dev, VuVirtq *vq)
+{
+	if (dev->broken || !vq->vring.avail) {
+		return;
+	}
+
+	if (!vring_notify(dev, vq)) {
+		debug("skipped notify...");
+		return;
+	}
+
+	if (eventfd_write(vq->call_fd, 1) < 0) {
+		vu_panic(dev, "Error writing eventfd: %s", strerror(errno));
+	}
+}
+
+static inline void vring_set_avail_event(VuVirtq *vq, uint16_t val)
+{
+	uint16_t val_le = htole16(val);
+
+	if (!vq->notification) {
+		return;
+	}
+
+	memcpy(&vq->vring.used->ring[vq->vring.num], &val_le, sizeof(uint16_t));
+}
+
+static bool virtqueue_map_desc(VuDev *dev,
+			       unsigned int *p_num_sg, struct iovec *iov,
+			       unsigned int max_num_sg,
+			       uint64_t pa, size_t sz)
+{
+	unsigned num_sg = *p_num_sg;
+
+	ASSERT(num_sg <= max_num_sg);
+
+	if (!sz) {
+		vu_panic(dev, "virtio: zero sized buffers are not allowed");
+		return false;
+	}
+
+	while (sz) {
+		uint64_t len = sz;
+
+		if (num_sg == max_num_sg) {
+			vu_panic(dev, "virtio: too many descriptors in indirect table");
+			return false;
+		}
+
+		iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
+		if (iov[num_sg].iov_base == NULL) {
+			vu_panic(dev, "virtio: invalid address for buffers");
+			return false;
+		}
+		iov[num_sg].iov_len = len;
+		num_sg++;
+		sz -= len;
+		pa += len;
+	}
+
+	*p_num_sg = num_sg;
+	return true;
+}
+
+static void * virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num, unsigned char *buffer)
+{
+	VuVirtqElement *elem;
+	size_t in_sg_ofs = ALIGN_UP(sz, __alignof__(elem->in_sg[0]));
+	size_t out_sg_ofs = in_sg_ofs + in_num * sizeof(elem->in_sg[0]);
+	size_t out_sg_end = out_sg_ofs + out_num * sizeof(elem->out_sg[0]);
+
+	if (out_sg_end > 65536)
+		return NULL;
+
+	elem = (void *)buffer;
+	elem->out_num = out_num;
+	elem->in_num = in_num;
+	elem->in_sg = (struct iovec *)((uintptr_t)elem + in_sg_ofs);
+	elem->out_sg = (struct iovec *)((uintptr_t)elem + out_sg_ofs);
+	return elem;
+}
+
+static void *
+vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz, unsigned char *buffer)
+{
+	struct vring_desc *desc = vq->vring.desc;
+	uint64_t desc_addr, read_len;
+	unsigned int desc_len;
+	unsigned int max = vq->vring.num;
+	unsigned int i = idx;
+	VuVirtqElement *elem;
+	unsigned int out_num = 0, in_num = 0;
+	struct iovec iov[VIRTQUEUE_MAX_SIZE];
+	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
+	int rc;
+
+	if (le16toh(desc[i].flags) & VRING_DESC_F_INDIRECT) {
+		if (le32toh(desc[i].len) % sizeof(struct vring_desc)) {
+			vu_panic(dev, "Invalid size for indirect buffer table");
+			return NULL;
+		}
+
+		/* loop over the indirect descriptor table */
+		desc_addr = le64toh(desc[i].addr);
+		desc_len = le32toh(desc[i].len);
+		max = desc_len / sizeof(struct vring_desc);
+		read_len = desc_len;
+		desc = vu_gpa_to_va(dev, &read_len, desc_addr);
+		if (desc && read_len != desc_len) {
+			/* Failed to use zero copy */
+			desc = NULL;
+			if (!virtqueue_read_indirect_desc(dev, desc_buf, desc_addr, desc_len)) {
+				desc = desc_buf;
+			}
+		}
+		if (!desc) {
+			vu_panic(dev, "Invalid indirect buffer table");
+			return NULL;
+		}
+		i = 0;
+	}
+
+	/* Collect all the descriptors */
+	do {
+		if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
+			if (!virtqueue_map_desc(dev, &in_num, iov + out_num,
+						VIRTQUEUE_MAX_SIZE - out_num,
+						le64toh(desc[i].addr),
+						le32toh(desc[i].len))) {
+				return NULL;
+			}
+		} else {
+			if (in_num) {
+				vu_panic(dev, "Incorrect order for descriptors");
+				return NULL;
+			}
+			if (!virtqueue_map_desc(dev, &out_num, iov,
+						VIRTQUEUE_MAX_SIZE,
+						le64toh(desc[i].addr),
+						le32toh(desc[i].len))) {
+				return NULL;
+			}
+		}
+
+		/* If we've got too many, that implies a descriptor loop. */
+		if ((in_num + out_num) > max) {
+			vu_panic(dev, "Looped descriptor");
+			return NULL;
+		}
+		rc = virtqueue_read_next_desc(dev, desc, i, max, &i);
+	} while (rc == VIRTQUEUE_READ_DESC_MORE);
+
+	if (rc == VIRTQUEUE_READ_DESC_ERROR) {
+		vu_panic(dev, "read descriptor error");
+		return NULL;
+	}
+
+	/* Now copy what we have collected and mapped */
+	elem = virtqueue_alloc_element(sz, out_num, in_num, buffer);
+	if (!elem) {
+		return NULL;
+	}
+	elem->index = idx;
+	for (i = 0; i < out_num; i++) {
+		elem->out_sg[i] = iov[i];
+	}
+	for (i = 0; i < in_num; i++) {
+		elem->in_sg[i] = iov[out_num + i];
+	}
+
+	return elem;
+}
+
+void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz, unsigned char *buffer)
+{
+	unsigned int head;
+	VuVirtqElement *elem;
+
+	if (dev->broken || !vq->vring.avail) {
+	return NULL;
+	}
+
+	if (vu_queue_empty(dev, vq)) {
+	return NULL;
+	}
+	/*
+	 * Needed after virtio_queue_empty(), see comment in
+	 * virtqueue_num_heads().
+	 */
+	smp_rmb();
+
+	if (vq->inuse >= vq->vring.num) {
+	vu_panic(dev, "Virtqueue size exceeded");
+	return NULL;
+	}
+
+	if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, &head)) {
+	return NULL;
+	}
+
+	if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+		vring_set_avail_event(vq, vq->last_avail_idx);
+	}
+
+	elem = vu_queue_map_desc(dev, vq, head, sz, buffer);
+
+	if (!elem) {
+	return NULL;
+	}
+
+	vq->inuse++;
+
+	return elem;
+}
+
+void vu_queue_detach_element(VuDev *dev, VuVirtq *vq,
+			     unsigned int index, size_t len)
+{
+	(void)dev;
+	(void)index;
+	(void)len;
+
+	vq->inuse--;
+	/* unmap, when DMA support is added */
+}
+
+void vu_queue_unpop(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len)
+{
+	vq->last_avail_idx--;
+	vu_queue_detach_element(dev, vq, index, len);
+}
+
+bool vu_queue_rewind(VuDev *dev, VuVirtq *vq, unsigned int num)
+{
+	(void)dev;
+	if (num > vq->inuse) {
+		return false;
+	}
+	vq->last_avail_idx -= num;
+	vq->inuse -= num;
+	return true;
+}
+
+static inline void vring_used_write(VuVirtq *vq,
+				    struct vring_used_elem *uelem, int i)
+{
+	struct vring_used *used = vq->vring.used;
+
+	used->ring[i] = *uelem;
+}
+
+void vu_queue_fill_by_index(VuDev *dev, VuVirtq *vq, unsigned int index,
+			  unsigned int len, unsigned int idx)
+{
+	struct vring_used_elem uelem;
+
+	if (dev->broken || !vq->vring.avail)
+		return;
+
+	idx = (idx + vq->used_idx) % vq->vring.num;
+
+	uelem.id = htole32(index);
+	uelem.len = htole32(len);
+	vring_used_write(vq, &uelem, idx);
+}
+
+void vu_queue_fill(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem,
+		   unsigned int len, unsigned int idx)
+{
+	vu_queue_fill_by_index(dev, vq, elem->index, len, idx);
+}
+
+static inline void vring_used_idx_set(VuVirtq *vq, uint16_t val)
+{
+	vq->vring.used->idx = htole16(val);
+
+	vq->used_idx = val;
+}
+
+void vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count)
+{
+	uint16_t old, new;
+
+	if (dev->broken ||
+		!vq->vring.avail) {
+		return;
+	}
+
+	/* Make sure buffer is written before we update index. */
+	smp_wmb();
+
+	old = vq->used_idx;
+	new = old + count;
+	vring_used_idx_set(vq, new);
+	vq->inuse -= count;
+	if ((int16_t)(new - vq->signalled_used) < (uint16_t)(new - old)) {
+		vq->signalled_used_valid = false;
+	}
+}
+
+void vu_queue_push(VuDev *dev, VuVirtq *vq,
+		   VuVirtqElement *elem, unsigned int len)
+{
+	vu_queue_fill(dev, vq, elem, len, 0);
+	vu_queue_flush(dev, vq, 1);
+}
+
diff --git a/virtio.h b/virtio.h
new file mode 100644
index 000000000000..e334355b0f30
--- /dev/null
+++ b/virtio.h
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+//
+/* come parts copied from QEMU subprojects/libvhost-user/libvhost-user.h */
+
+#ifndef VIRTIO_H
+#define VIRTIO_H
+
+#include <stdbool.h>
+#include <linux/vhost_types.h>
+
+#define VIRTQUEUE_MAX_SIZE 1024
+
+#define vu_panic(vdev, ...)		\
+	do {				\
+		(vdev)->broken = true;	\
+		err( __VA_ARGS__ );	\
+	} while (0)
+
+typedef struct VuRing {
+	unsigned int num;
+	struct vring_desc *desc;
+	struct vring_avail *avail;
+	struct vring_used *used;
+	uint64_t log_guest_addr;
+	uint32_t flags;
+} VuRing;
+
+typedef struct VuVirtq {
+	VuRing vring;
+
+	/* Next head to pop */
+	uint16_t last_avail_idx;
+
+	/* Last avail_idx read from VQ. */
+	uint16_t shadow_avail_idx;
+
+	uint16_t used_idx;
+
+	/* Last used index value we have signalled on */
+	uint16_t signalled_used;
+
+	/* Last used index value we have signalled on */
+	bool signalled_used_valid;
+
+	bool notification;
+
+	unsigned int inuse;
+
+	int call_fd;
+	int kick_fd;
+	int err_fd;
+	unsigned int enable;
+	bool started;
+
+	/* Guest addresses of our ring */
+	struct vhost_vring_addr vra;
+} VuVirtq;
+
+typedef struct VuDevRegion {
+	uint64_t gpa;
+	uint64_t size;
+	uint64_t qva;
+	uint64_t mmap_offset;
+	uint64_t mmap_addr;
+} VuDevRegion;
+
+#define VHOST_USER_MAX_QUEUES 2
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
+
+typedef struct VuDev {
+	uint32_t nregions;
+	VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS];
+	VuVirtq vq[VHOST_USER_MAX_QUEUES];
+	uint64_t features;
+	uint64_t protocol_features;
+	bool broken;
+	int hdrlen;
+} VuDev;
+
+typedef struct VuVirtqElement {
+	unsigned int index;
+	unsigned int out_num;
+	unsigned int in_num;
+	struct iovec *in_sg;
+	struct iovec *out_sg;
+} VuVirtqElement;
+
+static inline bool has_feature(uint64_t features, unsigned int fbit)
+{
+	return !!(features & (1ULL << fbit));
+}
+
+static inline bool vu_has_feature(VuDev *vdev, unsigned int fbit)
+{
+	return has_feature(vdev->features, fbit);
+}
+
+static inline bool vu_has_protocol_feature(VuDev *vdev, unsigned int fbit)
+{
+	return has_feature(vdev->protocol_features, fbit);
+}
+
+bool vu_queue_empty(VuDev *dev, VuVirtq *vq);
+void vu_queue_notify(VuDev *dev, VuVirtq *vq);
+void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz, unsigned char *buffer);
+void vu_queue_detach_element(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len);
+void vu_queue_unpop(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len);
+bool vu_queue_rewind(VuDev *dev, VuVirtq *vq, unsigned int num);
+
+void vu_queue_fill_by_index(VuDev *dev, VuVirtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx);
+void vu_queue_fill(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem, unsigned int len,
+		   unsigned int idx);
+void vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count);
+void vu_queue_push(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem, unsigned int len);
+#endif /* VIRTIO_H */
-- 
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+//
+/* come parts copied from QEMU subprojects/libvhost-user/libvhost-user.h */
+
+#ifndef VIRTIO_H
+#define VIRTIO_H
+
+#include <stdbool.h>
+#include <linux/vhost_types.h>
+
+#define VIRTQUEUE_MAX_SIZE 1024
+
+#define vu_panic(vdev, ...)		\
+	do {				\
+		(vdev)->broken = true;	\
+		err( __VA_ARGS__ );	\
+	} while (0)
+
+typedef struct VuRing {
+	unsigned int num;
+	struct vring_desc *desc;
+	struct vring_avail *avail;
+	struct vring_used *used;
+	uint64_t log_guest_addr;
+	uint32_t flags;
+} VuRing;
+
+typedef struct VuVirtq {
+	VuRing vring;
+
+	/* Next head to pop */
+	uint16_t last_avail_idx;
+
+	/* Last avail_idx read from VQ. */
+	uint16_t shadow_avail_idx;
+
+	uint16_t used_idx;
+
+	/* Last used index value we have signalled on */
+	uint16_t signalled_used;
+
+	/* Last used index value we have signalled on */
+	bool signalled_used_valid;
+
+	bool notification;
+
+	unsigned int inuse;
+
+	int call_fd;
+	int kick_fd;
+	int err_fd;
+	unsigned int enable;
+	bool started;
+
+	/* Guest addresses of our ring */
+	struct vhost_vring_addr vra;
+} VuVirtq;
+
+typedef struct VuDevRegion {
+	uint64_t gpa;
+	uint64_t size;
+	uint64_t qva;
+	uint64_t mmap_offset;
+	uint64_t mmap_addr;
+} VuDevRegion;
+
+#define VHOST_USER_MAX_QUEUES 2
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
+
+typedef struct VuDev {
+	uint32_t nregions;
+	VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS];
+	VuVirtq vq[VHOST_USER_MAX_QUEUES];
+	uint64_t features;
+	uint64_t protocol_features;
+	bool broken;
+	int hdrlen;
+} VuDev;
+
+typedef struct VuVirtqElement {
+	unsigned int index;
+	unsigned int out_num;
+	unsigned int in_num;
+	struct iovec *in_sg;
+	struct iovec *out_sg;
+} VuVirtqElement;
+
+static inline bool has_feature(uint64_t features, unsigned int fbit)
+{
+	return !!(features & (1ULL << fbit));
+}
+
+static inline bool vu_has_feature(VuDev *vdev, unsigned int fbit)
+{
+	return has_feature(vdev->features, fbit);
+}
+
+static inline bool vu_has_protocol_feature(VuDev *vdev, unsigned int fbit)
+{
+	return has_feature(vdev->protocol_features, fbit);
+}
+
+bool vu_queue_empty(VuDev *dev, VuVirtq *vq);
+void vu_queue_notify(VuDev *dev, VuVirtq *vq);
+void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz, unsigned char *buffer);
+void vu_queue_detach_element(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len);
+void vu_queue_unpop(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len);
+bool vu_queue_rewind(VuDev *dev, VuVirtq *vq, unsigned int num);
+
+void vu_queue_fill_by_index(VuDev *dev, VuVirtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx);
+void vu_queue_fill(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem, unsigned int len,
+		   unsigned int idx);
+void vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count);
+void vu_queue_push(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem, unsigned int len);
+#endif /* VIRTIO_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 19/24] vhost-user: introduce vhost-user API
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (17 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 18/24] vhost-user: introduce virtio API Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-07  2:13   ` David Gibson
  2024-02-02 14:11 ` [PATCH 20/24] vhost-user: add vhost-user Laurent Vivier
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile     |    4 +-
 passt.c      |    2 +
 passt.h      |    8 +
 tap.c        |    2 +-
 tap.h        |    3 +
 vhost_user.c | 1050 ++++++++++++++++++++++++++++++++++++++++++++++++++
 vhost_user.h |  139 +++++++
 7 files changed, 1205 insertions(+), 3 deletions(-)
 create mode 100644 vhost_user.c
 create mode 100644 vhost_user.h

diff --git a/Makefile b/Makefile
index ae1daa6b2b50..2016b071ddf2 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
 	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
-	tcp_buf.c udp.c util.c iov.c ip.c virtio.c
+	tcp_buf.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
 	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
-	util.h iov.h ip.h virtio.h
+	util.h iov.h ip.h virtio.h vhost_user.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/passt.c b/passt.c
index 5056a49dec95..95034d73381f 100644
--- a/passt.c
+++ b/passt.c
@@ -72,6 +72,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_PASTA]	= "/dev/net/tun device",
 	[EPOLL_TYPE_TAP_PASST]	= "connected qemu socket",
 	[EPOLL_TYPE_TAP_LISTEN]	= "listening qemu socket",
+	[EPOLL_TYPE_VHOST_CMD]	= "vhost-user command socket",
+	[EPOLL_TYPE_VHOST_KICK]	= "vhost-user kick socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
diff --git a/passt.h b/passt.h
index a9e8f15af0e1..6ed1d0b19e82 100644
--- a/passt.h
+++ b/passt.h
@@ -42,6 +42,7 @@ union epoll_ref;
 #include "port_fwd.h"
 #include "tcp.h"
 #include "udp.h"
+#include "vhost_user.h"
 
 /**
  * enum epoll_type - Different types of fds we poll over
@@ -71,6 +72,10 @@ enum epoll_type {
 	EPOLL_TYPE_TAP_PASST,
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
+	/* vhost-user command socket */
+	EPOLL_TYPE_VHOST_CMD,
+	/* vhost-user kick event socket */
+	EPOLL_TYPE_VHOST_KICK,
 
 	EPOLL_NUM_TYPES,
 };
@@ -303,6 +308,9 @@ struct ctx {
 
 	int low_wmem;
 	int low_rmem;
+
+	/* vhost-user */
+	struct VuDev vdev;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/tap.c b/tap.c
index ebe52247ad87..936206e53637 100644
--- a/tap.c
+++ b/tap.c
@@ -954,7 +954,7 @@ void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
  * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
  * @c:		Execution context
  */
-static void tap_sock_reset(struct ctx *c)
+void tap_sock_reset(struct ctx *c)
 {
 	if (c->one_off) {
 		info("Client closed connection, exiting");
diff --git a/tap.h b/tap.h
index 7157ef37ee6e..ee839d4f09dc 100644
--- a/tap.h
+++ b/tap.h
@@ -81,12 +81,15 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
+void tap_sock_reset(struct ctx *c);
 void tap_sock_init(struct ctx *c);
 void pool_flush_all(void);
 void tap_handler_all(struct ctx *c, const struct timespec *now);
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
+void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
+		       const char *func, int line);
 #define packet_add_all(p, len, start)					\
 	packet_add_all_do(p, len, start, __func__, __LINE__)
 
diff --git a/vhost_user.c b/vhost_user.c
new file mode 100644
index 000000000000..2acd72398e3a
--- /dev/null
+++ b/vhost_user.c
@@ -0,0 +1,1050 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts from QEMU subprojects/libvhost-user/libvhost-user.c */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <inttypes.h>
+#include <time.h>
+#include <net/ethernet.h>
+#include <netinet/in.h>
+#include <sys/epoll.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <linux/vhost_types.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "passt.h"
+#include "tap.h"
+#include "vhost_user.h"
+
+#define VHOST_USER_VERSION 1
+
+static unsigned char buffer[65536][VHOST_USER_MAX_QUEUES];
+
+void vu_print_capabilities(void)
+{
+	printf("{\n");
+	printf("  \"type\": \"net\"\n");
+	printf("}\n");
+	exit(EXIT_SUCCESS);
+}
+
+static const char *
+vu_request_to_string(unsigned int req)
+{
+#define REQ(req) [req] = #req
+	static const char *vu_request_str[] = {
+		REQ(VHOST_USER_NONE),
+		REQ(VHOST_USER_GET_FEATURES),
+		REQ(VHOST_USER_SET_FEATURES),
+		REQ(VHOST_USER_SET_OWNER),
+		REQ(VHOST_USER_RESET_OWNER),
+		REQ(VHOST_USER_SET_MEM_TABLE),
+		REQ(VHOST_USER_SET_LOG_BASE),
+		REQ(VHOST_USER_SET_LOG_FD),
+		REQ(VHOST_USER_SET_VRING_NUM),
+		REQ(VHOST_USER_SET_VRING_ADDR),
+		REQ(VHOST_USER_SET_VRING_BASE),
+		REQ(VHOST_USER_GET_VRING_BASE),
+		REQ(VHOST_USER_SET_VRING_KICK),
+		REQ(VHOST_USER_SET_VRING_CALL),
+		REQ(VHOST_USER_SET_VRING_ERR),
+		REQ(VHOST_USER_GET_PROTOCOL_FEATURES),
+		REQ(VHOST_USER_SET_PROTOCOL_FEATURES),
+		REQ(VHOST_USER_GET_QUEUE_NUM),
+		REQ(VHOST_USER_SET_VRING_ENABLE),
+		REQ(VHOST_USER_SEND_RARP),
+		REQ(VHOST_USER_NET_SET_MTU),
+		REQ(VHOST_USER_SET_BACKEND_REQ_FD),
+		REQ(VHOST_USER_IOTLB_MSG),
+		REQ(VHOST_USER_SET_VRING_ENDIAN),
+		REQ(VHOST_USER_GET_CONFIG),
+		REQ(VHOST_USER_SET_CONFIG),
+		REQ(VHOST_USER_POSTCOPY_ADVISE),
+		REQ(VHOST_USER_POSTCOPY_LISTEN),
+		REQ(VHOST_USER_POSTCOPY_END),
+		REQ(VHOST_USER_GET_INFLIGHT_FD),
+		REQ(VHOST_USER_SET_INFLIGHT_FD),
+		REQ(VHOST_USER_GPU_SET_SOCKET),
+		REQ(VHOST_USER_VRING_KICK),
+		REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
+		REQ(VHOST_USER_ADD_MEM_REG),
+		REQ(VHOST_USER_REM_MEM_REG),
+		REQ(VHOST_USER_MAX),
+	};
+#undef REQ
+
+	if (req < VHOST_USER_MAX) {
+		return vu_request_str[req];
+	} else {
+		return "unknown";
+	}
+}
+
+/* Translate qemu virtual address to our virtual address.  */
+static void *qva_to_va(VuDev *dev, uint64_t qemu_addr)
+{
+	unsigned int i;
+
+	/* Find matching memory region.  */
+	for (i = 0; i < dev->nregions; i++) {
+		VuDevRegion *r = &dev->regions[i];
+
+		if ((qemu_addr >= r->qva) && (qemu_addr < (r->qva + r->size))) {
+			return (void *)(uintptr_t)
+			(qemu_addr - r->qva + r->mmap_addr + r->mmap_offset);
+		}
+	}
+
+	return NULL;
+}
+
+static void
+vmsg_close_fds(VhostUserMsg *vmsg)
+{
+	int i;
+
+	for (i = 0; i < vmsg->fd_num; i++)
+		close(vmsg->fds[i]);
+}
+
+static void vu_remove_watch(VuDev *vdev, int fd)
+{
+	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
+
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, fd, NULL);
+}
+
+/* Set reply payload.u64 and clear request flags and fd_num */
+static void vmsg_set_reply_u64(struct VhostUserMsg *vmsg, uint64_t val)
+{
+	vmsg->hdr.flags = 0; /* defaults will be set by vu_send_reply() */
+	vmsg->hdr.size = sizeof(vmsg->payload.u64);
+	vmsg->payload.u64 = val;
+	vmsg->fd_num = 0;
+}
+
+static ssize_t vu_message_read_default(VuDev *dev, int conn_fd, struct VhostUserMsg *vmsg)
+{
+	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
+		     sizeof(int))] = { 0 };
+	struct iovec iov = {
+		.iov_base = (char *)vmsg,
+		.iov_len = VHOST_USER_HDR_SIZE,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = control,
+		.msg_controllen = sizeof(control),
+	};
+	size_t fd_size;
+	struct cmsghdr *cmsg;
+	ssize_t ret, sz_payload;
+
+	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
+	if (ret < 0) {
+		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
+			return 0;
+		vu_panic(dev, "Error while recvmsg: %s", strerror(errno));
+		goto out;
+	}
+
+	vmsg->fd_num = 0;
+	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+		if (cmsg->cmsg_level == SOL_SOCKET &&
+		    cmsg->cmsg_type == SCM_RIGHTS) {
+			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
+			vmsg->fd_num = fd_size / sizeof(int);
+			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);
+			break;
+		}
+	}
+
+	sz_payload = vmsg->hdr.size;
+	if ((size_t)sz_payload > sizeof(vmsg->payload)) {
+		vu_panic(dev,
+			 "Error: too big message request: %d, size: vmsg->size: %zd, "
+			 "while sizeof(vmsg->payload) = %zu",
+			 vmsg->hdr.request, sz_payload, sizeof(vmsg->payload));
+		goto out;
+	}
+
+	if (sz_payload) {
+		do {
+			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
+		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));
+
+		if (ret < sz_payload) {
+			vu_panic(dev, "Error while reading: %s", strerror(errno));
+			goto out;
+		}
+	}
+
+	return 1;
+out:
+	vmsg_close_fds(vmsg);
+
+	return -ECONNRESET;
+}
+
+static int vu_message_write(VuDev *dev, int conn_fd, struct VhostUserMsg *vmsg)
+{
+	int rc;
+	uint8_t *p = (uint8_t *)vmsg;
+	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = { 0 };
+	struct iovec iov = {
+		.iov_base = (char *)vmsg,
+		.iov_len = VHOST_USER_HDR_SIZE,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = control,
+	};
+	struct cmsghdr *cmsg;
+
+	memset(control, 0, sizeof(control));
+	assert(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
+	if (vmsg->fd_num > 0) {
+		size_t fdsize = vmsg->fd_num * sizeof(int);
+		msg.msg_controllen = CMSG_SPACE(fdsize);
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(fdsize);
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+		memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
+	} else {
+		msg.msg_controllen = 0;
+	}
+
+	do {
+		rc = sendmsg(conn_fd, &msg, 0);
+	} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
+
+	if (vmsg->hdr.size) {
+		do {
+			if (vmsg->data) {
+				rc = write(conn_fd, vmsg->data, vmsg->hdr.size);
+			} else {
+				rc = write(conn_fd, p + VHOST_USER_HDR_SIZE, vmsg->hdr.size);
+			}
+		} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
+	}
+
+	if (rc <= 0) {
+		vu_panic(dev, "Error while writing: %s", strerror(errno));
+		return false;
+	}
+
+	return true;
+}
+
+static int vu_send_reply(VuDev *dev, int conn_fd, struct VhostUserMsg *msg)
+{
+	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
+	msg->hdr.flags |= VHOST_USER_VERSION;
+	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
+
+	return vu_message_write(dev, conn_fd, msg);
+}
+
+static bool vu_get_features_exec(struct VhostUserMsg *msg)
+{
+	uint64_t features =
+		1ULL << VIRTIO_F_VERSION_1 |
+		1ULL << VIRTIO_NET_F_MRG_RXBUF |
+		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
+
+	vmsg_set_reply_u64(msg, features);
+
+	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
+
+	return true;
+}
+
+static void
+vu_set_enable_all_rings(VuDev *vdev, bool enabled)
+{
+	uint16_t i;
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		vdev->vq[i].enable = enabled;
+	}
+}
+
+static bool
+vu_set_features_exec(VuDev *vdev, struct VhostUserMsg *msg)
+{
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vdev->features = msg->payload.u64;
+	if (!vu_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+		/*
+		 * We only support devices conforming to VIRTIO 1.0 or
+		 * later
+		 */
+		vu_panic(vdev, "virtio legacy devices aren't supported by passt");
+		return false;
+	}
+
+	if (!vu_has_feature(vdev, VHOST_USER_F_PROTOCOL_FEATURES)) {
+		vu_set_enable_all_rings(vdev, true);
+	}
+
+	/* virtio-net features */
+
+	if (vu_has_feature(vdev, VIRTIO_F_VERSION_1) ||
+	    vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
+		vdev->hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	} else {
+		vdev->hdrlen = sizeof(struct virtio_net_hdr);
+	}
+
+	return false;
+}
+
+static bool
+vu_set_owner_exec(void)
+{
+	return false;
+}
+
+static bool map_ring(VuDev *vdev, VuVirtq *vq)
+{
+	vq->vring.desc = qva_to_va(vdev, vq->vra.desc_user_addr);
+	vq->vring.used = qva_to_va(vdev, vq->vra.used_user_addr);
+	vq->vring.avail = qva_to_va(vdev, vq->vra.avail_user_addr);
+
+	debug("Setting virtq addresses:");
+	debug("    vring_desc  at %p", (void *)vq->vring.desc);
+	debug("    vring_used  at %p", (void *)vq->vring.used);
+	debug("    vring_avail at %p", (void *)vq->vring.avail);
+
+	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
+/*
+ * #syscalls:passt mmap munmap
+ */
+
+static bool vu_set_mem_table_exec(VuDev *vdev,
+				  struct VhostUserMsg *msg)
+{
+	unsigned int i;
+	struct VhostUserMemory m = msg->payload.memory, *memory = &m;
+
+	for (i = 0; i < vdev->nregions; i++) {
+		VuDevRegion *r = &vdev->regions[i];
+		void *m = (void *) (uintptr_t) r->mmap_addr;
+
+		if (m)
+			munmap(m, r->size + r->mmap_offset);
+	}
+	vdev->nregions = memory->nregions;
+
+	debug("Nregions: %u", memory->nregions);
+	for (i = 0; i < vdev->nregions; i++) {
+		void *mmap_addr;
+		VhostUserMemory_region *msg_region = &memory->regions[i];
+		VuDevRegion *dev_region = &vdev->regions[i];
+
+		debug("Region %d", i);
+		debug("    guest_phys_addr: 0x%016"PRIx64,
+		      msg_region->guest_phys_addr);
+		debug("    memory_size:     0x%016"PRIx64,
+		      msg_region->memory_size);
+		debug("    userspace_addr   0x%016"PRIx64,
+		      msg_region->userspace_addr);
+		debug("    mmap_offset      0x%016"PRIx64,
+		      msg_region->mmap_offset);
+
+		dev_region->gpa = msg_region->guest_phys_addr;
+		dev_region->size = msg_region->memory_size;
+		dev_region->qva = msg_region->userspace_addr;
+		dev_region->mmap_offset = msg_region->mmap_offset;
+
+		/* We don't use offset argument of mmap() since the
+		 * mapped address has to be page aligned, and we use huge
+		 * pages.  */
+		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
+				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE,
+				 msg->fds[i], 0);
+
+		if (mmap_addr == MAP_FAILED) {
+			vu_panic(vdev, "region mmap error: %s", strerror(errno));
+		} else {
+			dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+			debug("    mmap_addr:       0x%016"PRIx64,
+			      dev_region->mmap_addr);
+		}
+
+		close(msg->fds[i]);
+	}
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		if (vdev->vq[i].vring.desc) {
+			if (map_ring(vdev, &vdev->vq[i])) {
+				vu_panic(vdev, "remapping queue %d during setmemtable", i);
+			}
+		}
+	}
+
+	return false;
+}
+
+static bool vu_set_vring_num_exec(VuDev *vdev,
+				  struct VhostUserMsg *msg)
+{
+	unsigned int index = msg->payload.state.index;
+	unsigned int num = msg->payload.state.num;
+
+	debug("State.index: %u", index);
+	debug("State.num:   %u", num);
+	vdev->vq[index].vring.num = num;
+
+	return false;
+}
+
+static bool vu_set_vring_addr_exec(VuDev *vdev,
+				   struct VhostUserMsg *msg)
+{
+	struct vhost_vring_addr addr = msg->payload.addr, *vra = &addr;
+	unsigned int index = vra->index;
+	VuVirtq *vq = &vdev->vq[index];
+
+	debug("vhost_vring_addr:");
+	debug("    index:  %d", vra->index);
+	debug("    flags:  %d", vra->flags);
+	debug("    desc_user_addr:   0x%016" PRIx64, (uint64_t)vra->desc_user_addr);
+	debug("    used_user_addr:   0x%016" PRIx64, (uint64_t)vra->used_user_addr);
+	debug("    avail_user_addr:  0x%016" PRIx64, (uint64_t)vra->avail_user_addr);
+	debug("    log_guest_addr:   0x%016" PRIx64, (uint64_t)vra->log_guest_addr);
+
+	vq->vra = *vra;
+	vq->vring.flags = vra->flags;
+	vq->vring.log_guest_addr = vra->log_guest_addr;
+
+	if (map_ring(vdev, vq)) {
+		vu_panic(vdev, "Invalid vring_addr message");
+		return false;
+	}
+
+	vq->used_idx = le16toh(vq->vring.used->idx);
+
+	if (vq->last_avail_idx != vq->used_idx) {
+		debug("Last avail index != used index: %u != %u",
+		      vq->last_avail_idx, vq->used_idx);
+	}
+
+	return false;
+}
+
+static bool vu_set_vring_base_exec(VuDev *vdev,
+				   struct VhostUserMsg *msg)
+{
+	unsigned int index = msg->payload.state.index;
+	unsigned int num = msg->payload.state.num;
+
+	debug("State.index: %u", index);
+	debug("State.num:   %u", num);
+	vdev->vq[index].shadow_avail_idx = vdev->vq[index].last_avail_idx = num;
+
+	return false;
+}
+
+static bool vu_get_vring_base_exec(VuDev *vdev,
+				   struct VhostUserMsg *msg)
+{
+	unsigned int index = msg->payload.state.index;
+
+	debug("State.index: %u", index);
+	msg->payload.state.num = vdev->vq[index].last_avail_idx;
+	msg->hdr.size = sizeof(msg->payload.state);
+
+	vdev->vq[index].started = false;
+
+	if (vdev->vq[index].call_fd != -1) {
+		close(vdev->vq[index].call_fd);
+		vdev->vq[index].call_fd = -1;
+	}
+	if (vdev->vq[index].kick_fd != -1) {
+		vu_remove_watch(vdev,  vdev->vq[index].kick_fd);
+		close(vdev->vq[index].kick_fd);
+		vdev->vq[index].kick_fd = -1;
+	}
+
+	return true;
+}
+
+static void vu_set_watch(VuDev *vdev, int fd)
+{
+	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
+	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
+	struct epoll_event ev = { 0 };
+
+	ev.data.u64 = ref.u64;
+	ev.events = EPOLLIN;
+	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev);
+}
+
+int vu_send(const struct ctx *c, const void *buf, size_t size)
+{
+	VuDev *vdev = (VuDev *)&c->vdev;
+	size_t hdrlen = vdev->hdrlen;
+	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	unsigned int indexes[VIRTQUEUE_MAX_SIZE];
+	size_t lens[VIRTQUEUE_MAX_SIZE];
+	size_t offset;
+	int i, j;
+	__virtio16 *num_buffers_ptr;
+
+	debug("vu_send size %zu hdrlen %zu", size, hdrlen);
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		err("Got packet, but no available descriptors on RX virtq.");
+		return 0;
+	}
+
+	offset = 0;
+	i = 0;
+	num_buffers_ptr = NULL;
+	while (offset < size) {
+		VuVirtqElement *elem;
+		size_t len;
+		int total;
+
+		total = 0;
+
+		if (i == VIRTQUEUE_MAX_SIZE) {
+			err("virtio-net unexpected long buffer chain");
+			goto err;
+		}
+
+		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement),
+				    buffer[VHOST_USER_RX_QUEUE]);
+		if (!elem) {
+			if (!vdev->broken) {
+				eventfd_t kick_data;
+				ssize_t rc;
+				int status;
+
+				/* wait the kernel to put new entries in the queue */
+
+				status = fcntl(vq->kick_fd, F_GETFL);
+				if (status != -1) {
+					fcntl(vq->kick_fd, F_SETFL, status & ~O_NONBLOCK);
+					rc =  eventfd_read(vq->kick_fd, &kick_data);
+					fcntl(vq->kick_fd, F_SETFL, status);
+					if (rc != -1)
+						continue;
+				}
+			}
+			if (i) {
+				err("virtio-net unexpected empty queue: "
+				    "i %d mergeable %d offset %zd, size %zd, "
+				    "features 0x%" PRIx64,
+				    i, vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF),
+				    offset, size, vdev->features);
+			}
+			offset = -1;
+			goto err;
+		}
+
+		if (elem->in_num < 1) {
+			err("virtio-net receive queue contains no in buffers");
+			vu_queue_detach_element(vdev, vq, elem->index, 0);
+			offset = -1;
+			goto err;
+		}
+
+		if (i == 0) {
+			struct virtio_net_hdr hdr = {
+				.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+				.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+			};
+
+			ASSERT(offset == 0);
+			ASSERT(elem->in_sg[0].iov_len >= hdrlen);
+
+			len = iov_from_buf(elem->in_sg, elem->in_num, 0, &hdr, sizeof hdr);
+
+			num_buffers_ptr = (__virtio16 *)((char *)elem->in_sg[0].iov_base +
+							 len);
+
+			total += hdrlen;
+		}
+
+		len = iov_from_buf(elem->in_sg, elem->in_num, total, (char *)buf + offset,
+				   size - offset);
+
+		total += len;
+		offset += len;
+
+		/* If buffers can't be merged, at this point we
+		 * must have consumed the complete packet.
+		 * Otherwise, drop it.
+		 */
+		if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) && offset < size) {
+			vu_queue_unpop(vdev, vq, elem->index, total);
+			goto err;
+		}
+
+		indexes[i] = elem->index;
+		lens[i] = total;
+		i++;
+	}
+
+	if (num_buffers_ptr && vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
+		*num_buffers_ptr = htole16(i);
+	}
+
+	for (j = 0; j < i; j++) {
+		debug("filling total %zd idx %d", lens[j], j);
+		vu_queue_fill_by_index(vdev, vq, indexes[j], lens[j], j);
+	}
+
+	vu_queue_flush(vdev, vq, i);
+	vu_queue_notify(vdev, vq);
+
+	debug("sent %zu", offset);
+
+	return offset;
+err:
+	for (j = 0; j < i; j++) {
+		vu_queue_detach_element(vdev, vq, indexes[j], lens[j]);
+	}
+
+	return offset;
+}
+
+size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov, size_t n)
+{
+	size_t i;
+	int ret;
+
+	debug("tap_send_frames_vu n %zd", n);
+
+	for (i = 0; i < n; i++) {
+		ret = vu_send(c, iov[i].iov_base, iov[i].iov_len);
+		if (ret < 0)
+			break;
+	}
+	debug("count %zd", i);
+	return i;
+}
+
+static void vu_handle_tx(VuDev *vdev, int index)
+{
+	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
+	VuVirtq *vq = &vdev->vq[index];
+	int hdrlen = vdev->hdrlen;
+	struct timespec now;
+	char *p;
+	size_t n;
+
+	if (index % 2 != VHOST_USER_TX_QUEUE) {
+		debug("index %d is not an TX queue", index);
+		return;
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &now);
+
+	p = pkt_buf;
+
+	pool_flush_all();
+
+	while (1) {
+		VuVirtqElement *elem;
+		unsigned int out_num;
+		struct iovec sg[VIRTQUEUE_MAX_SIZE], *out_sg;
+
+		ASSERT(index == VHOST_USER_TX_QUEUE);
+		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer[index]);
+		if (!elem) {
+			break;
+		}
+
+		out_num = elem->out_num;
+		out_sg = elem->out_sg;
+		if (out_num < 1) {
+			debug("virtio-net header not in first element");
+			break;
+		}
+
+		if (hdrlen) {
+			unsigned sg_num;
+
+			sg_num = iov_copy(sg, ARRAY_SIZE(sg), out_sg, out_num,
+					  hdrlen, -1);
+			out_num = sg_num;
+			out_sg = sg;
+		}
+
+		n = iov_to_buf(out_sg, out_num, 0, p, TAP_BUF_FILL);
+
+		packet_add_all(c, n, p);
+
+		p += n;
+
+		vu_queue_push(vdev, vq, elem, 0);
+		vu_queue_notify(vdev, vq);
+	}
+	tap_handler_all(c, &now);
+}
+
+void vu_kick_cb(struct ctx *c, union epoll_ref ref)
+{
+	VuDev *vdev = &c->vdev;
+	eventfd_t kick_data;
+	ssize_t rc;
+	int index;
+
+	for (index = 0; index < VHOST_USER_MAX_QUEUES; index++)
+		if (c->vdev.vq[index].kick_fd == ref.fd)
+			break;
+
+	if (index == VHOST_USER_MAX_QUEUES)
+		return;
+
+	rc =  eventfd_read(ref.fd, &kick_data);
+	if (rc == -1) {
+		vu_panic(vdev, "kick eventfd_read(): %s", strerror(errno));
+		vu_remove_watch(vdev, ref.fd);
+	} else {
+		debug("Got kick_data: %016"PRIx64" idx:%d",
+		      kick_data, index);
+		if (index % 2 == VHOST_USER_TX_QUEUE)
+			vu_handle_tx(vdev, index);
+	}
+}
+
+static bool vu_check_queue_msg_file(VuDev *vdev, struct VhostUserMsg *msg)
+{
+	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+
+	if (index >= VHOST_USER_MAX_QUEUES) {
+		vmsg_close_fds(msg);
+		vu_panic(vdev, "Invalid queue index: %u", index);
+		return false;
+	}
+
+	if (nofd) {
+		vmsg_close_fds(msg);
+		return true;
+	}
+
+	if (msg->fd_num != 1) {
+		vmsg_close_fds(msg);
+		vu_panic(vdev, "Invalid fds in request: %d", msg->hdr.request);
+		return false;
+	}
+
+	return true;
+}
+
+static bool vu_set_vring_kick_exec(VuDev *vdev,
+				   struct VhostUserMsg *msg)
+{
+	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	if (!vu_check_queue_msg_file(vdev, msg))
+		return false;
+
+	if (vdev->vq[index].kick_fd != -1) {
+		vu_remove_watch(vdev, vdev->vq[index].kick_fd);
+		close(vdev->vq[index].kick_fd);
+		vdev->vq[index].kick_fd = -1;
+	}
+
+	vdev->vq[index].kick_fd = nofd ? -1 : msg->fds[0];
+	debug("Got kick_fd: %d for vq: %d", vdev->vq[index].kick_fd, index);
+
+	vdev->vq[index].started = true;
+
+	if (vdev->vq[index].kick_fd != -1 && index % 2 == VHOST_USER_TX_QUEUE) {
+		vu_set_watch(vdev, vdev->vq[index].kick_fd);
+		debug("Waiting for kicks on fd: %d for vq: %d",
+		      vdev->vq[index].kick_fd, index);
+	}
+
+	return false;
+}
+
+static bool vu_set_vring_call_exec(VuDev *vdev,
+				   struct VhostUserMsg *msg)
+{
+	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	if (!vu_check_queue_msg_file(vdev, msg))
+		return false;
+
+	if (vdev->vq[index].call_fd != -1) {
+		close(vdev->vq[index].call_fd);
+		vdev->vq[index].call_fd = -1;
+	}
+
+	vdev->vq[index].call_fd = nofd ? -1 : msg->fds[0];
+
+	/* in case of I/O hang after reconnecting */
+	if (vdev->vq[index].call_fd != -1) {
+		eventfd_write(msg->fds[0], 1);
+	}
+
+	debug("Got call_fd: %d for vq: %d", vdev->vq[index].call_fd, index);
+
+	return false;
+}
+
+static bool vu_set_vring_err_exec(VuDev *vdev,
+				  struct VhostUserMsg *msg)
+{
+	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	if (!vu_check_queue_msg_file(vdev, msg))
+		return false;
+
+	if (vdev->vq[index].err_fd != -1) {
+		close(vdev->vq[index].err_fd);
+		vdev->vq[index].err_fd = -1;
+	}
+
+	vdev->vq[index].err_fd = nofd ? -1 : msg->fds[0];
+
+	return false;
+}
+
+static bool vu_get_protocol_features_exec(struct VhostUserMsg *msg)
+{
+	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
+
+	vmsg_set_reply_u64(msg, features);
+
+	return true;
+}
+
+static bool vu_set_protocol_features_exec(VuDev *vdev, struct VhostUserMsg *msg)
+{
+	uint64_t features = msg->payload.u64;
+
+	debug("u64: 0x%016"PRIx64, features);
+
+	vdev->protocol_features = msg->payload.u64;
+
+	if (vu_has_protocol_feature(vdev,
+				    VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS) &&
+	    (!vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_BACKEND_REQ) ||
+	     !vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_REPLY_ACK))) {
+		/*
+		 * The use case for using messages for kick/call is simulation, to make
+		 * the kick and call synchronous. To actually get that behaviour, both
+		 * of the other features are required.
+		 * Theoretically, one could use only kick messages, or do them without
+		 * having F_REPLY_ACK, but too many (possibly pending) messages on the
+		 * socket will eventually cause the master to hang, to avoid this in
+		 * scenarios where not desired enforce that the settings are in a way
+		 * that actually enables the simulation case.
+		 */
+		vu_panic(vdev,
+			 "F_IN_BAND_NOTIFICATIONS requires F_BACKEND_REQ && F_REPLY_ACK");
+		return false;
+	}
+
+	return false;
+}
+
+
+static bool vu_get_queue_num_exec(struct VhostUserMsg *msg)
+{
+	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
+	return true;
+}
+
+static bool vu_set_vring_enable_exec(VuDev *vdev, struct VhostUserMsg *msg)
+{
+	unsigned int index = msg->payload.state.index;
+	unsigned int enable = msg->payload.state.num;
+
+	debug("State.index:  %u", index);
+	debug("State.enable: %u", enable);
+
+	if (index >= VHOST_USER_MAX_QUEUES) {
+		vu_panic(vdev, "Invalid vring_enable index: %u", index);
+		return false;
+	}
+
+	vdev->vq[index].enable = enable;
+	return false;
+}
+
+void vu_init(struct ctx *c)
+{
+	int i;
+
+	c->vdev.hdrlen = 0;
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++)
+		c->vdev.vq[i] = (VuVirtq){
+			.call_fd = -1,
+			.kick_fd = -1,
+			.err_fd = -1,
+			.notification = true,
+		};
+}
+
+static void vu_cleanup(VuDev *vdev)
+{
+	unsigned int i;
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		VuVirtq *vq = &vdev->vq[i];
+
+		vq->started = false;
+		vq->notification = true;
+
+		if (vq->call_fd != -1) {
+			close(vq->call_fd);
+			vq->call_fd = -1;
+		}
+		if (vq->err_fd != -1) {
+			close(vq->err_fd);
+			vq->err_fd = -1;
+		}
+		if (vq->kick_fd != -1) {
+			vu_remove_watch(vdev,  vq->kick_fd);
+			close(vq->kick_fd);
+			vq->kick_fd = -1;
+		}
+
+		vq->vring.desc = 0;
+		vq->vring.used = 0;
+		vq->vring.avail = 0;
+	}
+	vdev->hdrlen = 0;
+
+	for (i = 0; i < vdev->nregions; i++) {
+		VuDevRegion *r = &vdev->regions[i];
+		void *m = (void *) (uintptr_t) r->mmap_addr;
+
+		if (m)
+			munmap(m, r->size + r->mmap_offset);
+	}
+	vdev->nregions = 0;
+}
+
+/**
+ * tap_handler_vu() - Packet handler for vhost-user
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void tap_handler_vu(struct ctx *c, uint32_t events)
+{
+	VuDev *dev = &c->vdev;
+	struct VhostUserMsg msg = { 0 };
+	bool need_reply, reply_requested;
+	int ret;
+
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
+		tap_sock_reset(c);
+		return;
+	}
+
+
+	ret = vu_message_read_default(dev, c->fd_tap, &msg);
+	if (ret <= 0) {
+		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
+			tap_sock_reset(c);
+		return;
+	}
+	debug("================ Vhost user message ================");
+	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
+		msg.hdr.request);
+	debug("Flags:   0x%x", msg.hdr.flags);
+	debug("Size:    %u", msg.hdr.size);
+
+	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
+	switch (msg.hdr.request) {
+	case VHOST_USER_GET_FEATURES:
+		reply_requested = vu_get_features_exec(&msg);
+		break;
+	case VHOST_USER_SET_FEATURES:
+		reply_requested = vu_set_features_exec(dev, &msg);
+		break;
+	case VHOST_USER_GET_PROTOCOL_FEATURES:
+		reply_requested = vu_get_protocol_features_exec(&msg);
+		break;
+	case VHOST_USER_SET_PROTOCOL_FEATURES:
+		reply_requested = vu_set_protocol_features_exec(dev, &msg);
+		break;
+	case VHOST_USER_GET_QUEUE_NUM:
+		reply_requested = vu_get_queue_num_exec(&msg);
+		break;
+	case VHOST_USER_SET_OWNER:
+		reply_requested = vu_set_owner_exec();
+		break;
+	case VHOST_USER_SET_MEM_TABLE:
+		reply_requested = vu_set_mem_table_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_NUM:
+		reply_requested = vu_set_vring_num_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_ADDR:
+		reply_requested = vu_set_vring_addr_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_BASE:
+		reply_requested = vu_set_vring_base_exec(dev, &msg);
+		break;
+	case VHOST_USER_GET_VRING_BASE:
+		reply_requested = vu_get_vring_base_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_KICK:
+		reply_requested = vu_set_vring_kick_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_CALL:
+		reply_requested = vu_set_vring_call_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_ERR:
+		reply_requested = vu_set_vring_err_exec(dev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_ENABLE:
+		reply_requested = vu_set_vring_enable_exec(dev, &msg);
+		break;
+	case VHOST_USER_NONE:
+		vu_cleanup(dev);
+		return;
+	default:
+		vu_panic(dev, "Unhandled request: %d", msg.hdr.request);
+		return;
+	}
+
+	if (!reply_requested && need_reply) {
+		msg.payload.u64 = 0;
+		msg.hdr.flags = 0;
+		msg.hdr.size = sizeof(msg.payload.u64);
+		msg.fd_num = 0;
+		reply_requested = true;
+	}
+
+	if (reply_requested)
+		ret = vu_send_reply(dev, c->fd_tap, &msg);
+	free(msg.data);
+}
diff --git a/vhost_user.h b/vhost_user.h
new file mode 100644
index 000000000000..25f0b617ab40
--- /dev/null
+++ b/vhost_user.h
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts from subprojects/libvhost-user/libvhost-user.h */
+
+#ifndef VHOST_USER_H
+#define VHOST_USER_H
+
+#include "virtio.h"
+#include "iov.h"
+
+#define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+#define VHOST_MEMORY_BASELINE_NREGIONS 8
+
+enum vhost_user_protocol_feature {
+	VHOST_USER_PROTOCOL_F_MQ = 0,
+	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
+	VHOST_USER_PROTOCOL_F_RARP = 2,
+	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
+	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
+	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
+	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
+	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
+	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
+	VHOST_USER_PROTOCOL_F_CONFIG = 9,
+	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
+	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
+	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
+	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+
+	VHOST_USER_PROTOCOL_F_MAX
+};
+
+enum vhost_user_request {
+	VHOST_USER_NONE = 0,
+	VHOST_USER_GET_FEATURES = 1,
+	VHOST_USER_SET_FEATURES = 2,
+	VHOST_USER_SET_OWNER = 3,
+	VHOST_USER_RESET_OWNER = 4,
+	VHOST_USER_SET_MEM_TABLE = 5,
+	VHOST_USER_SET_LOG_BASE = 6,
+	VHOST_USER_SET_LOG_FD = 7,
+	VHOST_USER_SET_VRING_NUM = 8,
+	VHOST_USER_SET_VRING_ADDR = 9,
+	VHOST_USER_SET_VRING_BASE = 10,
+	VHOST_USER_GET_VRING_BASE = 11,
+	VHOST_USER_SET_VRING_KICK = 12,
+	VHOST_USER_SET_VRING_CALL = 13,
+	VHOST_USER_SET_VRING_ERR = 14,
+	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+	VHOST_USER_GET_QUEUE_NUM = 17,
+	VHOST_USER_SET_VRING_ENABLE = 18,
+	VHOST_USER_SEND_RARP = 19,
+	VHOST_USER_NET_SET_MTU = 20,
+	VHOST_USER_SET_BACKEND_REQ_FD = 21,
+	VHOST_USER_IOTLB_MSG = 22,
+	VHOST_USER_SET_VRING_ENDIAN = 23,
+	VHOST_USER_GET_CONFIG = 24,
+	VHOST_USER_SET_CONFIG = 25,
+	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
+	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
+	VHOST_USER_POSTCOPY_ADVISE  = 28,
+	VHOST_USER_POSTCOPY_LISTEN  = 29,
+	VHOST_USER_POSTCOPY_END     = 30,
+	VHOST_USER_GET_INFLIGHT_FD = 31,
+	VHOST_USER_SET_INFLIGHT_FD = 32,
+	VHOST_USER_GPU_SET_SOCKET = 33,
+	VHOST_USER_VRING_KICK = 35,
+	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
+	VHOST_USER_ADD_MEM_REG = 37,
+	VHOST_USER_REM_MEM_REG = 38,
+	VHOST_USER_MAX
+};
+
+typedef struct {
+	enum vhost_user_request request;
+
+#define VHOST_USER_VERSION_MASK     0x3
+#define VHOST_USER_REPLY_MASK       (0x1 << 2)
+#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
+	uint32_t flags;
+	uint32_t size; /* the following payload size */
+} __attribute__ ((__packed__)) vhost_user_header;
+
+typedef struct VhostUserMemory_region {
+	uint64_t guest_phys_addr;
+	uint64_t memory_size;
+	uint64_t userspace_addr;
+	uint64_t mmap_offset;
+} VhostUserMemory_region;
+
+struct VhostUserMemory {
+	uint32_t nregions;
+	uint32_t padding;
+	struct VhostUserMemory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
+};
+
+typedef union {
+#define VHOST_USER_VRING_IDX_MASK   0xff
+#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
+	uint64_t u64;
+	struct vhost_vring_state state;
+	struct vhost_vring_addr addr;
+	struct VhostUserMemory memory;
+} vhost_user_payload;
+
+typedef struct VhostUserMsg {
+	vhost_user_header hdr;
+	vhost_user_payload payload;
+
+	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
+	int fd_num;
+	uint8_t *data;
+} __attribute__ ((__packed__)) VhostUserMsg;
+#define VHOST_USER_HDR_SIZE sizeof(vhost_user_header)
+
+#define VHOST_USER_RX_QUEUE 0
+#define VHOST_USER_TX_QUEUE 1
+
+static inline bool vu_queue_enabled(VuVirtq *vq)
+{
+	return vq->enable;
+}
+
+static inline bool vu_queue_started(const VuVirtq *vq)
+{
+	return vq->started;
+}
+
+size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov,
+			  size_t n);
+int vu_send(const struct ctx *c, const void *data, size_t len);
+void vu_print_capabilities(void);
+void vu_init(struct ctx *c);
+void vu_kick_cb(struct ctx *c, union epoll_ref ref);
+void tap_handler_vu(struct ctx *c, uint32_t events);
+#endif /* VHOST_USER_H */
-- 
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* some parts from subprojects/libvhost-user/libvhost-user.h */
+
+#ifndef VHOST_USER_H
+#define VHOST_USER_H
+
+#include "virtio.h"
+#include "iov.h"
+
+#define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+#define VHOST_MEMORY_BASELINE_NREGIONS 8
+
+enum vhost_user_protocol_feature {
+	VHOST_USER_PROTOCOL_F_MQ = 0,
+	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
+	VHOST_USER_PROTOCOL_F_RARP = 2,
+	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
+	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
+	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
+	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
+	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
+	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
+	VHOST_USER_PROTOCOL_F_CONFIG = 9,
+	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
+	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
+	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
+	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+
+	VHOST_USER_PROTOCOL_F_MAX
+};
+
+enum vhost_user_request {
+	VHOST_USER_NONE = 0,
+	VHOST_USER_GET_FEATURES = 1,
+	VHOST_USER_SET_FEATURES = 2,
+	VHOST_USER_SET_OWNER = 3,
+	VHOST_USER_RESET_OWNER = 4,
+	VHOST_USER_SET_MEM_TABLE = 5,
+	VHOST_USER_SET_LOG_BASE = 6,
+	VHOST_USER_SET_LOG_FD = 7,
+	VHOST_USER_SET_VRING_NUM = 8,
+	VHOST_USER_SET_VRING_ADDR = 9,
+	VHOST_USER_SET_VRING_BASE = 10,
+	VHOST_USER_GET_VRING_BASE = 11,
+	VHOST_USER_SET_VRING_KICK = 12,
+	VHOST_USER_SET_VRING_CALL = 13,
+	VHOST_USER_SET_VRING_ERR = 14,
+	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+	VHOST_USER_GET_QUEUE_NUM = 17,
+	VHOST_USER_SET_VRING_ENABLE = 18,
+	VHOST_USER_SEND_RARP = 19,
+	VHOST_USER_NET_SET_MTU = 20,
+	VHOST_USER_SET_BACKEND_REQ_FD = 21,
+	VHOST_USER_IOTLB_MSG = 22,
+	VHOST_USER_SET_VRING_ENDIAN = 23,
+	VHOST_USER_GET_CONFIG = 24,
+	VHOST_USER_SET_CONFIG = 25,
+	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
+	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
+	VHOST_USER_POSTCOPY_ADVISE  = 28,
+	VHOST_USER_POSTCOPY_LISTEN  = 29,
+	VHOST_USER_POSTCOPY_END     = 30,
+	VHOST_USER_GET_INFLIGHT_FD = 31,
+	VHOST_USER_SET_INFLIGHT_FD = 32,
+	VHOST_USER_GPU_SET_SOCKET = 33,
+	VHOST_USER_VRING_KICK = 35,
+	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
+	VHOST_USER_ADD_MEM_REG = 37,
+	VHOST_USER_REM_MEM_REG = 38,
+	VHOST_USER_MAX
+};
+
+typedef struct {
+	enum vhost_user_request request;
+
+#define VHOST_USER_VERSION_MASK     0x3
+#define VHOST_USER_REPLY_MASK       (0x1 << 2)
+#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
+	uint32_t flags;
+	uint32_t size; /* the following payload size */
+} __attribute__ ((__packed__)) vhost_user_header;
+
+typedef struct VhostUserMemory_region {
+	uint64_t guest_phys_addr;
+	uint64_t memory_size;
+	uint64_t userspace_addr;
+	uint64_t mmap_offset;
+} VhostUserMemory_region;
+
+struct VhostUserMemory {
+	uint32_t nregions;
+	uint32_t padding;
+	struct VhostUserMemory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
+};
+
+typedef union {
+#define VHOST_USER_VRING_IDX_MASK   0xff
+#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
+	uint64_t u64;
+	struct vhost_vring_state state;
+	struct vhost_vring_addr addr;
+	struct VhostUserMemory memory;
+} vhost_user_payload;
+
+typedef struct VhostUserMsg {
+	vhost_user_header hdr;
+	vhost_user_payload payload;
+
+	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
+	int fd_num;
+	uint8_t *data;
+} __attribute__ ((__packed__)) VhostUserMsg;
+#define VHOST_USER_HDR_SIZE sizeof(vhost_user_header)
+
+#define VHOST_USER_RX_QUEUE 0
+#define VHOST_USER_TX_QUEUE 1
+
+static inline bool vu_queue_enabled(VuVirtq *vq)
+{
+	return vq->enable;
+}
+
+static inline bool vu_queue_started(const VuVirtq *vq)
+{
+	return vq->started;
+}
+
+size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov,
+			  size_t n);
+int vu_send(const struct ctx *c, const void *data, size_t len);
+void vu_print_capabilities(void);
+void vu_init(struct ctx *c);
+void vu_kick_cb(struct ctx *c, union epoll_ref ref);
+void tap_handler_vu(struct ctx *c, uint32_t events);
+#endif /* VHOST_USER_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 20/24] vhost-user: add vhost-user
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (18 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 19/24] vhost-user: introduce vhost-user API Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-07  2:40   ` David Gibson
  2024-02-11 23:19   ` Stefano Brivio
  2024-02-02 14:11 ` [PATCH 21/24] vhost-user: use guest buffer directly in vu_handle_tx() Laurent Vivier
                   ` (3 subsequent siblings)
  23 siblings, 2 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 conf.c  | 20 ++++++++++++++--
 passt.c |  7 ++++++
 passt.h |  1 +
 tap.c   | 73 ++++++++++++++++++++++++++++++++++++++++++---------------
 tcp.c   |  8 +++++--
 udp.c   |  6 +++--
 6 files changed, 90 insertions(+), 25 deletions(-)

diff --git a/conf.c b/conf.c
index b6a2a1f0fdc3..40aa9519f8a6 100644
--- a/conf.c
+++ b/conf.c
@@ -44,6 +44,7 @@
 #include "lineread.h"
 #include "isolation.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /**
  * next_chunk - Return the next piece of a string delimited by a character
@@ -735,9 +736,12 @@ static void print_usage(const char *name, int status)
 		info(   "  -I, --ns-ifname NAME	namespace interface name");
 		info(   "    default: same interface name as external one");
 	} else {
-		info(   "  -s, --socket PATH	UNIX domain socket path");
+		info(   "  -s, --socket, --socket-path PATH	UNIX domain socket path");
 		info(   "    default: probe free path starting from "
 		     UNIX_SOCK_PATH, 1);
+		info(   "  --vhost-user		Enable vhost-user mode");
+		info(   "    UNIX domain socket is provided by -s option");
+		info(   "  --print-capabilities	print back-end capabilities in JSON format");
 	}
 
 	info(   "  -F, --fd FD		Use FD as pre-opened connected socket");
@@ -1123,6 +1127,7 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"help",	no_argument,		NULL,		'h' },
 		{"socket",	required_argument,	NULL,		's' },
 		{"fd",		required_argument,	NULL,		'F' },
+		{"socket-path",	required_argument,	NULL,		's' }, /* vhost-user mandatory */
 		{"ns-ifname",	required_argument,	NULL,		'I' },
 		{"pcap",	required_argument,	NULL,		'p' },
 		{"pid",		required_argument,	NULL,		'P' },
@@ -1169,6 +1174,8 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"config-net",	no_argument,		NULL,		17 },
 		{"no-copy-routes", no_argument,		NULL,		18 },
 		{"no-copy-addrs", no_argument,		NULL,		19 },
+		{"vhost-user",	no_argument,		NULL,		20 },
+		{"print-capabilities", no_argument,	NULL,		21 }, /* vhost-user mandatory */
 		{ 0 },
 	};
 	char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
@@ -1328,7 +1335,6 @@ void conf(struct ctx *c, int argc, char **argv)
 				       sizeof(c->ip6.ifname_out), "%s", optarg);
 			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
 				die("Invalid interface name: %s", optarg);
-
 			break;
 		case 17:
 			if (c->mode != MODE_PASTA)
@@ -1350,6 +1356,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			warn("--no-copy-addrs will be dropped soon");
 			c->no_copy_addrs = copy_addrs_opt = true;
 			break;
+		case 20:
+			if (c->mode == MODE_PASTA) {
+				err("--vhost-user is for passt mode only");
+				usage(argv[0]);
+			}
+			c->mode = MODE_VU;
+			break;
+		case 21:
+			vu_print_capabilities();
+			break;
 		case 'd':
 			if (c->debug)
 				die("Multiple --debug options given");
diff --git a/passt.c b/passt.c
index 95034d73381f..952aded12848 100644
--- a/passt.c
+++ b/passt.c
@@ -282,6 +282,7 @@ int main(int argc, char **argv)
 	quit_fd = pasta_netns_quit_init(&c);
 
 	tap_sock_init(&c);
+	vu_init(&c);
 
 	secret_init(&c);
 
@@ -399,6 +400,12 @@ loop:
 		case EPOLL_TYPE_ICMPV6:
 			icmp_sock_handler(&c, AF_INET6, ref);
 			break;
+		case EPOLL_TYPE_VHOST_CMD:
+			tap_handler_vu(&c, eventmask);
+			break;
+		case EPOLL_TYPE_VHOST_KICK:
+			vu_kick_cb(&c, ref);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 6ed1d0b19e82..4e0100d51a4d 100644
--- a/passt.h
+++ b/passt.h
@@ -141,6 +141,7 @@ struct fqdn {
 enum passt_modes {
 	MODE_PASST,
 	MODE_PASTA,
+	MODE_VU,
 };
 
 /**
diff --git a/tap.c b/tap.c
index 936206e53637..c2a917bc00ca 100644
--- a/tap.c
+++ b/tap.c
@@ -57,6 +57,7 @@
 #include "packet.h"
 #include "tap.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
@@ -75,19 +76,22 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
  */
 int tap_send(const struct ctx *c, const void *data, size_t len)
 {
-	pcap(data, len);
+	int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
+	uint32_t vnet_len = htonl(len);
 
-	if (c->mode == MODE_PASST) {
-		int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
-		uint32_t vnet_len = htonl(len);
+	pcap(data, len);
 
+	switch (c->mode) {
+	case MODE_PASST:
 		if (send(c->fd_tap, &vnet_len, 4, flags) < 0)
 			return -1;
-
 		return send(c->fd_tap, data, len, flags);
+	case MODE_PASTA:
+		return write(c->fd_tap, (char *)data, len);
+	case MODE_VU:
+		return vu_send(c, data, len);
 	}
-
-	return write(c->fd_tap, (char *)data, len);
+	return 0;
 }
 
 /**
@@ -428,10 +432,20 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
 	if (!n)
 		return 0;
 
-	if (c->mode == MODE_PASTA)
+	switch (c->mode) {
+	case MODE_PASTA:
 		m = tap_send_frames_pasta(c, iov, n);
-	else
+		break;
+	case MODE_PASST:
 		m = tap_send_frames_passt(c, iov, n);
+		break;
+	case MODE_VU:
+		m = tap_send_frames_vu(c, iov, n);
+		break;
+	default:
+		m = 0;
+		break;
+	}
 
 	if (m < n)
 		debug("tap: failed to send %zu frames of %zu", n - m, n);
@@ -1149,11 +1163,17 @@ static void tap_sock_unix_init(struct ctx *c)
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
 
-	info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
-	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
-	     addr.sun_path);
-	info("or qrap, for earlier qemu versions:");
-	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+	if (c->mode == MODE_VU) {
+		info("You can start qemu with:");
+		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
+		     addr.sun_path);
+	} else {
+		info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
+		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
+		     addr.sun_path);
+		info("or qrap, for earlier qemu versions:");
+		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+	}
 }
 
 /**
@@ -1163,7 +1183,7 @@ static void tap_sock_unix_init(struct ctx *c)
  */
 void tap_listen_handler(struct ctx *c, uint32_t events)
 {
-	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
+	union epoll_ref ref;
 	struct epoll_event ev = { 0 };
 	int v = INT_MAX / 2;
 	struct ucred ucred;
@@ -1204,7 +1224,13 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
-	ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+	if (c->mode == MODE_VU) {
+		ref.type = EPOLL_TYPE_VHOST_CMD;
+		ev.events = EPOLLIN | EPOLLRDHUP;
+	} else {
+		ref.type = EPOLL_TYPE_TAP_PASST;
+		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
+	}
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
@@ -1288,12 +1314,21 @@ void tap_sock_init(struct ctx *c)
 
 		ASSERT(c->one_off);
 		ref.fd = c->fd_tap;
-		if (c->mode == MODE_PASST)
+		switch (c->mode) {
+		case MODE_PASST:
 			ref.type = EPOLL_TYPE_TAP_PASST;
-		else
+			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+			break;
+		case MODE_PASTA:
 			ref.type = EPOLL_TYPE_TAP_PASTA;
+			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
+			break;
+		case MODE_VU:
+			ref.type = EPOLL_TYPE_VHOST_CMD;
+			ev.events = EPOLLIN | EPOLLRDHUP;
+			break;
+		}
 
-		ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
 		ev.data.u64 = ref.u64;
 		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 		return;
diff --git a/tcp.c b/tcp.c
index 54c15087d678..b6aca9f37f19 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1033,7 +1033,9 @@ size_t ipv4_fill_headers(const struct ctx *c,
 
 	tcp_set_tcp_header(th, conn, seq);
 
-	th->check = tcp_update_check_tcp4(iph);
+	th->check = 0;
+	if (c->mode != MODE_VU || *c->pcap)
+		th->check = tcp_update_check_tcp4(iph);
 
 	return ip_len;
 }
@@ -1069,7 +1071,9 @@ size_t ipv6_fill_headers(const struct ctx *c,
 
 	tcp_set_tcp_header(th, conn, seq);
 
-	th->check = tcp_update_check_tcp6(ip6h);
+	th->check = 0;
+	if (c->mode != MODE_VU || *c->pcap)
+		th->check = tcp_update_check_tcp6(ip6h);
 
 	ip6h->hop_limit = 255;
 	ip6h->version = 6;
diff --git a/udp.c b/udp.c
index a189c2e0b5a2..799a10989a91 100644
--- a/udp.c
+++ b/udp.c
@@ -671,8 +671,10 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
 	uh->source = s_in6->sin6_port;
 	uh->dest = htons(dstport);
 	uh->len = ip6h->payload_len;
-	uh->check = csum(uh, ntohs(ip6h->payload_len),
-			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
+	uh->check = 0;
+	if (c->mode != MODE_VU || *c->pcap)
+		uh->check = csum(uh, ntohs(ip6h->payload_len),
+				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
 	ip6h->version = 6;
 	ip6h->nexthdr = IPPROTO_UDP;
 	ip6h->hop_limit = 255;
-- 
@@ -671,8 +671,10 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
 	uh->source = s_in6->sin6_port;
 	uh->dest = htons(dstport);
 	uh->len = ip6h->payload_len;
-	uh->check = csum(uh, ntohs(ip6h->payload_len),
-			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
+	uh->check = 0;
+	if (c->mode != MODE_VU || *c->pcap)
+		uh->check = csum(uh, ntohs(ip6h->payload_len),
+				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
 	ip6h->version = 6;
 	ip6h->nexthdr = IPPROTO_UDP;
 	ip6h->hop_limit = 255;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 21/24] vhost-user: use guest buffer directly in vu_handle_tx()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (19 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 20/24] vhost-user: add vhost-user Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-09  4:26   ` David Gibson
  2024-02-02 14:11 ` [PATCH 22/24] tcp: vhost-user RX nocopy Laurent Vivier
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Check the buffer address is correctly in the mmap'ed memory.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 packet.c     |  6 +++++
 packet.h     |  2 ++
 tap.c        | 39 +++++++++++++++++++++++++++----
 tap.h        |  1 +
 vhost_user.c | 66 ++++++++++++++++++++++++++++++++--------------------
 5 files changed, 84 insertions(+), 30 deletions(-)

diff --git a/packet.c b/packet.c
index af2a539a1794..3c5fc39df6d7 100644
--- a/packet.c
+++ b/packet.c
@@ -25,6 +25,12 @@
 static int packet_check_range(const struct pool *p, size_t offset, size_t len,
 			      const char *start, const char *func, int line)
 {
+	ASSERT(p->buf);
+
+	if (p->buf_size == 0)
+		return vu_packet_check_range((void *)p->buf, offset, len, start,
+					     func, line);
+
 	if (start < p->buf) {
 		if (func) {
 			trace("add packet start %p before buffer start %p, "
diff --git a/packet.h b/packet.h
index 8377dcf678bb..0aec6d9410aa 100644
--- a/packet.h
+++ b/packet.h
@@ -22,6 +22,8 @@ struct pool {
 	struct iovec pkt[1];
 };
 
+int vu_packet_check_range(void *buf, size_t offset, size_t len,
+			  const char *start, const char *func, int line);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/tap.c b/tap.c
index c2a917bc00ca..930e48689497 100644
--- a/tap.c
+++ b/tap.c
@@ -626,7 +626,7 @@ resume:
 		if (!eh)
 			continue;
 		if (ntohs(eh->h_proto) == ETH_P_ARP) {
-			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
+			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
 
 			packet_add(pkt, l2_len, (char *)eh);
 			arp(c, pkt);
@@ -656,7 +656,7 @@ resume:
 			continue;
 
 		if (iph->protocol == IPPROTO_ICMP) {
-			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
+			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
 
 			if (c->no_icmp)
 				continue;
@@ -675,7 +675,7 @@ resume:
 			continue;
 
 		if (iph->protocol == IPPROTO_UDP) {
-			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
+			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
 
 			packet_add(pkt, l2_len, (char *)eh);
 			if (dhcp(c, pkt))
@@ -815,7 +815,7 @@ resume:
 		}
 
 		if (proto == IPPROTO_ICMPV6) {
-			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
+			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
 
 			if (c->no_icmp)
 				continue;
@@ -839,7 +839,7 @@ resume:
 		uh = (struct udphdr *)l4h;
 
 		if (proto == IPPROTO_UDP) {
-			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
+			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
 
 			packet_add(pkt, l4_len, l4h);
 
@@ -1291,6 +1291,23 @@ static void tap_sock_tun_init(struct ctx *c)
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
 
+void tap_sock_update_buf(void *base, size_t size)
+{
+	int i;
+
+	pool_tap4_storage.buf = base;
+	pool_tap4_storage.buf_size = size;
+	pool_tap6_storage.buf = base;
+	pool_tap6_storage.buf_size = size;
+
+	for (i = 0; i < TAP_SEQS; i++) {
+		tap4_l4[i].p.buf = base;
+		tap4_l4[i].p.buf_size = size;
+		tap6_l4[i].p.buf = base;
+		tap6_l4[i].p.buf_size = size;
+	}
+}
+
 /**
  * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
  * @c:		Execution context
@@ -1302,10 +1319,22 @@ void tap_sock_init(struct ctx *c)
 
 	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
 	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
+	if (c->mode == MODE_VU) {
+		pool_tap4_storage.buf = NULL;
+		pool_tap4_storage.buf_size = 0;
+		pool_tap6_storage.buf = NULL;
+		pool_tap6_storage.buf_size = 0;
+	}
 
 	for (i = 0; i < TAP_SEQS; i++) {
 		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
 		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
+		if (c->mode == MODE_VU) {
+			tap4_l4[i].p.buf = NULL;
+			tap4_l4[i].p.buf_size = 0;
+			tap6_l4[i].p.buf = NULL;
+			tap6_l4[i].p.buf_size = 0;
+		}
 	}
 
 	if (c->fd_tap != -1) { /* Passed as --fd */
diff --git a/tap.h b/tap.h
index ee839d4f09dc..6823c9b32313 100644
--- a/tap.h
+++ b/tap.h
@@ -82,6 +82,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 void tap_sock_reset(struct ctx *c);
+void tap_sock_update_buf(void *base, size_t size);
 void tap_sock_init(struct ctx *c);
 void pool_flush_all(void);
 void tap_handler_all(struct ctx *c, const struct timespec *now);
diff --git a/vhost_user.c b/vhost_user.c
index 2acd72398e3a..9cc07c8312c0 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -334,6 +334,25 @@ static bool map_ring(VuDev *vdev, VuVirtq *vq)
 	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
 }
 
+int vu_packet_check_range(void *buf, size_t offset, size_t len, const char *start,
+			  const char *func, int line)
+{
+	VuDevRegion *dev_region;
+
+	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
+		if ((char *)dev_region->mmap_addr <= start &&
+		    start + offset + len < (char *)dev_region->mmap_addr +
+					   dev_region->mmap_offset +
+					   dev_region->size)
+			return 0;
+	}
+	if (func) {
+		trace("cannot find region, %s:%i", func, line);
+	}
+
+	return -1;
+}
+
 /*
  * #syscalls:passt mmap munmap
  */
@@ -400,6 +419,12 @@ static bool vu_set_mem_table_exec(VuDev *vdev,
 		}
 	}
 
+	/* XXX */
+	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
+	vdev->regions[vdev->nregions].mmap_addr = 0; /* mark EOF for vu_packet_check_range() */
+
+	tap_sock_update_buf(vdev->regions, 0);
+
 	return false;
 }
 
@@ -650,8 +675,8 @@ static void vu_handle_tx(VuDev *vdev, int index)
 	VuVirtq *vq = &vdev->vq[index];
 	int hdrlen = vdev->hdrlen;
 	struct timespec now;
-	char *p;
-	size_t n;
+	unsigned int indexes[VIRTQUEUE_MAX_SIZE];
+	int count;
 
 	if (index % 2 != VHOST_USER_TX_QUEUE) {
 		debug("index %d is not an TX queue", index);
@@ -660,14 +685,11 @@ static void vu_handle_tx(VuDev *vdev, int index)
 
 	clock_gettime(CLOCK_MONOTONIC, &now);
 
-	p = pkt_buf;
-
 	pool_flush_all();
 
+	count = 0;
 	while (1) {
 		VuVirtqElement *elem;
-		unsigned int out_num;
-		struct iovec sg[VIRTQUEUE_MAX_SIZE], *out_sg;
 
 		ASSERT(index == VHOST_USER_TX_QUEUE);
 		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer[index]);
@@ -675,32 +697,26 @@ static void vu_handle_tx(VuDev *vdev, int index)
 			break;
 		}
 
-		out_num = elem->out_num;
-		out_sg = elem->out_sg;
-		if (out_num < 1) {
+		if (elem->out_num < 1) {
 			debug("virtio-net header not in first element");
 			break;
 		}
+		ASSERT(elem->out_num == 1);
 
-		if (hdrlen) {
-			unsigned sg_num;
-
-			sg_num = iov_copy(sg, ARRAY_SIZE(sg), out_sg, out_num,
-					  hdrlen, -1);
-			out_num = sg_num;
-			out_sg = sg;
-		}
-
-		n = iov_to_buf(out_sg, out_num, 0, p, TAP_BUF_FILL);
-
-		packet_add_all(c, n, p);
-
-		p += n;
+		packet_add_all(c, elem->out_sg[0].iov_len - hdrlen,
+			       (char *)elem->out_sg[0].iov_base + hdrlen);
+		indexes[count] = elem->index;
+		count++;
+	}
+	tap_handler_all(c, &now);
 
-		vu_queue_push(vdev, vq, elem, 0);
+	if (count) {
+		int i;
+		for (i = 0; i < count; i++)
+			vu_queue_fill_by_index(vdev, vq, indexes[i], 0, i);
+		vu_queue_flush(vdev, vq, count);
 		vu_queue_notify(vdev, vq);
 	}
-	tap_handler_all(c, &now);
 }
 
 void vu_kick_cb(struct ctx *c, union epoll_ref ref)
-- 
@@ -334,6 +334,25 @@ static bool map_ring(VuDev *vdev, VuVirtq *vq)
 	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
 }
 
+int vu_packet_check_range(void *buf, size_t offset, size_t len, const char *start,
+			  const char *func, int line)
+{
+	VuDevRegion *dev_region;
+
+	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
+		if ((char *)dev_region->mmap_addr <= start &&
+		    start + offset + len < (char *)dev_region->mmap_addr +
+					   dev_region->mmap_offset +
+					   dev_region->size)
+			return 0;
+	}
+	if (func) {
+		trace("cannot find region, %s:%i", func, line);
+	}
+
+	return -1;
+}
+
 /*
  * #syscalls:passt mmap munmap
  */
@@ -400,6 +419,12 @@ static bool vu_set_mem_table_exec(VuDev *vdev,
 		}
 	}
 
+	/* XXX */
+	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
+	vdev->regions[vdev->nregions].mmap_addr = 0; /* mark EOF for vu_packet_check_range() */
+
+	tap_sock_update_buf(vdev->regions, 0);
+
 	return false;
 }
 
@@ -650,8 +675,8 @@ static void vu_handle_tx(VuDev *vdev, int index)
 	VuVirtq *vq = &vdev->vq[index];
 	int hdrlen = vdev->hdrlen;
 	struct timespec now;
-	char *p;
-	size_t n;
+	unsigned int indexes[VIRTQUEUE_MAX_SIZE];
+	int count;
 
 	if (index % 2 != VHOST_USER_TX_QUEUE) {
 		debug("index %d is not an TX queue", index);
@@ -660,14 +685,11 @@ static void vu_handle_tx(VuDev *vdev, int index)
 
 	clock_gettime(CLOCK_MONOTONIC, &now);
 
-	p = pkt_buf;
-
 	pool_flush_all();
 
+	count = 0;
 	while (1) {
 		VuVirtqElement *elem;
-		unsigned int out_num;
-		struct iovec sg[VIRTQUEUE_MAX_SIZE], *out_sg;
 
 		ASSERT(index == VHOST_USER_TX_QUEUE);
 		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer[index]);
@@ -675,32 +697,26 @@ static void vu_handle_tx(VuDev *vdev, int index)
 			break;
 		}
 
-		out_num = elem->out_num;
-		out_sg = elem->out_sg;
-		if (out_num < 1) {
+		if (elem->out_num < 1) {
 			debug("virtio-net header not in first element");
 			break;
 		}
+		ASSERT(elem->out_num == 1);
 
-		if (hdrlen) {
-			unsigned sg_num;
-
-			sg_num = iov_copy(sg, ARRAY_SIZE(sg), out_sg, out_num,
-					  hdrlen, -1);
-			out_num = sg_num;
-			out_sg = sg;
-		}
-
-		n = iov_to_buf(out_sg, out_num, 0, p, TAP_BUF_FILL);
-
-		packet_add_all(c, n, p);
-
-		p += n;
+		packet_add_all(c, elem->out_sg[0].iov_len - hdrlen,
+			       (char *)elem->out_sg[0].iov_base + hdrlen);
+		indexes[count] = elem->index;
+		count++;
+	}
+	tap_handler_all(c, &now);
 
-		vu_queue_push(vdev, vq, elem, 0);
+	if (count) {
+		int i;
+		for (i = 0; i < count; i++)
+			vu_queue_fill_by_index(vdev, vq, indexes[i], 0, i);
+		vu_queue_flush(vdev, vq, count);
 		vu_queue_notify(vdev, vq);
 	}
-	tap_handler_all(c, &now);
 }
 
 void vu_kick_cb(struct ctx *c, union epoll_ref ref)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 22/24] tcp: vhost-user RX nocopy
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (20 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 21/24] vhost-user: use guest buffer directly in vu_handle_tx() Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-09  4:57   ` David Gibson
  2024-02-02 14:11 ` [PATCH 23/24] udp: " Laurent Vivier
  2024-02-02 14:11 ` [PATCH 24/24] vhost-user: remove tap_send_frames_vu() Laurent Vivier
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile |   6 +-
 tcp.c    |  66 +++++---
 tcp_vu.c | 447 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_vu.h |  10 ++
 4 files changed, 502 insertions(+), 27 deletions(-)
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h

diff --git a/Makefile b/Makefile
index 2016b071ddf2..f7a403d19b61 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
 	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
-	tcp_buf.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
+	tcp_buf.c tcp_vu.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -56,8 +56,8 @@ MANPAGES = passt.1 pasta.1 qrap.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
-	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
-	util.h iov.h ip.h virtio.h vhost_user.h
+	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_vu.h tcp_internal.h \
+	udp.h util.h iov.h ip.h virtio.h vhost_user.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/tcp.c b/tcp.c
index b6aca9f37f19..e829e12fe7c2 100644
--- a/tcp.c
+++ b/tcp.c
@@ -302,6 +302,7 @@
 #include "flow_table.h"
 #include "tcp_internal.h"
 #include "tcp_buf.h"
+#include "tcp_vu.h"
 
 /* Sides of a flow as we use them in "tap" connections */
 #define	SOCKSIDE	0
@@ -1034,7 +1035,7 @@ size_t ipv4_fill_headers(const struct ctx *c,
 	tcp_set_tcp_header(th, conn, seq);
 
 	th->check = 0;
-	if (c->mode != MODE_VU || *c->pcap)
+	if (c->mode != MODE_VU)
 		th->check = tcp_update_check_tcp4(iph);
 
 	return ip_len;
@@ -1072,7 +1073,7 @@ size_t ipv6_fill_headers(const struct ctx *c,
 	tcp_set_tcp_header(th, conn, seq);
 
 	th->check = 0;
-	if (c->mode != MODE_VU || *c->pcap)
+	if (c->mode != MODE_VU)
 		th->check = tcp_update_check_tcp6(ip6h);
 
 	ip6h->hop_limit = 255;
@@ -1302,6 +1303,12 @@ int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
 	return 1;
 }
 
+int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	if (c->mode == MODE_VU)
+		return tcp_vu_send_flag(c, conn, flags);
+	return tcp_buf_send_flag(c, conn, flags);
+}
 
 /**
  * tcp_rst_do() - Reset a tap connection: send RST segment to tap, close socket
@@ -1313,7 +1320,7 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
 	if (conn->events == CLOSED)
 		return;
 
-	if (!tcp_buf_send_flag(c, conn, RST))
+	if (!tcp_send_flag(c, conn, RST))
 		conn_event(c, conn, CLOSED);
 }
 
@@ -1430,7 +1437,8 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
  *
  * Return: clamped MSS value
  */
-static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
+static uint16_t tcp_conn_tap_mss(const struct ctx *c,
+				 const struct tcp_tap_conn *conn,
 				 const char *opts, size_t optlen)
 {
 	unsigned int mss;
@@ -1441,7 +1449,10 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
 	else
 		mss = ret;
 
-	mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
+	if (c->mode == MODE_VU)
+		mss = MIN(tcp_vu_conn_tap_mss(conn), mss);
+	else
+		mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
 
 	return MIN(mss, USHRT_MAX);
 }
@@ -1568,7 +1579,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 
 	conn->wnd_to_tap = WINDOW_DEFAULT;
 
-	mss = tcp_conn_tap_mss(conn, opts, optlen);
+	mss = tcp_conn_tap_mss(c, conn, opts, optlen);
 	if (setsockopt(s, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)))
 		flow_trace(conn, "failed to set TCP_MAXSEG on socket %i", s);
 	MSS_SET(conn, mss);
@@ -1625,7 +1636,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 	} else {
 		tcp_get_sndbuf(conn);
 
-		if (tcp_buf_send_flag(c, conn, SYN | ACK))
+		if (tcp_send_flag(c, conn, SYN | ACK))
 			return;
 
 		conn_event(c, conn, TAP_SYN_ACK_SENT);
@@ -1673,6 +1684,13 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
 	return 0;
 }
 
+static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	if (c->mode == MODE_VU)
+		return tcp_vu_data_from_sock(c, conn);
+
+	return tcp_buf_data_from_sock(c, conn);
+}
 
 /**
  * tcp_data_from_tap() - tap/guest data for established connection
@@ -1806,7 +1824,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			   max_ack_seq, conn->seq_to_tap);
 		conn->seq_ack_from_tap = max_ack_seq;
 		conn->seq_to_tap = max_ack_seq;
-		tcp_buf_data_from_sock(c, conn);
+		tcp_data_from_sock(c, conn);
 	}
 
 	if (!iov_i)
@@ -1822,14 +1840,14 @@ eintr:
 			 *   Then swiftly looked away and left.
 			 */
 			conn->seq_from_tap = seq_from_tap;
-			tcp_buf_send_flag(c, conn, ACK);
+			tcp_send_flag(c, conn, ACK);
 		}
 
 		if (errno == EINTR)
 			goto eintr;
 
 		if (errno == EAGAIN || errno == EWOULDBLOCK) {
-			tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
+			tcp_send_flag(c, conn, ACK_IF_NEEDED);
 			return p->count - idx;
 
 		}
@@ -1839,7 +1857,7 @@ eintr:
 	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
 		partial_send = 1;
 		conn->seq_from_tap += n;
-		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
+		tcp_send_flag(c, conn, ACK_IF_NEEDED);
 	} else {
 		conn->seq_from_tap += n;
 	}
@@ -1852,7 +1870,7 @@ out:
 		 */
 		if (conn->seq_dup_ack_approx != (conn->seq_from_tap & 0xff)) {
 			conn->seq_dup_ack_approx = conn->seq_from_tap & 0xff;
-			tcp_buf_send_flag(c, conn, DUP_ACK);
+			tcp_send_flag(c, conn, DUP_ACK);
 		}
 		return p->count - idx;
 	}
@@ -1866,7 +1884,7 @@ out:
 
 		conn_event(c, conn, TAP_FIN_RCVD);
 	} else {
-		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
+		tcp_send_flag(c, conn, ACK_IF_NEEDED);
 	}
 
 	return p->count - idx;
@@ -1891,7 +1909,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
 	if (!(conn->wnd_from_tap >>= conn->ws_from_tap))
 		conn->wnd_from_tap = 1;
 
-	MSS_SET(conn, tcp_conn_tap_mss(conn, opts, optlen));
+	MSS_SET(conn, tcp_conn_tap_mss(c, conn, opts, optlen));
 
 	conn->seq_init_from_tap = ntohl(th->seq) + 1;
 	conn->seq_from_tap = conn->seq_init_from_tap;
@@ -1902,8 +1920,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
 	/* The client might have sent data already, which we didn't
 	 * dequeue waiting for SYN,ACK from tap -- check now.
 	 */
-	tcp_buf_data_from_sock(c, conn);
-	tcp_buf_send_flag(c, conn, ACK);
+	tcp_data_from_sock(c, conn);
+	tcp_send_flag(c, conn, ACK);
 }
 
 /**
@@ -1983,7 +2001,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 			conn->seq_from_tap++;
 
 			shutdown(conn->sock, SHUT_WR);
-			tcp_buf_send_flag(c, conn, ACK);
+			tcp_send_flag(c, conn, ACK);
 			conn_event(c, conn, SOCK_FIN_SENT);
 
 			return 1;
@@ -1994,7 +2012,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 
 		tcp_tap_window_update(conn, ntohs(th->window));
 
-		tcp_buf_data_from_sock(c, conn);
+		tcp_data_from_sock(c, conn);
 
 		if (p->count - idx == 1)
 			return 1;
@@ -2024,7 +2042,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
 		shutdown(conn->sock, SHUT_WR);
 		conn_event(c, conn, SOCK_FIN_SENT);
-		tcp_buf_send_flag(c, conn, ACK);
+		tcp_send_flag(c, conn, ACK);
 		ack_due = 0;
 	}
 
@@ -2058,7 +2076,7 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 		return;
 	}
 
-	if (tcp_buf_send_flag(c, conn, SYN | ACK))
+	if (tcp_send_flag(c, conn, SYN | ACK))
 		return;
 
 	conn_event(c, conn, TAP_SYN_ACK_SENT);
@@ -2126,7 +2144,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
 
 	conn->wnd_from_tap = WINDOW_DEFAULT;
 
-	tcp_buf_send_flag(c, conn, SYN);
+	tcp_send_flag(c, conn, SYN);
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	tcp_get_sndbuf(conn);
@@ -2190,7 +2208,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 		return;
 
 	if (conn->flags & ACK_TO_TAP_DUE) {
-		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
+		tcp_send_flag(c, conn, ACK_IF_NEEDED);
 		tcp_timer_ctl(c, conn);
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		if (!(conn->events & ESTABLISHED)) {
@@ -2206,7 +2224,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 			flow_dbg(conn, "ACK timeout, retry");
 			conn->retrans++;
 			conn->seq_to_tap = conn->seq_ack_from_tap;
-			tcp_buf_data_from_sock(c, conn);
+			tcp_data_from_sock(c, conn);
 			tcp_timer_ctl(c, conn);
 		}
 	} else {
@@ -2261,7 +2279,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
 			conn_event(c, conn, SOCK_FIN_RCVD);
 
 		if (events & EPOLLIN)
-			tcp_buf_data_from_sock(c, conn);
+			tcp_data_from_sock(c, conn);
 
 		if (events & EPOLLOUT)
 			tcp_update_seqack_wnd(c, conn, 0, NULL);
diff --git a/tcp_vu.c b/tcp_vu.c
new file mode 100644
index 000000000000..ed59b21cabdc
--- /dev/null
+++ b/tcp_vu.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <errno.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <netinet/ip.h>
+
+#include <sys/socket.h>
+
+#include <linux/tcp.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "siphash.h"
+#include "inany.h"
+#include "vhost_user.h"
+#include "tcp.h"
+#include "pcap.h"
+#include "flow.h"
+#include "tcp_conn.h"
+#include "flow_table.h"
+#include "tcp_vu.h"
+#include "tcp_internal.h"
+#include "checksum.h"
+
+#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
+#define CONN_V6(conn)		(!CONN_V4(conn))
+
+/* vhost-user */
+static const struct virtio_net_hdr vu_header = {
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+};
+
+static unsigned char buffer[65536];
+static struct iovec	iov_vu			[VIRTQUEUE_MAX_SIZE];
+static unsigned int	indexes			[VIRTQUEUE_MAX_SIZE];
+
+uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn)
+{
+	(void)conn;
+	return USHRT_MAX;
+}
+
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	VuDev *vdev = (VuDev *)&c->vdev;
+	VuVirtqElement *elem;
+	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	struct virtio_net_hdr_mrg_rxbuf *vh;
+	size_t tlen, vnet_hdrlen, ip_len, optlen = 0;
+	struct ethhdr *eh;
+	int ret;
+	int nb_ack;
+
+	elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
+	if (!elem)
+		return 0;
+
+	if (elem->in_num < 1) {
+		err("virtio-net receive queue contains no in buffers");
+		vu_queue_rewind(vdev, vq, 1);
+		return 0;
+	}
+
+	/* Options: MSS, NOP and window scale (8 bytes) */
+	if (flags & SYN)
+		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
+
+	vh = elem->in_sg[0].iov_base;
+
+	vh->hdr = vu_header;
+	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
+		vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+		vh->num_buffers = htole16(1);
+	} else {
+		vnet_hdrlen = sizeof(struct virtio_net_hdr);
+	}
+	eh = (struct ethhdr *)((char *)elem->in_sg[0].iov_base + vnet_hdrlen);
+
+	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
+
+	if (CONN_V4(conn)) {
+		struct iphdr *iph = (struct iphdr *)(eh + 1);
+		struct tcphdr *th = (struct tcphdr *)(iph + 1);
+		char *data = (char *)(th + 1);
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		*th = (struct tcphdr){
+			.doff = sizeof(struct tcphdr) / 4,
+			.ack = 1
+		};
+
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+
+		ret = do_tcp_send_flag(c, conn, flags, th, data, optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vdev, vq, 1);
+			return ret;
+		}
+
+		ip_len = ipv4_fill_headers(c, conn, iph, optlen, NULL,
+					   conn->seq_to_tap);
+
+		tlen =  ip_len + sizeof(struct ethhdr);
+
+		if (*c->pcap) {
+			uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
+
+			th->check = csum(th, optlen + sizeof(struct tcphdr), sum);
+		}
+	} else {
+		struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
+		struct tcphdr *th = (struct tcphdr *)(ip6h + 1);
+		char *data = (char *)(th + 1);
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		*th = (struct tcphdr){
+			.doff = sizeof(struct tcphdr) / 4,
+			.ack = 1
+		};
+
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		ret = do_tcp_send_flag(c, conn, flags, th, data, optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vdev, vq, 1);
+			return ret;
+		}
+
+		ip_len = ipv6_fill_headers(c, conn, ip6h, optlen,
+					   conn->seq_to_tap);
+
+		tlen =  ip_len + sizeof(struct ethhdr);
+
+		if (*c->pcap) {
+			uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
+
+			th->check = csum(th, optlen + sizeof(struct tcphdr), sum);
+		}
+	}
+
+	pcap((void *)eh, tlen);
+
+	tlen += vnet_hdrlen;
+	vu_queue_fill(vdev, vq, elem, tlen, 0);
+	nb_ack = 1;
+
+	if (flags & DUP_ACK) {
+		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
+		if (elem) {
+			if (elem->in_num < 1 || elem->in_sg[0].iov_len < tlen) {
+				vu_queue_rewind(vdev, vq, 1);
+			} else {
+				memcpy(elem->in_sg[0].iov_base, vh, tlen);
+				nb_ack++;
+			}
+		}
+	}
+
+	vu_queue_flush(vdev, vq, nb_ack);
+	vu_queue_notify(vdev, vq);
+
+	return 0;
+}
+
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
+	uint32_t already_sent;
+	VuDev *vdev = (VuDev *)&c->vdev;
+	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	int s = conn->sock, v4 = CONN_V4(conn);
+	int i, ret = 0, iov_count, iov_used;
+	struct msghdr mh_sock = { 0 };
+	size_t l2_hdrlen, vnet_hdrlen, fillsize;
+	ssize_t len;
+	uint16_t *check;
+	uint16_t mss = MSS_GET(conn);
+	int num_buffers;
+	int segment_size;
+	struct iovec *first;
+	bool has_mrg_rxbuf;
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		err("Got packet, but no available descriptors on RX virtq.");
+		return 0;
+	}
+
+	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
+
+	if (SEQ_LT(already_sent, 0)) {
+		/* RFC 761, section 2.1. */
+		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
+			   conn->seq_ack_from_tap, conn->seq_to_tap);
+		conn->seq_to_tap = conn->seq_ack_from_tap;
+		already_sent = 0;
+	}
+
+	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, STALLED);
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
+		return 0;
+	}
+
+	/* Set up buffer descriptors we'll fill completely and partially. */
+
+	fillsize = wnd_scaled;
+
+	iov_vu[0].iov_base = tcp_buf_discard;
+	iov_vu[0].iov_len = already_sent;
+	fillsize -= already_sent;
+
+	has_mrg_rxbuf = vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF);
+	if (has_mrg_rxbuf) {
+		vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	} else {
+		vnet_hdrlen = sizeof(struct virtio_net_hdr);
+	}
+	l2_hdrlen = vnet_hdrlen + sizeof(struct ethhdr) + sizeof(struct tcphdr);
+	if (v4) {
+		l2_hdrlen += sizeof(struct iphdr);
+	} else {
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	}
+
+	iov_count = 0;
+	segment_size = 0;
+	while (fillsize > 0 && iov_count < VIRTQUEUE_MAX_SIZE - 1) {
+		VuVirtqElement *elem;
+
+		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
+		if (!elem)
+			break;
+
+		if (elem->in_num < 1) {
+			err("virtio-net receive queue contains no in buffers");
+			goto err;
+		}
+
+		ASSERT(elem->in_num == 1);
+		ASSERT(elem->in_sg[0].iov_len >= l2_hdrlen);
+
+		indexes[iov_count] = elem->index;
+
+		if (segment_size == 0) {
+			iov_vu[iov_count + 1].iov_base =
+					(char *)elem->in_sg[0].iov_base + l2_hdrlen;
+			iov_vu[iov_count + 1].iov_len =
+					elem->in_sg[0].iov_len - l2_hdrlen;
+		} else {
+			iov_vu[iov_count + 1].iov_base = elem->in_sg[0].iov_base;
+			iov_vu[iov_count + 1].iov_len = elem->in_sg[0].iov_len;
+		}
+
+		if (iov_vu[iov_count + 1].iov_len > fillsize)
+			 iov_vu[iov_count + 1].iov_len = fillsize;
+
+		segment_size += iov_vu[iov_count + 1].iov_len;
+		if (!has_mrg_rxbuf) {
+			segment_size = 0;
+		} else if (segment_size >= mss) {
+			iov_vu[iov_count + 1].iov_len -= segment_size - mss;
+			segment_size = 0;
+		}
+		fillsize -= iov_vu[iov_count + 1].iov_len;
+
+		iov_count++;
+	}
+	if (iov_count == 0)
+		return 0;
+
+	mh_sock.msg_iov = iov_vu;
+	mh_sock.msg_iovlen = iov_count + 1;
+
+	do
+		len = recvmsg(s, &mh_sock, MSG_PEEK);
+	while (len < 0 && errno == EINTR);
+
+	if (len < 0)
+		goto err;
+
+	if (!len) {
+		vu_queue_rewind(vdev, vq, iov_count);
+		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+			if ((ret = tcp_vu_send_flag(c, conn, FIN | ACK))) {
+				tcp_rst(c, conn);
+				return ret;
+			}
+
+			conn_event(c, conn, TAP_FIN_SENT);
+		}
+
+		return 0;
+	}
+
+	len -= already_sent;
+	if (len <= 0) {
+		conn_flag(c, conn, STALLED);
+		vu_queue_rewind(vdev, vq, iov_count);
+		return 0;
+	}
+
+	conn_flag(c, conn, ~STALLED);
+
+	/* Likely, some new data was acked too. */
+	tcp_update_seqack_wnd(c, conn, 0, NULL);
+
+	/* initialize headers */
+	iov_used = 0;
+	num_buffers = 0;
+	check = NULL;
+	segment_size = 0;
+	for (i = 0; i < iov_count && len; i++) {
+
+		if (segment_size == 0)
+			first = &iov_vu[i + 1];
+
+		if (iov_vu[i + 1].iov_len > (size_t)len)
+			iov_vu[i + 1].iov_len = len;
+
+		len -= iov_vu[i + 1].iov_len;
+		iov_used++;
+
+		segment_size += iov_vu[i + 1].iov_len;
+		num_buffers++;
+
+		if (segment_size >= mss || len == 0 ||
+		    i + 1 == iov_count || !has_mrg_rxbuf) {
+
+			struct ethhdr *eh;
+			struct virtio_net_hdr_mrg_rxbuf *vh;
+			char *base = (char *)first->iov_base - l2_hdrlen;
+			size_t size = first->iov_len + l2_hdrlen;
+
+			vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
+
+			vh->hdr = vu_header;
+			if (has_mrg_rxbuf)
+				vh->num_buffers = htole16(num_buffers);
+
+			eh = (struct ethhdr *)((char *)base + vnet_hdrlen);
+
+			memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
+			memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
+
+			/* initialize header */
+			if (v4) {
+				struct iphdr *iph = (struct iphdr *)(eh + 1);
+				struct tcphdr *th = (struct tcphdr *)(iph + 1);
+
+				eh->h_proto = htons(ETH_P_IP);
+
+				*th = (struct tcphdr){
+					.doff = sizeof(struct tcphdr) / 4,
+					.ack = 1
+				};
+
+				*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+
+				ipv4_fill_headers(c, conn, iph, segment_size,
+						len ? check : NULL, conn->seq_to_tap);
+
+				if (*c->pcap) {
+					uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
+
+					first->iov_base = th;
+					first->iov_len = size - l2_hdrlen + sizeof(*th);
+
+					th->check = csum_iov(first, num_buffers, sum);
+				}
+
+				check = &iph->check;
+			} else {
+				struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
+				struct tcphdr *th = (struct tcphdr *)(ip6h + 1);
+
+				eh->h_proto = htons(ETH_P_IPV6);
+
+				*th = (struct tcphdr){
+					.doff = sizeof(struct tcphdr) / 4,
+					.ack = 1
+				};
+
+				*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+				ipv6_fill_headers(c, conn, ip6h, segment_size,
+						conn->seq_to_tap);
+				if (*c->pcap) {
+					uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
+
+					first->iov_base = th;
+					first->iov_len = size - l2_hdrlen + sizeof(*th);
+
+					th->check = csum_iov(first, num_buffers, sum);
+				}
+			}
+
+			/* set iov for pcap logging */
+			first->iov_base = eh;
+			first->iov_len = size - vnet_hdrlen;
+
+			pcap_iov(first, num_buffers);
+
+			/* set iov_len for vu_queue_fill_by_index(); */
+
+			first->iov_base = base;
+			first->iov_len = size;
+
+			conn->seq_to_tap += segment_size;
+
+			segment_size = 0;
+			num_buffers = 0;
+		}
+	}
+
+	/* release unused buffers */
+	vu_queue_rewind(vdev, vq, iov_count - iov_used);
+
+	/* send packets */
+	for (i = 0; i < iov_used; i++) {
+		vu_queue_fill_by_index(vdev, vq, indexes[i],
+				       iov_vu[i + 1].iov_len, i);
+	}
+
+	vu_queue_flush(vdev, vq, iov_used);
+	vu_queue_notify(vdev, vq);
+
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
+
+	return 0;
+err:
+	vu_queue_rewind(vdev, vq, iov_count);
+
+	if (errno != EAGAIN && errno != EWOULDBLOCK) {
+		ret = -errno;
+		tcp_rst(c, conn);
+	}
+
+	return ret;
+}
diff --git a/tcp_vu.h b/tcp_vu.h
new file mode 100644
index 000000000000..8045a6e3edb8
--- /dev/null
+++ b/tcp_vu.h
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#ifndef TCP_VU_H
+#define TCP_VU_H
+
+uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn);
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
+
+#endif  /*TCP_VU_H */
-- 
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#ifndef TCP_VU_H
+#define TCP_VU_H
+
+uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn);
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
+
+#endif  /*TCP_VU_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 23/24] udp: vhost-user RX nocopy
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (21 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 22/24] tcp: vhost-user RX nocopy Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-09  5:00   ` David Gibson
  2024-02-02 14:11 ` [PATCH 24/24] vhost-user: remove tap_send_frames_vu() Laurent Vivier
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile       |   4 +-
 passt.c        |   5 +-
 passt.h        |   1 +
 udp.c          |  23 +++---
 udp_internal.h |  21 +++++
 udp_vu.c       | 215 +++++++++++++++++++++++++++++++++++++++++++++++++
 udp_vu.h       |   8 ++
 7 files changed, 262 insertions(+), 15 deletions(-)
 create mode 100644 udp_internal.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h

diff --git a/Makefile b/Makefile
index f7a403d19b61..1d2b5dbfe085 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
 	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
 	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
-	tcp_buf.c tcp_vu.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
+	tcp_buf.c tcp_vu.c udp.c udp_vu.c util.c iov.c ip.c virtio.c vhost_user.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
 	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
 	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
 	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_vu.h tcp_internal.h \
-	udp.h util.h iov.h ip.h virtio.h vhost_user.h
+	udp.h udp_internal.h udp_vu.h util.h iov.h ip.h virtio.h vhost_user.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/passt.c b/passt.c
index 952aded12848..a5abd5c4fc03 100644
--- a/passt.c
+++ b/passt.c
@@ -392,7 +392,10 @@ loop:
 			tcp_timer_handler(&c, ref);
 			break;
 		case EPOLL_TYPE_UDP:
-			udp_buf_sock_handler(&c, ref, eventmask, &now);
+			if (c.mode == MODE_VU)
+				udp_vu_sock_handler(&c, ref, eventmask, &now);
+			else
+				udp_buf_sock_handler(&c, ref, eventmask, &now);
 			break;
 		case EPOLL_TYPE_ICMP:
 			icmp_sock_handler(&c, AF_INET, ref);
diff --git a/passt.h b/passt.h
index 4e0100d51a4d..04f4af8fd72e 100644
--- a/passt.h
+++ b/passt.h
@@ -42,6 +42,7 @@ union epoll_ref;
 #include "port_fwd.h"
 #include "tcp.h"
 #include "udp.h"
+#include "udp_vu.h"
 #include "vhost_user.h"
 
 /**
diff --git a/udp.c b/udp.c
index 799a10989a91..da67d0cfa46b 100644
--- a/udp.c
+++ b/udp.c
@@ -117,9 +117,7 @@
 #include "tap.h"
 #include "pcap.h"
 #include "log.h"
-
-#define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
-#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+#include "udp_internal.h"
 
 /**
  * struct udp_tap_port - Port tracking based on tap-facing source port
@@ -227,11 +225,11 @@ static struct mmsghdr	udp6_l2_mh_sock		[UDP_MAX_FRAMES];
 static struct iovec	udp4_iov_splice		[UDP_MAX_FRAMES];
 static struct iovec	udp6_iov_splice		[UDP_MAX_FRAMES];
 
-static struct sockaddr_in udp4_localname = {
+struct sockaddr_in udp4_localname = {
 	.sin_family = AF_INET,
 	.sin_addr = IN4ADDR_LOOPBACK_INIT,
 };
-static struct sockaddr_in6 udp6_localname = {
+struct sockaddr_in6 udp6_localname = {
 	.sin6_family = AF_INET6,
 	.sin6_addr = IN6ADDR_LOOPBACK_INIT,
 };
@@ -562,9 +560,9 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
  *
  * Return: size of tap frame with headers
  */
-static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
-			      size_t data_len, struct sockaddr_in *s_in,
-			      in_port_t dstport, const struct timespec *now)
+size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
+		       size_t data_len, struct sockaddr_in *s_in,
+		       in_port_t dstport, const struct timespec *now)
 {
 	struct udphdr *uh = (struct udphdr *)(iph + 1);
 	in_port_t src_port;
@@ -602,6 +600,7 @@ static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
 	uh->source = s_in->sin_port;
 	uh->dest = htons(dstport);
 	uh->len= htons(data_len + sizeof(struct udphdr));
+	uh->check = 0;
 
 	return ip_len;
 }
@@ -615,9 +614,9 @@ static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
  *
  * Return: size of tap frame with headers
  */
-static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
-			      size_t data_len, struct sockaddr_in6 *s_in6,
-			      in_port_t dstport, const struct timespec *now)
+size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
+		       size_t data_len, struct sockaddr_in6 *s_in6,
+		       in_port_t dstport, const struct timespec *now)
 {
 	struct udphdr *uh = (struct udphdr *)(ip6h + 1);
 	struct in6_addr *src;
@@ -672,7 +671,7 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
 	uh->dest = htons(dstport);
 	uh->len = ip6h->payload_len;
 	uh->check = 0;
-	if (c->mode != MODE_VU || *c->pcap)
+	if (c->mode != MODE_VU)
 		uh->check = csum(uh, ntohs(ip6h->payload_len),
 				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
 	ip6h->version = 6;
diff --git a/udp_internal.h b/udp_internal.h
new file mode 100644
index 000000000000..a09f3c69da42
--- /dev/null
+++ b/udp_internal.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef UDP_INTERNAL_H
+#define UDP_INTERNAL_H
+
+#define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
+#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+
+extern struct sockaddr_in udp4_localname;
+extern struct sockaddr_in6 udp6_localname;
+
+size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
+		       size_t data_len, struct sockaddr_in *s_in,
+		       in_port_t dstport, const struct timespec *now);
+size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
+		       size_t data_len, struct sockaddr_in6 *s_in6,
+		       in_port_t dstport, const struct timespec *now);
+#endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
new file mode 100644
index 000000000000..c0f4cb90abd2
--- /dev/null
+++ b/udp_vu.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <unistd.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/uio.h>
+#include <linux/virtio_net.h>
+
+#include "checksum.h"
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "pcap.h"
+#include "log.h"
+#include "vhost_user.h"
+#include "udp_internal.h"
+#include "udp_vu.h"
+
+/* vhost-user */
+static const struct virtio_net_hdr vu_header = {
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+};
+
+static unsigned char buffer[65536];
+static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
+static unsigned int     indexes		[VIRTQUEUE_MAX_SIZE];
+
+void udp_vu_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
+			 const struct timespec *now)
+{
+	VuDev *vdev = (VuDev *)&c->vdev;
+	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	size_t l2_hdrlen, vnet_hdrlen, fillsize;
+	ssize_t data_len;
+	in_port_t dstport = ref.udp.port;
+	bool has_mrg_rxbuf, v6 = ref.udp.v6;
+	struct msghdr msg;
+	int i, iov_count, iov_used, virtqueue_max;
+
+	if (c->no_udp || !(events & EPOLLIN))
+		return;
+
+	has_mrg_rxbuf = vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF);
+	if (has_mrg_rxbuf) {
+		vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+		virtqueue_max = VIRTQUEUE_MAX_SIZE;
+	} else {
+		vnet_hdrlen = sizeof(struct virtio_net_hdr);
+		virtqueue_max = 1;
+	}
+	l2_hdrlen = vnet_hdrlen + sizeof(struct ethhdr) + sizeof(struct udphdr);
+
+	if (v6) {
+		l2_hdrlen += sizeof(struct ipv6hdr);
+
+		udp6_localname.sin6_port = htons(dstport);
+		msg.msg_name = &udp6_localname;
+		msg.msg_namelen = sizeof(udp6_localname);
+	} else {
+		l2_hdrlen += sizeof(struct iphdr);
+
+		udp4_localname.sin_port = htons(dstport);
+		msg.msg_name = &udp4_localname;
+		msg.msg_namelen = sizeof(udp4_localname);
+	}
+
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = 0;
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		struct virtio_net_hdr_mrg_rxbuf *vh;
+		struct ethhdr *eh;
+		char *base;
+		size_t size;
+
+		fillsize = USHRT_MAX;
+		iov_count = 0;
+		while (fillsize && iov_count < virtqueue_max) {
+			VuVirtqElement *elem;
+
+			elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
+			if (!elem)
+				break;
+
+			if (elem->in_num < 1) {
+				err("virtio-net receive queue contains no in buffers");
+				vu_queue_rewind(vdev, vq, iov_count);
+				return;
+			}
+			ASSERT(elem->in_num == 1);
+			ASSERT(elem->in_sg[0].iov_len >= l2_hdrlen);
+
+			indexes[iov_count] = elem->index;
+			if (iov_count == 0) {
+				iov_vu[0].iov_base = (char *)elem->in_sg[0].iov_base + l2_hdrlen;
+				iov_vu[0].iov_len = elem->in_sg[0].iov_len - l2_hdrlen;
+			} else {
+				iov_vu[iov_count].iov_base = elem->in_sg[0].iov_base;
+				iov_vu[iov_count].iov_len = elem->in_sg[0].iov_len;
+			}
+
+			if (iov_vu[iov_count].iov_len > fillsize)
+				iov_vu[iov_count].iov_len = fillsize;
+
+			fillsize -= iov_vu[iov_count].iov_len;
+
+			iov_count++;
+		}
+		if (iov_count == 0)
+			break;
+
+		msg.msg_iov = iov_vu;
+		msg.msg_iovlen = iov_count;
+
+		data_len = recvmsg(ref.fd, &msg, 0);
+		if (data_len < 0) {
+			vu_queue_rewind(vdev, vq, iov_count);
+			return;
+		}
+
+		iov_used = 0;
+		size = data_len;
+		while (size) {
+			if (iov_vu[iov_used].iov_len > size)
+				iov_vu[iov_used].iov_len = size;
+
+			size -= iov_vu[iov_used].iov_len;
+			iov_used++;
+		}
+
+		base = (char *)iov_vu[0].iov_base - l2_hdrlen;
+		size = iov_vu[0].iov_len + l2_hdrlen;
+
+		/* release unused buffers */
+		vu_queue_rewind(vdev, vq, iov_count - iov_used);
+
+		/* vnet_header */
+		vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
+		vh->hdr = vu_header;
+		if (has_mrg_rxbuf)
+			vh->num_buffers = htole16(iov_used);
+
+		/* ethernet header */
+		eh = (struct ethhdr *)(base + vnet_hdrlen);
+
+		memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
+		memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
+
+		/* initialize header */
+		if (v6) {
+			struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
+			struct udphdr *uh = (struct udphdr *)(ip6h + 1);
+			uint32_t sum;
+
+			eh->h_proto = htons(ETH_P_IPV6);
+
+			*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
+
+			udp_update_hdr6(c, ip6h, data_len, &udp6_localname,
+					dstport, now);
+			if (*c->pcap) {
+				sum = proto_ipv6_header_checksum(ip6h, IPPROTO_UDP);
+
+				iov_vu[0].iov_base = uh;
+				iov_vu[0].iov_len = size - l2_hdrlen + sizeof(*uh);
+				uh->check = csum_iov(iov_vu, iov_used, sum);
+			} else {
+				/* 0 checksum is invalid with IPv6/UDP */
+				uh->check = 0xFFFF;
+			}
+		} else {
+			struct iphdr *iph = (struct iphdr *)(eh + 1);
+			struct udphdr *uh = (struct udphdr *)(iph + 1);
+			uint32_t sum;
+
+			eh->h_proto = htons(ETH_P_IP);
+
+			*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
+
+			udp_update_hdr4(c, iph, data_len, &udp4_localname,
+					dstport, now);
+			if (*c->pcap) {
+				sum = proto_ipv4_header_checksum(iph, IPPROTO_UDP);
+
+				iov_vu[0].iov_base = uh;
+				iov_vu[0].iov_len = size - l2_hdrlen + sizeof(*uh);
+				uh->check = csum_iov(iov_vu, iov_used, sum);
+			}
+		}
+
+		/* set iov for pcap logging */
+		iov_vu[0].iov_base = base + vnet_hdrlen;
+		iov_vu[0].iov_len = size - vnet_hdrlen;
+		pcap_iov(iov_vu, iov_used);
+
+		/* set iov_len for vu_queue_fill_by_index(); */
+		iov_vu[0].iov_base = base;
+		iov_vu[0].iov_len = size;
+
+		/* send packets */
+		for (i = 0; i < iov_used; i++)
+			vu_queue_fill_by_index(vdev, vq, indexes[i],
+					       iov_vu[i].iov_len, i);
+
+		vu_queue_flush(vdev, vq, iov_used);
+		vu_queue_notify(vdev, vq);
+	}
+}
diff --git a/udp_vu.h b/udp_vu.h
new file mode 100644
index 000000000000..e01ce047ee0a
--- /dev/null
+++ b/udp_vu.h
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#ifndef UDP_VU_H
+#define UDP_VU_H
+
+void udp_vu_sock_handler(const struct ctx *c, union epoll_ref ref,
+			 uint32_t events, const struct timespec *now);
+#endif /* UDP_VU_H */
-- 
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#ifndef UDP_VU_H
+#define UDP_VU_H
+
+void udp_vu_sock_handler(const struct ctx *c, union epoll_ref ref,
+			 uint32_t events, const struct timespec *now);
+#endif /* UDP_VU_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 24/24] vhost-user: remove tap_send_frames_vu()
  2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
                   ` (22 preceding siblings ...)
  2024-02-02 14:11 ` [PATCH 23/24] udp: " Laurent Vivier
@ 2024-02-02 14:11 ` Laurent Vivier
  2024-02-09  5:01   ` David Gibson
  23 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

As TCP and UDP use now directly vhost-user we don't need this function
anymore. Other protocols (ICMP, ARP, DHCP, ...) use tap_send()/vu_send()

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 tap.c        |  3 +--
 vhost_user.c | 16 ----------------
 vhost_user.h |  2 --
 3 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/tap.c b/tap.c
index 930e48689497..ed1744f72e37 100644
--- a/tap.c
+++ b/tap.c
@@ -440,8 +440,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
 		m = tap_send_frames_passt(c, iov, n);
 		break;
 	case MODE_VU:
-		m = tap_send_frames_vu(c, iov, n);
-		break;
+		ASSERT(0);
 	default:
 		m = 0;
 		break;
diff --git a/vhost_user.c b/vhost_user.c
index 9cc07c8312c0..4ceeeb58f792 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -653,22 +653,6 @@ err:
 	return offset;
 }
 
-size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov, size_t n)
-{
-	size_t i;
-	int ret;
-
-	debug("tap_send_frames_vu n %zd", n);
-
-	for (i = 0; i < n; i++) {
-		ret = vu_send(c, iov[i].iov_base, iov[i].iov_len);
-		if (ret < 0)
-			break;
-	}
-	debug("count %zd", i);
-	return i;
-}
-
 static void vu_handle_tx(VuDev *vdev, int index)
 {
 	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
diff --git a/vhost_user.h b/vhost_user.h
index 25f0b617ab40..44678ddabef4 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -129,8 +129,6 @@ static inline bool vu_queue_started(const VuVirtq *vq)
 	return vq->started;
 }
 
-size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov,
-			  size_t n);
 int vu_send(const struct ctx *c, const void *data, size_t len);
 void vu_print_capabilities(void);
 void vu_init(struct ctx *c);
-- 
@@ -129,8 +129,6 @@ static inline bool vu_queue_started(const VuVirtq *vq)
 	return vq->started;
 }
 
-size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov,
-			  size_t n);
 int vu_send(const struct ctx *c, const void *data, size_t len);
 void vu_print_capabilities(void);
 void vu_init(struct ctx *c);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add()
  2024-02-02 14:11 ` [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add() Laurent Vivier
@ 2024-02-02 14:29   ` Laurent Vivier
  2024-02-06  1:52   ` David Gibson
  2024-02-11 23:15   ` Stefano Brivio
  2 siblings, 0 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-02 14:29 UTC (permalink / raw)
  To: passt-dev

On 2/2/24 15:11, Laurent Vivier wrote:
> From: Laurent Vivier <laurent@vivier.eu>
> 
> Signed-off-by: Laurent Vivier <laurent@vivier.eu>

I will fix S-o-B and author in v2.

Thanks;
Laurent

> ---
>   tap.c | 98 +++++++++++++++++++++++++++++------------------------------
>   tap.h |  7 +++++
>   2 files changed, 56 insertions(+), 49 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 29f389057ac1..5b1b61550c13 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -911,6 +911,45 @@ append:
>   	return in->count;
>   }
>   
> +void pool_flush_all(void)
> +{
> +	pool_flush(pool_tap4);
> +	pool_flush(pool_tap6);
> +}
> +
> +void tap_handler_all(struct ctx *c, const struct timespec *now)
> +{
> +	tap4_handler(c, pool_tap4, now);
> +	tap6_handler(c, pool_tap6, now);
> +}
> +
> +void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
> +		       const char *func, int line)
> +{
> +	const struct ethhdr *eh;
> +
> +	pcap(p, len);
> +
> +	eh = (struct ethhdr *)p;
> +
> +	if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> +		memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> +		proto_update_l2_buf(c->mac_guest, NULL);
> +	}
> +
> +	switch (ntohs(eh->h_proto)) {
> +	case ETH_P_ARP:
> +	case ETH_P_IP:
> +		packet_add_do(pool_tap4, len, p, func, line);
> +		break;
> +	case ETH_P_IPV6:
> +		packet_add_do(pool_tap6, len, p, func, line);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>   /**
>    * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>    * @c:		Execution context
> @@ -937,7 +976,6 @@ static void tap_sock_reset(struct ctx *c)
>   void tap_handler_passt(struct ctx *c, uint32_t events,
>   		       const struct timespec *now)
>   {
> -	const struct ethhdr *eh;
>   	ssize_t n, rem;
>   	char *p;
>   
> @@ -950,8 +988,7 @@ redo:
>   	p = pkt_buf;
>   	rem = 0;
>   
> -	pool_flush(pool_tap4);
> -	pool_flush(pool_tap6);
> +	pool_flush_all();
>   
>   	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
>   	if (n < 0) {
> @@ -978,37 +1015,18 @@ redo:
>   		/* Complete the partial read above before discarding a malformed
>   		 * frame, otherwise the stream will be inconsistent.
>   		 */
> -		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU)
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU)
>   			goto next;
>   
> -		pcap(p, len);
> -
> -		eh = (struct ethhdr *)p;
> -
> -		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> -			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> -			proto_update_l2_buf(c->mac_guest, NULL);
> -		}
> -
> -		switch (ntohs(eh->h_proto)) {
> -		case ETH_P_ARP:
> -		case ETH_P_IP:
> -			packet_add(pool_tap4, len, p);
> -			break;
> -		case ETH_P_IPV6:
> -			packet_add(pool_tap6, len, p);
> -			break;
> -		default:
> -			break;
> -		}
> +		packet_add_all(c, len, p);
>   
>   next:
>   		p += len;
>   		n -= len;
>   	}
>   
> -	tap4_handler(c, pool_tap4, now);
> -	tap6_handler(c, pool_tap6, now);
> +	tap_handler_all(c, now);
>   
>   	/* We can't use EPOLLET otherwise. */
>   	if (rem)
> @@ -1033,35 +1051,18 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>   redo:
>   	n = 0;
>   
> -	pool_flush(pool_tap4);
> -	pool_flush(pool_tap6);
> +	pool_flush_all();
>   restart:
>   	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
> -		const struct ethhdr *eh = (struct ethhdr *)(pkt_buf + n);
>   
> -		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU) {
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU) {
>   			n += len;
>   			continue;
>   		}
>   
> -		pcap(pkt_buf + n, len);
>   
> -		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> -			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> -			proto_update_l2_buf(c->mac_guest, NULL);
> -		}
> -
> -		switch (ntohs(eh->h_proto)) {
> -		case ETH_P_ARP:
> -		case ETH_P_IP:
> -			packet_add(pool_tap4, len, pkt_buf + n);
> -			break;
> -		case ETH_P_IPV6:
> -			packet_add(pool_tap6, len, pkt_buf + n);
> -			break;
> -		default:
> -			break;
> -		}
> +		packet_add_all(c, len, pkt_buf + n);
>   
>   		if ((n += len) == TAP_BUF_BYTES)
>   			break;
> @@ -1072,8 +1073,7 @@ restart:
>   
>   	ret = errno;
>   
> -	tap4_handler(c, pool_tap4, now);
> -	tap6_handler(c, pool_tap6, now);
> +	tap_handler_all(c, now);
>   
>   	if (len > 0 || ret == EAGAIN)
>   		return;
> diff --git a/tap.h b/tap.h
> index 437b9aa2b43f..7157ef37ee6e 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -82,5 +82,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>   void tap_handler_passt(struct ctx *c, uint32_t events,
>   		       const struct timespec *now);
>   void tap_sock_init(struct ctx *c);
> +void pool_flush_all(void);
> +void tap_handler_all(struct ctx *c, const struct timespec *now);
> +
> +void packet_add_do(struct pool *p, size_t len, const char *start,
> +		   const char *func, int line);
> +#define packet_add_all(p, len, start)					\
> +	packet_add_all_do(p, len, start, __func__, __LINE__)
>   
>   #endif /* TAP_H */


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-02 14:11 ` [PATCH 01/24] iov: add some functions to manage iovec Laurent Vivier
@ 2024-02-05  5:57   ` David Gibson
  2024-02-06 14:28     ` Laurent Vivier
  2024-02-06 16:10   ` Stefano Brivio
  1 sibling, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-05  5:57 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6714 bytes --]

On Fri, Feb 02, 2024 at 03:11:28PM +0100, Laurent Vivier wrote:
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile |  4 +--
>  iov.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  iov.h    | 46 +++++++++++++++++++++++++++++++++
>  3 files changed, 126 insertions(+), 2 deletions(-)
>  create mode 100644 iov.c
>  create mode 100644 iov.h
> 
> diff --git a/Makefile b/Makefile
> index af4fa87e7e13..c1138fb91d26 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
> -	util.c
> +	util.c iov.c

I think we've been maintaining these in alphabetical order so far.

>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -56,7 +56,7 @@ MANPAGES = passt.1 pasta.1 qrap.1
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
> -	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h
> +	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/iov.c b/iov.c
> new file mode 100644
> index 000000000000..38a8e7566021
> --- /dev/null
> +++ b/iov.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later

I believe we need an actual copyright / authorship notice as well as
the SPDX comment.


> +
> +/* some parts copied from QEMU util/iov.c */
> +
> +#include <sys/socket.h>
> +#include "util.h"
> +#include "iov.h"
> +

Function comments would be really helpful here.  It took me a while to
figure out what this does from just the name and implementation.

> +size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
> +			 size_t offset, const void *buf, size_t bytes)
> +{
> +	size_t done;
> +	unsigned int i;

passt style is to order local declarations from longest to shortest.

> +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {

Not immediately seeing why you need the 'offset ||' part of the condition.

> +		if (offset < iov[i].iov_len) {
> +			size_t len = MIN(iov[i].iov_len - offset, bytes - done);
> +			memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
> +			done += len;
> +			offset = 0;
> +		} else {
> +			offset -= iov[i].iov_len;
> +		}
> +	}
> +	return done;
> +}
> +
> +size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,

const modifier on an int isn't very useful.

> +		       size_t offset, void *buf, size_t bytes)
> +{
> +	size_t done;
> +	unsigned int i;
> +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
> +		if (offset < iov[i].iov_len) {
> +			size_t len = MIN(iov[i].iov_len - offset, bytes - done);
> +			memcpy((char *)buf + done, (char *)iov[i].iov_base + offset, len);
> +			done += len;
> +			offset = 0;
> +		} else {
> +			offset -= iov[i].iov_len;
> +		}
> +	}
> +	return done;
> +}
> +
> +size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt)
> +{
> +	size_t len;
> +	unsigned int i;
> +
> +	len = 0;

Can be an initialiser.

> +	for (i = 0; i < iov_cnt; i++) {
> +		len += iov[i].iov_len;
> +	}
> +	return len;
> +}
> +
> +unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
> +		  const struct iovec *iov, unsigned int iov_cnt,
> +		  size_t offset, size_t bytes)
> +{
> +	size_t len;
> +	unsigned int i, j;
> +	for (i = 0, j = 0;
> +		 i < iov_cnt && j < dst_iov_cnt && (offset || bytes); i++) {
> +		if (offset >= iov[i].iov_len) {
> +			offset -= iov[i].iov_len;
> +			continue;
> +		}
> +		len = MIN(bytes, iov[i].iov_len - offset);
> +
> +		dst_iov[j].iov_base = (char *)iov[i].iov_base + offset;
> +		dst_iov[j].iov_len = len;
> +		j++;
> +		bytes -= len;
> +		offset = 0;
> +	}
> +	return j;
> +}

Small concern about the interface to iov_copy().  If dst_iov_cnt <
iov_cnt and the chunk of the input iovec you want doesn't fit in the
destination it will silently truncate - you can't tell if this has
happened from the return value.  If the assumption is that dst_iov_cnt
>= iov_cnt, then there's not really any need to pass it.

> diff --git a/iov.h b/iov.h
> new file mode 100644
> index 000000000000..31fbf6d0e1cf
> --- /dev/null
> +++ b/iov.h
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts copied from QEMU include/qemu/iov.h */
> +
> +#ifndef IOVEC_H
> +#define IOVEC_H
> +
> +#include <unistd.h>
> +#include <string.h>
> +
> +size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
> +			 size_t offset, const void *buf, size_t bytes);
> +size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
> +		       size_t offset, void *buf, size_t bytes);
> +
> +static inline size_t iov_from_buf(const struct iovec *iov,
> +				  unsigned int iov_cnt, size_t offset,
> +				  const void *buf, size_t bytes)
> +{
> +	if (__builtin_constant_p(bytes) && iov_cnt &&
> +		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
> +		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
> +		return bytes;

Do you have an idea of how much difference this optimized path makes?

> +	} else {
> +		return iov_from_buf_full(iov, iov_cnt, offset, buf, bytes);
> +	}
> +}
> +
> +static inline size_t iov_to_buf(const struct iovec *iov,
> +				const unsigned int iov_cnt, size_t offset,
> +				void *buf, size_t bytes)
> +{
> +	if (__builtin_constant_p(bytes) && iov_cnt &&
> +		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
> +		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
> +		return bytes;
> +	} else {
> +		return iov_to_buf_full(iov, iov_cnt, offset, buf, bytes);
> +	}
> +}
> +
> +size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt);
> +unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
> +		  const struct iovec *iov, unsigned int iov_cnt,
> +		  size_t offset, size_t bytes);
> +#endif

Normal to put a /* IOVEC_H */ comment on the final #endif.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/24] checksum: align buffers
  2024-02-02 14:11 ` [PATCH 03/24] checksum: align buffers Laurent Vivier
@ 2024-02-05  6:02   ` David Gibson
  2024-02-07  9:01     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-05  6:02 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On Fri, Feb 02, 2024 at 03:11:30PM +0100, Laurent Vivier wrote:
> if buffer is not aligned use sum_16b() only on the not aligned
> part, and then use csum() on the remaining part
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  checksum.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/checksum.c b/checksum.c
> index f21c9b7a14d1..c94980771c63 100644
> --- a/checksum.c
> +++ b/checksum.c
> @@ -407,7 +407,19 @@ less_than_128_bytes:
>  __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
>  uint16_t csum(const void *buf, size_t len, uint32_t init)
>  {
> -	return (uint16_t)~csum_fold(csum_avx2(buf, len, init));
> +	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;

Wonder if its worth adding an ALIGN_UP macro.

> +	unsigned int pad = align - (intptr_t)buf;
> +
> +	if (len < pad)
> +		pad = len;
> +
> +	if (pad)
> +		init += sum_16b(buf, pad);
> +
> +	if (len > pad)
> +		init = csum_avx2((void *)align, len - pad, init);
> +
> +	return (uint16_t)~csum_fold(init);
>  }
>  
>  #else /* __AVX2__ */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/24] checksum: add csum_iov()
  2024-02-02 14:11 ` [PATCH 04/24] checksum: add csum_iov() Laurent Vivier
@ 2024-02-05  6:07   ` David Gibson
  2024-02-07  9:02   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-05  6:07 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3567 bytes --]

On Fri, Feb 02, 2024 at 03:11:31PM +0100, Laurent Vivier wrote:
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  checksum.c | 39 ++++++++++++++++++++++-----------------
>  checksum.h |  1 +
>  2 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/checksum.c b/checksum.c
> index c94980771c63..14b6057684d9 100644
> --- a/checksum.c
> +++ b/checksum.c
> @@ -395,17 +395,8 @@ less_than_128_bytes:
>  	return (uint32_t)sum64;
>  }
>  
> -/**
> - * csum() - Compute TCP/IP-style checksum
> - * @buf:	Input buffer, must be aligned to 32-byte boundary
> - * @len:	Input length
> - * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
> - *
> - * Return: 16-bit folded, complemented checksum sum
> - */
> -/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
>  __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
> -uint16_t csum(const void *buf, size_t len, uint32_t init)
> +uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)

I'm wondering if this might be a little clearer with a feed() /
final() interface something like siphash.

>  {
>  	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;
>  	unsigned int pad = align - (intptr_t)buf;
> @@ -419,24 +410,38 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
>  	if (len > pad)
>  		init = csum_avx2((void *)align, len - pad, init);
>  
> -	return (uint16_t)~csum_fold(init);
> +	return init;
>  }
> -
>  #else /* __AVX2__ */
>  
> +__attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
> +uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
> +{
> +	return sum_16b(buf, len) + init;
> +}
> +#endif /* !__AVX2__ */
> +
>  /**
>   * csum() - Compute TCP/IP-style checksum
> - * @buf:	Input buffer
> + * @buf:	Input buffer, must be aligned to 32-byte boundary

I thought the point of the previous patch was that this didn't have to
be 32-byte aligned any more.

>   * @len:	Input length
> - * @sum:	Initial 32-bit checksum, 0 for no pre-computed checksum
> + * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
>   *
> - * Return: 16-bit folded, complemented checksum
> + * Return: 16-bit folded, complemented checksum sum

"checksum sum"

>   */
>  /* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
>  __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
>  uint16_t csum(const void *buf, size_t len, uint32_t init)
>  {
> -	return csum_unaligned(buf, len, init);
> +	return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
>  }
>  
> -#endif /* !__AVX2__ */
> +uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < n;  i++)
> +		init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
> +
> +	return (uint16_t)~csum_fold(init);
> +}
> diff --git a/checksum.h b/checksum.h
> index 21c0310d3804..6a20297a5826 100644
> --- a/checksum.h
> +++ b/checksum.h
> @@ -25,5 +25,6 @@ void csum_icmp6(struct icmp6hdr *icmp6hr,
>  		const struct in6_addr *saddr, const struct in6_addr *daddr,
>  		const void *payload, size_t len);
>  uint16_t csum(const void *buf, size_t len, uint32_t init);
> +uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init);
>  
>  #endif /* CHECKSUM_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch]
  2024-02-02 14:11 ` [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch] Laurent Vivier
@ 2024-02-05  6:13   ` David Gibson
  2024-02-07  9:03     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-05  6:13 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 15371 bytes --]

On Fri, Feb 02, 2024 at 03:11:32PM +0100, Laurent Vivier wrote:

As with previous patches, could really do with a rationale in the
commit message.

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |  4 +--
>  conf.c       |  1 +
>  dhcp.c       |  1 +
>  flow.c       |  1 +
>  icmp.c       |  1 +
>  ip.c         | 72 +++++++++++++++++++++++++++++++++++++++++++
>  ip.h         | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  ndp.c        |  1 +
>  port_fwd.c   |  1 +
>  qrap.c       |  1 +
>  tap.c        |  1 +
>  tcp.c        |  1 +
>  tcp_splice.c |  1 +
>  udp.c        |  1 +
>  util.c       | 55 ---------------------------------
>  util.h       | 76 ----------------------------------------------
>  16 files changed, 171 insertions(+), 133 deletions(-)
>  create mode 100644 ip.c
>  create mode 100644 ip.h
> 
> diff --git a/Makefile b/Makefile
> index c1138fb91d26..acf37f5a2036 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
> -	util.c iov.c
> +	util.c iov.c ip.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -56,7 +56,7 @@ MANPAGES = passt.1 pasta.1 qrap.1
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
> -	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h
> +	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h ip.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/conf.c b/conf.c
> index 5e15b665be9c..93bfda331349 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -35,6 +35,7 @@
>  #include <netinet/if_ether.h>
>  
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "netlink.h"
>  #include "udp.h"
> diff --git a/dhcp.c b/dhcp.c
> index 110772867632..ff4834a3dce9 100644
> --- a/dhcp.c
> +++ b/dhcp.c
> @@ -25,6 +25,7 @@
>  #include <limits.h>
>  
>  #include "util.h"
> +#include "ip.h"
>  #include "checksum.h"
>  #include "packet.h"
>  #include "passt.h"
> diff --git a/flow.c b/flow.c
> index 5e94a7a949e5..73d52bda8774 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -11,6 +11,7 @@
>  #include <string.h>
>  
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "siphash.h"
>  #include "inany.h"
> diff --git a/icmp.c b/icmp.c
> index 9434fc5a7490..3b85a8578316 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -33,6 +33,7 @@
>  
>  #include "packet.h"
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "tap.h"
>  #include "log.h"
> diff --git a/ip.c b/ip.c
> new file mode 100644
> index 000000000000..64e336139819
> --- /dev/null
> +++ b/ip.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* PASST - Plug A Simple Socket Transport
> + *  for qemu/UNIX domain socket mode
> + *
> + * PASTA - Pack A Subtle Tap Abstraction
> + *  for network namespace/tap device mode
> + *
> + * util.c - Convenience helpers

Needs an update for the move.

> + *
> + * Copyright (c) 2020-2021 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#include <stddef.h>
> +#include "util.h"
> +#include "ip.h"
> +
> +#define IPV6_NH_OPT(nh)							\
> +	((nh) == 0   || (nh) == 43  || (nh) == 44  || (nh) == 50  ||	\
> +	 (nh) == 51  || (nh) == 60  || (nh) == 135 || (nh) == 139 ||	\
> +	 (nh) == 140 || (nh) == 253 || (nh) == 254)
> +
> +/**
> + * ipv6_l4hdr() - Find pointer to L4 header in IPv6 packet and extract protocol
> + * @p:		Packet pool, packet number @idx has IPv6 header at @offset
> + * @idx:	Index of packet in pool
> + * @offset:	Pre-calculated IPv6 header offset
> + * @proto:	Filled with L4 protocol number
> + * @dlen:	Data length (payload excluding header extensions), set on return
> + *
> + * Return: pointer to L4 header, NULL if not found
> + */
> +char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> +		 size_t *dlen)
> +{
> +	const struct ipv6_opt_hdr *o;
> +	const struct ipv6hdr *ip6h;
> +	char *base;
> +	int hdrlen;
> +	uint8_t nh;
> +
> +	base = packet_get(p, idx, 0, 0, NULL);
> +	ip6h = packet_get(p, idx, offset, sizeof(*ip6h), dlen);
> +	if (!ip6h)
> +		return NULL;
> +
> +	offset += sizeof(*ip6h);
> +
> +	nh = ip6h->nexthdr;
> +	if (!IPV6_NH_OPT(nh))
> +		goto found;
> +
> +	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
> +		nh = o->nexthdr;
> +		hdrlen = (o->hdrlen + 1) * 8;
> +
> +		if (IPV6_NH_OPT(nh))
> +			offset += hdrlen;
> +		else
> +			goto found;
> +	}
> +
> +	return NULL;
> +
> +found:
> +	if (nh == 59)
> +		return NULL;
> +
> +	*proto = nh;
> +	return base + offset;
> +}
> diff --git a/ip.h b/ip.h
> new file mode 100644
> index 000000000000..b2e08bc049f3
> --- /dev/null
> +++ b/ip.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2021 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#ifndef IP_H
> +#define IP_H
> +
> +#include <netinet/ip.h>
> +#include <netinet/ip6.h>
> +
> +#define IN4_IS_ADDR_UNSPECIFIED(a) \
> +	((a)->s_addr == htonl_constant(INADDR_ANY))
> +#define IN4_IS_ADDR_BROADCAST(a) \
> +	((a)->s_addr == htonl_constant(INADDR_BROADCAST))
> +#define IN4_IS_ADDR_LOOPBACK(a) \
> +	(ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
> +#define IN4_IS_ADDR_MULTICAST(a) \
> +	(IN_MULTICAST(ntohl((a)->s_addr)))
> +#define IN4_ARE_ADDR_EQUAL(a, b) \
> +	(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
> +#define IN4ADDR_LOOPBACK_INIT \
> +	{ .s_addr	= htonl_constant(INADDR_LOOPBACK) }
> +#define IN4ADDR_ANY_INIT \
> +	{ .s_addr	= htonl_constant(INADDR_ANY) }

I'm hoping to eliminate some of these with increased use of inany.h,
but that's not really relevant to you.

> +#define L2_BUF_IP4_INIT(proto)						\
> +	{								\
> +		.version	= 4,					\
> +		.ihl		= 5,					\
> +		.tos		= 0,					\
> +		.tot_len	= 0,					\
> +		.id		= 0,					\
> +		.frag_off	= 0,					\
> +		.ttl		= 0xff,					\
> +		.protocol	= (proto),				\
> +		.saddr		= 0,					\
> +		.daddr		= 0,					\
> +	}
> +#define L2_BUF_IP4_PSUM(proto)	((uint32_t)htons_constant(0x4500) +	\
> +				 (uint32_t)htons_constant(0xff00 | (proto)))
> +
> +#define L2_BUF_IP6_INIT(proto)						\
> +	{								\
> +		.priority	= 0,					\
> +		.version	= 6,					\
> +		.flow_lbl	= { 0 },				\
> +		.payload_len	= 0,					\
> +		.nexthdr	= (proto),				\
> +		.hop_limit	= 255,					\
> +		.saddr		= IN6ADDR_ANY_INIT,			\
> +		.daddr		= IN6ADDR_ANY_INIT,			\
> +	}
> +
> +struct ipv6hdr {

Not really in scope for this patch, but I have wondered if we should
try to use struct ip6_hdr from netinet/ip6.h instead of our own
version (derived, I think, from the kernel one).

> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wpedantic"
> +#if __BYTE_ORDER == __BIG_ENDIAN
> +	uint8_t			version:4,
> +				priority:4;
> +#else
> +	uint8_t			priority:4,
> +				version:4;
> +#endif
> +#pragma GCC diagnostic pop
> +	uint8_t			flow_lbl[3];
> +
> +	uint16_t		payload_len;
> +	uint8_t			nexthdr;
> +	uint8_t			hop_limit;
> +
> +	struct in6_addr		saddr;
> +	struct in6_addr		daddr;
> +};
> +
> +struct ipv6_opt_hdr {
> +	uint8_t			nexthdr;
> +	uint8_t			hdrlen;
> +	/*
> +	 * TLV encoded option data follows.
> +	 */
> +} __attribute__((packed));	/* required for some archs */
> +
> +char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> +		 size_t *dlen);
> +#endif /* IP_H */
> diff --git a/ndp.c b/ndp.c
> index 4c85ab8bcaee..c58f4b222b76 100644
> --- a/ndp.c
> +++ b/ndp.c
> @@ -28,6 +28,7 @@
>  
>  #include "checksum.h"
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "tap.h"
>  #include "log.h"
> diff --git a/port_fwd.c b/port_fwd.c
> index 6f6c836c57ad..e1ec31e2232c 100644
> --- a/port_fwd.c
> +++ b/port_fwd.c
> @@ -21,6 +21,7 @@
>  #include <stdio.h>
>  
>  #include "util.h"
> +#include "ip.h"
>  #include "port_fwd.h"
>  #include "passt.h"
>  #include "lineread.h"
> diff --git a/qrap.c b/qrap.c
> index 97f350a4bf0b..d59670621731 100644
> --- a/qrap.c
> +++ b/qrap.c
> @@ -32,6 +32,7 @@
>  #include <linux/icmpv6.h>
>  
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "arp.h"
>  
> diff --git a/tap.c b/tap.c
> index 396dee7eef25..3ea03f720d6d 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -45,6 +45,7 @@
>  
>  #include "checksum.h"
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "arp.h"
>  #include "dhcp.h"
> diff --git a/tcp.c b/tcp.c
> index 905d26f6c656..4c9c5fb51c60 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -289,6 +289,7 @@
>  
>  #include "checksum.h"
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "tap.h"
>  #include "siphash.h"
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 26d32065cd47..66575ca95a1e 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -49,6 +49,7 @@
>  #include <sys/socket.h>
>  
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "log.h"
>  #include "tcp_splice.h"
> diff --git a/udp.c b/udp.c
> index b5b8f8a7cd5b..d514c864ab5b 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -112,6 +112,7 @@
>  
>  #include "checksum.h"
>  #include "util.h"
> +#include "ip.h"
>  #include "passt.h"
>  #include "tap.h"
>  #include "pcap.h"
> diff --git a/util.c b/util.c
> index 21b35ff94db1..f73ea1d98a09 100644
> --- a/util.c
> +++ b/util.c
> @@ -30,61 +30,6 @@
>  #include "packet.h"
>  #include "log.h"
>  
> -#define IPV6_NH_OPT(nh)							\
> -	((nh) == 0   || (nh) == 43  || (nh) == 44  || (nh) == 50  ||	\
> -	 (nh) == 51  || (nh) == 60  || (nh) == 135 || (nh) == 139 ||	\
> -	 (nh) == 140 || (nh) == 253 || (nh) == 254)
> -
> -/**
> - * ipv6_l4hdr() - Find pointer to L4 header in IPv6 packet and extract protocol
> - * @p:		Packet pool, packet number @idx has IPv6 header at @offset
> - * @idx:	Index of packet in pool
> - * @offset:	Pre-calculated IPv6 header offset
> - * @proto:	Filled with L4 protocol number
> - * @dlen:	Data length (payload excluding header extensions), set on return
> - *
> - * Return: pointer to L4 header, NULL if not found
> - */
> -char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> -		 size_t *dlen)
> -{
> -	const struct ipv6_opt_hdr *o;
> -	const struct ipv6hdr *ip6h;
> -	char *base;
> -	int hdrlen;
> -	uint8_t nh;
> -
> -	base = packet_get(p, idx, 0, 0, NULL);
> -	ip6h = packet_get(p, idx, offset, sizeof(*ip6h), dlen);
> -	if (!ip6h)
> -		return NULL;
> -
> -	offset += sizeof(*ip6h);
> -
> -	nh = ip6h->nexthdr;
> -	if (!IPV6_NH_OPT(nh))
> -		goto found;
> -
> -	while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) {
> -		nh = o->nexthdr;
> -		hdrlen = (o->hdrlen + 1) * 8;
> -
> -		if (IPV6_NH_OPT(nh))
> -			offset += hdrlen;
> -		else
> -			goto found;
> -	}
> -
> -	return NULL;
> -
> -found:
> -	if (nh == 59)
> -		return NULL;
> -
> -	*proto = nh;
> -	return base + offset;
> -}
> -
>  /**
>   * sock_l4() - Create and bind socket for given L4, add to epoll list
>   * @c:		Execution context
> diff --git a/util.h b/util.h
> index d2320f8cc99a..f7c3dfee9972 100644
> --- a/util.h
> +++ b/util.h
> @@ -110,22 +110,6 @@
>  #define	htonl_constant(x)	(__bswap_constant_32(x))
>  #endif
>  
> -#define IN4_IS_ADDR_UNSPECIFIED(a) \
> -	((a)->s_addr == htonl_constant(INADDR_ANY))
> -#define IN4_IS_ADDR_BROADCAST(a) \
> -	((a)->s_addr == htonl_constant(INADDR_BROADCAST))
> -#define IN4_IS_ADDR_LOOPBACK(a) \
> -	(ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
> -#define IN4_IS_ADDR_MULTICAST(a) \
> -	(IN_MULTICAST(ntohl((a)->s_addr)))
> -#define IN4_ARE_ADDR_EQUAL(a, b) \
> -	(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
> -#define IN4ADDR_LOOPBACK_INIT \
> -	{ .s_addr	= htonl_constant(INADDR_LOOPBACK) }
> -#define IN4ADDR_ANY_INIT \
> -	{ .s_addr	= htonl_constant(INADDR_ANY) }
> -
> -
>  #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
>  int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  	     void *arg);
> @@ -138,34 +122,6 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  			 (void *)(arg));				\
>  	} while (0)
>  
> -#define L2_BUF_IP4_INIT(proto)						\
> -	{								\
> -		.version	= 4,					\
> -		.ihl		= 5,					\
> -		.tos		= 0,					\
> -		.tot_len	= 0,					\
> -		.id		= 0,					\
> -		.frag_off	= 0,					\
> -		.ttl		= 0xff,					\
> -		.protocol	= (proto),				\
> -		.saddr		= 0,					\
> -		.daddr		= 0,					\
> -	}
> -#define L2_BUF_IP4_PSUM(proto)	((uint32_t)htons_constant(0x4500) +	\
> -				 (uint32_t)htons_constant(0xff00 | (proto)))
> -
> -#define L2_BUF_IP6_INIT(proto)						\
> -	{								\
> -		.priority	= 0,					\
> -		.version	= 6,					\
> -		.flow_lbl	= { 0 },				\
> -		.payload_len	= 0,					\
> -		.nexthdr	= (proto),				\
> -		.hop_limit	= 255,					\
> -		.saddr		= IN6ADDR_ANY_INIT,			\
> -		.daddr		= IN6ADDR_ANY_INIT,			\
> -	}
> -
>  #define RCVBUF_BIG		(2UL * 1024 * 1024)
>  #define SNDBUF_BIG		(4UL * 1024 * 1024)
>  #define SNDBUF_SMALL		(128UL * 1024)
> @@ -173,45 +129,13 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  #include <net/if.h>
>  #include <limits.h>
>  #include <stdint.h>
> -#include <netinet/ip6.h>
>  
>  #include "packet.h"
>  
>  struct ctx;
>  
> -struct ipv6hdr {
> -#pragma GCC diagnostic push
> -#pragma GCC diagnostic ignored "-Wpedantic"
> -#if __BYTE_ORDER == __BIG_ENDIAN
> -	uint8_t			version:4,
> -				priority:4;
> -#else
> -	uint8_t			priority:4,
> -				version:4;
> -#endif
> -#pragma GCC diagnostic pop
> -	uint8_t			flow_lbl[3];
> -
> -	uint16_t		payload_len;
> -	uint8_t			nexthdr;
> -	uint8_t			hop_limit;
> -
> -	struct in6_addr		saddr;
> -	struct in6_addr		daddr;
> -};
> -
> -struct ipv6_opt_hdr {
> -	uint8_t			nexthdr;
> -	uint8_t			hdrlen;
> -	/*
> -	 * TLV encoded option data follows.
> -	 */
> -} __attribute__((packed));	/* required for some archs */
> -
>  /* cppcheck-suppress funcArgNamesDifferent */
>  __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); }
> -char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> -		 size_t *dlen);
>  int sock_l4(const struct ctx *c, int af, uint8_t proto,
>  	    const void *bind_addr, const char *ifname, uint16_t port,
>  	    uint32_t data);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h
  2024-02-02 14:11 ` [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h Laurent Vivier
@ 2024-02-05  6:16   ` David Gibson
  2024-02-07 10:40   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-05  6:16 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4073 bytes --]

On Fri, Feb 02, 2024 at 03:11:33PM +0100, Laurent Vivier wrote:
> We can find the same function to compute the IPv4 header
> checksum in tcp.c and udp.c
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Nice clean up.

> ---
>  ip.h  | 14 ++++++++++++++
>  tcp.c | 23 ++---------------------
>  udp.c | 19 +------------------
>  3 files changed, 17 insertions(+), 39 deletions(-)
> 
> diff --git a/ip.h b/ip.h
> index b2e08bc049f3..ff7902c45a95 100644
> --- a/ip.h
> +++ b/ip.h
> @@ -9,6 +9,8 @@
>  #include <netinet/ip.h>
>  #include <netinet/ip6.h>
>  
> +#include "checksum.h"
> +
>  #define IN4_IS_ADDR_UNSPECIFIED(a) \
>  	((a)->s_addr == htonl_constant(INADDR_ANY))
>  #define IN4_IS_ADDR_BROADCAST(a) \
> @@ -83,4 +85,16 @@ struct ipv6_opt_hdr {
>  
>  char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
>  		 size_t *dlen);
> +static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)
> +{
> +	uint32_t sum = L2_BUF_IP4_PSUM(proto);
> +
> +	sum += iph->tot_len;
> +	sum += (iph->saddr >> 16) & 0xffff;
> +	sum += iph->saddr & 0xffff;
> +	sum += (iph->daddr >> 16) & 0xffff;
> +	sum += iph->daddr & 0xffff;
> +
> +	return ~csum_fold(sum);
> +}
>  #endif /* IP_H */
> diff --git a/tcp.c b/tcp.c
> index 4c9c5fb51c60..293ab12d8c21 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -934,23 +934,6 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>  		trace("TCP: failed to set SO_SNDBUF to %i", v);
>  }
>  
> -/**
> - * tcp_update_check_ip4() - Update IPv4 with variable parts from stored one
> - * @buf:	L2 packet buffer with final IPv4 header
> - */
> -static void tcp_update_check_ip4(struct tcp4_l2_buf_t *buf)
> -{
> -	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_TCP);
> -
> -	sum += buf->iph.tot_len;
> -	sum += (buf->iph.saddr >> 16) & 0xffff;
> -	sum += buf->iph.saddr & 0xffff;
> -	sum += (buf->iph.daddr >> 16) & 0xffff;
> -	sum += buf->iph.daddr & 0xffff;
> -
> -	buf->iph.check = (uint16_t)~csum_fold(sum);
> -}
> -
>  /**
>   * tcp_update_check_tcp4() - Update TCP checksum from stored one
>   * @buf:	L2 packet buffer with final IPv4 header
> @@ -1393,10 +1376,8 @@ do {									\
>  		b->iph.saddr = a4->s_addr;
>  		b->iph.daddr = c->ip4.addr_seen.s_addr;
>  
> -		if (check)
> -			b->iph.check = *check;
> -		else
> -			tcp_update_check_ip4(b);
> +		b->iph.check = check ? *check :
> +				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
>  
>  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
>  
> diff --git a/udp.c b/udp.c
> index d514c864ab5b..6f867df81c05 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -270,23 +270,6 @@ static void udp_invert_portmap(struct udp_port_fwd *fwd)
>  	}
>  }
>  
> -/**
> - * udp_update_check4() - Update checksum with variable parts from stored one
> - * @buf:	L2 packet buffer with final IPv4 header
> - */
> -static void udp_update_check4(struct udp4_l2_buf_t *buf)
> -{
> -	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_UDP);
> -
> -	sum += buf->iph.tot_len;
> -	sum += (buf->iph.saddr >> 16) & 0xffff;
> -	sum += buf->iph.saddr & 0xffff;
> -	sum += (buf->iph.daddr >> 16) & 0xffff;
> -	sum += buf->iph.daddr & 0xffff;
> -
> -	buf->iph.check = (uint16_t)~csum_fold(sum);
> -}
> -
>  /**
>   * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
>   * @eth_d:	Ethernet destination address, NULL if unchanged
> @@ -614,7 +597,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
>  		b->iph.saddr = b->s_in.sin_addr.s_addr;
>  	}
>  
> -	udp_update_check4(b);
> +	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
>  	b->uh.source = b->s_in.sin_port;
>  	b->uh.dest = htons(dstport);
>  	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP
  2024-02-02 14:11 ` [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP Laurent Vivier
@ 2024-02-05  6:20   ` David Gibson
  2024-02-07 10:41   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-05  6:20 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5172 bytes --]

On Fri, Feb 02, 2024 at 03:11:34PM +0100, Laurent Vivier wrote:
> The TCP and UDP checksums are computed using the data in the TCP/UDP
> payload but also some informations in the IP header (protocol,
> length, source and destination addresses).
> 
> We add two functions, proto_ipv4_header_checksum() and
> proto_ipv6_header_checksum(), to compute the checksum of the IP
> header part.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  ip.h  | 24 ++++++++++++++++++++++++
>  tcp.c | 40 +++++++++++++++-------------------------
>  udp.c |  6 ++----
>  3 files changed, 41 insertions(+), 29 deletions(-)
> 
> diff --git a/ip.h b/ip.h
> index ff7902c45a95..87cb8dd21d2e 100644
> --- a/ip.h
> +++ b/ip.h
> @@ -97,4 +97,28 @@ static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)
>  
>  	return ~csum_fold(sum);
>  }
> +

Function comment.

> +static inline uint32_t proto_ipv4_header_checksum(struct iphdr *iph, int proto)

Logic looks fine, but I don't love the name, since this isn't a
complete checksum, but just an intermediate result we use to compute
the final checksum.  Maybe use the 'psum' name we have in some other
places for "partial sum".

> +{
> +	uint32_t sum = htons(proto);
> +
> +	sum += (iph->saddr >> 16) & 0xffff;
> +	sum += iph->saddr & 0xffff;
> +	sum += (iph->daddr >> 16) & 0xffff;
> +	sum += iph->daddr & 0xffff;
> +	sum += htons(ntohs(iph->tot_len) - 20);
> +
> +	return sum;
> +}
> +
> +static inline uint32_t proto_ipv6_header_checksum(struct ipv6hdr *ip6h,
> +						  int proto)
> +{
> +	uint32_t sum = htons(proto) + ip6h->payload_len;
> +
> +	sum += sum_16b(&ip6h->saddr, sizeof(ip6h->saddr));
> +	sum += sum_16b(&ip6h->daddr, sizeof(ip6h->daddr));
> +
> +	return sum;
> +}
>  #endif /* IP_H */
> diff --git a/tcp.c b/tcp.c
> index 293ab12d8c21..2fd6bc2eda53 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -938,39 +938,25 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>   * tcp_update_check_tcp4() - Update TCP checksum from stored one
>   * @buf:	L2 packet buffer with final IPv4 header
>   */
> -static void tcp_update_check_tcp4(struct tcp4_l2_buf_t *buf)
> +static uint16_t tcp_update_check_tcp4(struct iphdr *iph)
>  {
> -	uint16_t tlen = ntohs(buf->iph.tot_len) - 20;
> -	uint32_t sum = htons(IPPROTO_TCP);
> +	struct tcphdr *th = (void *)(iph + 1);
> +	uint16_t tlen = ntohs(iph->tot_len) - 20;
> +	uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
>  
> -	sum += (buf->iph.saddr >> 16) & 0xffff;
> -	sum += buf->iph.saddr & 0xffff;
> -	sum += (buf->iph.daddr >> 16) & 0xffff;
> -	sum += buf->iph.daddr & 0xffff;
> -	sum += htons(ntohs(buf->iph.tot_len) - 20);
> -
> -	buf->th.check = 0;
> -	buf->th.check = csum(&buf->th, tlen, sum);
> +	return csum(th, tlen, sum);
>  }
>  
>  /**
>   * tcp_update_check_tcp6() - Calculate TCP checksum for IPv6
>   * @buf:	L2 packet buffer with final IPv6 header
>   */
> -static void tcp_update_check_tcp6(struct tcp6_l2_buf_t *buf)
> +static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
>  {
> -	int len = ntohs(buf->ip6h.payload_len) + sizeof(struct ipv6hdr);
> -
> -	buf->ip6h.hop_limit = IPPROTO_TCP;
> -	buf->ip6h.version = 0;
> -	buf->ip6h.nexthdr = 0;
> +	struct tcphdr *th = (void *)(ip6h + 1);
> +	uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
>  
> -	buf->th.check = 0;
> -	buf->th.check = csum(&buf->ip6h, len, 0);
> -
> -	buf->ip6h.hop_limit = 255;
> -	buf->ip6h.version = 6;
> -	buf->ip6h.nexthdr = IPPROTO_TCP;
> +	return csum(th, ntohs(ip6h->payload_len), sum);
>  }
>  
>  /**
> @@ -1381,7 +1367,7 @@ do {									\
>  
>  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
>  
> -		tcp_update_check_tcp4(b);
> +		b->th.check = tcp_update_check_tcp4(&b->iph);
>  
>  		tlen = tap_iov_len(c, &b->taph, ip_len);
>  	} else {
> @@ -1400,7 +1386,11 @@ do {									\
>  
>  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
>  
> -		tcp_update_check_tcp6(b);
> +		b->th.check = tcp_update_check_tcp6(&b->ip6h);
> +
> +		b->ip6h.hop_limit = 255;
> +		b->ip6h.version = 6;
> +		b->ip6h.nexthdr = IPPROTO_TCP;
>  
>  		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
>  		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
> diff --git a/udp.c b/udp.c
> index 6f867df81c05..96b4e6ca9a85 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -669,10 +669,8 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
>  	b->uh.source = b->s_in6.sin6_port;
>  	b->uh.dest = htons(dstport);
>  	b->uh.len = b->ip6h.payload_len;
> -
> -	b->ip6h.hop_limit = IPPROTO_UDP;
> -	b->ip6h.version = b->ip6h.nexthdr = b->uh.check = 0;
> -	b->uh.check = csum(&b->ip6h, ip_len, 0);
> +	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
> +			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
>  	b->ip6h.version = 6;
>  	b->ip6h.nexthdr = IPPROTO_UDP;
>  	b->ip6h.hop_limit = 255;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/24] pcap: add pcap_iov()
  2024-02-02 14:11 ` [PATCH 02/24] pcap: add pcap_iov() Laurent Vivier
@ 2024-02-05  6:25   ` David Gibson
  2024-02-06 16:10   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-05  6:25 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2307 bytes --]

On Fri, Feb 02, 2024 at 03:11:29PM +0100, Laurent Vivier wrote:
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  pcap.c | 32 ++++++++++++++++++++++++++++++++
>  pcap.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/pcap.c b/pcap.c
> index 501d52d4992b..b002bb01314c 100644
> --- a/pcap.c
> +++ b/pcap.c
> @@ -31,6 +31,7 @@
>  #include "util.h"
>  #include "passt.h"
>  #include "log.h"
> +#include "iov.h"
>  
>  #define PCAP_VERSION_MINOR 4
>  
> @@ -130,6 +131,37 @@ void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset)
>  	}
>  }


Function comment.

> +void pcap_iov(const struct iovec *iov, unsigned int n)
> +{
> +	struct timeval tv;
> +	struct pcap_pkthdr h;
> +	size_t len;
> +	unsigned int i;
> +
> +	if (pcap_fd == -1)
> +		return;
> +
> +	gettimeofday(&tv, NULL);
> +
> +	len = iov_size(iov, n);
> +
> +	h.tv_sec = tv.tv_sec;
> +	h.tv_usec = tv.tv_usec;
> +	h.caplen = h.len = len;
> +
> +	if (write(pcap_fd, &h, sizeof(h)) < 0) {
> +		debug("Cannot write pcap header");
> +		return;
> +	}

It would be nice to have a common helper for writing the header used
by pcap_iov() and pcap_frame().

> +	for (i = 0; i < n; i++) {
> +		if (write(pcap_fd, iov[i].iov_base, iov[i].iov_len) < 0) {

You should be able to avoid the loop by using writev() no?

Although, both here and the existing code we should, technically, be
looping on short write()s to make sure we've written everything.
That's more awkward with an iov.

> +			debug("Cannot log packet, iov %d length %lu\n",
> +			      i, iov[i].iov_len);
> +		}
> +	}
> +}
> +
>  /**
>   * pcap_init() - Initialise pcap file
>   * @c:		Execution context
> diff --git a/pcap.h b/pcap.h
> index da5a7e846b72..732a0ddf14cc 100644
> --- a/pcap.h
> +++ b/pcap.h
> @@ -8,6 +8,7 @@
>  
>  void pcap(const char *pkt, size_t len);
>  void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset);
> +void pcap_iov(const struct iovec *iov, unsigned int n);
>  void pcap_init(struct ctx *c);
>  
>  #endif /* PCAP_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/24] tcp: extract buffer management from tcp_send_flag()
  2024-02-02 14:11 ` [PATCH 08/24] tcp: extract buffer management from tcp_send_flag() Laurent Vivier
@ 2024-02-06  0:24   ` David Gibson
  2024-02-08 16:57   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-06  0:24 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13163 bytes --]

On Fri, Feb 02, 2024 at 03:11:35PM +0100, Laurent Vivier wrote:

This definitely needs a commit message to explain what you're trying
to achieve here.  Without further context "buffer management" suggests
to me allocation / freeing / placement of buffers, whereas what's
actually being moved here is code related to the construction of
headers within the buffers.

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  tcp.c | 224 +++++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 129 insertions(+), 95 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 2fd6bc2eda53..20ad8a4e5271 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1320,87 +1320,98 @@ void tcp_defer_handler(struct ctx *c)
>  	tcp_l2_data_buf_flush(c);
>  }
>  

Function comment.

> +static void tcp_set_tcp_header(struct tcphdr *th,
> +			       const struct tcp_tap_conn *conn, uint32_t seq)
> +{
> +	th->source = htons(conn->fport);
> +	th->dest = htons(conn->eport);
> +	th->seq = htonl(seq);
> +	th->ack_seq = htonl(conn->seq_ack_to_tap);
> +	if (conn->events & ESTABLISHED)	{
> +		th->window = htons(conn->wnd_to_tap);
> +	} else {
> +		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;
> +
> +		th->window = htons(MIN(wnd, USHRT_MAX));
> +	}
> +}
> +
>  /**
> - * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
> + * ipv4_fill_headers() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
>   * @c:		Execution context
>   * @conn:	Connection pointer
> - * @p:		Pointer to any type of TCP pre-cooked buffer
> + * @iph:	Pointer to IPv4 header, immediately followed by a TCP header
>   * @plen:	Payload length (including TCP header options)
>   * @check:	Checksum, if already known
>   * @seq:	Sequence number for this segment
>   *
> - * Return: frame length including L2 headers, host order
> + * Return: IP frame length including L2 headers, host order

This is an odd case where adding an extra descriptor is making it less
clear to me.  Without context, "frame" suggests to me an entire L2
frame containing whatever inner protocol.  But I'm not immediately
sure what "IP frame" means: is it an L2 frame which happens to have IP
inside, or does it mean just the IP packet without the L2 headers.
The rest of the sentence clarifies it's the first, but it still throws
me for a moment.

>   */
> -static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> -				      const struct tcp_tap_conn *conn,
> -				      void *p, size_t plen,
> -				      const uint16_t *check, uint32_t seq)
> +
> +static size_t ipv4_fill_headers(const struct ctx *c,
> +				const struct tcp_tap_conn *conn,
> +				struct iphdr *iph, size_t plen,
> +				const uint16_t *check, uint32_t seq)

I really like this re-organization of the header filling code.  I
wonder if separating it into a separate patch might make the remainder
patch easier to follow.

>  {
> +	struct tcphdr *th = (void *)(iph + 1);
>  	const struct in_addr *a4 = inany_v4(&conn->faddr);
> -	size_t ip_len, tlen;
> -
> -#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq)			\
> -do {									\
> -	b->th.source = htons(conn->fport);				\
> -	b->th.dest = htons(conn->eport);				\
> -	b->th.seq = htonl(seq);						\
> -	b->th.ack_seq = htonl(conn->seq_ack_to_tap);			\
> -	if (conn->events & ESTABLISHED)	{				\
> -		b->th.window = htons(conn->wnd_to_tap);			\
> -	} else {							\
> -		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;	\
> -									\
> -		b->th.window = htons(MIN(wnd, USHRT_MAX));		\
> -	}								\
> -} while (0)
> -
> -	if (a4) {
> -		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
> -
> -		ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
> -		b->iph.tot_len = htons(ip_len);
> -		b->iph.saddr = a4->s_addr;
> -		b->iph.daddr = c->ip4.addr_seen.s_addr;
> -
> -		b->iph.check = check ? *check :
> -				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
> -
> -		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
> -
> -		b->th.check = tcp_update_check_tcp4(&b->iph);
> -
> -		tlen = tap_iov_len(c, &b->taph, ip_len);
> -	} else {
> -		struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
> +	size_t ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
>  
> -		ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
> +	iph->tot_len = htons(ip_len);
> +	iph->saddr = a4->s_addr;
> +	iph->daddr = c->ip4.addr_seen.s_addr;
>  
> -		b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr));
> -		b->ip6h.saddr = conn->faddr.a6;
> -		if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr))
> -			b->ip6h.daddr = c->ip6.addr_ll_seen;
> -		else
> -			b->ip6h.daddr = c->ip6.addr_seen;
> +	iph->check = check ? *check : ipv4_hdr_checksum(iph, IPPROTO_TCP);
> +
> +	tcp_set_tcp_header(th, conn, seq);
> +
> +	th->check = tcp_update_check_tcp4(iph);
> +
> +	return ip_len;
> +}
> +
> +/**
> + * ipv6_fill_headers() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @ip6h:	Pointer to IPv6 header, immediately followed by a TCP header
> + * @plen:	Payload length (including TCP header options)
> + * @check:	Checksum, if already known
> + * @seq:	Sequence number for this segment
> + *
> + * Return: IP frame length including L2 headers, host order
> + */
> +
> +static size_t ipv6_fill_headers(const struct ctx *c,
> +				const struct tcp_tap_conn *conn,
> +				struct ipv6hdr *ip6h, size_t plen,
> +				uint32_t seq)
> +{
> +	struct tcphdr *th = (void *)(ip6h + 1);
> +	size_t ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
>  
> -		memset(b->ip6h.flow_lbl, 0, 3);
> +	ip6h->payload_len = htons(plen + sizeof(struct tcphdr));
> +	ip6h->saddr = conn->faddr.a6;
> +	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
> +		ip6h->daddr = c->ip6.addr_ll_seen;
> +	else
> +		ip6h->daddr = c->ip6.addr_seen;
>  
> -		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
> +	memset(ip6h->flow_lbl, 0, 3);
>  
> -		b->th.check = tcp_update_check_tcp6(&b->ip6h);
> +	tcp_set_tcp_header(th, conn, seq);
>  
> -		b->ip6h.hop_limit = 255;
> -		b->ip6h.version = 6;
> -		b->ip6h.nexthdr = IPPROTO_TCP;
> +	th->check = tcp_update_check_tcp6(ip6h);
>  
> -		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
> -		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
> -		b->ip6h.flow_lbl[2] = (conn->sock >> 0) & 0xff;
> +	ip6h->hop_limit = 255;
> +	ip6h->version = 6;
> +	ip6h->nexthdr = IPPROTO_TCP;
>  
> -		tlen = tap_iov_len(c, &b->taph, ip_len);
> -	}
> -#undef SET_TCP_HEADER_COMMON_V4_V6
> +	ip6h->flow_lbl[0] = (conn->sock >> 16) & 0xf;
> +	ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff;
> +	ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff;
>  
> -	return tlen;
> +	return ip_len;
>  }
>  
>  /**
> @@ -1520,27 +1531,21 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
>  }
>  
>  /**
> - * tcp_send_flag() - Send segment with flags to tap (no payload)
> + * do_tcp_send_flag() - Send segment with flags to tap (no payload)

As a rule, I don't love "do_X" as a function name.  I particularly
dislike it here because AFAICT this function isn't actually the one
that "does" the sending of the flag - that happens with the
tcp_l2_flags_buf_flush() in the caller.

>   * @c:		Execution context
>   * @conn:	Connection pointer
>   * @flags:	TCP flags: if not set, send segment only if ACK is due
>   *
>   * Return: negative error code on connection reset, 0 otherwise
>   */
> -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +
> +static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,			    struct tcphdr *th, char *data, size_t optlen)
>  {
>  	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
>  	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
> -	struct tcp4_l2_flags_buf_t *b4 = NULL;
> -	struct tcp6_l2_flags_buf_t *b6 = NULL;
>  	struct tcp_info tinfo = { 0 };
>  	socklen_t sl = sizeof(tinfo);
>  	int s = conn->sock;
> -	size_t optlen = 0;
> -	struct iovec *iov;
> -	struct tcphdr *th;
> -	char *data;
> -	void *p;
>  
>  	if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
>  	    !flags && conn->wnd_to_tap)
> @@ -1562,26 +1567,9 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
>  		return 0;
>  
> -	if (CONN_V4(conn)) {
> -		iov = tcp4_l2_flags_iov    + tcp4_l2_flags_buf_used;
> -		p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
> -		th = &b4->th;
> -
> -		/* gcc 11.2 would complain on data = (char *)(th + 1); */
> -		data = b4->opts;
> -	} else {
> -		iov = tcp6_l2_flags_iov    + tcp6_l2_flags_buf_used;
> -		p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
> -		th = &b6->th;
> -		data = b6->opts;
> -	}
> -
>  	if (flags & SYN) {
>  		int mss;
>  
> -		/* Options: MSS, NOP and window scale (8 bytes) */
> -		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
> -
>  		*data++ = OPT_MSS;
>  		*data++ = OPT_MSS_LEN;
>  
> @@ -1624,9 +1612,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	th->syn = !!(flags & SYN);
>  	th->fin = !!(flags & FIN);
>  
> -	iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen,
> -					       NULL, conn->seq_to_tap);
> -
>  	if (th->ack) {
>  		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
>  			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
> @@ -1641,8 +1626,38 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	if (th->fin || th->syn)
>  		conn->seq_to_tap++;
>  
> +	return 1;
> +}
> +
> +static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	size_t optlen = 0;
> +	struct iovec *iov;
> +	size_t ip_len;
> +	int ret;
> +
> +	/* Options: MSS, NOP and window scale (8 bytes) */
> +	if (flags & SYN)
> +		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
> +
>  	if (CONN_V4(conn)) {
> +		struct tcp4_l2_flags_buf_t *b4;
> +
> +		iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
> +		b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
> +
> +		ret = do_tcp_send_flag(c, conn, flags, &b4->th, b4->opts,
> +				       optlen);
> +		if (ret <= 0)
> +			return ret;
> +
> +		ip_len = ipv4_fill_headers(c, conn, &b4->iph, optlen,
> +					   NULL, conn->seq_to_tap);
> +
> +		iov->iov_len = tap_iov_len(c, &b4->taph, ip_len);
> +
>  		if (flags & DUP_ACK) {
> +
>  			memcpy(b4 + 1, b4, sizeof(*b4));
>  			(iov + 1)->iov_len = iov->iov_len;
>  			tcp4_l2_flags_buf_used++;
> @@ -1651,6 +1666,21 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
>  			tcp_l2_flags_buf_flush(c);
>  	} else {
> +		struct tcp6_l2_flags_buf_t *b6;
> +
> +		iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
> +		b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
> +
> +		ret = do_tcp_send_flag(c, conn, flags, &b6->th, b6->opts,
> +				       optlen);
> +		if (ret <= 0)
> +			return ret;
> +
> +		ip_len = ipv6_fill_headers(c, conn, &b6->ip6h, optlen,
> +					   conn->seq_to_tap);
> +
> +		iov->iov_len = tap_iov_len(c, &b6->taph, ip_len);
> +
>  		if (flags & DUP_ACK) {
>  			memcpy(b6 + 1, b6, sizeof(*b6));
>  			(iov + 1)->iov_len = iov->iov_len;
> @@ -2050,6 +2080,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  {
>  	uint32_t *seq_update = &conn->seq_to_tap;
>  	struct iovec *iov;
> +	size_t ip_len;
>  
>  	if (CONN_V4(conn)) {
>  		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> @@ -2058,9 +2089,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
>  		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
>  
> +		ip_len = ipv4_fill_headers(c, conn, &b->iph, plen,
> +					   check, seq);
> +
>  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> -		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> -						       check, seq);
> +		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
>  		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
>  			tcp_l2_data_buf_flush(c);
>  	} else if (CONN_V6(conn)) {
> @@ -2069,9 +2102,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
>  		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
>  
> +		ip_len = ipv6_fill_headers(c, conn, &b->ip6h, plen, seq);
> +
>  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> -		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> -						       NULL, seq);
> +		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
>  		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
>  			tcp_l2_data_buf_flush(c);
>  	}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss()
  2024-02-02 14:11 ` [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss() Laurent Vivier
@ 2024-02-06  0:47   ` David Gibson
  2024-02-08 16:59   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-06  0:47 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1393 bytes --]

On Fri, Feb 02, 2024 at 03:11:36PM +0100, Laurent Vivier wrote:

Even more than the previous patch, this doesn't really seem like
"buffer management" to me.  I'd say rather that this is extracting
maximum mss calculation into a helper.

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  tcp.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 20ad8a4e5271..cdbceed65033 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1813,6 +1813,14 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
>  	return s;
>  }
>  

Function comment.

> +static uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn)
> +{
> +	if (CONN_V4(conn))
> +		return MSS4;
> +
> +	return MSS6;
> +}
> +
>  /**
>   * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
>   * @conn:	Connection pointer
> @@ -1832,10 +1840,7 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
>  	else
>  		mss = ret;
>  
> -	if (CONN_V4(conn))
> -		mss = MIN(MSS4, mss);
> -	else
> -		mss = MIN(MSS6, mss);
> +	mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
>  
>  	return MIN(mss, USHRT_MAX);
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/24] tcp: rename functions that manage buffers
  2024-02-02 14:11 ` [PATCH 10/24] tcp: rename functions that manage buffers Laurent Vivier
@ 2024-02-06  1:48   ` David Gibson
  2024-02-08 17:10     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-06  1:48 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13926 bytes --]

On Fri, Feb 02, 2024 at 03:11:37PM +0100, Laurent Vivier wrote:
> To separate these functions from the ones specific to TCP management,
> we are going to move it to a new file, but before that update their names
> to reflect their role.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  passt.c |  2 +-
>  tcp.c   | 84 ++++++++++++++++++++++++++++-----------------------------
>  tcp.h   |  2 +-
>  3 files changed, 44 insertions(+), 44 deletions(-)
> 
> diff --git a/passt.c b/passt.c
> index 44d3a0b0548c..10042a9b9789 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -164,7 +164,7 @@ static void timer_init(struct ctx *c, const struct timespec *now)
>   */
>  void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
>  {
> -	tcp_update_l2_buf(eth_d, eth_s);
> +	tcp_buf_update_l2(eth_d, eth_s);
>  	udp_update_l2_buf(eth_d, eth_s);
>  }
>  
> diff --git a/tcp.c b/tcp.c
> index cdbceed65033..640209533772 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -383,7 +383,7 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
>  #define ACK		(1 << 4)
>  /* Flags for internal usage */
>  #define DUP_ACK		(1 << 5)
> -#define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> +#define ACK_IF_NEEDED	0		/* See tcp_buf_send_flag() */
>  
>  #define OPT_EOL		0
>  #define OPT_NOP		1
> @@ -960,11 +960,11 @@ static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
>  }
>  
>  /**
> - * tcp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
> + * tcp_buf_update_l2() - Update L2 buffers with Ethernet and IPv4 addresses
>   * @eth_d:	Ethernet destination address, NULL if unchanged
>   * @eth_s:	Ethernet source address, NULL if unchanged
>   */
> -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> +void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
>  {
>  	int i;
>  
> @@ -982,10 +982,10 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
>  }
>  
>  /**
> - * tcp_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
> + * tcp_buf_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
>   * @c:		Execution context
>   */
> -static void tcp_sock4_iov_init(const struct ctx *c)
> +static void tcp_buf_sock4_iov_init(const struct ctx *c)
>  {
>  	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
>  	struct iovec *iov;
> @@ -1014,10 +1014,10 @@ static void tcp_sock4_iov_init(const struct ctx *c)
>  }
>  
>  /**
> - * tcp_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
> + * tcp_buf_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
>   * @c:		Execution context
>   */
> -static void tcp_sock6_iov_init(const struct ctx *c)
> +static void tcp_buf_sock6_iov_init(const struct ctx *c)
>  {
>  	struct iovec *iov;
>  	int i;
> @@ -1277,10 +1277,10 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
>  	} while (0)
>  
>  /**
> - * tcp_l2_flags_buf_flush() - Send out buffers for segments with no data (flags)
> + * tcp_buf_l2_flags_flush() - Send out buffers for segments with no data (flags)
>   * @c:		Execution context
>   */
> -static void tcp_l2_flags_buf_flush(const struct ctx *c)
> +static void tcp_buf_l2_flags_flush(const struct ctx *c)
>  {
>  	tap_send_frames(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used);
>  	tcp6_l2_flags_buf_used = 0;
> @@ -1290,10 +1290,10 @@ static void tcp_l2_flags_buf_flush(const struct ctx *c)
>  }
>  
>  /**
> - * tcp_l2_data_buf_flush() - Send out buffers for segments with data
> + * tcp_buf_l2_data_flush() - Send out buffers for segments with data
>   * @c:		Execution context
>   */
> -static void tcp_l2_data_buf_flush(const struct ctx *c)
> +static void tcp_buf_l2_data_flush(const struct ctx *c)
>  {
>  	unsigned i;
>  	size_t m;
> @@ -1316,8 +1316,8 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
>  /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
>  void tcp_defer_handler(struct ctx *c)
>  {
> -	tcp_l2_flags_buf_flush(c);
> -	tcp_l2_data_buf_flush(c);
> +	tcp_buf_l2_flags_flush(c);
> +	tcp_buf_l2_data_flush(c);
>  }
>  
>  static void tcp_set_tcp_header(struct tcphdr *th,
> @@ -1629,7 +1629,7 @@ static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
>  	return 1;
>  }
>  
> -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +static int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)

The functions above could reasonably be said to be part of the buffer
management, but I'm not convinced on this one - it's primary purpose
is to, well, send a flag, so it uses the buffers, but I wouldn't
really say it manages them.

>  {
>  	size_t optlen = 0;
>  	struct iovec *iov;
> @@ -1664,7 +1664,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		}
>  
>  		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
> -			tcp_l2_flags_buf_flush(c);
> +			tcp_buf_l2_flags_flush(c);
>  	} else {
>  		struct tcp6_l2_flags_buf_t *b6;
>  
> @@ -1688,7 +1688,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		}
>  
>  		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
> -			tcp_l2_flags_buf_flush(c);
> +			tcp_buf_l2_flags_flush(c);
>  	}
>  
>  	return 0;
> @@ -1704,7 +1704,7 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
>  	if (conn->events == CLOSED)
>  		return;
>  
> -	if (!tcp_send_flag(c, conn, RST))
> +	if (!tcp_buf_send_flag(c, conn, RST))
>  		conn_event(c, conn, CLOSED);
>  }
>  
> @@ -2024,7 +2024,7 @@ static void tcp_conn_from_tap(struct ctx *c,
>  	} else {
>  		tcp_get_sndbuf(conn);
>  
> -		if (tcp_send_flag(c, conn, SYN | ACK))
> +		if (tcp_buf_send_flag(c, conn, SYN | ACK))
>  			return;
>  
>  		conn_event(c, conn, TAP_SYN_ACK_SENT);
> @@ -2100,7 +2100,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
>  		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
>  		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
> -			tcp_l2_data_buf_flush(c);
> +			tcp_buf_l2_data_flush(c);
>  	} else if (CONN_V6(conn)) {
>  		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
>  
> @@ -2112,12 +2112,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
>  		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
>  		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
> -			tcp_l2_data_buf_flush(c);
> +			tcp_buf_l2_data_flush(c);
>  	}
>  }
>  
>  /**
> - * tcp_data_from_sock() - Handle new data from socket, queue to tap, in window
> + * tcp_buf_data_from_sock() - Handle new data from socket, queue to tap, in window

Same with this one.

>   * @c:		Execution context
>   * @conn:	Connection pointer
>   *
> @@ -2125,7 +2125,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * #syscalls recvmsg
>   */
> -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +static int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> @@ -2169,7 +2169,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  
>  	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
>  	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
> -		tcp_l2_data_buf_flush(c);
> +		tcp_buf_l2_data_flush(c);
>  
>  		/* Silence Coverity CWE-125 false positive */
>  		tcp4_l2_buf_used = tcp6_l2_buf_used = 0;
> @@ -2195,7 +2195,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  
>  	if (!len) {
>  		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> -			if ((ret = tcp_send_flag(c, conn, FIN | ACK))) {
> +			if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
>  				tcp_rst(c, conn);
>  				return ret;
>  			}
> @@ -2378,7 +2378,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  			   max_ack_seq, conn->seq_to_tap);
>  		conn->seq_ack_from_tap = max_ack_seq;
>  		conn->seq_to_tap = max_ack_seq;
> -		tcp_data_from_sock(c, conn);
> +		tcp_buf_data_from_sock(c, conn);
>  	}
>  
>  	if (!iov_i)
> @@ -2394,14 +2394,14 @@ eintr:
>  			 *   Then swiftly looked away and left.
>  			 */
>  			conn->seq_from_tap = seq_from_tap;
> -			tcp_send_flag(c, conn, ACK);
> +			tcp_buf_send_flag(c, conn, ACK);
>  		}
>  
>  		if (errno == EINTR)
>  			goto eintr;
>  
>  		if (errno == EAGAIN || errno == EWOULDBLOCK) {
> -			tcp_send_flag(c, conn, ACK_IF_NEEDED);
> +			tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
>  			return p->count - idx;
>  
>  		}
> @@ -2411,7 +2411,7 @@ eintr:
>  	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
>  		partial_send = 1;
>  		conn->seq_from_tap += n;
> -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> +		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
>  	} else {
>  		conn->seq_from_tap += n;
>  	}
> @@ -2424,7 +2424,7 @@ out:
>  		 */
>  		if (conn->seq_dup_ack_approx != (conn->seq_from_tap & 0xff)) {
>  			conn->seq_dup_ack_approx = conn->seq_from_tap & 0xff;
> -			tcp_send_flag(c, conn, DUP_ACK);
> +			tcp_buf_send_flag(c, conn, DUP_ACK);
>  		}
>  		return p->count - idx;
>  	}
> @@ -2438,7 +2438,7 @@ out:
>  
>  		conn_event(c, conn, TAP_FIN_RCVD);
>  	} else {
> -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> +		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
>  	}
>  
>  	return p->count - idx;
> @@ -2474,8 +2474,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
>  	/* The client might have sent data already, which we didn't
>  	 * dequeue waiting for SYN,ACK from tap -- check now.
>  	 */
> -	tcp_data_from_sock(c, conn);
> -	tcp_send_flag(c, conn, ACK);
> +	tcp_buf_data_from_sock(c, conn);
> +	tcp_buf_send_flag(c, conn, ACK);
>  }
>  
>  /**
> @@ -2555,7 +2555,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  			conn->seq_from_tap++;
>  
>  			shutdown(conn->sock, SHUT_WR);
> -			tcp_send_flag(c, conn, ACK);
> +			tcp_buf_send_flag(c, conn, ACK);
>  			conn_event(c, conn, SOCK_FIN_SENT);
>  
>  			return 1;
> @@ -2566,7 +2566,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  
>  		tcp_tap_window_update(conn, ntohs(th->window));
>  
> -		tcp_data_from_sock(c, conn);
> +		tcp_buf_data_from_sock(c, conn);
>  
>  		if (p->count - idx == 1)
>  			return 1;
> @@ -2596,7 +2596,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
>  		shutdown(conn->sock, SHUT_WR);
>  		conn_event(c, conn, SOCK_FIN_SENT);
> -		tcp_send_flag(c, conn, ACK);
> +		tcp_buf_send_flag(c, conn, ACK);
>  		ack_due = 0;
>  	}
>  
> @@ -2630,7 +2630,7 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
>  		return;
>  	}
>  
> -	if (tcp_send_flag(c, conn, SYN | ACK))
> +	if (tcp_buf_send_flag(c, conn, SYN | ACK))
>  		return;
>  
>  	conn_event(c, conn, TAP_SYN_ACK_SENT);
> @@ -2698,7 +2698,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
>  
>  	conn->wnd_from_tap = WINDOW_DEFAULT;
>  
> -	tcp_send_flag(c, conn, SYN);
> +	tcp_buf_send_flag(c, conn, SYN);
>  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  
>  	tcp_get_sndbuf(conn);
> @@ -2762,7 +2762,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>  		return;
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
> -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> +		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
>  		tcp_timer_ctl(c, conn);
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		if (!(conn->events & ESTABLISHED)) {
> @@ -2778,7 +2778,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>  			flow_dbg(conn, "ACK timeout, retry");
>  			conn->retrans++;
>  			conn->seq_to_tap = conn->seq_ack_from_tap;
> -			tcp_data_from_sock(c, conn);
> +			tcp_buf_data_from_sock(c, conn);
>  			tcp_timer_ctl(c, conn);
>  		}
>  	} else {
> @@ -2833,7 +2833,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
>  			conn_event(c, conn, SOCK_FIN_RCVD);
>  
>  		if (events & EPOLLIN)
> -			tcp_data_from_sock(c, conn);
> +			tcp_buf_data_from_sock(c, conn);
>  
>  		if (events & EPOLLOUT)
>  			tcp_update_seqack_wnd(c, conn, 0, NULL);
> @@ -3058,10 +3058,10 @@ int tcp_init(struct ctx *c)
>  		tc_hash[b] = FLOW_SIDX_NONE;
>  
>  	if (c->ifi4)
> -		tcp_sock4_iov_init(c);
> +		tcp_buf_sock4_iov_init(c);
>  
>  	if (c->ifi6)
> -		tcp_sock6_iov_init(c);
> +		tcp_buf_sock6_iov_init(c);
>  
>  	memset(init_sock_pool4,		0xff,	sizeof(init_sock_pool4));
>  	memset(init_sock_pool6,		0xff,	sizeof(init_sock_pool6));
> diff --git a/tcp.h b/tcp.h
> index b9f546d31002..e7dbcfa2ddbd 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -23,7 +23,7 @@ int tcp_init(struct ctx *c);
>  void tcp_timer(struct ctx *c, const struct timespec *now);
>  void tcp_defer_handler(struct ctx *c);
>  
> -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
> +void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s);
>  
>  /**
>   * union tcp_epoll_ref - epoll reference portion for TCP connections

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/24] tap: make tap_update_mac() generic
  2024-02-02 14:11 ` [PATCH 12/24] tap: make tap_update_mac() generic Laurent Vivier
@ 2024-02-06  1:49   ` David Gibson
  2024-02-08 17:10     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-06  1:49 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3390 bytes --]

On Fri, Feb 02, 2024 at 03:11:39PM +0100, Laurent Vivier wrote:
> Use ethhdr rather than tap_hdr.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I'd be happy to see this applied immediately, in advance of the rest
of the series.

> ---
>  tap.c     | 6 +++---
>  tap.h     | 2 +-
>  tcp_buf.c | 8 ++++----
>  udp.c     | 4 ++--
>  4 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 3ea03f720d6d..29f389057ac1 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -447,13 +447,13 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
>   * @eth_d:	Ethernet destination address, NULL if unchanged
>   * @eth_s:	Ethernet source address, NULL if unchanged
>   */
> -void tap_update_mac(struct tap_hdr *taph,
> +void eth_update_mac(struct ethhdr *eh,
>  		    const unsigned char *eth_d, const unsigned char *eth_s)
>  {
>  	if (eth_d)
> -		memcpy(taph->eh.h_dest, eth_d, sizeof(taph->eh.h_dest));
> +		memcpy(eh->h_dest, eth_d, sizeof(eh->h_dest));
>  	if (eth_s)
> -		memcpy(taph->eh.h_source, eth_s, sizeof(taph->eh.h_source));
> +		memcpy(eh->h_source, eth_s, sizeof(eh->h_source));
>  }
>  
>  PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> diff --git a/tap.h b/tap.h
> index 466d91466c3d..437b9aa2b43f 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -74,7 +74,7 @@ void tap_icmp6_send(const struct ctx *c,
>  		    const void *in, size_t len);
>  int tap_send(const struct ctx *c, const void *data, size_t len);
>  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n);
> -void tap_update_mac(struct tap_hdr *taph,
> +void eth_update_mac(struct ethhdr *eh,
>  		    const unsigned char *eth_d, const unsigned char *eth_s);
>  void tap_listen_handler(struct ctx *c, uint32_t events);
>  void tap_handler_pasta(struct ctx *c, uint32_t events,
> diff --git a/tcp_buf.c b/tcp_buf.c
> index d70e7f810e4a..4c1f00c1d1b2 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -218,10 +218,10 @@ void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
>  		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
>  		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
>  
> -		tap_update_mac(&b4->taph, eth_d, eth_s);
> -		tap_update_mac(&b6->taph, eth_d, eth_s);
> -		tap_update_mac(&b4f->taph, eth_d, eth_s);
> -		tap_update_mac(&b6f->taph, eth_d, eth_s);
> +		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> +		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> +		eth_update_mac(&b4f->taph.eh, eth_d, eth_s);
> +		eth_update_mac(&b6f->taph.eh, eth_d, eth_s);
>  	}
>  }
>  
> diff --git a/udp.c b/udp.c
> index 96b4e6ca9a85..db635742319b 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -283,8 +283,8 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
>  		struct udp4_l2_buf_t *b4 = &udp4_l2_buf[i];
>  		struct udp6_l2_buf_t *b6 = &udp6_l2_buf[i];
>  
> -		tap_update_mac(&b4->taph, eth_d, eth_s);
> -		tap_update_mac(&b6->taph, eth_d, eth_s);
> +		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> +		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
>  	}
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add()
  2024-02-02 14:11 ` [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add() Laurent Vivier
  2024-02-02 14:29   ` Laurent Vivier
@ 2024-02-06  1:52   ` David Gibson
  2024-02-11 23:15   ` Stefano Brivio
  2 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-06  1:52 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 5317 bytes --]

On Fri, Feb 02, 2024 at 03:11:40PM +0100, Laurent Vivier wrote:
> From: Laurent Vivier <laurent@vivier.eu>

Rationale?

> Signed-off-by: Laurent Vivier <laurent@vivier.eu>

Otherwise LGTM.

> ---
>  tap.c | 98 +++++++++++++++++++++++++++++------------------------------
>  tap.h |  7 +++++
>  2 files changed, 56 insertions(+), 49 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 29f389057ac1..5b1b61550c13 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -911,6 +911,45 @@ append:
>  	return in->count;
>  }
>  
> +void pool_flush_all(void)
> +{
> +	pool_flush(pool_tap4);
> +	pool_flush(pool_tap6);
> +}
> +
> +void tap_handler_all(struct ctx *c, const struct timespec *now)
> +{
> +	tap4_handler(c, pool_tap4, now);
> +	tap6_handler(c, pool_tap6, now);
> +}
> +
> +void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
> +		       const char *func, int line)
> +{
> +	const struct ethhdr *eh;
> +
> +	pcap(p, len);
> +
> +	eh = (struct ethhdr *)p;
> +
> +	if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> +		memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> +		proto_update_l2_buf(c->mac_guest, NULL);
> +	}
> +
> +	switch (ntohs(eh->h_proto)) {
> +	case ETH_P_ARP:
> +	case ETH_P_IP:
> +		packet_add_do(pool_tap4, len, p, func, line);
> +		break;
> +	case ETH_P_IPV6:
> +		packet_add_do(pool_tap6, len, p, func, line);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  /**
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
> @@ -937,7 +976,6 @@ static void tap_sock_reset(struct ctx *c)
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now)
>  {
> -	const struct ethhdr *eh;
>  	ssize_t n, rem;
>  	char *p;
>  
> @@ -950,8 +988,7 @@ redo:
>  	p = pkt_buf;
>  	rem = 0;
>  
> -	pool_flush(pool_tap4);
> -	pool_flush(pool_tap6);
> +	pool_flush_all();
>  
>  	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
>  	if (n < 0) {
> @@ -978,37 +1015,18 @@ redo:
>  		/* Complete the partial read above before discarding a malformed
>  		 * frame, otherwise the stream will be inconsistent.
>  		 */
> -		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU)
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU)
>  			goto next;
>  
> -		pcap(p, len);
> -
> -		eh = (struct ethhdr *)p;
> -
> -		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> -			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> -			proto_update_l2_buf(c->mac_guest, NULL);
> -		}
> -
> -		switch (ntohs(eh->h_proto)) {
> -		case ETH_P_ARP:
> -		case ETH_P_IP:
> -			packet_add(pool_tap4, len, p);
> -			break;
> -		case ETH_P_IPV6:
> -			packet_add(pool_tap6, len, p);
> -			break;
> -		default:
> -			break;
> -		}
> +		packet_add_all(c, len, p);
>  
>  next:
>  		p += len;
>  		n -= len;
>  	}
>  
> -	tap4_handler(c, pool_tap4, now);
> -	tap6_handler(c, pool_tap6, now);
> +	tap_handler_all(c, now);
>  
>  	/* We can't use EPOLLET otherwise. */
>  	if (rem)
> @@ -1033,35 +1051,18 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  redo:
>  	n = 0;
>  
> -	pool_flush(pool_tap4);
> -	pool_flush(pool_tap6);
> +	pool_flush_all();
>  restart:
>  	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
> -		const struct ethhdr *eh = (struct ethhdr *)(pkt_buf + n);
>  
> -		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU) {
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU) {
>  			n += len;
>  			continue;
>  		}
>  
> -		pcap(pkt_buf + n, len);
>  
> -		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> -			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> -			proto_update_l2_buf(c->mac_guest, NULL);
> -		}
> -
> -		switch (ntohs(eh->h_proto)) {
> -		case ETH_P_ARP:
> -		case ETH_P_IP:
> -			packet_add(pool_tap4, len, pkt_buf + n);
> -			break;
> -		case ETH_P_IPV6:
> -			packet_add(pool_tap6, len, pkt_buf + n);
> -			break;
> -		default:
> -			break;
> -		}
> +		packet_add_all(c, len, pkt_buf + n);
>  
>  		if ((n += len) == TAP_BUF_BYTES)
>  			break;
> @@ -1072,8 +1073,7 @@ restart:
>  
>  	ret = errno;
>  
> -	tap4_handler(c, pool_tap4, now);
> -	tap6_handler(c, pool_tap6, now);
> +	tap_handler_all(c, now);
>  
>  	if (len > 0 || ret == EAGAIN)
>  		return;
> diff --git a/tap.h b/tap.h
> index 437b9aa2b43f..7157ef37ee6e 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -82,5 +82,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  void tap_sock_init(struct ctx *c);
> +void pool_flush_all(void);
> +void tap_handler_all(struct ctx *c, const struct timespec *now);
> +
> +void packet_add_do(struct pool *p, size_t len, const char *start,
> +		   const char *func, int line);
> +#define packet_add_all(p, len, start)					\
> +	packet_add_all_do(p, len, start, __func__, __LINE__)
>  
>  #endif /* TAP_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX()
  2024-02-02 14:11 ` [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX() Laurent Vivier
@ 2024-02-06  1:59   ` David Gibson
  2024-02-11 23:16   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-06  1:59 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8418 bytes --]

On Fri, Feb 02, 2024 at 03:11:41PM +0100, Laurent Vivier wrote:

Commit message please.

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  udp.c | 126 ++++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 73 insertions(+), 53 deletions(-)
> 
> diff --git a/udp.c b/udp.c
> index db635742319b..77168fb0a2af 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -562,47 +562,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
>   *
>   * Return: size of tap frame with headers
>   */
> -static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
> -			      const struct timespec *now)
> +static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
> +			      size_t data_len, struct sockaddr_in *s_in,
> +			      in_port_t dstport, const struct timespec *now)

This is a much better interface, nice change.

>  {
> -	struct udp4_l2_buf_t *b = &udp4_l2_buf[n];
> +	struct udphdr *uh = (struct udphdr *)(iph + 1);
>  	in_port_t src_port;
>  	size_t ip_len;
>  
> -	ip_len = udp4_l2_mh_sock[n].msg_len + sizeof(b->iph) + sizeof(b->uh);
> +	ip_len = data_len + sizeof(struct iphdr) + sizeof(struct udphdr);
>  
> -	b->iph.tot_len = htons(ip_len);
> -	b->iph.daddr = c->ip4.addr_seen.s_addr;
> +	iph->tot_len = htons(ip_len);
> +	iph->daddr = c->ip4.addr_seen.s_addr;
>  
> -	src_port = ntohs(b->s_in.sin_port);
> +	src_port = ntohs(s_in->sin_port);
>  
>  	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) &&
> -	    IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.dns_host) &&
> +	    IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.dns_host) &&
>  	    src_port == 53) {
> -		b->iph.saddr = c->ip4.dns_match.s_addr;
> -	} else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) ||
> -		   IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)||
> -		   IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) {
> -		b->iph.saddr = c->ip4.gw.s_addr;
> +		iph->saddr = c->ip4.dns_match.s_addr;
> +	} else if (IN4_IS_ADDR_LOOPBACK(&s_in->sin_addr) ||
> +		   IN4_IS_ADDR_UNSPECIFIED(&s_in->sin_addr)||
> +		   IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.addr_seen)) {
> +		iph->saddr = c->ip4.gw.s_addr;
>  		udp_tap_map[V4][src_port].ts = now->tv_sec;
>  		udp_tap_map[V4][src_port].flags |= PORT_LOCAL;
>  
> -		if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen))
> +		if (IN4_ARE_ADDR_EQUAL(&s_in->sin_addr.s_addr, &c->ip4.addr_seen))
>  			udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK;
>  		else
>  			udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK;
>  
>  		bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port);
>  	} else {
> -		b->iph.saddr = b->s_in.sin_addr.s_addr;
> +		iph->saddr = s_in->sin_addr.s_addr;
>  	}
>  
> -	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
> -	b->uh.source = b->s_in.sin_port;
> -	b->uh.dest = htons(dstport);
> -	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
> +	iph->check = ipv4_hdr_checksum(iph, IPPROTO_UDP);
> +	uh->source = s_in->sin_port;
> +	uh->dest = htons(dstport);
> +	uh->len= htons(data_len + sizeof(struct udphdr));
>  
> -	return tap_iov_len(c, &b->taph, ip_len);
> +	return ip_len;
>  }
>  
>  /**
> @@ -614,38 +615,39 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
>   *
>   * Return: size of tap frame with headers
>   */
> -static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
> -			      const struct timespec *now)
> +static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
> +			      size_t data_len, struct sockaddr_in6 *s_in6,
> +			      in_port_t dstport, const struct timespec *now)
>  {
> -	struct udp6_l2_buf_t *b = &udp6_l2_buf[n];
> +	struct udphdr *uh = (struct udphdr *)(ip6h + 1);
>  	struct in6_addr *src;
>  	in_port_t src_port;
>  	size_t ip_len;
>  
> -	src = &b->s_in6.sin6_addr;
> -	src_port = ntohs(b->s_in6.sin6_port);
> +	src = &s_in6->sin6_addr;
> +	src_port = ntohs(s_in6->sin6_port);
>  
> -	ip_len = udp6_l2_mh_sock[n].msg_len + sizeof(b->ip6h) + sizeof(b->uh);
> +	ip_len = data_len + sizeof(struct ipv6hdr) + sizeof(struct udphdr);
>  
> -	b->ip6h.payload_len = htons(udp6_l2_mh_sock[n].msg_len + sizeof(b->uh));
> +	ip6h->payload_len = htons(data_len + sizeof(struct udphdr));
>  
>  	if (IN6_IS_ADDR_LINKLOCAL(src)) {
> -		b->ip6h.daddr = c->ip6.addr_ll_seen;
> -		b->ip6h.saddr = b->s_in6.sin6_addr;
> +		ip6h->daddr = c->ip6.addr_ll_seen;
> +		ip6h->saddr = s_in6->sin6_addr;
>  	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) &&
>  		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) &&
>  		   src_port == 53) {
> -		b->ip6h.daddr = c->ip6.addr_seen;
> -		b->ip6h.saddr = c->ip6.dns_match;
> +		ip6h->daddr = c->ip6.addr_seen;
> +		ip6h->saddr = c->ip6.dns_match;
>  	} else if (IN6_IS_ADDR_LOOPBACK(src)			||
>  		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen)	||
>  		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) {
> -		b->ip6h.daddr = c->ip6.addr_ll_seen;
> +		ip6h->daddr = c->ip6.addr_ll_seen;
>  
>  		if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> -			b->ip6h.saddr = c->ip6.gw;
> +			ip6h->saddr = c->ip6.gw;
>  		else
> -			b->ip6h.saddr = c->ip6.addr_ll;
> +			ip6h->saddr = c->ip6.addr_ll;
>  
>  		udp_tap_map[V6][src_port].ts = now->tv_sec;
>  		udp_tap_map[V6][src_port].flags |= PORT_LOCAL;
> @@ -662,20 +664,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
>  
>  		bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port);
>  	} else {
> -		b->ip6h.daddr = c->ip6.addr_seen;
> -		b->ip6h.saddr = b->s_in6.sin6_addr;
> +		ip6h->daddr = c->ip6.addr_seen;
> +		ip6h->saddr = s_in6->sin6_addr;
>  	}
>  
> -	b->uh.source = b->s_in6.sin6_port;
> -	b->uh.dest = htons(dstport);
> -	b->uh.len = b->ip6h.payload_len;
> -	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
> -			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
> -	b->ip6h.version = 6;
> -	b->ip6h.nexthdr = IPPROTO_UDP;
> -	b->ip6h.hop_limit = 255;
> +	uh->source = s_in6->sin6_port;
> +	uh->dest = htons(dstport);
> +	uh->len = ip6h->payload_len;
> +	uh->check = csum(uh, ntohs(ip6h->payload_len),
> +			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
> +	ip6h->version = 6;
> +	ip6h->nexthdr = IPPROTO_UDP;
> +	ip6h->hop_limit = 255;
>  
> -	return tap_iov_len(c, &b->taph, ip_len);
> +	return ip_len;
>  }
>  
>  /**
> @@ -689,6 +691,11 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
>   *
>   * Return: size of tap frame with headers
>   */
> +#pragma GCC diagnostic push
> +/* ignore unaligned pointer value warning for &udp6_l2_buf[i].ip6h and 
> + * &udp4_l2_buf[i].iph

I feel like this needs more explanation: why is it unaligned?  why
can't we make it aligned?  why is it safe to ignore the warning?

> + */
> +#pragma GCC diagnostic ignored "-Waddress-of-packed-member"
>  static void udp_tap_send(const struct ctx *c,
>  			 unsigned int start, unsigned int n,
>  			 in_port_t dstport, bool v6, const struct timespec *now)
> @@ -702,18 +709,31 @@ static void udp_tap_send(const struct ctx *c,
>  		tap_iov = udp4_l2_iov_tap;
>  
>  	for (i = start; i < start + n; i++) {
> -		size_t buf_len;
> -
> -		if (v6)
> -			buf_len = udp_update_hdr6(c, i, dstport, now);
> -		else
> -			buf_len = udp_update_hdr4(c, i, dstport, now);
> -
> -		tap_iov[i].iov_len = buf_len;
> +		size_t ip_len;
> +
> +		if (v6) {
> +			ip_len = udp_update_hdr6(c, &udp6_l2_buf[i].ip6h,
> +						 udp6_l2_mh_sock[i].msg_len,
> +						 &udp6_l2_buf[i].s_in6, dstport,
> +						 now);
> +			tap_iov[i].iov_len = tap_iov_len(c,
> +							 &udp6_l2_buf[i].taph,
> +							 ip_len);
> +		} else {
> +			ip_len = udp_update_hdr4(c, &udp4_l2_buf[i].iph,
> +						 udp4_l2_mh_sock[i].msg_len,
> +						 &udp4_l2_buf[i].s_in,
> +						 dstport, now);
> +
> +			tap_iov[i].iov_len = tap_iov_len(c,
> +							 &udp4_l2_buf[i].taph,
> +							 ip_len);
> +		}
>  	}
>  
>  	tap_send_frames(c, tap_iov + start, n);
>  }
> +#pragma GCC diagnostic pop
>  
>  /**
>   * udp_sock_handler() - Handle new data from socket

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler()
  2024-02-02 14:11 ` [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler() Laurent Vivier
@ 2024-02-06  2:14   ` David Gibson
  2024-02-11 23:17     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-06  2:14 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2605 bytes --]

On Fri, Feb 02, 2024 at 03:11:42PM +0100, Laurent Vivier wrote:
> We are going to introduce a variant of the function to use
> vhost-user buffers rather than passt internal buffers.

Not entirely sure the new name really conveys that distinction.

> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  passt.c | 2 +-
>  udp.c   | 6 +++---
>  udp.h   | 4 ++--
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/passt.c b/passt.c
> index 10042a9b9789..c70caf464e61 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -389,7 +389,7 @@ loop:
>  			tcp_timer_handler(&c, ref);
>  			break;
>  		case EPOLL_TYPE_UDP:
> -			udp_sock_handler(&c, ref, eventmask, &now);
> +			udp_buf_sock_handler(&c, ref, eventmask, &now);
>  			break;
>  		case EPOLL_TYPE_ICMP:
>  			icmp_sock_handler(&c, AF_INET, ref);
> diff --git a/udp.c b/udp.c
> index 77168fb0a2af..9c56168c6340 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -736,7 +736,7 @@ static void udp_tap_send(const struct ctx *c,
>  #pragma GCC diagnostic pop
>  
>  /**
> - * udp_sock_handler() - Handle new data from socket
> + * udp_buf_sock_handler() - Handle new data from socket
>   * @c:		Execution context
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
> @@ -744,8 +744,8 @@ static void udp_tap_send(const struct ctx *c,
>   *
>   * #syscalls recvmmsg
>   */
> -void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> -		      const struct timespec *now)
> +void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> +			  const struct timespec *now)
>  {
>  	/* For not entirely clear reasons (data locality?) pasta gets
>  	 * better throughput if we receive tap datagrams one at a
> diff --git a/udp.h b/udp.h
> index 087e4820f93c..6c8519e87f1a 100644
> --- a/udp.h
> +++ b/udp.h
> @@ -9,8 +9,8 @@
>  #define UDP_TIMER_INTERVAL		1000 /* ms */
>  
>  void udp_portmap_clear(void);
> -void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> -		      const struct timespec *now);
> +void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> +			  const struct timespec *now);
>  int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  		    const void *saddr, const void *daddr,
>  		    const struct pool *p, int idx, const struct timespec *now);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 16/24] packet: replace struct desc by struct iovec
  2024-02-02 14:11 ` [PATCH 16/24] packet: replace struct desc by struct iovec Laurent Vivier
@ 2024-02-06  2:25   ` David Gibson
  2024-02-11 23:18     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-06  2:25 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5629 bytes --]

On Fri, Feb 02, 2024 at 03:11:43PM +0100, Laurent Vivier wrote:

Rationale please.  It's probably also worth nothing that this does
replace struct desc with a larger structure - struct desc is already
padded out to 8 bytes, but on 64-bit machines iovec will be larger
still.

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  packet.c | 75 +++++++++++++++++++++++++++++++-------------------------
>  packet.h | 14 ++---------
>  2 files changed, 43 insertions(+), 46 deletions(-)
> 
> diff --git a/packet.c b/packet.c
> index ccfc84607709..af2a539a1794 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -22,6 +22,36 @@
>  #include "util.h"
>  #include "log.h"
>  
> +static int packet_check_range(const struct pool *p, size_t offset, size_t len,
> +			      const char *start, const char *func, int line)
> +{
> +	if (start < p->buf) {
> +		if (func) {
> +			trace("add packet start %p before buffer start %p, "
> +			      "%s:%i", (void *)start, (void *)p->buf, func, line);
> +		}
> +		return -1;

I guess not really in scope for this patch, but IIUC the only place
we'd hit this is if the caller has done something badly wrong, so
possibly it should be an ASSERT().

> +	}
> +
> +	if (start + len + offset > p->buf + p->buf_size) {
> +		if (func) {
> +			trace("packet offset plus length %lu from size %lu, "
> +			      "%s:%i", start - p->buf + len + offset,
> +			      p->buf_size, func, line);
> +		}
> +		return -1;
> +	}

Same with this one.

> +
> +#if UINTPTR_MAX == UINT64_MAX
> +	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
> +		trace("add packet start %p, buffer start %p, %s:%i",
> +		      (void *)start, (void *)p->buf, func, line);
> +		return -1;
> +	}
> +#endif

This one is relevant to this patch though - because you're expanding
struct desc's 32-bit offset to full void * from struct iovec, this
check is no longer relevant.

> +
> +	return 0;
> +}
>  /**
>   * packet_add_do() - Add data as packet descriptor to given pool
>   * @p:		Existing pool
> @@ -41,34 +71,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
>  		return;
>  	}
>  
> -	if (start < p->buf) {
> -		trace("add packet start %p before buffer start %p, %s:%i",
> -		      (void *)start, (void *)p->buf, func, line);
> +	if (packet_check_range(p, 0, len, start, func, line))
>  		return;
> -	}
> -
> -	if (start + len > p->buf + p->buf_size) {
> -		trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
> -		      (void *)start, len, (void *)(p->buf + p->buf_size),
> -		      func, line);
> -		return;
> -	}
>  
>  	if (len > UINT16_MAX) {
>  		trace("add packet length %zu, %s:%i", len, func, line);
>  		return;
>  	}
>  
> -#if UINTPTR_MAX == UINT64_MAX
> -	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
> -		trace("add packet start %p, buffer start %p, %s:%i",
> -		      (void *)start, (void *)p->buf, func, line);
> -		return;
> -	}
> -#endif
> -
> -	p->pkt[idx].offset = start - p->buf;
> -	p->pkt[idx].len = len;
> +	p->pkt[idx].iov_base = (void *)start;
> +	p->pkt[idx].iov_len = len;
>  
>  	p->count++;
>  }
> @@ -104,28 +116,23 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
>  		return NULL;
>  	}
>  
> -	if (p->pkt[idx].offset + len + offset > p->buf_size) {
> +	if (len + offset > p->pkt[idx].iov_len) {
>  		if (func) {
> -			trace("packet offset plus length %zu from size %zu, "
> -			      "%s:%i", p->pkt[idx].offset + len + offset,
> -			      p->buf_size, func, line);
> +			trace("data length %zu, offset %zu from length %zu, "
> +			      "%s:%i", len, offset, p->pkt[idx].iov_len,
> +			      func, line);
>  		}
>  		return NULL;
>  	}
>  
> -	if (len + offset > p->pkt[idx].len) {
> -		if (func) {
> -			trace("data length %zu, offset %zu from length %u, "
> -			      "%s:%i", len, offset, p->pkt[idx].len,
> -			      func, line);
> -		}
> +	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
> +			       func, line))
>  		return NULL;
> -	}
>  
>  	if (left)
> -		*left = p->pkt[idx].len - offset - len;
> +		*left = p->pkt[idx].iov_len - offset - len;
>  
> -	return p->buf + p->pkt[idx].offset + offset;
> +	return (char *)p->pkt[idx].iov_base + offset;
>  }
>  
>  /**
> diff --git a/packet.h b/packet.h
> index a784b07bbed5..8377dcf678bb 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -6,16 +6,6 @@
>  #ifndef PACKET_H
>  #define PACKET_H
>  
> -/**
> - * struct desc - Generic offset-based descriptor within buffer
> - * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
> - * @len:	Length of descriptor, host order, 16-bit limit
> - */
> -struct desc {
> -	uint32_t offset;
> -	uint16_t len;
> -};
> -
>  /**
>   * struct pool - Generic pool of packets stored in a buffer
>   * @buf:	Buffer storing packet descriptors
> @@ -29,7 +19,7 @@ struct pool {
>  	size_t buf_size;
>  	size_t size;
>  	size_t count;
> -	struct desc pkt[1];
> +	struct iovec pkt[1];
>  };
>  
>  void packet_add_do(struct pool *p, size_t len, const char *start,
> @@ -54,7 +44,7 @@ struct _name ## _t {							\
>  	size_t buf_size;						\
>  	size_t size;							\
>  	size_t count;							\
> -	struct desc pkt[_size];						\
> +	struct iovec pkt[_size];					\
>  }
>  
>  #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 17/24] vhost-user: compare mode MODE_PASTA and not MODE_PASST
  2024-02-02 14:11 ` [PATCH 17/24] vhost-user: compare mode MODE_PASTA and not MODE_PASST Laurent Vivier
@ 2024-02-06  2:29   ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-06  2:29 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6139 bytes --]

On Fri, Feb 02, 2024 at 03:11:44PM +0100, Laurent Vivier wrote:
> As we are going to introduce the MODE_VU that will act like
> the mode MODE_PASST, compare to MODE_PASTA rather than to add
> a comparison to MODE_VU when we check for MODE_PASST.

I have vague plans to make this more sensible with many of the mode
checks replaced with checks on the pif type, but this looks fine for
now.
 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  conf.c      | 12 ++++++------
>  isolation.c | 10 +++++-----
>  passt.c     |  2 +-
>  tap.c       | 12 ++++++------
>  tcp_buf.c   |  2 +-
>  udp.c       |  2 +-
>  6 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 93bfda331349..b6a2a1f0fdc3 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -147,7 +147,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
>  		if (fwd->mode)
>  			goto mode_conflict;
>  
> -		if (c->mode != MODE_PASST)
> +		if (c->mode == MODE_PASTA)
>  			die("'all' port forwarding is only allowed for passt");
>  
>  		fwd->mode = FWD_ALL;
> @@ -1240,7 +1240,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  			c->no_dhcp_dns = 0;
>  			break;
>  		case 6:
> -			if (c->mode != MODE_PASST)
> +			if (c->mode == MODE_PASTA)
>  				die("--no-dhcp-dns is for passt mode only");
>  
>  			c->no_dhcp_dns = 1;
> @@ -1252,7 +1252,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  			c->no_dhcp_dns_search = 0;
>  			break;
>  		case 8:
> -			if (c->mode != MODE_PASST)
> +			if (c->mode == MODE_PASTA)
>  				die("--no-dhcp-search is for passt mode only");
>  
>  			c->no_dhcp_dns_search = 1;
> @@ -1307,7 +1307,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  			break;
>  		case 14:
>  			fprintf(stdout,
> -				c->mode == MODE_PASST ? "passt " : "pasta ");
> +				c->mode == MODE_PASTA ? "pasta " : "passt ");
>  			fprintf(stdout, VERSION_BLOB);
>  			exit(EXIT_SUCCESS);
>  		case 15:
> @@ -1610,7 +1610,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  			v6_only = true;
>  			break;
>  		case '1':
> -			if (c->mode != MODE_PASST)
> +			if (c->mode == MODE_PASTA)
>  				die("--one-off is for passt mode only");
>  
>  			if (c->one_off)
> @@ -1657,7 +1657,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  	conf_ugid(runas, &uid, &gid);
>  
>  	if (logfile) {
> -		logfile_init(c->mode == MODE_PASST ? "passt" : "pasta",
> +		logfile_init(c->mode == MODE_PASTA ? "pasta" : "passt",
>  			     logfile, logsize);
>  	}
>  
> diff --git a/isolation.c b/isolation.c
> index f394e93b8526..ca2c68b52ec7 100644
> --- a/isolation.c
> +++ b/isolation.c
> @@ -312,7 +312,7 @@ int isolate_prefork(const struct ctx *c)
>  	 * PID namespace. For passt, use CLONE_NEWPID anyway, in case somebody
>  	 * ever gets around seccomp profiles -- there's no harm in passing it.
>  	 */
> -	if (!c->foreground || c->mode == MODE_PASST)
> +	if (!c->foreground || c->mode != MODE_PASTA)
>  		flags |= CLONE_NEWPID;
>  
>  	if (unshare(flags)) {
> @@ -379,12 +379,12 @@ void isolate_postfork(const struct ctx *c)
>  
>  	prctl(PR_SET_DUMPABLE, 0);
>  
> -	if (c->mode == MODE_PASST) {
> -		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
> -		prog.filter = filter_passt;
> -	} else {
> +	if (c->mode == MODE_PASTA) {
>  		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
>  		prog.filter = filter_pasta;
> +	} else {
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
> +		prog.filter = filter_passt;
>  	}
>  
>  	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
> diff --git a/passt.c b/passt.c
> index c70caf464e61..5056a49dec95 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -360,7 +360,7 @@ loop:
>  		uint32_t eventmask = events[i].events;
>  
>  		trace("%s: epoll event on %s %i (events: 0x%08x)",
> -		      c.mode == MODE_PASST ? "passt" : "pasta",
> +		      c.mode == MODE_PASTA ? "pasta" : "passt",
>  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
>  
>  		switch (ref.type) {
> diff --git a/tap.c b/tap.c
> index 5b1b61550c13..ebe52247ad87 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -428,10 +428,10 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
>  	if (!n)
>  		return 0;
>  
> -	if (c->mode == MODE_PASST)
> -		m = tap_send_frames_passt(c, iov, n);
> -	else
> +	if (c->mode == MODE_PASTA)
>  		m = tap_send_frames_pasta(c, iov, n);
> +	else
> +		m = tap_send_frames_passt(c, iov, n);
>  
>  	if (m < n)
>  		debug("tap: failed to send %zu frames of %zu", n - m, n);
> @@ -1299,10 +1299,10 @@ void tap_sock_init(struct ctx *c)
>  		return;
>  	}
>  
> -	if (c->mode == MODE_PASST) {
> +	if (c->mode == MODE_PASTA) {
> +		tap_sock_tun_init(c);
> +	} else {
>  		if (c->fd_tap_listen == -1)
>  			tap_sock_unix_init(c);
> -	} else {
> -		tap_sock_tun_init(c);
>  	}
>  }
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 4c1f00c1d1b2..dff6802c5734 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -34,7 +34,7 @@
>  
>  #define TCP_FRAMES_MEM			128
>  #define TCP_FRAMES							\
> -	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
> +	(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
>  
>  struct tcp4_l2_head {	/* For MSS4 macro: keep in sync with tcp4_l2_buf_t */
>  #ifdef __AVX2__
> diff --git a/udp.c b/udp.c
> index 9c56168c6340..a189c2e0b5a2 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -755,7 +755,7 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve
>  	 * whether we'll use tap or splice, always go one at a time
>  	 * for pasta mode.
>  	 */
> -	ssize_t n = (c->mode == MODE_PASST ? UDP_MAX_FRAMES : 1);
> +	ssize_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES);
>  	in_port_t dstport = ref.udp.port;
>  	bool v6 = ref.udp.v6;
>  	struct mmsghdr *mmh_recv;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 18/24] vhost-user: introduce virtio API
  2024-02-02 14:11 ` [PATCH 18/24] vhost-user: introduce virtio API Laurent Vivier
@ 2024-02-06  3:51   ` David Gibson
  2024-02-11 23:18     ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-06  3:51 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 20327 bytes --]

On Fri, Feb 02, 2024 at 03:11:45PM +0100, Laurent Vivier wrote:
> Add virtio.c and virtio.h that define the functions needed
> to manage virtqueues.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

When importing a batch of code from outside, I think we need to choose
between one of two extremes:

  1) Treat this as a "vendored" dependency.  Keep the imported code
     byte-for-byte identical to the original source, and possibly have
     some integration glue in different files

  2) Fully assimilate: treat this as our own code, inspired by the
     original source.  Rewrite as much as we need to match our own
     conventions.

Currently, this is somewhere in between: we have some changes for the
passt tree (e.g. tab indents), but other things retain qemu style
(e.g. CamelCase, typedefs, and braces around single line clauses).

> ---
>  Makefile |   4 +-
>  util.h   |  11 ++
>  virtio.c | 484 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  virtio.h | 121 ++++++++++++++
>  4 files changed, 618 insertions(+), 2 deletions(-)
>  create mode 100644 virtio.c
>  create mode 100644 virtio.h
> 
> diff --git a/Makefile b/Makefile
> index bf370b6ec2e6..ae1daa6b2b50 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
> -	tcp_buf.c udp.c util.c iov.c ip.c
> +	tcp_buf.c udp.c util.c iov.c ip.c virtio.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
>  	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
> -	util.h iov.h ip.h
> +	util.h iov.h ip.h virtio.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/util.h b/util.h
> index f7c3dfee9972..a80024e3b797 100644
> --- a/util.h
> +++ b/util.h
> @@ -43,6 +43,9 @@
>  #define ROUND_DOWN(x, y)	((x) & ~((y) - 1))
>  #define ROUND_UP(x, y)		(((x) + (y) - 1) & ~((y) - 1))
>  
> +#define ALIGN_DOWN(n, m)	((n) / (m) * (m))
> +#define ALIGN_UP(n, m)		ALIGN_DOWN((n) + (m) - 1, (m))

It would be nice to move these earlier in the series and use them for
patch 3.

>  #define MAX_FROM_BITS(n)	(((1U << (n)) - 1))
>  
>  #define BIT(n)			(1UL << (n))
> @@ -110,6 +113,14 @@
>  #define	htonl_constant(x)	(__bswap_constant_32(x))
>  #endif
>  
> +#define  barrier()		do { __asm__ __volatile__("" ::: "memory"); } while (0)
> +#define smp_mb()		do { barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); } while (0)
> +#define smp_mb_release()	do { barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); } while (0)
> +#define smp_mb_acquire()	do { barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); } while (0)
> +
> +#define smp_wmb()	smp_mb_release()
> +#define smp_rmb()	smp_mb_acquire()
> +
>  #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
>  int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  	     void *arg);
> diff --git a/virtio.c b/virtio.c
> new file mode 100644
> index 000000000000..1edd4155eec2
> --- /dev/null
> +++ b/virtio.c
> @@ -0,0 +1,484 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts copied from QEMU subprojects/libvhost-user/libvhost-user.c */
> +
> +#include <stddef.h>
> +#include <endian.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/eventfd.h>
> +#include <sys/socket.h>
> +
> +#include "util.h"
> +#include "virtio.h"
> +
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
> +/* Translate guest physical address to our virtual address.  */
> +static void *vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr)
> +{
> +	unsigned int i;
> +
> +	if (*plen == 0) {
> +		return NULL;
> +	}
> +
> +	/* Find matching memory region.  */
> +	for (i = 0; i < dev->nregions; i++) {
> +		VuDevRegion *r = &dev->regions[i];
> +
> +		if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) {
> +			if ((guest_addr + *plen) > (r->gpa + r->size)) {
> +				*plen = r->gpa + r->size - guest_addr;
> +			}
> +			return (void *)(guest_addr - (uintptr_t)r->gpa +
> +					(uintptr_t)r->mmap_addr + r->mmap_offset);
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline uint16_t vring_avail_flags(VuVirtq *vq)
> +{
> +	return le16toh(vq->vring.avail->flags);
> +}
> +
> +static inline uint16_t vring_avail_idx(VuVirtq *vq)
> +{
> +	vq->shadow_avail_idx = le16toh(vq->vring.avail->idx);
> +
> +	return vq->shadow_avail_idx;
> +}
> +
> +static inline uint16_t vring_avail_ring(VuVirtq *vq, int i)
> +{
> +	return le16toh(vq->vring.avail->ring[i]);
> +}
> +
> +static inline uint16_t vring_get_used_event(VuVirtq *vq)
> +{
> +	return vring_avail_ring(vq, vq->vring.num);
> +}
> +
> +static bool virtqueue_get_head(VuDev *dev, VuVirtq *vq,
> +		   unsigned int idx, unsigned int *head)
> +{
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen. */
> +	*head = vring_avail_ring(vq, idx % vq->vring.num);
> +
> +	/* If their number is silly, that's a fatal mistake. */
> +	if (*head >= vq->vring.num) {
> +		vu_panic(dev, "Guest says index %u is available", *head);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static int
> +virtqueue_read_indirect_desc(VuDev *dev, struct vring_desc *desc,
> +			     uint64_t addr, size_t len)
> +{
> +	struct vring_desc *ori_desc;
> +	uint64_t read_len;
> +
> +	if (len > (VIRTQUEUE_MAX_SIZE * sizeof(struct vring_desc))) {
> +		return -1;
> +	}
> +
> +	if (len == 0) {
> +		return -1;
> +	}
> +
> +	while (len) {
> +		read_len = len;
> +		ori_desc = vu_gpa_to_va(dev, &read_len, addr);
> +		if (!ori_desc) {
> +			return -1;
> +		}
> +
> +		memcpy(desc, ori_desc, read_len);
> +		len -= read_len;
> +		addr += read_len;
> +		desc += read_len;

Hrm... this is copied as is from qemu, but it looks wrong.  Why would
be advancing the descriptor by a number of descriptor entries equal to
the number of bytes in this chunk.

> +	}
> +
> +	return 0;
> +}
> +
> +enum {
> +	VIRTQUEUE_READ_DESC_ERROR = -1,
> +	VIRTQUEUE_READ_DESC_DONE = 0,   /* end of chain */
> +	VIRTQUEUE_READ_DESC_MORE = 1,   /* more buffers in chain */
> +};
> +
> +static int
> +virtqueue_read_next_desc(VuDev *dev, struct vring_desc *desc,
> +			 int i, unsigned int max, unsigned int *next)
> +{
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(le16toh(desc[i].flags) & VRING_DESC_F_NEXT)) {
> +		return VIRTQUEUE_READ_DESC_DONE;
> +	}
> +
> +	/* Check they're not leading us off end of descriptors. */
> +	*next = le16toh(desc[i].next);
> +	/* Make sure compiler knows to grab that: we don't want it changing! */
> +	smp_wmb();
> +
> +	if (*next >= max) {
> +		vu_panic(dev, "Desc next is %u", *next);
> +		return VIRTQUEUE_READ_DESC_ERROR;
> +	}
> +
> +	return VIRTQUEUE_READ_DESC_MORE;
> +}
> +
> +bool vu_queue_empty(VuDev *dev, VuVirtq *vq)
> +{
> +	if (dev->broken ||
> +		!vq->vring.avail) {
> +		return true;
> +	}
> +
> +	if (vq->shadow_avail_idx != vq->last_avail_idx) {
> +		return false;
> +	}
> +
> +	return vring_avail_idx(vq) == vq->last_avail_idx;
> +}
> +
> +static bool vring_notify(VuDev *dev, VuVirtq *vq)
> +{
> +	uint16_t old, new;
> +	bool v;
> +
> +	/* We need to expose used array entries before checking used event. */
> +	smp_mb();
> +
> +	/* Always notify when queue is empty (when feature acknowledge) */
> +	if (vu_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +		!vq->inuse && vu_queue_empty(dev, vq)) {
> +		return true;
> +	}
> +
> +	if (!vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
> +		return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
> +	}
> +
> +	v = vq->signalled_used_valid;
> +	vq->signalled_used_valid = true;
> +	old = vq->signalled_used;
> +	new = vq->signalled_used = vq->used_idx;
> +	return !v || vring_need_event(vring_get_used_event(vq), new, old);
> +}
> +
> +void vu_queue_notify(VuDev *dev, VuVirtq *vq)
> +{
> +	if (dev->broken || !vq->vring.avail) {
> +		return;
> +	}
> +
> +	if (!vring_notify(dev, vq)) {
> +		debug("skipped notify...");
> +		return;
> +	}
> +
> +	if (eventfd_write(vq->call_fd, 1) < 0) {
> +		vu_panic(dev, "Error writing eventfd: %s", strerror(errno));
> +	}
> +}
> +
> +static inline void vring_set_avail_event(VuVirtq *vq, uint16_t val)
> +{
> +	uint16_t val_le = htole16(val);
> +
> +	if (!vq->notification) {
> +		return;
> +	}
> +
> +	memcpy(&vq->vring.used->ring[vq->vring.num], &val_le, sizeof(uint16_t));
> +}
> +
> +static bool virtqueue_map_desc(VuDev *dev,
> +			       unsigned int *p_num_sg, struct iovec *iov,
> +			       unsigned int max_num_sg,
> +			       uint64_t pa, size_t sz)
> +{
> +	unsigned num_sg = *p_num_sg;
> +
> +	ASSERT(num_sg <= max_num_sg);
> +
> +	if (!sz) {
> +		vu_panic(dev, "virtio: zero sized buffers are not allowed");
> +		return false;
> +	}
> +
> +	while (sz) {
> +		uint64_t len = sz;
> +
> +		if (num_sg == max_num_sg) {
> +			vu_panic(dev, "virtio: too many descriptors in indirect table");
> +			return false;
> +		}
> +
> +		iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
> +		if (iov[num_sg].iov_base == NULL) {
> +			vu_panic(dev, "virtio: invalid address for buffers");
> +			return false;
> +		}
> +		iov[num_sg].iov_len = len;
> +		num_sg++;
> +		sz -= len;
> +		pa += len;
> +	}
> +
> +	*p_num_sg = num_sg;
> +	return true;
> +}
> +
> +static void * virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num, unsigned char *buffer)
> +{
> +	VuVirtqElement *elem;
> +	size_t in_sg_ofs = ALIGN_UP(sz, __alignof__(elem->in_sg[0]));
> +	size_t out_sg_ofs = in_sg_ofs + in_num * sizeof(elem->in_sg[0]);
> +	size_t out_sg_end = out_sg_ofs + out_num * sizeof(elem->out_sg[0]);
> +
> +	if (out_sg_end > 65536)
> +		return NULL;
> +
> +	elem = (void *)buffer;
> +	elem->out_num = out_num;
> +	elem->in_num = in_num;
> +	elem->in_sg = (struct iovec *)((uintptr_t)elem + in_sg_ofs);
> +	elem->out_sg = (struct iovec *)((uintptr_t)elem + out_sg_ofs);
> +	return elem;
> +}
> +
> +static void *
> +vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz, unsigned char *buffer)
> +{
> +	struct vring_desc *desc = vq->vring.desc;
> +	uint64_t desc_addr, read_len;
> +	unsigned int desc_len;
> +	unsigned int max = vq->vring.num;
> +	unsigned int i = idx;
> +	VuVirtqElement *elem;
> +	unsigned int out_num = 0, in_num = 0;
> +	struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
> +	int rc;
> +
> +	if (le16toh(desc[i].flags) & VRING_DESC_F_INDIRECT) {
> +		if (le32toh(desc[i].len) % sizeof(struct vring_desc)) {
> +			vu_panic(dev, "Invalid size for indirect buffer table");
> +			return NULL;
> +		}
> +
> +		/* loop over the indirect descriptor table */
> +		desc_addr = le64toh(desc[i].addr);
> +		desc_len = le32toh(desc[i].len);
> +		max = desc_len / sizeof(struct vring_desc);
> +		read_len = desc_len;
> +		desc = vu_gpa_to_va(dev, &read_len, desc_addr);
> +		if (desc && read_len != desc_len) {
> +			/* Failed to use zero copy */
> +			desc = NULL;
> +			if (!virtqueue_read_indirect_desc(dev, desc_buf, desc_addr, desc_len)) {
> +				desc = desc_buf;
> +			}
> +		}
> +		if (!desc) {
> +			vu_panic(dev, "Invalid indirect buffer table");
> +			return NULL;
> +		}
> +		i = 0;
> +	}
> +
> +	/* Collect all the descriptors */
> +	do {
> +		if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
> +			if (!virtqueue_map_desc(dev, &in_num, iov + out_num,
> +						VIRTQUEUE_MAX_SIZE - out_num,
> +						le64toh(desc[i].addr),
> +						le32toh(desc[i].len))) {
> +				return NULL;
> +			}
> +		} else {
> +			if (in_num) {
> +				vu_panic(dev, "Incorrect order for descriptors");
> +				return NULL;
> +			}
> +			if (!virtqueue_map_desc(dev, &out_num, iov,
> +						VIRTQUEUE_MAX_SIZE,
> +						le64toh(desc[i].addr),
> +						le32toh(desc[i].len))) {
> +				return NULL;
> +			}
> +		}
> +
> +		/* If we've got too many, that implies a descriptor loop. */
> +		if ((in_num + out_num) > max) {
> +			vu_panic(dev, "Looped descriptor");
> +			return NULL;
> +		}
> +		rc = virtqueue_read_next_desc(dev, desc, i, max, &i);
> +	} while (rc == VIRTQUEUE_READ_DESC_MORE);
> +
> +	if (rc == VIRTQUEUE_READ_DESC_ERROR) {
> +		vu_panic(dev, "read descriptor error");
> +		return NULL;
> +	}
> +
> +	/* Now copy what we have collected and mapped */
> +	elem = virtqueue_alloc_element(sz, out_num, in_num, buffer);
> +	if (!elem) {
> +		return NULL;
> +	}
> +	elem->index = idx;
> +	for (i = 0; i < out_num; i++) {
> +		elem->out_sg[i] = iov[i];
> +	}
> +	for (i = 0; i < in_num; i++) {
> +		elem->in_sg[i] = iov[out_num + i];
> +	}
> +
> +	return elem;
> +}
> +
> +void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz, unsigned char *buffer)
> +{
> +	unsigned int head;
> +	VuVirtqElement *elem;
> +
> +	if (dev->broken || !vq->vring.avail) {
> +	return NULL;
> +	}
> +
> +	if (vu_queue_empty(dev, vq)) {
> +	return NULL;
> +	}
> +	/*
> +	 * Needed after virtio_queue_empty(), see comment in
> +	 * virtqueue_num_heads().
> +	 */
> +	smp_rmb();
> +
> +	if (vq->inuse >= vq->vring.num) {
> +	vu_panic(dev, "Virtqueue size exceeded");
> +	return NULL;
> +	}
> +
> +	if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, &head)) {
> +	return NULL;
> +	}
> +
> +	if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
> +		vring_set_avail_event(vq, vq->last_avail_idx);
> +	}
> +
> +	elem = vu_queue_map_desc(dev, vq, head, sz, buffer);
> +
> +	if (!elem) {
> +	return NULL;
> +	}
> +
> +	vq->inuse++;
> +
> +	return elem;
> +}
> +
> +void vu_queue_detach_element(VuDev *dev, VuVirtq *vq,
> +			     unsigned int index, size_t len)
> +{
> +	(void)dev;
> +	(void)index;
> +	(void)len;
> +
> +	vq->inuse--;
> +	/* unmap, when DMA support is added */
> +}
> +
> +void vu_queue_unpop(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len)
> +{
> +	vq->last_avail_idx--;
> +	vu_queue_detach_element(dev, vq, index, len);
> +}
> +
> +bool vu_queue_rewind(VuDev *dev, VuVirtq *vq, unsigned int num)
> +{
> +	(void)dev;
> +	if (num > vq->inuse) {
> +		return false;
> +	}
> +	vq->last_avail_idx -= num;
> +	vq->inuse -= num;
> +	return true;
> +}
> +
> +static inline void vring_used_write(VuVirtq *vq,
> +				    struct vring_used_elem *uelem, int i)
> +{
> +	struct vring_used *used = vq->vring.used;
> +
> +	used->ring[i] = *uelem;
> +}
> +
> +void vu_queue_fill_by_index(VuDev *dev, VuVirtq *vq, unsigned int index,
> +			  unsigned int len, unsigned int idx)
> +{
> +	struct vring_used_elem uelem;
> +
> +	if (dev->broken || !vq->vring.avail)
> +		return;
> +
> +	idx = (idx + vq->used_idx) % vq->vring.num;
> +
> +	uelem.id = htole32(index);
> +	uelem.len = htole32(len);
> +	vring_used_write(vq, &uelem, idx);
> +}
> +
> +void vu_queue_fill(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem,
> +		   unsigned int len, unsigned int idx)
> +{
> +	vu_queue_fill_by_index(dev, vq, elem->index, len, idx);
> +}
> +
> +static inline void vring_used_idx_set(VuVirtq *vq, uint16_t val)
> +{
> +	vq->vring.used->idx = htole16(val);
> +
> +	vq->used_idx = val;
> +}
> +
> +void vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count)
> +{
> +	uint16_t old, new;
> +
> +	if (dev->broken ||
> +		!vq->vring.avail) {
> +		return;
> +	}
> +
> +	/* Make sure buffer is written before we update index. */
> +	smp_wmb();
> +
> +	old = vq->used_idx;
> +	new = old + count;
> +	vring_used_idx_set(vq, new);
> +	vq->inuse -= count;
> +	if ((int16_t)(new - vq->signalled_used) < (uint16_t)(new - old)) {
> +		vq->signalled_used_valid = false;
> +	}
> +}
> +
> +void vu_queue_push(VuDev *dev, VuVirtq *vq,
> +		   VuVirtqElement *elem, unsigned int len)
> +{
> +	vu_queue_fill(dev, vq, elem, len, 0);
> +	vu_queue_flush(dev, vq, 1);
> +}
> +
> diff --git a/virtio.h b/virtio.h
> new file mode 100644
> index 000000000000..e334355b0f30
> --- /dev/null
> +++ b/virtio.h
> @@ -0,0 +1,121 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +//
> +/* come parts copied from QEMU subprojects/libvhost-user/libvhost-user.h */
> +
> +#ifndef VIRTIO_H
> +#define VIRTIO_H
> +
> +#include <stdbool.h>
> +#include <linux/vhost_types.h>
> +
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
> +#define vu_panic(vdev, ...)		\
> +	do {				\
> +		(vdev)->broken = true;	\
> +		err( __VA_ARGS__ );	\
> +	} while (0)
> +
> +typedef struct VuRing {
> +	unsigned int num;
> +	struct vring_desc *desc;
> +	struct vring_avail *avail;
> +	struct vring_used *used;
> +	uint64_t log_guest_addr;
> +	uint32_t flags;
> +} VuRing;
> +
> +typedef struct VuVirtq {
> +	VuRing vring;
> +
> +	/* Next head to pop */
> +	uint16_t last_avail_idx;
> +
> +	/* Last avail_idx read from VQ. */
> +	uint16_t shadow_avail_idx;
> +
> +	uint16_t used_idx;
> +
> +	/* Last used index value we have signalled on */
> +	uint16_t signalled_used;
> +
> +	/* Last used index value we have signalled on */
> +	bool signalled_used_valid;
> +
> +	bool notification;
> +
> +	unsigned int inuse;
> +
> +	int call_fd;
> +	int kick_fd;
> +	int err_fd;
> +	unsigned int enable;
> +	bool started;
> +
> +	/* Guest addresses of our ring */
> +	struct vhost_vring_addr vra;
> +} VuVirtq;
> +
> +typedef struct VuDevRegion {
> +	uint64_t gpa;
> +	uint64_t size;
> +	uint64_t qva;
> +	uint64_t mmap_offset;
> +	uint64_t mmap_addr;
> +} VuDevRegion;
> +
> +#define VHOST_USER_MAX_QUEUES 2
> +
> +/*
> + * Set a reasonable maximum number of ram slots, which will be supported by
> + * any architecture.
> + */
> +#define VHOST_USER_MAX_RAM_SLOTS 32
> +
> +typedef struct VuDev {
> +	uint32_t nregions;
> +	VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS];
> +	VuVirtq vq[VHOST_USER_MAX_QUEUES];
> +	uint64_t features;
> +	uint64_t protocol_features;
> +	bool broken;
> +	int hdrlen;
> +} VuDev;
> +
> +typedef struct VuVirtqElement {
> +	unsigned int index;
> +	unsigned int out_num;
> +	unsigned int in_num;
> +	struct iovec *in_sg;
> +	struct iovec *out_sg;
> +} VuVirtqElement;
> +
> +static inline bool has_feature(uint64_t features, unsigned int fbit)
> +{
> +	return !!(features & (1ULL << fbit));
> +}
> +
> +static inline bool vu_has_feature(VuDev *vdev, unsigned int fbit)
> +{
> +	return has_feature(vdev->features, fbit);
> +}
> +
> +static inline bool vu_has_protocol_feature(VuDev *vdev, unsigned int fbit)
> +{
> +	return has_feature(vdev->protocol_features, fbit);
> +}
> +
> +bool vu_queue_empty(VuDev *dev, VuVirtq *vq);
> +void vu_queue_notify(VuDev *dev, VuVirtq *vq);
> +void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz, unsigned char *buffer);
> +void vu_queue_detach_element(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len);
> +void vu_queue_unpop(VuDev *dev, VuVirtq *vq, unsigned int index, size_t len);
> +bool vu_queue_rewind(VuDev *dev, VuVirtq *vq, unsigned int num);
> +
> +void vu_queue_fill_by_index(VuDev *dev, VuVirtq *vq, unsigned int index,
> +			    unsigned int len, unsigned int idx);
> +void vu_queue_fill(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem, unsigned int len,
> +		   unsigned int idx);
> +void vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count);
> +void vu_queue_push(VuDev *dev, VuVirtq *vq, VuVirtqElement *elem, unsigned int len);
> +#endif /* VIRTIO_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-05  5:57   ` David Gibson
@ 2024-02-06 14:28     ` Laurent Vivier
  2024-02-07  1:01       ` David Gibson
  0 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-06 14:28 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 2/5/24 06:57, David Gibson wrote:
> On Fri, Feb 02, 2024 at 03:11:28PM +0100, Laurent Vivier wrote:
> ...
>> diff --git a/iov.c b/iov.c
>> new file mode 100644
>> index 000000000000..38a8e7566021
>> --- /dev/null
>> +++ b/iov.c
>>
>> +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
> Not immediately seeing why you need the 'offset ||' part of the condition.

In fact the loop has two purposes:

1- scan the the iovec to reach byte offset in the iov (so until offset is 0)

2- copy the bytes (until done == byte)

It could be written like this:

for (i = 0; offset && i < iov_cnt && offset >= iov[i].iov_len ; i++)
         offset -= iov[i].iov_len;

for (done = 0; done < bytes && i < iov_cnt; i++) {
     size_t len = MIN(iov[i].iov_len - offset, bytes - done);
     memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
     done += len;
}

...

>> +unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
>> +		  const struct iovec *iov, unsigned int iov_cnt,
>> +		  size_t offset, size_t bytes)
>> +{
>> +	size_t len;
>> +	unsigned int i, j;
>> +	for (i = 0, j = 0;
>> +		 i < iov_cnt && j < dst_iov_cnt && (offset || bytes); i++) {
>> +		if (offset >= iov[i].iov_len) {
>> +			offset -= iov[i].iov_len;
>> +			continue;
>> +		}
>> +		len = MIN(bytes, iov[i].iov_len - offset);
>> +
>> +		dst_iov[j].iov_base = (char *)iov[i].iov_base + offset;
>> +		dst_iov[j].iov_len = len;
>> +		j++;
>> +		bytes -= len;
>> +		offset = 0;
>> +	}
>> +	return j;
>> +}
> Small concern about the interface to iov_copy().  If dst_iov_cnt <
> iov_cnt and the chunk of the input iovec you want doesn't fit in the
> destination it will silently truncate - you can't tell if this has
> happened from the return value.  If the assumption is that dst_iov_cnt = iov_cnt,
> then there's not really any need to pass it.

In fact this function will be removed with "vhost-user: use guest buffer directly in 
vu_handle_tx()" as it is not needed when we use directly guest buffers. So I don't think 
we need to improve this.

If I remember correctly I think it behaves like passt already does with socket backend: if 
it doesn't fit, it's dropped silently.

I've fixed all your other comments.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-02 14:11 ` [PATCH 01/24] iov: add some functions to manage iovec Laurent Vivier
  2024-02-05  5:57   ` David Gibson
@ 2024-02-06 16:10   ` Stefano Brivio
  2024-02-07 14:02     ` Laurent Vivier
  1 sibling, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-06 16:10 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:28 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile |  4 +--
>  iov.c    | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  iov.h    | 46 +++++++++++++++++++++++++++++++++
>  3 files changed, 126 insertions(+), 2 deletions(-)
>  create mode 100644 iov.c
>  create mode 100644 iov.h
> 
> diff --git a/Makefile b/Makefile
> index af4fa87e7e13..c1138fb91d26 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
> -	util.c
> +	util.c iov.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -56,7 +56,7 @@ MANPAGES = passt.1 pasta.1 qrap.1
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
> -	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h
> +	tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h iov.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/iov.c b/iov.c
> new file mode 100644
> index 000000000000..38a8e7566021
> --- /dev/null
> +++ b/iov.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts copied from QEMU util/iov.c */

As David mentioned, we actually include authorship notices in passt,
and if you need an example of something with compatible licenses but
sourced from another project, see checksum.c.

> +
> +#include <sys/socket.h>

For consistency: newline between system and local includes.

> +#include "util.h"
> +#include "iov.h"
> +
> +size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,

If I understood the idea behind this function correctly, maybe
iov_fill_from_buf() would be more descriptive.

> +			 size_t offset, const void *buf, size_t bytes)
> +{
> +	size_t done;
> +	unsigned int i;

Customary newline between declarations and code.

> +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
> +		if (offset < iov[i].iov_len) {
> +			size_t len = MIN(iov[i].iov_len - offset, bytes - done);
> +			memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
> +			done += len;
> +			offset = 0;
> +		} else {
> +			offset -= iov[i].iov_len;
> +		}
> +	}

I think the version you mentioned later in this thread (quoting here) is
much clearer. Some observations on it:

> for (i = 0; offset && i < iov_cnt && offset >= iov[i].iov_len ; i++)
>          offset -= iov[i].iov_len;

Do you actually need to check for offset to be non-zero? To me it looks like
offset >= iov[i].iov_len would suffice, and look more familiar.

What happens if offset is bigger than the sum of all iov[i].iov_len? To me it
looks like we would just go ahead and copy to the wrong position. But I just
had a quick look at the usage of this function, maybe it's safe to assume
that it can't ever happen.

> 
> for (done = 0; done < bytes && i < iov_cnt; i++) {

Rather than 'done', I would call this 'copied', so that we return the
bytes copied later, instead of the "done" ones. I think it's a bit clearer.

>      size_t len = MIN(iov[i].iov_len - offset, bytes - done);

Newline between declaration and code for consistency.

>      memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
>      done += len;
> }
> 
> +	return done;
> +}
> +
> +size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
> +		       size_t offset, void *buf, size_t bytes)
> +{
> +	size_t done;
> +	unsigned int i;
> +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {

Same here, I think this would be easier to understand if you split it
into two loops.

> +		if (offset < iov[i].iov_len) {
> +			size_t len = MIN(iov[i].iov_len - offset, bytes - done);
> +			memcpy((char *)buf + done, (char *)iov[i].iov_base + offset, len);
> +			done += len;
> +			offset = 0;
> +		} else {
> +			offset -= iov[i].iov_len;
> +		}
> +	}

Newline here would be appreciated for consistency.

> +	return done;
> +}
> +
> +size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt)
> +{
> +	size_t len;
> +	unsigned int i;
> +
> +	len = 0;

...then it could be initialised in the declaration.

> +	for (i = 0; i < iov_cnt; i++) {
> +		len += iov[i].iov_len;
> +	}
> +	return len;
> +}
> +
> +unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
> +		  const struct iovec *iov, unsigned int iov_cnt,
> +		  size_t offset, size_t bytes)
> +{
> +	size_t len;
> +	unsigned int i, j;
> +	for (i = 0, j = 0;
> +		 i < iov_cnt && j < dst_iov_cnt && (offset || bytes); i++) {

Same here, if possible, split.

> +		if (offset >= iov[i].iov_len) {
> +			offset -= iov[i].iov_len;
> +			continue;
> +		}
> +		len = MIN(bytes, iov[i].iov_len - offset);
> +
> +		dst_iov[j].iov_base = (char *)iov[i].iov_base + offset;
> +		dst_iov[j].iov_len = len;
> +		j++;
> +		bytes -= len;
> +		offset = 0;
> +	}
> +	return j;
> +}
> diff --git a/iov.h b/iov.h
> new file mode 100644
> index 000000000000..31fbf6d0e1cf
> --- /dev/null
> +++ b/iov.h
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts copied from QEMU include/qemu/iov.h */
> +
> +#ifndef IOVEC_H
> +#define IOVEC_H
> +
> +#include <unistd.h>
> +#include <string.h>
> +
> +size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
> +			 size_t offset, const void *buf, size_t bytes);
> +size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
> +		       size_t offset, void *buf, size_t bytes);
> +
> +static inline size_t iov_from_buf(const struct iovec *iov,
> +				  unsigned int iov_cnt, size_t offset,
> +				  const void *buf, size_t bytes)
> +{

Is there a particular reason to include these two in a header? The
compiler will inline as needed if they are in a source file.

> +	if (__builtin_constant_p(bytes) && iov_cnt &&
> +		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
> +		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
> +		return bytes;
> +	} else {

We generally avoid else-after-return in passt -- clang-tidy should also
warn about it (but I haven't tried yet).

> +		return iov_from_buf_full(iov, iov_cnt, offset, buf, bytes);
> +	}
> +}
> +
> +static inline size_t iov_to_buf(const struct iovec *iov,
> +				const unsigned int iov_cnt, size_t offset,
> +				void *buf, size_t bytes)
> +{
> +	if (__builtin_constant_p(bytes) && iov_cnt &&
> +		offset <= iov[0].iov_len && bytes <= iov[0].iov_len - offset) {
> +		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
> +		return bytes;
> +	} else {
> +		return iov_to_buf_full(iov, iov_cnt, offset, buf, bytes);
> +	}
> +}
> +
> +size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt);
> +unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
> +		  const struct iovec *iov, unsigned int iov_cnt,
> +		  size_t offset, size_t bytes);
> +#endif

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/24] pcap: add pcap_iov()
  2024-02-02 14:11 ` [PATCH 02/24] pcap: add pcap_iov() Laurent Vivier
  2024-02-05  6:25   ` David Gibson
@ 2024-02-06 16:10   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-06 16:10 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:29 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  pcap.c | 32 ++++++++++++++++++++++++++++++++
>  pcap.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/pcap.c b/pcap.c
> index 501d52d4992b..b002bb01314c 100644
> --- a/pcap.c
> +++ b/pcap.c
> @@ -31,6 +31,7 @@
>  #include "util.h"
>  #include "passt.h"
>  #include "log.h"
> +#include "iov.h"
>  
>  #define PCAP_VERSION_MINOR 4
>  
> @@ -130,6 +131,37 @@ void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset)
>  	}
>  }
>  
> +void pcap_iov(const struct iovec *iov, unsigned int n)
> +{
> +	struct timeval tv;
> +	struct pcap_pkthdr h;
> +	size_t len;
> +	unsigned int i;

From longest to shortest, rationale:

  https://hisham.hm/2018/06/16/when-listing-repeated-things-make-pyramids/

> +
> +	if (pcap_fd == -1)
> +		return;
> +
> +	gettimeofday(&tv, NULL);
> +
> +	len = iov_size(iov, n);
> +
> +	h.tv_sec = tv.tv_sec;
> +	h.tv_usec = tv.tv_usec;
> +	h.caplen = h.len = len;
> +
> +	if (write(pcap_fd, &h, sizeof(h)) < 0) {
> +		debug("Cannot write pcap header");
> +		return;
> +	}
> +
> +	for (i = 0; i < n; i++) {
> +		if (write(pcap_fd, iov[i].iov_base, iov[i].iov_len) < 0) {
> +			debug("Cannot log packet, iov %d length %lu\n",

%zu instead of %lu -- that guarantees the correct width for size_t on
different architectures (it's not always 'long'). And "iov %d" followed
by the length doesn't really tell that we're printing the index of the
vector, maybe something on the lines of:

			debug("Cannot log packet, length %zu, iovec index %i\n",

?
> +			      i, iov[i].iov_len);
> +		}
> +	}
> +}
> +
>  /**
>   * pcap_init() - Initialise pcap file
>   * @c:		Execution context
> diff --git a/pcap.h b/pcap.h
> index da5a7e846b72..732a0ddf14cc 100644
> --- a/pcap.h
> +++ b/pcap.h
> @@ -8,6 +8,7 @@
>  
>  void pcap(const char *pkt, size_t len);
>  void pcap_multiple(const struct iovec *iov, unsigned int n, size_t offset);
> +void pcap_iov(const struct iovec *iov, unsigned int n);
>  void pcap_init(struct ctx *c);
>  
>  #endif /* PCAP_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-06 14:28     ` Laurent Vivier
@ 2024-02-07  1:01       ` David Gibson
  2024-02-07 10:00         ` Laurent Vivier
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-07  1:01 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3064 bytes --]

On Tue, Feb 06, 2024 at 03:28:10PM +0100, Laurent Vivier wrote:
> On 2/5/24 06:57, David Gibson wrote:
> > On Fri, Feb 02, 2024 at 03:11:28PM +0100, Laurent Vivier wrote:
> > ...
> > > diff --git a/iov.c b/iov.c
> > > new file mode 100644
> > > index 000000000000..38a8e7566021
> > > --- /dev/null
> > > +++ b/iov.c
> > > 
> > > +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
> > Not immediately seeing why you need the 'offset ||' part of the condition.
> 
> In fact the loop has two purposes:
> 
> 1- scan the the iovec to reach byte offset in the iov (so until offset is 0)
> 
> 2- copy the bytes (until done == byte)
> 
> It could be written like this:
> 
> for (i = 0; offset && i < iov_cnt && offset >= iov[i].iov_len ; i++)
>         offset -= iov[i].iov_len;
> 
> for (done = 0; done < bytes && i < iov_cnt; i++) {
>     size_t len = MIN(iov[i].iov_len - offset, bytes - done);
>     memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
>     done += len;
> }

Right, but done starts at 0 and will remain zero until you reach the
first segment where you need to copy something.  So, unless bytes ==
0, then done < bytes will always be true when offset != 0.  And if
bytes *is* zero, then there's nothing to do, so there's no need to
step through the vector at all.

> ...
> 
> > > +unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
> > > +		  const struct iovec *iov, unsigned int iov_cnt,
> > > +		  size_t offset, size_t bytes)
> > > +{
> > > +	size_t len;
> > > +	unsigned int i, j;
> > > +	for (i = 0, j = 0;
> > > +		 i < iov_cnt && j < dst_iov_cnt && (offset || bytes); i++) {
> > > +		if (offset >= iov[i].iov_len) {
> > > +			offset -= iov[i].iov_len;
> > > +			continue;
> > > +		}
> > > +		len = MIN(bytes, iov[i].iov_len - offset);
> > > +
> > > +		dst_iov[j].iov_base = (char *)iov[i].iov_base + offset;
> > > +		dst_iov[j].iov_len = len;
> > > +		j++;
> > > +		bytes -= len;
> > > +		offset = 0;
> > > +	}
> > > +	return j;
> > > +}
> > Small concern about the interface to iov_copy().  If dst_iov_cnt <
> > iov_cnt and the chunk of the input iovec you want doesn't fit in the
> > destination it will silently truncate - you can't tell if this has
> > happened from the return value.  If the assumption is that dst_iov_cnt = iov_cnt,
> > then there's not really any need to pass it.
> 
> In fact this function will be removed with "vhost-user: use guest buffer
> directly in vu_handle_tx()" as it is not needed when we use directly guest
> buffers. So I don't think we need to improve this.

Ok, fair enough.

> If I remember correctly I think it behaves like passt already does with
> socket backend: if it doesn't fit, it's dropped silently.
> 
> I've fixed all your other comments.
> 
> Thanks,
> Laurent
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/24] vhost-user: introduce vhost-user API
  2024-02-02 14:11 ` [PATCH 19/24] vhost-user: introduce vhost-user API Laurent Vivier
@ 2024-02-07  2:13   ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-07  2:13 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 40823 bytes --]

On Fri, Feb 02, 2024 at 03:11:46PM +0100, Laurent Vivier wrote:
> Add vhost_user.c and vhost_user.h that define the functions needed
> to implement vhost-user backend.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |    4 +-
>  passt.c      |    2 +
>  passt.h      |    8 +
>  tap.c        |    2 +-
>  tap.h        |    3 +
>  vhost_user.c | 1050 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  vhost_user.h |  139 +++++++
>  7 files changed, 1205 insertions(+), 3 deletions(-)
>  create mode 100644 vhost_user.c
>  create mode 100644 vhost_user.h
> 
> diff --git a/Makefile b/Makefile
> index ae1daa6b2b50..2016b071ddf2 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
> -	tcp_buf.c udp.c util.c iov.c ip.c virtio.c
> +	tcp_buf.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
>  	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
> -	util.h iov.h ip.h virtio.h
> +	util.h iov.h ip.h virtio.h vhost_user.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/passt.c b/passt.c
> index 5056a49dec95..95034d73381f 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -72,6 +72,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_TAP_PASTA]	= "/dev/net/tun device",
>  	[EPOLL_TYPE_TAP_PASST]	= "connected qemu socket",
>  	[EPOLL_TYPE_TAP_LISTEN]	= "listening qemu socket",
> +	[EPOLL_TYPE_VHOST_CMD]	= "vhost-user command socket",
> +	[EPOLL_TYPE_VHOST_KICK]	= "vhost-user kick socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> diff --git a/passt.h b/passt.h
> index a9e8f15af0e1..6ed1d0b19e82 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -42,6 +42,7 @@ union epoll_ref;
>  #include "port_fwd.h"
>  #include "tcp.h"
>  #include "udp.h"
> +#include "vhost_user.h"
>  
>  /**
>   * enum epoll_type - Different types of fds we poll over
> @@ -71,6 +72,10 @@ enum epoll_type {
>  	EPOLL_TYPE_TAP_PASST,
>  	/* socket listening for qemu socket connections */
>  	EPOLL_TYPE_TAP_LISTEN,
> +	/* vhost-user command socket */
> +	EPOLL_TYPE_VHOST_CMD,
> +	/* vhost-user kick event socket */
> +	EPOLL_TYPE_VHOST_KICK,
>  
>  	EPOLL_NUM_TYPES,
>  };
> @@ -303,6 +308,9 @@ struct ctx {
>  
>  	int low_wmem;
>  	int low_rmem;
> +
> +	/* vhost-user */
> +	struct VuDev vdev;
>  };
>  
>  void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/tap.c b/tap.c
> index ebe52247ad87..936206e53637 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -954,7 +954,7 @@ void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
>   */
> -static void tap_sock_reset(struct ctx *c)
> +void tap_sock_reset(struct ctx *c)
>  {
>  	if (c->one_off) {
>  		info("Client closed connection, exiting");
> diff --git a/tap.h b/tap.h
> index 7157ef37ee6e..ee839d4f09dc 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -81,12 +81,15 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
> +void tap_sock_reset(struct ctx *c);
>  void tap_sock_init(struct ctx *c);
>  void pool_flush_all(void);
>  void tap_handler_all(struct ctx *c, const struct timespec *now);
>  
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
> +void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
> +		       const char *func, int line);

This looks like it should be a fixup for the earlier patch which
introduced the macro below.

>  #define packet_add_all(p, len, start)					\
>  	packet_add_all_do(p, len, start, __func__, __LINE__)
>  
> diff --git a/vhost_user.c b/vhost_user.c
> new file mode 100644
> index 000000000000..2acd72398e3a
> --- /dev/null
> +++ b/vhost_user.c
> @@ -0,0 +1,1050 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts from QEMU subprojects/libvhost-user/libvhost-user.c */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <string.h>
> +#include <assert.h>
> +#include <stdbool.h>
> +#include <inttypes.h>
> +#include <time.h>
> +#include <net/ethernet.h>
> +#include <netinet/in.h>
> +#include <sys/epoll.h>
> +#include <sys/eventfd.h>
> +#include <sys/mman.h>
> +#include <linux/vhost_types.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "tap.h"
> +#include "vhost_user.h"
> +
> +#define VHOST_USER_VERSION 1
> +
> +static unsigned char buffer[65536][VHOST_USER_MAX_QUEUES];
> +
> +void vu_print_capabilities(void)
> +{
> +	printf("{\n");
> +	printf("  \"type\": \"net\"\n");
> +	printf("}\n");
> +	exit(EXIT_SUCCESS);

It's not clear to me what we need this for.

> +}
> +
> +static const char *
> +vu_request_to_string(unsigned int req)
> +{
> +#define REQ(req) [req] = #req
> +	static const char *vu_request_str[] = {
> +		REQ(VHOST_USER_NONE),
> +		REQ(VHOST_USER_GET_FEATURES),
> +		REQ(VHOST_USER_SET_FEATURES),
> +		REQ(VHOST_USER_SET_OWNER),
> +		REQ(VHOST_USER_RESET_OWNER),
> +		REQ(VHOST_USER_SET_MEM_TABLE),
> +		REQ(VHOST_USER_SET_LOG_BASE),
> +		REQ(VHOST_USER_SET_LOG_FD),
> +		REQ(VHOST_USER_SET_VRING_NUM),
> +		REQ(VHOST_USER_SET_VRING_ADDR),
> +		REQ(VHOST_USER_SET_VRING_BASE),
> +		REQ(VHOST_USER_GET_VRING_BASE),
> +		REQ(VHOST_USER_SET_VRING_KICK),
> +		REQ(VHOST_USER_SET_VRING_CALL),
> +		REQ(VHOST_USER_SET_VRING_ERR),
> +		REQ(VHOST_USER_GET_PROTOCOL_FEATURES),
> +		REQ(VHOST_USER_SET_PROTOCOL_FEATURES),
> +		REQ(VHOST_USER_GET_QUEUE_NUM),
> +		REQ(VHOST_USER_SET_VRING_ENABLE),
> +		REQ(VHOST_USER_SEND_RARP),
> +		REQ(VHOST_USER_NET_SET_MTU),
> +		REQ(VHOST_USER_SET_BACKEND_REQ_FD),
> +		REQ(VHOST_USER_IOTLB_MSG),
> +		REQ(VHOST_USER_SET_VRING_ENDIAN),
> +		REQ(VHOST_USER_GET_CONFIG),
> +		REQ(VHOST_USER_SET_CONFIG),
> +		REQ(VHOST_USER_POSTCOPY_ADVISE),
> +		REQ(VHOST_USER_POSTCOPY_LISTEN),
> +		REQ(VHOST_USER_POSTCOPY_END),
> +		REQ(VHOST_USER_GET_INFLIGHT_FD),
> +		REQ(VHOST_USER_SET_INFLIGHT_FD),
> +		REQ(VHOST_USER_GPU_SET_SOCKET),
> +		REQ(VHOST_USER_VRING_KICK),
> +		REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
> +		REQ(VHOST_USER_ADD_MEM_REG),
> +		REQ(VHOST_USER_REM_MEM_REG),
> +		REQ(VHOST_USER_MAX),
> +	};
> +#undef REQ
> +
> +	if (req < VHOST_USER_MAX) {
> +		return vu_request_str[req];
> +	} else {
> +		return "unknown";
> +	}
> +}
> +
> +/* Translate qemu virtual address to our virtual address.  */

The meaning of this comment in its new context is rather unclear.

> +static void *qva_to_va(VuDev *dev, uint64_t qemu_addr)
> +{
> +	unsigned int i;
> +
> +	/* Find matching memory region.  */
> +	for (i = 0; i < dev->nregions; i++) {
> +		VuDevRegion *r = &dev->regions[i];
> +
> +		if ((qemu_addr >= r->qva) && (qemu_addr < (r->qva + r->size))) {
> +			return (void *)(uintptr_t)
> +			(qemu_addr - r->qva + r->mmap_addr + r->mmap_offset);
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static void
> +vmsg_close_fds(VhostUserMsg *vmsg)
> +{
> +	int i;
> +
> +	for (i = 0; i < vmsg->fd_num; i++)
> +		close(vmsg->fds[i]);
> +}
> +
> +static void vu_remove_watch(VuDev *vdev, int fd)
> +{
> +	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
> +
> +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, fd, NULL);
> +}
> +
> +/* Set reply payload.u64 and clear request flags and fd_num */
> +static void vmsg_set_reply_u64(struct VhostUserMsg *vmsg, uint64_t val)
> +{
> +	vmsg->hdr.flags = 0; /* defaults will be set by vu_send_reply() */
> +	vmsg->hdr.size = sizeof(vmsg->payload.u64);
> +	vmsg->payload.u64 = val;
> +	vmsg->fd_num = 0;
> +}
> +
> +static ssize_t vu_message_read_default(VuDev *dev, int conn_fd, struct VhostUserMsg *vmsg)

This only appears to return 0, 1 or a negative errno, which makes
ssize_t an odd choice.

> +{
> +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
> +		     sizeof(int))] = { 0 };
> +	struct iovec iov = {
> +		.iov_base = (char *)vmsg,
> +		.iov_len = VHOST_USER_HDR_SIZE,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = control,
> +		.msg_controllen = sizeof(control),
> +	};
> +	size_t fd_size;
> +	struct cmsghdr *cmsg;
> +	ssize_t ret, sz_payload;
> +
> +	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
> +	if (ret < 0) {
> +		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
> +			return 0;
> +		vu_panic(dev, "Error while recvmsg: %s", strerror(errno));
> +		goto out;
> +	}
> +
> +	vmsg->fd_num = 0;
> +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
> +	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> +		if (cmsg->cmsg_level == SOL_SOCKET &&
> +		    cmsg->cmsg_type == SCM_RIGHTS) {
> +			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
> +			vmsg->fd_num = fd_size / sizeof(int);
> +			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);
> +			break;
> +		}
> +	}
> +
> +	sz_payload = vmsg->hdr.size;
> +	if ((size_t)sz_payload > sizeof(vmsg->payload)) {
> +		vu_panic(dev,
> +			 "Error: too big message request: %d, size: vmsg->size: %zd, "
> +			 "while sizeof(vmsg->payload) = %zu",
> +			 vmsg->hdr.request, sz_payload, sizeof(vmsg->payload));
> +		goto out;
> +	}
> +
> +	if (sz_payload) {
> +		do {
> +			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
> +		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));
> +
> +		if (ret < sz_payload) {
> +			vu_panic(dev, "Error while reading: %s", strerror(errno));
> +			goto out;
> +		}
> +	}
> +
> +	return 1;
> +out:
> +	vmsg_close_fds(vmsg);
> +
> +	return -ECONNRESET;
> +}
> +
> +static int vu_message_write(VuDev *dev, int conn_fd, struct VhostUserMsg *vmsg)
> +{
> +	int rc;
> +	uint8_t *p = (uint8_t *)vmsg;
> +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = { 0 };
> +	struct iovec iov = {
> +		.iov_base = (char *)vmsg,
> +		.iov_len = VHOST_USER_HDR_SIZE,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = control,
> +	};
> +	struct cmsghdr *cmsg;
> +
> +	memset(control, 0, sizeof(control));

Redudant with the initializer AFAICT.

> +	assert(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
> +	if (vmsg->fd_num > 0) {
> +		size_t fdsize = vmsg->fd_num * sizeof(int);
> +		msg.msg_controllen = CMSG_SPACE(fdsize);
> +		cmsg = CMSG_FIRSTHDR(&msg);
> +		cmsg->cmsg_len = CMSG_LEN(fdsize);
> +		cmsg->cmsg_level = SOL_SOCKET;
> +		cmsg->cmsg_type = SCM_RIGHTS;
> +		memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
> +	} else {
> +		msg.msg_controllen = 0;
> +	}
> +
> +	do {
> +		rc = sendmsg(conn_fd, &msg, 0);
> +	} while (rc < 0 && (errno == EINTR || errno == EAGAIN));

Hrm.. if this first sendmsg() returns a (non EINTER, non EGAIN) error,
couldn't the sending of the payload clobber rc before we actually
handle it and print an error?

> +	if (vmsg->hdr.size) {
> +		do {
> +			if (vmsg->data) {
> +				rc = write(conn_fd, vmsg->data, vmsg->hdr.size);
> +			} else {
> +				rc = write(conn_fd, p + VHOST_USER_HDR_SIZE, vmsg->hdr.size);
> +			}
> +		} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
> +	}
> +
> +	if (rc <= 0) {
> +		vu_panic(dev, "Error while writing: %s", strerror(errno));
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static int vu_send_reply(VuDev *dev, int conn_fd, struct VhostUserMsg *msg)
> +{
> +	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
> +	msg->hdr.flags |= VHOST_USER_VERSION;
> +	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
> +
> +	return vu_message_write(dev, conn_fd, msg);
> +}
> +
> +static bool vu_get_features_exec(struct VhostUserMsg *msg)
> +{
> +	uint64_t features =
> +		1ULL << VIRTIO_F_VERSION_1 |
> +		1ULL << VIRTIO_NET_F_MRG_RXBUF |
> +		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
> +
> +	vmsg_set_reply_u64(msg, features);
> +
> +	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	return true;
> +}
> +
> +static void
> +vu_set_enable_all_rings(VuDev *vdev, bool enabled)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		vdev->vq[i].enable = enabled;
> +	}
> +}
> +
> +static bool
> +vu_set_features_exec(VuDev *vdev, struct VhostUserMsg *msg)
> +{
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);

I think a lot of these debug messages need some looking over to make
sure they're actually clear if you don't already know they're related
to vhost-user.

> +
> +	vdev->features = msg->payload.u64;
> +	if (!vu_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> +		/*
> +		 * We only support devices conforming to VIRTIO 1.0 or
> +		 * later
> +		 */
> +		vu_panic(vdev, "virtio legacy devices aren't supported by passt");
> +		return false;
> +	}
> +
> +	if (!vu_has_feature(vdev, VHOST_USER_F_PROTOCOL_FEATURES)) {
> +		vu_set_enable_all_rings(vdev, true);
> +	}
> +
> +	/* virtio-net features */
> +
> +	if (vu_has_feature(vdev, VIRTIO_F_VERSION_1) ||

Won't this always be true given the test above?

> +	    vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +		vdev->hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	} else {
> +		vdev->hdrlen = sizeof(struct virtio_net_hdr);
> +	}
> +
> +	return false;
> +}
> +
> +static bool
> +vu_set_owner_exec(void)
> +{
> +	return false;
> +}
> +
> +static bool map_ring(VuDev *vdev, VuVirtq *vq)
> +{
> +	vq->vring.desc = qva_to_va(vdev, vq->vra.desc_user_addr);
> +	vq->vring.used = qva_to_va(vdev, vq->vra.used_user_addr);
> +	vq->vring.avail = qva_to_va(vdev, vq->vra.avail_user_addr);
> +
> +	debug("Setting virtq addresses:");
> +	debug("    vring_desc  at %p", (void *)vq->vring.desc);
> +	debug("    vring_used  at %p", (void *)vq->vring.used);
> +	debug("    vring_avail at %p", (void *)vq->vring.avail);
> +
> +	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
> +}
> +
> +/*
> + * #syscalls:passt mmap munmap
> + */
> +
> +static bool vu_set_mem_table_exec(VuDev *vdev,
> +				  struct VhostUserMsg *msg)
> +{
> +	unsigned int i;
> +	struct VhostUserMemory m = msg->payload.memory, *memory = &m;
> +
> +	for (i = 0; i < vdev->nregions; i++) {
> +		VuDevRegion *r = &vdev->regions[i];
> +		void *m = (void *) (uintptr_t) r->mmap_addr;
> +
> +		if (m)
> +			munmap(m, r->size + r->mmap_offset);
> +	}
> +	vdev->nregions = memory->nregions;
> +
> +	debug("Nregions: %u", memory->nregions);
> +	for (i = 0; i < vdev->nregions; i++) {
> +		void *mmap_addr;
> +		VhostUserMemory_region *msg_region = &memory->regions[i];
> +		VuDevRegion *dev_region = &vdev->regions[i];
> +
> +		debug("Region %d", i);
> +		debug("    guest_phys_addr: 0x%016"PRIx64,
> +		      msg_region->guest_phys_addr);
> +		debug("    memory_size:     0x%016"PRIx64,
> +		      msg_region->memory_size);
> +		debug("    userspace_addr   0x%016"PRIx64,
> +		      msg_region->userspace_addr);
> +		debug("    mmap_offset      0x%016"PRIx64,
> +		      msg_region->mmap_offset);
> +
> +		dev_region->gpa = msg_region->guest_phys_addr;
> +		dev_region->size = msg_region->memory_size;
> +		dev_region->qva = msg_region->userspace_addr;
> +		dev_region->mmap_offset = msg_region->mmap_offset;
> +
> +		/* We don't use offset argument of mmap() since the
> +		 * mapped address has to be page aligned, and we use huge
> +		 * pages.  */

Is that accurate in the new context?

> +		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
> +				 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE,
> +				 msg->fds[i], 0);
> +
> +		if (mmap_addr == MAP_FAILED) {
> +			vu_panic(vdev, "region mmap error: %s", strerror(errno));
> +		} else {
> +			dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
> +			debug("    mmap_addr:       0x%016"PRIx64,
> +			      dev_region->mmap_addr);
> +		}
> +
> +		close(msg->fds[i]);
> +	}
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		if (vdev->vq[i].vring.desc) {
> +			if (map_ring(vdev, &vdev->vq[i])) {
> +				vu_panic(vdev, "remapping queue %d during setmemtable", i);
> +			}
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +static bool vu_set_vring_num_exec(VuDev *vdev,
> +				  struct VhostUserMsg *msg)
> +{
> +	unsigned int index = msg->payload.state.index;
> +	unsigned int num = msg->payload.state.num;
> +
> +	debug("State.index: %u", index);
> +	debug("State.num:   %u", num);
> +	vdev->vq[index].vring.num = num;
> +
> +	return false;
> +}
> +
> +static bool vu_set_vring_addr_exec(VuDev *vdev,
> +				   struct VhostUserMsg *msg)
> +{
> +	struct vhost_vring_addr addr = msg->payload.addr, *vra = &addr;
> +	unsigned int index = vra->index;
> +	VuVirtq *vq = &vdev->vq[index];
> +
> +	debug("vhost_vring_addr:");
> +	debug("    index:  %d", vra->index);
> +	debug("    flags:  %d", vra->flags);
> +	debug("    desc_user_addr:   0x%016" PRIx64, (uint64_t)vra->desc_user_addr);
> +	debug("    used_user_addr:   0x%016" PRIx64, (uint64_t)vra->used_user_addr);
> +	debug("    avail_user_addr:  0x%016" PRIx64, (uint64_t)vra->avail_user_addr);
> +	debug("    log_guest_addr:   0x%016" PRIx64, (uint64_t)vra->log_guest_addr);
> +
> +	vq->vra = *vra;
> +	vq->vring.flags = vra->flags;
> +	vq->vring.log_guest_addr = vra->log_guest_addr;
> +
> +	if (map_ring(vdev, vq)) {
> +		vu_panic(vdev, "Invalid vring_addr message");
> +		return false;
> +	}
> +
> +	vq->used_idx = le16toh(vq->vring.used->idx);
> +
> +	if (vq->last_avail_idx != vq->used_idx) {
> +		debug("Last avail index != used index: %u != %u",
> +		      vq->last_avail_idx, vq->used_idx);
> +	}
> +
> +	return false;
> +}
> +
> +static bool vu_set_vring_base_exec(VuDev *vdev,
> +				   struct VhostUserMsg *msg)
> +{
> +	unsigned int index = msg->payload.state.index;
> +	unsigned int num = msg->payload.state.num;
> +
> +	debug("State.index: %u", index);
> +	debug("State.num:   %u", num);
> +	vdev->vq[index].shadow_avail_idx = vdev->vq[index].last_avail_idx = num;
> +
> +	return false;
> +}
> +
> +static bool vu_get_vring_base_exec(VuDev *vdev,
> +				   struct VhostUserMsg *msg)
> +{
> +	unsigned int index = msg->payload.state.index;
> +
> +	debug("State.index: %u", index);
> +	msg->payload.state.num = vdev->vq[index].last_avail_idx;
> +	msg->hdr.size = sizeof(msg->payload.state);
> +
> +	vdev->vq[index].started = false;
> +
> +	if (vdev->vq[index].call_fd != -1) {
> +		close(vdev->vq[index].call_fd);
> +		vdev->vq[index].call_fd = -1;
> +	}
> +	if (vdev->vq[index].kick_fd != -1) {
> +		vu_remove_watch(vdev,  vdev->vq[index].kick_fd);
> +		close(vdev->vq[index].kick_fd);
> +		vdev->vq[index].kick_fd = -1;
> +	}
> +
> +	return true;
> +}
> +
> +static void vu_set_watch(VuDev *vdev, int fd)
> +{
> +	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
> +	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
> +	struct epoll_event ev = { 0 };
> +
> +	ev.data.u64 = ref.u64;
> +	ev.events = EPOLLIN;
> +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev);
> +}
> +
> +int vu_send(const struct ctx *c, const void *buf, size_t size)
> +{
> +	VuDev *vdev = (VuDev *)&c->vdev;
> +	size_t hdrlen = vdev->hdrlen;
> +	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	unsigned int indexes[VIRTQUEUE_MAX_SIZE];
> +	size_t lens[VIRTQUEUE_MAX_SIZE];
> +	size_t offset;
> +	int i, j;
> +	__virtio16 *num_buffers_ptr;
> +
> +	debug("vu_send size %zu hdrlen %zu", size, hdrlen);
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		err("Got packet, but no available descriptors on RX virtq.");

The message doesn't seem to quite match the condition.

> +		return 0;
> +	}
> +
> +	offset = 0;
> +	i = 0;
> +	num_buffers_ptr = NULL;
> +	while (offset < size) {
> +		VuVirtqElement *elem;
> +		size_t len;
> +		int total;
> +
> +		total = 0;
> +
> +		if (i == VIRTQUEUE_MAX_SIZE) {
> +			err("virtio-net unexpected long buffer chain");
> +			goto err;
> +		}
> +
> +		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement),
> +				    buffer[VHOST_USER_RX_QUEUE]);
> +		if (!elem) {
> +			if (!vdev->broken) {
> +				eventfd_t kick_data;
> +				ssize_t rc;
> +				int status;
> +
> +				/* wait the kernel to put new entries in the queue */
> +
> +				status = fcntl(vq->kick_fd, F_GETFL);
> +				if (status != -1) {
> +					fcntl(vq->kick_fd, F_SETFL, status & ~O_NONBLOCK);
> +					rc =  eventfd_read(vq->kick_fd, &kick_data);

Is it safe for us to block here?

> +					fcntl(vq->kick_fd, F_SETFL, status);
> +					if (rc != -1)
> +						continue;
> +				}
> +			}
> +			if (i) {
> +				err("virtio-net unexpected empty queue: "
> +				    "i %d mergeable %d offset %zd, size %zd, "
> +				    "features 0x%" PRIx64,
> +				    i, vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF),
> +				    offset, size, vdev->features);
> +			}
> +			offset = -1;
> +			goto err;
> +		}
> +
> +		if (elem->in_num < 1) {
> +			err("virtio-net receive queue contains no in buffers");
> +			vu_queue_detach_element(vdev, vq, elem->index, 0);
> +			offset = -1;
> +			goto err;
> +		}
> +
> +		if (i == 0) {
> +			struct virtio_net_hdr hdr = {
> +				.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +				.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +			};
> +
> +			ASSERT(offset == 0);
> +			ASSERT(elem->in_sg[0].iov_len >= hdrlen);
> +
> +			len = iov_from_buf(elem->in_sg, elem->in_num, 0, &hdr, sizeof hdr);
> +
> +			num_buffers_ptr = (__virtio16 *)((char *)elem->in_sg[0].iov_base +
> +							 len);
> +
> +			total += hdrlen;
> +		}
> +
> +		len = iov_from_buf(elem->in_sg, elem->in_num, total, (char *)buf + offset,
> +				   size - offset);
> +
> +		total += len;
> +		offset += len;
> +
> +		/* If buffers can't be merged, at this point we
> +		 * must have consumed the complete packet.
> +		 * Otherwise, drop it.
> +		 */
> +		if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) && offset < size) {
> +			vu_queue_unpop(vdev, vq, elem->index, total);
> +			goto err;
> +		}
> +
> +		indexes[i] = elem->index;
> +		lens[i] = total;
> +		i++;
> +	}
> +
> +	if (num_buffers_ptr && vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +		*num_buffers_ptr = htole16(i);
> +	}
> +
> +	for (j = 0; j < i; j++) {
> +		debug("filling total %zd idx %d", lens[j], j);
> +		vu_queue_fill_by_index(vdev, vq, indexes[j], lens[j], j);
> +	}
> +
> +	vu_queue_flush(vdev, vq, i);
> +	vu_queue_notify(vdev, vq);
> +
> +	debug("sent %zu", offset);
> +
> +	return offset;
> +err:
> +	for (j = 0; j < i; j++) {
> +		vu_queue_detach_element(vdev, vq, indexes[j], lens[j]);
> +	}
> +
> +	return offset;
> +}
> +
> +size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov, size_t n)
> +{
> +	size_t i;
> +	int ret;
> +
> +	debug("tap_send_frames_vu n %zd", n);
> +
> +	for (i = 0; i < n; i++) {
> +		ret = vu_send(c, iov[i].iov_base, iov[i].iov_len);
> +		if (ret < 0)
> +			break;
> +	}
> +	debug("count %zd", i);
> +	return i;
> +}
> +
> +static void vu_handle_tx(VuDev *vdev, int index)
> +{
> +	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
> +	VuVirtq *vq = &vdev->vq[index];
> +	int hdrlen = vdev->hdrlen;
> +	struct timespec now;
> +	char *p;
> +	size_t n;
> +
> +	if (index % 2 != VHOST_USER_TX_QUEUE) {
> +		debug("index %d is not an TX queue", index);

When would this situation arise?  It seems like it ought to be
something stronger than a debug() - either a fatal error in the setup
or an ASSERT().

> +		return;
> +	}
> +
> +	clock_gettime(CLOCK_MONOTONIC, &now);

If I'm looking ahead at the next patches to see how this is called
correctly, I think you can avoid this gettime() call by passing a
'now' value down from the mail epoll loop (the other tap handlers
already take one).

> +	p = pkt_buf;
> +
> +	pool_flush_all();
> +
> +	while (1) {
> +		VuVirtqElement *elem;
> +		unsigned int out_num;
> +		struct iovec sg[VIRTQUEUE_MAX_SIZE], *out_sg;
> +
> +		ASSERT(index == VHOST_USER_TX_QUEUE);

Hrm.. and isn't this redundant with the check at the start of the function?

> +		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer[index]);
> +		if (!elem) {
> +			break;
> +		}
> +
> +		out_num = elem->out_num;
> +		out_sg = elem->out_sg;
> +		if (out_num < 1) {
> +			debug("virtio-net header not in first element");
> +			break;
> +		}
> +
> +		if (hdrlen) {
> +			unsigned sg_num;
> +
> +			sg_num = iov_copy(sg, ARRAY_SIZE(sg), out_sg, out_num,
> +					  hdrlen, -1);
> +			out_num = sg_num;
> +			out_sg = sg;
> +		}
> +
> +		n = iov_to_buf(out_sg, out_num, 0, p, TAP_BUF_FILL);
> +
> +		packet_add_all(c, n, p);
> +
> +		p += n;
> +
> +		vu_queue_push(vdev, vq, elem, 0);
> +		vu_queue_notify(vdev, vq);
> +	}
> +	tap_handler_all(c, &now);
> +}
> +
> +void vu_kick_cb(struct ctx *c, union epoll_ref ref)
> +{
> +	VuDev *vdev = &c->vdev;
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int index;
> +
> +	for (index = 0; index < VHOST_USER_MAX_QUEUES; index++)
> +		if (c->vdev.vq[index].kick_fd == ref.fd)
> +			break;
> +
> +	if (index == VHOST_USER_MAX_QUEUES)
> +		return;
> +
> +	rc =  eventfd_read(ref.fd, &kick_data);
> +	if (rc == -1) {
> +		vu_panic(vdev, "kick eventfd_read(): %s", strerror(errno));
> +		vu_remove_watch(vdev, ref.fd);
> +	} else {
> +		debug("Got kick_data: %016"PRIx64" idx:%d",
> +		      kick_data, index);
> +		if (index % 2 == VHOST_USER_TX_QUEUE)

.. and here we seem to have a third check for the same thing.

> +			vu_handle_tx(vdev, index);
> +	}
> +}
> +
> +static bool vu_check_queue_msg_file(VuDev *vdev, struct VhostUserMsg *msg)
> +{
> +	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +
> +	if (index >= VHOST_USER_MAX_QUEUES) {
> +		vmsg_close_fds(msg);
> +		vu_panic(vdev, "Invalid queue index: %u", index);
> +		return false;
> +	}
> +
> +	if (nofd) {
> +		vmsg_close_fds(msg);
> +		return true;
> +	}
> +
> +	if (msg->fd_num != 1) {
> +		vmsg_close_fds(msg);
> +		vu_panic(vdev, "Invalid fds in request: %d", msg->hdr.request);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static bool vu_set_vring_kick_exec(VuDev *vdev,
> +				   struct VhostUserMsg *msg)
> +{
> +	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	if (!vu_check_queue_msg_file(vdev, msg))
> +		return false;
> +
> +	if (vdev->vq[index].kick_fd != -1) {
> +		vu_remove_watch(vdev, vdev->vq[index].kick_fd);
> +		close(vdev->vq[index].kick_fd);
> +		vdev->vq[index].kick_fd = -1;
> +	}
> +
> +	vdev->vq[index].kick_fd = nofd ? -1 : msg->fds[0];
> +	debug("Got kick_fd: %d for vq: %d", vdev->vq[index].kick_fd, index);
> +
> +	vdev->vq[index].started = true;
> +
> +	if (vdev->vq[index].kick_fd != -1 && index % 2 == VHOST_USER_TX_QUEUE) {
> +		vu_set_watch(vdev, vdev->vq[index].kick_fd);
> +		debug("Waiting for kicks on fd: %d for vq: %d",
> +		      vdev->vq[index].kick_fd, index);
> +	}
> +
> +	return false;
> +}
> +
> +static bool vu_set_vring_call_exec(VuDev *vdev,
> +				   struct VhostUserMsg *msg)
> +{
> +	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	if (!vu_check_queue_msg_file(vdev, msg))
> +		return false;
> +
> +	if (vdev->vq[index].call_fd != -1) {
> +		close(vdev->vq[index].call_fd);
> +		vdev->vq[index].call_fd = -1;
> +	}
> +
> +	vdev->vq[index].call_fd = nofd ? -1 : msg->fds[0];
> +
> +	/* in case of I/O hang after reconnecting */
> +	if (vdev->vq[index].call_fd != -1) {
> +		eventfd_write(msg->fds[0], 1);
> +	}
> +
> +	debug("Got call_fd: %d for vq: %d", vdev->vq[index].call_fd, index);
> +
> +	return false;
> +}
> +
> +static bool vu_set_vring_err_exec(VuDev *vdev,
> +				  struct VhostUserMsg *msg)
> +{
> +	int index = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	if (!vu_check_queue_msg_file(vdev, msg))
> +		return false;
> +
> +	if (vdev->vq[index].err_fd != -1) {
> +		close(vdev->vq[index].err_fd);
> +		vdev->vq[index].err_fd = -1;
> +	}
> +
> +	vdev->vq[index].err_fd = nofd ? -1 : msg->fds[0];
> +
> +	return false;
> +}
> +
> +static bool vu_get_protocol_features_exec(struct VhostUserMsg *msg)
> +{
> +	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
> +
> +	vmsg_set_reply_u64(msg, features);
> +
> +	return true;
> +}
> +
> +static bool vu_set_protocol_features_exec(VuDev *vdev, struct VhostUserMsg *msg)
> +{
> +	uint64_t features = msg->payload.u64;
> +
> +	debug("u64: 0x%016"PRIx64, features);
> +
> +	vdev->protocol_features = msg->payload.u64;
> +
> +	if (vu_has_protocol_feature(vdev,
> +				    VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS) &&
> +	    (!vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_BACKEND_REQ) ||
> +	     !vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_REPLY_ACK))) {
> +		/*
> +		 * The use case for using messages for kick/call is simulation, to make
> +		 * the kick and call synchronous. To actually get that behaviour, both
> +		 * of the other features are required.
> +		 * Theoretically, one could use only kick messages, or do them without
> +		 * having F_REPLY_ACK, but too many (possibly pending) messages on the
> +		 * socket will eventually cause the master to hang, to avoid this in
> +		 * scenarios where not desired enforce that the settings are in a way
> +		 * that actually enables the simulation case.
> +		 */
> +		vu_panic(vdev,
> +			 "F_IN_BAND_NOTIFICATIONS requires F_BACKEND_REQ && F_REPLY_ACK");
> +		return false;
> +	}
> +
> +	return false;
> +}
> +
> +
> +static bool vu_get_queue_num_exec(struct VhostUserMsg *msg)
> +{
> +	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
> +	return true;
> +}
> +
> +static bool vu_set_vring_enable_exec(VuDev *vdev, struct VhostUserMsg *msg)
> +{
> +	unsigned int index = msg->payload.state.index;
> +	unsigned int enable = msg->payload.state.num;
> +
> +	debug("State.index:  %u", index);
> +	debug("State.enable: %u", enable);
> +
> +	if (index >= VHOST_USER_MAX_QUEUES) {
> +		vu_panic(vdev, "Invalid vring_enable index: %u", index);
> +		return false;
> +	}
> +
> +	vdev->vq[index].enable = enable;
> +	return false;
> +}
> +
> +void vu_init(struct ctx *c)
> +{
> +	int i;
> +
> +	c->vdev.hdrlen = 0;
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++)
> +		c->vdev.vq[i] = (VuVirtq){
> +			.call_fd = -1,
> +			.kick_fd = -1,
> +			.err_fd = -1,
> +			.notification = true,
> +		};
> +}
> +
> +static void vu_cleanup(VuDev *vdev)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		VuVirtq *vq = &vdev->vq[i];
> +
> +		vq->started = false;
> +		vq->notification = true;
> +
> +		if (vq->call_fd != -1) {
> +			close(vq->call_fd);
> +			vq->call_fd = -1;
> +		}
> +		if (vq->err_fd != -1) {
> +			close(vq->err_fd);
> +			vq->err_fd = -1;
> +		}
> +		if (vq->kick_fd != -1) {
> +			vu_remove_watch(vdev,  vq->kick_fd);
> +			close(vq->kick_fd);
> +			vq->kick_fd = -1;
> +		}
> +
> +		vq->vring.desc = 0;
> +		vq->vring.used = 0;
> +		vq->vring.avail = 0;
> +	}
> +	vdev->hdrlen = 0;
> +
> +	for (i = 0; i < vdev->nregions; i++) {
> +		VuDevRegion *r = &vdev->regions[i];
> +		void *m = (void *) (uintptr_t) r->mmap_addr;
> +
> +		if (m)
> +			munmap(m, r->size + r->mmap_offset);
> +	}
> +	vdev->nregions = 0;
> +}
> +
> +/**
> + * tap_handler_vu() - Packet handler for vhost-user
> + * @c:		Execution context
> + * @events:	epoll events
> + */
> +void tap_handler_vu(struct ctx *c, uint32_t events)
> +{
> +	VuDev *dev = &c->vdev;
> +	struct VhostUserMsg msg = { 0 };
> +	bool need_reply, reply_requested;
> +	int ret;
> +
> +	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
> +		tap_sock_reset(c);
> +		return;
> +	}
> +
> +
> +	ret = vu_message_read_default(dev, c->fd_tap, &msg);
> +	if (ret <= 0) {
> +		if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK)
> +			tap_sock_reset(c);
> +		return;
> +	}
> +	debug("================ Vhost user message ================");
> +	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
> +		msg.hdr.request);
> +	debug("Flags:   0x%x", msg.hdr.flags);
> +	debug("Size:    %u", msg.hdr.size);
> +
> +	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
> +	switch (msg.hdr.request) {
> +	case VHOST_USER_GET_FEATURES:
> +		reply_requested = vu_get_features_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_FEATURES:
> +		reply_requested = vu_set_features_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_GET_PROTOCOL_FEATURES:
> +		reply_requested = vu_get_protocol_features_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_PROTOCOL_FEATURES:
> +		reply_requested = vu_set_protocol_features_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_GET_QUEUE_NUM:
> +		reply_requested = vu_get_queue_num_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_OWNER:
> +		reply_requested = vu_set_owner_exec();
> +		break;
> +	case VHOST_USER_SET_MEM_TABLE:
> +		reply_requested = vu_set_mem_table_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_NUM:
> +		reply_requested = vu_set_vring_num_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ADDR:
> +		reply_requested = vu_set_vring_addr_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_BASE:
> +		reply_requested = vu_set_vring_base_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_GET_VRING_BASE:
> +		reply_requested = vu_get_vring_base_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_KICK:
> +		reply_requested = vu_set_vring_kick_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_CALL:
> +		reply_requested = vu_set_vring_call_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ERR:
> +		reply_requested = vu_set_vring_err_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ENABLE:
> +		reply_requested = vu_set_vring_enable_exec(dev, &msg);
> +		break;
> +	case VHOST_USER_NONE:
> +		vu_cleanup(dev);
> +		return;
> +	default:
> +		vu_panic(dev, "Unhandled request: %d", msg.hdr.request);
> +		return;
> +	}
> +
> +	if (!reply_requested && need_reply) {
> +		msg.payload.u64 = 0;
> +		msg.hdr.flags = 0;
> +		msg.hdr.size = sizeof(msg.payload.u64);
> +		msg.fd_num = 0;
> +		reply_requested = true;
> +	}
> +
> +	if (reply_requested)
> +		ret = vu_send_reply(dev, c->fd_tap, &msg);
> +	free(msg.data);
> +}
> diff --git a/vhost_user.h b/vhost_user.h
> new file mode 100644
> index 000000000000..25f0b617ab40
> --- /dev/null
> +++ b/vhost_user.h
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts from subprojects/libvhost-user/libvhost-user.h */
> +
> +#ifndef VHOST_USER_H
> +#define VHOST_USER_H
> +
> +#include "virtio.h"
> +#include "iov.h"
> +
> +#define VHOST_USER_F_PROTOCOL_FEATURES 30
> +
> +#define VHOST_MEMORY_BASELINE_NREGIONS 8
> +
> +enum vhost_user_protocol_feature {
> +	VHOST_USER_PROTOCOL_F_MQ = 0,
> +	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
> +	VHOST_USER_PROTOCOL_F_RARP = 2,
> +	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
> +	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
> +	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
> +	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
> +	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
> +	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
> +	VHOST_USER_PROTOCOL_F_CONFIG = 9,
> +	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
> +	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> +	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
> +	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
> +	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
> +
> +	VHOST_USER_PROTOCOL_F_MAX
> +};
> +
> +enum vhost_user_request {
> +	VHOST_USER_NONE = 0,
> +	VHOST_USER_GET_FEATURES = 1,
> +	VHOST_USER_SET_FEATURES = 2,
> +	VHOST_USER_SET_OWNER = 3,
> +	VHOST_USER_RESET_OWNER = 4,
> +	VHOST_USER_SET_MEM_TABLE = 5,
> +	VHOST_USER_SET_LOG_BASE = 6,
> +	VHOST_USER_SET_LOG_FD = 7,
> +	VHOST_USER_SET_VRING_NUM = 8,
> +	VHOST_USER_SET_VRING_ADDR = 9,
> +	VHOST_USER_SET_VRING_BASE = 10,
> +	VHOST_USER_GET_VRING_BASE = 11,
> +	VHOST_USER_SET_VRING_KICK = 12,
> +	VHOST_USER_SET_VRING_CALL = 13,
> +	VHOST_USER_SET_VRING_ERR = 14,
> +	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
> +	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
> +	VHOST_USER_GET_QUEUE_NUM = 17,
> +	VHOST_USER_SET_VRING_ENABLE = 18,
> +	VHOST_USER_SEND_RARP = 19,
> +	VHOST_USER_NET_SET_MTU = 20,
> +	VHOST_USER_SET_BACKEND_REQ_FD = 21,
> +	VHOST_USER_IOTLB_MSG = 22,
> +	VHOST_USER_SET_VRING_ENDIAN = 23,
> +	VHOST_USER_GET_CONFIG = 24,
> +	VHOST_USER_SET_CONFIG = 25,
> +	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
> +	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
> +	VHOST_USER_POSTCOPY_ADVISE  = 28,
> +	VHOST_USER_POSTCOPY_LISTEN  = 29,
> +	VHOST_USER_POSTCOPY_END     = 30,
> +	VHOST_USER_GET_INFLIGHT_FD = 31,
> +	VHOST_USER_SET_INFLIGHT_FD = 32,
> +	VHOST_USER_GPU_SET_SOCKET = 33,
> +	VHOST_USER_VRING_KICK = 35,
> +	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> +	VHOST_USER_ADD_MEM_REG = 37,
> +	VHOST_USER_REM_MEM_REG = 38,
> +	VHOST_USER_MAX
> +};
> +
> +typedef struct {
> +	enum vhost_user_request request;
> +
> +#define VHOST_USER_VERSION_MASK     0x3
> +#define VHOST_USER_REPLY_MASK       (0x1 << 2)
> +#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
> +	uint32_t flags;
> +	uint32_t size; /* the following payload size */
> +} __attribute__ ((__packed__)) vhost_user_header;
> +
> +typedef struct VhostUserMemory_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t memory_size;
> +	uint64_t userspace_addr;
> +	uint64_t mmap_offset;
> +} VhostUserMemory_region;
> +
> +struct VhostUserMemory {
> +	uint32_t nregions;
> +	uint32_t padding;
> +	struct VhostUserMemory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
> +};
> +
> +typedef union {
> +#define VHOST_USER_VRING_IDX_MASK   0xff
> +#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> +	uint64_t u64;
> +	struct vhost_vring_state state;
> +	struct vhost_vring_addr addr;
> +	struct VhostUserMemory memory;
> +} vhost_user_payload;
> +
> +typedef struct VhostUserMsg {
> +	vhost_user_header hdr;
> +	vhost_user_payload payload;
> +
> +	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> +	int fd_num;
> +	uint8_t *data;
> +} __attribute__ ((__packed__)) VhostUserMsg;
> +#define VHOST_USER_HDR_SIZE sizeof(vhost_user_header)
> +
> +#define VHOST_USER_RX_QUEUE 0
> +#define VHOST_USER_TX_QUEUE 1
> +
> +static inline bool vu_queue_enabled(VuVirtq *vq)
> +{
> +	return vq->enable;
> +}
> +
> +static inline bool vu_queue_started(const VuVirtq *vq)
> +{
> +	return vq->started;
> +}
> +
> +size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov,
> +			  size_t n);
> +int vu_send(const struct ctx *c, const void *data, size_t len);
> +void vu_print_capabilities(void);
> +void vu_init(struct ctx *c);
> +void vu_kick_cb(struct ctx *c, union epoll_ref ref);
> +void tap_handler_vu(struct ctx *c, uint32_t events);
> +#endif /* VHOST_USER_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-02 14:11 ` [PATCH 20/24] vhost-user: add vhost-user Laurent Vivier
@ 2024-02-07  2:40   ` David Gibson
  2024-02-11 23:19     ` Stefano Brivio
  2024-02-11 23:19   ` Stefano Brivio
  1 sibling, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-07  2:40 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 11163 bytes --]

On Fri, Feb 02, 2024 at 03:11:47PM +0100, Laurent Vivier wrote:
> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \

I think it would be wise to use different default socket names for
vhost-user than for the qemu socket protocol.  Or even to require
--socket-path: the reasons we have these rather weird default probed
paths don't apply here, AFAICT.

>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  conf.c  | 20 ++++++++++++++--
>  passt.c |  7 ++++++
>  passt.h |  1 +
>  tap.c   | 73 ++++++++++++++++++++++++++++++++++++++++++---------------
>  tcp.c   |  8 +++++--
>  udp.c   |  6 +++--
>  6 files changed, 90 insertions(+), 25 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index b6a2a1f0fdc3..40aa9519f8a6 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -44,6 +44,7 @@
>  #include "lineread.h"
>  #include "isolation.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /**
>   * next_chunk - Return the next piece of a string delimited by a character
> @@ -735,9 +736,12 @@ static void print_usage(const char *name, int status)
>  		info(   "  -I, --ns-ifname NAME	namespace interface name");
>  		info(   "    default: same interface name as external one");
>  	} else {
> -		info(   "  -s, --socket PATH	UNIX domain socket path");
> +		info(   "  -s, --socket, --socket-path PATH	UNIX domain socket path");
>  		info(   "    default: probe free path starting from "
>  		     UNIX_SOCK_PATH, 1);
> +		info(   "  --vhost-user		Enable vhost-user mode");
> +		info(   "    UNIX domain socket is provided by -s option");
> +		info(   "  --print-capabilities	print back-end capabilities in JSON format");
>  	}
>  
>  	info(   "  -F, --fd FD		Use FD as pre-opened connected socket");
> @@ -1123,6 +1127,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"help",	no_argument,		NULL,		'h' },
>  		{"socket",	required_argument,	NULL,		's' },
>  		{"fd",		required_argument,	NULL,		'F' },
> +		{"socket-path",	required_argument,	NULL,		's' }, /* vhost-user mandatory */
>  		{"ns-ifname",	required_argument,	NULL,		'I' },
>  		{"pcap",	required_argument,	NULL,		'p' },
>  		{"pid",		required_argument,	NULL,		'P' },
> @@ -1169,6 +1174,8 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"config-net",	no_argument,		NULL,		17 },
>  		{"no-copy-routes", no_argument,		NULL,		18 },
>  		{"no-copy-addrs", no_argument,		NULL,		19 },
> +		{"vhost-user",	no_argument,		NULL,		20 },
> +		{"print-capabilities", no_argument,	NULL,		21 }, /* vhost-user mandatory */
>  		{ 0 },
>  	};
>  	char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
> @@ -1328,7 +1335,6 @@ void conf(struct ctx *c, int argc, char **argv)
>  				       sizeof(c->ip6.ifname_out), "%s", optarg);
>  			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
>  				die("Invalid interface name: %s", optarg);
> -

Unrelated change

>  			break;
>  		case 17:
>  			if (c->mode != MODE_PASTA)
> @@ -1350,6 +1356,16 @@ void conf(struct ctx *c, int argc, char **argv)
>  			warn("--no-copy-addrs will be dropped soon");
>  			c->no_copy_addrs = copy_addrs_opt = true;
>  			break;
> +		case 20:
> +			if (c->mode == MODE_PASTA) {
> +				err("--vhost-user is for passt mode only");
> +				usage(argv[0]);
> +			}
> +			c->mode = MODE_VU;
> +			break;
> +		case 21:
> +			vu_print_capabilities();
> +			break;
>  		case 'd':
>  			if (c->debug)
>  				die("Multiple --debug options given");
> diff --git a/passt.c b/passt.c
> index 95034d73381f..952aded12848 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -282,6 +282,7 @@ int main(int argc, char **argv)
>  	quit_fd = pasta_netns_quit_init(&c);
>  
>  	tap_sock_init(&c);
> +	vu_init(&c);
>  
>  	secret_init(&c);
>  
> @@ -399,6 +400,12 @@ loop:
>  		case EPOLL_TYPE_ICMPV6:
>  			icmp_sock_handler(&c, AF_INET6, ref);
>  			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			tap_handler_vu(&c, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(&c, ref);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 6ed1d0b19e82..4e0100d51a4d 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -141,6 +141,7 @@ struct fqdn {
>  enum passt_modes {
>  	MODE_PASST,
>  	MODE_PASTA,
> +	MODE_VU,
>  };
>  
>  /**
> diff --git a/tap.c b/tap.c
> index 936206e53637..c2a917bc00ca 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -57,6 +57,7 @@
>  #include "packet.h"
>  #include "tap.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
>  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> @@ -75,19 +76,22 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
>   */
>  int tap_send(const struct ctx *c, const void *data, size_t len)
>  {
> -	pcap(data, len);
> +	int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
> +	uint32_t vnet_len = htonl(len);
>  
> -	if (c->mode == MODE_PASST) {
> -		int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
> -		uint32_t vnet_len = htonl(len);
> +	pcap(data, len);
>  
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		if (send(c->fd_tap, &vnet_len, 4, flags) < 0)
>  			return -1;
> -
>  		return send(c->fd_tap, data, len, flags);
> +	case MODE_PASTA:
> +		return write(c->fd_tap, (char *)data, len);
> +	case MODE_VU:
> +		return vu_send(c, data, len);
>  	}
> -
> -	return write(c->fd_tap, (char *)data, len);
> +	return 0;
>  }
>  
>  /**
> @@ -428,10 +432,20 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
>  	if (!n)
>  		return 0;
>  
> -	if (c->mode == MODE_PASTA)
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		m = tap_send_frames_pasta(c, iov, n);
> -	else
> +		break;
> +	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, n);
> +		break;
> +	case MODE_VU:
> +		m = tap_send_frames_vu(c, iov, n);
> +		break;
> +	default:
> +		m = 0;
> +		break;
> +	}
>  
>  	if (m < n)
>  		debug("tap: failed to send %zu frames of %zu", n - m, n);
> @@ -1149,11 +1163,17 @@ static void tap_sock_unix_init(struct ctx *c)
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
>  
> -	info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
> -	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> -	     addr.sun_path);
> -	info("or qrap, for earlier qemu versions:");
> -	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +	if (c->mode == MODE_VU) {
> +		info("You can start qemu with:");
> +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
> +		     addr.sun_path);
> +	} else {
> +		info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
> +		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> +		     addr.sun_path);
> +		info("or qrap, for earlier qemu versions:");
> +		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +	}
>  }
>  
>  /**
> @@ -1163,7 +1183,7 @@ static void tap_sock_unix_init(struct ctx *c)
>   */
>  void tap_listen_handler(struct ctx *c, uint32_t events)
>  {
> -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
> +	union epoll_ref ref;
>  	struct epoll_event ev = { 0 };
>  	int v = INT_MAX / 2;
>  	struct ucred ucred;
> @@ -1204,7 +1224,13 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> -	ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> +	if (c->mode == MODE_VU) {

I'd prefer to see different epoll types for listening "old-style" unix
sockets and listening vhost-user sockets, rather than having a switch
on the global mode here.

> +		ref.type = EPOLL_TYPE_VHOST_CMD;
> +		ev.events = EPOLLIN | EPOLLRDHUP;
> +	} else {
> +		ref.type = EPOLL_TYPE_TAP_PASST;
> +		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
> +	}
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  }
> @@ -1288,12 +1314,21 @@ void tap_sock_init(struct ctx *c)
>  
>  		ASSERT(c->one_off);
>  		ref.fd = c->fd_tap;
> -		if (c->mode == MODE_PASST)
> +		switch (c->mode) {
> +		case MODE_PASST:
>  			ref.type = EPOLL_TYPE_TAP_PASST;
> -		else
> +			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> +			break;
> +		case MODE_PASTA:
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
> +			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> +			break;
> +		case MODE_VU:
> +			ref.type = EPOLL_TYPE_VHOST_CMD;
> +			ev.events = EPOLLIN | EPOLLRDHUP;
> +			break;

Please add a default: case, even if it's just ASSERT(0).  Leaving it
out tends to make static checkers unhappy.

> +		}
>  
> -		ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
>  		ev.data.u64 = ref.u64;
>  		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  		return;
> diff --git a/tcp.c b/tcp.c
> index 54c15087d678..b6aca9f37f19 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1033,7 +1033,9 @@ size_t ipv4_fill_headers(const struct ctx *c,
>  
>  	tcp_set_tcp_header(th, conn, seq);
>  
> -	th->check = tcp_update_check_tcp4(iph);
> +	th->check = 0;
> +	if (c->mode != MODE_VU || *c->pcap)
> +		th->check = tcp_update_check_tcp4(iph);
>  
>  	return ip_len;
>  }
> @@ -1069,7 +1071,9 @@ size_t ipv6_fill_headers(const struct ctx *c,
>  
>  	tcp_set_tcp_header(th, conn, seq);
>  
> -	th->check = tcp_update_check_tcp6(ip6h);
> +	th->check = 0;
> +	if (c->mode != MODE_VU || *c->pcap)
> +		th->check = tcp_update_check_tcp6(ip6h);
>  
>  	ip6h->hop_limit = 255;
>  	ip6h->version = 6;
> diff --git a/udp.c b/udp.c
> index a189c2e0b5a2..799a10989a91 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -671,8 +671,10 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
>  	uh->source = s_in6->sin6_port;
>  	uh->dest = htons(dstport);
>  	uh->len = ip6h->payload_len;
> -	uh->check = csum(uh, ntohs(ip6h->payload_len),
> -			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
> +	uh->check = 0;
> +	if (c->mode != MODE_VU || *c->pcap)
> +		uh->check = csum(uh, ntohs(ip6h->payload_len),
> +				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
>  	ip6h->version = 6;
>  	ip6h->nexthdr = IPPROTO_UDP;
>  	ip6h->hop_limit = 255;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/24] checksum: align buffers
  2024-02-05  6:02   ` David Gibson
@ 2024-02-07  9:01     ` Stefano Brivio
  0 siblings, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-07  9:01 UTC (permalink / raw)
  To: David Gibson, Laurent Vivier; +Cc: passt-dev

On Mon, 5 Feb 2024 17:02:00 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:30PM +0100, Laurent Vivier wrote:
> > if buffer is not aligned use sum_16b() only on the not aligned
> > part, and then use csum() on the remaining part
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>  
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> > ---
> >  checksum.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/checksum.c b/checksum.c
> > index f21c9b7a14d1..c94980771c63 100644
> > --- a/checksum.c
> > +++ b/checksum.c
> > @@ -407,7 +407,19 @@ less_than_128_bytes:
> >  __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
> >  uint16_t csum(const void *buf, size_t len, uint32_t init)
> >  {
> > -	return (uint16_t)~csum_fold(csum_avx2(buf, len, init));
> > +	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;  
> 
> Wonder if its worth adding an ALIGN_UP macro.

Actually, we have it, it's called ROUND_UP (see util.h). This could be:

	intptr_t align = ROUND_UP(buf, 0x20);

...and maybe we could use sizeof(__m256i) or similar instead of 0x20.

> 
> > +	unsigned int pad = align - (intptr_t)buf;
> > +
> > +	if (len < pad)
> > +		pad = len;
> > +
> > +	if (pad)
> > +		init += sum_16b(buf, pad);
> > +
> > +	if (len > pad)
> > +		init = csum_avx2((void *)align, len - pad, init);
> > +
> > +	return (uint16_t)~csum_fold(init);
> >  }
> >  
> >  #else /* __AVX2__ */  
> 

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/24] checksum: add csum_iov()
  2024-02-02 14:11 ` [PATCH 04/24] checksum: add csum_iov() Laurent Vivier
  2024-02-05  6:07   ` David Gibson
@ 2024-02-07  9:02   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-07  9:02 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:31 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  checksum.c | 39 ++++++++++++++++++++++-----------------
>  checksum.h |  1 +
>  2 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/checksum.c b/checksum.c
> index c94980771c63..14b6057684d9 100644
> --- a/checksum.c
> +++ b/checksum.c
> @@ -395,17 +395,8 @@ less_than_128_bytes:
>  	return (uint32_t)sum64;
>  }
>  
> -/**
> - * csum() - Compute TCP/IP-style checksum
> - * @buf:	Input buffer, must be aligned to 32-byte boundary
> - * @len:	Input length
> - * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
> - *
> - * Return: 16-bit folded, complemented checksum sum
> - */
> -/* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
>  __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
> -uint16_t csum(const void *buf, size_t len, uint32_t init)
> +uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
>  {
>  	intptr_t align = ((intptr_t)buf + 0x1f) & ~(intptr_t)0x1f;
>  	unsigned int pad = align - (intptr_t)buf;
> @@ -419,24 +410,38 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
>  	if (len > pad)
>  		init = csum_avx2((void *)align, len - pad, init);
>  
> -	return (uint16_t)~csum_fold(init);
> +	return init;
>  }
> -
>  #else /* __AVX2__ */
>  
> +__attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
> +uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init)
> +{
> +	return sum_16b(buf, len) + init;
> +}
> +#endif /* !__AVX2__ */
> +
>  /**
>   * csum() - Compute TCP/IP-style checksum
> - * @buf:	Input buffer
> + * @buf:	Input buffer, must be aligned to 32-byte boundary
>   * @len:	Input length
> - * @sum:	Initial 32-bit checksum, 0 for no pre-computed checksum
> + * @init:	Initial 32-bit checksum, 0 for no pre-computed checksum
>   *
> - * Return: 16-bit folded, complemented checksum
> + * Return: 16-bit folded, complemented checksum sum
>   */
>  /* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */
>  __attribute__((optimize("-fno-strict-aliasing")))	/* See csum_16b() */
>  uint16_t csum(const void *buf, size_t len, uint32_t init)
>  {
> -	return csum_unaligned(buf, len, init);
> +	return (uint16_t)~csum_fold(csum_unfolded(buf, len, init));
>  }
>  
> -#endif /* !__AVX2__ */
> +uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < n;  i++)

Stray whitespace before i++.

> +		init = csum_unfolded(iov[i].iov_base, iov[i].iov_len, init);
> +
> +	return (uint16_t)~csum_fold(init);
> +}
> diff --git a/checksum.h b/checksum.h
> index 21c0310d3804..6a20297a5826 100644
> --- a/checksum.h
> +++ b/checksum.h
> @@ -25,5 +25,6 @@ void csum_icmp6(struct icmp6hdr *icmp6hr,
>  		const struct in6_addr *saddr, const struct in6_addr *daddr,
>  		const void *payload, size_t len);
>  uint16_t csum(const void *buf, size_t len, uint32_t init);
> +uint16_t csum_iov(struct iovec *iov, unsigned int n, uint32_t init);
>  
>  #endif /* CHECKSUM_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch]
  2024-02-05  6:13   ` David Gibson
@ 2024-02-07  9:03     ` Stefano Brivio
  2024-02-08  0:04       ` David Gibson
  0 siblings, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-07  9:03 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

Not related to the review of the patch itself:

On Mon, 5 Feb 2024 17:13:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:32PM +0100, Laurent Vivier wrote:
> 
> > [...]
> >
> > +struct ipv6hdr {  
> 
> Not really in scope for this patch, but I have wondered if we should
> try to use struct ip6_hdr from netinet/ip6.h instead of our own
> version (derived, I think, from the kernel one).

The reason why I went with this is that the one in netinet/ip6.h looks
fairly unusable to me: there are no explicit fields for version and
priority, and names are long and a bit obscure, as defined by RFC 3542:
does 'ctlun' actually mean "control union"?

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-07  1:01       ` David Gibson
@ 2024-02-07 10:00         ` Laurent Vivier
  0 siblings, 0 replies; 83+ messages in thread
From: Laurent Vivier @ 2024-02-07 10:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 2/7/24 02:01, David Gibson wrote:
> On Tue, Feb 06, 2024 at 03:28:10PM +0100, Laurent Vivier wrote:
>> On 2/5/24 06:57, David Gibson wrote:
>>> On Fri, Feb 02, 2024 at 03:11:28PM +0100, Laurent Vivier wrote:
>>> ...
>>>> diff --git a/iov.c b/iov.c
>>>> new file mode 100644
>>>> index 000000000000..38a8e7566021
>>>> --- /dev/null
>>>> +++ b/iov.c
>>>>
>>>> +	for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
>>> Not immediately seeing why you need the 'offset ||' part of the condition.
>> In fact the loop has two purposes:
>>
>> 1- scan the the iovec to reach byte offset in the iov (so until offset is 0)
>>
>> 2- copy the bytes (until done == byte)
>>
>> It could be written like this:
>>
>> for (i = 0; offset && i < iov_cnt && offset >= iov[i].iov_len ; i++)
>>          offset -= iov[i].iov_len;
>>
>> for (done = 0; done < bytes && i < iov_cnt; i++) {
>>      size_t len = MIN(iov[i].iov_len - offset, bytes - done);
>>      memcpy((char *)iov[i].iov_base + offset, (char *)buf + done, len);
>>      done += len;
>> }
> Right, but done starts at 0 and will remain zero until you reach the
> first segment where you need to copy something.  So, unless bytes ==
> 0, then done < bytes will always be true when offset != 0.  And if
> bytes *is* zero, then there's nothing to do, so there's no need to
> step through the vector at all.
>
Yes, you're right. I'm going to remove the "offset" test from the condition.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h
  2024-02-02 14:11 ` [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h Laurent Vivier
  2024-02-05  6:16   ` David Gibson
@ 2024-02-07 10:40   ` Stefano Brivio
  2024-02-07 23:43     ` David Gibson
  1 sibling, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-07 10:40 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:33 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> We can find the same function to compute the IPv4 header
> checksum in tcp.c and udp.c
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  ip.h  | 14 ++++++++++++++
>  tcp.c | 23 ++---------------------
>  udp.c | 19 +------------------
>  3 files changed, 17 insertions(+), 39 deletions(-)
> 
> diff --git a/ip.h b/ip.h
> index b2e08bc049f3..ff7902c45a95 100644
> --- a/ip.h
> +++ b/ip.h
> @@ -9,6 +9,8 @@
>  #include <netinet/ip.h>
>  #include <netinet/ip6.h>
>  
> +#include "checksum.h"
> +
>  #define IN4_IS_ADDR_UNSPECIFIED(a) \
>  	((a)->s_addr == htonl_constant(INADDR_ANY))
>  #define IN4_IS_ADDR_BROADCAST(a) \
> @@ -83,4 +85,16 @@ struct ipv6_opt_hdr {
>  
>  char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
>  		 size_t *dlen);
> +static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)

A function comment would be nice. A couple of doubts:

- why is this an inline in ip.h, instead of a function in checksum.c?
  That would be more natural, I think

- this would be the first Layer-4 protocol number passed as int: we use
  uint8_t elsewhere. Now, socket(2) and similar all take an int, but
  using uint8_t internally keeps large arrays such as tap4_l4 a bit
  smaller.

  The only value defined in Linux UAPI exceeding eight bits is
  IPPROTO_MPTCP, 262, because that's never on the wire (the TCP
  protocol number is used instead). And we won't meet that either.

  In practice, it doesn't matter what we use here, but still uint8_t
  would be consistent.

> +{
> +	uint32_t sum = L2_BUF_IP4_PSUM(proto);
> +
> +	sum += iph->tot_len;
> +	sum += (iph->saddr >> 16) & 0xffff;
> +	sum += iph->saddr & 0xffff;
> +	sum += (iph->daddr >> 16) & 0xffff;
> +	sum += iph->daddr & 0xffff;
> +
> +	return ~csum_fold(sum);
> +}
>  #endif /* IP_H */
> diff --git a/tcp.c b/tcp.c
> index 4c9c5fb51c60..293ab12d8c21 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -934,23 +934,6 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>  		trace("TCP: failed to set SO_SNDBUF to %i", v);
>  }
>  
> -/**
> - * tcp_update_check_ip4() - Update IPv4 with variable parts from stored one
> - * @buf:	L2 packet buffer with final IPv4 header
> - */
> -static void tcp_update_check_ip4(struct tcp4_l2_buf_t *buf)
> -{
> -	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_TCP);
> -
> -	sum += buf->iph.tot_len;
> -	sum += (buf->iph.saddr >> 16) & 0xffff;
> -	sum += buf->iph.saddr & 0xffff;
> -	sum += (buf->iph.daddr >> 16) & 0xffff;
> -	sum += buf->iph.daddr & 0xffff;
> -
> -	buf->iph.check = (uint16_t)~csum_fold(sum);
> -}
> -
>  /**
>   * tcp_update_check_tcp4() - Update TCP checksum from stored one
>   * @buf:	L2 packet buffer with final IPv4 header
> @@ -1393,10 +1376,8 @@ do {									\
>  		b->iph.saddr = a4->s_addr;
>  		b->iph.daddr = c->ip4.addr_seen.s_addr;
>  
> -		if (check)
> -			b->iph.check = *check;
> -		else
> -			tcp_update_check_ip4(b);
> +		b->iph.check = check ? *check :
> +				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
>  
>  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
>  
> diff --git a/udp.c b/udp.c
> index d514c864ab5b..6f867df81c05 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -270,23 +270,6 @@ static void udp_invert_portmap(struct udp_port_fwd *fwd)
>  	}
>  }
>  
> -/**
> - * udp_update_check4() - Update checksum with variable parts from stored one
> - * @buf:	L2 packet buffer with final IPv4 header
> - */
> -static void udp_update_check4(struct udp4_l2_buf_t *buf)
> -{
> -	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_UDP);
> -
> -	sum += buf->iph.tot_len;
> -	sum += (buf->iph.saddr >> 16) & 0xffff;
> -	sum += buf->iph.saddr & 0xffff;
> -	sum += (buf->iph.daddr >> 16) & 0xffff;
> -	sum += buf->iph.daddr & 0xffff;
> -
> -	buf->iph.check = (uint16_t)~csum_fold(sum);
> -}
> -
>  /**
>   * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
>   * @eth_d:	Ethernet destination address, NULL if unchanged
> @@ -614,7 +597,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
>  		b->iph.saddr = b->s_in.sin_addr.s_addr;
>  	}
>  
> -	udp_update_check4(b);
> +	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
>  	b->uh.source = b->s_in.sin_port;
>  	b->uh.dest = htons(dstport);
>  	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP
  2024-02-02 14:11 ` [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP Laurent Vivier
  2024-02-05  6:20   ` David Gibson
@ 2024-02-07 10:41   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-07 10:41 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:34 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> The TCP and UDP checksums are computed using the data in the TCP/UDP
> payload but also some informations in the IP header (protocol,
> length, source and destination addresses).
> 
> We add two functions, proto_ipv4_header_checksum() and
> proto_ipv6_header_checksum(), to compute the checksum of the IP
> header part.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  ip.h  | 24 ++++++++++++++++++++++++
>  tcp.c | 40 +++++++++++++++-------------------------
>  udp.c |  6 ++----
>  3 files changed, 41 insertions(+), 29 deletions(-)
> 
> diff --git a/ip.h b/ip.h
> index ff7902c45a95..87cb8dd21d2e 100644
> --- a/ip.h
> +++ b/ip.h
> @@ -97,4 +97,28 @@ static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)
>  
>  	return ~csum_fold(sum);
>  }
> +
> +static inline uint32_t proto_ipv4_header_checksum(struct iphdr *iph, int proto)

Function comments, proto can be uint8_t for consistency, and probably
all these don't need to be inlines. Interesting readings in case you
haven't come across them:

	https://lwn.net/Kernel/Index/#Inline_functions
	https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?id=6d280f4d760e3bcb4a8df302afebf085b65ec982#n962

> +{
> +	uint32_t sum = htons(proto);
> +
> +	sum += (iph->saddr >> 16) & 0xffff;
> +	sum += iph->saddr & 0xffff;
> +	sum += (iph->daddr >> 16) & 0xffff;
> +	sum += iph->daddr & 0xffff;
> +	sum += htons(ntohs(iph->tot_len) - 20);
> +
> +	return sum;
> +}
> +
> +static inline uint32_t proto_ipv6_header_checksum(struct ipv6hdr *ip6h,
> +						  int proto)
> +{
> +	uint32_t sum = htons(proto) + ip6h->payload_len;
> +
> +	sum += sum_16b(&ip6h->saddr, sizeof(ip6h->saddr));
> +	sum += sum_16b(&ip6h->daddr, sizeof(ip6h->daddr));
> +
> +	return sum;
> +}
>  #endif /* IP_H */
> diff --git a/tcp.c b/tcp.c
> index 293ab12d8c21..2fd6bc2eda53 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -938,39 +938,25 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>   * tcp_update_check_tcp4() - Update TCP checksum from stored one
>   * @buf:	L2 packet buffer with final IPv4 header
>   */
> -static void tcp_update_check_tcp4(struct tcp4_l2_buf_t *buf)
> +static uint16_t tcp_update_check_tcp4(struct iphdr *iph)
>  {
> -	uint16_t tlen = ntohs(buf->iph.tot_len) - 20;
> -	uint32_t sum = htons(IPPROTO_TCP);
> +	struct tcphdr *th = (void *)(iph + 1);
> +	uint16_t tlen = ntohs(iph->tot_len) - 20;
> +	uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
>  
> -	sum += (buf->iph.saddr >> 16) & 0xffff;
> -	sum += buf->iph.saddr & 0xffff;
> -	sum += (buf->iph.daddr >> 16) & 0xffff;
> -	sum += buf->iph.daddr & 0xffff;
> -	sum += htons(ntohs(buf->iph.tot_len) - 20);
> -
> -	buf->th.check = 0;
> -	buf->th.check = csum(&buf->th, tlen, sum);
> +	return csum(th, tlen, sum);
>  }
>  
>  /**
>   * tcp_update_check_tcp6() - Calculate TCP checksum for IPv6
>   * @buf:	L2 packet buffer with final IPv6 header
>   */
> -static void tcp_update_check_tcp6(struct tcp6_l2_buf_t *buf)
> +static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
>  {
> -	int len = ntohs(buf->ip6h.payload_len) + sizeof(struct ipv6hdr);
> -
> -	buf->ip6h.hop_limit = IPPROTO_TCP;
> -	buf->ip6h.version = 0;
> -	buf->ip6h.nexthdr = 0;
> +	struct tcphdr *th = (void *)(ip6h + 1);
> +	uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
>  
> -	buf->th.check = 0;
> -	buf->th.check = csum(&buf->ip6h, len, 0);
> -
> -	buf->ip6h.hop_limit = 255;
> -	buf->ip6h.version = 6;
> -	buf->ip6h.nexthdr = IPPROTO_TCP;
> +	return csum(th, ntohs(ip6h->payload_len), sum);
>  }
>  
>  /**
> @@ -1381,7 +1367,7 @@ do {									\
>  
>  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
>  
> -		tcp_update_check_tcp4(b);
> +		b->th.check = tcp_update_check_tcp4(&b->iph);
>  
>  		tlen = tap_iov_len(c, &b->taph, ip_len);
>  	} else {
> @@ -1400,7 +1386,11 @@ do {									\
>  
>  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
>  
> -		tcp_update_check_tcp6(b);
> +		b->th.check = tcp_update_check_tcp6(&b->ip6h);
> +
> +		b->ip6h.hop_limit = 255;
> +		b->ip6h.version = 6;
> +		b->ip6h.nexthdr = IPPROTO_TCP;
>  
>  		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
>  		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
> diff --git a/udp.c b/udp.c
> index 6f867df81c05..96b4e6ca9a85 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -669,10 +669,8 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
>  	b->uh.source = b->s_in6.sin6_port;
>  	b->uh.dest = htons(dstport);
>  	b->uh.len = b->ip6h.payload_len;
> -
> -	b->ip6h.hop_limit = IPPROTO_UDP;
> -	b->ip6h.version = b->ip6h.nexthdr = b->uh.check = 0;
> -	b->uh.check = csum(&b->ip6h, ip_len, 0);
> +	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
> +			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
>  	b->ip6h.version = 6;
>  	b->ip6h.nexthdr = IPPROTO_UDP;
>  	b->ip6h.hop_limit = 255;

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-06 16:10   ` Stefano Brivio
@ 2024-02-07 14:02     ` Laurent Vivier
  2024-02-07 14:57       ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-07 14:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 2/6/24 17:10, Stefano Brivio wrote:
> On Fri,  2 Feb 2024 15:11:28 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> ...
> diff --git a/iov.h b/iov.h
> new file mode 100644
> index 000000000000..31fbf6d0e1cf
> --- /dev/null
> +++ b/iov.h
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* some parts copied from QEMU include/qemu/iov.h */
> +
> +#ifndef IOVEC_H
> +#define IOVEC_H
> +
> +#include <unistd.h>
> +#include <string.h>
> +
> +size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
> +			 size_t offset, const void *buf, size_t bytes);
> +size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
> +		       size_t offset, void *buf, size_t bytes);
> +
> +static inline size_t iov_from_buf(const struct iovec *iov,
> +				  unsigned int iov_cnt, size_t offset,
> +				  const void *buf, size_t bytes)
> +{
> Is there a particular reason to include these two in a header? The
> compiler will inline as needed if they are in a source file.
>
This code has been introduced in QEMU by:

commit ad523bca56a7202d2498c550a41be5c986c4d33c
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Dec 22 12:03:33 2015 +0100

     iov: avoid memcpy for "simple" iov_from_buf/iov_to_buf

     memcpy can take a large amount of time for small reads and writes.
     For virtio it is a common case that the first iovec can satisfy the
     whole read or write.  In that case, and if bytes is a constant to
     avoid excessive growth of code, inline the first iteration
     into the caller.

     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
     Message-id: 1450782213-14227-1-git-send-email-pbonzini@redhat.com
     Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Is the compiler able to check "bytes" is a constant and inline the function if the 
definition is in a .c file and not in a .h ?

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/24] iov: add some functions to manage iovec
  2024-02-07 14:02     ` Laurent Vivier
@ 2024-02-07 14:57       ` Stefano Brivio
  0 siblings, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-07 14:57 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Wed, 7 Feb 2024 15:02:42 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 2/6/24 17:10, Stefano Brivio wrote:
> > On Fri,  2 Feb 2024 15:11:28 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> > ...
> > diff --git a/iov.h b/iov.h
> > new file mode 100644
> > index 000000000000..31fbf6d0e1cf
> > --- /dev/null
> > +++ b/iov.h
> > @@ -0,0 +1,46 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/* some parts copied from QEMU include/qemu/iov.h */
> > +
> > +#ifndef IOVEC_H
> > +#define IOVEC_H
> > +
> > +#include <unistd.h>
> > +#include <string.h>
> > +
> > +size_t iov_from_buf_full(const struct iovec *iov, unsigned int iov_cnt,
> > +			 size_t offset, const void *buf, size_t bytes);
> > +size_t iov_to_buf_full(const struct iovec *iov, const unsigned int iov_cnt,
> > +		       size_t offset, void *buf, size_t bytes);
> > +
> > +static inline size_t iov_from_buf(const struct iovec *iov,
> > +				  unsigned int iov_cnt, size_t offset,
> > +				  const void *buf, size_t bytes)
> > +{
> > Is there a particular reason to include these two in a header? The
> > compiler will inline as needed if they are in a source file.
> >  
> This code has been introduced in QEMU by:
> 
> commit ad523bca56a7202d2498c550a41be5c986c4d33c
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Tue Dec 22 12:03:33 2015 +0100
> 
>      iov: avoid memcpy for "simple" iov_from_buf/iov_to_buf
> 
>      memcpy can take a large amount of time for small reads and writes.
>      For virtio it is a common case that the first iovec can satisfy the
>      whole read or write.  In that case, and if bytes is a constant to
>      avoid excessive growth of code, inline the first iteration
>      into the caller.
> 
>      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>      Message-id: 1450782213-14227-1-git-send-email-pbonzini@redhat.com
>      Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Is the compiler able to check "bytes" is a constant and inline the function if the 
> definition is in a .c file and not in a .h ?

Well, those are two different aspects, but anyway, yes, both. Having it
in a header file implies that the compiler considers the problem
separately for every compilation unit (roughly every .c file, here). If
you move it in a source file, the compiler will instead apply some
heuristics to decide if it makes sense to inline and, if, yes, you end
up with essentially the same result.

If I apply this on top of your series:

---
diff --git a/iov.c b/iov.c
index 38a8e75..cabd6d0 100644
--- a/iov.c
+++ b/iov.c
@@ -76,3 +76,27 @@ unsigned iov_copy(struct iovec *dst_iov, unsigned
int dst_iov_cnt, }
 	return j;
 }
+
+size_t iov_from_buf(const struct iovec *iov, unsigned int iov_cnt,
+		    size_t offset, const void *buf, size_t bytes)
+{
+	if (__builtin_constant_p(bytes) && iov_cnt &&
+		offset <= iov[0].iov_len && bytes <= iov[0].iov_len -
offset) {
+		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
+		return bytes;
+	} else {
+		return iov_from_buf_full(iov, iov_cnt, offset, buf,
bytes);
+	}
+}
+
+size_t iov_to_buf(const struct iovec *iov, const unsigned int iov_cnt,
+		  size_t offset, void *buf, size_t bytes)
+{
+	if (__builtin_constant_p(bytes) && iov_cnt &&
+		offset <= iov[0].iov_len && bytes <= iov[0].iov_len -
offset) {
+		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
+		return bytes;
+	} else {
+		return iov_to_buf_full(iov, iov_cnt, offset, buf,
bytes);
+	}
+}
diff --git a/iov.h b/iov.h
index 31fbf6d..598c2ba 100644
--- a/iov.h
+++ b/iov.h
@@ -12,35 +12,14 @@ size_t iov_from_buf_full(const struct iovec *iov,
unsigned int iov_cnt, size_t offset, const void *buf, size_t bytes);
 size_t iov_to_buf_full(const struct iovec *iov, const unsigned int
iov_cnt, size_t offset, void *buf, size_t bytes);
-
-static inline size_t iov_from_buf(const struct iovec *iov,
-				  unsigned int iov_cnt, size_t offset,
-				  const void *buf, size_t bytes)
-{
-	if (__builtin_constant_p(bytes) && iov_cnt &&
-		offset <= iov[0].iov_len && bytes <= iov[0].iov_len -
offset) {
-		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
-		return bytes;
-	} else {
-		return iov_from_buf_full(iov, iov_cnt, offset, buf,
bytes);
-	}
-}
-
-static inline size_t iov_to_buf(const struct iovec *iov,
-				const unsigned int iov_cnt, size_t
offset,
-				void *buf, size_t bytes)
-{
-	if (__builtin_constant_p(bytes) && iov_cnt &&
-		offset <= iov[0].iov_len && bytes <= iov[0].iov_len -
offset) {
-		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
-		return bytes;
-	} else {
-		return iov_to_buf_full(iov, iov_cnt, offset, buf,
bytes);
-	}
-}
-
 size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt);
 unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
 		  const struct iovec *iov, unsigned int iov_cnt,
 		  size_t offset, size_t bytes);
+
+size_t iov_from_buf(const struct iovec *iov, unsigned int iov_cnt,
+		    size_t offset, const void *buf, size_t bytes);
+
+size_t iov_to_buf(const struct iovec *iov, const unsigned int iov_cnt,
+		  size_t offset, void *buf, size_t bytes);
 #endif
---

and have a look at it:

  $ CFLAGS="-g" make
  $ objdump -DxSs passt | less

you'll see it's not inlined, but the function simply resolves to a jump
to iov_from_buf_full():

0000000000020ac0 <iov_from_buf>:
                return iov_from_buf_full(iov, iov_cnt, offset, buf, bytes);
   20ac0:       e9 db fd ff ff          jmp    208a0 <iov_from_buf_full>
   20ac5:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
   20acc:       00 00 00 00 

because in the usage you make of it (vu_send()), elem->in_sg is never a
constant.

If we look at the AVX2 version instead:

  $ objdump -DxSs passt.avx2 | less

iov_from_buf() directly becomes iov_from_buf_full() (inlined):

000000000002dfa0 <iov_from_buf>:
{
   2dfa0:       41 57                   push   %r15
   2dfa2:       41 56                   push   %r14
   2dfa4:       41 89 f6                mov    %esi,%r14d
        for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
   2dfa7:       4c 89 c6                mov    %r8,%rsi

and it's still called from vu_send() -- not inlined there,
probably because iov_from_buf_full() is too big.

In the end, the compiler decides to inline iov_from_buf_full() into
iov_from_buf(), to drop the "constant" implementation from it (because it
wouldn't be used), and I guess that makes sense.

-- 
@@ -12,35 +12,14 @@ size_t iov_from_buf_full(const struct iovec *iov,
unsigned int iov_cnt, size_t offset, const void *buf, size_t bytes);
 size_t iov_to_buf_full(const struct iovec *iov, const unsigned int
iov_cnt, size_t offset, void *buf, size_t bytes);
-
-static inline size_t iov_from_buf(const struct iovec *iov,
-				  unsigned int iov_cnt, size_t offset,
-				  const void *buf, size_t bytes)
-{
-	if (__builtin_constant_p(bytes) && iov_cnt &&
-		offset <= iov[0].iov_len && bytes <= iov[0].iov_len -
offset) {
-		memcpy((char *)iov[0].iov_base + offset, buf, bytes);
-		return bytes;
-	} else {
-		return iov_from_buf_full(iov, iov_cnt, offset, buf,
bytes);
-	}
-}
-
-static inline size_t iov_to_buf(const struct iovec *iov,
-				const unsigned int iov_cnt, size_t
offset,
-				void *buf, size_t bytes)
-{
-	if (__builtin_constant_p(bytes) && iov_cnt &&
-		offset <= iov[0].iov_len && bytes <= iov[0].iov_len -
offset) {
-		memcpy(buf, (char *)iov[0].iov_base + offset, bytes);
-		return bytes;
-	} else {
-		return iov_to_buf_full(iov, iov_cnt, offset, buf,
bytes);
-	}
-}
-
 size_t iov_size(const struct iovec *iov, const unsigned int iov_cnt);
 unsigned iov_copy(struct iovec *dst_iov, unsigned int dst_iov_cnt,
 		  const struct iovec *iov, unsigned int iov_cnt,
 		  size_t offset, size_t bytes);
+
+size_t iov_from_buf(const struct iovec *iov, unsigned int iov_cnt,
+		    size_t offset, const void *buf, size_t bytes);
+
+size_t iov_to_buf(const struct iovec *iov, const unsigned int iov_cnt,
+		  size_t offset, void *buf, size_t bytes);
 #endif
---

and have a look at it:

  $ CFLAGS="-g" make
  $ objdump -DxSs passt | less

you'll see it's not inlined, but the function simply resolves to a jump
to iov_from_buf_full():

0000000000020ac0 <iov_from_buf>:
                return iov_from_buf_full(iov, iov_cnt, offset, buf, bytes);
   20ac0:       e9 db fd ff ff          jmp    208a0 <iov_from_buf_full>
   20ac5:       66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
   20acc:       00 00 00 00 

because in the usage you make of it (vu_send()), elem->in_sg is never a
constant.

If we look at the AVX2 version instead:

  $ objdump -DxSs passt.avx2 | less

iov_from_buf() directly becomes iov_from_buf_full() (inlined):

000000000002dfa0 <iov_from_buf>:
{
   2dfa0:       41 57                   push   %r15
   2dfa2:       41 56                   push   %r14
   2dfa4:       41 89 f6                mov    %esi,%r14d
        for (i = 0, done = 0; (offset || done < bytes) && i < iov_cnt; i++) {
   2dfa7:       4c 89 c6                mov    %r8,%rsi

and it's still called from vu_send() -- not inlined there,
probably because iov_from_buf_full() is too big.

In the end, the compiler decides to inline iov_from_buf_full() into
iov_from_buf(), to drop the "constant" implementation from it (because it
wouldn't be used), and I guess that makes sense.

-- 
Stefano


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h
  2024-02-07 10:40   ` Stefano Brivio
@ 2024-02-07 23:43     ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-07 23:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 5159 bytes --]

On Wed, Feb 07, 2024 at 11:40:40AM +0100, Stefano Brivio wrote:
> On Fri,  2 Feb 2024 15:11:33 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > We can find the same function to compute the IPv4 header
> > checksum in tcp.c and udp.c
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  ip.h  | 14 ++++++++++++++
> >  tcp.c | 23 ++---------------------
> >  udp.c | 19 +------------------
> >  3 files changed, 17 insertions(+), 39 deletions(-)
> > 
> > diff --git a/ip.h b/ip.h
> > index b2e08bc049f3..ff7902c45a95 100644
> > --- a/ip.h
> > +++ b/ip.h
> > @@ -9,6 +9,8 @@
> >  #include <netinet/ip.h>
> >  #include <netinet/ip6.h>
> >  
> > +#include "checksum.h"
> > +
> >  #define IN4_IS_ADDR_UNSPECIFIED(a) \
> >  	((a)->s_addr == htonl_constant(INADDR_ANY))
> >  #define IN4_IS_ADDR_BROADCAST(a) \
> > @@ -83,4 +85,16 @@ struct ipv6_opt_hdr {
> >  
> >  char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
> >  		 size_t *dlen);
> > +static inline uint16_t ipv4_hdr_checksum(struct iphdr *iph, int proto)
> 
> A function comment would be nice. A couple of doubts:
> 
> - why is this an inline in ip.h, instead of a function in checksum.c?
>   That would be more natural, I think
> 
> - this would be the first Layer-4 protocol number passed as int: we use
>   uint8_t elsewhere. Now, socket(2) and similar all take an int, but
>   using uint8_t internally keeps large arrays such as tap4_l4 a bit
>   smaller.
> 
>   The only value defined in Linux UAPI exceeding eight bits is
>   IPPROTO_MPTCP, 262, because that's never on the wire (the TCP
>   protocol number is used instead). And we won't meet that either.

Right.  This is pretty much explicitly the IP protocol number, which
is an 8-bit field in the IPv4 header.

>   In practice, it doesn't matter what we use here, but still uint8_t
>   would be consistent.
> 
> > +{
> > +	uint32_t sum = L2_BUF_IP4_PSUM(proto);
> > +
> > +	sum += iph->tot_len;
> > +	sum += (iph->saddr >> 16) & 0xffff;
> > +	sum += iph->saddr & 0xffff;
> > +	sum += (iph->daddr >> 16) & 0xffff;
> > +	sum += iph->daddr & 0xffff;
> > +
> > +	return ~csum_fold(sum);
> > +}
> >  #endif /* IP_H */
> > diff --git a/tcp.c b/tcp.c
> > index 4c9c5fb51c60..293ab12d8c21 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -934,23 +934,6 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
> >  		trace("TCP: failed to set SO_SNDBUF to %i", v);
> >  }
> >  
> > -/**
> > - * tcp_update_check_ip4() - Update IPv4 with variable parts from stored one
> > - * @buf:	L2 packet buffer with final IPv4 header
> > - */
> > -static void tcp_update_check_ip4(struct tcp4_l2_buf_t *buf)
> > -{
> > -	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_TCP);
> > -
> > -	sum += buf->iph.tot_len;
> > -	sum += (buf->iph.saddr >> 16) & 0xffff;
> > -	sum += buf->iph.saddr & 0xffff;
> > -	sum += (buf->iph.daddr >> 16) & 0xffff;
> > -	sum += buf->iph.daddr & 0xffff;
> > -
> > -	buf->iph.check = (uint16_t)~csum_fold(sum);
> > -}
> > -
> >  /**
> >   * tcp_update_check_tcp4() - Update TCP checksum from stored one
> >   * @buf:	L2 packet buffer with final IPv4 header
> > @@ -1393,10 +1376,8 @@ do {									\
> >  		b->iph.saddr = a4->s_addr;
> >  		b->iph.daddr = c->ip4.addr_seen.s_addr;
> >  
> > -		if (check)
> > -			b->iph.check = *check;
> > -		else
> > -			tcp_update_check_ip4(b);
> > +		b->iph.check = check ? *check :
> > +				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
> >  
> >  		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
> >  
> > diff --git a/udp.c b/udp.c
> > index d514c864ab5b..6f867df81c05 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -270,23 +270,6 @@ static void udp_invert_portmap(struct udp_port_fwd *fwd)
> >  	}
> >  }
> >  
> > -/**
> > - * udp_update_check4() - Update checksum with variable parts from stored one
> > - * @buf:	L2 packet buffer with final IPv4 header
> > - */
> > -static void udp_update_check4(struct udp4_l2_buf_t *buf)
> > -{
> > -	uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_UDP);
> > -
> > -	sum += buf->iph.tot_len;
> > -	sum += (buf->iph.saddr >> 16) & 0xffff;
> > -	sum += buf->iph.saddr & 0xffff;
> > -	sum += (buf->iph.daddr >> 16) & 0xffff;
> > -	sum += buf->iph.daddr & 0xffff;
> > -
> > -	buf->iph.check = (uint16_t)~csum_fold(sum);
> > -}
> > -
> >  /**
> >   * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
> >   * @eth_d:	Ethernet destination address, NULL if unchanged
> > @@ -614,7 +597,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
> >  		b->iph.saddr = b->s_in.sin_addr.s_addr;
> >  	}
> >  
> > -	udp_update_check4(b);
> > +	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
> >  	b->uh.source = b->s_in.sin_port;
> >  	b->uh.dest = htons(dstport);
> >  	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch]
  2024-02-07  9:03     ` Stefano Brivio
@ 2024-02-08  0:04       ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-08  0:04 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1134 bytes --]

On Wed, Feb 07, 2024 at 10:03:19AM +0100, Stefano Brivio wrote:
> Not related to the review of the patch itself:
> 
> On Mon, 5 Feb 2024 17:13:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Feb 02, 2024 at 03:11:32PM +0100, Laurent Vivier wrote:
> > 
> > > [...]
> > >
> > > +struct ipv6hdr {  
> > 
> > Not really in scope for this patch, but I have wondered if we should
> > try to use struct ip6_hdr from netinet/ip6.h instead of our own
> > version (derived, I think, from the kernel one).
> 
> The reason why I went with this is that the one in netinet/ip6.h looks
> fairly unusable to me: there are no explicit fields for version and
> priority, and names are long and a bit obscure, as defined by RFC 3542:
> does 'ctlun' actually mean "control union"?

Yeah, I did wonder about there.  There are a bunch of macros to make
things not so long, but the names do seem less natural.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/24] tcp: extract buffer management from tcp_send_flag()
  2024-02-02 14:11 ` [PATCH 08/24] tcp: extract buffer management from tcp_send_flag() Laurent Vivier
  2024-02-06  0:24   ` David Gibson
@ 2024-02-08 16:57   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-08 16:57 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:35 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  tcp.c | 224 +++++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 129 insertions(+), 95 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 2fd6bc2eda53..20ad8a4e5271 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1320,87 +1320,98 @@ void tcp_defer_handler(struct ctx *c)
>  	tcp_l2_data_buf_flush(c);
>  }
>  
> +static void tcp_set_tcp_header(struct tcphdr *th,
> +			       const struct tcp_tap_conn *conn, uint32_t seq)

Ah, nice, thanks, it adds a few lines but it's much better than that
macro soup.

I think the names of this function and the following ones, though, are
now a bit less consistent: filling and setting are pretty much the same
thing, and ipv4_fill_headers() only works for TCP packets. What about
tcp_fill_header(), tcp_fill_ipv4_header(), and tcp_fill_ipv6_header()?
Or _set_ for everything.

> +{
> +	th->source = htons(conn->fport);
> +	th->dest = htons(conn->eport);
> +	th->seq = htonl(seq);
> +	th->ack_seq = htonl(conn->seq_ack_to_tap);
> +	if (conn->events & ESTABLISHED)	{
> +		th->window = htons(conn->wnd_to_tap);
> +	} else {
> +		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;
> +
> +		th->window = htons(MIN(wnd, USHRT_MAX));
> +	}
> +}
> +
>  /**
> - * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
> + * ipv4_fill_headers() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
>   * @c:		Execution context
>   * @conn:	Connection pointer
> - * @p:		Pointer to any type of TCP pre-cooked buffer
> + * @iph:	Pointer to IPv4 header, immediately followed by a TCP header
>   * @plen:	Payload length (including TCP header options)
>   * @check:	Checksum, if already known
>   * @seq:	Sequence number for this segment
>   *
> - * Return: frame length including L2 headers, host order
> + * Return: IP frame length including L2 headers, host order
>   */
> -static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> -				      const struct tcp_tap_conn *conn,
> -				      void *p, size_t plen,
> -				      const uint16_t *check, uint32_t seq)
> +
> +static size_t ipv4_fill_headers(const struct ctx *c,
> +				const struct tcp_tap_conn *conn,
> +				struct iphdr *iph, size_t plen,
> +				const uint16_t *check, uint32_t seq)
>  {
> +	struct tcphdr *th = (void *)(iph + 1);

We should check this against gcc 11.2, because I had a warning with the
previous attempt at this, below:

> -		/* gcc 11.2 would complain on data = (char *)(th + 1); */

Besides, void * will be promoted to struct tcphdr *, but can't we just
use the right cast right away? That is,

	struct tcphdr *th = (struct tcphdr *)(iph + 1);

Either way, this should go after the next line (declarations from
longest to shortest).

>  	const struct in_addr *a4 = inany_v4(&conn->faddr);
> -	size_t ip_len, tlen;
> -
> -#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq)			\
> -do {									\
> -	b->th.source = htons(conn->fport);				\
> -	b->th.dest = htons(conn->eport);				\
> -	b->th.seq = htonl(seq);						\
> -	b->th.ack_seq = htonl(conn->seq_ack_to_tap);			\
> -	if (conn->events & ESTABLISHED)	{				\
> -		b->th.window = htons(conn->wnd_to_tap);			\
> -	} else {							\
> -		unsigned wnd = conn->wnd_to_tap << conn->ws_to_tap;	\
> -									\
> -		b->th.window = htons(MIN(wnd, USHRT_MAX));		\
> -	}								\
> -} while (0)
> -
> -	if (a4) {
> -		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
> -
> -		ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
> -		b->iph.tot_len = htons(ip_len);
> -		b->iph.saddr = a4->s_addr;
> -		b->iph.daddr = c->ip4.addr_seen.s_addr;
> -
> -		b->iph.check = check ? *check :
> -				       ipv4_hdr_checksum(&b->iph, IPPROTO_TCP);
> -
> -		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
> -
> -		b->th.check = tcp_update_check_tcp4(&b->iph);
> -
> -		tlen = tap_iov_len(c, &b->taph, ip_len);
> -	} else {
> -		struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
> +	size_t ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);

...and this should be the first one.

>  
> -		ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
> +	iph->tot_len = htons(ip_len);
> +	iph->saddr = a4->s_addr;
> +	iph->daddr = c->ip4.addr_seen.s_addr;
>  
> -		b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr));
> -		b->ip6h.saddr = conn->faddr.a6;
> -		if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr))
> -			b->ip6h.daddr = c->ip6.addr_ll_seen;
> -		else
> -			b->ip6h.daddr = c->ip6.addr_seen;
> +	iph->check = check ? *check : ipv4_hdr_checksum(iph, IPPROTO_TCP);
> +
> +	tcp_set_tcp_header(th, conn, seq);
> +
> +	th->check = tcp_update_check_tcp4(iph);
> +
> +	return ip_len;
> +}
> +
> +/**
> + * ipv6_fill_headers() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @ip6h:	Pointer to IPv6 header, immediately followed by a TCP header
> + * @plen:	Payload length (including TCP header options)
> + * @check:	Checksum, if already known
> + * @seq:	Sequence number for this segment
> + *
> + * Return: IP frame length including L2 headers, host order
> + */
> +
> +static size_t ipv6_fill_headers(const struct ctx *c,
> +				const struct tcp_tap_conn *conn,
> +				struct ipv6hdr *ip6h, size_t plen,
> +				uint32_t seq)
> +{
> +	struct tcphdr *th = (void *)(ip6h + 1);
> +	size_t ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
>  
> -		memset(b->ip6h.flow_lbl, 0, 3);
> +	ip6h->payload_len = htons(plen + sizeof(struct tcphdr));
> +	ip6h->saddr = conn->faddr.a6;
> +	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
> +		ip6h->daddr = c->ip6.addr_ll_seen;
> +	else
> +		ip6h->daddr = c->ip6.addr_seen;
>  
> -		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
> +	memset(ip6h->flow_lbl, 0, 3);
>  
> -		b->th.check = tcp_update_check_tcp6(&b->ip6h);
> +	tcp_set_tcp_header(th, conn, seq);
>  
> -		b->ip6h.hop_limit = 255;
> -		b->ip6h.version = 6;
> -		b->ip6h.nexthdr = IPPROTO_TCP;
> +	th->check = tcp_update_check_tcp6(ip6h);
>  
> -		b->ip6h.flow_lbl[0] = (conn->sock >> 16) & 0xf;
> -		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
> -		b->ip6h.flow_lbl[2] = (conn->sock >> 0) & 0xff;
> +	ip6h->hop_limit = 255;
> +	ip6h->version = 6;
> +	ip6h->nexthdr = IPPROTO_TCP;
>  
> -		tlen = tap_iov_len(c, &b->taph, ip_len);
> -	}
> -#undef SET_TCP_HEADER_COMMON_V4_V6
> +	ip6h->flow_lbl[0] = (conn->sock >> 16) & 0xf;
> +	ip6h->flow_lbl[1] = (conn->sock >> 8) & 0xff;
> +	ip6h->flow_lbl[2] = (conn->sock >> 0) & 0xff;
>  
> -	return tlen;
> +	return ip_len;
>  }
>  
>  /**
> @@ -1520,27 +1531,21 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
>  }
>  
>  /**
> - * tcp_send_flag() - Send segment with flags to tap (no payload)
> + * do_tcp_send_flag() - Send segment with flags to tap (no payload)
>   * @c:		Execution context
>   * @conn:	Connection pointer
>   * @flags:	TCP flags: if not set, send segment only if ACK is due
>   *
>   * Return: negative error code on connection reset, 0 otherwise

This should be adjusted -- it took me a while to realise what you mean
by 0 and 1 now.

>   */
> -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +
> +static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,			    struct tcphdr *th, char *data, size_t optlen)

Maybe tcp_fill_flag_header() - Prepare header for flags-only segment
(no payload)?

>  {
>  	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
>  	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
> -	struct tcp4_l2_flags_buf_t *b4 = NULL;
> -	struct tcp6_l2_flags_buf_t *b6 = NULL;
>  	struct tcp_info tinfo = { 0 };
>  	socklen_t sl = sizeof(tinfo);
>  	int s = conn->sock;
> -	size_t optlen = 0;
> -	struct iovec *iov;
> -	struct tcphdr *th;
> -	char *data;
> -	void *p;
>  
>  	if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
>  	    !flags && conn->wnd_to_tap)
> @@ -1562,26 +1567,9 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	if (!tcp_update_seqack_wnd(c, conn, flags, &tinfo) && !flags)
>  		return 0;
>  
> -	if (CONN_V4(conn)) {
> -		iov = tcp4_l2_flags_iov    + tcp4_l2_flags_buf_used;
> -		p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
> -		th = &b4->th;
> -
> -		/* gcc 11.2 would complain on data = (char *)(th + 1); */
> -		data = b4->opts;
> -	} else {
> -		iov = tcp6_l2_flags_iov    + tcp6_l2_flags_buf_used;
> -		p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
> -		th = &b6->th;
> -		data = b6->opts;
> -	}
> -
>  	if (flags & SYN) {
>  		int mss;
>  
> -		/* Options: MSS, NOP and window scale (8 bytes) */
> -		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
> -

This is very header-specific, whereas tcp_send_flag() isn't anymore.
Perhaps the new function could take a pointer to optlen and change it
instead?

>  		*data++ = OPT_MSS;
>  		*data++ = OPT_MSS_LEN;
>  
> @@ -1624,9 +1612,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	th->syn = !!(flags & SYN);
>  	th->fin = !!(flags & FIN);
>  
> -	iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen,
> -					       NULL, conn->seq_to_tap);
> -
>  	if (th->ack) {
>  		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
>  			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
> @@ -1641,8 +1626,38 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	if (th->fin || th->syn)
>  		conn->seq_to_tap++;
>  
> +	return 1;
> +}
> +
> +static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	size_t optlen = 0;
> +	struct iovec *iov;
> +	size_t ip_len;
> +	int ret;
> +
> +	/* Options: MSS, NOP and window scale (8 bytes) */
> +	if (flags & SYN)
> +		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
> +
>  	if (CONN_V4(conn)) {
> +		struct tcp4_l2_flags_buf_t *b4;
> +
> +		iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
> +		b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
> +
> +		ret = do_tcp_send_flag(c, conn, flags, &b4->th, b4->opts,
> +				       optlen);
> +		if (ret <= 0)
> +			return ret;
> +
> +		ip_len = ipv4_fill_headers(c, conn, &b4->iph, optlen,
> +					   NULL, conn->seq_to_tap);
> +
> +		iov->iov_len = tap_iov_len(c, &b4->taph, ip_len);
> +
>  		if (flags & DUP_ACK) {
> +
>  			memcpy(b4 + 1, b4, sizeof(*b4));
>  			(iov + 1)->iov_len = iov->iov_len;
>  			tcp4_l2_flags_buf_used++;
> @@ -1651,6 +1666,21 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
>  			tcp_l2_flags_buf_flush(c);
>  	} else {
> +		struct tcp6_l2_flags_buf_t *b6;
> +
> +		iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
> +		b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
> +
> +		ret = do_tcp_send_flag(c, conn, flags, &b6->th, b6->opts,
> +				       optlen);
> +		if (ret <= 0)
> +			return ret;
> +
> +		ip_len = ipv6_fill_headers(c, conn, &b6->ip6h, optlen,
> +					   conn->seq_to_tap);
> +
> +		iov->iov_len = tap_iov_len(c, &b6->taph, ip_len);
> +
>  		if (flags & DUP_ACK) {
>  			memcpy(b6 + 1, b6, sizeof(*b6));
>  			(iov + 1)->iov_len = iov->iov_len;
> @@ -2050,6 +2080,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  {
>  	uint32_t *seq_update = &conn->seq_to_tap;
>  	struct iovec *iov;
> +	size_t ip_len;
>  
>  	if (CONN_V4(conn)) {
>  		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> @@ -2058,9 +2089,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
>  		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
>  
> +		ip_len = ipv4_fill_headers(c, conn, &b->iph, plen,
> +					   check, seq);
> +
>  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> -		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> -						       check, seq);
> +		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
>  		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
>  			tcp_l2_data_buf_flush(c);
>  	} else if (CONN_V6(conn)) {
> @@ -2069,9 +2102,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
>  		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
>  
> +		ip_len = ipv6_fill_headers(c, conn, &b->ip6h, plen, seq);
> +
>  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> -		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> -						       NULL, seq);
> +		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
>  		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
>  			tcp_l2_data_buf_flush(c);
>  	}

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss()
  2024-02-02 14:11 ` [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss() Laurent Vivier
  2024-02-06  0:47   ` David Gibson
@ 2024-02-08 16:59   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-08 16:59 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:36 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  tcp.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 20ad8a4e5271..cdbceed65033 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1813,6 +1813,14 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
>  	return s;
>  }
>  
> +static uint16_t tcp_buf_conn_tap_mss(const struct tcp_tap_conn *conn)

I was trying to propose a more descriptive name for this, then I
realised I don't understand why you need it: tcp_vu_conn_tap_mss(),
added in 22/24, simply returns USHRT_MAX. But then can't we just do
something like:

	if (c->mode == MODE_VU)
		mss = MIN(MSS_VU, mss);
	else if (CONN_V4(conn)
		...

with #define MSS_VU USHRT_MAX?

> +{
> +	if (CONN_V4(conn))
> +		return MSS4;
> +
> +	return MSS6;
> +}
> +
>  /**
>   * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
>   * @conn:	Connection pointer
> @@ -1832,10 +1840,7 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
>  	else
>  		mss = ret;
>  
> -	if (CONN_V4(conn))
> -		mss = MIN(MSS4, mss);
> -	else
> -		mss = MIN(MSS6, mss);
> +	mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
>  
>  	return MIN(mss, USHRT_MAX);
>  }

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/24] tcp: rename functions that manage buffers
  2024-02-06  1:48   ` David Gibson
@ 2024-02-08 17:10     ` Stefano Brivio
  0 siblings, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-08 17:10 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Tue, 6 Feb 2024 12:48:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:37PM +0100, Laurent Vivier wrote:
> > To separate these functions from the ones specific to TCP management,
> > we are going to move it to a new file, but before that update their names
> > to reflect their role.
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  passt.c |  2 +-
> >  tcp.c   | 84 ++++++++++++++++++++++++++++-----------------------------
> >  tcp.h   |  2 +-
> >  3 files changed, 44 insertions(+), 44 deletions(-)
> > 
> > diff --git a/passt.c b/passt.c
> > index 44d3a0b0548c..10042a9b9789 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -164,7 +164,7 @@ static void timer_init(struct ctx *c, const struct timespec *now)
> >   */
> >  void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> >  {
> > -	tcp_update_l2_buf(eth_d, eth_s);
> > +	tcp_buf_update_l2(eth_d, eth_s);
> >  	udp_update_l2_buf(eth_d, eth_s);
> >  }
> >  
> > diff --git a/tcp.c b/tcp.c
> > index cdbceed65033..640209533772 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -383,7 +383,7 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
> >  #define ACK		(1 << 4)
> >  /* Flags for internal usage */
> >  #define DUP_ACK		(1 << 5)
> > -#define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> > +#define ACK_IF_NEEDED	0		/* See tcp_buf_send_flag() */
> >  
> >  #define OPT_EOL		0
> >  #define OPT_NOP		1
> > @@ -960,11 +960,11 @@ static uint16_t tcp_update_check_tcp6(struct ipv6hdr *ip6h)
> >  }
> >  
> >  /**
> > - * tcp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
> > + * tcp_buf_update_l2() - Update L2 buffers with Ethernet and IPv4 addresses
> >   * @eth_d:	Ethernet destination address, NULL if unchanged
> >   * @eth_s:	Ethernet source address, NULL if unchanged
> >   */
> > -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> > +void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
> >  {
> >  	int i;
> >  
> > @@ -982,10 +982,10 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> >  }
> >  
> >  /**
> > - * tcp_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
> > + * tcp_buf_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
> >   * @c:		Execution context
> >   */
> > -static void tcp_sock4_iov_init(const struct ctx *c)
> > +static void tcp_buf_sock4_iov_init(const struct ctx *c)
> >  {
> >  	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
> >  	struct iovec *iov;
> > @@ -1014,10 +1014,10 @@ static void tcp_sock4_iov_init(const struct ctx *c)
> >  }
> >  
> >  /**
> > - * tcp_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
> > + * tcp_buf_sock6_iov_init() - Initialise scatter-gather L2 buffers for IPv6 sockets
> >   * @c:		Execution context
> >   */
> > -static void tcp_sock6_iov_init(const struct ctx *c)
> > +static void tcp_buf_sock6_iov_init(const struct ctx *c)
> >  {
> >  	struct iovec *iov;
> >  	int i;
> > @@ -1277,10 +1277,10 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
> >  	} while (0)
> >  
> >  /**
> > - * tcp_l2_flags_buf_flush() - Send out buffers for segments with no data (flags)
> > + * tcp_buf_l2_flags_flush() - Send out buffers for segments with no data (flags)
> >   * @c:		Execution context
> >   */
> > -static void tcp_l2_flags_buf_flush(const struct ctx *c)
> > +static void tcp_buf_l2_flags_flush(const struct ctx *c)
> >  {
> >  	tap_send_frames(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used);
> >  	tcp6_l2_flags_buf_used = 0;
> > @@ -1290,10 +1290,10 @@ static void tcp_l2_flags_buf_flush(const struct ctx *c)
> >  }
> >  
> >  /**
> > - * tcp_l2_data_buf_flush() - Send out buffers for segments with data
> > + * tcp_buf_l2_data_flush() - Send out buffers for segments with data
> >   * @c:		Execution context
> >   */
> > -static void tcp_l2_data_buf_flush(const struct ctx *c)
> > +static void tcp_buf_l2_data_flush(const struct ctx *c)
> >  {
> >  	unsigned i;
> >  	size_t m;
> > @@ -1316,8 +1316,8 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
> >  /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
> >  void tcp_defer_handler(struct ctx *c)
> >  {
> > -	tcp_l2_flags_buf_flush(c);
> > -	tcp_l2_data_buf_flush(c);
> > +	tcp_buf_l2_flags_flush(c);
> > +	tcp_buf_l2_data_flush(c);
> >  }
> >  
> >  static void tcp_set_tcp_header(struct tcphdr *th,
> > @@ -1629,7 +1629,7 @@ static int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
> >  	return 1;
> >  }
> >  
> > -static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> > +static int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)  
> 
> The functions above could reasonably be said to be part of the buffer
> management, but I'm not convinced on this one - it's primary purpose
> is to, well, send a flag, so it uses the buffers, but I wouldn't
> really say it manages them.

Right, this and tcp_data_from_sock() below really implement TCP logic.
I'd be happy if we could avoid this patch and 11/24, but I didn't reach
the point where you need them, yet.

> >  {
> >  	size_t optlen = 0;
> >  	struct iovec *iov;
> > @@ -1664,7 +1664,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  		}
> >  
> >  		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
> > -			tcp_l2_flags_buf_flush(c);
> > +			tcp_buf_l2_flags_flush(c);
> >  	} else {
> >  		struct tcp6_l2_flags_buf_t *b6;
> >  
> > @@ -1688,7 +1688,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  		}
> >  
> >  		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
> > -			tcp_l2_flags_buf_flush(c);
> > +			tcp_buf_l2_flags_flush(c);
> >  	}
> >  
> >  	return 0;
> > @@ -1704,7 +1704,7 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
> >  	if (conn->events == CLOSED)
> >  		return;
> >  
> > -	if (!tcp_send_flag(c, conn, RST))
> > +	if (!tcp_buf_send_flag(c, conn, RST))
> >  		conn_event(c, conn, CLOSED);
> >  }
> >  
> > @@ -2024,7 +2024,7 @@ static void tcp_conn_from_tap(struct ctx *c,
> >  	} else {
> >  		tcp_get_sndbuf(conn);
> >  
> > -		if (tcp_send_flag(c, conn, SYN | ACK))
> > +		if (tcp_buf_send_flag(c, conn, SYN | ACK))
> >  			return;
> >  
> >  		conn_event(c, conn, TAP_SYN_ACK_SENT);
> > @@ -2100,7 +2100,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> >  		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
> >  		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
> > -			tcp_l2_data_buf_flush(c);
> > +			tcp_buf_l2_data_flush(c);
> >  	} else if (CONN_V6(conn)) {
> >  		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
> >  
> > @@ -2112,12 +2112,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> >  		iov->iov_len = tap_iov_len(c, &b->taph, ip_len);
> >  		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
> > -			tcp_l2_data_buf_flush(c);
> > +			tcp_buf_l2_data_flush(c);
> >  	}
> >  }
> >  
> >  /**
> > - * tcp_data_from_sock() - Handle new data from socket, queue to tap, in window
> > + * tcp_buf_data_from_sock() - Handle new data from socket, queue to tap, in window  
> 
> Same with this one.
> 
> >   * @c:		Execution context
> >   * @conn:	Connection pointer
> >   *
> > @@ -2125,7 +2125,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >   *
> >   * #syscalls recvmsg
> >   */
> > -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > +static int tcp_buf_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  {
> >  	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> >  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> > @@ -2169,7 +2169,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  
> >  	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
> >  	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
> > -		tcp_l2_data_buf_flush(c);
> > +		tcp_buf_l2_data_flush(c);
> >  
> >  		/* Silence Coverity CWE-125 false positive */
> >  		tcp4_l2_buf_used = tcp6_l2_buf_used = 0;
> > @@ -2195,7 +2195,7 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  
> >  	if (!len) {
> >  		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> > -			if ((ret = tcp_send_flag(c, conn, FIN | ACK))) {
> > +			if ((ret = tcp_buf_send_flag(c, conn, FIN | ACK))) {
> >  				tcp_rst(c, conn);
> >  				return ret;
> >  			}
> > @@ -2378,7 +2378,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >  			   max_ack_seq, conn->seq_to_tap);
> >  		conn->seq_ack_from_tap = max_ack_seq;
> >  		conn->seq_to_tap = max_ack_seq;
> > -		tcp_data_from_sock(c, conn);
> > +		tcp_buf_data_from_sock(c, conn);
> >  	}
> >  
> >  	if (!iov_i)
> > @@ -2394,14 +2394,14 @@ eintr:
> >  			 *   Then swiftly looked away and left.
> >  			 */
> >  			conn->seq_from_tap = seq_from_tap;
> > -			tcp_send_flag(c, conn, ACK);
> > +			tcp_buf_send_flag(c, conn, ACK);
> >  		}
> >  
> >  		if (errno == EINTR)
> >  			goto eintr;
> >  
> >  		if (errno == EAGAIN || errno == EWOULDBLOCK) {
> > -			tcp_send_flag(c, conn, ACK_IF_NEEDED);
> > +			tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> >  			return p->count - idx;
> >  
> >  		}
> > @@ -2411,7 +2411,7 @@ eintr:
> >  	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
> >  		partial_send = 1;
> >  		conn->seq_from_tap += n;
> > -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> > +		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> >  	} else {
> >  		conn->seq_from_tap += n;
> >  	}
> > @@ -2424,7 +2424,7 @@ out:
> >  		 */
> >  		if (conn->seq_dup_ack_approx != (conn->seq_from_tap & 0xff)) {
> >  			conn->seq_dup_ack_approx = conn->seq_from_tap & 0xff;
> > -			tcp_send_flag(c, conn, DUP_ACK);
> > +			tcp_buf_send_flag(c, conn, DUP_ACK);
> >  		}
> >  		return p->count - idx;
> >  	}
> > @@ -2438,7 +2438,7 @@ out:
> >  
> >  		conn_event(c, conn, TAP_FIN_RCVD);
> >  	} else {
> > -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> > +		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> >  	}
> >  
> >  	return p->count - idx;
> > @@ -2474,8 +2474,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> >  	/* The client might have sent data already, which we didn't
> >  	 * dequeue waiting for SYN,ACK from tap -- check now.
> >  	 */
> > -	tcp_data_from_sock(c, conn);
> > -	tcp_send_flag(c, conn, ACK);
> > +	tcp_buf_data_from_sock(c, conn);
> > +	tcp_buf_send_flag(c, conn, ACK);
> >  }
> >  
> >  /**
> > @@ -2555,7 +2555,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
> >  			conn->seq_from_tap++;
> >  
> >  			shutdown(conn->sock, SHUT_WR);
> > -			tcp_send_flag(c, conn, ACK);
> > +			tcp_buf_send_flag(c, conn, ACK);
> >  			conn_event(c, conn, SOCK_FIN_SENT);
> >  
> >  			return 1;
> > @@ -2566,7 +2566,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
> >  
> >  		tcp_tap_window_update(conn, ntohs(th->window));
> >  
> > -		tcp_data_from_sock(c, conn);
> > +		tcp_buf_data_from_sock(c, conn);
> >  
> >  		if (p->count - idx == 1)
> >  			return 1;
> > @@ -2596,7 +2596,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
> >  	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
> >  		shutdown(conn->sock, SHUT_WR);
> >  		conn_event(c, conn, SOCK_FIN_SENT);
> > -		tcp_send_flag(c, conn, ACK);
> > +		tcp_buf_send_flag(c, conn, ACK);
> >  		ack_due = 0;
> >  	}
> >  
> > @@ -2630,7 +2630,7 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> >  		return;
> >  	}
> >  
> > -	if (tcp_send_flag(c, conn, SYN | ACK))
> > +	if (tcp_buf_send_flag(c, conn, SYN | ACK))
> >  		return;
> >  
> >  	conn_event(c, conn, TAP_SYN_ACK_SENT);
> > @@ -2698,7 +2698,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
> >  
> >  	conn->wnd_from_tap = WINDOW_DEFAULT;
> >  
> > -	tcp_send_flag(c, conn, SYN);
> > +	tcp_buf_send_flag(c, conn, SYN);
> >  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> >  
> >  	tcp_get_sndbuf(conn);
> > @@ -2762,7 +2762,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> >  		return;
> >  
> >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> > +		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> >  		tcp_timer_ctl(c, conn);
> >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> >  		if (!(conn->events & ESTABLISHED)) {
> > @@ -2778,7 +2778,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> >  			flow_dbg(conn, "ACK timeout, retry");
> >  			conn->retrans++;
> >  			conn->seq_to_tap = conn->seq_ack_from_tap;
> > -			tcp_data_from_sock(c, conn);
> > +			tcp_buf_data_from_sock(c, conn);
> >  			tcp_timer_ctl(c, conn);
> >  		}
> >  	} else {
> > @@ -2833,7 +2833,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
> >  			conn_event(c, conn, SOCK_FIN_RCVD);
> >  
> >  		if (events & EPOLLIN)
> > -			tcp_data_from_sock(c, conn);
> > +			tcp_buf_data_from_sock(c, conn);
> >  
> >  		if (events & EPOLLOUT)
> >  			tcp_update_seqack_wnd(c, conn, 0, NULL);
> > @@ -3058,10 +3058,10 @@ int tcp_init(struct ctx *c)
> >  		tc_hash[b] = FLOW_SIDX_NONE;
> >  
> >  	if (c->ifi4)
> > -		tcp_sock4_iov_init(c);
> > +		tcp_buf_sock4_iov_init(c);
> >  
> >  	if (c->ifi6)
> > -		tcp_sock6_iov_init(c);
> > +		tcp_buf_sock6_iov_init(c);
> >  
> >  	memset(init_sock_pool4,		0xff,	sizeof(init_sock_pool4));
> >  	memset(init_sock_pool6,		0xff,	sizeof(init_sock_pool6));
> > diff --git a/tcp.h b/tcp.h
> > index b9f546d31002..e7dbcfa2ddbd 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -23,7 +23,7 @@ int tcp_init(struct ctx *c);
> >  void tcp_timer(struct ctx *c, const struct timespec *now);
> >  void tcp_defer_handler(struct ctx *c);
> >  
> > -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
> > +void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s);
> >  
> >  /**
> >   * union tcp_epoll_ref - epoll reference portion for TCP connections  
> 

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/24] tap: make tap_update_mac() generic
  2024-02-06  1:49   ` David Gibson
@ 2024-02-08 17:10     ` Stefano Brivio
  2024-02-09  5:02       ` David Gibson
  0 siblings, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-08 17:10 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Tue, 6 Feb 2024 12:49:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:39PM +0100, Laurent Vivier wrote:
> > Use ethhdr rather than tap_hdr.
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>  
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> I'd be happy to see this applied immediately, in advance of the rest
> of the series.

Oh, hm, do you need this for something around the flow table? I just
have one nit below:

> > ---
> >  tap.c     | 6 +++---
> >  tap.h     | 2 +-
> >  tcp_buf.c | 8 ++++----
> >  udp.c     | 4 ++--
> >  4 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 3ea03f720d6d..29f389057ac1 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -447,13 +447,13 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
> >   * @eth_d:	Ethernet destination address, NULL if unchanged
> >   * @eth_s:	Ethernet source address, NULL if unchanged
> >   */
> > -void tap_update_mac(struct tap_hdr *taph,
> > +void eth_update_mac(struct ethhdr *eh,

...function comment should be updated accordingly.

> >  		    const unsigned char *eth_d, const unsigned char *eth_s)
> >  {
> >  	if (eth_d)
> > -		memcpy(taph->eh.h_dest, eth_d, sizeof(taph->eh.h_dest));
> > +		memcpy(eh->h_dest, eth_d, sizeof(eh->h_dest));
> >  	if (eth_s)
> > -		memcpy(taph->eh.h_source, eth_s, sizeof(taph->eh.h_source));
> > +		memcpy(eh->h_source, eth_s, sizeof(eh->h_source));
> >  }
> >  
> >  PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> > diff --git a/tap.h b/tap.h
> > index 466d91466c3d..437b9aa2b43f 100644
> > --- a/tap.h
> > +++ b/tap.h
> > @@ -74,7 +74,7 @@ void tap_icmp6_send(const struct ctx *c,
> >  		    const void *in, size_t len);
> >  int tap_send(const struct ctx *c, const void *data, size_t len);
> >  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n);
> > -void tap_update_mac(struct tap_hdr *taph,
> > +void eth_update_mac(struct ethhdr *eh,
> >  		    const unsigned char *eth_d, const unsigned char *eth_s);
> >  void tap_listen_handler(struct ctx *c, uint32_t events);
> >  void tap_handler_pasta(struct ctx *c, uint32_t events,
> > diff --git a/tcp_buf.c b/tcp_buf.c
> > index d70e7f810e4a..4c1f00c1d1b2 100644
> > --- a/tcp_buf.c
> > +++ b/tcp_buf.c
> > @@ -218,10 +218,10 @@ void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
> >  		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
> >  		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
> >  
> > -		tap_update_mac(&b4->taph, eth_d, eth_s);
> > -		tap_update_mac(&b6->taph, eth_d, eth_s);
> > -		tap_update_mac(&b4f->taph, eth_d, eth_s);
> > -		tap_update_mac(&b6f->taph, eth_d, eth_s);
> > +		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> > +		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> > +		eth_update_mac(&b4f->taph.eh, eth_d, eth_s);
> > +		eth_update_mac(&b6f->taph.eh, eth_d, eth_s);
> >  	}
> >  }
> >  
> > diff --git a/udp.c b/udp.c
> > index 96b4e6ca9a85..db635742319b 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -283,8 +283,8 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> >  		struct udp4_l2_buf_t *b4 = &udp4_l2_buf[i];
> >  		struct udp6_l2_buf_t *b6 = &udp6_l2_buf[i];
> >  
> > -		tap_update_mac(&b4->taph, eth_d, eth_s);
> > -		tap_update_mac(&b6->taph, eth_d, eth_s);
> > +		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> > +		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> >  	}
> >  }
> >    
> 

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 21/24] vhost-user: use guest buffer directly in vu_handle_tx()
  2024-02-02 14:11 ` [PATCH 21/24] vhost-user: use guest buffer directly in vu_handle_tx() Laurent Vivier
@ 2024-02-09  4:26   ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-09  4:26 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 9424 bytes --]

On Fri, Feb 02, 2024 at 03:11:48PM +0100, Laurent Vivier wrote:
> Check the buffer address is correctly in the mmap'ed memory.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  packet.c     |  6 +++++
>  packet.h     |  2 ++
>  tap.c        | 39 +++++++++++++++++++++++++++----
>  tap.h        |  1 +
>  vhost_user.c | 66 ++++++++++++++++++++++++++++++++--------------------
>  5 files changed, 84 insertions(+), 30 deletions(-)
> 
> diff --git a/packet.c b/packet.c
> index af2a539a1794..3c5fc39df6d7 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -25,6 +25,12 @@
>  static int packet_check_range(const struct pool *p, size_t offset, size_t len,
>  			      const char *start, const char *func, int line)
>  {
> +	ASSERT(p->buf);
> +
> +	if (p->buf_size == 0)

So, IIUC, you're using p->buf_size == 0 essentially as a flag to
indicate that this packet pool is reference packets in an external
(i.e.. guest for vhost-user) pool rather than passt allocated memory.
Some comments on the data structure definition, and/or some "theory of
operation" comments describing the two cases would probably help to
make it easier to follow.

> +		return vu_packet_check_range((void *)p->buf, offset, len, start,
> +					     func, line);
> +
>  	if (start < p->buf) {
>  		if (func) {
>  			trace("add packet start %p before buffer start %p, "
> diff --git a/packet.h b/packet.h
> index 8377dcf678bb..0aec6d9410aa 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -22,6 +22,8 @@ struct pool {
>  	struct iovec pkt[1];
>  };
>  
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start, const char *func, int line);
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
>  void *packet_get_do(const struct pool *p, const size_t idx,
> diff --git a/tap.c b/tap.c
> index c2a917bc00ca..930e48689497 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -626,7 +626,7 @@ resume:
>  		if (!eh)
>  			continue;
>  		if (ntohs(eh->h_proto) == ETH_P_ARP) {
> -			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
> +			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
>  
>  			packet_add(pkt, l2_len, (char *)eh);
>  			arp(c, pkt);
> @@ -656,7 +656,7 @@ resume:
>  			continue;
>  
>  		if (iph->protocol == IPPROTO_ICMP) {
> -			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
> +			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
>  
>  			if (c->no_icmp)
>  				continue;
> @@ -675,7 +675,7 @@ resume:
>  			continue;
>  
>  		if (iph->protocol == IPPROTO_UDP) {
> -			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
> +			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
>  
>  			packet_add(pkt, l2_len, (char *)eh);
>  			if (dhcp(c, pkt))
> @@ -815,7 +815,7 @@ resume:
>  		}
>  
>  		if (proto == IPPROTO_ICMPV6) {
> -			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
> +			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
>  
>  			if (c->no_icmp)
>  				continue;
> @@ -839,7 +839,7 @@ resume:
>  		uh = (struct udphdr *)l4h;
>  
>  		if (proto == IPPROTO_UDP) {
> -			PACKET_POOL_P(pkt, 1, in->buf, sizeof(pkt_buf));
> +			PACKET_POOL_P(pkt, 1, in->buf, in->buf_size);
>  
>  			packet_add(pkt, l4_len, l4h);
>  
> @@ -1291,6 +1291,23 @@ static void tap_sock_tun_init(struct ctx *c)
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  }
>  
> +void tap_sock_update_buf(void *base, size_t size)
> +{
> +	int i;
> +
> +	pool_tap4_storage.buf = base;
> +	pool_tap4_storage.buf_size = size;
> +	pool_tap6_storage.buf = base;
> +	pool_tap6_storage.buf_size = size;
> +
> +	for (i = 0; i < TAP_SEQS; i++) {
> +		tap4_l4[i].p.buf = base;
> +		tap4_l4[i].p.buf_size = size;
> +		tap6_l4[i].p.buf = base;
> +		tap6_l4[i].p.buf_size = size;
> +	}
> +}
> +
>  /**
>   * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
>   * @c:		Execution context
> @@ -1302,10 +1319,22 @@ void tap_sock_init(struct ctx *c)
>  
>  	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
>  	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
> +	if (c->mode == MODE_VU) {
> +		pool_tap4_storage.buf = NULL;
> +		pool_tap4_storage.buf_size = 0;
> +		pool_tap6_storage.buf = NULL;
> +		pool_tap6_storage.buf_size = 0;
> +	}
>  
>  	for (i = 0; i < TAP_SEQS; i++) {
>  		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
>  		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
> +		if (c->mode == MODE_VU) {
> +			tap4_l4[i].p.buf = NULL;
> +			tap4_l4[i].p.buf_size = 0;
> +			tap6_l4[i].p.buf = NULL;
> +			tap6_l4[i].p.buf_size = 0;
> +		}

Can't you use your tap_sock_update_buf() function above to do this
initialization?

>  	}
>  
>  	if (c->fd_tap != -1) { /* Passed as --fd */
> diff --git a/tap.h b/tap.h
> index ee839d4f09dc..6823c9b32313 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -82,6 +82,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  void tap_sock_reset(struct ctx *c);
> +void tap_sock_update_buf(void *base, size_t size);
>  void tap_sock_init(struct ctx *c);
>  void pool_flush_all(void);
>  void tap_handler_all(struct ctx *c, const struct timespec *now);
> diff --git a/vhost_user.c b/vhost_user.c
> index 2acd72398e3a..9cc07c8312c0 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -334,6 +334,25 @@ static bool map_ring(VuDev *vdev, VuVirtq *vq)
>  	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
>  }
>  
> +int vu_packet_check_range(void *buf, size_t offset, size_t len, const char *start,
> +			  const char *func, int line)
> +{
> +	VuDevRegion *dev_region;
> +

Ah.. and if IIUC, in the indirect buffer case, the buf pointer in the
pool is a pointer to a vector of VuDevRegion rather than a buffer.
I think I'd prefer to see struct pool changed to include a union to
make it clear that there are two quite different interpretations of
the buf pointer.

> +	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
> +		if ((char *)dev_region->mmap_addr <= start &&
> +		    start + offset + len < (char *)dev_region->mmap_addr +
> +					   dev_region->mmap_offset +
> +					   dev_region->size)
> +			return 0;
> +	}
> +	if (func) {
> +		trace("cannot find region, %s:%i", func, line);
> +	}
> +
> +	return -1;
> +}
> +
>  /*
>   * #syscalls:passt mmap munmap
>   */
> @@ -400,6 +419,12 @@ static bool vu_set_mem_table_exec(VuDev *vdev,
>  		}
>  	}
>  
> +	/* XXX */

What's this XXX for?

> +	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
> +	vdev->regions[vdev->nregions].mmap_addr = 0; /* mark EOF for vu_packet_check_range() */
> +
> +	tap_sock_update_buf(vdev->regions, 0);

If you use a union, you could make the pool point to a while VuDev
with nregions as well as the actual region list and bounds check
without needing this hack.

> +
>  	return false;
>  }
>  
> @@ -650,8 +675,8 @@ static void vu_handle_tx(VuDev *vdev, int index)
>  	VuVirtq *vq = &vdev->vq[index];
>  	int hdrlen = vdev->hdrlen;
>  	struct timespec now;
> -	char *p;
> -	size_t n;
> +	unsigned int indexes[VIRTQUEUE_MAX_SIZE];
> +	int count;
>  
>  	if (index % 2 != VHOST_USER_TX_QUEUE) {
>  		debug("index %d is not an TX queue", index);
> @@ -660,14 +685,11 @@ static void vu_handle_tx(VuDev *vdev, int index)
>  
>  	clock_gettime(CLOCK_MONOTONIC, &now);
>  
> -	p = pkt_buf;
> -
>  	pool_flush_all();
>  
> +	count = 0;
>  	while (1) {
>  		VuVirtqElement *elem;
> -		unsigned int out_num;
> -		struct iovec sg[VIRTQUEUE_MAX_SIZE], *out_sg;
>  
>  		ASSERT(index == VHOST_USER_TX_QUEUE);
>  		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer[index]);
> @@ -675,32 +697,26 @@ static void vu_handle_tx(VuDev *vdev, int index)
>  			break;
>  		}
>  
> -		out_num = elem->out_num;
> -		out_sg = elem->out_sg;
> -		if (out_num < 1) {
> +		if (elem->out_num < 1) {

The change from out_num local to elem->out_num seems like an unrelated
stylistic change that could be folded into the earlier patch.

>  			debug("virtio-net header not in first element");
>  			break;
>  		}
> +		ASSERT(elem->out_num == 1);
>  
> -		if (hdrlen) {
> -			unsigned sg_num;
> -
> -			sg_num = iov_copy(sg, ARRAY_SIZE(sg), out_sg, out_num,
> -					  hdrlen, -1);
> -			out_num = sg_num;
> -			out_sg = sg;
> -		}
> -
> -		n = iov_to_buf(out_sg, out_num, 0, p, TAP_BUF_FILL);
> -
> -		packet_add_all(c, n, p);
> -
> -		p += n;
> +		packet_add_all(c, elem->out_sg[0].iov_len - hdrlen,
> +			       (char *)elem->out_sg[0].iov_base + hdrlen);
> +		indexes[count] = elem->index;
> +		count++;
> +	}
> +	tap_handler_all(c, &now);
>  
> -		vu_queue_push(vdev, vq, elem, 0);
> +	if (count) {
> +		int i;
> +		for (i = 0; i < count; i++)
> +			vu_queue_fill_by_index(vdev, vq, indexes[i], 0, i);
> +		vu_queue_flush(vdev, vq, count);
>  		vu_queue_notify(vdev, vq);
>  	}
> -	tap_handler_all(c, &now);
>  }
>  
>  void vu_kick_cb(struct ctx *c, union epoll_ref ref)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 22/24] tcp: vhost-user RX nocopy
  2024-02-02 14:11 ` [PATCH 22/24] tcp: vhost-user RX nocopy Laurent Vivier
@ 2024-02-09  4:57   ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-09  4:57 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 25634 bytes --]

On Fri, Feb 02, 2024 at 03:11:49PM +0100, Laurent Vivier wrote:
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile |   6 +-
>  tcp.c    |  66 +++++---
>  tcp_vu.c | 447 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_vu.h |  10 ++
>  4 files changed, 502 insertions(+), 27 deletions(-)
>  create mode 100644 tcp_vu.c
>  create mode 100644 tcp_vu.h
> 
> diff --git a/Makefile b/Makefile
> index 2016b071ddf2..f7a403d19b61 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
> -	tcp_buf.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
> +	tcp_buf.c tcp_vu.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -56,8 +56,8 @@ MANPAGES = passt.1 pasta.1 qrap.1
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
> -	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \
> -	util.h iov.h ip.h virtio.h vhost_user.h
> +	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_vu.h tcp_internal.h \
> +	udp.h util.h iov.h ip.h virtio.h vhost_user.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/tcp.c b/tcp.c
> index b6aca9f37f19..e829e12fe7c2 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -302,6 +302,7 @@
>  #include "flow_table.h"
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
> +#include "tcp_vu.h"
>  
>  /* Sides of a flow as we use them in "tap" connections */
>  #define	SOCKSIDE	0
> @@ -1034,7 +1035,7 @@ size_t ipv4_fill_headers(const struct ctx *c,
>  	tcp_set_tcp_header(th, conn, seq);
>  
>  	th->check = 0;
> -	if (c->mode != MODE_VU || *c->pcap)
> +	if (c->mode != MODE_VU)
>  		th->check = tcp_update_check_tcp4(iph);
>  
>  	return ip_len;
> @@ -1072,7 +1073,7 @@ size_t ipv6_fill_headers(const struct ctx *c,
>  	tcp_set_tcp_header(th, conn, seq);
>  
>  	th->check = 0;
> -	if (c->mode != MODE_VU || *c->pcap)
> +	if (c->mode != MODE_VU)
>  		th->check = tcp_update_check_tcp6(ip6h);
>  
>  	ip6h->hop_limit = 255;
> @@ -1302,6 +1303,12 @@ int do_tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags,
>  	return 1;
>  }
>  
> +int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_send_flag(c, conn, flags);
> +	return tcp_buf_send_flag(c, conn, flags);

Your previous renames to "tcp_buf" make some more sense to me now.
It's not so much that the "tcp_buf" functions are related to buffer
management but they belong to the (linear, passt-managed) buffer
implementation of TCP.  I see the rationale, but I still don't really
like the name - I don't think that the connection from "tcp_buf" to,
"TCP code specific to several but not all L2 interface
implementations" is at all obvious.  Not that a good way of conveying
that quickly occurs to me.  For the time being, I'm inclined to just
stick with "tcp", or maybe "tcp_default" for the existing (tuntap &
qemu socket) implementations and tcp_vu for the new ones.  That can
maybe cleaned up with a more systematic division of L2 interface types
(on my list...).

> +}
>  
>  /**
>   * tcp_rst_do() - Reset a tap connection: send RST segment to tap, close socket
> @@ -1313,7 +1320,7 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
>  	if (conn->events == CLOSED)
>  		return;
>  
> -	if (!tcp_buf_send_flag(c, conn, RST))
> +	if (!tcp_send_flag(c, conn, RST))
>  		conn_event(c, conn, CLOSED);
>  }
>  
> @@ -1430,7 +1437,8 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
>   *
>   * Return: clamped MSS value
>   */
> -static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
> +static uint16_t tcp_conn_tap_mss(const struct ctx *c,
> +				 const struct tcp_tap_conn *conn,
>  				 const char *opts, size_t optlen)
>  {
>  	unsigned int mss;
> @@ -1441,7 +1449,10 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn,
>  	else
>  		mss = ret;
>  
> -	mss = MIN(tcp_buf_conn_tap_mss(conn), mss);
> +	if (c->mode == MODE_VU)
> +		mss = MIN(tcp_vu_conn_tap_mss(conn), mss);
> +	else
> +		mss = MIN(tcp_buf_conn_tap_mss(conn), mss);

This seems oddly complex.  What are the actual circumstances in which
the VU mss would differ from other cases?

>  	return MIN(mss, USHRT_MAX);
>  }
> @@ -1568,7 +1579,7 @@ static void tcp_conn_from_tap(struct ctx *c,
>  
>  	conn->wnd_to_tap = WINDOW_DEFAULT;
>  
> -	mss = tcp_conn_tap_mss(conn, opts, optlen);
> +	mss = tcp_conn_tap_mss(c, conn, opts, optlen);
>  	if (setsockopt(s, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)))
>  		flow_trace(conn, "failed to set TCP_MAXSEG on socket %i", s);
>  	MSS_SET(conn, mss);
> @@ -1625,7 +1636,7 @@ static void tcp_conn_from_tap(struct ctx *c,
>  	} else {
>  		tcp_get_sndbuf(conn);
>  
> -		if (tcp_buf_send_flag(c, conn, SYN | ACK))
> +		if (tcp_send_flag(c, conn, SYN | ACK))
>  			return;
>  
>  		conn_event(c, conn, TAP_SYN_ACK_SENT);
> @@ -1673,6 +1684,13 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>  	return 0;
>  }
>  
> +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_data_from_sock(c, conn);
> +
> +	return tcp_buf_data_from_sock(c, conn);
> +}
>  
>  /**
>   * tcp_data_from_tap() - tap/guest data for established connection
> @@ -1806,7 +1824,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  			   max_ack_seq, conn->seq_to_tap);
>  		conn->seq_ack_from_tap = max_ack_seq;
>  		conn->seq_to_tap = max_ack_seq;
> -		tcp_buf_data_from_sock(c, conn);
> +		tcp_data_from_sock(c, conn);

In particular having changed all these calls from tcp_ to tcp_buf_ and
now changing them back seems like churn that it would be nice to
avoid.

>  	}
>  
>  	if (!iov_i)
> @@ -1822,14 +1840,14 @@ eintr:
>  			 *   Then swiftly looked away and left.
>  			 */
>  			conn->seq_from_tap = seq_from_tap;
> -			tcp_buf_send_flag(c, conn, ACK);
> +			tcp_send_flag(c, conn, ACK);
>  		}
>  
>  		if (errno == EINTR)
>  			goto eintr;
>  
>  		if (errno == EAGAIN || errno == EWOULDBLOCK) {
> -			tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> +			tcp_send_flag(c, conn, ACK_IF_NEEDED);
>  			return p->count - idx;
>  
>  		}
> @@ -1839,7 +1857,7 @@ eintr:
>  	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
>  		partial_send = 1;
>  		conn->seq_from_tap += n;
> -		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> +		tcp_send_flag(c, conn, ACK_IF_NEEDED);
>  	} else {
>  		conn->seq_from_tap += n;
>  	}
> @@ -1852,7 +1870,7 @@ out:
>  		 */
>  		if (conn->seq_dup_ack_approx != (conn->seq_from_tap & 0xff)) {
>  			conn->seq_dup_ack_approx = conn->seq_from_tap & 0xff;
> -			tcp_buf_send_flag(c, conn, DUP_ACK);
> +			tcp_send_flag(c, conn, DUP_ACK);
>  		}
>  		return p->count - idx;
>  	}
> @@ -1866,7 +1884,7 @@ out:
>  
>  		conn_event(c, conn, TAP_FIN_RCVD);
>  	} else {
> -		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> +		tcp_send_flag(c, conn, ACK_IF_NEEDED);
>  	}
>  
>  	return p->count - idx;
> @@ -1891,7 +1909,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
>  	if (!(conn->wnd_from_tap >>= conn->ws_from_tap))
>  		conn->wnd_from_tap = 1;
>  
> -	MSS_SET(conn, tcp_conn_tap_mss(conn, opts, optlen));
> +	MSS_SET(conn, tcp_conn_tap_mss(c, conn, opts, optlen));
>  
>  	conn->seq_init_from_tap = ntohl(th->seq) + 1;
>  	conn->seq_from_tap = conn->seq_init_from_tap;
> @@ -1902,8 +1920,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
>  	/* The client might have sent data already, which we didn't
>  	 * dequeue waiting for SYN,ACK from tap -- check now.
>  	 */
> -	tcp_buf_data_from_sock(c, conn);
> -	tcp_buf_send_flag(c, conn, ACK);
> +	tcp_data_from_sock(c, conn);
> +	tcp_send_flag(c, conn, ACK);
>  }
>  
>  /**
> @@ -1983,7 +2001,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  			conn->seq_from_tap++;
>  
>  			shutdown(conn->sock, SHUT_WR);
> -			tcp_buf_send_flag(c, conn, ACK);
> +			tcp_send_flag(c, conn, ACK);
>  			conn_event(c, conn, SOCK_FIN_SENT);
>  
>  			return 1;
> @@ -1994,7 +2012,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  
>  		tcp_tap_window_update(conn, ntohs(th->window));
>  
> -		tcp_buf_data_from_sock(c, conn);
> +		tcp_data_from_sock(c, conn);
>  
>  		if (p->count - idx == 1)
>  			return 1;
> @@ -2024,7 +2042,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  	if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) {
>  		shutdown(conn->sock, SHUT_WR);
>  		conn_event(c, conn, SOCK_FIN_SENT);
> -		tcp_buf_send_flag(c, conn, ACK);
> +		tcp_send_flag(c, conn, ACK);
>  		ack_due = 0;
>  	}
>  
> @@ -2058,7 +2076,7 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
>  		return;
>  	}
>  
> -	if (tcp_buf_send_flag(c, conn, SYN | ACK))
> +	if (tcp_send_flag(c, conn, SYN | ACK))
>  		return;
>  
>  	conn_event(c, conn, TAP_SYN_ACK_SENT);
> @@ -2126,7 +2144,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
>  
>  	conn->wnd_from_tap = WINDOW_DEFAULT;
>  
> -	tcp_buf_send_flag(c, conn, SYN);
> +	tcp_send_flag(c, conn, SYN);
>  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  
>  	tcp_get_sndbuf(conn);
> @@ -2190,7 +2208,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>  		return;
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
> -		tcp_buf_send_flag(c, conn, ACK_IF_NEEDED);
> +		tcp_send_flag(c, conn, ACK_IF_NEEDED);
>  		tcp_timer_ctl(c, conn);
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		if (!(conn->events & ESTABLISHED)) {
> @@ -2206,7 +2224,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>  			flow_dbg(conn, "ACK timeout, retry");
>  			conn->retrans++;
>  			conn->seq_to_tap = conn->seq_ack_from_tap;
> -			tcp_buf_data_from_sock(c, conn);
> +			tcp_data_from_sock(c, conn);
>  			tcp_timer_ctl(c, conn);
>  		}
>  	} else {
> @@ -2261,7 +2279,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
>  			conn_event(c, conn, SOCK_FIN_RCVD);
>  
>  		if (events & EPOLLIN)
> -			tcp_buf_data_from_sock(c, conn);
> +			tcp_data_from_sock(c, conn);
>  
>  		if (events & EPOLLOUT)
>  			tcp_update_seqack_wnd(c, conn, 0, NULL);
> diff --git a/tcp_vu.c b/tcp_vu.c
> new file mode 100644
> index 000000000000..ed59b21cabdc
> --- /dev/null
> +++ b/tcp_vu.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later

Copyright notice.


> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +#include <netinet/ip.h>
> +
> +#include <sys/socket.h>
> +
> +#include <linux/tcp.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "vhost_user.h"
> +#include "tcp.h"
> +#include "pcap.h"
> +#include "flow.h"
> +#include "tcp_conn.h"
> +#include "flow_table.h"
> +#include "tcp_vu.h"
> +#include "tcp_internal.h"
> +#include "checksum.h"
> +
> +#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
> +#define CONN_V6(conn)		(!CONN_V4(conn))

I don't love having these duplicated in two .c files.  However, it
might become irrelevant as I move towards having v4/v6 become implicit
in the common flow addresses.

> +/* vhost-user */
> +static const struct virtio_net_hdr vu_header = {
> +	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +};
> +
> +static unsigned char buffer[65536];
> +static struct iovec	iov_vu			[VIRTQUEUE_MAX_SIZE];
> +static unsigned int	indexes			[VIRTQUEUE_MAX_SIZE];
> +
> +uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn)
> +{
> +	(void)conn;
> +	return USHRT_MAX;
> +}
> +
> +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	VuDev *vdev = (VuDev *)&c->vdev;
> +	VuVirtqElement *elem;
> +	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct virtio_net_hdr_mrg_rxbuf *vh;
> +	size_t tlen, vnet_hdrlen, ip_len, optlen = 0;
> +	struct ethhdr *eh;
> +	int ret;
> +	int nb_ack;
> +
> +	elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
> +	if (!elem)
> +		return 0;
> +
> +	if (elem->in_num < 1) {
> +		err("virtio-net receive queue contains no in buffers");
> +		vu_queue_rewind(vdev, vq, 1);
> +		return 0;
> +	}
> +
> +	/* Options: MSS, NOP and window scale (8 bytes) */
> +	if (flags & SYN)
> +		optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;

Given the number of subtle TCP bugs we've had to squash, it would be
really nice if we could avoid duplicating TCP logic between paths.
Could we make some abstraction that takes an iov, but can be also
called from the non-vu case with a 1-entry iov representing a single
buffer?

> +	vh = elem->in_sg[0].iov_base;
> +
> +	vh->hdr = vu_header;
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +		vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +		vh->num_buffers = htole16(1);
> +	} else {
> +		vnet_hdrlen = sizeof(struct virtio_net_hdr);
> +	}
> +	eh = (struct ethhdr *)((char *)elem->in_sg[0].iov_base + vnet_hdrlen);

Ah... hmm.. I already had hope to clean up handling different L2 and
below headers for the different "tap" types.  We basically have ugly
hacks to deal with the difference between tuntap (plain ethernet) and
qemu socket (ethernet + length header).  Now we're adding vhost-user
(ethernet + vhost header), which is a similar issue.  Abstracting this
could also make it pretty easy to support further "tap" interfaces: a
different hypervisor socket transfer with slightly different header,
tuntap in "tun" mode (raw IP without ethernet headers), SLIP or PPP,
...

> +	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
> +
> +	if (CONN_V4(conn)) {
> +		struct iphdr *iph = (struct iphdr *)(eh + 1);
> +		struct tcphdr *th = (struct tcphdr *)(iph + 1);

Hmm.. did I miss logic to check that there's room for the vhost +
ethernet + IP  + TCP headers in the first iov element?

> +		char *data = (char *)(th + 1);
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		*th = (struct tcphdr){
> +			.doff = sizeof(struct tcphdr) / 4,
> +			.ack = 1
> +		};
> +
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +
> +		ret = do_tcp_send_flag(c, conn, flags, th, data, optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vdev, vq, 1);
> +			return ret;
> +		}
> +
> +		ip_len = ipv4_fill_headers(c, conn, iph, optlen, NULL,
> +					   conn->seq_to_tap);
> +
> +		tlen =  ip_len + sizeof(struct ethhdr);
> +
> +		if (*c->pcap) {
> +			uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
> +
> +			th->check = csum(th, optlen + sizeof(struct tcphdr), sum);
> +		}
> +	} else {
> +		struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
> +		struct tcphdr *th = (struct tcphdr *)(ip6h + 1);
> +		char *data = (char *)(th + 1);
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		*th = (struct tcphdr){
> +			.doff = sizeof(struct tcphdr) / 4,
> +			.ack = 1
> +		};
> +
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		ret = do_tcp_send_flag(c, conn, flags, th, data, optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vdev, vq, 1);
> +			return ret;
> +		}
> +
> +		ip_len = ipv6_fill_headers(c, conn, ip6h, optlen,
> +					   conn->seq_to_tap);
> +
> +		tlen =  ip_len + sizeof(struct ethhdr);
> +
> +		if (*c->pcap) {
> +			uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
> +
> +			th->check = csum(th, optlen + sizeof(struct tcphdr), sum);
> +		}
> +	}
> +
> +	pcap((void *)eh, tlen);
> +
> +	tlen += vnet_hdrlen;
> +	vu_queue_fill(vdev, vq, elem, tlen, 0);
> +	nb_ack = 1;
> +
> +	if (flags & DUP_ACK) {
> +		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
> +		if (elem) {
> +			if (elem->in_num < 1 || elem->in_sg[0].iov_len < tlen) {
> +				vu_queue_rewind(vdev, vq, 1);
> +			} else {
> +				memcpy(elem->in_sg[0].iov_base, vh, tlen);
> +				nb_ack++;
> +			}
> +		}
> +	}
> +
> +	vu_queue_flush(vdev, vq, nb_ack);
> +	vu_queue_notify(vdev, vq);
> +
> +	return 0;
> +}
> +
> +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> +	uint32_t already_sent;
> +	VuDev *vdev = (VuDev *)&c->vdev;
> +	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int s = conn->sock, v4 = CONN_V4(conn);
> +	int i, ret = 0, iov_count, iov_used;
> +	struct msghdr mh_sock = { 0 };
> +	size_t l2_hdrlen, vnet_hdrlen, fillsize;
> +	ssize_t len;
> +	uint16_t *check;
> +	uint16_t mss = MSS_GET(conn);
> +	int num_buffers;
> +	int segment_size;
> +	struct iovec *first;
> +	bool has_mrg_rxbuf;
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		err("Got packet, but no available descriptors on RX virtq.");
> +		return 0;
> +	}
> +
> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> +
> +	if (SEQ_LT(already_sent, 0)) {
> +		/* RFC 761, section 2.1. */
> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> +		already_sent = 0;
> +	}
> +
> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, STALLED);
> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +		return 0;
> +	}
> +
> +	/* Set up buffer descriptors we'll fill completely and partially. */
> +
> +	fillsize = wnd_scaled;
> +
> +	iov_vu[0].iov_base = tcp_buf_discard;
> +	iov_vu[0].iov_len = already_sent;
> +	fillsize -= already_sent;
> +
> +	has_mrg_rxbuf = vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF);
> +	if (has_mrg_rxbuf) {
> +		vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	} else {
> +		vnet_hdrlen = sizeof(struct virtio_net_hdr);
> +	}

passt style (unlike qemu) does not put braces on 1-line blocks.

> +	l2_hdrlen = vnet_hdrlen + sizeof(struct ethhdr) + sizeof(struct tcphdr);

That seems like a misleading variable name.  The ethernet headers are
certainly L2.  Including the lower level headers in L2 is reasonable,
but the IP and TCP headers are L3 and L4 headers respectively.

> +	if (v4) {
> +		l2_hdrlen += sizeof(struct iphdr);
> +	} else {
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	}
> +
> +	iov_count = 0;
> +	segment_size = 0;
> +	while (fillsize > 0 && iov_count < VIRTQUEUE_MAX_SIZE - 1) {
> +		VuVirtqElement *elem;
> +
> +		elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
> +		if (!elem)
> +			break;
> +
> +		if (elem->in_num < 1) {
> +			err("virtio-net receive queue contains no in buffers");
> +			goto err;
> +		}
> +
> +		ASSERT(elem->in_num == 1);
> +		ASSERT(elem->in_sg[0].iov_len >= l2_hdrlen);
> +
> +		indexes[iov_count] = elem->index;
> +
> +		if (segment_size == 0) {
> +			iov_vu[iov_count + 1].iov_base =
> +					(char *)elem->in_sg[0].iov_base + l2_hdrlen;
> +			iov_vu[iov_count + 1].iov_len =
> +					elem->in_sg[0].iov_len - l2_hdrlen;
> +		} else {
> +			iov_vu[iov_count + 1].iov_base = elem->in_sg[0].iov_base;
> +			iov_vu[iov_count + 1].iov_len = elem->in_sg[0].iov_len;
> +		}
> +
> +		if (iov_vu[iov_count + 1].iov_len > fillsize)
> +			 iov_vu[iov_count + 1].iov_len = fillsize;
> +
> +		segment_size += iov_vu[iov_count + 1].iov_len;
> +		if (!has_mrg_rxbuf) {
> +			segment_size = 0;
> +		} else if (segment_size >= mss) {
> +			iov_vu[iov_count + 1].iov_len -= segment_size - mss;
> +			segment_size = 0;
> +		}
> +		fillsize -= iov_vu[iov_count + 1].iov_len;
> +
> +		iov_count++;
> +	}
> +	if (iov_count == 0)
> +		return 0;
> +
> +	mh_sock.msg_iov = iov_vu;
> +	mh_sock.msg_iovlen = iov_count + 1;
> +
> +	do
> +		len = recvmsg(s, &mh_sock, MSG_PEEK);
> +	while (len < 0 && errno == EINTR);
> +
> +	if (len < 0)
> +		goto err;
> +
> +	if (!len) {
> +		vu_queue_rewind(vdev, vq, iov_count);
> +		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> +			if ((ret = tcp_vu_send_flag(c, conn, FIN | ACK))) {
> +				tcp_rst(c, conn);
> +				return ret;
> +			}
> +
> +			conn_event(c, conn, TAP_FIN_SENT);
> +		}
> +
> +		return 0;
> +	}
> +
> +	len -= already_sent;
> +	if (len <= 0) {
> +		conn_flag(c, conn, STALLED);
> +		vu_queue_rewind(vdev, vq, iov_count);
> +		return 0;
> +	}
> +
> +	conn_flag(c, conn, ~STALLED);
> +
> +	/* Likely, some new data was acked too. */
> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> +
> +	/* initialize headers */
> +	iov_used = 0;
> +	num_buffers = 0;
> +	check = NULL;
> +	segment_size = 0;
> +	for (i = 0; i < iov_count && len; i++) {
> +
> +		if (segment_size == 0)
> +			first = &iov_vu[i + 1];
> +
> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> +			iov_vu[i + 1].iov_len = len;
> +
> +		len -= iov_vu[i + 1].iov_len;
> +		iov_used++;
> +
> +		segment_size += iov_vu[i + 1].iov_len;
> +		num_buffers++;
> +
> +		if (segment_size >= mss || len == 0 ||
> +		    i + 1 == iov_count || !has_mrg_rxbuf) {
> +
> +			struct ethhdr *eh;
> +			struct virtio_net_hdr_mrg_rxbuf *vh;
> +			char *base = (char *)first->iov_base - l2_hdrlen;
> +			size_t size = first->iov_len + l2_hdrlen;
> +
> +			vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
> +
> +			vh->hdr = vu_header;
> +			if (has_mrg_rxbuf)
> +				vh->num_buffers = htole16(num_buffers);
> +
> +			eh = (struct ethhdr *)((char *)base + vnet_hdrlen);
> +
> +			memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
> +			memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
> +
> +			/* initialize header */
> +			if (v4) {
> +				struct iphdr *iph = (struct iphdr *)(eh + 1);
> +				struct tcphdr *th = (struct tcphdr *)(iph + 1);
> +
> +				eh->h_proto = htons(ETH_P_IP);
> +
> +				*th = (struct tcphdr){
> +					.doff = sizeof(struct tcphdr) / 4,
> +					.ack = 1
> +				};
> +
> +				*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +
> +				ipv4_fill_headers(c, conn, iph, segment_size,
> +						len ? check : NULL, conn->seq_to_tap);
> +
> +				if (*c->pcap) {
> +					uint32_t sum = proto_ipv4_header_checksum(iph, IPPROTO_TCP);
> +
> +					first->iov_base = th;
> +					first->iov_len = size - l2_hdrlen + sizeof(*th);
> +
> +					th->check = csum_iov(first, num_buffers, sum);
> +				}
> +
> +				check = &iph->check;
> +			} else {
> +				struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
> +				struct tcphdr *th = (struct tcphdr *)(ip6h + 1);
> +
> +				eh->h_proto = htons(ETH_P_IPV6);
> +
> +				*th = (struct tcphdr){
> +					.doff = sizeof(struct tcphdr) / 4,
> +					.ack = 1
> +				};
> +
> +				*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +				ipv6_fill_headers(c, conn, ip6h, segment_size,
> +						conn->seq_to_tap);
> +				if (*c->pcap) {
> +					uint32_t sum = proto_ipv6_header_checksum(ip6h, IPPROTO_TCP);
> +
> +					first->iov_base = th;
> +					first->iov_len = size - l2_hdrlen + sizeof(*th);
> +
> +					th->check = csum_iov(first, num_buffers, sum);
> +				}
> +			}
> +
> +			/* set iov for pcap logging */
> +			first->iov_base = eh;
> +			first->iov_len = size - vnet_hdrlen;
> +
> +			pcap_iov(first, num_buffers);
> +
> +			/* set iov_len for vu_queue_fill_by_index(); */
> +
> +			first->iov_base = base;
> +			first->iov_len = size;
> +
> +			conn->seq_to_tap += segment_size;
> +
> +			segment_size = 0;
> +			num_buffers = 0;
> +		}
> +	}
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vdev, vq, iov_count - iov_used);
> +
> +	/* send packets */
> +	for (i = 0; i < iov_used; i++) {
> +		vu_queue_fill_by_index(vdev, vq, indexes[i],
> +				       iov_vu[i + 1].iov_len, i);
> +	}
> +
> +	vu_queue_flush(vdev, vq, iov_used);
> +	vu_queue_notify(vdev, vq);
> +
> +	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +
> +	return 0;
> +err:
> +	vu_queue_rewind(vdev, vq, iov_count);
> +
> +	if (errno != EAGAIN && errno != EWOULDBLOCK) {
> +		ret = -errno;
> +		tcp_rst(c, conn);
> +	}
> +
> +	return ret;
> +}
> diff --git a/tcp_vu.h b/tcp_vu.h
> new file mode 100644
> index 000000000000..8045a6e3edb8
> --- /dev/null
> +++ b/tcp_vu.h
> @@ -0,0 +1,10 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#ifndef TCP_VU_H
> +#define TCP_VU_H
> +
> +uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn);
> +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
> +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
> +
> +#endif  /*TCP_VU_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 23/24] udp: vhost-user RX nocopy
  2024-02-02 14:11 ` [PATCH 23/24] udp: " Laurent Vivier
@ 2024-02-09  5:00   ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-09  5:00 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13746 bytes --]

On Fri, Feb 02, 2024 at 03:11:50PM +0100, Laurent Vivier wrote:
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile       |   4 +-
>  passt.c        |   5 +-
>  passt.h        |   1 +
>  udp.c          |  23 +++---
>  udp_internal.h |  21 +++++
>  udp_vu.c       | 215 +++++++++++++++++++++++++++++++++++++++++++++++++
>  udp_vu.h       |   8 ++
>  7 files changed, 262 insertions(+), 15 deletions(-)
>  create mode 100644 udp_internal.h
>  create mode 100644 udp_vu.c
>  create mode 100644 udp_vu.h
> 
> diff --git a/Makefile b/Makefile
> index f7a403d19b61..1d2b5dbfe085 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
>  	igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
>  	passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \
> -	tcp_buf.c tcp_vu.c udp.c util.c iov.c ip.c virtio.c vhost_user.c
> +	tcp_buf.c tcp_vu.c udp.c udp_vu.c util.c iov.c ip.c virtio.c vhost_user.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
>  	flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
>  	netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
>  	tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_vu.h tcp_internal.h \
> -	udp.h util.h iov.h ip.h virtio.h vhost_user.h
> +	udp.h udp_internal.h udp_vu.h util.h iov.h ip.h virtio.h vhost_user.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/passt.c b/passt.c
> index 952aded12848..a5abd5c4fc03 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -392,7 +392,10 @@ loop:
>  			tcp_timer_handler(&c, ref);
>  			break;
>  		case EPOLL_TYPE_UDP:
> -			udp_buf_sock_handler(&c, ref, eventmask, &now);
> +			if (c.mode == MODE_VU)
> +				udp_vu_sock_handler(&c, ref, eventmask, &now);
> +			else
> +				udp_buf_sock_handler(&c, ref, eventmask, &now);
>  			break;
>  		case EPOLL_TYPE_ICMP:
>  			icmp_sock_handler(&c, AF_INET, ref);
> diff --git a/passt.h b/passt.h
> index 4e0100d51a4d..04f4af8fd72e 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -42,6 +42,7 @@ union epoll_ref;
>  #include "port_fwd.h"
>  #include "tcp.h"
>  #include "udp.h"
> +#include "udp_vu.h"
>  #include "vhost_user.h"
>  
>  /**
> diff --git a/udp.c b/udp.c
> index 799a10989a91..da67d0cfa46b 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -117,9 +117,7 @@
>  #include "tap.h"
>  #include "pcap.h"
>  #include "log.h"
> -
> -#define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
> -#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> +#include "udp_internal.h"
>  
>  /**
>   * struct udp_tap_port - Port tracking based on tap-facing source port
> @@ -227,11 +225,11 @@ static struct mmsghdr	udp6_l2_mh_sock		[UDP_MAX_FRAMES];
>  static struct iovec	udp4_iov_splice		[UDP_MAX_FRAMES];
>  static struct iovec	udp6_iov_splice		[UDP_MAX_FRAMES];
>  
> -static struct sockaddr_in udp4_localname = {
> +struct sockaddr_in udp4_localname = {
>  	.sin_family = AF_INET,
>  	.sin_addr = IN4ADDR_LOOPBACK_INIT,
>  };
> -static struct sockaddr_in6 udp6_localname = {
> +struct sockaddr_in6 udp6_localname = {
>  	.sin6_family = AF_INET6,
>  	.sin6_addr = IN6ADDR_LOOPBACK_INIT,
>  };
> @@ -562,9 +560,9 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
>   *
>   * Return: size of tap frame with headers
>   */
> -static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
> -			      size_t data_len, struct sockaddr_in *s_in,
> -			      in_port_t dstport, const struct timespec *now)
> +size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
> +		       size_t data_len, struct sockaddr_in *s_in,
> +		       in_port_t dstport, const struct timespec *now)
>  {
>  	struct udphdr *uh = (struct udphdr *)(iph + 1);
>  	in_port_t src_port;
> @@ -602,6 +600,7 @@ static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
>  	uh->source = s_in->sin_port;
>  	uh->dest = htons(dstport);
>  	uh->len= htons(data_len + sizeof(struct udphdr));
> +	uh->check = 0;
>  
>  	return ip_len;
>  }
> @@ -615,9 +614,9 @@ static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
>   *
>   * Return: size of tap frame with headers
>   */
> -static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
> -			      size_t data_len, struct sockaddr_in6 *s_in6,
> -			      in_port_t dstport, const struct timespec *now)
> +size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
> +		       size_t data_len, struct sockaddr_in6 *s_in6,
> +		       in_port_t dstport, const struct timespec *now)
>  {
>  	struct udphdr *uh = (struct udphdr *)(ip6h + 1);
>  	struct in6_addr *src;
> @@ -672,7 +671,7 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
>  	uh->dest = htons(dstport);
>  	uh->len = ip6h->payload_len;
>  	uh->check = 0;
> -	if (c->mode != MODE_VU || *c->pcap)
> +	if (c->mode != MODE_VU)
>  		uh->check = csum(uh, ntohs(ip6h->payload_len),
>  				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
>  	ip6h->version = 6;
> diff --git a/udp_internal.h b/udp_internal.h
> new file mode 100644
> index 000000000000..a09f3c69da42
> --- /dev/null
> +++ b/udp_internal.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2021 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#ifndef UDP_INTERNAL_H
> +#define UDP_INTERNAL_H
> +
> +#define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
> +#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> +
> +extern struct sockaddr_in udp4_localname;
> +extern struct sockaddr_in6 udp6_localname;
> +
> +size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
> +		       size_t data_len, struct sockaddr_in *s_in,
> +		       in_port_t dstport, const struct timespec *now);
> +size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
> +		       size_t data_len, struct sockaddr_in6 *s_in6,
> +		       in_port_t dstport, const struct timespec *now);
> +#endif /* UDP_INTERNAL_H */
> diff --git a/udp_vu.c b/udp_vu.c
> new file mode 100644
> index 000000000000..c0f4cb90abd2
> --- /dev/null
> +++ b/udp_vu.c
> @@ -0,0 +1,215 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#include <unistd.h>
> +#include <net/ethernet.h>
> +#include <net/if.h>
> +#include <netinet/in.h>
> +#include <netinet/ip.h>
> +#include <netinet/udp.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <sys/uio.h>
> +#include <linux/virtio_net.h>
> +
> +#include "checksum.h"
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "pcap.h"
> +#include "log.h"
> +#include "vhost_user.h"
> +#include "udp_internal.h"
> +#include "udp_vu.h"
> +
> +/* vhost-user */
> +static const struct virtio_net_hdr vu_header = {
> +	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +};
> +
> +static unsigned char buffer[65536];
> +static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
> +static unsigned int     indexes		[VIRTQUEUE_MAX_SIZE];
> +
> +void udp_vu_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> +			 const struct timespec *now)

It's not *as* big a deal as for TCP, but I'm really hoping we can
abstract things to avoid more code duplication between the vu and
non-vu paths here as well.

> +{
> +	VuDev *vdev = (VuDev *)&c->vdev;
> +	VuVirtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	size_t l2_hdrlen, vnet_hdrlen, fillsize;
> +	ssize_t data_len;
> +	in_port_t dstport = ref.udp.port;
> +	bool has_mrg_rxbuf, v6 = ref.udp.v6;
> +	struct msghdr msg;
> +	int i, iov_count, iov_used, virtqueue_max;
> +
> +	if (c->no_udp || !(events & EPOLLIN))
> +		return;
> +
> +	has_mrg_rxbuf = vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF);
> +	if (has_mrg_rxbuf) {
> +		vnet_hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +		virtqueue_max = VIRTQUEUE_MAX_SIZE;
> +	} else {
> +		vnet_hdrlen = sizeof(struct virtio_net_hdr);
> +		virtqueue_max = 1;
> +	}
> +	l2_hdrlen = vnet_hdrlen + sizeof(struct ethhdr) + sizeof(struct udphdr);
> +
> +	if (v6) {
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +
> +		udp6_localname.sin6_port = htons(dstport);
> +		msg.msg_name = &udp6_localname;
> +		msg.msg_namelen = sizeof(udp6_localname);
> +	} else {
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +		udp4_localname.sin_port = htons(dstport);
> +		msg.msg_name = &udp4_localname;
> +		msg.msg_namelen = sizeof(udp4_localname);
> +	}
> +
> +	msg.msg_control = NULL;
> +	msg.msg_controllen = 0;
> +	msg.msg_flags = 0;
> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		struct virtio_net_hdr_mrg_rxbuf *vh;
> +		struct ethhdr *eh;
> +		char *base;
> +		size_t size;
> +
> +		fillsize = USHRT_MAX;
> +		iov_count = 0;
> +		while (fillsize && iov_count < virtqueue_max) {
> +			VuVirtqElement *elem;
> +
> +			elem = vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer);
> +			if (!elem)
> +				break;
> +
> +			if (elem->in_num < 1) {
> +				err("virtio-net receive queue contains no in buffers");
> +				vu_queue_rewind(vdev, vq, iov_count);
> +				return;
> +			}
> +			ASSERT(elem->in_num == 1);
> +			ASSERT(elem->in_sg[0].iov_len >= l2_hdrlen);
> +
> +			indexes[iov_count] = elem->index;
> +			if (iov_count == 0) {
> +				iov_vu[0].iov_base = (char *)elem->in_sg[0].iov_base + l2_hdrlen;
> +				iov_vu[0].iov_len = elem->in_sg[0].iov_len - l2_hdrlen;
> +			} else {
> +				iov_vu[iov_count].iov_base = elem->in_sg[0].iov_base;
> +				iov_vu[iov_count].iov_len = elem->in_sg[0].iov_len;
> +			}
> +
> +			if (iov_vu[iov_count].iov_len > fillsize)
> +				iov_vu[iov_count].iov_len = fillsize;
> +
> +			fillsize -= iov_vu[iov_count].iov_len;
> +
> +			iov_count++;
> +		}
> +		if (iov_count == 0)
> +			break;
> +
> +		msg.msg_iov = iov_vu;
> +		msg.msg_iovlen = iov_count;
> +
> +		data_len = recvmsg(ref.fd, &msg, 0);
> +		if (data_len < 0) {
> +			vu_queue_rewind(vdev, vq, iov_count);
> +			return;
> +		}
> +
> +		iov_used = 0;
> +		size = data_len;
> +		while (size) {
> +			if (iov_vu[iov_used].iov_len > size)
> +				iov_vu[iov_used].iov_len = size;
> +
> +			size -= iov_vu[iov_used].iov_len;
> +			iov_used++;
> +		}
> +
> +		base = (char *)iov_vu[0].iov_base - l2_hdrlen;
> +		size = iov_vu[0].iov_len + l2_hdrlen;
> +
> +		/* release unused buffers */
> +		vu_queue_rewind(vdev, vq, iov_count - iov_used);
> +
> +		/* vnet_header */
> +		vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
> +		vh->hdr = vu_header;
> +		if (has_mrg_rxbuf)
> +			vh->num_buffers = htole16(iov_used);
> +
> +		/* ethernet header */
> +		eh = (struct ethhdr *)(base + vnet_hdrlen);
> +
> +		memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
> +		memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
> +
> +		/* initialize header */
> +		if (v6) {
> +			struct ipv6hdr *ip6h = (struct ipv6hdr *)(eh + 1);
> +			struct udphdr *uh = (struct udphdr *)(ip6h + 1);
> +			uint32_t sum;
> +
> +			eh->h_proto = htons(ETH_P_IPV6);
> +
> +			*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
> +
> +			udp_update_hdr6(c, ip6h, data_len, &udp6_localname,
> +					dstport, now);
> +			if (*c->pcap) {
> +				sum = proto_ipv6_header_checksum(ip6h, IPPROTO_UDP);
> +
> +				iov_vu[0].iov_base = uh;
> +				iov_vu[0].iov_len = size - l2_hdrlen + sizeof(*uh);
> +				uh->check = csum_iov(iov_vu, iov_used, sum);
> +			} else {
> +				/* 0 checksum is invalid with IPv6/UDP */
> +				uh->check = 0xFFFF;
> +			}
> +		} else {
> +			struct iphdr *iph = (struct iphdr *)(eh + 1);
> +			struct udphdr *uh = (struct udphdr *)(iph + 1);
> +			uint32_t sum;
> +
> +			eh->h_proto = htons(ETH_P_IP);
> +
> +			*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
> +
> +			udp_update_hdr4(c, iph, data_len, &udp4_localname,
> +					dstport, now);
> +			if (*c->pcap) {
> +				sum = proto_ipv4_header_checksum(iph, IPPROTO_UDP);
> +
> +				iov_vu[0].iov_base = uh;
> +				iov_vu[0].iov_len = size - l2_hdrlen + sizeof(*uh);
> +				uh->check = csum_iov(iov_vu, iov_used, sum);
> +			}
> +		}
> +
> +		/* set iov for pcap logging */
> +		iov_vu[0].iov_base = base + vnet_hdrlen;
> +		iov_vu[0].iov_len = size - vnet_hdrlen;
> +		pcap_iov(iov_vu, iov_used);
> +
> +		/* set iov_len for vu_queue_fill_by_index(); */
> +		iov_vu[0].iov_base = base;
> +		iov_vu[0].iov_len = size;
> +
> +		/* send packets */
> +		for (i = 0; i < iov_used; i++)
> +			vu_queue_fill_by_index(vdev, vq, indexes[i],
> +					       iov_vu[i].iov_len, i);
> +
> +		vu_queue_flush(vdev, vq, iov_used);
> +		vu_queue_notify(vdev, vq);
> +	}
> +}
> diff --git a/udp_vu.h b/udp_vu.h
> new file mode 100644
> index 000000000000..e01ce047ee0a
> --- /dev/null
> +++ b/udp_vu.h
> @@ -0,0 +1,8 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#ifndef UDP_VU_H
> +#define UDP_VU_H
> +
> +void udp_vu_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			 uint32_t events, const struct timespec *now);
> +#endif /* UDP_VU_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 24/24] vhost-user: remove tap_send_frames_vu()
  2024-02-02 14:11 ` [PATCH 24/24] vhost-user: remove tap_send_frames_vu() Laurent Vivier
@ 2024-02-09  5:01   ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-09  5:01 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]

On Fri, Feb 02, 2024 at 03:11:51PM +0100, Laurent Vivier wrote:
> As TCP and UDP use now directly vhost-user we don't need this function
> anymore. Other protocols (ICMP, ARP, DHCP, ...) use tap_send()/vu_send()
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tap.c        |  3 +--
>  vhost_user.c | 16 ----------------
>  vhost_user.h |  2 --
>  3 files changed, 1 insertion(+), 20 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 930e48689497..ed1744f72e37 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -440,8 +440,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
>  		m = tap_send_frames_passt(c, iov, n);
>  		break;
>  	case MODE_VU:
> -		m = tap_send_frames_vu(c, iov, n);
> -		break;
> +		ASSERT(0);
>  	default:
>  		m = 0;
>  		break;
> diff --git a/vhost_user.c b/vhost_user.c
> index 9cc07c8312c0..4ceeeb58f792 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -653,22 +653,6 @@ err:
>  	return offset;
>  }
>  
> -size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov, size_t n)
> -{
> -	size_t i;
> -	int ret;
> -
> -	debug("tap_send_frames_vu n %zd", n);
> -
> -	for (i = 0; i < n; i++) {
> -		ret = vu_send(c, iov[i].iov_base, iov[i].iov_len);
> -		if (ret < 0)
> -			break;
> -	}
> -	debug("count %zd", i);
> -	return i;
> -}
> -
>  static void vu_handle_tx(VuDev *vdev, int index)
>  {
>  	struct ctx *c = (struct ctx *) ((char *)vdev - offsetof(struct ctx, vdev));
> diff --git a/vhost_user.h b/vhost_user.h
> index 25f0b617ab40..44678ddabef4 100644
> --- a/vhost_user.h
> +++ b/vhost_user.h
> @@ -129,8 +129,6 @@ static inline bool vu_queue_started(const VuVirtq *vq)
>  	return vq->started;
>  }
>  
> -size_t tap_send_frames_vu(const struct ctx *c, const struct iovec *iov,
> -			  size_t n);
>  int vu_send(const struct ctx *c, const void *data, size_t len);
>  void vu_print_capabilities(void);
>  void vu_init(struct ctx *c);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/24] tap: make tap_update_mac() generic
  2024-02-08 17:10     ` Stefano Brivio
@ 2024-02-09  5:02       ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-09  5:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 4149 bytes --]

On Thu, Feb 08, 2024 at 06:10:51PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:49:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Feb 02, 2024 at 03:11:39PM +0100, Laurent Vivier wrote:
> > > Use ethhdr rather than tap_hdr.
> > > 
> > > Signed-off-by: Laurent Vivier <lvivier@redhat.com>  
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > I'd be happy to see this applied immediately, in advance of the rest
> > of the series.
> 
> Oh, hm, do you need this for something around the flow table? I just
> have one nit below:

Need? No, not really.  I just think it's clearer, independent of the
VU changes.

> 
> > > ---
> > >  tap.c     | 6 +++---
> > >  tap.h     | 2 +-
> > >  tcp_buf.c | 8 ++++----
> > >  udp.c     | 4 ++--
> > >  4 files changed, 10 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/tap.c b/tap.c
> > > index 3ea03f720d6d..29f389057ac1 100644
> > > --- a/tap.c
> > > +++ b/tap.c
> > > @@ -447,13 +447,13 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
> > >   * @eth_d:	Ethernet destination address, NULL if unchanged
> > >   * @eth_s:	Ethernet source address, NULL if unchanged
> > >   */
> > > -void tap_update_mac(struct tap_hdr *taph,
> > > +void eth_update_mac(struct ethhdr *eh,
> 
> ...function comment should be updated accordingly.
> 
> > >  		    const unsigned char *eth_d, const unsigned char *eth_s)
> > >  {
> > >  	if (eth_d)
> > > -		memcpy(taph->eh.h_dest, eth_d, sizeof(taph->eh.h_dest));
> > > +		memcpy(eh->h_dest, eth_d, sizeof(eh->h_dest));
> > >  	if (eth_s)
> > > -		memcpy(taph->eh.h_source, eth_s, sizeof(taph->eh.h_source));
> > > +		memcpy(eh->h_source, eth_s, sizeof(eh->h_source));
> > >  }
> > >  
> > >  PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf);
> > > diff --git a/tap.h b/tap.h
> > > index 466d91466c3d..437b9aa2b43f 100644
> > > --- a/tap.h
> > > +++ b/tap.h
> > > @@ -74,7 +74,7 @@ void tap_icmp6_send(const struct ctx *c,
> > >  		    const void *in, size_t len);
> > >  int tap_send(const struct ctx *c, const void *data, size_t len);
> > >  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n);
> > > -void tap_update_mac(struct tap_hdr *taph,
> > > +void eth_update_mac(struct ethhdr *eh,
> > >  		    const unsigned char *eth_d, const unsigned char *eth_s);
> > >  void tap_listen_handler(struct ctx *c, uint32_t events);
> > >  void tap_handler_pasta(struct ctx *c, uint32_t events,
> > > diff --git a/tcp_buf.c b/tcp_buf.c
> > > index d70e7f810e4a..4c1f00c1d1b2 100644
> > > --- a/tcp_buf.c
> > > +++ b/tcp_buf.c
> > > @@ -218,10 +218,10 @@ void tcp_buf_update_l2(const unsigned char *eth_d, const unsigned char *eth_s)
> > >  		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
> > >  		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
> > >  
> > > -		tap_update_mac(&b4->taph, eth_d, eth_s);
> > > -		tap_update_mac(&b6->taph, eth_d, eth_s);
> > > -		tap_update_mac(&b4f->taph, eth_d, eth_s);
> > > -		tap_update_mac(&b6f->taph, eth_d, eth_s);
> > > +		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> > > +		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> > > +		eth_update_mac(&b4f->taph.eh, eth_d, eth_s);
> > > +		eth_update_mac(&b6f->taph.eh, eth_d, eth_s);
> > >  	}
> > >  }
> > >  
> > > diff --git a/udp.c b/udp.c
> > > index 96b4e6ca9a85..db635742319b 100644
> > > --- a/udp.c
> > > +++ b/udp.c
> > > @@ -283,8 +283,8 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> > >  		struct udp4_l2_buf_t *b4 = &udp4_l2_buf[i];
> > >  		struct udp6_l2_buf_t *b6 = &udp6_l2_buf[i];
> > >  
> > > -		tap_update_mac(&b4->taph, eth_d, eth_s);
> > > -		tap_update_mac(&b6->taph, eth_d, eth_s);
> > > +		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> > > +		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> > >  	}
> > >  }
> > >    
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add()
  2024-02-02 14:11 ` [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add() Laurent Vivier
  2024-02-02 14:29   ` Laurent Vivier
  2024-02-06  1:52   ` David Gibson
@ 2024-02-11 23:15   ` Stefano Brivio
  2024-02-12  2:22     ` David Gibson
  2 siblings, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:15 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev, Laurent Vivier

On Fri,  2 Feb 2024 15:11:40 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> From: Laurent Vivier <laurent@vivier.eu>
> 
> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> ---
>  tap.c | 98 +++++++++++++++++++++++++++++------------------------------
>  tap.h |  7 +++++
>  2 files changed, 56 insertions(+), 49 deletions(-)

I'm assuming that you need this patch to recycle those bits of "tap"
functions for usage in vhost-user code... which shows they actually
have little to do with tun/tap interfaces.

But sure, we already have there stuff to deal with UNIX domain sockets,
so "tap" is already somewhat inconsistent.

If use "tap" for a (long) moment to denote "anything guest/container
facing", then:

> diff --git a/tap.c b/tap.c
> index 29f389057ac1..5b1b61550c13 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -911,6 +911,45 @@ append:
>  	return in->count;
>  }
>  
> +void pool_flush_all(void)

I think that the "all" in pool_flush_all() doesn't really convey the
message of "all pools", and tap_pools_flush() would describe this
better.

All these would need function comments, by the way.

> +{
> +	pool_flush(pool_tap4);
> +	pool_flush(pool_tap6);
> +}
> +
> +void tap_handler_all(struct ctx *c, const struct timespec *now)

Same here: something like tap_pools_handler() describes better the fact
that this is not handling "everything", rather "tap pools".

> +{
> +	tap4_handler(c, pool_tap4, now);
> +	tap6_handler(c, pool_tap6, now);
> +}
> +
> +void packet_add_all_do(struct ctx *c, ssize_t len, char *p,
> +		       const char *func, int line)

...and this doesn't add "all the packets" -- it adds just one, to one
pool! What about tap_pool_add()?

About using packet_add_do() directly, and passing in 'line': I'm not
sure there's a big advantage in having the line from
tap_handler_passt() or another caller reported instead of having the
point where packet_add() is actually called.

There might also be an issue in this function (packet_add_all_do()) so
one might want to start debugging there.

> +{
> +	const struct ethhdr *eh;
> +
> +	pcap(p, len);
> +
> +	eh = (struct ethhdr *)p;
> +
> +	if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> +		memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> +		proto_update_l2_buf(c->mac_guest, NULL);
> +	}
> +
> +	switch (ntohs(eh->h_proto)) {
> +	case ETH_P_ARP:
> +	case ETH_P_IP:
> +		packet_add_do(pool_tap4, len, p, func, line);
> +		break;
> +	case ETH_P_IPV6:
> +		packet_add_do(pool_tap6, len, p, func, line);
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
>  /**
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
> @@ -937,7 +976,6 @@ static void tap_sock_reset(struct ctx *c)
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now)
>  {
> -	const struct ethhdr *eh;
>  	ssize_t n, rem;
>  	char *p;
>  
> @@ -950,8 +988,7 @@ redo:
>  	p = pkt_buf;
>  	rem = 0;
>  
> -	pool_flush(pool_tap4);
> -	pool_flush(pool_tap6);
> +	pool_flush_all();
>  
>  	n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT);
>  	if (n < 0) {
> @@ -978,37 +1015,18 @@ redo:
>  		/* Complete the partial read above before discarding a malformed
>  		 * frame, otherwise the stream will be inconsistent.
>  		 */
> -		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU)
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU)
>  			goto next;
>  
> -		pcap(p, len);
> -
> -		eh = (struct ethhdr *)p;
> -
> -		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> -			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> -			proto_update_l2_buf(c->mac_guest, NULL);
> -		}
> -
> -		switch (ntohs(eh->h_proto)) {
> -		case ETH_P_ARP:
> -		case ETH_P_IP:
> -			packet_add(pool_tap4, len, p);
> -			break;
> -		case ETH_P_IPV6:
> -			packet_add(pool_tap6, len, p);
> -			break;
> -		default:
> -			break;
> -		}
> +		packet_add_all(c, len, p);
>  
>  next:
>  		p += len;
>  		n -= len;
>  	}
>  
> -	tap4_handler(c, pool_tap4, now);
> -	tap6_handler(c, pool_tap6, now);
> +	tap_handler_all(c, now);
>  
>  	/* We can't use EPOLLET otherwise. */
>  	if (rem)
> @@ -1033,35 +1051,18 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  redo:
>  	n = 0;
>  
> -	pool_flush(pool_tap4);
> -	pool_flush(pool_tap6);
> +	pool_flush_all();
>  restart:
>  	while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
> -		const struct ethhdr *eh = (struct ethhdr *)(pkt_buf + n);
>  
> -		if (len < (ssize_t)sizeof(*eh) || len > (ssize_t)ETH_MAX_MTU) {
> +		if (len < (ssize_t)sizeof(struct ethhdr) ||
> +		    len > (ssize_t)ETH_MAX_MTU) {
>  			n += len;
>  			continue;
>  		}
>  
> -		pcap(pkt_buf + n, len);
>  
> -		if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) {
> -			memcpy(c->mac_guest, eh->h_source, ETH_ALEN);
> -			proto_update_l2_buf(c->mac_guest, NULL);
> -		}
> -
> -		switch (ntohs(eh->h_proto)) {
> -		case ETH_P_ARP:
> -		case ETH_P_IP:
> -			packet_add(pool_tap4, len, pkt_buf + n);
> -			break;
> -		case ETH_P_IPV6:
> -			packet_add(pool_tap6, len, pkt_buf + n);
> -			break;
> -		default:
> -			break;
> -		}
> +		packet_add_all(c, len, pkt_buf + n);
>  
>  		if ((n += len) == TAP_BUF_BYTES)
>  			break;
> @@ -1072,8 +1073,7 @@ restart:
>  
>  	ret = errno;
>  
> -	tap4_handler(c, pool_tap4, now);
> -	tap6_handler(c, pool_tap6, now);
> +	tap_handler_all(c, now);
>  
>  	if (len > 0 || ret == EAGAIN)
>  		return;
> diff --git a/tap.h b/tap.h
> index 437b9aa2b43f..7157ef37ee6e 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -82,5 +82,12 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  void tap_sock_init(struct ctx *c);
> +void pool_flush_all(void);
> +void tap_handler_all(struct ctx *c, const struct timespec *now);
> +
> +void packet_add_do(struct pool *p, size_t len, const char *start,
> +		   const char *func, int line);
> +#define packet_add_all(p, len, start)					\
> +	packet_add_all_do(p, len, start, __func__, __LINE__)
>  
>  #endif /* TAP_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX()
  2024-02-02 14:11 ` [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX() Laurent Vivier
  2024-02-06  1:59   ` David Gibson
@ 2024-02-11 23:16   ` Stefano Brivio
  1 sibling, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:16 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:41 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  udp.c | 126 ++++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 73 insertions(+), 53 deletions(-)
> 
> diff --git a/udp.c b/udp.c
> index db635742319b..77168fb0a2af 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -562,47 +562,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
>   *
>   * Return: size of tap frame with headers
>   */
> -static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
> -			      const struct timespec *now)
> +static size_t udp_update_hdr4(const struct ctx *c, struct iphdr *iph,
> +			      size_t data_len, struct sockaddr_in *s_in,
> +			      in_port_t dstport, const struct timespec *now)

Function comment should be updated to reflect the new set of parameters.

>  {
> -	struct udp4_l2_buf_t *b = &udp4_l2_buf[n];
> +	struct udphdr *uh = (struct udphdr *)(iph + 1);
>  	in_port_t src_port;
>  	size_t ip_len;
>  
> -	ip_len = udp4_l2_mh_sock[n].msg_len + sizeof(b->iph) + sizeof(b->uh);
> +	ip_len = data_len + sizeof(struct iphdr) + sizeof(struct udphdr);

ip_len takes into account the size of iph and uh, both local, so for
consistency with the rest of the codebase, + sizeof(*iph) + sizeof(*uh).

>  
> -	b->iph.tot_len = htons(ip_len);
> -	b->iph.daddr = c->ip4.addr_seen.s_addr;
> +	iph->tot_len = htons(ip_len);
> +	iph->daddr = c->ip4.addr_seen.s_addr;
>  
> -	src_port = ntohs(b->s_in.sin_port);
> +	src_port = ntohs(s_in->sin_port);
>  
>  	if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) &&
> -	    IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.dns_host) &&
> +	    IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.dns_host) &&
>  	    src_port == 53) {
> -		b->iph.saddr = c->ip4.dns_match.s_addr;
> -	} else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) ||
> -		   IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)||
> -		   IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) {
> -		b->iph.saddr = c->ip4.gw.s_addr;
> +		iph->saddr = c->ip4.dns_match.s_addr;
> +	} else if (IN4_IS_ADDR_LOOPBACK(&s_in->sin_addr) ||
> +		   IN4_IS_ADDR_UNSPECIFIED(&s_in->sin_addr)||
> +		   IN4_ARE_ADDR_EQUAL(&s_in->sin_addr, &c->ip4.addr_seen)) {
> +		iph->saddr = c->ip4.gw.s_addr;
>  		udp_tap_map[V4][src_port].ts = now->tv_sec;
>  		udp_tap_map[V4][src_port].flags |= PORT_LOCAL;
>  
> -		if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen))
> +		if (IN4_ARE_ADDR_EQUAL(&s_in->sin_addr.s_addr, &c->ip4.addr_seen))
>  			udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK;
>  		else
>  			udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK;
>  
>  		bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port);
>  	} else {
> -		b->iph.saddr = b->s_in.sin_addr.s_addr;
> +		iph->saddr = s_in->sin_addr.s_addr;
>  	}
>  
> -	b->iph.check = ipv4_hdr_checksum(&b->iph, IPPROTO_UDP);
> -	b->uh.source = b->s_in.sin_port;
> -	b->uh.dest = htons(dstport);
> -	b->uh.len = htons(udp4_l2_mh_sock[n].msg_len + sizeof(b->uh));
> +	iph->check = ipv4_hdr_checksum(iph, IPPROTO_UDP);
> +	uh->source = s_in->sin_port;
> +	uh->dest = htons(dstport);
> +	uh->len= htons(data_len + sizeof(struct udphdr));

Missing whitespace.

>  
> -	return tap_iov_len(c, &b->taph, ip_len);
> +	return ip_len;
>  }
>  
>  /**
> @@ -614,38 +615,39 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport,
>   *
>   * Return: size of tap frame with headers
>   */
> -static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
> -			      const struct timespec *now)
> +static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
> +			      size_t data_len, struct sockaddr_in6 *s_in6,
> +			      in_port_t dstport, const struct timespec *now)

Same here, function comment should be updated.

>  {
> -	struct udp6_l2_buf_t *b = &udp6_l2_buf[n];
> +	struct udphdr *uh = (struct udphdr *)(ip6h + 1);
>  	struct in6_addr *src;
>  	in_port_t src_port;
>  	size_t ip_len;
>  
> -	src = &b->s_in6.sin6_addr;
> -	src_port = ntohs(b->s_in6.sin6_port);
> +	src = &s_in6->sin6_addr;
> +	src_port = ntohs(s_in6->sin6_port);
>  
> -	ip_len = udp6_l2_mh_sock[n].msg_len + sizeof(b->ip6h) + sizeof(b->uh);
> +	ip_len = data_len + sizeof(struct ipv6hdr) + sizeof(struct udphdr);
>  
> -	b->ip6h.payload_len = htons(udp6_l2_mh_sock[n].msg_len + sizeof(b->uh));
> +	ip6h->payload_len = htons(data_len + sizeof(struct udphdr));
>  
>  	if (IN6_IS_ADDR_LINKLOCAL(src)) {
> -		b->ip6h.daddr = c->ip6.addr_ll_seen;
> -		b->ip6h.saddr = b->s_in6.sin6_addr;
> +		ip6h->daddr = c->ip6.addr_ll_seen;
> +		ip6h->saddr = s_in6->sin6_addr;
>  	} else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) &&
>  		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) &&
>  		   src_port == 53) {
> -		b->ip6h.daddr = c->ip6.addr_seen;
> -		b->ip6h.saddr = c->ip6.dns_match;
> +		ip6h->daddr = c->ip6.addr_seen;
> +		ip6h->saddr = c->ip6.dns_match;
>  	} else if (IN6_IS_ADDR_LOOPBACK(src)			||
>  		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen)	||
>  		   IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) {
> -		b->ip6h.daddr = c->ip6.addr_ll_seen;
> +		ip6h->daddr = c->ip6.addr_ll_seen;
>  
>  		if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> -			b->ip6h.saddr = c->ip6.gw;
> +			ip6h->saddr = c->ip6.gw;
>  		else
> -			b->ip6h.saddr = c->ip6.addr_ll;
> +			ip6h->saddr = c->ip6.addr_ll;
>  
>  		udp_tap_map[V6][src_port].ts = now->tv_sec;
>  		udp_tap_map[V6][src_port].flags |= PORT_LOCAL;
> @@ -662,20 +664,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
>  
>  		bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port);
>  	} else {
> -		b->ip6h.daddr = c->ip6.addr_seen;
> -		b->ip6h.saddr = b->s_in6.sin6_addr;
> +		ip6h->daddr = c->ip6.addr_seen;
> +		ip6h->saddr = s_in6->sin6_addr;
>  	}
>  
> -	b->uh.source = b->s_in6.sin6_port;
> -	b->uh.dest = htons(dstport);
> -	b->uh.len = b->ip6h.payload_len;
> -	b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
> -			 proto_ipv6_header_checksum(&b->ip6h, IPPROTO_UDP));
> -	b->ip6h.version = 6;
> -	b->ip6h.nexthdr = IPPROTO_UDP;
> -	b->ip6h.hop_limit = 255;
> +	uh->source = s_in6->sin6_port;
> +	uh->dest = htons(dstport);
> +	uh->len = ip6h->payload_len;
> +	uh->check = csum(uh, ntohs(ip6h->payload_len),
> +			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
> +	ip6h->version = 6;
> +	ip6h->nexthdr = IPPROTO_UDP;
> +	ip6h->hop_limit = 255;
>  
> -	return tap_iov_len(c, &b->taph, ip_len);
> +	return ip_len;
>  }
>  
>  /**
> @@ -689,6 +691,11 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport,
>   *
>   * Return: size of tap frame with headers
>   */
> +#pragma GCC diagnostic push
> +/* ignore unaligned pointer value warning for &udp6_l2_buf[i].ip6h and 
> + * &udp4_l2_buf[i].iph

...this is the reason why I originally wrote these functions this way:
for AVX2 builds, we align ip6h because we need to calculate the
checksum (the UDP checksum for IPv6 needs a version of the IPv6 header
as pseudo-header).

But for non-AVX2 builds, and for IPv4, we don't need any alignment
other than 4-bytes alignment of the start (s_in / s_in6). If you pass
iph or ip6h as arguments and dereference them, then they need to be
aligned (4-bytes) to be safe on all the architectures we might
reasonably be running on.

Passing &udp6_l2_buf[n] and dereferencing only that should work (gcc
checks are rather reliable in my experience, you don't have to go and
test this on all the possible architectures). Would that work for you?

Otherwise, we can pad as we do for AVX2 builds, but we'll waste bytes.
In this sense, the existing version is not ideal either:

$ CFLAGS="-g" make
$ pahole passt | less

[...]

struct udp4_l2_buf_t {
        struct sockaddr_in         s_in;                 /*     0    16 */
        struct tap_hdr             taph;                 /*    16    18 */
        struct iphdr               iph;                  /*    34    20 */
        struct udphdr              uh;                   /*    54     8 */
        uint8_t                    data[65507];          /*    62 65507 */

        /* size: 65572, cachelines: 1025, members: 5 */
        /* padding: 3 */
        /* last cacheline: 36 bytes */
} __attribute__((__aligned__(4)));

while in general we try to keep those structures below or at exactly
64 KiB. If we make 'data' smaller, though, we might truncate messages.

But at the same time, we need to keep struct sockaddr_in (or bigger) and
struct tap_hdr in here, plus padding for AVX2 builds.

Given that ~64 KiB messages are not common in practice, I wonder if we
could keep parallel arrays with sets of those few excess bytes we need,
using them as second items of iovec arrays, which will almost never be
used.

Anyway, this is beyond the scope of this patch. About this change
itself, I guess passing (and dereferencing) the head of the buffer, or
aligning iph / ipv6h should be enough.

> + */
> +#pragma GCC diagnostic ignored "-Waddress-of-packed-member"
>  static void udp_tap_send(const struct ctx *c,
>  			 unsigned int start, unsigned int n,
>  			 in_port_t dstport, bool v6, const struct timespec *now)
> @@ -702,18 +709,31 @@ static void udp_tap_send(const struct ctx *c,
>  		tap_iov = udp4_l2_iov_tap;
>  
>  	for (i = start; i < start + n; i++) {
> -		size_t buf_len;
> -
> -		if (v6)
> -			buf_len = udp_update_hdr6(c, i, dstport, now);
> -		else
> -			buf_len = udp_update_hdr4(c, i, dstport, now);
> -
> -		tap_iov[i].iov_len = buf_len;
> +		size_t ip_len;
> +
> +		if (v6) {
> +			ip_len = udp_update_hdr6(c, &udp6_l2_buf[i].ip6h,
> +						 udp6_l2_mh_sock[i].msg_len,
> +						 &udp6_l2_buf[i].s_in6, dstport,
> +						 now);
> +			tap_iov[i].iov_len = tap_iov_len(c,
> +							 &udp6_l2_buf[i].taph,
> +							 ip_len);
> +		} else {
> +			ip_len = udp_update_hdr4(c, &udp4_l2_buf[i].iph,
> +						 udp4_l2_mh_sock[i].msg_len,
> +						 &udp4_l2_buf[i].s_in,
> +						 dstport, now);
> +
> +			tap_iov[i].iov_len = tap_iov_len(c,
> +							 &udp4_l2_buf[i].taph,
> +							 ip_len);
> +		}
>  	}
>  
>  	tap_send_frames(c, tap_iov + start, n);
>  }
> +#pragma GCC diagnostic pop
>  
>  /**
>   * udp_sock_handler() - Handle new data from socket

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler()
  2024-02-06  2:14   ` David Gibson
@ 2024-02-11 23:17     ` Stefano Brivio
  0 siblings, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:17 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Tue, 6 Feb 2024 13:14:41 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:42PM +0100, Laurent Vivier wrote:
> > We are going to introduce a variant of the function to use
> > vhost-user buffers rather than passt internal buffers.  
> 
> Not entirely sure the new name really conveys that distinction.

Me neither: vhost-user uses buffers too. I'd rather keep the
udp_sock_handler() name here, I don't see a particular problem having
udp_vu_sock_handler() on top of that (unless we can avoid duplicating
code as David suggested for 23/24).

> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  passt.c | 2 +-
> >  udp.c   | 6 +++---
> >  udp.h   | 4 ++--
> >  3 files changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/passt.c b/passt.c
> > index 10042a9b9789..c70caf464e61 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -389,7 +389,7 @@ loop:
> >  			tcp_timer_handler(&c, ref);
> >  			break;
> >  		case EPOLL_TYPE_UDP:
> > -			udp_sock_handler(&c, ref, eventmask, &now);
> > +			udp_buf_sock_handler(&c, ref, eventmask, &now);
> >  			break;
> >  		case EPOLL_TYPE_ICMP:
> >  			icmp_sock_handler(&c, AF_INET, ref);
> > diff --git a/udp.c b/udp.c
> > index 77168fb0a2af..9c56168c6340 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -736,7 +736,7 @@ static void udp_tap_send(const struct ctx *c,
> >  #pragma GCC diagnostic pop
> >  
> >  /**
> > - * udp_sock_handler() - Handle new data from socket
> > + * udp_buf_sock_handler() - Handle new data from socket
> >   * @c:		Execution context
> >   * @ref:	epoll reference
> >   * @events:	epoll events bitmap
> > @@ -744,8 +744,8 @@ static void udp_tap_send(const struct ctx *c,
> >   *
> >   * #syscalls recvmmsg
> >   */
> > -void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> > -		      const struct timespec *now)
> > +void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> > +			  const struct timespec *now)
> >  {
> >  	/* For not entirely clear reasons (data locality?) pasta gets
> >  	 * better throughput if we receive tap datagrams one at a
> > diff --git a/udp.h b/udp.h
> > index 087e4820f93c..6c8519e87f1a 100644
> > --- a/udp.h
> > +++ b/udp.h
> > @@ -9,8 +9,8 @@
> >  #define UDP_TIMER_INTERVAL		1000 /* ms */
> >  
> >  void udp_portmap_clear(void);
> > -void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> > -		      const struct timespec *now);
> > +void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
> > +			  const struct timespec *now);
> >  int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
> >  		    const void *saddr, const void *daddr,
> >  		    const struct pool *p, int idx, const struct timespec *now);  
> 

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 16/24] packet: replace struct desc by struct iovec
  2024-02-06  2:25   ` David Gibson
@ 2024-02-11 23:18     ` Stefano Brivio
  0 siblings, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:18 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Tue, 6 Feb 2024 13:25:18 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:43PM +0100, Laurent Vivier wrote:
> 
> Rationale please.  It's probably also worth nothing that this does
> replace struct desc with a larger structure - struct desc is already
> padded out to 8 bytes, but on 64-bit machines iovec will be larger
> still.

Right, yes, that becomes 16 bytes (from 8), and those arrays are
already quite large. I wonder if we can keep struct desc, but I have no
idea how complicated it is.

> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  packet.c | 75 +++++++++++++++++++++++++++++++-------------------------
> >  packet.h | 14 ++---------
> >  2 files changed, 43 insertions(+), 46 deletions(-)
> > 
> > diff --git a/packet.c b/packet.c
> > index ccfc84607709..af2a539a1794 100644
> > --- a/packet.c
> > +++ b/packet.c
> > @@ -22,6 +22,36 @@
> >  #include "util.h"
> >  #include "log.h"
> >  
> > +static int packet_check_range(const struct pool *p, size_t offset, size_t len,
> > +			      const char *start, const char *func, int line)
> > +{
> > +	if (start < p->buf) {
> > +		if (func) {
> > +			trace("add packet start %p before buffer start %p, "
> > +			      "%s:%i", (void *)start, (void *)p->buf, func, line);
> > +		}
> > +		return -1;  
> 
> I guess not really in scope for this patch, but IIUC the only place
> we'd hit this is if the caller has done something badly wrong, so
> possibly it should be an ASSERT().

I wouldn't terminate passt in any of these cases, because that can turn
a safety check into a possibility for a denial of service. These checks
are here not to avoid obvious logic issues, but rather to make them
less harmful if present.

In general, we assume that the hypervisor is able to wreck its own
connectivity, but we should make it harder for users in the guest to do
so.

> > +	}
> > +
> > +	if (start + len + offset > p->buf + p->buf_size) {
> > +		if (func) {
> > +			trace("packet offset plus length %lu from size %lu, "
> > +			      "%s:%i", start - p->buf + len + offset,
> > +			      p->buf_size, func, line);
> > +		}
> > +		return -1;
> > +	}  
> 
> Same with this one.
> 
> > +
> > +#if UINTPTR_MAX == UINT64_MAX
> > +	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
> > +		trace("add packet start %p, buffer start %p, %s:%i",
> > +		      (void *)start, (void *)p->buf, func, line);
> > +		return -1;
> > +	}
> > +#endif  
> 
> This one is relevant to this patch though - because you're expanding
> struct desc's 32-bit offset to full void * from struct iovec, this
> check is no longer relevant.
> 
> > +
> > +	return 0;
> > +}
> >  /**
> >   * packet_add_do() - Add data as packet descriptor to given pool
> >   * @p:		Existing pool
> > @@ -41,34 +71,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
> >  		return;
> >  	}
> >  
> > -	if (start < p->buf) {
> > -		trace("add packet start %p before buffer start %p, %s:%i",
> > -		      (void *)start, (void *)p->buf, func, line);
> > +	if (packet_check_range(p, 0, len, start, func, line))
> >  		return;
> > -	}
> > -
> > -	if (start + len > p->buf + p->buf_size) {
> > -		trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
> > -		      (void *)start, len, (void *)(p->buf + p->buf_size),
> > -		      func, line);
> > -		return;
> > -	}
> >  
> >  	if (len > UINT16_MAX) {
> >  		trace("add packet length %zu, %s:%i", len, func, line);
> >  		return;
> >  	}
> >  
> > -#if UINTPTR_MAX == UINT64_MAX
> > -	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
> > -		trace("add packet start %p, buffer start %p, %s:%i",
> > -		      (void *)start, (void *)p->buf, func, line);
> > -		return;
> > -	}
> > -#endif
> > -
> > -	p->pkt[idx].offset = start - p->buf;
> > -	p->pkt[idx].len = len;
> > +	p->pkt[idx].iov_base = (void *)start;
> > +	p->pkt[idx].iov_len = len;
> >  
> >  	p->count++;
> >  }
> > @@ -104,28 +116,23 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
> >  		return NULL;
> >  	}
> >  
> > -	if (p->pkt[idx].offset + len + offset > p->buf_size) {
> > +	if (len + offset > p->pkt[idx].iov_len) {
> >  		if (func) {
> > -			trace("packet offset plus length %zu from size %zu, "
> > -			      "%s:%i", p->pkt[idx].offset + len + offset,
> > -			      p->buf_size, func, line);
> > +			trace("data length %zu, offset %zu from length %zu, "
> > +			      "%s:%i", len, offset, p->pkt[idx].iov_len,
> > +			      func, line);
> >  		}
> >  		return NULL;
> >  	}
> >  
> > -	if (len + offset > p->pkt[idx].len) {
> > -		if (func) {
> > -			trace("data length %zu, offset %zu from length %u, "
> > -			      "%s:%i", len, offset, p->pkt[idx].len,
> > -			      func, line);
> > -		}
> > +	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
> > +			       func, line))
> >  		return NULL;
> > -	}
> >  
> >  	if (left)
> > -		*left = p->pkt[idx].len - offset - len;
> > +		*left = p->pkt[idx].iov_len - offset - len;
> >  
> > -	return p->buf + p->pkt[idx].offset + offset;
> > +	return (char *)p->pkt[idx].iov_base + offset;
> >  }
> >  
> >  /**
> > diff --git a/packet.h b/packet.h
> > index a784b07bbed5..8377dcf678bb 100644
> > --- a/packet.h
> > +++ b/packet.h
> > @@ -6,16 +6,6 @@
> >  #ifndef PACKET_H
> >  #define PACKET_H
> >  
> > -/**
> > - * struct desc - Generic offset-based descriptor within buffer
> > - * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
> > - * @len:	Length of descriptor, host order, 16-bit limit
> > - */
> > -struct desc {
> > -	uint32_t offset;
> > -	uint16_t len;
> > -};
> > -
> >  /**
> >   * struct pool - Generic pool of packets stored in a buffer
> >   * @buf:	Buffer storing packet descriptors
> > @@ -29,7 +19,7 @@ struct pool {
> >  	size_t buf_size;
> >  	size_t size;
> >  	size_t count;
> > -	struct desc pkt[1];
> > +	struct iovec pkt[1];
> >  };
> >  
> >  void packet_add_do(struct pool *p, size_t len, const char *start,
> > @@ -54,7 +44,7 @@ struct _name ## _t {							\
> >  	size_t buf_size;						\
> >  	size_t size;							\
> >  	size_t count;							\
> > -	struct desc pkt[_size];						\
> > +	struct iovec pkt[_size];					\
> >  }
> >  
> >  #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\  
> 

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 18/24] vhost-user: introduce virtio API
  2024-02-06  3:51   ` David Gibson
@ 2024-02-11 23:18     ` Stefano Brivio
  2024-02-12  2:26       ` David Gibson
  0 siblings, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:18 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Tue, 6 Feb 2024 14:51:31 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:45PM +0100, Laurent Vivier wrote:
> > Add virtio.c and virtio.h that define the functions needed
> > to manage virtqueues.
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>  
> 
> When importing a batch of code from outside, I think we need to choose
> between one of two extremes:
> 
>   1) Treat this as a "vendored" dependency.  Keep the imported code
>      byte-for-byte identical to the original source, and possibly have
>      some integration glue in different files
> 
>   2) Fully assimilate: treat this as our own code, inspired by the
>      original source.  Rewrite as much as we need to match our own
>      conventions.
> 
> Currently, this is somewhere in between: we have some changes for the
> passt tree (e.g. tab indents), but other things retain qemu style
> (e.g. CamelCase, typedefs, and braces around single line clauses).

I'd rather pick 2) if possible, in the hope that we can cut down on
lines of code, but I haven't really checked how much we use of this.

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-02 14:11 ` [PATCH 20/24] vhost-user: add vhost-user Laurent Vivier
  2024-02-07  2:40   ` David Gibson
@ 2024-02-11 23:19   ` Stefano Brivio
  2024-02-12  2:49     ` David Gibson
  1 sibling, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:19 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri,  2 Feb 2024 15:11:47 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  conf.c  | 20 ++++++++++++++--
>  passt.c |  7 ++++++
>  passt.h |  1 +
>  tap.c   | 73 ++++++++++++++++++++++++++++++++++++++++++---------------
>  tcp.c   |  8 +++++--
>  udp.c   |  6 +++--
>  6 files changed, 90 insertions(+), 25 deletions(-)

This would need a matching change in the man page, passt.1, at least
documenting the --vhost-user option and adjusting descriptions about
the guest communication interface (look for "UNIX domain" there).

> diff --git a/conf.c b/conf.c
> index b6a2a1f0fdc3..40aa9519f8a6 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -44,6 +44,7 @@
>  #include "lineread.h"
>  #include "isolation.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /**
>   * next_chunk - Return the next piece of a string delimited by a character
> @@ -735,9 +736,12 @@ static void print_usage(const char *name, int status)
>  		info(   "  -I, --ns-ifname NAME	namespace interface name");
>  		info(   "    default: same interface name as external one");
>  	} else {
> -		info(   "  -s, --socket PATH	UNIX domain socket path");
> +		info(   "  -s, --socket, --socket-path PATH	UNIX domain socket path");

I don't get the point of --socket-path. It's handled just like -s
anyway, right? Why can't it just be -s / --socket?

>  		info(   "    default: probe free path starting from "
>  		     UNIX_SOCK_PATH, 1);
> +		info(   "  --vhost-user		Enable vhost-user mode");
> +		info(   "    UNIX domain socket is provided by -s option");
> +		info(   "  --print-capabilities	print back-end capabilities in JSON format");

Instead of introducing a new option, couldn't we have these printed
unconditionally with debug()? I guess it's debug-level stuff anyway.

>  	}
>  
>  	info(   "  -F, --fd FD		Use FD as pre-opened connected socket");
> @@ -1123,6 +1127,7 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"help",	no_argument,		NULL,		'h' },
>  		{"socket",	required_argument,	NULL,		's' },
>  		{"fd",		required_argument,	NULL,		'F' },
> +		{"socket-path",	required_argument,	NULL,		's' }, /* vhost-user mandatory */
>  		{"ns-ifname",	required_argument,	NULL,		'I' },
>  		{"pcap",	required_argument,	NULL,		'p' },
>  		{"pid",		required_argument,	NULL,		'P' },
> @@ -1169,6 +1174,8 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"config-net",	no_argument,		NULL,		17 },
>  		{"no-copy-routes", no_argument,		NULL,		18 },
>  		{"no-copy-addrs", no_argument,		NULL,		19 },
> +		{"vhost-user",	no_argument,		NULL,		20 },
> +		{"print-capabilities", no_argument,	NULL,		21 }, /* vhost-user mandatory */
>  		{ 0 },
>  	};
>  	char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
> @@ -1328,7 +1335,6 @@ void conf(struct ctx *c, int argc, char **argv)
>  				       sizeof(c->ip6.ifname_out), "%s", optarg);
>  			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
>  				die("Invalid interface name: %s", optarg);
> -
>  			break;
>  		case 17:
>  			if (c->mode != MODE_PASTA)
> @@ -1350,6 +1356,16 @@ void conf(struct ctx *c, int argc, char **argv)
>  			warn("--no-copy-addrs will be dropped soon");
>  			c->no_copy_addrs = copy_addrs_opt = true;
>  			break;
> +		case 20:
> +			if (c->mode == MODE_PASTA) {

And, if c->mode == MODE_VU, the option is redundant (see e.g. the handling
for --one-off).

> +				err("--vhost-user is for passt mode only");
> +				usage(argv[0]);
> +			}
> +			c->mode = MODE_VU;
> +			break;
> +		case 21:
> +			vu_print_capabilities();
> +			break;
>  		case 'd':
>  			if (c->debug)
>  				die("Multiple --debug options given");
> diff --git a/passt.c b/passt.c
> index 95034d73381f..952aded12848 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -282,6 +282,7 @@ int main(int argc, char **argv)
>  	quit_fd = pasta_netns_quit_init(&c);
>  
>  	tap_sock_init(&c);
> +	vu_init(&c);
>  
>  	secret_init(&c);
>  
> @@ -399,6 +400,12 @@ loop:
>  		case EPOLL_TYPE_ICMPV6:
>  			icmp_sock_handler(&c, AF_INET6, ref);
>  			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			tap_handler_vu(&c, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(&c, ref);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 6ed1d0b19e82..4e0100d51a4d 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -141,6 +141,7 @@ struct fqdn {
>  enum passt_modes {
>  	MODE_PASST,
>  	MODE_PASTA,
> +	MODE_VU,
>  };
>  
>  /**
> diff --git a/tap.c b/tap.c
> index 936206e53637..c2a917bc00ca 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -57,6 +57,7 @@
>  #include "packet.h"
>  #include "tap.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
>  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> @@ -75,19 +76,22 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
>   */
>  int tap_send(const struct ctx *c, const void *data, size_t len)
>  {
> -	pcap(data, len);
> +	int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
> +	uint32_t vnet_len = htonl(len);
>  
> -	if (c->mode == MODE_PASST) {
> -		int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
> -		uint32_t vnet_len = htonl(len);
> +	pcap(data, len);
>  
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		if (send(c->fd_tap, &vnet_len, 4, flags) < 0)
>  			return -1;
> -

Unrelated change.

>  		return send(c->fd_tap, data, len, flags);
> +	case MODE_PASTA:
> +		return write(c->fd_tap, (char *)data, len);
> +	case MODE_VU:
> +		return vu_send(c, data, len);
>  	}
> -
> -	return write(c->fd_tap, (char *)data, len);
> +	return 0;
>  }
>  
>  /**
> @@ -428,10 +432,20 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
>  	if (!n)
>  		return 0;
>  
> -	if (c->mode == MODE_PASTA)
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		m = tap_send_frames_pasta(c, iov, n);
> -	else
> +		break;
> +	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, n);
> +		break;
> +	case MODE_VU:
> +		m = tap_send_frames_vu(c, iov, n);
> +		break;
> +	default:
> +		m = 0;
> +		break;
> +	}
>  
>  	if (m < n)
>  		debug("tap: failed to send %zu frames of %zu", n - m, n);
> @@ -1149,11 +1163,17 @@ static void tap_sock_unix_init(struct ctx *c)
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
>  
> -	info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
> -	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> -	     addr.sun_path);
> -	info("or qrap, for earlier qemu versions:");
> -	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +	if (c->mode == MODE_VU) {
> +		info("You can start qemu with:");
> +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",

We'll never make it nice, but at least we can make it shorter by using
single characters for device names (see "id=s" for the socket) and
perhaps select something reasonable, say 1G, for $RAMSIZE? I haven't
tried other ways yet.

> +		     addr.sun_path);
> +	} else {
> +		info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
> +		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> +		     addr.sun_path);
> +		info("or qrap, for earlier qemu versions:");
> +		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +	}
>  }
>  
>  /**
> @@ -1163,7 +1183,7 @@ static void tap_sock_unix_init(struct ctx *c)
>   */
>  void tap_listen_handler(struct ctx *c, uint32_t events)
>  {
> -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
> +	union epoll_ref ref;

...then it should go after 'int v'.

>  	struct epoll_event ev = { 0 };
>  	int v = INT_MAX / 2;
>  	struct ucred ucred;
> @@ -1204,7 +1224,13 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> -	ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> +	if (c->mode == MODE_VU) {
> +		ref.type = EPOLL_TYPE_VHOST_CMD;
> +		ev.events = EPOLLIN | EPOLLRDHUP;
> +	} else {
> +		ref.type = EPOLL_TYPE_TAP_PASST;
> +		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
> +	}
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  }
> @@ -1288,12 +1314,21 @@ void tap_sock_init(struct ctx *c)
>  
>  		ASSERT(c->one_off);
>  		ref.fd = c->fd_tap;
> -		if (c->mode == MODE_PASST)
> +		switch (c->mode) {
> +		case MODE_PASST:
>  			ref.type = EPOLL_TYPE_TAP_PASST;
> -		else
> +			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> +			break;
> +		case MODE_PASTA:
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
> +			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> +			break;
> +		case MODE_VU:
> +			ref.type = EPOLL_TYPE_VHOST_CMD;
> +			ev.events = EPOLLIN | EPOLLRDHUP;
> +			break;
> +		}
>  
> -		ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
>  		ev.data.u64 = ref.u64;
>  		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  		return;
> diff --git a/tcp.c b/tcp.c
> index 54c15087d678..b6aca9f37f19 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1033,7 +1033,9 @@ size_t ipv4_fill_headers(const struct ctx *c,
>  
>  	tcp_set_tcp_header(th, conn, seq);
>  
> -	th->check = tcp_update_check_tcp4(iph);
> +	th->check = 0;
> +	if (c->mode != MODE_VU || *c->pcap)
> +		th->check = tcp_update_check_tcp4(iph);
>  
>  	return ip_len;
>  }
> @@ -1069,7 +1071,9 @@ size_t ipv6_fill_headers(const struct ctx *c,
>  
>  	tcp_set_tcp_header(th, conn, seq);
>  
> -	th->check = tcp_update_check_tcp6(ip6h);
> +	th->check = 0;
> +	if (c->mode != MODE_VU || *c->pcap)
> +		th->check = tcp_update_check_tcp6(ip6h);
>  
>  	ip6h->hop_limit = 255;
>  	ip6h->version = 6;
> diff --git a/udp.c b/udp.c
> index a189c2e0b5a2..799a10989a91 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -671,8 +671,10 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
>  	uh->source = s_in6->sin6_port;
>  	uh->dest = htons(dstport);
>  	uh->len = ip6h->payload_len;
> -	uh->check = csum(uh, ntohs(ip6h->payload_len),
> -			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
> +	uh->check = 0;
> +	if (c->mode != MODE_VU || *c->pcap)
> +		uh->check = csum(uh, ntohs(ip6h->payload_len),
> +				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
>  	ip6h->version = 6;
>  	ip6h->nexthdr = IPPROTO_UDP;
>  	ip6h->hop_limit = 255;

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-07  2:40   ` David Gibson
@ 2024-02-11 23:19     ` Stefano Brivio
  2024-02-12  2:47       ` David Gibson
  0 siblings, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-11 23:19 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Wed, 7 Feb 2024 13:40:33 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Feb 02, 2024 at 03:11:47PM +0100, Laurent Vivier wrote:
> > add virtio and vhost-user functions to connect with QEMU.
> > 
> >   $ ./passt --vhost-user
> > 
> > and
> > 
> >   # qemu-system-x86_64 ... -m 4G \
> >         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
> >         -numa node,memdev=memfd0 \
> >         -chardev socket,id=chr0,path=/tmp/passt_1.socket \  
> 
> I think it would be wise to use different default socket names for
> vhost-user than for the qemu socket protocol.

I'm not sure if there's an obvious benefit (mix them up, and nothing
will work anyway). On the other hand, that means more typing and
remembering what's the separator between "passt", "vhost", and "user".

> Or even to require
> --socket-path: the reasons we have these rather weird default probed
> paths don't apply here, AFAICT.

Why not, actually? With probed paths, you can still reasonably start
passt by *typing* its command line. I do it all the time, and I think
it's quite nice to have.

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add()
  2024-02-11 23:15   ` Stefano Brivio
@ 2024-02-12  2:22     ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-12  2:22 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]

On Mon, Feb 12, 2024 at 12:15:30AM +0100, Stefano Brivio wrote:
> On Fri,  2 Feb 2024 15:11:40 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > From: Laurent Vivier <laurent@vivier.eu>
> > 
> > Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> > ---
> >  tap.c | 98 +++++++++++++++++++++++++++++------------------------------
> >  tap.h |  7 +++++
> >  2 files changed, 56 insertions(+), 49 deletions(-)
> 
> I'm assuming that you need this patch to recycle those bits of "tap"
> functions for usage in vhost-user code... which shows they actually
> have little to do with tun/tap interfaces.
> 
> But sure, we already have there stuff to deal with UNIX domain sockets,
> so "tap" is already somewhat inconsistent.
> 
> If use "tap" for a (long) moment to denote "anything guest/container
> facing", then:

We're definitely doing that at present.  I am wondering with adding
this third "tap" option that's even further from tuntap whether we
should rethink that naming convention.  I was contemplating something
along the lines of "l2if" emphasizing that the thing in common is that
it's a transport operating at L2 level, unlike the "sock" side
operating at l4.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 18/24] vhost-user: introduce virtio API
  2024-02-11 23:18     ` Stefano Brivio
@ 2024-02-12  2:26       ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-12  2:26 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1605 bytes --]

On Mon, Feb 12, 2024 at 12:18:22AM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 14:51:31 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Feb 02, 2024 at 03:11:45PM +0100, Laurent Vivier wrote:
> > > Add virtio.c and virtio.h that define the functions needed
> > > to manage virtqueues.
> > > 
> > > Signed-off-by: Laurent Vivier <lvivier@redhat.com>  
> > 
> > When importing a batch of code from outside, I think we need to choose
> > between one of two extremes:
> > 
> >   1) Treat this as a "vendored" dependency.  Keep the imported code
> >      byte-for-byte identical to the original source, and possibly have
> >      some integration glue in different files
> > 
> >   2) Fully assimilate: treat this as our own code, inspired by the
> >      original source.  Rewrite as much as we need to match our own
> >      conventions.
> > 
> > Currently, this is somewhere in between: we have some changes for the
> > passt tree (e.g. tab indents), but other things retain qemu style
> > (e.g. CamelCase, typedefs, and braces around single line clauses).
> 
> I'd rather pick 2) if possible, in the hope that we can cut down on
> lines of code, but I haven't really checked how much we use of this.

Given what I've seen so far, (2) would also be my inclination at this
point.  I do think either one is better than something in the middle
though.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-11 23:19     ` Stefano Brivio
@ 2024-02-12  2:47       ` David Gibson
  2024-02-13 15:22         ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-12  2:47 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

On Mon, Feb 12, 2024 at 12:19:22AM +0100, Stefano Brivio wrote:
> On Wed, 7 Feb 2024 13:40:33 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Feb 02, 2024 at 03:11:47PM +0100, Laurent Vivier wrote:
> > > add virtio and vhost-user functions to connect with QEMU.
> > > 
> > >   $ ./passt --vhost-user
> > > 
> > > and
> > > 
> > >   # qemu-system-x86_64 ... -m 4G \
> > >         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
> > >         -numa node,memdev=memfd0 \
> > >         -chardev socket,id=chr0,path=/tmp/passt_1.socket \  
> > 
> > I think it would be wise to use different default socket names for
> > vhost-user than for the qemu socket protocol.
> 
> I'm not sure if there's an obvious benefit (mix them up, and nothing
> will work anyway). On the other hand, that means more typing and
> remembering what's the separator between "passt", "vhost", and "user".
> 
> > Or even to require
> > --socket-path: the reasons we have these rather weird default probed
> > paths don't apply here, AFAICT.
> 
> Why not, actually? With probed paths, you can still reasonably start
> passt by *typing* its command line. I do it all the time, and I think
> it's quite nice to have.

Uh.. I'm not sure how this would change that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-11 23:19   ` Stefano Brivio
@ 2024-02-12  2:49     ` David Gibson
  2024-02-12 10:02       ` Laurent Vivier
  0 siblings, 1 reply; 83+ messages in thread
From: David Gibson @ 2024-02-12  2:49 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 12535 bytes --]

On Mon, Feb 12, 2024 at 12:19:01AM +0100, Stefano Brivio wrote:
> On Fri,  2 Feb 2024 15:11:47 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > add virtio and vhost-user functions to connect with QEMU.
> > 
> >   $ ./passt --vhost-user
> > 
> > and
> > 
> >   # qemu-system-x86_64 ... -m 4G \
> >         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
> >         -numa node,memdev=memfd0 \
> >         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
> >         -netdev vhost-user,id=netdev0,chardev=chr0 \
> >         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
> >         ...
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  conf.c  | 20 ++++++++++++++--
> >  passt.c |  7 ++++++
> >  passt.h |  1 +
> >  tap.c   | 73 ++++++++++++++++++++++++++++++++++++++++++---------------
> >  tcp.c   |  8 +++++--
> >  udp.c   |  6 +++--
> >  6 files changed, 90 insertions(+), 25 deletions(-)
> 
> This would need a matching change in the man page, passt.1, at least
> documenting the --vhost-user option and adjusting descriptions about
> the guest communication interface (look for "UNIX domain" there).
> 
> > diff --git a/conf.c b/conf.c
> > index b6a2a1f0fdc3..40aa9519f8a6 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -44,6 +44,7 @@
> >  #include "lineread.h"
> >  #include "isolation.h"
> >  #include "log.h"
> > +#include "vhost_user.h"
> >  
> >  /**
> >   * next_chunk - Return the next piece of a string delimited by a character
> > @@ -735,9 +736,12 @@ static void print_usage(const char *name, int status)
> >  		info(   "  -I, --ns-ifname NAME	namespace interface name");
> >  		info(   "    default: same interface name as external one");
> >  	} else {
> > -		info(   "  -s, --socket PATH	UNIX domain socket path");
> > +		info(   "  -s, --socket, --socket-path PATH	UNIX domain socket path");
> 
> I don't get the point of --socket-path. It's handled just like -s
> anyway, right? Why can't it just be -s / --socket?

I believe the issue is that there's an expected command line interface
for the vhost server, which uses --socket-path for, well, the socket
path.  Hence adding an alias to the existing passt option.

> >  		info(   "    default: probe free path starting from "
> >  		     UNIX_SOCK_PATH, 1);
> > +		info(   "  --vhost-user		Enable vhost-user mode");
> > +		info(   "    UNIX domain socket is provided by -s option");
> > +		info(   "  --print-capabilities	print back-end capabilities in JSON format");
> 
> Instead of introducing a new option, couldn't we have these printed
> unconditionally with debug()? I guess it's debug-level stuff anyway.

Likewise, I think this option is expected by the thing which starts
the vhost server.

> >  	}
> >  
> >  	info(   "  -F, --fd FD		Use FD as pre-opened connected socket");
> > @@ -1123,6 +1127,7 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		{"help",	no_argument,		NULL,		'h' },
> >  		{"socket",	required_argument,	NULL,		's' },
> >  		{"fd",		required_argument,	NULL,		'F' },
> > +		{"socket-path",	required_argument,	NULL,		's' }, /* vhost-user mandatory */
> >  		{"ns-ifname",	required_argument,	NULL,		'I' },
> >  		{"pcap",	required_argument,	NULL,		'p' },
> >  		{"pid",		required_argument,	NULL,		'P' },
> > @@ -1169,6 +1174,8 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		{"config-net",	no_argument,		NULL,		17 },
> >  		{"no-copy-routes", no_argument,		NULL,		18 },
> >  		{"no-copy-addrs", no_argument,		NULL,		19 },
> > +		{"vhost-user",	no_argument,		NULL,		20 },
> > +		{"print-capabilities", no_argument,	NULL,		21 }, /* vhost-user mandatory */
> >  		{ 0 },
> >  	};
> >  	char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
> > @@ -1328,7 +1335,6 @@ void conf(struct ctx *c, int argc, char **argv)
> >  				       sizeof(c->ip6.ifname_out), "%s", optarg);
> >  			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
> >  				die("Invalid interface name: %s", optarg);
> > -
> >  			break;
> >  		case 17:
> >  			if (c->mode != MODE_PASTA)
> > @@ -1350,6 +1356,16 @@ void conf(struct ctx *c, int argc, char **argv)
> >  			warn("--no-copy-addrs will be dropped soon");
> >  			c->no_copy_addrs = copy_addrs_opt = true;
> >  			break;
> > +		case 20:
> > +			if (c->mode == MODE_PASTA) {
> 
> And, if c->mode == MODE_VU, the option is redundant (see e.g. the handling
> for --one-off).
> 
> > +				err("--vhost-user is for passt mode only");
> > +				usage(argv[0]);
> > +			}
> > +			c->mode = MODE_VU;
> > +			break;
> > +		case 21:
> > +			vu_print_capabilities();
> > +			break;
> >  		case 'd':
> >  			if (c->debug)
> >  				die("Multiple --debug options given");
> > diff --git a/passt.c b/passt.c
> > index 95034d73381f..952aded12848 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -282,6 +282,7 @@ int main(int argc, char **argv)
> >  	quit_fd = pasta_netns_quit_init(&c);
> >  
> >  	tap_sock_init(&c);
> > +	vu_init(&c);
> >  
> >  	secret_init(&c);
> >  
> > @@ -399,6 +400,12 @@ loop:
> >  		case EPOLL_TYPE_ICMPV6:
> >  			icmp_sock_handler(&c, AF_INET6, ref);
> >  			break;
> > +		case EPOLL_TYPE_VHOST_CMD:
> > +			tap_handler_vu(&c, eventmask);
> > +			break;
> > +		case EPOLL_TYPE_VHOST_KICK:
> > +			vu_kick_cb(&c, ref);
> > +			break;
> >  		default:
> >  			/* Can't happen */
> >  			ASSERT(0);
> > diff --git a/passt.h b/passt.h
> > index 6ed1d0b19e82..4e0100d51a4d 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -141,6 +141,7 @@ struct fqdn {
> >  enum passt_modes {
> >  	MODE_PASST,
> >  	MODE_PASTA,
> > +	MODE_VU,
> >  };
> >  
> >  /**
> > diff --git a/tap.c b/tap.c
> > index 936206e53637..c2a917bc00ca 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -57,6 +57,7 @@
> >  #include "packet.h"
> >  #include "tap.h"
> >  #include "log.h"
> > +#include "vhost_user.h"
> >  
> >  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
> >  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> > @@ -75,19 +76,22 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf);
> >   */
> >  int tap_send(const struct ctx *c, const void *data, size_t len)
> >  {
> > -	pcap(data, len);
> > +	int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
> > +	uint32_t vnet_len = htonl(len);
> >  
> > -	if (c->mode == MODE_PASST) {
> > -		int flags = MSG_NOSIGNAL | MSG_DONTWAIT;
> > -		uint32_t vnet_len = htonl(len);
> > +	pcap(data, len);
> >  
> > +	switch (c->mode) {
> > +	case MODE_PASST:
> >  		if (send(c->fd_tap, &vnet_len, 4, flags) < 0)
> >  			return -1;
> > -
> 
> Unrelated change.
> 
> >  		return send(c->fd_tap, data, len, flags);
> > +	case MODE_PASTA:
> > +		return write(c->fd_tap, (char *)data, len);
> > +	case MODE_VU:
> > +		return vu_send(c, data, len);
> >  	}
> > -
> > -	return write(c->fd_tap, (char *)data, len);
> > +	return 0;
> >  }
> >  
> >  /**
> > @@ -428,10 +432,20 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n)
> >  	if (!n)
> >  		return 0;
> >  
> > -	if (c->mode == MODE_PASTA)
> > +	switch (c->mode) {
> > +	case MODE_PASTA:
> >  		m = tap_send_frames_pasta(c, iov, n);
> > -	else
> > +		break;
> > +	case MODE_PASST:
> >  		m = tap_send_frames_passt(c, iov, n);
> > +		break;
> > +	case MODE_VU:
> > +		m = tap_send_frames_vu(c, iov, n);
> > +		break;
> > +	default:
> > +		m = 0;
> > +		break;
> > +	}
> >  
> >  	if (m < n)
> >  		debug("tap: failed to send %zu frames of %zu", n - m, n);
> > @@ -1149,11 +1163,17 @@ static void tap_sock_unix_init(struct ctx *c)
> >  	ev.data.u64 = ref.u64;
> >  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
> >  
> > -	info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
> > -	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> > -	     addr.sun_path);
> > -	info("or qrap, for earlier qemu versions:");
> > -	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> > +	if (c->mode == MODE_VU) {
> > +		info("You can start qemu with:");
> > +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
> 
> We'll never make it nice, but at least we can make it shorter by using
> single characters for device names (see "id=s" for the socket) and
> perhaps select something reasonable, say 1G, for $RAMSIZE? I haven't
> tried other ways yet.
> 
> > +		     addr.sun_path);
> > +	} else {
> > +		info("You can now start qemu (>= 7.2, with commit 13c6be96618c):");
> > +		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> > +		     addr.sun_path);
> > +		info("or qrap, for earlier qemu versions:");
> > +		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> > +	}
> >  }
> >  
> >  /**
> > @@ -1163,7 +1183,7 @@ static void tap_sock_unix_init(struct ctx *c)
> >   */
> >  void tap_listen_handler(struct ctx *c, uint32_t events)
> >  {
> > -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
> > +	union epoll_ref ref;
> 
> ...then it should go after 'int v'.
> 
> >  	struct epoll_event ev = { 0 };
> >  	int v = INT_MAX / 2;
> >  	struct ucred ucred;
> > @@ -1204,7 +1224,13 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
> >  		trace("tap: failed to set SO_SNDBUF to %i", v);
> >  
> >  	ref.fd = c->fd_tap;
> > -	ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> > +	if (c->mode == MODE_VU) {
> > +		ref.type = EPOLL_TYPE_VHOST_CMD;
> > +		ev.events = EPOLLIN | EPOLLRDHUP;
> > +	} else {
> > +		ref.type = EPOLL_TYPE_TAP_PASST;
> > +		ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET;
> > +	}
> >  	ev.data.u64 = ref.u64;
> >  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> >  }
> > @@ -1288,12 +1314,21 @@ void tap_sock_init(struct ctx *c)
> >  
> >  		ASSERT(c->one_off);
> >  		ref.fd = c->fd_tap;
> > -		if (c->mode == MODE_PASST)
> > +		switch (c->mode) {
> > +		case MODE_PASST:
> >  			ref.type = EPOLL_TYPE_TAP_PASST;
> > -		else
> > +			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> > +			break;
> > +		case MODE_PASTA:
> >  			ref.type = EPOLL_TYPE_TAP_PASTA;
> > +			ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> > +			break;
> > +		case MODE_VU:
> > +			ref.type = EPOLL_TYPE_VHOST_CMD;
> > +			ev.events = EPOLLIN | EPOLLRDHUP;
> > +			break;
> > +		}
> >  
> > -		ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP;
> >  		ev.data.u64 = ref.u64;
> >  		epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> >  		return;
> > diff --git a/tcp.c b/tcp.c
> > index 54c15087d678..b6aca9f37f19 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1033,7 +1033,9 @@ size_t ipv4_fill_headers(const struct ctx *c,
> >  
> >  	tcp_set_tcp_header(th, conn, seq);
> >  
> > -	th->check = tcp_update_check_tcp4(iph);
> > +	th->check = 0;
> > +	if (c->mode != MODE_VU || *c->pcap)
> > +		th->check = tcp_update_check_tcp4(iph);
> >  
> >  	return ip_len;
> >  }
> > @@ -1069,7 +1071,9 @@ size_t ipv6_fill_headers(const struct ctx *c,
> >  
> >  	tcp_set_tcp_header(th, conn, seq);
> >  
> > -	th->check = tcp_update_check_tcp6(ip6h);
> > +	th->check = 0;
> > +	if (c->mode != MODE_VU || *c->pcap)
> > +		th->check = tcp_update_check_tcp6(ip6h);
> >  
> >  	ip6h->hop_limit = 255;
> >  	ip6h->version = 6;
> > diff --git a/udp.c b/udp.c
> > index a189c2e0b5a2..799a10989a91 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -671,8 +671,10 @@ static size_t udp_update_hdr6(const struct ctx *c, struct ipv6hdr *ip6h,
> >  	uh->source = s_in6->sin6_port;
> >  	uh->dest = htons(dstport);
> >  	uh->len = ip6h->payload_len;
> > -	uh->check = csum(uh, ntohs(ip6h->payload_len),
> > -			 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
> > +	uh->check = 0;
> > +	if (c->mode != MODE_VU || *c->pcap)
> > +		uh->check = csum(uh, ntohs(ip6h->payload_len),
> > +				 proto_ipv6_header_checksum(ip6h, IPPROTO_UDP));
> >  	ip6h->version = 6;
> >  	ip6h->nexthdr = IPPROTO_UDP;
> >  	ip6h->hop_limit = 255;
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-12  2:49     ` David Gibson
@ 2024-02-12 10:02       ` Laurent Vivier
  2024-02-12 16:56         ` Stefano Brivio
  0 siblings, 1 reply; 83+ messages in thread
From: Laurent Vivier @ 2024-02-12 10:02 UTC (permalink / raw)
  To: David Gibson, Stefano Brivio; +Cc: passt-dev

On 2/12/24 03:49, David Gibson wrote:
> On Mon, Feb 12, 2024 at 12:19:01AM +0100, Stefano Brivio wrote:
>> On Fri,  2 Feb 2024 15:11:47 +0100
>> Laurent Vivier <lvivier@redhat.com> wrote:
>>
>>> add virtio and vhost-user functions to connect with QEMU.
>>>
>>>    $ ./passt --vhost-user
>>>
>>> and
>>>
>>>    # qemu-system-x86_64 ... -m 4G \
>>>          -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>>>          -numa node,memdev=memfd0 \
>>>          -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>>>          -netdev vhost-user,id=netdev0,chardev=chr0 \
>>>          -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>>>          ...
>>>
>>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>>> ---
>>>   conf.c  | 20 ++++++++++++++--
>>>   passt.c |  7 ++++++
>>>   passt.h |  1 +
>>>   tap.c   | 73 ++++++++++++++++++++++++++++++++++++++++++---------------
>>>   tcp.c   |  8 +++++--
>>>   udp.c   |  6 +++--
>>>   6 files changed, 90 insertions(+), 25 deletions(-)
>> This would need a matching change in the man page, passt.1, at least
>> documenting the --vhost-user option and adjusting descriptions about
>> the guest communication interface (look for "UNIX domain" there).
>>
>>> diff --git a/conf.c b/conf.c
>>> index b6a2a1f0fdc3..40aa9519f8a6 100644
>>> --- a/conf.c
>>> +++ b/conf.c
>>> @@ -44,6 +44,7 @@
>>>   #include "lineread.h"
>>>   #include "isolation.h"
>>>   #include "log.h"
>>> +#include "vhost_user.h"
>>>   
>>>   /**
>>>    * next_chunk - Return the next piece of a string delimited by a character
>>> @@ -735,9 +736,12 @@ static void print_usage(const char *name, int status)
>>>   		info(   "  -I, --ns-ifname NAME	namespace interface name");
>>>   		info(   "    default: same interface name as external one");
>>>   	} else {
>>> -		info(   "  -s, --socket PATH	UNIX domain socket path");
>>> +		info(   "  -s, --socket, --socket-path PATH	UNIX domain socket path");
>> I don't get the point of --socket-path. It's handled just like -s
>> anyway, right? Why can't it just be -s / --socket?
> I believe the issue is that there's an expected command line interface
> for the vhost server, which uses --socket-path for, well, the socket
> path.  Hence adding an alias to the existing passt option.
>
>>>   		info(   "    default: probe free path starting from "
>>>   		     UNIX_SOCK_PATH, 1);
>>> +		info(   "  --vhost-user		Enable vhost-user mode");
>>> +		info(   "    UNIX domain socket is provided by -s option");
>>> +		info(   "  --print-capabilities	print back-end capabilities in JSON format");
>> Instead of introducing a new option, couldn't we have these printed
>> unconditionally with debug()? I guess it's debug-level stuff anyway.
> Likewise, I think this option is expected by the thing which starts
> the vhost server.
>
Yes, these parameters are defined in the vhost-user protocol specification:

https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#backend-program-conventions

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-12 10:02       ` Laurent Vivier
@ 2024-02-12 16:56         ` Stefano Brivio
  0 siblings, 0 replies; 83+ messages in thread
From: Stefano Brivio @ 2024-02-12 16:56 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Mon, 12 Feb 2024 11:02:14 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 2/12/24 03:49, David Gibson wrote:
> > On Mon, Feb 12, 2024 at 12:19:01AM +0100, Stefano Brivio wrote:  
> >> On Fri,  2 Feb 2024 15:11:47 +0100
> >> Laurent Vivier <lvivier@redhat.com> wrote:
> >>  
> >>> add virtio and vhost-user functions to connect with QEMU.
> >>>
> >>>    $ ./passt --vhost-user
> >>>
> >>> and
> >>>
> >>>    # qemu-system-x86_64 ... -m 4G \
> >>>          -object memory-backend-memfd,id=memfd0,share=on,size=4G \
> >>>          -numa node,memdev=memfd0 \
> >>>          -chardev socket,id=chr0,path=/tmp/passt_1.socket \
> >>>          -netdev vhost-user,id=netdev0,chardev=chr0 \
> >>>          -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
> >>>          ...
> >>>
> >>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> >>> ---
> >>>   conf.c  | 20 ++++++++++++++--
> >>>   passt.c |  7 ++++++
> >>>   passt.h |  1 +
> >>>   tap.c   | 73 ++++++++++++++++++++++++++++++++++++++++++---------------
> >>>   tcp.c   |  8 +++++--
> >>>   udp.c   |  6 +++--
> >>>   6 files changed, 90 insertions(+), 25 deletions(-)  
> >> This would need a matching change in the man page, passt.1, at least
> >> documenting the --vhost-user option and adjusting descriptions about
> >> the guest communication interface (look for "UNIX domain" there).
> >>  
> >>> diff --git a/conf.c b/conf.c
> >>> index b6a2a1f0fdc3..40aa9519f8a6 100644
> >>> --- a/conf.c
> >>> +++ b/conf.c
> >>> @@ -44,6 +44,7 @@
> >>>   #include "lineread.h"
> >>>   #include "isolation.h"
> >>>   #include "log.h"
> >>> +#include "vhost_user.h"
> >>>   
> >>>   /**
> >>>    * next_chunk - Return the next piece of a string delimited by a character
> >>> @@ -735,9 +736,12 @@ static void print_usage(const char *name, int status)
> >>>   		info(   "  -I, --ns-ifname NAME	namespace interface name");
> >>>   		info(   "    default: same interface name as external one");
> >>>   	} else {
> >>> -		info(   "  -s, --socket PATH	UNIX domain socket path");
> >>> +		info(   "  -s, --socket, --socket-path PATH	UNIX domain socket path");  
> >> I don't get the point of --socket-path. It's handled just like -s
> >> anyway, right? Why can't it just be -s / --socket?  
> > I believe the issue is that there's an expected command line interface
> > for the vhost server, which uses --socket-path for, well, the socket
> > path.  Hence adding an alias to the existing passt option.
> >  
> >>>   		info(   "    default: probe free path starting from "
> >>>   		     UNIX_SOCK_PATH, 1);
> >>> +		info(   "  --vhost-user		Enable vhost-user mode");
> >>> +		info(   "    UNIX domain socket is provided by -s option");
> >>> +		info(   "  --print-capabilities	print back-end capabilities in JSON format");  
> >> Instead of introducing a new option, couldn't we have these printed
> >> unconditionally with debug()? I guess it's debug-level stuff anyway.  
> > Likewise, I think this option is expected by the thing which starts
> > the vhost server.
>
> Yes, these parameters are defined in the vhost-user protocol specification:
> 
> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#backend-program-conventions

So, I went through the discussion from which these option names
originated:

  https://patchwork.ozlabs.org/project/qemu-devel/cover/20180713130916.4153-1-marcandre.lureau@redhat.com/#1986839

...and I don't actually see the point for our case: libvirt and krunvm
(pending) have a bit of code to start passt, and so do Podman and
rootlesskit for pasta:

  https://gitlab.com/libvirt/libvirt/-/blob/20a5f77156bb0237269008351ddf285067065516/src/qemu/qemu_passt.c#L183
  https://github.com/containers/krunvm/pull/56/commits/5fc42aa4dfde8608320e06192a7d5ec428834bb5#diff-3cd015c5feef5904a99e9a7513340c1e656649037b7442edeed2a799d242ede3R66
  https://github.com/containers/common/blob/91e0fac33e22545f6e0d99d41d315075c02576e1/libnetwork/pasta/pasta.go#L56
  https://github.com/rootless-containers/rootlesskit/blob/efee459a225b80ace3d51ae39e9b838616c3d652/pkg/network/pasta/pasta.go#L113

where option names are happily open-coded (of course!).

That is, I find rather unfeasible that any of these users would switch
to some "unified" front-end that would ever need to start passt with a
standardised command line.

Anyway, okay, the cost isn't much as you already have the code changes,
so I guess we can indulge in a bit of... compliance cult and keep all
this.

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-12  2:47       ` David Gibson
@ 2024-02-13 15:22         ` Stefano Brivio
  2024-02-14  2:05           ` David Gibson
  0 siblings, 1 reply; 83+ messages in thread
From: Stefano Brivio @ 2024-02-13 15:22 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Mon, 12 Feb 2024 13:47:15 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Feb 12, 2024 at 12:19:22AM +0100, Stefano Brivio wrote:
> > On Wed, 7 Feb 2024 13:40:33 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Fri, Feb 02, 2024 at 03:11:47PM +0100, Laurent Vivier wrote:  
> > > > add virtio and vhost-user functions to connect with QEMU.
> > > > 
> > > >   $ ./passt --vhost-user
> > > > 
> > > > and
> > > > 
> > > >   # qemu-system-x86_64 ... -m 4G \
> > > >         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
> > > >         -numa node,memdev=memfd0 \
> > > >         -chardev socket,id=chr0,path=/tmp/passt_1.socket \    
> > > 
> > > I think it would be wise to use different default socket names for
> > > vhost-user than for the qemu socket protocol.  
> > 
> > I'm not sure if there's an obvious benefit (mix them up, and nothing
> > will work anyway). On the other hand, that means more typing and
> > remembering what's the separator between "passt", "vhost", and "user".
> >   
> > > Or even to require
> > > --socket-path: the reasons we have these rather weird default probed
> > > paths don't apply here, AFAICT.  
> > 
> > Why not, actually? With probed paths, you can still reasonably start
> > passt by *typing* its command line. I do it all the time, and I think
> > it's quite nice to have.  
> 
> Uh.. I'm not sure how this would change that.

Because one would have to type:

  ./passt --vhost-user --socket-path /tmp/passt.something

instead of ./passt --vhost-user? Sure, sometimes I call my sockets
/tmp/s, but still that doubles the length of the command line.

-- 
Stefano


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/24] vhost-user: add vhost-user
  2024-02-13 15:22         ` Stefano Brivio
@ 2024-02-14  2:05           ` David Gibson
  0 siblings, 0 replies; 83+ messages in thread
From: David Gibson @ 2024-02-14  2:05 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 2339 bytes --]

On Tue, Feb 13, 2024 at 04:22:56PM +0100, Stefano Brivio wrote:
> On Mon, 12 Feb 2024 13:47:15 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Feb 12, 2024 at 12:19:22AM +0100, Stefano Brivio wrote:
> > > On Wed, 7 Feb 2024 13:40:33 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Fri, Feb 02, 2024 at 03:11:47PM +0100, Laurent Vivier wrote:  
> > > > > add virtio and vhost-user functions to connect with QEMU.
> > > > > 
> > > > >   $ ./passt --vhost-user
> > > > > 
> > > > > and
> > > > > 
> > > > >   # qemu-system-x86_64 ... -m 4G \
> > > > >         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
> > > > >         -numa node,memdev=memfd0 \
> > > > >         -chardev socket,id=chr0,path=/tmp/passt_1.socket \    
> > > > 
> > > > I think it would be wise to use different default socket names for
> > > > vhost-user than for the qemu socket protocol.  
> > > 
> > > I'm not sure if there's an obvious benefit (mix them up, and nothing
> > > will work anyway). On the other hand, that means more typing and
> > > remembering what's the separator between "passt", "vhost", and "user".
> > >   
> > > > Or even to require
> > > > --socket-path: the reasons we have these rather weird default probed
> > > > paths don't apply here, AFAICT.  
> > > 
> > > Why not, actually? With probed paths, you can still reasonably start
> > > passt by *typing* its command line. I do it all the time, and I think
> > > it's quite nice to have.  
> > 
> > Uh.. I'm not sure how this would change that.
> 
> Because one would have to type:
> 
>   ./passt --vhost-user --socket-path /tmp/passt.something

Well.. it could be:
  ./passt --vhost-user -s foo

You can use the shorter option, and the socket doesn't have to be in
/tmp (in fact, I'd argue it's usually better not to put them there).
This also means you can use your (possibly shorter) choice of socket
name on the qemu command line.

> instead of ./passt --vhost-user? Sure, sometimes I call my sockets
> /tmp/s, but still that doubles the length of the command line.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2024-02-14  2:10 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-02 14:11 [PATCH 00/24] Add vhost-user support to passt Laurent Vivier
2024-02-02 14:11 ` [PATCH 01/24] iov: add some functions to manage iovec Laurent Vivier
2024-02-05  5:57   ` David Gibson
2024-02-06 14:28     ` Laurent Vivier
2024-02-07  1:01       ` David Gibson
2024-02-07 10:00         ` Laurent Vivier
2024-02-06 16:10   ` Stefano Brivio
2024-02-07 14:02     ` Laurent Vivier
2024-02-07 14:57       ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 02/24] pcap: add pcap_iov() Laurent Vivier
2024-02-05  6:25   ` David Gibson
2024-02-06 16:10   ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 03/24] checksum: align buffers Laurent Vivier
2024-02-05  6:02   ` David Gibson
2024-02-07  9:01     ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 04/24] checksum: add csum_iov() Laurent Vivier
2024-02-05  6:07   ` David Gibson
2024-02-07  9:02   ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 05/24] util: move IP stuff from util.[ch] to ip.[ch] Laurent Vivier
2024-02-05  6:13   ` David Gibson
2024-02-07  9:03     ` Stefano Brivio
2024-02-08  0:04       ` David Gibson
2024-02-02 14:11 ` [PATCH 06/24] ip: move duplicate IPv4 checksum function to ip.h Laurent Vivier
2024-02-05  6:16   ` David Gibson
2024-02-07 10:40   ` Stefano Brivio
2024-02-07 23:43     ` David Gibson
2024-02-02 14:11 ` [PATCH 07/24] ip: introduce functions to compute the header part checksum for TCP/UDP Laurent Vivier
2024-02-05  6:20   ` David Gibson
2024-02-07 10:41   ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 08/24] tcp: extract buffer management from tcp_send_flag() Laurent Vivier
2024-02-06  0:24   ` David Gibson
2024-02-08 16:57   ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 09/24] tcp: extract buffer management from tcp_conn_tap_mss() Laurent Vivier
2024-02-06  0:47   ` David Gibson
2024-02-08 16:59   ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 10/24] tcp: rename functions that manage buffers Laurent Vivier
2024-02-06  1:48   ` David Gibson
2024-02-08 17:10     ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 11/24] tcp: move buffers management functions to their own file Laurent Vivier
2024-02-02 14:11 ` [PATCH 12/24] tap: make tap_update_mac() generic Laurent Vivier
2024-02-06  1:49   ` David Gibson
2024-02-08 17:10     ` Stefano Brivio
2024-02-09  5:02       ` David Gibson
2024-02-02 14:11 ` [PATCH 13/24] tap: export pool_flush()/tapX_handler()/packet_add() Laurent Vivier
2024-02-02 14:29   ` Laurent Vivier
2024-02-06  1:52   ` David Gibson
2024-02-11 23:15   ` Stefano Brivio
2024-02-12  2:22     ` David Gibson
2024-02-02 14:11 ` [PATCH 14/24] udp: move udpX_l2_buf_t and udpX_l2_mh_sock out of udp_update_hdrX() Laurent Vivier
2024-02-06  1:59   ` David Gibson
2024-02-11 23:16   ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 15/24] udp: rename udp_sock_handler() to udp_buf_sock_handler() Laurent Vivier
2024-02-06  2:14   ` David Gibson
2024-02-11 23:17     ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 16/24] packet: replace struct desc by struct iovec Laurent Vivier
2024-02-06  2:25   ` David Gibson
2024-02-11 23:18     ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 17/24] vhost-user: compare mode MODE_PASTA and not MODE_PASST Laurent Vivier
2024-02-06  2:29   ` David Gibson
2024-02-02 14:11 ` [PATCH 18/24] vhost-user: introduce virtio API Laurent Vivier
2024-02-06  3:51   ` David Gibson
2024-02-11 23:18     ` Stefano Brivio
2024-02-12  2:26       ` David Gibson
2024-02-02 14:11 ` [PATCH 19/24] vhost-user: introduce vhost-user API Laurent Vivier
2024-02-07  2:13   ` David Gibson
2024-02-02 14:11 ` [PATCH 20/24] vhost-user: add vhost-user Laurent Vivier
2024-02-07  2:40   ` David Gibson
2024-02-11 23:19     ` Stefano Brivio
2024-02-12  2:47       ` David Gibson
2024-02-13 15:22         ` Stefano Brivio
2024-02-14  2:05           ` David Gibson
2024-02-11 23:19   ` Stefano Brivio
2024-02-12  2:49     ` David Gibson
2024-02-12 10:02       ` Laurent Vivier
2024-02-12 16:56         ` Stefano Brivio
2024-02-02 14:11 ` [PATCH 21/24] vhost-user: use guest buffer directly in vu_handle_tx() Laurent Vivier
2024-02-09  4:26   ` David Gibson
2024-02-02 14:11 ` [PATCH 22/24] tcp: vhost-user RX nocopy Laurent Vivier
2024-02-09  4:57   ` David Gibson
2024-02-02 14:11 ` [PATCH 23/24] udp: " Laurent Vivier
2024-02-09  5:00   ` David Gibson
2024-02-02 14:11 ` [PATCH 24/24] vhost-user: remove tap_send_frames_vu() Laurent Vivier
2024-02-09  5:01   ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).