* [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element
@ 2026-03-23 16:52 Laurent Vivier
2026-03-23 16:52 ` [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum Laurent Vivier
` (7 more replies)
0 siblings, 8 replies; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
This is the TCP counterpart to the UDP multi-iov series. It converts
the TCP vhost-user receive path from direct pointer arithmetic (via
vu_eth(), vu_ip(), etc.) to the iov_tail abstraction, removing the
assumption that all headers reside in a single contiguous buffer.
With this series applied, the TCP path correctly handles virtio-net
drivers that provide multiple buffers per virtqueue element (e.g. iPXE
provides the vnet header in the first buffer and the frame payload in a
second one), matching the support already present in the UDP path.
Based-on: 20260323143151.538673-1-lvivier@redhat.com
Laurent Vivier (7):
tcp: pass ipv4h checksum, not a pointer to the checksum
tcp: use iov_tail to access headers in tcp_fill_headers()
tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare()
tcp_vu: Support multibuffer frames in tcp_vu_sock_recv()
tcp: Use iov_tail to access headers in tcp_prepare_flags()
iov: introduce iov_memcopy()
tcp_vu: Use iov_tail helpers to build headers in tcp_vu_send_flag()
iov.c | 45 +++++++
iov.h | 2 +
tcp.c | 180 ++++++++++++++-----------
tcp_buf.c | 34 ++---
tcp_internal.h | 9 +-
tcp_vu.c | 354 ++++++++++++++++++++++++++++---------------------
vu_common.h | 20 ---
7 files changed, 368 insertions(+), 276 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-24 3:53 ` David Gibson
2026-03-23 16:52 ` [PATCH 2/7] tcp: use iov_tail to access headers in tcp_fill_headers() Laurent Vivier
` (6 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
tcp_fill_headers() takes a pointer to a previously computed IPv4 header
checksum to avoid recalculating it when the payload length doesn't
change. A subsequent patch makes tcp_fill_headers() access ip4h via
with_header() which scopes it to a temporary variable, so a pointer to
ip4h->check would become dangling after the with_header() block.
Pass the checksum by value as an int instead, using -1 as the sentinel
to indicate that the checksum should be computed from scratch (replacing
the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be
a valid checksum value in an int.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
tcp.c | 7 +++----
tcp_buf.c | 10 +++++-----
tcp_internal.h | 3 +--
tcp_vu.c | 12 ++++++------
4 files changed, 15 insertions(+), 17 deletions(-)
diff --git a/tcp.c b/tcp.c
index b14586249c4e..158a5be0327e 100644
--- a/tcp.c
+++ b/tcp.c
@@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
struct ethhdr *eh,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
struct tcphdr *th, struct iov_tail *payload,
- const uint16_t *ip4_check, uint32_t seq,
- bool no_tcp_csum)
+ int ip4_check, uint32_t seq, bool no_tcp_csum)
{
const struct flowside *tapside = TAPFLOW(conn);
size_t l4len = iov_tail_size(payload) + sizeof(*th);
@@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
ip4h->saddr = src4->s_addr;
ip4h->daddr = dst4->s_addr;
- if (ip4_check)
- ip4h->check = *ip4_check;
+ if (ip4_check != -1)
+ ip4h->check = ip4_check;
else
ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP,
*src4, *dst4);
diff --git a/tcp_buf.c b/tcp_buf.c
index 41965b107567..bc0f58dd7a5e 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov)
*/
static void tcp_l2_buf_fill_headers(const struct ctx *c,
struct tcp_tap_conn *conn,
- struct iovec *iov, const uint16_t *check,
+ struct iovec *iov, int check,
uint32_t seq, bool no_tcp_csum)
{
struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
@@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
if (flags & KEEPALIVE)
seq--;
- tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
+ tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false);
tcp_l2_buf_pad(iov);
@@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ssize_t dlen, int no_csum, uint32_t seq, bool push)
{
struct tcp_payload_t *payload;
- const uint16_t *check = NULL;
+ int check = -1;
struct iovec *iov;
conn->seq_to_tap = seq + dlen;
@@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (CONN_V4(conn)) {
if (no_csum) {
struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
- struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
+ const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
- check = &iph->check;
+ check = iph->check;
}
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
} else if (CONN_V6(conn)) {
diff --git a/tcp_internal.h b/tcp_internal.h
index d9408852571f..bb7a6629839c 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
struct ethhdr *eh,
struct iphdr *ip4h, struct ipv6hdr *ip6h,
struct tcphdr *th, struct iov_tail *payload,
- const uint16_t *ip4_check, uint32_t seq,
- bool no_tcp_csum);
+ int ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
bool force_seq, struct tcp_info_linux *tinfo);
diff --git a/tcp_vu.c b/tcp_vu.c
index 3001defb5467..a21ee3499aed 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
seq--;
tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
- NULL, seq, !*c->pcap);
+ -1, seq, !*c->pcap);
if (*c->pcap)
pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
@@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
*/
static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov, size_t iov_cnt,
- const uint16_t **check, bool no_tcp_csum, bool push)
+ int *check, bool no_tcp_csum, bool push)
{
const struct flowside *toside = TAPFLOW(conn);
bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
@@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
*check, conn->seq_to_tap, no_tcp_csum);
if (ip4h)
- *check = &ip4h->check;
+ *check = ip4h->check;
}
/**
@@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
size_t hdrlen, fillsize;
int v6 = CONN_V6(conn);
uint32_t already_sent;
- const uint16_t *check;
+ int check;
if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
debug("Got packet, but RX virtqueue not usable yet");
@@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
*/
hdrlen = tcp_vu_hdrlen(v6);
- for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) {
+ for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) {
struct iovec *iov = &elem[head[i]].in_sg[0];
int buf_cnt = head[i + 1] - head[i];
size_t frame_size = iov_size(iov, buf_cnt);
@@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
/* The IPv4 header checksum varies only with dlen */
if (previous_dlen != dlen)
- check = NULL;
+ check = -1;
previous_dlen = dlen;
tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/7] tcp: use iov_tail to access headers in tcp_fill_headers()
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
2026-03-23 16:52 ` [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-24 3:58 ` David Gibson
2026-03-23 16:52 ` [PATCH 3/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare() Laurent Vivier
` (5 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
Instead of receiving individual pointers to each protocol header (eh,
ip4h, ip6h, th), have tcp_fill_headers() take an iov_tail starting at
the Ethernet header and walk through it using with_header() and
IOV_DROP_HEADER() to access each header in turn.
Replace the ip4h/ip6h NULL-pointer convention with a bool ipv4
parameter, and move Ethernet header filling (MAC address and ethertype)
into tcp_fill_headers() as well, since the function now owns the full
header chain.
This simplifies callers, which no longer need to extract and pass
individual header pointers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
tcp.c | 106 +++++++++++++++++++++++++++----------------------
tcp_buf.c | 16 ++------
tcp_internal.h | 4 +-
tcp_vu.c | 12 +++---
4 files changed, 70 insertions(+), 68 deletions(-)
diff --git a/tcp.c b/tcp.c
index 158a5be0327e..058792d5b184 100644
--- a/tcp.c
+++ b/tcp.c
@@ -938,11 +938,8 @@ static void tcp_fill_header(struct tcphdr *th,
* tcp_fill_headers() - Fill 802.3, IP, TCP headers
* @c: Execution context
* @conn: Connection pointer
- * @eh: Pointer to Ethernet header
- * @ip4h: Pointer to IPv4 header, or NULL
- * @ip6h: Pointer to IPv6 header, or NULL
- * @th: Pointer to TCP header
- * @payload: TCP payload
+ * @ipv4: True for IPv4, false for IPv6
+ * @payload: IOV tail starting at the Ethernet header
* @ip4_check: IPv4 checksum, if already known
* @seq: Sequence number for this segment
* @no_tcp_csum: Do not set TCP checksum
@@ -950,74 +947,89 @@ static void tcp_fill_header(struct tcphdr *th,
* Return: frame length (including L2 headers)
*/
size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
- struct ethhdr *eh,
- struct iphdr *ip4h, struct ipv6hdr *ip6h,
- struct tcphdr *th, struct iov_tail *payload,
+ bool ipv4, const struct iov_tail *payload,
int ip4_check, uint32_t seq, bool no_tcp_csum)
{
const struct flowside *tapside = TAPFLOW(conn);
- size_t l4len = iov_tail_size(payload) + sizeof(*th);
+ struct iov_tail current = *payload;
uint8_t *omac = conn->f.tap_omac;
- size_t l3len = l4len;
+ size_t l3len, l4len;
uint32_t psum = 0;
- if (ip4h) {
+ with_header(struct ethhdr, eh, ¤t) {
+ if (ipv4)
+ eh->h_proto = htons_constant(ETH_P_IP);
+ else
+ eh->h_proto = htons_constant(ETH_P_IPV6);
+
+ /* Find if neighbour table has a recorded MAC address */
+ if (MAC_IS_UNDEF(omac))
+ fwd_neigh_mac_get(c, &tapside->oaddr, omac);
+ eth_update_mac(eh, NULL, omac);
+ }
+ IOV_DROP_HEADER(¤t, struct ethhdr);
+
+ l3len = iov_tail_size(¤t);
+
+ if (ipv4) {
const struct in_addr *src4 = inany_v4(&tapside->oaddr);
const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
assert(src4 && dst4);
- l3len += + sizeof(*ip4h);
+ l4len = l3len - sizeof(struct iphdr);
- ip4h->tot_len = htons(l3len);
- ip4h->saddr = src4->s_addr;
- ip4h->daddr = dst4->s_addr;
+ with_header(struct iphdr, ip4h, ¤t) {
+ ip4h->tot_len = htons(l3len);
+ ip4h->saddr = src4->s_addr;
+ ip4h->daddr = dst4->s_addr;
- if (ip4_check != -1)
- ip4h->check = ip4_check;
- else
- ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP,
- *src4, *dst4);
+ if (ip4_check != -1)
+ ip4h->check = ip4_check;
+ else
+ ip4h->check = csum_ip4_header(l3len,
+ IPPROTO_TCP,
+ *src4, *dst4);
+ }
+ IOV_DROP_HEADER(¤t, struct iphdr);
if (!no_tcp_csum) {
psum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
*src4, *dst4);
}
- eh->h_proto = htons_constant(ETH_P_IP);
- }
-
- if (ip6h) {
- l3len += sizeof(*ip6h);
+ } else {
+ l4len = l3len - sizeof(struct ipv6hdr);
- ip6h->payload_len = htons(l4len);
- ip6h->saddr = tapside->oaddr.a6;
- ip6h->daddr = tapside->eaddr.a6;
+ with_header(struct ipv6hdr, ip6h, ¤t) {
+ ip6h->payload_len = htons(l4len);
+ ip6h->saddr = tapside->oaddr.a6;
+ ip6h->daddr = tapside->eaddr.a6;
- ip6h->hop_limit = 255;
- ip6h->version = 6;
- ip6h->nexthdr = IPPROTO_TCP;
+ ip6h->hop_limit = 255;
+ ip6h->version = 6;
+ ip6h->nexthdr = IPPROTO_TCP;
- ip6_set_flow_lbl(ip6h, conn->sock);
+ ip6_set_flow_lbl(ip6h, conn->sock);
- if (!no_tcp_csum) {
- psum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,
- &ip6h->saddr,
- &ip6h->daddr);
+ if (!no_tcp_csum) {
+ psum = proto_ipv6_header_psum(l4len,
+ IPPROTO_TCP,
+ &ip6h->saddr,
+ &ip6h->daddr);
+ }
}
- eh->h_proto = htons_constant(ETH_P_IPV6);
+ IOV_DROP_HEADER(¤t, struct ipv6hdr);
}
- /* Find if neighbour table has a recorded MAC address */
- if (MAC_IS_UNDEF(omac))
- fwd_neigh_mac_get(c, &tapside->oaddr, omac);
- eth_update_mac(eh, NULL, omac);
-
- tcp_fill_header(th, conn, seq);
-
- if (no_tcp_csum)
+ with_header(struct tcphdr, th, ¤t) {
+ tcp_fill_header(th, conn, seq);
th->check = 0;
- else
- tcp_update_csum(psum, th, payload);
+ }
+
+ if (!no_tcp_csum) {
+ with_header(struct tcphdr, th, ¤t)
+ th->check = csum_iov_tail(¤t, psum);
+ }
return MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN);
}
diff --git a/tcp_buf.c b/tcp_buf.c
index bc0f58dd7a5e..891043c96dcb 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -175,23 +175,15 @@ static void tcp_l2_buf_fill_headers(const struct ctx *c,
struct iovec *iov, int check,
uint32_t seq, bool no_tcp_csum)
{
- struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
- struct tcphdr th_storage, *th = IOV_REMOVE_HEADER(&tail, th_storage);
+ struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_ETH],
+ TCP_IOV_PAYLOAD + 1 - TCP_IOV_ETH, 0);
struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base;
const struct flowside *tapside = TAPFLOW(conn);
- const struct in_addr *a4 = inany_v4(&tapside->oaddr);
- struct ethhdr *eh = iov[TCP_IOV_ETH].iov_base;
- struct ipv6hdr *ip6h = NULL;
- struct iphdr *ip4h = NULL;
+ bool ipv4 = inany_v4(&tapside->oaddr) != NULL;
size_t l2len;
- if (a4)
- ip4h = iov[TCP_IOV_IP].iov_base;
- else
- ip6h = iov[TCP_IOV_IP].iov_base;
+ l2len = tcp_fill_headers(c, conn, ipv4, &tail, check, seq, no_tcp_csum);
- l2len = tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &tail, check, seq,
- no_tcp_csum);
tap_hdr_update(taph, l2len);
}
diff --git a/tcp_internal.h b/tcp_internal.h
index bb7a6629839c..136e947f6e70 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -184,9 +184,7 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
struct tcp_info_linux;
size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
- struct ethhdr *eh,
- struct iphdr *ip4h, struct ipv6hdr *ip6h,
- struct tcphdr *th, struct iov_tail *payload,
+ bool ipv4, const struct iov_tail *payload,
int ip4_check, uint32_t seq, bool no_tcp_csum);
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
diff --git a/tcp_vu.c b/tcp_vu.c
index a21ee3499aed..c6206b7a689c 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -132,13 +132,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
}
vu_pad(&flags_iov[0], 1, 0, hdrlen + optlen);
- payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
if (flags & KEEPALIVE)
seq--;
- tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
- -1, seq, !*c->pcap);
+ payload = IOV_TAIL(flags_elem[0].in_sg, 1, VNET_HLEN);
+ tcp_fill_headers(c, conn, CONN_V4(conn), &payload, -1, seq, !*c->pcap);
if (*c->pcap)
pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
@@ -288,10 +287,10 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
const struct flowside *toside = TAPFLOW(conn);
bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
size_t hdrlen = tcp_vu_hdrlen(v6);
- struct iov_tail payload = IOV_TAIL(iov, iov_cnt, hdrlen);
char *base = iov[0].iov_base;
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
+ struct iov_tail payload;
struct tcphdr *th;
struct ethhdr *eh;
@@ -326,8 +325,9 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
th->ack = 1;
th->psh = push;
- tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
- *check, conn->seq_to_tap, no_tcp_csum);
+ payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN);
+ tcp_fill_headers(c, conn, !v6, &payload, *check, conn->seq_to_tap,
+ no_tcp_csum);
if (ip4h)
*check = ip4h->check;
}
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare()
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
2026-03-23 16:52 ` [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum Laurent Vivier
2026-03-23 16:52 ` [PATCH 2/7] tcp: use iov_tail to access headers in tcp_fill_headers() Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-25 4:46 ` David Gibson
2026-03-23 16:52 ` [PATCH 4/7] tcp_vu: Support multibuffer frames in tcp_vu_sock_recv() Laurent Vivier
` (4 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
Replace direct pointer arithmetic using vu_eth(), vu_ip(),
vu_payloadv4() and vu_payloadv6() with the iov_tail abstraction
to build Ethernet, IP and TCP headers in tcp_vu_prepare().
This removes the assumption that all headers reside in a single
contiguous iov entry.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
tcp_vu.c | 61 ++++++++++++++++++++++++++------------------------------
1 file changed, 28 insertions(+), 33 deletions(-)
diff --git a/tcp_vu.c b/tcp_vu.c
index c6206b7a689c..a39f6ea018e9 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -222,6 +222,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
/* reserve space for headers in iov */
iov = &elem[elem_cnt].in_sg[0];
assert(iov->iov_len >= hdrlen);
+
iov->iov_base = (char *)iov->iov_base + hdrlen;
iov->iov_len -= hdrlen;
head[(*head_cnt)++] = elem_cnt;
@@ -285,51 +286,45 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
int *check, bool no_tcp_csum, bool push)
{
const struct flowside *toside = TAPFLOW(conn);
- bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
- size_t hdrlen = tcp_vu_hdrlen(v6);
- char *base = iov[0].iov_base;
- struct ipv6hdr *ip6h = NULL;
- struct iphdr *ip4h = NULL;
+ bool ipv4 = inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr);
struct iov_tail payload;
- struct tcphdr *th;
- struct ethhdr *eh;
-
- /* we guess the first iovec provided by the guest can embed
- * all the headers needed by L2 frame, including any padding
- */
- assert(iov[0].iov_len >= hdrlen);
-
- eh = vu_eth(base);
- memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+ payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN);
+ with_header(struct ethhdr, eh, &payload)
+ memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+ IOV_DROP_HEADER(&payload, struct ethhdr);
/* initialize header */
- if (!v6) {
- eh->h_proto = htons(ETH_P_IP);
-
- ip4h = vu_ip(base);
- *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
- th = vu_payloadv4(base);
+ if (ipv4) {
+ with_header(struct iphdr, ip4h, &payload)
+ *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+ IOV_DROP_HEADER(&payload, struct iphdr);
} else {
- eh->h_proto = htons(ETH_P_IPV6);
-
- ip6h = vu_ip(base);
- *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
-
- th = vu_payloadv6(base);
+ with_header(struct ipv6hdr, ip6h, &payload)
+ *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+ IOV_DROP_HEADER(&payload, struct ipv6hdr);
}
- memset(th, 0, sizeof(*th));
- th->doff = sizeof(*th) / 4;
- th->ack = 1;
- th->psh = push;
+ with_header(struct tcphdr, th, &payload) {
+ memset(th, 0, sizeof(*th));
+ th->doff = sizeof(*th) / 4;
+ th->ack = 1;
+ th->psh = push;
+ }
payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN);
- tcp_fill_headers(c, conn, !v6, &payload, *check, conn->seq_to_tap,
+ tcp_fill_headers(c, conn, ipv4, &payload, *check, conn->seq_to_tap,
no_tcp_csum);
- if (ip4h)
+ if (ipv4) {
+ struct iphdr ip4h_storage;
+ const struct iphdr *ip4h;
+
+ IOV_DROP_HEADER(&payload, struct ethhdr);
+ ip4h = IOV_PEEK_HEADER(&payload, ip4h_storage);
+
*check = ip4h->check;
+ }
}
/**
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 4/7] tcp_vu: Support multibuffer frames in tcp_vu_sock_recv()
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
` (2 preceding siblings ...)
2026-03-23 16:52 ` [PATCH 3/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare() Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-25 5:06 ` David Gibson
2026-03-23 16:52 ` [PATCH 5/7] tcp: Use iov_tail to access headers in tcp_prepare_flags() Laurent Vivier
` (3 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue
elements and iovecs (one iovec per element), enforced by an ASSERT.
This prevented the use of virtqueue elements with multiple buffers
(e.g. when mergeable rx buffers are not negotiated and headers are
provided in a separate buffer).
Introduce a struct vu_frame to track per-frame metadata: the range of
elements and iovecs that make up each frame, and the frame's total size.
This replaces the head[] array which only tracked element indices.
A separate iov_msg[] array is built for recvmsg() by cloning the data
portions (after stripping headers) using iov_tail helpers.
Then a frame truncation after recvmsg() properly walks the frame and
element arrays to adjust iovec counts and element counts.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
tcp_vu.c | 161 ++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 107 insertions(+), 54 deletions(-)
diff --git a/tcp_vu.c b/tcp_vu.c
index a39f6ea018e9..96cd9da1caae 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -35,9 +35,24 @@
#include "vu_common.h"
#include <time.h>
-static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM];
+static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE];
static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
-static int head[VIRTQUEUE_MAX_SIZE + 1];
+
+/**
+ * struct vu_frame - Descriptor for a TCP frame mapped to virtqueue elements
+ * @idx_element: Index of first element in elem[] for this frame
+ * @num_element: Number of virtqueue elements used by this frame
+ * @idx_iovec: Index of first iovec in iov_vu[] for this frame
+ * @num_iovec: Number of iovecs covering this frame's buffers
+ * @size: Total frame size including all headers
+ */
+static struct vu_frame {
+ int idx_element;
+ int num_element;
+ int idx_iovec;
+ int num_iovec;
+ size_t size;
+} frame[VIRTQUEUE_MAX_SIZE];
/**
* tcp_vu_hdrlen() - Sum size of all headers, from TCP to virtio-net
@@ -173,8 +188,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
* @v6: Set for IPv6 connections
* @already_sent: Number of bytes already sent
* @fillsize: Maximum bytes to fill in guest-side receiving window
- * @iov_cnt: number of iov (output)
- * @head_cnt: Pointer to store the count of head iov entries (output)
+ * @elem_used: number of element (output)
+ * @frame_cnt: Pointer to store the number of frames (output)
*
* Return: number of bytes received from the socket, or a negative error code
* on failure.
@@ -182,58 +197,77 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
const struct tcp_tap_conn *conn, bool v6,
uint32_t already_sent, size_t fillsize,
- int *iov_cnt, int *head_cnt)
+ int *elem_used, int *frame_cnt)
{
+ static struct iovec iov_msg[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM];
const struct vu_dev *vdev = c->vdev;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
size_t hdrlen, iov_used;
int s = conn->sock;
+ ssize_t ret, dlen;
int elem_cnt;
- ssize_t ret;
- int i;
-
- *iov_cnt = 0;
+ int i, j;
hdrlen = tcp_vu_hdrlen(v6);
+ *elem_used = 0;
+
iov_used = 0;
elem_cnt = 0;
- *head_cnt = 0;
+ *frame_cnt = 0;
while (fillsize > 0 && elem_cnt < ARRAY_SIZE(elem) &&
- iov_used < VIRTQUEUE_MAX_SIZE) {
- size_t frame_size, dlen, in_total;
- struct iovec *iov;
+ iov_used < ARRAY_SIZE(iov_vu) &&
+ *frame_cnt < ARRAY_SIZE(frame)) {
+ size_t frame_size, in_total;
int cnt;
cnt = vu_collect(vdev, vq, &elem[elem_cnt],
ARRAY_SIZE(elem) - elem_cnt,
- &iov_vu[DISCARD_IOV_NUM + iov_used],
- VIRTQUEUE_MAX_SIZE - iov_used, &in_total,
+ &iov_vu[iov_used],
+ ARRAY_SIZE(iov_vu) - iov_used, &in_total,
MIN(mss, fillsize) + hdrlen,
&frame_size);
if (cnt == 0)
break;
- assert((size_t)cnt == in_total); /* one iovec per element */
+
+ frame[*frame_cnt].idx_element = elem_cnt;
+ frame[*frame_cnt].num_element = cnt;
+ frame[*frame_cnt].idx_iovec = iov_used;
+ frame[*frame_cnt].num_iovec = in_total;
+ frame[*frame_cnt].size = frame_size;
+ (*frame_cnt)++;
iov_used += in_total;
- dlen = frame_size - hdrlen;
+ elem_cnt += cnt;
- /* reserve space for headers in iov */
- iov = &elem[elem_cnt].in_sg[0];
- assert(iov->iov_len >= hdrlen);
+ fillsize -= frame_size - hdrlen;
+ }
- iov->iov_base = (char *)iov->iov_base + hdrlen;
- iov->iov_len -= hdrlen;
- head[(*head_cnt)++] = elem_cnt;
+ /* build an iov array without headers */
+ for (i = 0, j = DISCARD_IOV_NUM; i < *frame_cnt &&
+ j < ARRAY_SIZE(iov_msg); i++) {
+ struct iov_tail data;
+ ssize_t cnt;
- fillsize -= dlen;
- elem_cnt += cnt;
+ data = IOV_TAIL(&iov_vu[frame[i].idx_iovec],
+ frame[i].num_iovec, 0);
+ iov_drop_header(&data, hdrlen);
+
+ cnt = iov_tail_clone(&iov_msg[j], ARRAY_SIZE(iov_msg) - j,
+ &data);
+ if (cnt == -1)
+ die("Missing entries in iov_msg");
+
+ j += cnt;
}
- if (tcp_prepare_iov(&mh_sock, iov_vu, already_sent, elem_cnt))
+ if (tcp_prepare_iov(&mh_sock, iov_msg, already_sent,
+ j - DISCARD_IOV_NUM)) {
/* Expect caller to do a TCP reset */
+ vu_queue_rewind(vq, elem_cnt);
return -1;
+ }
do
ret = recvmsg(s, &mh_sock, MSG_PEEK);
@@ -247,28 +281,47 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
if (!peek_offset_cap)
ret -= already_sent;
- i = vu_pad(&iov_vu[DISCARD_IOV_NUM], iov_used, hdrlen, ret);
+ dlen = ret;
- /* adjust head count */
- while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
- (*head_cnt)--;
+ /* truncate frame */
+ *elem_used = 0;
+ for (i = 0; i < *frame_cnt; i++) {
+ struct vu_frame *f = &frame[i];
- /* mark end of array */
- head[*head_cnt] = i;
- *iov_cnt = i;
+ if ((size_t)ret <= f->size - hdrlen) {
+ unsigned cnt;
- /* release unused buffers */
- vu_queue_rewind(vq, elem_cnt - i);
+ cnt = vu_pad(&iov_vu[f->idx_iovec], f->num_iovec,
+ 0, hdrlen + ret);
- /* restore space for headers in iov */
- for (i = 0; i < *head_cnt; i++) {
- struct iovec *iov = &elem[head[i]].in_sg[0];
+ f->size = ret + hdrlen;
+ f->num_iovec = cnt;
- iov->iov_base = (char *)iov->iov_base - hdrlen;
- iov->iov_len += hdrlen;
+ for (j = 0; j < f->num_element; j++) {
+ struct vu_virtq_element *e;
+
+ e = &elem[f->idx_element + j];
+ if (cnt <= e->in_num) {
+ e->in_num = cnt;
+ j++;
+ break;
+ }
+ cnt -= e->in_num;
+ }
+ f->num_element = j;
+ *elem_used += j;
+ i++;
+ break;
+ }
+ *elem_used += f->num_element;
+ ret -= f->size - hdrlen;
}
+ *frame_cnt = i;
- return ret;
+ /* release unused buffers */
+ vu_queue_rewind(vq, elem_cnt - *elem_used);
+
+ return dlen;
}
/**
@@ -341,7 +394,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
ssize_t len, previous_dlen;
- int i, iov_cnt, head_cnt;
+ int i, elem_cnt, frame_cnt;
size_t hdrlen, fillsize;
int v6 = CONN_V6(conn);
uint32_t already_sent;
@@ -381,7 +434,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
* data from the socket
*/
len = tcp_vu_sock_recv(c, vq, conn, v6, already_sent, fillsize,
- &iov_cnt, &head_cnt);
+ &elem_cnt, &frame_cnt);
if (len < 0) {
if (len != -EAGAIN && len != -EWOULDBLOCK) {
tcp_rst(c, conn);
@@ -395,6 +448,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
if (!len) {
+ vu_queue_rewind(vq, elem_cnt);
if (already_sent) {
conn_flag(c, conn, STALLED);
} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
@@ -432,33 +486,32 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
*/
hdrlen = tcp_vu_hdrlen(v6);
- for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) {
- struct iovec *iov = &elem[head[i]].in_sg[0];
- int buf_cnt = head[i + 1] - head[i];
- size_t frame_size = iov_size(iov, buf_cnt);
- bool push = i == head_cnt - 1;
+ for (i = 0, previous_dlen = -1, check = -1; i < frame_cnt; i++) {
+ struct iovec *iov = &iov_vu[frame[i].idx_iovec];
+ int iov_cnt = frame[i].num_iovec;
+ bool push = i == frame_cnt - 1;
ssize_t dlen;
- assert(frame_size >= hdrlen);
+ assert(frame[i].size >= hdrlen);
- dlen = frame_size - hdrlen;
- vu_set_vnethdr(iov->iov_base, buf_cnt);
+ dlen = frame[i].size - hdrlen;
+ vu_set_vnethdr(iov->iov_base, frame[i].num_element);
/* The IPv4 header checksum varies only with dlen */
if (previous_dlen != dlen)
check = -1;
previous_dlen = dlen;
- tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
+ tcp_vu_prepare(c, conn, iov, iov_cnt, &check, !*c->pcap, push);
if (*c->pcap)
- pcap_iov(iov, buf_cnt, VNET_HLEN);
+ pcap_iov(iov, iov_cnt, VNET_HLEN);
conn->seq_to_tap += dlen;
}
/* send packets */
- vu_flush(vdev, vq, elem, iov_cnt);
+ vu_flush(vdev, vq, elem, elem_cnt);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 5/7] tcp: Use iov_tail to access headers in tcp_prepare_flags()
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
` (3 preceding siblings ...)
2026-03-23 16:52 ` [PATCH 4/7] tcp_vu: Support multibuffer frames in tcp_vu_sock_recv() Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-23 16:52 ` [PATCH 6/7] iov: introduce iov_memcopy() Laurent Vivier
` (2 subsequent siblings)
7 siblings, 0 replies; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
Instead of passing separate pointers to the TCP header and SYN options,
pass an iov_tail pointing to the start of the TCP payload area.
tcp_prepare_flags() then uses with_header() and IOV_DROP_HEADER() to
access the TCP header and SYN options directly within the iov, matching
the iov_tail-based approach already used by tcp_fill_headers().
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
tcp.c | 71 +++++++++++++++++++++++++++-----------------------
tcp_buf.c | 8 +++---
tcp_internal.h | 2 +-
tcp_vu.c | 9 +++----
4 files changed, 48 insertions(+), 42 deletions(-)
diff --git a/tcp.c b/tcp.c
index 058792d5b184..ad601567d39f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1295,8 +1295,7 @@ static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
* @c: Execution context
* @conn: Connection pointer
* @flags: TCP flags: if not set, send segment only if ACK is due
- * @th: TCP header to update
- * @opts: TCP option buffer (output parameter)
+ * @payload: TCP payload including TCP header
* @optlen: size of the TCP option buffer (output parameter)
*
* Return: < 0 error code on connection reset,
@@ -1304,10 +1303,11 @@ static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
* 1 otherwise
*/
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
- int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
+ int flags, const struct iov_tail *payload,
size_t *optlen)
{
struct tcp_info_linux tinfo = { 0 };
+ struct iov_tail current = *payload;
socklen_t sl = sizeof(tinfo);
int s = conn->sock;
@@ -1330,6 +1330,40 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
*optlen = 0;
if (flags & SYN) {
+ conn->ws_to_tap = MIN(MAX_WS, tinfo.tcpi_snd_wscale);
+
+ *optlen = sizeof(struct tcp_syn_opts);
+ } else {
+ flags |= ACK;
+ }
+
+ with_header(struct tcphdr, th, ¤t) {
+ th->doff = (sizeof(*th) + *optlen) / 4;
+
+ th->ack = !!(flags & ACK);
+ th->psh = !!(flags & PSH);
+ th->rst = !!(flags & RST);
+ th->syn = !!(flags & SYN);
+ th->fin = !!(flags & FIN);
+
+ if (th->ack) {
+ if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
+ conn->wnd_to_tap)
+ conn_flag(c, conn, ~ACK_TO_TAP_DUE);
+ else
+ conn_flag(c, conn, ACK_TO_TAP_DUE);
+ }
+
+ if (th->fin)
+ conn_flag(c, conn, ACK_FROM_TAP_DUE);
+
+ /* RFC 793, 3.1: "[...] and the first data octet is ISN+1." */
+ if (th->fin || th->syn)
+ conn->seq_to_tap++;
+ }
+ IOV_DROP_HEADER(¤t, struct tcphdr);
+
+ if (*optlen) {
int mss;
if (!c->mtu) {
@@ -1348,37 +1382,10 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
mss = ROUND_DOWN(mss, PAGE_SIZE);
}
- conn->ws_to_tap = MIN(MAX_WS, tinfo.tcpi_snd_wscale);
-
- *opts = TCP_SYN_OPTS(mss, conn->ws_to_tap);
- *optlen = sizeof(*opts);
- } else {
- flags |= ACK;
+ with_header(struct tcp_syn_opts, opts, ¤t)
+ *opts = TCP_SYN_OPTS(mss, conn->ws_to_tap);
}
- th->doff = (sizeof(*th) + *optlen) / 4;
-
- th->ack = !!(flags & ACK);
- th->psh = !!(flags & PSH);
- th->rst = !!(flags & RST);
- th->syn = !!(flags & SYN);
- th->fin = !!(flags & FIN);
-
- if (th->ack) {
- if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
- conn->wnd_to_tap)
- conn_flag(c, conn, ~ACK_TO_TAP_DUE);
- else
- conn_flag(c, conn, ACK_TO_TAP_DUE);
- }
-
- if (th->fin)
- conn_flag(c, conn, ACK_FROM_TAP_DUE);
-
- /* RFC 793, 3.1: "[...] and the first data octet is ISN+1." */
- if (th->fin || th->syn)
- conn->seq_to_tap++;
-
return 1;
}
diff --git a/tcp_buf.c b/tcp_buf.c
index 891043c96dcb..ffd250c8798f 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -197,7 +197,7 @@ static void tcp_l2_buf_fill_headers(const struct ctx *c,
*/
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
{
- struct tcp_payload_t *payload;
+ struct iov_tail tail;
struct iovec *iov;
size_t optlen;
size_t l4len;
@@ -211,10 +211,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp_eth_hdr[tcp_payload_used]);
- payload = iov[TCP_IOV_PAYLOAD].iov_base;
+ iov[TCP_IOV_PAYLOAD].iov_len = sizeof(struct tcp_payload_t);
+ tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
seq = conn->seq_to_tap;
- ret = tcp_prepare_flags(c, conn, flags, &payload->th,
- (struct tcp_syn_opts *)&payload->data, &optlen);
+ ret = tcp_prepare_flags(c, conn, flags, &tail, &optlen);
if (ret <= 0)
return ret;
diff --git a/tcp_internal.h b/tcp_internal.h
index 136e947f6e70..9f745d0752ff 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -190,7 +190,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
bool force_seq, struct tcp_info_linux *tinfo);
int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
- int flags, struct tcphdr *th, struct tcp_syn_opts *opts,
+ int flags, const struct iov_tail *payload,
size_t *optlen);
int tcp_set_peek_offset(const struct tcp_tap_conn *conn, int offset);
diff --git a/tcp_vu.c b/tcp_vu.c
index 96cd9da1caae..d6bd6a34d6da 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -90,7 +90,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
struct ipv6hdr *ip6h = NULL;
struct iphdr *ip4h = NULL;
struct iovec flags_iov[2];
- struct tcp_syn_opts *opts;
struct iov_tail payload;
size_t optlen, hdrlen;
struct tcphdr *th;
@@ -104,13 +103,13 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1,
&flags_iov[0], 1, NULL,
- hdrlen + sizeof(*opts), NULL);
+ hdrlen + sizeof(struct tcp_syn_opts), NULL);
if (elem_cnt != 1)
return -1;
assert(flags_elem[0].in_num == 1);
assert(flags_elem[0].in_sg[0].iov_len >=
- MAX(hdrlen + sizeof(*opts), ETH_ZLEN + VNET_HLEN));
+ MAX(hdrlen + sizeof(struct tcp_syn_opts), ETH_ZLEN + VNET_HLEN));
vu_set_vnethdr(flags_elem[0].in_sg[0].iov_base, 1);
@@ -139,8 +138,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
th->ack = 1;
seq = conn->seq_to_tap;
- opts = (struct tcp_syn_opts *)(th + 1);
- ret = tcp_prepare_flags(c, conn, flags, th, opts, &optlen);
+ payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen - sizeof(*th));
+ ret = tcp_prepare_flags(c, conn, flags, &payload, &optlen);
if (ret <= 0) {
vu_queue_rewind(vq, 1);
return ret;
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 6/7] iov: introduce iov_memcopy()
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
` (4 preceding siblings ...)
2026-03-23 16:52 ` [PATCH 5/7] tcp: Use iov_tail to access headers in tcp_prepare_flags() Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-23 16:52 ` [PATCH 7/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_send_flag() Laurent Vivier
2026-03-25 5:07 ` [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element David Gibson
7 siblings, 0 replies; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
Copy buffers data from one iovec array to another.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
iov.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
iov.h | 2 ++
2 files changed, 48 insertions(+)
diff --git a/iov.c b/iov.c
index c1eda9941f32..55263ec10b44 100644
--- a/iov.c
+++ b/iov.c
@@ -196,6 +196,52 @@ void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c,
}
}
+/**
+ * iov_memcopy() - Copy data between two iovec arrays
+ * @dst_iov: Destination iovec array
+ * @dst_iov_cnt: Number of elements in destination iovec array
+ * @dst_offs: Destination offset
+ * @iov: Source iovec array
+ * @iov_cnt: Number of elements in source iovec array
+ * @offs: Source offset
+ *
+ * Return: total number of bytes copied
+ */
+/* cppcheck-suppress unusedFunction */
+size_t iov_memcopy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offs,
+ const struct iovec *iov, size_t iov_cnt, size_t offs)
+{
+ unsigned int i, j;
+ size_t total = 0;
+
+ i = iov_skip_bytes(iov, iov_cnt, offs, &offs);
+ j = iov_skip_bytes(dst_iov, dst_iov_cnt, dst_offs, &dst_offs);
+
+ /* copying data */
+ while (i < iov_cnt && j < dst_iov_cnt) {
+ size_t len = MIN(dst_iov[j].iov_len - dst_offs,
+ iov[i].iov_len - offs);
+
+ memcpy((char *)dst_iov[j].iov_base + dst_offs,
+ (const char *)iov[i].iov_base + offs, len);
+
+ dst_offs += len;
+ offs += len;
+ total += len;
+
+ if (dst_offs == dst_iov[j].iov_len) {
+ dst_offs = 0;
+ j++;
+ }
+ if (offs == iov[i].iov_len) {
+ offs = 0;
+ i++;
+ }
+ }
+
+ return total;
+}
+
/**
* iov_tail_prune() - Remove any unneeded buffers from an IOV tail
* @tail: IO vector tail (modified)
diff --git a/iov.h b/iov.h
index 4ce425ccdbe5..d3c51dc8d6da 100644
--- a/iov.h
+++ b/iov.h
@@ -32,6 +32,8 @@ size_t iov_size(const struct iovec *iov, size_t iov_cnt);
size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size);
void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c,
size_t length);
+size_t iov_memcopy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offs,
+ const struct iovec *iov, size_t iov_cnt, size_t offs);
/*
* DOC: Theory of Operation, struct iov_tail
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 7/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_send_flag()
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
` (5 preceding siblings ...)
2026-03-23 16:52 ` [PATCH 6/7] iov: introduce iov_memcopy() Laurent Vivier
@ 2026-03-23 16:52 ` Laurent Vivier
2026-03-25 5:07 ` [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element David Gibson
7 siblings, 0 replies; 15+ messages in thread
From: Laurent Vivier @ 2026-03-23 16:52 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier
tcp_vu_send_flag() previously used vu_eth(), vu_ip(), vu_payloadv4() and
vu_payloadv6() to access headers via direct pointer arithmetic on a
single contiguous buffer. Replace these with iov_tail-based
with_header() and IOV_DROP_HEADER() accessors, which work correctly with
multi-buffer virtio descriptors.
Enlarge the flags_iov and flags_elem arrays to accommodate split
descriptors, and pass iov_cnt through vu_collect() so we track
how many iovecs the frame actually spans.
Extract a tcp_vu_send_dup() helper for the DUP_ACK path that uses
iov_memcopy() to duplicate a frame across potentially multi-buffer
descriptors, replacing the previous single-buffer memcpy.
With all callers converted, remove the vu_eth(), vu_ip(),
vu_payloadv4() and vu_payloadv6() inline helpers from vu_common.h,
and drop the cppcheck-suppress annotation on iov_memcopy() which is
now used.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
iov.c | 1 -
tcp_vu.c | 113 ++++++++++++++++++++++++++++------------------------
vu_common.h | 20 ----------
3 files changed, 62 insertions(+), 72 deletions(-)
diff --git a/iov.c b/iov.c
index 55263ec10b44..9638dc6de03d 100644
--- a/iov.c
+++ b/iov.c
@@ -207,7 +207,6 @@ void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c,
*
* Return: total number of bytes copied
*/
-/* cppcheck-suppress unusedFunction */
size_t iov_memcopy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offs,
const struct iovec *iov, size_t iov_cnt, size_t offs)
{
diff --git a/tcp_vu.c b/tcp_vu.c
index d6bd6a34d6da..776b47aea18c 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -74,6 +74,41 @@ static size_t tcp_vu_hdrlen(bool v6)
return hdrlen;
}
+/**
+ * tcp_vu_send_dup() - Duplicate a frame into a new virtqueue element
+ * @c: Execution context
+ * @vq: Receive virtqueue
+ * @dest_elem: Destination virtqueue element to collect
+ * @dest_iov: Destination iovec array for collected buffers
+ * @max_dest_iov: Maximum number of entries in @dest_iov
+ * @src_iov: Source iovec array containing the frame to duplicate
+ * @src_cnt: Number of entries in @src_iov
+ *
+ * Return: number of virtqueue elements collected (0 if none available)
+ */
+static int tcp_vu_send_dup(const struct ctx *c, struct vu_virtq *vq,
+ struct vu_virtq_element *dest_elem,
+ struct iovec *dest_iov, size_t max_dest_iov,
+ const struct iovec *src_iov, size_t src_cnt)
+{
+ size_t dest_cnt, frame_size = iov_size(src_iov, src_cnt);
+ const struct vu_dev *vdev = c->vdev;
+ int elem_cnt;
+
+ elem_cnt = vu_collect(vdev, vq, dest_elem, 1, dest_iov, max_dest_iov,
+ &dest_cnt, frame_size, NULL);
+ if (elem_cnt == 0)
+ return 0;
+
+ vu_pad(dest_iov, dest_cnt, 0, frame_size);
+ iov_memcopy(dest_iov, dest_cnt, 0, src_iov, src_cnt, 0);
+
+ if (*c->pcap)
+ pcap_iov(dest_iov, dest_cnt, VNET_HLEN);
+
+ return elem_cnt;
+}
+
/**
* tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
* @c: Execution context
@@ -87,13 +122,9 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
struct vu_dev *vdev = c->vdev;
struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
struct vu_virtq_element flags_elem[2];
- struct ipv6hdr *ip6h = NULL;
- struct iphdr *ip4h = NULL;
- struct iovec flags_iov[2];
+ size_t optlen, hdrlen, iov_cnt;
+ struct iovec flags_iov[64];
struct iov_tail payload;
- size_t optlen, hdrlen;
- struct tcphdr *th;
- struct ethhdr *eh;
uint32_t seq;
int elem_cnt;
int nb_ack;
@@ -102,77 +133,57 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
hdrlen = tcp_vu_hdrlen(CONN_V6(conn));
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1,
- &flags_iov[0], 1, NULL,
+ flags_iov, ARRAY_SIZE(flags_iov), &iov_cnt,
hdrlen + sizeof(struct tcp_syn_opts), NULL);
- if (elem_cnt != 1)
+ if (elem_cnt == 0)
return -1;
- assert(flags_elem[0].in_num == 1);
- assert(flags_elem[0].in_sg[0].iov_len >=
- MAX(hdrlen + sizeof(struct tcp_syn_opts), ETH_ZLEN + VNET_HLEN));
-
- vu_set_vnethdr(flags_elem[0].in_sg[0].iov_base, 1);
+ vu_set_vnethdr(flags_elem[0].in_sg[0].iov_base, elem_cnt);
- eh = vu_eth(flags_elem[0].in_sg[0].iov_base);
+ payload = IOV_TAIL(flags_elem[0].in_sg, iov_cnt, VNET_HLEN);
- memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
- memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+ with_header(struct ethhdr, eh, &payload)
+ memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+ IOV_DROP_HEADER(&payload, struct ethhdr);
if (CONN_V4(conn)) {
- eh->h_proto = htons(ETH_P_IP);
-
- ip4h = vu_ip(flags_elem[0].in_sg[0].iov_base);
- *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
-
- th = vu_payloadv4(flags_elem[0].in_sg[0].iov_base);
+ with_header(struct iphdr, ip4h, &payload)
+ *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+ IOV_DROP_HEADER(&payload, struct iphdr);
} else {
- eh->h_proto = htons(ETH_P_IPV6);
-
- ip6h = vu_ip(flags_elem[0].in_sg[0].iov_base);
- *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
- th = vu_payloadv6(flags_elem[0].in_sg[0].iov_base);
+ with_header(struct ipv6hdr, ip6h, &payload)
+ *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+ IOV_DROP_HEADER(&payload, struct ipv6hdr);
}
- memset(th, 0, sizeof(*th));
- th->doff = sizeof(*th) / 4;
- th->ack = 1;
+ with_header(struct tcphdr, th, &payload)
+ memset(th, 0, sizeof(*th));
seq = conn->seq_to_tap;
- payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen - sizeof(*th));
ret = tcp_prepare_flags(c, conn, flags, &payload, &optlen);
if (ret <= 0) {
- vu_queue_rewind(vq, 1);
+ vu_queue_rewind(vq, elem_cnt);
return ret;
}
- vu_pad(&flags_iov[0], 1, 0, hdrlen + optlen);
+ vu_pad(flags_elem[0].in_sg, iov_cnt, 0, hdrlen + optlen);
if (flags & KEEPALIVE)
seq--;
- payload = IOV_TAIL(flags_elem[0].in_sg, 1, VNET_HLEN);
+ payload = IOV_TAIL(flags_elem[0].in_sg, iov_cnt, VNET_HLEN);
tcp_fill_headers(c, conn, CONN_V4(conn), &payload, -1, seq, !*c->pcap);
if (*c->pcap)
- pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
- nb_ack = 1;
+ pcap_iov(flags_elem[0].in_sg, iov_cnt, VNET_HLEN);
+ nb_ack = elem_cnt;
if (flags & DUP_ACK) {
- elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1,
- &flags_iov[1], 1, NULL,
- hdrlen + optlen, NULL);
- if (elem_cnt == 1 &&
- flags_elem[1].in_sg[0].iov_len >=
- flags_elem[0].in_sg[0].iov_len) {
- vu_pad(&flags_iov[1], 1, 0, hdrlen + optlen);
- memcpy(flags_elem[1].in_sg[0].iov_base,
- flags_elem[0].in_sg[0].iov_base,
- flags_elem[0].in_sg[0].iov_len);
- nb_ack++;
-
- if (*c->pcap)
- pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN);
- }
+ elem_cnt = tcp_vu_send_dup(c, vq, &flags_elem[elem_cnt],
+ &flags_iov[iov_cnt],
+ ARRAY_SIZE(flags_iov) - iov_cnt,
+ flags_elem[0].in_sg, iov_cnt);
+ nb_ack += elem_cnt;
}
vu_flush(vdev, vq, flags_elem, nb_ack);
diff --git a/vu_common.h b/vu_common.h
index f09dd25d3d1f..8e42ce88b2ab 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -15,26 +15,6 @@
#include "ip.h"
#include "virtio.h"
-static inline void *vu_eth(void *base)
-{
- return ((char *)base + VNET_HLEN);
-}
-
-static inline void *vu_ip(void *base)
-{
- return (struct ethhdr *)vu_eth(base) + 1;
-}
-
-static inline void *vu_payloadv4(void *base)
-{
- return (struct iphdr *)vu_ip(base) + 1;
-}
-
-static inline void *vu_payloadv6(void *base)
-{
- return (struct ipv6hdr *)vu_ip(base) + 1;
-}
-
int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq,
struct vu_virtq_element *elem, int max_elem,
struct iovec *in_sg, size_t max_in_sg, size_t *in_total,
--
2.53.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum
2026-03-23 16:52 ` [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum Laurent Vivier
@ 2026-03-24 3:53 ` David Gibson
2026-03-24 7:56 ` Laurent Vivier
0 siblings, 1 reply; 15+ messages in thread
From: David Gibson @ 2026-03-24 3:53 UTC (permalink / raw)
To: Laurent Vivier; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 7297 bytes --]
On Mon, Mar 23, 2026 at 05:52:53PM +0100, Laurent Vivier wrote:
> tcp_fill_headers() takes a pointer to a previously computed IPv4 header
> checksum to avoid recalculating it when the payload length doesn't
> change. A subsequent patch makes tcp_fill_headers() access ip4h via
> with_header() which scopes it to a temporary variable, so a pointer to
> ip4h->check would become dangling after the with_header() block.
Oof, that kind of indicates the dangers with the with_header()
structure. Is that change merged already? If not we should probably
fix it before merge rather than after the fact.
> Pass the checksum by value as an int instead, using -1 as the sentinel
> to indicate that the checksum should be computed from scratch (replacing
> the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be
> a valid checksum value in an int.
That said, passing this by value I think is cleaner than the pointer,
regardless of other reasons. Would it also make sense to flag
no_tcp_csum with an additional special value (a #define to -2, or
whatever) instead of using an extra parameter? Logically the checksum
parameter would be:
CALCULATE | UNNEEDED | specific value
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
> tcp.c | 7 +++----
> tcp_buf.c | 10 +++++-----
> tcp_internal.h | 3 +--
> tcp_vu.c | 12 ++++++------
> 4 files changed, 15 insertions(+), 17 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index b14586249c4e..158a5be0327e 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> struct ethhdr *eh,
> struct iphdr *ip4h, struct ipv6hdr *ip6h,
> struct tcphdr *th, struct iov_tail *payload,
> - const uint16_t *ip4_check, uint32_t seq,
> - bool no_tcp_csum)
> + int ip4_check, uint32_t seq, bool no_tcp_csum)
> {
> const struct flowside *tapside = TAPFLOW(conn);
> size_t l4len = iov_tail_size(payload) + sizeof(*th);
> @@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> ip4h->saddr = src4->s_addr;
> ip4h->daddr = dst4->s_addr;
>
> - if (ip4_check)
> - ip4h->check = *ip4_check;
> + if (ip4_check != -1)
> + ip4h->check = ip4_check;
> else
> ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP,
> *src4, *dst4);
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 41965b107567..bc0f58dd7a5e 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov)
> */
> static void tcp_l2_buf_fill_headers(const struct ctx *c,
> struct tcp_tap_conn *conn,
> - struct iovec *iov, const uint16_t *check,
> + struct iovec *iov, int check,
> uint32_t seq, bool no_tcp_csum)
> {
> struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
> @@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> if (flags & KEEPALIVE)
> seq--;
>
> - tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
> + tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false);
>
> tcp_l2_buf_pad(iov);
>
> @@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> ssize_t dlen, int no_csum, uint32_t seq, bool push)
> {
> struct tcp_payload_t *payload;
> - const uint16_t *check = NULL;
> + int check = -1;
> struct iovec *iov;
>
> conn->seq_to_tap = seq + dlen;
> @@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> if (CONN_V4(conn)) {
> if (no_csum) {
> struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
> - struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
> + const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
>
> - check = &iph->check;
> + check = iph->check;
> }
> iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
> } else if (CONN_V6(conn)) {
> diff --git a/tcp_internal.h b/tcp_internal.h
> index d9408852571f..bb7a6629839c 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> struct ethhdr *eh,
> struct iphdr *ip4h, struct ipv6hdr *ip6h,
> struct tcphdr *th, struct iov_tail *payload,
> - const uint16_t *ip4_check, uint32_t seq,
> - bool no_tcp_csum);
> + int ip4_check, uint32_t seq, bool no_tcp_csum);
>
> int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> bool force_seq, struct tcp_info_linux *tinfo);
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 3001defb5467..a21ee3499aed 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> seq--;
>
> tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
> - NULL, seq, !*c->pcap);
> + -1, seq, !*c->pcap);
>
> if (*c->pcap)
> pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
> @@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
> */
> static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> struct iovec *iov, size_t iov_cnt,
> - const uint16_t **check, bool no_tcp_csum, bool push)
> + int *check, bool no_tcp_csum, bool push)
> {
> const struct flowside *toside = TAPFLOW(conn);
> bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> @@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
> *check, conn->seq_to_tap, no_tcp_csum);
> if (ip4h)
> - *check = &ip4h->check;
> + *check = ip4h->check;
> }
>
> /**
> @@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> size_t hdrlen, fillsize;
> int v6 = CONN_V6(conn);
> uint32_t already_sent;
> - const uint16_t *check;
> + int check;
>
> if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> debug("Got packet, but RX virtqueue not usable yet");
> @@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> */
>
> hdrlen = tcp_vu_hdrlen(v6);
> - for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) {
> + for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) {
> struct iovec *iov = &elem[head[i]].in_sg[0];
> int buf_cnt = head[i + 1] - head[i];
> size_t frame_size = iov_size(iov, buf_cnt);
> @@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>
> /* The IPv4 header checksum varies only with dlen */
> if (previous_dlen != dlen)
> - check = NULL;
> + check = -1;
> previous_dlen = dlen;
>
> tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
> --
> 2.53.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/7] tcp: use iov_tail to access headers in tcp_fill_headers()
2026-03-23 16:52 ` [PATCH 2/7] tcp: use iov_tail to access headers in tcp_fill_headers() Laurent Vivier
@ 2026-03-24 3:58 ` David Gibson
0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2026-03-24 3:58 UTC (permalink / raw)
To: Laurent Vivier; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 9697 bytes --]
On Mon, Mar 23, 2026 at 05:52:54PM +0100, Laurent Vivier wrote:
> Instead of receiving individual pointers to each protocol header (eh,
> ip4h, ip6h, th), have tcp_fill_headers() take an iov_tail starting at
> the Ethernet header and walk through it using with_header() and
> IOV_DROP_HEADER() to access each header in turn.
>
> Replace the ip4h/ip6h NULL-pointer convention with a bool ipv4
> parameter, and move Ethernet header filling (MAC address and ethertype)
> into tcp_fill_headers() as well, since the function now owns the full
> header chain.
>
> This simplifies callers, which no longer need to extract and pass
> individual header pointers.
>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
> tcp.c | 106 +++++++++++++++++++++++++++----------------------
> tcp_buf.c | 16 ++------
> tcp_internal.h | 4 +-
> tcp_vu.c | 12 +++---
> 4 files changed, 70 insertions(+), 68 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 158a5be0327e..058792d5b184 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -938,11 +938,8 @@ static void tcp_fill_header(struct tcphdr *th,
> * tcp_fill_headers() - Fill 802.3, IP, TCP headers
> * @c: Execution context
> * @conn: Connection pointer
> - * @eh: Pointer to Ethernet header
> - * @ip4h: Pointer to IPv4 header, or NULL
> - * @ip6h: Pointer to IPv6 header, or NULL
> - * @th: Pointer to TCP header
> - * @payload: TCP payload
> + * @ipv4: True for IPv4, false for IPv6
> + * @payload: IOV tail starting at the Ethernet header
As UDP I'm not sure I like removing the separater parameters. A
little more so, since this doesn't (any more) have the ugly
udp_payload_t equivalent. At minimum @payload is no longer a good
name.
> * @ip4_check: IPv4 checksum, if already known
> * @seq: Sequence number for this segment
> * @no_tcp_csum: Do not set TCP checksum
> @@ -950,74 +947,89 @@ static void tcp_fill_header(struct tcphdr *th,
> * Return: frame length (including L2 headers)
> */
> size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> - struct ethhdr *eh,
> - struct iphdr *ip4h, struct ipv6hdr *ip6h,
> - struct tcphdr *th, struct iov_tail *payload,
> + bool ipv4, const struct iov_tail *payload,
> int ip4_check, uint32_t seq, bool no_tcp_csum)
> {
> const struct flowside *tapside = TAPFLOW(conn);
> - size_t l4len = iov_tail_size(payload) + sizeof(*th);
> + struct iov_tail current = *payload;
> uint8_t *omac = conn->f.tap_omac;
> - size_t l3len = l4len;
> + size_t l3len, l4len;
> uint32_t psum = 0;
>
> - if (ip4h) {
> + with_header(struct ethhdr, eh, ¤t) {
> + if (ipv4)
> + eh->h_proto = htons_constant(ETH_P_IP);
> + else
> + eh->h_proto = htons_constant(ETH_P_IPV6);
> +
> + /* Find if neighbour table has a recorded MAC address */
> + if (MAC_IS_UNDEF(omac))
> + fwd_neigh_mac_get(c, &tapside->oaddr, omac);
> + eth_update_mac(eh, NULL, omac);
> + }
> + IOV_DROP_HEADER(¤t, struct ethhdr);
> +
> + l3len = iov_tail_size(¤t);
> +
> + if (ipv4) {
> const struct in_addr *src4 = inany_v4(&tapside->oaddr);
> const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
>
> assert(src4 && dst4);
>
> - l3len += + sizeof(*ip4h);
> + l4len = l3len - sizeof(struct iphdr);
>
> - ip4h->tot_len = htons(l3len);
> - ip4h->saddr = src4->s_addr;
> - ip4h->daddr = dst4->s_addr;
> + with_header(struct iphdr, ip4h, ¤t) {
> + ip4h->tot_len = htons(l3len);
> + ip4h->saddr = src4->s_addr;
> + ip4h->daddr = dst4->s_addr;
>
> - if (ip4_check != -1)
> - ip4h->check = ip4_check;
> - else
> - ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP,
> - *src4, *dst4);
> + if (ip4_check != -1)
> + ip4h->check = ip4_check;
> + else
> + ip4h->check = csum_ip4_header(l3len,
> + IPPROTO_TCP,
> + *src4, *dst4);
> + }
> + IOV_DROP_HEADER(¤t, struct iphdr);
>
> if (!no_tcp_csum) {
> psum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
> *src4, *dst4);
> }
> - eh->h_proto = htons_constant(ETH_P_IP);
> - }
> -
> - if (ip6h) {
> - l3len += sizeof(*ip6h);
> + } else {
> + l4len = l3len - sizeof(struct ipv6hdr);
>
> - ip6h->payload_len = htons(l4len);
> - ip6h->saddr = tapside->oaddr.a6;
> - ip6h->daddr = tapside->eaddr.a6;
> + with_header(struct ipv6hdr, ip6h, ¤t) {
> + ip6h->payload_len = htons(l4len);
> + ip6h->saddr = tapside->oaddr.a6;
> + ip6h->daddr = tapside->eaddr.a6;
>
> - ip6h->hop_limit = 255;
> - ip6h->version = 6;
> - ip6h->nexthdr = IPPROTO_TCP;
> + ip6h->hop_limit = 255;
> + ip6h->version = 6;
> + ip6h->nexthdr = IPPROTO_TCP;
>
> - ip6_set_flow_lbl(ip6h, conn->sock);
> + ip6_set_flow_lbl(ip6h, conn->sock);
>
> - if (!no_tcp_csum) {
> - psum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,
> - &ip6h->saddr,
> - &ip6h->daddr);
> + if (!no_tcp_csum) {
> + psum = proto_ipv6_header_psum(l4len,
> + IPPROTO_TCP,
> + &ip6h->saddr,
> + &ip6h->daddr);
> + }
> }
> - eh->h_proto = htons_constant(ETH_P_IPV6);
> + IOV_DROP_HEADER(¤t, struct ipv6hdr);
> }
>
> - /* Find if neighbour table has a recorded MAC address */
> - if (MAC_IS_UNDEF(omac))
> - fwd_neigh_mac_get(c, &tapside->oaddr, omac);
> - eth_update_mac(eh, NULL, omac);
> -
> - tcp_fill_header(th, conn, seq);
> -
> - if (no_tcp_csum)
> + with_header(struct tcphdr, th, ¤t) {
> + tcp_fill_header(th, conn, seq);
> th->check = 0;
> - else
> - tcp_update_csum(psum, th, payload);
> + }
> +
> + if (!no_tcp_csum) {
> + with_header(struct tcphdr, th, ¤t)
> + th->check = csum_iov_tail(¤t, psum);
> + }
>
> return MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN);
> }
> diff --git a/tcp_buf.c b/tcp_buf.c
> index bc0f58dd7a5e..891043c96dcb 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -175,23 +175,15 @@ static void tcp_l2_buf_fill_headers(const struct ctx *c,
> struct iovec *iov, int check,
> uint32_t seq, bool no_tcp_csum)
> {
> - struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
> - struct tcphdr th_storage, *th = IOV_REMOVE_HEADER(&tail, th_storage);
> + struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_ETH],
> + TCP_IOV_PAYLOAD + 1 - TCP_IOV_ETH, 0);
> struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base;
> const struct flowside *tapside = TAPFLOW(conn);
> - const struct in_addr *a4 = inany_v4(&tapside->oaddr);
> - struct ethhdr *eh = iov[TCP_IOV_ETH].iov_base;
> - struct ipv6hdr *ip6h = NULL;
> - struct iphdr *ip4h = NULL;
> + bool ipv4 = inany_v4(&tapside->oaddr) != NULL;
> size_t l2len;
>
> - if (a4)
> - ip4h = iov[TCP_IOV_IP].iov_base;
> - else
> - ip6h = iov[TCP_IOV_IP].iov_base;
> + l2len = tcp_fill_headers(c, conn, ipv4, &tail, check, seq, no_tcp_csum);
>
> - l2len = tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &tail, check, seq,
> - no_tcp_csum);
> tap_hdr_update(taph, l2len);
> }
>
> diff --git a/tcp_internal.h b/tcp_internal.h
> index bb7a6629839c..136e947f6e70 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -184,9 +184,7 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
> struct tcp_info_linux;
>
> size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> - struct ethhdr *eh,
> - struct iphdr *ip4h, struct ipv6hdr *ip6h,
> - struct tcphdr *th, struct iov_tail *payload,
> + bool ipv4, const struct iov_tail *payload,
> int ip4_check, uint32_t seq, bool no_tcp_csum);
>
> int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> diff --git a/tcp_vu.c b/tcp_vu.c
> index a21ee3499aed..c6206b7a689c 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -132,13 +132,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> }
>
> vu_pad(&flags_iov[0], 1, 0, hdrlen + optlen);
> - payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
>
> if (flags & KEEPALIVE)
> seq--;
>
> - tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
> - -1, seq, !*c->pcap);
> + payload = IOV_TAIL(flags_elem[0].in_sg, 1, VNET_HLEN);
> + tcp_fill_headers(c, conn, CONN_V4(conn), &payload, -1, seq, !*c->pcap);
>
> if (*c->pcap)
> pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
> @@ -288,10 +287,10 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> const struct flowside *toside = TAPFLOW(conn);
> bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> size_t hdrlen = tcp_vu_hdrlen(v6);
> - struct iov_tail payload = IOV_TAIL(iov, iov_cnt, hdrlen);
> char *base = iov[0].iov_base;
> struct ipv6hdr *ip6h = NULL;
> struct iphdr *ip4h = NULL;
> + struct iov_tail payload;
> struct tcphdr *th;
> struct ethhdr *eh;
>
> @@ -326,8 +325,9 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> th->ack = 1;
> th->psh = push;
>
> - tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
> - *check, conn->seq_to_tap, no_tcp_csum);
> + payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN);
> + tcp_fill_headers(c, conn, !v6, &payload, *check, conn->seq_to_tap,
> + no_tcp_csum);
> if (ip4h)
> *check = ip4h->check;
> }
> --
> 2.53.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum
2026-03-24 3:53 ` David Gibson
@ 2026-03-24 7:56 ` Laurent Vivier
2026-03-24 23:49 ` David Gibson
0 siblings, 1 reply; 15+ messages in thread
From: Laurent Vivier @ 2026-03-24 7:56 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On 3/24/26 04:53, David Gibson wrote:
> On Mon, Mar 23, 2026 at 05:52:53PM +0100, Laurent Vivier wrote:
>> tcp_fill_headers() takes a pointer to a previously computed IPv4 header
>> checksum to avoid recalculating it when the payload length doesn't
>> change. A subsequent patch makes tcp_fill_headers() access ip4h via
>> with_header() which scopes it to a temporary variable, so a pointer to
>> ip4h->check would become dangling after the with_header() block.
>
> Oof, that kind of indicates the dangers with the with_header()
> structure. Is that change merged already? If not we should probably
> fix it before merge rather than after the fact.
The problem appears in the subsequent patches of the series.
So at this point we prevent the problem, don't fix it.
>
>> Pass the checksum by value as an int instead, using -1 as the sentinel
>> to indicate that the checksum should be computed from scratch (replacing
>> the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be
>> a valid checksum value in an int.
>
> That said, passing this by value I think is cleaner than the pointer,
> regardless of other reasons. Would it also make sense to flag
> no_tcp_csum with an additional special value (a #define to -2, or
> whatever) instead of using an extra parameter? Logically the checksum
> parameter would be:
> CALCULATE | UNNEEDED | specific value
I can extract the patch from this series to make this improvement on the current code.
Thanks,
Laurent
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>> ---
>> tcp.c | 7 +++----
>> tcp_buf.c | 10 +++++-----
>> tcp_internal.h | 3 +--
>> tcp_vu.c | 12 ++++++------
>> 4 files changed, 15 insertions(+), 17 deletions(-)
>>
>> diff --git a/tcp.c b/tcp.c
>> index b14586249c4e..158a5be0327e 100644
>> --- a/tcp.c
>> +++ b/tcp.c
>> @@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
>> struct ethhdr *eh,
>> struct iphdr *ip4h, struct ipv6hdr *ip6h,
>> struct tcphdr *th, struct iov_tail *payload,
>> - const uint16_t *ip4_check, uint32_t seq,
>> - bool no_tcp_csum)
>> + int ip4_check, uint32_t seq, bool no_tcp_csum)
>> {
>> const struct flowside *tapside = TAPFLOW(conn);
>> size_t l4len = iov_tail_size(payload) + sizeof(*th);
>> @@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
>> ip4h->saddr = src4->s_addr;
>> ip4h->daddr = dst4->s_addr;
>>
>> - if (ip4_check)
>> - ip4h->check = *ip4_check;
>> + if (ip4_check != -1)
>> + ip4h->check = ip4_check;
>> else
>> ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP,
>> *src4, *dst4);
>> diff --git a/tcp_buf.c b/tcp_buf.c
>> index 41965b107567..bc0f58dd7a5e 100644
>> --- a/tcp_buf.c
>> +++ b/tcp_buf.c
>> @@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov)
>> */
>> static void tcp_l2_buf_fill_headers(const struct ctx *c,
>> struct tcp_tap_conn *conn,
>> - struct iovec *iov, const uint16_t *check,
>> + struct iovec *iov, int check,
>> uint32_t seq, bool no_tcp_csum)
>> {
>> struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
>> @@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>> if (flags & KEEPALIVE)
>> seq--;
>>
>> - tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
>> + tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false);
>>
>> tcp_l2_buf_pad(iov);
>>
>> @@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>> ssize_t dlen, int no_csum, uint32_t seq, bool push)
>> {
>> struct tcp_payload_t *payload;
>> - const uint16_t *check = NULL;
>> + int check = -1;
>> struct iovec *iov;
>>
>> conn->seq_to_tap = seq + dlen;
>> @@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>> if (CONN_V4(conn)) {
>> if (no_csum) {
>> struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
>> - struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
>> + const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
>>
>> - check = &iph->check;
>> + check = iph->check;
>> }
>> iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
>> } else if (CONN_V6(conn)) {
>> diff --git a/tcp_internal.h b/tcp_internal.h
>> index d9408852571f..bb7a6629839c 100644
>> --- a/tcp_internal.h
>> +++ b/tcp_internal.h
>> @@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
>> struct ethhdr *eh,
>> struct iphdr *ip4h, struct ipv6hdr *ip6h,
>> struct tcphdr *th, struct iov_tail *payload,
>> - const uint16_t *ip4_check, uint32_t seq,
>> - bool no_tcp_csum);
>> + int ip4_check, uint32_t seq, bool no_tcp_csum);
>>
>> int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>> bool force_seq, struct tcp_info_linux *tinfo);
>> diff --git a/tcp_vu.c b/tcp_vu.c
>> index 3001defb5467..a21ee3499aed 100644
>> --- a/tcp_vu.c
>> +++ b/tcp_vu.c
>> @@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>> seq--;
>>
>> tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
>> - NULL, seq, !*c->pcap);
>> + -1, seq, !*c->pcap);
>>
>> if (*c->pcap)
>> pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
>> @@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
>> */
>> static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
>> struct iovec *iov, size_t iov_cnt,
>> - const uint16_t **check, bool no_tcp_csum, bool push)
>> + int *check, bool no_tcp_csum, bool push)
>> {
>> const struct flowside *toside = TAPFLOW(conn);
>> bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
>> @@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
>> tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
>> *check, conn->seq_to_tap, no_tcp_csum);
>> if (ip4h)
>> - *check = &ip4h->check;
>> + *check = ip4h->check;
>> }
>>
>> /**
>> @@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>> size_t hdrlen, fillsize;
>> int v6 = CONN_V6(conn);
>> uint32_t already_sent;
>> - const uint16_t *check;
>> + int check;
>>
>> if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
>> debug("Got packet, but RX virtqueue not usable yet");
>> @@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>> */
>>
>> hdrlen = tcp_vu_hdrlen(v6);
>> - for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) {
>> + for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) {
>> struct iovec *iov = &elem[head[i]].in_sg[0];
>> int buf_cnt = head[i + 1] - head[i];
>> size_t frame_size = iov_size(iov, buf_cnt);
>> @@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>>
>> /* The IPv4 header checksum varies only with dlen */
>> if (previous_dlen != dlen)
>> - check = NULL;
>> + check = -1;
>> previous_dlen = dlen;
>>
>> tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
>> --
>> 2.53.0
>>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum
2026-03-24 7:56 ` Laurent Vivier
@ 2026-03-24 23:49 ` David Gibson
0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2026-03-24 23:49 UTC (permalink / raw)
To: Laurent Vivier; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 8369 bytes --]
On Tue, Mar 24, 2026 at 08:56:48AM +0100, Laurent Vivier wrote:
> On 3/24/26 04:53, David Gibson wrote:
> > On Mon, Mar 23, 2026 at 05:52:53PM +0100, Laurent Vivier wrote:
> > > tcp_fill_headers() takes a pointer to a previously computed IPv4 header
> > > checksum to avoid recalculating it when the payload length doesn't
> > > change. A subsequent patch makes tcp_fill_headers() access ip4h via
> > > with_header() which scopes it to a temporary variable, so a pointer to
> > > ip4h->check would become dangling after the with_header() block.
> >
> > Oof, that kind of indicates the dangers with the with_header()
> > structure. Is that change merged already? If not we should probably
> > fix it before merge rather than after the fact.
>
> The problem appears in the subsequent patches of the series.
> So at this point we prevent the problem, don't fix it.
Oh, sorry, I misread your comment which already said that.
> > > Pass the checksum by value as an int instead, using -1 as the sentinel
> > > to indicate that the checksum should be computed from scratch (replacing
> > > the NULL pointer sentinel). As the checksum is a uint16_t, -1 cannot be
> > > a valid checksum value in an int.
> >
> > That said, passing this by value I think is cleaner than the pointer,
> > regardless of other reasons. Would it also make sense to flag
> > no_tcp_csum with an additional special value (a #define to -2, or
> > whatever) instead of using an extra parameter? Logically the checksum
> > parameter would be:
> > CALCULATE | UNNEEDED | specific value
>
> I can extract the patch from this series to make this improvement on the current code.
>
> Thanks,
> Laurent
>
> >
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> > > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > > ---
> > > tcp.c | 7 +++----
> > > tcp_buf.c | 10 +++++-----
> > > tcp_internal.h | 3 +--
> > > tcp_vu.c | 12 ++++++------
> > > 4 files changed, 15 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index b14586249c4e..158a5be0327e 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -953,8 +953,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> > > struct ethhdr *eh,
> > > struct iphdr *ip4h, struct ipv6hdr *ip6h,
> > > struct tcphdr *th, struct iov_tail *payload,
> > > - const uint16_t *ip4_check, uint32_t seq,
> > > - bool no_tcp_csum)
> > > + int ip4_check, uint32_t seq, bool no_tcp_csum)
> > > {
> > > const struct flowside *tapside = TAPFLOW(conn);
> > > size_t l4len = iov_tail_size(payload) + sizeof(*th);
> > > @@ -974,8 +973,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> > > ip4h->saddr = src4->s_addr;
> > > ip4h->daddr = dst4->s_addr;
> > > - if (ip4_check)
> > > - ip4h->check = *ip4_check;
> > > + if (ip4_check != -1)
> > > + ip4h->check = ip4_check;
> > > else
> > > ip4h->check = csum_ip4_header(l3len, IPPROTO_TCP,
> > > *src4, *dst4);
> > > diff --git a/tcp_buf.c b/tcp_buf.c
> > > index 41965b107567..bc0f58dd7a5e 100644
> > > --- a/tcp_buf.c
> > > +++ b/tcp_buf.c
> > > @@ -172,7 +172,7 @@ static void tcp_l2_buf_pad(struct iovec *iov)
> > > */
> > > static void tcp_l2_buf_fill_headers(const struct ctx *c,
> > > struct tcp_tap_conn *conn,
> > > - struct iovec *iov, const uint16_t *check,
> > > + struct iovec *iov, int check,
> > > uint32_t seq, bool no_tcp_csum)
> > > {
> > > struct iov_tail tail = IOV_TAIL(&iov[TCP_IOV_PAYLOAD], 1, 0);
> > > @@ -233,7 +233,7 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> > > if (flags & KEEPALIVE)
> > > seq--;
> > > - tcp_l2_buf_fill_headers(c, conn, iov, NULL, seq, false);
> > > + tcp_l2_buf_fill_headers(c, conn, iov, -1, seq, false);
> > > tcp_l2_buf_pad(iov);
> > > @@ -270,7 +270,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > ssize_t dlen, int no_csum, uint32_t seq, bool push)
> > > {
> > > struct tcp_payload_t *payload;
> > > - const uint16_t *check = NULL;
> > > + int check = -1;
> > > struct iovec *iov;
> > > conn->seq_to_tap = seq + dlen;
> > > @@ -279,9 +279,9 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > if (CONN_V4(conn)) {
> > > if (no_csum) {
> > > struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
> > > - struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
> > > + const struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
> > > - check = &iph->check;
> > > + check = iph->check;
> > > }
> > > iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
> > > } else if (CONN_V6(conn)) {
> > > diff --git a/tcp_internal.h b/tcp_internal.h
> > > index d9408852571f..bb7a6629839c 100644
> > > --- a/tcp_internal.h
> > > +++ b/tcp_internal.h
> > > @@ -187,8 +187,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> > > struct ethhdr *eh,
> > > struct iphdr *ip4h, struct ipv6hdr *ip6h,
> > > struct tcphdr *th, struct iov_tail *payload,
> > > - const uint16_t *ip4_check, uint32_t seq,
> > > - bool no_tcp_csum);
> > > + int ip4_check, uint32_t seq, bool no_tcp_csum);
> > > int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > > bool force_seq, struct tcp_info_linux *tinfo);
> > > diff --git a/tcp_vu.c b/tcp_vu.c
> > > index 3001defb5467..a21ee3499aed 100644
> > > --- a/tcp_vu.c
> > > +++ b/tcp_vu.c
> > > @@ -138,7 +138,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> > > seq--;
> > > tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
> > > - NULL, seq, !*c->pcap);
> > > + -1, seq, !*c->pcap);
> > > if (*c->pcap)
> > > pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
> > > @@ -283,7 +283,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
> > > */
> > > static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> > > struct iovec *iov, size_t iov_cnt,
> > > - const uint16_t **check, bool no_tcp_csum, bool push)
> > > + int *check, bool no_tcp_csum, bool push)
> > > {
> > > const struct flowside *toside = TAPFLOW(conn);
> > > bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> > > @@ -329,7 +329,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> > > tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
> > > *check, conn->seq_to_tap, no_tcp_csum);
> > > if (ip4h)
> > > - *check = &ip4h->check;
> > > + *check = ip4h->check;
> > > }
> > > /**
> > > @@ -350,7 +350,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > > size_t hdrlen, fillsize;
> > > int v6 = CONN_V6(conn);
> > > uint32_t already_sent;
> > > - const uint16_t *check;
> > > + int check;
> > > if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> > > debug("Got packet, but RX virtqueue not usable yet");
> > > @@ -437,7 +437,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > > */
> > > hdrlen = tcp_vu_hdrlen(v6);
> > > - for (i = 0, previous_dlen = -1, check = NULL; i < head_cnt; i++) {
> > > + for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) {
> > > struct iovec *iov = &elem[head[i]].in_sg[0];
> > > int buf_cnt = head[i + 1] - head[i];
> > > size_t frame_size = iov_size(iov, buf_cnt);
> > > @@ -451,7 +451,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > > /* The IPv4 header checksum varies only with dlen */
> > > if (previous_dlen != dlen)
> > > - check = NULL;
> > > + check = -1;
> > > previous_dlen = dlen;
> > > tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
> > > --
> > > 2.53.0
> > >
> >
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare()
2026-03-23 16:52 ` [PATCH 3/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare() Laurent Vivier
@ 2026-03-25 4:46 ` David Gibson
0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2026-03-25 4:46 UTC (permalink / raw)
To: Laurent Vivier; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 3993 bytes --]
On Mon, Mar 23, 2026 at 05:52:55PM +0100, Laurent Vivier wrote:
> Replace direct pointer arithmetic using vu_eth(), vu_ip(),
> vu_payloadv4() and vu_payloadv6() with the iov_tail abstraction
> to build Ethernet, IP and TCP headers in tcp_vu_prepare().
>
> This removes the assumption that all headers reside in a single
> contiguous iov entry.
>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
LGTM, with the exception of my general concerns about with_header().
> ---
> tcp_vu.c | 61 ++++++++++++++++++++++++++------------------------------
> 1 file changed, 28 insertions(+), 33 deletions(-)
>
> diff --git a/tcp_vu.c b/tcp_vu.c
> index c6206b7a689c..a39f6ea018e9 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -222,6 +222,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
> /* reserve space for headers in iov */
> iov = &elem[elem_cnt].in_sg[0];
> assert(iov->iov_len >= hdrlen);
> +
Unrelated change (usually I don't mind if it's a reasonable change
adjacent to what you're doing, but this is an entirely different
function).
> iov->iov_base = (char *)iov->iov_base + hdrlen;
> iov->iov_len -= hdrlen;
> head[(*head_cnt)++] = elem_cnt;
> @@ -285,51 +286,45 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn,
> int *check, bool no_tcp_csum, bool push)
> {
> const struct flowside *toside = TAPFLOW(conn);
> - bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> - size_t hdrlen = tcp_vu_hdrlen(v6);
> - char *base = iov[0].iov_base;
> - struct ipv6hdr *ip6h = NULL;
> - struct iphdr *ip4h = NULL;
> + bool ipv4 = inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr);
> struct iov_tail payload;
> - struct tcphdr *th;
> - struct ethhdr *eh;
> -
> - /* we guess the first iovec provided by the guest can embed
> - * all the headers needed by L2 frame, including any padding
> - */
> - assert(iov[0].iov_len >= hdrlen);
> -
> - eh = vu_eth(base);
>
> - memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> + payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN);
> + with_header(struct ethhdr, eh, &payload)
> + memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> + IOV_DROP_HEADER(&payload, struct ethhdr);
>
> /* initialize header */
>
> - if (!v6) {
> - eh->h_proto = htons(ETH_P_IP);
> -
> - ip4h = vu_ip(base);
> - *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> - th = vu_payloadv4(base);
> + if (ipv4) {
> + with_header(struct iphdr, ip4h, &payload)
> + *ip4h = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> + IOV_DROP_HEADER(&payload, struct iphdr);
> } else {
> - eh->h_proto = htons(ETH_P_IPV6);
> -
> - ip6h = vu_ip(base);
> - *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> -
> - th = vu_payloadv6(base);
> + with_header(struct ipv6hdr, ip6h, &payload)
> + *ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> + IOV_DROP_HEADER(&payload, struct ipv6hdr);
> }
>
> - memset(th, 0, sizeof(*th));
> - th->doff = sizeof(*th) / 4;
> - th->ack = 1;
> - th->psh = push;
> + with_header(struct tcphdr, th, &payload) {
> + memset(th, 0, sizeof(*th));
> + th->doff = sizeof(*th) / 4;
> + th->ack = 1;
> + th->psh = push;
> + }
>
> payload = IOV_TAIL(iov, iov_cnt, VNET_HLEN);
> - tcp_fill_headers(c, conn, !v6, &payload, *check, conn->seq_to_tap,
> + tcp_fill_headers(c, conn, ipv4, &payload, *check, conn->seq_to_tap,
> no_tcp_csum);
> - if (ip4h)
> + if (ipv4) {
> + struct iphdr ip4h_storage;
> + const struct iphdr *ip4h;
> +
> + IOV_DROP_HEADER(&payload, struct ethhdr);
> + ip4h = IOV_PEEK_HEADER(&payload, ip4h_storage);
> +
> *check = ip4h->check;
> + }
> }
>
> /**
> --
> 2.53.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/7] tcp_vu: Support multibuffer frames in tcp_vu_sock_recv()
2026-03-23 16:52 ` [PATCH 4/7] tcp_vu: Support multibuffer frames in tcp_vu_sock_recv() Laurent Vivier
@ 2026-03-25 5:06 ` David Gibson
0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2026-03-25 5:06 UTC (permalink / raw)
To: Laurent Vivier; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 10841 bytes --]
On Mon, Mar 23, 2026 at 05:52:56PM +0100, Laurent Vivier wrote:
> Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue
> elements and iovecs (one iovec per element), enforced by an ASSERT.
> This prevented the use of virtqueue elements with multiple buffers
> (e.g. when mergeable rx buffers are not negotiated and headers are
> provided in a separate buffer).
>
> Introduce a struct vu_frame to track per-frame metadata: the range of
> elements and iovecs that make up each frame, and the frame's total size.
> This replaces the head[] array which only tracked element indices.
>
> A separate iov_msg[] array is built for recvmsg() by cloning the data
> portions (after stripping headers) using iov_tail helpers.
>
> Then a frame truncation after recvmsg() properly walks the frame and
> element arrays to adjust iovec counts and element counts.
>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
> tcp_vu.c | 161 ++++++++++++++++++++++++++++++++++++-------------------
> 1 file changed, 107 insertions(+), 54 deletions(-)
>
> diff --git a/tcp_vu.c b/tcp_vu.c
> index a39f6ea018e9..96cd9da1caae 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -35,9 +35,24 @@
> #include "vu_common.h"
> #include <time.h>
>
> -static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM];
> +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE];
> static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> -static int head[VIRTQUEUE_MAX_SIZE + 1];
> +
> +/**
> + * struct vu_frame - Descriptor for a TCP frame mapped to virtqueue elements
> + * @idx_element: Index of first element in elem[] for this frame
> + * @num_element: Number of virtqueue elements used by this frame
> + * @idx_iovec: Index of first iovec in iov_vu[] for this frame
> + * @num_iovec: Number of iovecs covering this frame's buffers
> + * @size: Total frame size including all headers
"all headers" is a bit context dependent. I assume it includes the
Ethernet header, but I'm not sure about virtio_net_hdr_mrg_rxbuf. If
it doesn't this should be called l2len. If it does.. maybe we need to
invent a term. Usually I wouldn't consider the "frame" to include
like mrg_rxbuf (I'd consider that a "hw" descriptor followed by the
frame).
> + */
> +static struct vu_frame {
> + int idx_element;
> + int num_element;
> + int idx_iovec;
> + int num_iovec;
> + size_t size;
> +} frame[VIRTQUEUE_MAX_SIZE];
>
> /**
> * tcp_vu_hdrlen() - Sum size of all headers, from TCP to virtio-net
> @@ -173,8 +188,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> * @v6: Set for IPv6 connections
> * @already_sent: Number of bytes already sent
> * @fillsize: Maximum bytes to fill in guest-side receiving window
> - * @iov_cnt: number of iov (output)
> - * @head_cnt: Pointer to store the count of head iov entries (output)
> + * @elem_used: number of element (output)
> + * @frame_cnt: Pointer to store the number of frames (output)
> *
> * Return: number of bytes received from the socket, or a negative error code
> * on failure.
> @@ -182,58 +197,77 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
> const struct tcp_tap_conn *conn, bool v6,
> uint32_t already_sent, size_t fillsize,
> - int *iov_cnt, int *head_cnt)
> + int *elem_used, int *frame_cnt)
> {
> + static struct iovec iov_msg[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM];
What's the rationale for making this static?
> const struct vu_dev *vdev = c->vdev;
> struct msghdr mh_sock = { 0 };
> uint16_t mss = MSS_GET(conn);
> size_t hdrlen, iov_used;
> int s = conn->sock;
> + ssize_t ret, dlen;
> int elem_cnt;
> - ssize_t ret;
> - int i;
> -
> - *iov_cnt = 0;
> + int i, j;
>
> hdrlen = tcp_vu_hdrlen(v6);
>
> + *elem_used = 0;
> +
> iov_used = 0;
> elem_cnt = 0;
> - *head_cnt = 0;
> + *frame_cnt = 0;
> while (fillsize > 0 && elem_cnt < ARRAY_SIZE(elem) &&
> - iov_used < VIRTQUEUE_MAX_SIZE) {
> - size_t frame_size, dlen, in_total;
> - struct iovec *iov;
> + iov_used < ARRAY_SIZE(iov_vu) &&
> + *frame_cnt < ARRAY_SIZE(frame)) {
> + size_t frame_size, in_total;
> int cnt;
>
> cnt = vu_collect(vdev, vq, &elem[elem_cnt],
> ARRAY_SIZE(elem) - elem_cnt,
> - &iov_vu[DISCARD_IOV_NUM + iov_used],
> - VIRTQUEUE_MAX_SIZE - iov_used, &in_total,
> + &iov_vu[iov_used],
> + ARRAY_SIZE(iov_vu) - iov_used, &in_total,
> MIN(mss, fillsize) + hdrlen,
> &frame_size);
> if (cnt == 0)
> break;
> - assert((size_t)cnt == in_total); /* one iovec per element */
> +
> + frame[*frame_cnt].idx_element = elem_cnt;
> + frame[*frame_cnt].num_element = cnt;
> + frame[*frame_cnt].idx_iovec = iov_used;
> + frame[*frame_cnt].num_iovec = in_total;
> + frame[*frame_cnt].size = frame_size;
> + (*frame_cnt)++;
>
> iov_used += in_total;
> - dlen = frame_size - hdrlen;
> + elem_cnt += cnt;
>
> - /* reserve space for headers in iov */
> - iov = &elem[elem_cnt].in_sg[0];
> - assert(iov->iov_len >= hdrlen);
> + fillsize -= frame_size - hdrlen;
> + }
>
> - iov->iov_base = (char *)iov->iov_base + hdrlen;
> - iov->iov_len -= hdrlen;
> - head[(*head_cnt)++] = elem_cnt;
> + /* build an iov array without headers */
> + for (i = 0, j = DISCARD_IOV_NUM; i < *frame_cnt &&
> + j < ARRAY_SIZE(iov_msg); i++) {
> + struct iov_tail data;
> + ssize_t cnt;
>
> - fillsize -= dlen;
> - elem_cnt += cnt;
> + data = IOV_TAIL(&iov_vu[frame[i].idx_iovec],
> + frame[i].num_iovec, 0);
> + iov_drop_header(&data, hdrlen);
> +
> + cnt = iov_tail_clone(&iov_msg[j], ARRAY_SIZE(iov_msg) - j,
> + &data);
> + if (cnt == -1)
> + die("Missing entries in iov_msg");
IIUC that would indicate a bug in the sizing / setup of the arrays, in
which case it should be an assert().
> +
> + j += cnt;
> }
>
> - if (tcp_prepare_iov(&mh_sock, iov_vu, already_sent, elem_cnt))
> + if (tcp_prepare_iov(&mh_sock, iov_msg, already_sent,
> + j - DISCARD_IOV_NUM)) {
> /* Expect caller to do a TCP reset */
> + vu_queue_rewind(vq, elem_cnt);
> return -1;
> + }
>
> do
> ret = recvmsg(s, &mh_sock, MSG_PEEK);
> @@ -247,28 +281,47 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
> if (!peek_offset_cap)
> ret -= already_sent;
>
> - i = vu_pad(&iov_vu[DISCARD_IOV_NUM], iov_used, hdrlen, ret);
> + dlen = ret;
>
> - /* adjust head count */
> - while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
> - (*head_cnt)--;
> + /* truncate frame */
> + *elem_used = 0;
> + for (i = 0; i < *frame_cnt; i++) {
> + struct vu_frame *f = &frame[i];
>
> - /* mark end of array */
> - head[*head_cnt] = i;
> - *iov_cnt = i;
> + if ((size_t)ret <= f->size - hdrlen) {
> + unsigned cnt;
>
> - /* release unused buffers */
> - vu_queue_rewind(vq, elem_cnt - i);
> + cnt = vu_pad(&iov_vu[f->idx_iovec], f->num_iovec,
> + 0, hdrlen + ret);
>
> - /* restore space for headers in iov */
> - for (i = 0; i < *head_cnt; i++) {
> - struct iovec *iov = &elem[head[i]].in_sg[0];
> + f->size = ret + hdrlen;
> + f->num_iovec = cnt;
>
> - iov->iov_base = (char *)iov->iov_base - hdrlen;
> - iov->iov_len += hdrlen;
> + for (j = 0; j < f->num_element; j++) {
> + struct vu_virtq_element *e;
> +
> + e = &elem[f->idx_element + j];
> + if (cnt <= e->in_num) {
> + e->in_num = cnt;
> + j++;
> + break;
> + }
> + cnt -= e->in_num;
> + }
> + f->num_element = j;
> + *elem_used += j;
> + i++;
> + break;
> + }
> + *elem_used += f->num_element;
> + ret -= f->size - hdrlen;
> }
> + *frame_cnt = i;
>
> - return ret;
> + /* release unused buffers */
> + vu_queue_rewind(vq, elem_cnt - *elem_used);
> +
> + return dlen;
> }
>
> /**
> @@ -341,7 +394,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> struct vu_dev *vdev = c->vdev;
> struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> ssize_t len, previous_dlen;
> - int i, iov_cnt, head_cnt;
> + int i, elem_cnt, frame_cnt;
> size_t hdrlen, fillsize;
> int v6 = CONN_V6(conn);
> uint32_t already_sent;
> @@ -381,7 +434,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> * data from the socket
> */
> len = tcp_vu_sock_recv(c, vq, conn, v6, already_sent, fillsize,
> - &iov_cnt, &head_cnt);
> + &elem_cnt, &frame_cnt);
> if (len < 0) {
> if (len != -EAGAIN && len != -EWOULDBLOCK) {
> tcp_rst(c, conn);
> @@ -395,6 +448,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> }
>
> if (!len) {
> + vu_queue_rewind(vq, elem_cnt);
> if (already_sent) {
> conn_flag(c, conn, STALLED);
> } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> @@ -432,33 +486,32 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> */
>
> hdrlen = tcp_vu_hdrlen(v6);
> - for (i = 0, previous_dlen = -1, check = -1; i < head_cnt; i++) {
> - struct iovec *iov = &elem[head[i]].in_sg[0];
> - int buf_cnt = head[i + 1] - head[i];
> - size_t frame_size = iov_size(iov, buf_cnt);
> - bool push = i == head_cnt - 1;
> + for (i = 0, previous_dlen = -1, check = -1; i < frame_cnt; i++) {
> + struct iovec *iov = &iov_vu[frame[i].idx_iovec];
> + int iov_cnt = frame[i].num_iovec;
> + bool push = i == frame_cnt - 1;
> ssize_t dlen;
>
> - assert(frame_size >= hdrlen);
> + assert(frame[i].size >= hdrlen);
>
> - dlen = frame_size - hdrlen;
> - vu_set_vnethdr(iov->iov_base, buf_cnt);
> + dlen = frame[i].size - hdrlen;
> + vu_set_vnethdr(iov->iov_base, frame[i].num_element);
>
> /* The IPv4 header checksum varies only with dlen */
> if (previous_dlen != dlen)
> check = -1;
> previous_dlen = dlen;
>
> - tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
> + tcp_vu_prepare(c, conn, iov, iov_cnt, &check, !*c->pcap, push);
>
> if (*c->pcap)
> - pcap_iov(iov, buf_cnt, VNET_HLEN);
> + pcap_iov(iov, iov_cnt, VNET_HLEN);
>
> conn->seq_to_tap += dlen;
> }
>
> /* send packets */
> - vu_flush(vdev, vq, elem, iov_cnt);
> + vu_flush(vdev, vq, elem, elem_cnt);
>
> conn_flag(c, conn, ACK_FROM_TAP_DUE);
>
> --
> 2.53.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
` (6 preceding siblings ...)
2026-03-23 16:52 ` [PATCH 7/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_send_flag() Laurent Vivier
@ 2026-03-25 5:07 ` David Gibson
7 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2026-03-25 5:07 UTC (permalink / raw)
To: Laurent Vivier; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]
On Mon, Mar 23, 2026 at 05:52:52PM +0100, Laurent Vivier wrote:
> This is the TCP counterpart to the UDP multi-iov series. It converts
> the TCP vhost-user receive path from direct pointer arithmetic (via
> vu_eth(), vu_ip(), etc.) to the iov_tail abstraction, removing the
> assumption that all headers reside in a single contiguous buffer.
>
> With this series applied, the TCP path correctly handles virtio-net
> drivers that provide multiple buffers per virtqueue element (e.g. iPXE
> provides the vnet header in the first buffer and the frame payload in a
> second one), matching the support already present in the UDP path.
>
> Based-on: 20260323143151.538673-1-lvivier@redhat.com
I didn't finish reviewing this series. However, I'm going to stop
here and wait for the spin based on only truncating the iovs when
they're flushed.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-03-25 5:07 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-23 16:52 [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element Laurent Vivier
2026-03-23 16:52 ` [PATCH 1/7] tcp: pass ipv4h checksum, not a pointer to the checksum Laurent Vivier
2026-03-24 3:53 ` David Gibson
2026-03-24 7:56 ` Laurent Vivier
2026-03-24 23:49 ` David Gibson
2026-03-23 16:52 ` [PATCH 2/7] tcp: use iov_tail to access headers in tcp_fill_headers() Laurent Vivier
2026-03-24 3:58 ` David Gibson
2026-03-23 16:52 ` [PATCH 3/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_prepare() Laurent Vivier
2026-03-25 4:46 ` David Gibson
2026-03-23 16:52 ` [PATCH 4/7] tcp_vu: Support multibuffer frames in tcp_vu_sock_recv() Laurent Vivier
2026-03-25 5:06 ` David Gibson
2026-03-23 16:52 ` [PATCH 5/7] tcp: Use iov_tail to access headers in tcp_prepare_flags() Laurent Vivier
2026-03-23 16:52 ` [PATCH 6/7] iov: introduce iov_memcopy() Laurent Vivier
2026-03-23 16:52 ` [PATCH 7/7] tcp_vu: Use iov_tail helpers to build headers in tcp_vu_send_flag() Laurent Vivier
2026-03-25 5:07 ` [PATCH 0/7] vhost-user,tcp: Handle multiple iovec entries per virtqueue element David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).