Re: [PATCH v4 1/5] vhost-user: Centralise Ethernet frame padding in vu_collect(), vu_pad() and vu_flush()

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Laurent Vivier <lvivier@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v4 1/5] vhost-user: Centralise Ethernet frame padding in vu_collect(), vu_pad() and vu_flush()
Date: Tue, 24 Mar 2026 12:56:54 +1100	[thread overview]
Message-ID: <acHvZlr7v9NUcq34@zatzit> (raw)
In-Reply-To: <20260323143151.538673-2-lvivier@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 15891 bytes --]

On Mon, Mar 23, 2026 at 03:31:47PM +0100, Laurent Vivier wrote:
> The per-protocol padding done by vu_pad() in tcp_vu.c and udp_vu.c was
> only correct for single-buffer frames, and assumed the padding area always
> fell within the first iov.  It also relied on each caller computing the
> right MAX(..., ETH_ZLEN + VNET_HLEN) size for vu_collect() and calling
> vu_pad() at the right point.
> 
> Centralise padding logic into three shared vhost-user helpers instead:
> 
>  - vu_collect() now ensures at least ETH_ZLEN + VNET_HLEN bytes of buffer
>    space are collected, so there is always room for a minimum-sized frame.
> 
>  - vu_pad() replaces the old single-iov helper with a new implementation
>    that takes a full iovec array plus a 'skipped' byte count.  It uses a
>    new iov_memset() helper in iov.c to zero-fill the padding area across
>    iovec boundaries, then calls iov_truncate() to set the logical frame
>    size.
> 
>  - vu_flush() computes the actual frame length (accounting for
>    VIRTIO_NET_F_MRG_RXBUF multi-buffer frames) and passes the padded
>    length to vu_queue_fill().
> 
> Callers in tcp_vu.c, udp_vu.c and vu_send_single() now use the new
> vu_pad() in place of the old pad-then-truncate sequences and the
> MAX(..., ETH_ZLEN + VNET_HLEN) size calculations passed to vu_collect().
> 
> Centralising padding here will also ease the move to multi-iovec per
> element support, since there will be a single place to update.
> 
> In vu_send_single(), fix padding, truncation and data copy to use the
> requested frame size rather than the total available buffer space from
> vu_collect(), which could be larger.  Also add matching padding, truncation
> and explicit size to vu_collect() for the DUP_ACK path in
> tcp_vu_send_flag().

Something I'm struggling to reason about throughout these series is
the distinction between three things.  I'm going to call them 1)
"available buffer" - the space we've gotten for the guest into which
we can potentially put data, 2) "used buffer" - the space we actually
filled with data or padding and 3) "packet" - the space filled with
the actual frame data, not counting padding.

Actually, maybe there's 4: 1) the set of buffers we get from
vu_collect() and 0) the buffers vu_collect() gathered before it
truncated them to the requseted size.

We juggle iov variables that at various points could be one of those
three, and I'm often not clear on which one it's representing at a
given moment.  An iov_truncate() can change from (1) to (2) or (2) to
(3) safely, but moving in the other direction is never _obviously_
safe - it relies on knowing where this specific iov came from and how
it was allocated.

Stefano and I were discussing last night whether an explicit and
consistent terminology for those three things through the code might
help.

> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  iov.c       | 26 ++++++++++++++++++
>  iov.h       |  2 ++
>  tcp_vu.c    | 22 +++++-----------
>  udp_vu.c    |  9 ++-----
>  vu_common.c | 76 ++++++++++++++++++++++++++++++++++++-----------------
>  vu_common.h |  2 +-
>  6 files changed, 90 insertions(+), 47 deletions(-)
> 
> diff --git a/iov.c b/iov.c
> index ae0743931d18..8134b8c9f988 100644
> --- a/iov.c
> +++ b/iov.c
> @@ -170,6 +170,32 @@ size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size)
>  	return i;
>  }
>  
> +/**
> + * iov_memset() - Set bytes of an IO vector to a given value
> + * @iov:	IO vector
> + * @iov_cnt:	Number of elements in @iov
> + * @offset:	Byte offset in the iovec at which to start
> + * @c:		Byte value to fill with
> + * @length:	Number of bytes to set
> + * 		Will write less than @length bytes if it runs out of space in
> + * 		the iov
> + */
> +void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c,
> +		size_t length)
> +{
> +	size_t i;
> +
> +	i = iov_skip_bytes(iov, iov_cnt, offset, &offset);
> +
> +	for ( ; i < iov_cnt && length; i++) {
> +		size_t n = MIN(iov[i].iov_len - offset, length);
> +
> +		memset((char *)iov[i].iov_base + offset, c, n);
> +		offset = 0;
> +		length -= n;
> +	}
> +}
> +
>  /**
>   * iov_tail_prune() - Remove any unneeded buffers from an IOV tail
>   * @tail:	IO vector tail (modified)
> diff --git a/iov.h b/iov.h
> index b4e50b0fca5a..d295d05b3bab 100644
> --- a/iov.h
> +++ b/iov.h
> @@ -30,6 +30,8 @@ size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt,
>  		  size_t offset, void *buf, size_t bytes);
>  size_t iov_size(const struct iovec *iov, size_t iov_cnt);
>  size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size);
> +void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c,
> +		size_t length);
>  
>  /*
>   * DOC: Theory of Operation, struct iov_tail
> diff --git a/tcp_vu.c b/tcp_vu.c
> index dc0e17c0f03f..3001defb5467 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -72,12 +72,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	struct vu_dev *vdev = c->vdev;
>  	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
>  	struct vu_virtq_element flags_elem[2];
> -	size_t optlen, hdrlen, l2len;
>  	struct ipv6hdr *ip6h = NULL;
>  	struct iphdr *ip4h = NULL;
>  	struct iovec flags_iov[2];
>  	struct tcp_syn_opts *opts;
>  	struct iov_tail payload;
> +	size_t optlen, hdrlen;
>  	struct tcphdr *th;
>  	struct ethhdr *eh;
>  	uint32_t seq;
> @@ -89,7 +89,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  
>  	elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1,
>  			      &flags_iov[0], 1, NULL,
> -			      MAX(hdrlen + sizeof(*opts), ETH_ZLEN + VNET_HLEN), NULL);
> +			      hdrlen + sizeof(*opts), NULL);

So vu_collect() now takes an upper bound on (3) as the parameter,
instead of an upper bound on (2).  Seems reasonable, but it does have
the side effect that this value is no longer an upper bound on (1).

>  	if (elem_cnt != 1)
>  		return -1;
>  
> @@ -131,7 +131,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		return ret;
>  	}
>  
> -	iov_truncate(&flags_iov[0], 1, hdrlen + optlen);
> +	vu_pad(&flags_iov[0], 1, 0, hdrlen + optlen);
>  	payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
>  
>  	if (flags & KEEPALIVE)
> @@ -140,9 +140,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload,
>  			 NULL, seq, !*c->pcap);
>  
> -	l2len = optlen + hdrlen - VNET_HLEN;
> -	vu_pad(&flags_elem[0].in_sg[0], l2len);
> -
>  	if (*c->pcap)
>  		pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN);
>  	nb_ack = 1;
> @@ -150,10 +147,11 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	if (flags & DUP_ACK) {
>  		elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1,
>  				      &flags_iov[1], 1, NULL,
> -				      flags_elem[0].in_sg[0].iov_len, NULL);
> +				      hdrlen + optlen, NULL);
>  		if (elem_cnt == 1 &&
>  		    flags_elem[1].in_sg[0].iov_len >=
>  		    flags_elem[0].in_sg[0].iov_len) {
> +			vu_pad(&flags_iov[1], 1, 0, hdrlen + optlen);
>  			memcpy(flags_elem[1].in_sg[0].iov_base,
>  			       flags_elem[0].in_sg[0].iov_base,
>  			       flags_elem[0].in_sg[0].iov_len);
> @@ -213,7 +211,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
>  				 ARRAY_SIZE(elem) - elem_cnt,
>  				 &iov_vu[DISCARD_IOV_NUM + iov_used],
>  				 VIRTQUEUE_MAX_SIZE - iov_used, &in_total,
> -				 MAX(MIN(mss, fillsize) + hdrlen, ETH_ZLEN + VNET_HLEN),
> +				 MIN(mss, fillsize) + hdrlen,
>  				 &frame_size);
>  		if (cnt == 0)
>  			break;
> @@ -249,8 +247,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq,
>  	if (!peek_offset_cap)
>  		ret -= already_sent;
>  
> -	/* adjust iov number and length of the last iov */
> -	i = iov_truncate(&iov_vu[DISCARD_IOV_NUM], iov_used, ret);
> +	i = vu_pad(&iov_vu[DISCARD_IOV_NUM], iov_used, hdrlen, ret);
>  
>  	/* adjust head count */
>  	while (*head_cnt > 0 && head[*head_cnt - 1] >= i)
> @@ -446,7 +443,6 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  		size_t frame_size = iov_size(iov, buf_cnt);
>  		bool push = i == head_cnt - 1;
>  		ssize_t dlen;
> -		size_t l2len;
>  
>  		assert(frame_size >= hdrlen);
>  
> @@ -460,10 +456,6 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  
>  		tcp_vu_prepare(c, conn, iov, buf_cnt, &check, !*c->pcap, push);
>  
> -		/* Pad first/single buffer only, it's at least ETH_ZLEN long */
> -		l2len = dlen + hdrlen - VNET_HLEN;
> -		vu_pad(iov, l2len);
> -
>  		if (*c->pcap)
>  			pcap_iov(iov, buf_cnt, VNET_HLEN);
>  
> diff --git a/udp_vu.c b/udp_vu.c
> index cc69654398f0..6a1e1696b19f 100644
> --- a/udp_vu.c
> +++ b/udp_vu.c
> @@ -73,8 +73,7 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s,
>  	const struct vu_dev *vdev = c->vdev;
>  	int elem_cnt, elem_used, iov_used;
>  	struct msghdr msg  = { 0 };
> -	size_t hdrlen, l2len;
> -	size_t iov_cnt;
> +	size_t iov_cnt, hdrlen;
>  
>  	assert(!c->no_udp);
>  
> @@ -117,13 +116,9 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s,
>  	iov_vu[0].iov_base = (char *)iov_vu[0].iov_base - hdrlen;
>  	iov_vu[0].iov_len += hdrlen;
>  
> -	iov_used = iov_truncate(iov_vu, iov_cnt, *dlen + hdrlen);
> +	iov_used = vu_pad(iov_vu, iov_cnt, 0, *dlen + hdrlen);
>  	elem_used = iov_used; /* one iovec per element */
>  
> -	/* pad frame to 60 bytes: first buffer is at least ETH_ZLEN long */
> -	l2len = *dlen + hdrlen - VNET_HLEN;
> -	vu_pad(&iov_vu[0], l2len);
> -
>  	vu_set_vnethdr(iov_vu[0].iov_base, elem_used);
>  
>  	/* release unused buffers */
> diff --git a/vu_common.c b/vu_common.c
> index 92381cd33c85..808eb2b69281 100644
> --- a/vu_common.c
> +++ b/vu_common.c
> @@ -74,6 +74,7 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq,
>  	size_t current_iov = 0;
>  	int elem_cnt = 0;
>  
> +	size = MAX(size, ETH_ZLEN + VNET_HLEN); /* Ethernet minimum size */
>  	while (current_size < size && elem_cnt < max_elem &&
>  	       current_iov < max_in_sg) {
>  		int ret;
> @@ -113,6 +114,25 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq,
>  	return elem_cnt;
>  }
>  
> +/**
> + * vu_pad() - Pad short frames to minimum Ethernet length and truncate iovec
> + * @iov:	Pointer to iovec array
> + * @cnt:	Number of entries in @iov
> + * @skipped:	Bytes already accounted for in the frame but not in @iov
> + * @size:	Data length in @iov
> + *
> + * Return: number of iovec entries after truncation
> + */
> +size_t vu_pad(struct iovec *iov, size_t cnt, size_t skipped, size_t size)
> +{
> +	if (skipped + size < ETH_ZLEN + VNET_HLEN) {
> +		iov_memset(iov, cnt, size, 0,
> +			   ETH_ZLEN + VNET_HLEN - (skipped + size));
> +	}
> +
> +	return iov_truncate(iov, cnt, size);
> +}
> +
>  /**
>   * vu_set_vnethdr() - set virtio-net headers
>   * @vnethdr:		Address of the header to set
> @@ -137,12 +157,32 @@ void vu_set_vnethdr(struct virtio_net_hdr_mrg_rxbuf *vnethdr, int num_buffers)
>  void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
>  	      struct vu_virtq_element *elem, int elem_cnt)
>  {
> -	int i;
> -
> -	for (i = 0; i < elem_cnt; i++) {
> -		size_t elem_size = iov_size(elem[i].in_sg, elem[i].in_num);
> -
> -		vu_queue_fill(vdev, vq, &elem[i], elem_size, i);
> +	int i, j, num_elements;
> +
> +	for (i = 0; i < elem_cnt; i += num_elements) {
> +		const struct virtio_net_hdr_mrg_rxbuf *vnethdr;
> +		size_t len, padding, elem_size;
> +
> +		vnethdr = elem[i].in_sg[0].iov_base;
> +		num_elements = le16toh(vnethdr->num_buffers);
> +
> +		len = 0;
> +		for (j = 0; j < num_elements - 1; j++) {
> +			elem_size = iov_size(elem[i + j].in_sg,
> +					     elem[i + j].in_num);
> +			vu_queue_fill(vdev, vq, &elem[i + j],
> +				      elem_size, i + j);
> +			len += elem_size;
> +		}
> +		/* pad the last element to have an Ethernet minimum size */
> +		elem_size = iov_size(elem[i + j].in_sg, elem[i + j].in_num);
> +		if (ETH_ZLEN + VNET_HLEN > len + elem_size)
> +			padding = ETH_ZLEN + VNET_HLEN - (len + elem_size);
> +		else
> +			padding = 0;
> +
> +		vu_queue_fill(vdev, vq, &elem[i + j], elem_size + padding,

So here we go from (3) to (2) which as discussed requires a very broad
understanding to knwo is safe.


Hrm.. I might have missed something, but would this approach work?
The idea is to always delay iov truncation to the last possible
moment, so we never have to try reversing it.
 * vu_collect() an upper bound on data length, as in this patch
 * vu_collect() pads that to at least ETH_ZLEN, again as in this patch
 * vu_collect() doesn't do an iov_truncate(), it's explicitly allowed
   to return more buffer space than you asked for (effectively, it
   returns (0))
 * Protocols never iov_truncate(), but keep track of the data length
   in a separate variable
 * vu_flush() takes the data length in addition to the other
   parameters.  It does the pad and truncation immediately before
   putting the iov in the queue

> +			      i + j);
>  	}
>  
>  	vu_queue_flush(vdev, vq, elem_cnt);
> @@ -260,38 +300,26 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size)
>  		goto err;
>  	}
>  
> +	in_total = vu_pad(in_sg, in_total, 0, size);
> +
>  	vu_set_vnethdr(in_sg[0].iov_base, elem_cnt);
>  
> -	total -= VNET_HLEN;
> +	size -= VNET_HLEN;
>  
>  	/* copy data from the buffer to the iovec */
> -	iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total);
> +	iov_from_buf(in_sg, in_total, VNET_HLEN, buf, size);
>  
>  	if (*c->pcap)
>  		pcap_iov(in_sg, in_total, VNET_HLEN);
>  
>  	vu_flush(vdev, vq, elem, elem_cnt);
>  
> -	trace("vhost-user sent %zu", total);
> +	trace("vhost-user sent %zu", size);
>  
> -	return total;
> +	return size;
>  err:
>  	for (i = 0; i < elem_cnt; i++)
>  		vu_queue_detach_element(vq);
>  
>  	return -1;
>  }
> -
> -/**
> - * vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed
> - * @iov:	Buffer in iovec array where end of 802.3 frame is stored
> - * @l2len:	Layer-2 length already filled in frame
> - */
> -void vu_pad(struct iovec *iov, size_t l2len)
> -{
> -	if (l2len >= ETH_ZLEN)
> -		return;
> -
> -	memset((char *)iov->iov_base + iov->iov_len, 0, ETH_ZLEN - l2len);
> -	iov->iov_len += ETH_ZLEN - l2len;
> -}
> diff --git a/vu_common.h b/vu_common.h
> index 7b060eb6184f..f09dd25d3d1f 100644
> --- a/vu_common.h
> +++ b/vu_common.h
> @@ -39,12 +39,12 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq,
>  	       struct vu_virtq_element *elem, int max_elem,
>  	       struct iovec *in_sg, size_t max_in_sg, size_t *in_total,
>  	       size_t size, size_t *collected);
> +size_t vu_pad(struct iovec *iov, size_t cnt, size_t skipped, size_t size);
>  void vu_set_vnethdr(struct virtio_net_hdr_mrg_rxbuf *vnethdr, int num_buffers);
>  void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
>  	      struct vu_virtq_element *elem, int elem_cnt);
>  void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
>  		const struct timespec *now);
>  int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> -void vu_pad(struct iovec *iov, size_t l2len);
>  
>  #endif /* VU_COMMON_H */
> -- 
> 2.53.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2026-03-24  2:11 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-23 14:31 [PATCH v4 0/5] vhost-user,udp: Handle multiple iovec entries per virtqueue element Laurent Vivier
2026-03-23 14:31 ` [PATCH v4 1/5] vhost-user: Centralise Ethernet frame padding in vu_collect(), vu_pad() and vu_flush() Laurent Vivier
2026-03-24  1:56   ` David Gibson [this message]
2026-03-24  8:04     ` Laurent Vivier
2026-03-23 14:31 ` [PATCH v4 2/5] udp_vu: Use iov_tail to manage virtqueue buffers Laurent Vivier
2026-03-24  2:11   ` David Gibson
2026-03-23 14:31 ` [PATCH v4 3/5] udp_vu: Move virtqueue management from udp_vu_sock_recv() to its caller Laurent Vivier
2026-03-24  2:37   ` David Gibson
2026-03-23 14:31 ` [PATCH v4 4/5] iov: Add IOV_PUT_HEADER() and with_header() to write header data back to iov_tail Laurent Vivier
2026-03-24  2:41   ` David Gibson
2026-03-24  2:48     ` David Gibson
2026-03-24  7:44       ` Laurent Vivier
2026-03-24 23:46         ` David Gibson
2026-03-24  7:16     ` Laurent Vivier
2026-03-24 23:38       ` David Gibson
2026-03-23 14:31 ` [PATCH v4 5/5] udp: Pass iov_tail to udp_update_hdr4()/udp_update_hdr6() Laurent Vivier
2026-03-24  2:54   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acHvZlr7v9NUcq34@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).