public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Laurent Vivier <lvivier@redhat.com>, passt-dev@passt.top
Subject: Re: [RFC v2] tcp: Replace TCP buffer structure by an iovec array
Date: Mon, 18 Mar 2024 09:53:29 +0100	[thread overview]
Message-ID: <20240318095329.34152c6f@elisabeth> (raw)
In-Reply-To: <ZffVdl5k0OSW7e4-@zatzit>

[Not a full review, just some follow-up comments]

On Mon, 18 Mar 2024 16:47:34 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Mar 15, 2024 at 06:51:19PM +0100, Laurent Vivier wrote:
> > To be able to provide pointers to TCP headers and IP headers without
> > worrying about alignment in the structure, split the structure into
> > several arrays and point to each part of the frame using an iovec array.
> > 
> > Using iovec also allows us to simply ignore the first entry when the
> > vnet length header is not needed.
> > 
> > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> > 
> > Notes:
> >     v2:
> >       - rebased on top of David's series
> >         "Some improvements to the tap send path"  
> 
> Thanks, and sorry for the disruption.
> 
> >       - no performance improvement anymore  
> 
> Well, that's unfortunate.  I think we need to drill down and find out
> what actually made the difference here, since it's a pretty big change
> from something rather subtle.

...right, we spent quite some effort on batching everything that could
be batched, but now if one sendmsg() per frame gives higher throughput
than one call for multiple frames, I think we should understand a bit
better to which extent, and in which cases, and vaguely why.

I still have to run iperf3 with perf(1) and the v1 variant, but
meanwhile the experiments I suggested should also make sense.

> >       - remove the iovec functions of v1 introduced in tap.c to use new
> >         functions from David
> >       - don't use an array of array of iovec  
> 
> Huh.  Is there a specific reason for that?  I actually changed *to*
> using a 2d array in my in-progress patches which do something similar
> in the UDP code, based on your example.
> 
> >       - fix comments
> >       - re-introduce MSS4 and MSS6
> >       - address comments from David and Stefano
> >     
> >     I have tested only passt, not pasta  
> 
> Is there a reason for that?  If you're having trouble running the
> testsuite, we really need to fix that...
> 
> > 
> >  tap.c |   3 +-
> >  tcp.c | 479 ++++++++++++++++++++++++++++------------------------------
> >  2 files changed, 233 insertions(+), 249 deletions(-)
> > 
> > diff --git a/tap.c b/tap.c
> > index 13e4da79d690..605b472b5ca2 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -424,8 +424,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
> >  		debug("tap: failed to send %zu frames of %zu",
> >  		      nframes - m, nframes);
> >  
> > -	pcap_multiple(iov, bufs_per_frame, m,
> > -		      c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> > +	pcap_multiple(iov, bufs_per_frame, m, 0);  
> 
> If you're no longer changing the semantics of tap_send_frames() from
> those in my patch, I don't think you want this change any more either.
> 
> >  	return m;
> >  }
> > diff --git a/tcp.c b/tcp.c
> > index a1860d10b15f..d5c09dbf53ab 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -318,39 +318,8 @@
> >  
> >  /* MSS rounding: see SET_MSS() */
> >  #define MSS_DEFAULT			536
> > -
> > -struct tcp4_l2_head {	/* For MSS4 macro: keep in sync with tcp4_l2_buf_t */
> > -#ifdef __AVX2__
> > -	uint8_t pad[26];
> > -#else
> > -	uint8_t pad[2];
> > -#endif
> > -	struct tap_hdr taph;
> > -	struct iphdr iph;
> > -	struct tcphdr th;
> > -#ifdef __AVX2__
> > -} __attribute__ ((packed, aligned(32)));
> > -#else
> > -} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> > -#endif
> > -
> > -struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
> > -#ifdef __AVX2__
> > -	uint8_t pad[14];
> > -#else
> > -	uint8_t pad[2];
> > -#endif
> > -	struct tap_hdr taph;
> > -	struct ipv6hdr ip6h;
> > -	struct tcphdr th;
> > -#ifdef __AVX2__
> > -} __attribute__ ((packed, aligned(32)));
> > -#else
> > -} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> > -#endif
> > -
> > -#define MSS4	ROUND_DOWN(USHRT_MAX - sizeof(struct tcp4_l2_head), 4)
> > -#define MSS6	ROUND_DOWN(USHRT_MAX - sizeof(struct tcp6_l2_head), 4)
> > +#define MSS4				ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN - sizeof(struct tcphdr) - sizeof(struct iphdr), sizeof(uint32_t))
> > +#define MSS6				ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN - sizeof(struct tcphdr) - sizeof(struct ipv6hdr), sizeof(uint32_t))  
> 
> I'd be inclined to base these static constants on only the IP level
> restriction (16-bit L3 packet length), i.e. replace (ETH_MAX_MTU -
> ETH_HLEN) with USHRT_MAX (or a new L3_MAX_LENGTH constant with the
> same value).

The kernel has IP_MAX_MTU for internal usage -- with its name spilling
over to userspace via comment to ETH_MAX_MTU ("same as IP_MAX_MTU"). I
would call it that way, in case.

But... is there any value in ignoring the length of the Ethernet header
here?

Thinking further about it, if one day we enable usage of, say, a tun
interface (instead of tap), we can probably send 14 bytes more. Is this
the reason why you suggest this?

Because to me, if we assume instead that we'll always have 802.3-style
frames, it makes sense to account for ETH_HLEN here.

> The L2 MTU restriction, ought to be applied at runtime,
> to properly handle the --mtu option.

Okay, but MSS4 and MSS6 are here to define buffer sizes, not to be used
at runtime (except for clamping incoming values, and there I'm not so
sure we want to ignore ETH_HLEN).

> >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> >  #ifdef HAS_SND_WND
> > @@ -445,133 +414,111 @@ struct tcp_buf_seq_update {
> >  };
> >  
> >  /* Static buffers */
> > -
> >  /**
> > - * tcp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections
> > - * @pad:	Align TCP header to 32 bytes, for AVX2 checksum calculation only
> > - * @taph:	Tap-level headers (partially pre-filled)
> > - * @iph:	Pre-filled IP header (except for tot_len and saddr)
> > - * @uh:		Headroom for TCP header
> > - * @data:	Storage for TCP payload
> > + * struct tcp_l2_payload_t - TCP header and data to send data  
> 
> Pre-existing problem, but without a typedef, I don't think the _t
> suffix is useful.

The idea behind that suffix is that the variable is tcp4_l2_buf[],
without _t, and the type name has a _t: you can't call them the same
way. And yet, we needed to refer to the type name.

With this patch, we would still have a similar naming problem without
_t.

> I also don't think the "l2" really adds any useful
> information here.
> 
> > + * 		32 bytes aligned to be able to use AVX2 checksum
> > + * @th:		TCP header
> > + * @data:	TCP data
> >   */
> > -static struct tcp4_l2_buf_t {
> > -#ifdef __AVX2__
> > -	uint8_t pad[26];	/* 0, align th to 32 bytes */
> > -#else
> > -	uint8_t pad[2];		/*	align iph to 4 bytes	0 */
> > -#endif
> > -	struct tap_hdr taph;	/* 26				2 */
> > -	struct iphdr iph;	/* 44				20 */
> > -	struct tcphdr th;	/* 64				40 */
> > -	uint8_t data[MSS4];	/* 84				60 */
> > -				/* 65536			65532 */
> > +struct tcp_l2_payload_t {
> > +	struct tcphdr th;
> > +	uint8_t data[65536 - sizeof(struct tcphdr)];
> >  #ifdef __AVX2__
> > -} __attribute__ ((packed, aligned(32)))
> > +} __attribute__ ((packed, aligned(32)));
> >  #else
> > -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> > +} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> >  #endif
> > -tcp4_l2_buf[TCP_FRAMES_MEM];
> >  
> > -static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> > -
> > -static unsigned int tcp4_l2_buf_used;
> > +static_assert(MSS4 <= sizeof(((struct tcp_l2_payload_t *)0)->data));
> > +static_assert(MSS6 <= sizeof(((struct tcp_l2_payload_t *)0)->data));  
> 
> I think you could avoid the (safe but ugly) explicit 0-dereferences by
> writing the static_assert()s in terms of the actual global variables
> once you've defined them, rather than purely the types.
> 
> >  
> >  /**
> > - * tcp6_l2_buf_t - Pre-cooked IPv6 packet buffers for tap connections
> > - * @pad:	Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
> > - * @taph:	Tap-level headers (partially pre-filled)
> > - * @ip6h:	Pre-filled IP header (except for payload_len and addresses)
> > - * @th:		Headroom for TCP header
> > - * @data:	Storage for TCP payload
> > + * struct tcp_l2_flags_t - TCP header and data to send option flags  
> 
> Again, I don't think the "l2" really says anything useful.  It was a
> bit unclear what it referred to in the original, but it's even more
> out of place now that the structure doesn't include L2 headers.
> 
> > + * @th:		TCP header
> > + * @opts	TCP option flags
> >   */
> > -struct tcp6_l2_buf_t {
> > -#ifdef __AVX2__
> > -	uint8_t pad[14];	/* 0	align ip6h to 32 bytes */
> > -#else
> > -	uint8_t pad[2];		/*	align ip6h to 4 bytes	0 */
> > -#endif
> > -	struct tap_hdr taph;	/* 14				2 */
> > -	struct ipv6hdr ip6h;	/* 32				20 */
> > -	struct tcphdr th;	/* 72				60 */
> > -	uint8_t data[MSS6];	/* 92				80 */
> > -				/* 65536			65532 */
> > +struct tcp_l2_flags_t {
> > +	struct tcphdr th;
> > +	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> >  #ifdef __AVX2__
> > -} __attribute__ ((packed, aligned(32)))
> > +} __attribute__ ((packed, aligned(32)));
> >  #else
> > -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> > +} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> >  #endif
> > -tcp6_l2_buf[TCP_FRAMES_MEM];
> >  
> > -static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> > +static uint32_t		tcp4_vnet_len		[TCP_FRAMES_MEM];
> > +/* Ethernet header for IPv4 frames */
> > +static struct ethhdr		tcp4_eth_src;
> > +
> > +/* IPv4 headers */
> > +static struct iphdr		tcp4_l2_ip[TCP_FRAMES_MEM];
> > +/* TCP headers and data for IPv4 frames */
> > +static struct tcp_l2_payload_t tcp4_l2_payload[TCP_FRAMES_MEM];
> >  
> > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> > +static unsigned int tcp4_l2_buf_used;
> > +
> > +static uint32_t		tcp4_flags_vnet_len		[TCP_FRAMES_MEM];
> > +/* IPv4 headers for TCP option flags frames */
> > +static struct iphdr		tcp4_l2_flags_ip[TCP_FRAMES_MEM];
> > +/* TCP headers and option flags for IPv4 frames */
> > +static struct tcp_l2_flags_t	tcp4_l2_flags[TCP_FRAMES_MEM];
> > +
> > +static unsigned int tcp4_l2_flags_buf_used;
> > +
> > +static uint32_t		tcp6_vnet_len		[TCP_FRAMES_MEM];
> > +/* Ethernet header for IPv6 frames */
> > +static struct ethhdr		tcp6_eth_src;
> > +
> > +/* IPv6 headers */
> > +static struct ipv6hdr		tcp6_l2_ip[TCP_FRAMES_MEM];
> > +/* TCP headers and data for IPv6 frames */
> > +static struct tcp_l2_payload_t tcp6_l2_payload[TCP_FRAMES_MEM];  
> 
> The v6 vnet_len, ethhdr and payload arrays can be shared with IPv4,
> but that can and IMO should be a follow on patch.
> 
> > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> >  static unsigned int tcp6_l2_buf_used;
> >  
> > +static uint32_t		tcp6_flags_vnet_len		[TCP_FRAMES_MEM];
> > +/* IPv6 headers for TCP option flags frames */
> > +static struct ipv6hdr		tcp6_l2_flags_ip[TCP_FRAMES_MEM];
> > +/* TCP headers and option flags for IPv6 frames */
> > +static struct tcp_l2_flags_t	tcp6_l2_flags[TCP_FRAMES_MEM];
> > +
> > +static unsigned int tcp6_l2_flags_buf_used;
> > +
> >  /* recvmsg()/sendmsg() data for tap */
> >  static char 		tcp_buf_discard		[MAX_WINDOW];
> >  static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
> >  
> > -static struct iovec	tcp4_l2_iov		[TCP_FRAMES_MEM];
> > -static struct iovec	tcp6_l2_iov		[TCP_FRAMES_MEM];
> > -static struct iovec	tcp4_l2_flags_iov	[TCP_FRAMES_MEM];
> > -static struct iovec	tcp6_l2_flags_iov	[TCP_FRAMES_MEM];
> > +/*
> > + * enum tcp_iov_parts - I/O vector parts for one TCP frame
> > + * TCP_IOV_TAP		TAP header
> > + * TCP_IOV_ETH		ethernet header
> > + * TCP_IOV_IP		IP (v4/v6) header
> > + * TCP_IOV_PAYLOAD	IP payload (TCP header + data)
> > + * TCP_IOV_NUM is the number of entries in the iovec array  
> 
> If we have these descriptions adjacent to the actual values, I don't
> think we need to duplicate them in the overall enum comment.

Right, perhaps my earlier comment on the subject was a bit misleading.
By the way, according to kernel-doc documentation (which we happily
ignored elsewhere, but now I took a chance to look at it...), we should
prefix those by @, that is:

 * @TCP_IOV_TAP		TAP Header
   ...

> > + */
> > +enum tcp_iov_parts {
> > +	/* TAP header */
> > +	TCP_IOV_TAP	= 0,
> > +	/* ethernet header */
> > +	TCP_IOV_ETH	= 1,
> > +	/* IP (v4/v6) header */
> > +	TCP_IOV_IP	= 2,
> > +	/* IP payload (TCP header + data) */
> > +	TCP_IOV_PAYLOAD	= 3,
> > +	/* number of entries in the iovec array */
> > +	TCP_IOV_NUM  
> 
> I think our convention has been to use 'FOO_NUM_XS' rather than
> 'FOO_X_NUM' for constants like this, to distinguish it a little
> visually from the individual enum values.  (e.g. PIF_NUM_TYPES,
> FLOW_NUM_TYPES).
> 
> > +};
> > +
> > +static struct iovec	tcp4_l2_iov		[TCP_FRAMES_MEM * TCP_IOV_NUM];
> > +static struct iovec	tcp6_l2_iov		[TCP_FRAMES_MEM * TCP_IOV_NUM];
> > +static struct iovec	tcp4_l2_flags_iov	[TCP_FRAMES_MEM * TCP_IOV_NUM];
> > +static struct iovec	tcp6_l2_flags_iov	[TCP_FRAMES_MEM * TCP_IOV_NUM];
> >  
> >  /* sendmsg() to socket */
> >  static struct iovec	tcp_iov			[UIO_MAXIOV];
> >  
> > -/**
> > - * tcp4_l2_flags_buf_t - IPv4 packet buffers for segments without data (flags)
> > - * @pad:	Align TCP header to 32 bytes, for AVX2 checksum calculation only
> > - * @taph:	Tap-level headers (partially pre-filled)
> > - * @iph:	Pre-filled IP header (except for tot_len and saddr)
> > - * @th:		Headroom for TCP header
> > - * @opts:	Headroom for TCP options
> > - */
> > -static struct tcp4_l2_flags_buf_t {
> > -#ifdef __AVX2__
> > -	uint8_t pad[26];	/* 0, align th to 32 bytes */
> > -#else
> > -	uint8_t pad[2];		/*	align iph to 4 bytes	0 */
> > -#endif
> > -	struct tap_hdr taph;	/* 26				2 */
> > -	struct iphdr iph;	/* 44				20 */
> > -	struct tcphdr th;	/* 64				40 */
> > -	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> > -#ifdef __AVX2__
> > -} __attribute__ ((packed, aligned(32)))
> > -#else
> > -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> > -#endif
> > -tcp4_l2_flags_buf[TCP_FRAMES_MEM];
> > -
> > -static unsigned int tcp4_l2_flags_buf_used;
> > -
> > -/**
> > - * tcp6_l2_flags_buf_t - IPv6 packet buffers for segments without data (flags)
> > - * @pad:	Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
> > - * @taph:	Tap-level headers (partially pre-filled)
> > - * @ip6h:	Pre-filled IP header (except for payload_len and addresses)
> > - * @th:		Headroom for TCP header
> > - * @opts:	Headroom for TCP options
> > - */
> > -static struct tcp6_l2_flags_buf_t {
> > -#ifdef __AVX2__
> > -	uint8_t pad[14];	/* 0	align ip6h to 32 bytes */
> > -#else
> > -	uint8_t pad[2];		/*	align ip6h to 4 bytes		   0 */
> > -#endif
> > -	struct tap_hdr taph;	/* 14					   2 */
> > -	struct ipv6hdr ip6h;	/* 32					  20 */
> > -	struct tcphdr th	/* 72 */ __attribute__ ((aligned(4))); /* 60 */
> > -	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> > -#ifdef __AVX2__
> > -} __attribute__ ((packed, aligned(32)))
> > -#else
> > -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> > -#endif
> > -tcp6_l2_flags_buf[TCP_FRAMES_MEM];
> > -
> > -static unsigned int tcp6_l2_flags_buf_used;
> > -
> >  #define CONN(idx)		(&(FLOW(idx)->tcp))
> >  
> >  /* Table for lookup from remote address, local port, remote port */
> > @@ -967,25 +914,14 @@ static void tcp_update_check_tcp6(struct ipv6hdr *ip6h, struct tcphdr *th)
> >  }
> >  
> >  /**
> > - * tcp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
> > + * tcp_update_l2_buf() - Update ethernet header buffers with addresses
> >   * @eth_d:	Ethernet destination address, NULL if unchanged
> >   * @eth_s:	Ethernet source address, NULL if unchanged
> >   */
> >  void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> >  {
> > -	int i;
> > -
> > -	for (i = 0; i < TCP_FRAMES_MEM; i++) {
> > -		struct tcp4_l2_flags_buf_t *b4f = &tcp4_l2_flags_buf[i];
> > -		struct tcp6_l2_flags_buf_t *b6f = &tcp6_l2_flags_buf[i];
> > -		struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
> > -		struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
> > -
> > -		eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> > -		eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> > -		eth_update_mac(&b4f->taph.eh, eth_d, eth_s);
> > -		eth_update_mac(&b6f->taph.eh, eth_d, eth_s);
> > -	}
> > +	eth_update_mac(&tcp4_eth_src, eth_d, eth_s);
> > +	eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
> >  }
> >  
> >  /**
> > @@ -995,29 +931,42 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> >  static void tcp_sock4_iov_init(const struct ctx *c)
> >  {
> >  	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
> > -	struct iovec *iov;
> >  	int i;
> >  
> > -	for (i = 0; i < ARRAY_SIZE(tcp4_l2_buf); i++) {
> > -		tcp4_l2_buf[i] = (struct tcp4_l2_buf_t) {
> > -			.taph = TAP_HDR_INIT(ETH_P_IP),
> > -			.iph = iph,
> > -			.th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
> > -		};
> > -	}
> > +	tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
> > +	for (i = 0; i < TCP_FRAMES_MEM; i++) {
> > +		struct iovec *iov;
> >  
> > -	for (i = 0; i < ARRAY_SIZE(tcp4_l2_flags_buf); i++) {
> > -		tcp4_l2_flags_buf[i] = (struct tcp4_l2_flags_buf_t) {
> > -			.taph = TAP_HDR_INIT(ETH_P_IP),
> > -			.iph = L2_BUF_IP4_INIT(IPPROTO_TCP)
> > -		};
> > -	}
> > +		/* headers */
> > +		tcp4_l2_ip[i] = iph;
> > +		tcp4_l2_payload[i].th.doff = sizeof(struct tcphdr) / 4;
> > +		tcp4_l2_payload[i].th.ack = 1;
> >  
> > -	for (i = 0, iov = tcp4_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
> > -		iov->iov_base = tap_frame_base(c, &tcp4_l2_buf[i].taph);
> > +		tcp4_l2_flags_ip[i] = iph;
> > +		tcp4_l2_flags[i].th.doff = sizeof(struct tcphdr) / 4;
> > +		tcp4_l2_flags[i].th.ack = 1;
> >  
> > -	for (i = 0, iov = tcp4_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
> > -		iov->iov_base = tap_frame_base(c, &tcp4_l2_flags_buf[i].taph);
> > +		/* iovecs */
> > +		iov = &tcp4_l2_iov[i * TCP_IOV_NUM];
> > +		iov[TCP_IOV_TAP].iov_base = &tcp4_vnet_len[i];
> > +		iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> > +					   sizeof(uint32_t) : 0;
> > +		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
> > +		iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);  
> 
> I'd prefer to see
> 	..base = &x;
> 	..len = sizeof(x);
> 
> rather than repeating the type name.
> 
> > +		iov[TCP_IOV_IP].iov_base = &tcp4_l2_ip[i];
> > +		iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
> > +		iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_payload[i];
> > +
> > +		iov = &tcp4_l2_flags_iov[i * TCP_IOV_NUM];
> > +		iov[TCP_IOV_TAP].iov_base = &tcp4_flags_vnet_len[i];
> > +		iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> > +					   sizeof(uint32_t) : 0;
> > +		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
> > +		iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> > +		iov[TCP_IOV_IP].iov_base = &tcp4_l2_flags_ip[i];
> > +		iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
> > +		iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_flags[i];
> > +	}
> >  }
> >  
> >  /**
> > @@ -1026,29 +975,43 @@ static void tcp_sock4_iov_init(const struct ctx *c)
> >   */
> >  static void tcp_sock6_iov_init(const struct ctx *c)
> >  {
> > -	struct iovec *iov;
> > +	struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP);
> >  	int i;
> >  
> > -	for (i = 0; i < ARRAY_SIZE(tcp6_l2_buf); i++) {
> > -		tcp6_l2_buf[i] = (struct tcp6_l2_buf_t) {
> > -			.taph = TAP_HDR_INIT(ETH_P_IPV6),
> > -			.ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP),
> > -			.th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
> > -		};
> > -	}
> > +	tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
> > +	for (i = 0; i < TCP_FRAMES_MEM; i++) {
> > +		struct iovec *iov;
> >  
> > -	for (i = 0; i < ARRAY_SIZE(tcp6_l2_flags_buf); i++) {
> > -		tcp6_l2_flags_buf[i] = (struct tcp6_l2_flags_buf_t) {
> > -			.taph = TAP_HDR_INIT(ETH_P_IPV6),
> > -			.ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP)
> > -		};
> > -	}
> > +		/* headers */
> > +		tcp6_l2_ip[i] = ip6;
> > +		tcp6_l2_payload[i].th.doff = sizeof(struct tcphdr) / 4;
> > +		tcp6_l2_payload[i].th.ack = 1;
> >  
> > -	for (i = 0, iov = tcp6_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
> > -		iov->iov_base = tap_frame_base(c, &tcp6_l2_buf[i].taph);
> > +		tcp6_l2_flags_ip[i] = ip6;
> > +		tcp6_l2_flags[i].th.doff = sizeof(struct tcphdr) / 4;
> > +		tcp6_l2_flags[i].th .ack = 1;
> >  
> > -	for (i = 0, iov = tcp6_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
> > -		iov->iov_base = tap_frame_base(c, &tcp6_l2_flags_buf[i].taph);
> > +		/* iovecs */
> > +		iov = &tcp6_l2_iov[i * TCP_IOV_NUM];
> > +		iov[TCP_IOV_TAP].iov_base = &tcp6_vnet_len[i];
> > +		iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> > +					   sizeof(uint32_t) : 0;
> > +		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
> > +		iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> > +		iov[TCP_IOV_IP].iov_base = &tcp6_l2_ip[i];
> > +		iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
> > +		iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_payload[i];
> > +
> > +		iov = &tcp6_l2_flags_iov[i * TCP_IOV_NUM];
> > +		iov[TCP_IOV_TAP].iov_base = &tcp6_flags_vnet_len[i];
> > +		iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> > +					   sizeof(uint32_t) : 0;
> > +		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
> > +		iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> > +		iov[TCP_IOV_IP].iov_base = &tcp6_l2_flags_ip[i];
> > +		iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
> > +		iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_flags[i];
> > +	}
> >  }
> >  
> >  /**
> > @@ -1289,10 +1252,12 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
> >   */
> >  static void tcp_l2_flags_buf_flush(const struct ctx *c)
> >  {
> > -	tap_send_frames(c, tcp6_l2_flags_iov, 1, tcp6_l2_flags_buf_used);
> > +	tap_send_frames(c, tcp6_l2_flags_iov, TCP_IOV_NUM,
> > +			tcp6_l2_flags_buf_used);
> >  	tcp6_l2_flags_buf_used = 0;
> >  
> > -	tap_send_frames(c, tcp4_l2_flags_iov, 1, tcp4_l2_flags_buf_used);
> > +	tap_send_frames(c, tcp4_l2_flags_iov, TCP_IOV_NUM,
> > +			tcp4_l2_flags_buf_used);
> >  	tcp4_l2_flags_buf_used = 0;
> >  }
> >  
> > @@ -1305,12 +1270,12 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
> >  	unsigned i;
> >  	size_t m;
> >  
> > -	m = tap_send_frames(c, tcp6_l2_iov, 1, tcp6_l2_buf_used);
> > +	m = tap_send_frames(c, tcp6_l2_iov, TCP_IOV_NUM, tcp6_l2_buf_used);
> >  	for (i = 0; i < m; i++)
> >  		*tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
> >  	tcp6_l2_buf_used = 0;
> >  
> > -	m = tap_send_frames(c, tcp4_l2_iov, 1, tcp4_l2_buf_used);
> > +	m = tap_send_frames(c, tcp4_l2_iov, TCP_IOV_NUM, tcp4_l2_buf_used);
> >  	for (i = 0; i < m; i++)
> >  		*tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
> >  	tcp4_l2_buf_used = 0;
> > @@ -1433,7 +1398,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
> >   * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
> >   * @c:		Execution context
> >   * @conn:	Connection pointer
> > - * @p:		Pointer to any type of TCP pre-cooked buffer
> > + * @iov:	Pointer to an array of iovec of TCP pre-cooked buffer
> >   * @plen:	Payload length (including TCP header options)
> >   * @check:	Checksum, if already known
> >   * @seq:	Sequence number for this segment
> > @@ -1442,26 +1407,22 @@ static size_t tcp_fill_headers6(const struct ctx *c,
> >   */
> >  static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> >  				      const struct tcp_tap_conn *conn,
> > -				      void *p, size_t plen,
> > +				      struct iovec *iov, size_t plen,
> >  				      const uint16_t *check, uint32_t seq)
> >  {
> >  	const struct in_addr *a4 = inany_v4(&conn->faddr);
> >  	size_t ip_len, tlen;
> >  
> >  	if (a4) {
> > -		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
> > -
> > -		ip_len = tcp_fill_headers4(c, conn, &b->iph, &b->th, plen,
> > +		ip_len = tcp_fill_headers4(c, conn, iov[TCP_IOV_IP].iov_base,
> > +					   iov[TCP_IOV_PAYLOAD].iov_base, plen,
> >  					   check, seq);
> > -
> > -		tlen = tap_frame_len(c, &b->taph, ip_len);
> > +		tlen = ip_len - sizeof(struct iphdr);
> >  	} else {
> > -		struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
> > -
> > -		ip_len = tcp_fill_headers6(c, conn, &b->ip6h, &b->th, plen,
> > +		ip_len = tcp_fill_headers6(c, conn, iov[TCP_IOV_IP].iov_base,
> > +					   iov[TCP_IOV_PAYLOAD].iov_base, plen,
> >  					   seq);
> > -
> > -		tlen = tap_frame_len(c, &b->taph, ip_len);
> > +		tlen = ip_len - sizeof(struct ipv6hdr);  
> 
> You're changing from returning the frame length to returning the IP
> payload length, so the function comment should be updated as well.
> 
> >  	}
> >  
> >  	return tlen;
> > @@ -1595,16 +1556,16 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  {
> >  	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
> >  	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
> > -	struct tcp4_l2_flags_buf_t *b4 = NULL;
> > -	struct tcp6_l2_flags_buf_t *b6 = NULL;
> > +	struct tcp_l2_flags_t *payload;
> >  	struct tcp_info tinfo = { 0 };
> >  	socklen_t sl = sizeof(tinfo);
> >  	int s = conn->sock;
> > +	uint32_t *vnet_len;
> >  	size_t optlen = 0;
> > -	struct iovec *iov;
> >  	struct tcphdr *th;
> > +	struct iovec *iov;
> > +	size_t ip_len;
> >  	char *data;
> > -	void *p;
> >  
> >  	if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
> >  	    !flags && conn->wnd_to_tap)
> > @@ -1627,19 +1588,21 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  		return 0;
> >  
> >  	if (CONN_V4(conn)) {
> > -		iov = tcp4_l2_flags_iov    + tcp4_l2_flags_buf_used;
> > -		p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
> > -		th = &b4->th;
> > -
> > -		/* gcc 11.2 would complain on data = (char *)(th + 1); */
> > -		data = b4->opts;
> > +		iov = &tcp4_l2_flags_iov[tcp4_l2_flags_buf_used++ *
> > +					 TCP_IOV_NUM];
> > +		vnet_len = iov[TCP_IOV_TAP].iov_base;
> > +		*vnet_len = sizeof(struct ethhdr) + sizeof(struct iphdr);  
> 
> You're storing a native-endian value in *vnet_len here, then a network
> endian value later.  I think that's extremely fragile and easy to mess
> up on later changes.
> 
> >  	} else {
> > -		iov = tcp6_l2_flags_iov    + tcp6_l2_flags_buf_used;
> > -		p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
> > -		th = &b6->th;
> > -		data = b6->opts;
> > +		iov = &tcp6_l2_flags_iov[tcp6_l2_flags_buf_used++ *
> > +					 TCP_IOV_NUM];
> > +		vnet_len = iov[TCP_IOV_TAP].iov_base;
> > +		*vnet_len = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
> >  	}
> >  
> > +	payload = iov[TCP_IOV_PAYLOAD].iov_base;
> > +	th = &payload->th;
> > +	data = payload->opts;
> > +
> >  	if (flags & SYN) {
> >  		int mss;
> >  
> > @@ -1688,8 +1651,10 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  	th->syn = !!(flags & SYN);
> >  	th->fin = !!(flags & FIN);
> >  
> > -	iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen,
> > -					       NULL, conn->seq_to_tap);
> > +	ip_len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL,
> > +					 conn->seq_to_tap);
> > +	iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
> > +	*vnet_len = htonl(*vnet_len + ip_len);  
> 
> Fwiw, in the UDP patches I'm working on, I still called
> tap_frame_len() to set the vnet_len field, but returned a different
> length from the fill_headers() equivalent to size the relevant IOV
> segment.  That way fill_headers() does fill all the headers, including
> the vnet_len.
> 
> >  	if (th->ack) {
> >  		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
> > @@ -1705,23 +1670,29 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> >  	if (th->fin || th->syn)
> >  		conn->seq_to_tap++;
> >  
> > -	if (CONN_V4(conn)) {
> > -		if (flags & DUP_ACK) {
> > -			memcpy(b4 + 1, b4, sizeof(*b4));
> > -			(iov + 1)->iov_len = iov->iov_len;
> > -			tcp4_l2_flags_buf_used++;
> > +	if (flags & DUP_ACK) {
> > +		struct iovec *dup_iov;
> > +		int i;
> > +
> > +		if (CONN_V4(conn))
> > +			dup_iov = &tcp4_l2_flags_iov[tcp4_l2_flags_buf_used++ *
> > +						     TCP_IOV_NUM];
> > +		else
> > +			dup_iov = &tcp6_l2_flags_iov[tcp6_l2_flags_buf_used++ *
> > +						     TCP_IOV_NUM];
> > +
> > +		for (i = 0; i < TCP_IOV_NUM; i++) {
> > +			memcpy(dup_iov[i].iov_base, iov[i].iov_base,
> > +			       iov[i].iov_len);
> > +			dup_iov[i].iov_len = iov[i].iov_len;  
> 
> IIUC, the iov_len for everything except TCP_IOV_PAYLOAD should be
> uniform, so most of these won't change anything.  I get nervous about
> fragility if we write to values which are supposed to be readonly
> after initialisation.
> 
> >  		}
> > +	}
> >  
> > -		if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
> > +	if (CONN_V4(conn)) {
> > +		if (tcp4_l2_flags_buf_used > TCP_FRAMES_MEM - 2)
> >  			tcp_l2_flags_buf_flush(c);
> >  	} else {
> > -		if (flags & DUP_ACK) {
> > -			memcpy(b6 + 1, b6, sizeof(*b6));
> > -			(iov + 1)->iov_len = iov->iov_len;
> > -			tcp6_l2_flags_buf_used++;
> > -		}
> > -
> > -		if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
> > +		if (tcp6_l2_flags_buf_used > TCP_FRAMES_MEM - 2)
> >  			tcp_l2_flags_buf_flush(c);
> >  	}
> >  
> > @@ -2168,30 +2139,44 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  			    ssize_t plen, int no_csum, uint32_t seq)
> >  {
> >  	uint32_t *seq_update = &conn->seq_to_tap;
> > +	uint32_t *vnet_len;
> >  	struct iovec *iov;
> > +	size_t ip_len;
> >  
> >  	if (CONN_V4(conn)) {
> > -		struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> > -		const uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
> > +		struct iovec *iov_prev = &tcp4_l2_iov[(tcp4_l2_buf_used - 1) * TCP_IOV_NUM];
> > +		const uint16_t *check = NULL;
> > +
> > +		if (no_csum) {
> > +			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
> > +			check = &iph->check;
> > +		}
> >  
> >  		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> >  		tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
> >  
> > -		iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> > -		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> > -						       check, seq);
> > -		if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
> > +		iov = &tcp4_l2_iov[tcp4_l2_buf_used++ * TCP_IOV_NUM];
> > +		ip_len = tcp_l2_buf_fill_headers(c, conn, iov, plen, check,
> > +						 seq);
> > +		iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
> > +		vnet_len = iov[TCP_IOV_TAP].iov_base;
> > +		*vnet_len = htonl(sizeof(struct ethhdr) +
> > +				  sizeof(struct iphdr) +
> > +			          ip_len);
> > +		if (tcp4_l2_buf_used > TCP_FRAMES_MEM - 1)
> >  			tcp_l2_data_buf_flush(c);
> >  	} else if (CONN_V6(conn)) {
> > -		struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
> > -
> >  		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> >  		tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
> >  
> > -		iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> > -		iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> > -						       NULL, seq);
> > -		if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
> > +		iov = &tcp6_l2_iov[tcp6_l2_buf_used++ * TCP_IOV_NUM];
> > +		ip_len = tcp_l2_buf_fill_headers(c, conn, iov, plen, NULL, seq);
> > +		iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
> > +		vnet_len = iov[TCP_IOV_TAP].iov_base;
> > +		*vnet_len = htonl(sizeof(struct ethhdr) +
> > +				  sizeof(struct ipv6hdr) +
> > +			          ip_len);
> > +		if (tcp6_l2_buf_used > TCP_FRAMES_MEM - 1)
> >  			tcp_l2_data_buf_flush(c);
> >  	}
> >  }
> > @@ -2247,8 +2232,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  	iov_sock[0].iov_base = tcp_buf_discard;
> >  	iov_sock[0].iov_len = already_sent;
> >  
> > -	if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
> > -	    (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
> > +	if (( v4 && tcp4_l2_buf_used + fill_bufs > TCP_FRAMES_MEM) ||
> > +	    (!v4 && tcp6_l2_buf_used + fill_bufs > TCP_FRAMES_MEM)) {
> >  		tcp_l2_data_buf_flush(c);
> >  
> >  		/* Silence Coverity CWE-125 false positive */
> > @@ -2257,9 +2242,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  
> >  	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
> >  		if (v4)
> > -			iov->iov_base = &tcp4_l2_buf[tcp4_l2_buf_used + i].data;
> > +			iov->iov_base = &tcp4_l2_payload[tcp4_l2_buf_used + i].data;
> >  		else
> > -			iov->iov_base = &tcp6_l2_buf[tcp6_l2_buf_used + i].data;
> > +			iov->iov_base = &tcp6_l2_payload[tcp6_l2_buf_used + i].data;
> >  		iov->iov_len = mss;
> >  	}
> >  	if (iov_rem)  

-- 
Stefano


  reply	other threads:[~2024-03-18  8:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-15 17:51 [RFC v2] tcp: Replace TCP buffer structure by an iovec array Laurent Vivier
2024-03-18  5:47 ` David Gibson
2024-03-18  8:53   ` Stefano Brivio [this message]
2024-03-19  4:51     ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240318095329.34152c6f@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).