From: David Gibson <david@gibson.dropbear.id.au>
To: Laurent Vivier <lvivier@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [RFC v2] tcp: Replace TCP buffer structure by an iovec array
Date: Mon, 18 Mar 2024 16:47:34 +1100 [thread overview]
Message-ID: <ZffVdl5k0OSW7e4-@zatzit> (raw)
In-Reply-To: <20240315175119.1618550-1-lvivier@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 30151 bytes --]
On Fri, Mar 15, 2024 at 06:51:19PM +0100, Laurent Vivier wrote:
> To be able to provide pointers to TCP headers and IP headers without
> worrying about alignment in the structure, split the structure into
> several arrays and point to each part of the frame using an iovec array.
>
> Using iovec also allows us to simply ignore the first entry when the
> vnet length header is not needed.
>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>
> Notes:
> v2:
> - rebased on top of David's series
> "Some improvements to the tap send path"
Thanks, and sorry for the disruption.
> - no performance improvement anymore
Well, that's unfortunate. I think we need to drill down and find out
what actually made the difference here, since it's a pretty big change
from something rather subtle.
> - remove the iovec functions of v1 introduced in tap.c to use new
> functions from David
> - don't use an array of array of iovec
Huh. Is there a specific reason for that? I actually changed *to*
using a 2d array in my in-progress patches which do something similar
in the UDP code, based on your example.
> - fix comments
> - re-introduce MSS4 and MSS6
> - address comments from David and Stefano
>
> I have tested only passt, not pasta
Is there a reason for that? If you're having trouble running the
testsuite, we really need to fix that...
>
> tap.c | 3 +-
> tcp.c | 479 ++++++++++++++++++++++++++++------------------------------
> 2 files changed, 233 insertions(+), 249 deletions(-)
>
> diff --git a/tap.c b/tap.c
> index 13e4da79d690..605b472b5ca2 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -424,8 +424,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
> debug("tap: failed to send %zu frames of %zu",
> nframes - m, nframes);
>
> - pcap_multiple(iov, bufs_per_frame, m,
> - c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> + pcap_multiple(iov, bufs_per_frame, m, 0);
If you're no longer changing the semantics of tap_send_frames() from
those in my patch, I don't think you want this change any more either.
> return m;
> }
> diff --git a/tcp.c b/tcp.c
> index a1860d10b15f..d5c09dbf53ab 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -318,39 +318,8 @@
>
> /* MSS rounding: see SET_MSS() */
> #define MSS_DEFAULT 536
> -
> -struct tcp4_l2_head { /* For MSS4 macro: keep in sync with tcp4_l2_buf_t */
> -#ifdef __AVX2__
> - uint8_t pad[26];
> -#else
> - uint8_t pad[2];
> -#endif
> - struct tap_hdr taph;
> - struct iphdr iph;
> - struct tcphdr th;
> -#ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)));
> -#else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> -#endif
> -
> -struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
> -#ifdef __AVX2__
> - uint8_t pad[14];
> -#else
> - uint8_t pad[2];
> -#endif
> - struct tap_hdr taph;
> - struct ipv6hdr ip6h;
> - struct tcphdr th;
> -#ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)));
> -#else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> -#endif
> -
> -#define MSS4 ROUND_DOWN(USHRT_MAX - sizeof(struct tcp4_l2_head), 4)
> -#define MSS6 ROUND_DOWN(USHRT_MAX - sizeof(struct tcp6_l2_head), 4)
> +#define MSS4 ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN - sizeof(struct tcphdr) - sizeof(struct iphdr), sizeof(uint32_t))
> +#define MSS6 ROUND_DOWN(ETH_MAX_MTU - ETH_HLEN - sizeof(struct tcphdr) - sizeof(struct ipv6hdr), sizeof(uint32_t))
I'd be inclined to base these static constants on only the IP level
restriction (16-bit L3 packet length), i.e. replace (ETH_MAX_MTU -
ETH_HLEN) with USHRT_MAX (or a new L3_MAX_LENGTH constant with the
same value). The L2 MTU restriction, ought to be applied at runtime,
to properly handle the --mtu option.
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> #ifdef HAS_SND_WND
> @@ -445,133 +414,111 @@ struct tcp_buf_seq_update {
> };
>
> /* Static buffers */
> -
> /**
> - * tcp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections
> - * @pad: Align TCP header to 32 bytes, for AVX2 checksum calculation only
> - * @taph: Tap-level headers (partially pre-filled)
> - * @iph: Pre-filled IP header (except for tot_len and saddr)
> - * @uh: Headroom for TCP header
> - * @data: Storage for TCP payload
> + * struct tcp_l2_payload_t - TCP header and data to send data
Pre-existing problem, but without a typedef, I don't think the _t
suffix is useful. I also don't think the "l2" really adds any useful
information here.
> + * 32 bytes aligned to be able to use AVX2 checksum
> + * @th: TCP header
> + * @data: TCP data
> */
> -static struct tcp4_l2_buf_t {
> -#ifdef __AVX2__
> - uint8_t pad[26]; /* 0, align th to 32 bytes */
> -#else
> - uint8_t pad[2]; /* align iph to 4 bytes 0 */
> -#endif
> - struct tap_hdr taph; /* 26 2 */
> - struct iphdr iph; /* 44 20 */
> - struct tcphdr th; /* 64 40 */
> - uint8_t data[MSS4]; /* 84 60 */
> - /* 65536 65532 */
> +struct tcp_l2_payload_t {
> + struct tcphdr th;
> + uint8_t data[65536 - sizeof(struct tcphdr)];
> #ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)))
> +} __attribute__ ((packed, aligned(32)));
> #else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> +} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> #endif
> -tcp4_l2_buf[TCP_FRAMES_MEM];
>
> -static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> -
> -static unsigned int tcp4_l2_buf_used;
> +static_assert(MSS4 <= sizeof(((struct tcp_l2_payload_t *)0)->data));
> +static_assert(MSS6 <= sizeof(((struct tcp_l2_payload_t *)0)->data));
I think you could avoid the (safe but ugly) explicit 0-dereferences by
writing the static_assert()s in terms of the actual global variables
once you've defined them, rather than purely the types.
>
> /**
> - * tcp6_l2_buf_t - Pre-cooked IPv6 packet buffers for tap connections
> - * @pad: Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
> - * @taph: Tap-level headers (partially pre-filled)
> - * @ip6h: Pre-filled IP header (except for payload_len and addresses)
> - * @th: Headroom for TCP header
> - * @data: Storage for TCP payload
> + * struct tcp_l2_flags_t - TCP header and data to send option flags
Again, I don't think the "l2" really says anything useful. It was a
bit unclear what it referred to in the original, but it's even more
out of place now that the structure doesn't include L2 headers.
> + * @th: TCP header
> + * @opts TCP option flags
> */
> -struct tcp6_l2_buf_t {
> -#ifdef __AVX2__
> - uint8_t pad[14]; /* 0 align ip6h to 32 bytes */
> -#else
> - uint8_t pad[2]; /* align ip6h to 4 bytes 0 */
> -#endif
> - struct tap_hdr taph; /* 14 2 */
> - struct ipv6hdr ip6h; /* 32 20 */
> - struct tcphdr th; /* 72 60 */
> - uint8_t data[MSS6]; /* 92 80 */
> - /* 65536 65532 */
> +struct tcp_l2_flags_t {
> + struct tcphdr th;
> + char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> #ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)))
> +} __attribute__ ((packed, aligned(32)));
> #else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> +} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> #endif
> -tcp6_l2_buf[TCP_FRAMES_MEM];
>
> -static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> +static uint32_t tcp4_vnet_len [TCP_FRAMES_MEM];
> +/* Ethernet header for IPv4 frames */
> +static struct ethhdr tcp4_eth_src;
> +
> +/* IPv4 headers */
> +static struct iphdr tcp4_l2_ip[TCP_FRAMES_MEM];
> +/* TCP headers and data for IPv4 frames */
> +static struct tcp_l2_payload_t tcp4_l2_payload[TCP_FRAMES_MEM];
>
> +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM];
> +static unsigned int tcp4_l2_buf_used;
> +
> +static uint32_t tcp4_flags_vnet_len [TCP_FRAMES_MEM];
> +/* IPv4 headers for TCP option flags frames */
> +static struct iphdr tcp4_l2_flags_ip[TCP_FRAMES_MEM];
> +/* TCP headers and option flags for IPv4 frames */
> +static struct tcp_l2_flags_t tcp4_l2_flags[TCP_FRAMES_MEM];
> +
> +static unsigned int tcp4_l2_flags_buf_used;
> +
> +static uint32_t tcp6_vnet_len [TCP_FRAMES_MEM];
> +/* Ethernet header for IPv6 frames */
> +static struct ethhdr tcp6_eth_src;
> +
> +/* IPv6 headers */
> +static struct ipv6hdr tcp6_l2_ip[TCP_FRAMES_MEM];
> +/* TCP headers and data for IPv6 frames */
> +static struct tcp_l2_payload_t tcp6_l2_payload[TCP_FRAMES_MEM];
The v6 vnet_len, ethhdr and payload arrays can be shared with IPv4,
but that can and IMO should be a follow on patch.
> +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM];
> static unsigned int tcp6_l2_buf_used;
>
> +static uint32_t tcp6_flags_vnet_len [TCP_FRAMES_MEM];
> +/* IPv6 headers for TCP option flags frames */
> +static struct ipv6hdr tcp6_l2_flags_ip[TCP_FRAMES_MEM];
> +/* TCP headers and option flags for IPv6 frames */
> +static struct tcp_l2_flags_t tcp6_l2_flags[TCP_FRAMES_MEM];
> +
> +static unsigned int tcp6_l2_flags_buf_used;
> +
> /* recvmsg()/sendmsg() data for tap */
> static char tcp_buf_discard [MAX_WINDOW];
> static struct iovec iov_sock [TCP_FRAMES_MEM + 1];
>
> -static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM];
> -static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM];
> -static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM];
> -static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM];
> +/*
> + * enum tcp_iov_parts - I/O vector parts for one TCP frame
> + * TCP_IOV_TAP TAP header
> + * TCP_IOV_ETH ethernet header
> + * TCP_IOV_IP IP (v4/v6) header
> + * TCP_IOV_PAYLOAD IP payload (TCP header + data)
> + * TCP_IOV_NUM is the number of entries in the iovec array
If we have these descriptions adjacent to the actual values, I don't
think we need to duplicate them in the overall enum comment.
> + */
> +enum tcp_iov_parts {
> + /* TAP header */
> + TCP_IOV_TAP = 0,
> + /* ethernet header */
> + TCP_IOV_ETH = 1,
> + /* IP (v4/v6) header */
> + TCP_IOV_IP = 2,
> + /* IP payload (TCP header + data) */
> + TCP_IOV_PAYLOAD = 3,
> + /* number of entries in the iovec array */
> + TCP_IOV_NUM
I think our convention has been to use 'FOO_NUM_XS' rather than
'FOO_X_NUM' for constants like this, to distinguish it a little
visually from the individual enum values. (e.g. PIF_NUM_TYPES,
FLOW_NUM_TYPES).
> +};
> +
> +static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM * TCP_IOV_NUM];
> +static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM * TCP_IOV_NUM];
> +static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM * TCP_IOV_NUM];
> +static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM * TCP_IOV_NUM];
>
> /* sendmsg() to socket */
> static struct iovec tcp_iov [UIO_MAXIOV];
>
> -/**
> - * tcp4_l2_flags_buf_t - IPv4 packet buffers for segments without data (flags)
> - * @pad: Align TCP header to 32 bytes, for AVX2 checksum calculation only
> - * @taph: Tap-level headers (partially pre-filled)
> - * @iph: Pre-filled IP header (except for tot_len and saddr)
> - * @th: Headroom for TCP header
> - * @opts: Headroom for TCP options
> - */
> -static struct tcp4_l2_flags_buf_t {
> -#ifdef __AVX2__
> - uint8_t pad[26]; /* 0, align th to 32 bytes */
> -#else
> - uint8_t pad[2]; /* align iph to 4 bytes 0 */
> -#endif
> - struct tap_hdr taph; /* 26 2 */
> - struct iphdr iph; /* 44 20 */
> - struct tcphdr th; /* 64 40 */
> - char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> -#ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)))
> -#else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> -#endif
> -tcp4_l2_flags_buf[TCP_FRAMES_MEM];
> -
> -static unsigned int tcp4_l2_flags_buf_used;
> -
> -/**
> - * tcp6_l2_flags_buf_t - IPv6 packet buffers for segments without data (flags)
> - * @pad: Align IPv6 header for checksum calculation to 32B (AVX2) or 4B
> - * @taph: Tap-level headers (partially pre-filled)
> - * @ip6h: Pre-filled IP header (except for payload_len and addresses)
> - * @th: Headroom for TCP header
> - * @opts: Headroom for TCP options
> - */
> -static struct tcp6_l2_flags_buf_t {
> -#ifdef __AVX2__
> - uint8_t pad[14]; /* 0 align ip6h to 32 bytes */
> -#else
> - uint8_t pad[2]; /* align ip6h to 4 bytes 0 */
> -#endif
> - struct tap_hdr taph; /* 14 2 */
> - struct ipv6hdr ip6h; /* 32 20 */
> - struct tcphdr th /* 72 */ __attribute__ ((aligned(4))); /* 60 */
> - char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> -#ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)))
> -#else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> -#endif
> -tcp6_l2_flags_buf[TCP_FRAMES_MEM];
> -
> -static unsigned int tcp6_l2_flags_buf_used;
> -
> #define CONN(idx) (&(FLOW(idx)->tcp))
>
> /* Table for lookup from remote address, local port, remote port */
> @@ -967,25 +914,14 @@ static void tcp_update_check_tcp6(struct ipv6hdr *ip6h, struct tcphdr *th)
> }
>
> /**
> - * tcp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses
> + * tcp_update_l2_buf() - Update ethernet header buffers with addresses
> * @eth_d: Ethernet destination address, NULL if unchanged
> * @eth_s: Ethernet source address, NULL if unchanged
> */
> void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> {
> - int i;
> -
> - for (i = 0; i < TCP_FRAMES_MEM; i++) {
> - struct tcp4_l2_flags_buf_t *b4f = &tcp4_l2_flags_buf[i];
> - struct tcp6_l2_flags_buf_t *b6f = &tcp6_l2_flags_buf[i];
> - struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i];
> - struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i];
> -
> - eth_update_mac(&b4->taph.eh, eth_d, eth_s);
> - eth_update_mac(&b6->taph.eh, eth_d, eth_s);
> - eth_update_mac(&b4f->taph.eh, eth_d, eth_s);
> - eth_update_mac(&b6f->taph.eh, eth_d, eth_s);
> - }
> + eth_update_mac(&tcp4_eth_src, eth_d, eth_s);
> + eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
> }
>
> /**
> @@ -995,29 +931,42 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
> static void tcp_sock4_iov_init(const struct ctx *c)
> {
> struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
> - struct iovec *iov;
> int i;
>
> - for (i = 0; i < ARRAY_SIZE(tcp4_l2_buf); i++) {
> - tcp4_l2_buf[i] = (struct tcp4_l2_buf_t) {
> - .taph = TAP_HDR_INIT(ETH_P_IP),
> - .iph = iph,
> - .th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
> - };
> - }
> + tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
> + for (i = 0; i < TCP_FRAMES_MEM; i++) {
> + struct iovec *iov;
>
> - for (i = 0; i < ARRAY_SIZE(tcp4_l2_flags_buf); i++) {
> - tcp4_l2_flags_buf[i] = (struct tcp4_l2_flags_buf_t) {
> - .taph = TAP_HDR_INIT(ETH_P_IP),
> - .iph = L2_BUF_IP4_INIT(IPPROTO_TCP)
> - };
> - }
> + /* headers */
> + tcp4_l2_ip[i] = iph;
> + tcp4_l2_payload[i].th.doff = sizeof(struct tcphdr) / 4;
> + tcp4_l2_payload[i].th.ack = 1;
>
> - for (i = 0, iov = tcp4_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
> - iov->iov_base = tap_frame_base(c, &tcp4_l2_buf[i].taph);
> + tcp4_l2_flags_ip[i] = iph;
> + tcp4_l2_flags[i].th.doff = sizeof(struct tcphdr) / 4;
> + tcp4_l2_flags[i].th.ack = 1;
>
> - for (i = 0, iov = tcp4_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
> - iov->iov_base = tap_frame_base(c, &tcp4_l2_flags_buf[i].taph);
> + /* iovecs */
> + iov = &tcp4_l2_iov[i * TCP_IOV_NUM];
> + iov[TCP_IOV_TAP].iov_base = &tcp4_vnet_len[i];
> + iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> + sizeof(uint32_t) : 0;
> + iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
> + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
I'd prefer to see
..base = &x;
..len = sizeof(x);
rather than repeating the type name.
> + iov[TCP_IOV_IP].iov_base = &tcp4_l2_ip[i];
> + iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
> + iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_payload[i];
> +
> + iov = &tcp4_l2_flags_iov[i * TCP_IOV_NUM];
> + iov[TCP_IOV_TAP].iov_base = &tcp4_flags_vnet_len[i];
> + iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> + sizeof(uint32_t) : 0;
> + iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
> + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> + iov[TCP_IOV_IP].iov_base = &tcp4_l2_flags_ip[i];
> + iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
> + iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_flags[i];
> + }
> }
>
> /**
> @@ -1026,29 +975,43 @@ static void tcp_sock4_iov_init(const struct ctx *c)
> */
> static void tcp_sock6_iov_init(const struct ctx *c)
> {
> - struct iovec *iov;
> + struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP);
> int i;
>
> - for (i = 0; i < ARRAY_SIZE(tcp6_l2_buf); i++) {
> - tcp6_l2_buf[i] = (struct tcp6_l2_buf_t) {
> - .taph = TAP_HDR_INIT(ETH_P_IPV6),
> - .ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP),
> - .th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 }
> - };
> - }
> + tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
> + for (i = 0; i < TCP_FRAMES_MEM; i++) {
> + struct iovec *iov;
>
> - for (i = 0; i < ARRAY_SIZE(tcp6_l2_flags_buf); i++) {
> - tcp6_l2_flags_buf[i] = (struct tcp6_l2_flags_buf_t) {
> - .taph = TAP_HDR_INIT(ETH_P_IPV6),
> - .ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP)
> - };
> - }
> + /* headers */
> + tcp6_l2_ip[i] = ip6;
> + tcp6_l2_payload[i].th.doff = sizeof(struct tcphdr) / 4;
> + tcp6_l2_payload[i].th.ack = 1;
>
> - for (i = 0, iov = tcp6_l2_iov; i < TCP_FRAMES_MEM; i++, iov++)
> - iov->iov_base = tap_frame_base(c, &tcp6_l2_buf[i].taph);
> + tcp6_l2_flags_ip[i] = ip6;
> + tcp6_l2_flags[i].th.doff = sizeof(struct tcphdr) / 4;
> + tcp6_l2_flags[i].th .ack = 1;
>
> - for (i = 0, iov = tcp6_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++)
> - iov->iov_base = tap_frame_base(c, &tcp6_l2_flags_buf[i].taph);
> + /* iovecs */
> + iov = &tcp6_l2_iov[i * TCP_IOV_NUM];
> + iov[TCP_IOV_TAP].iov_base = &tcp6_vnet_len[i];
> + iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> + sizeof(uint32_t) : 0;
> + iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
> + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> + iov[TCP_IOV_IP].iov_base = &tcp6_l2_ip[i];
> + iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
> + iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_payload[i];
> +
> + iov = &tcp6_l2_flags_iov[i * TCP_IOV_NUM];
> + iov[TCP_IOV_TAP].iov_base = &tcp6_flags_vnet_len[i];
> + iov[TCP_IOV_TAP].iov_len = c->mode == MODE_PASST ?
> + sizeof(uint32_t) : 0;
> + iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
> + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> + iov[TCP_IOV_IP].iov_base = &tcp6_l2_flags_ip[i];
> + iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
> + iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_flags[i];
> + }
> }
>
> /**
> @@ -1289,10 +1252,12 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
> */
> static void tcp_l2_flags_buf_flush(const struct ctx *c)
> {
> - tap_send_frames(c, tcp6_l2_flags_iov, 1, tcp6_l2_flags_buf_used);
> + tap_send_frames(c, tcp6_l2_flags_iov, TCP_IOV_NUM,
> + tcp6_l2_flags_buf_used);
> tcp6_l2_flags_buf_used = 0;
>
> - tap_send_frames(c, tcp4_l2_flags_iov, 1, tcp4_l2_flags_buf_used);
> + tap_send_frames(c, tcp4_l2_flags_iov, TCP_IOV_NUM,
> + tcp4_l2_flags_buf_used);
> tcp4_l2_flags_buf_used = 0;
> }
>
> @@ -1305,12 +1270,12 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
> unsigned i;
> size_t m;
>
> - m = tap_send_frames(c, tcp6_l2_iov, 1, tcp6_l2_buf_used);
> + m = tap_send_frames(c, tcp6_l2_iov, TCP_IOV_NUM, tcp6_l2_buf_used);
> for (i = 0; i < m; i++)
> *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len;
> tcp6_l2_buf_used = 0;
>
> - m = tap_send_frames(c, tcp4_l2_iov, 1, tcp4_l2_buf_used);
> + m = tap_send_frames(c, tcp4_l2_iov, TCP_IOV_NUM, tcp4_l2_buf_used);
> for (i = 0; i < m; i++)
> *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len;
> tcp4_l2_buf_used = 0;
> @@ -1433,7 +1398,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
> * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
> * @c: Execution context
> * @conn: Connection pointer
> - * @p: Pointer to any type of TCP pre-cooked buffer
> + * @iov: Pointer to an array of iovec of TCP pre-cooked buffer
> * @plen: Payload length (including TCP header options)
> * @check: Checksum, if already known
> * @seq: Sequence number for this segment
> @@ -1442,26 +1407,22 @@ static size_t tcp_fill_headers6(const struct ctx *c,
> */
> static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> const struct tcp_tap_conn *conn,
> - void *p, size_t plen,
> + struct iovec *iov, size_t plen,
> const uint16_t *check, uint32_t seq)
> {
> const struct in_addr *a4 = inany_v4(&conn->faddr);
> size_t ip_len, tlen;
>
> if (a4) {
> - struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
> -
> - ip_len = tcp_fill_headers4(c, conn, &b->iph, &b->th, plen,
> + ip_len = tcp_fill_headers4(c, conn, iov[TCP_IOV_IP].iov_base,
> + iov[TCP_IOV_PAYLOAD].iov_base, plen,
> check, seq);
> -
> - tlen = tap_frame_len(c, &b->taph, ip_len);
> + tlen = ip_len - sizeof(struct iphdr);
> } else {
> - struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
> -
> - ip_len = tcp_fill_headers6(c, conn, &b->ip6h, &b->th, plen,
> + ip_len = tcp_fill_headers6(c, conn, iov[TCP_IOV_IP].iov_base,
> + iov[TCP_IOV_PAYLOAD].iov_base, plen,
> seq);
> -
> - tlen = tap_frame_len(c, &b->taph, ip_len);
> + tlen = ip_len - sizeof(struct ipv6hdr);
You're changing from returning the frame length to returning the IP
payload length, so the function comment should be updated as well.
> }
>
> return tlen;
> @@ -1595,16 +1556,16 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> {
> uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
> uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
> - struct tcp4_l2_flags_buf_t *b4 = NULL;
> - struct tcp6_l2_flags_buf_t *b6 = NULL;
> + struct tcp_l2_flags_t *payload;
> struct tcp_info tinfo = { 0 };
> socklen_t sl = sizeof(tinfo);
> int s = conn->sock;
> + uint32_t *vnet_len;
> size_t optlen = 0;
> - struct iovec *iov;
> struct tcphdr *th;
> + struct iovec *iov;
> + size_t ip_len;
> char *data;
> - void *p;
>
> if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
> !flags && conn->wnd_to_tap)
> @@ -1627,19 +1588,21 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> return 0;
>
> if (CONN_V4(conn)) {
> - iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used;
> - p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++;
> - th = &b4->th;
> -
> - /* gcc 11.2 would complain on data = (char *)(th + 1); */
> - data = b4->opts;
> + iov = &tcp4_l2_flags_iov[tcp4_l2_flags_buf_used++ *
> + TCP_IOV_NUM];
> + vnet_len = iov[TCP_IOV_TAP].iov_base;
> + *vnet_len = sizeof(struct ethhdr) + sizeof(struct iphdr);
You're storing a native-endian value in *vnet_len here, then a network
endian value later. I think that's extremely fragile and easy to mess
up on later changes.
> } else {
> - iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used;
> - p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++;
> - th = &b6->th;
> - data = b6->opts;
> + iov = &tcp6_l2_flags_iov[tcp6_l2_flags_buf_used++ *
> + TCP_IOV_NUM];
> + vnet_len = iov[TCP_IOV_TAP].iov_base;
> + *vnet_len = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
> }
>
> + payload = iov[TCP_IOV_PAYLOAD].iov_base;
> + th = &payload->th;
> + data = payload->opts;
> +
> if (flags & SYN) {
> int mss;
>
> @@ -1688,8 +1651,10 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> th->syn = !!(flags & SYN);
> th->fin = !!(flags & FIN);
>
> - iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen,
> - NULL, conn->seq_to_tap);
> + ip_len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL,
> + conn->seq_to_tap);
> + iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
> + *vnet_len = htonl(*vnet_len + ip_len);
Fwiw, in the UDP patches I'm working on, I still called
tap_frame_len() to set the vnet_len field, but returned a different
length from the fill_headers() equivalent to size the relevant IOV
segment. That way fill_headers() does fill all the headers, including
the vnet_len.
> if (th->ack) {
> if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
> @@ -1705,23 +1670,29 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
> if (th->fin || th->syn)
> conn->seq_to_tap++;
>
> - if (CONN_V4(conn)) {
> - if (flags & DUP_ACK) {
> - memcpy(b4 + 1, b4, sizeof(*b4));
> - (iov + 1)->iov_len = iov->iov_len;
> - tcp4_l2_flags_buf_used++;
> + if (flags & DUP_ACK) {
> + struct iovec *dup_iov;
> + int i;
> +
> + if (CONN_V4(conn))
> + dup_iov = &tcp4_l2_flags_iov[tcp4_l2_flags_buf_used++ *
> + TCP_IOV_NUM];
> + else
> + dup_iov = &tcp6_l2_flags_iov[tcp6_l2_flags_buf_used++ *
> + TCP_IOV_NUM];
> +
> + for (i = 0; i < TCP_IOV_NUM; i++) {
> + memcpy(dup_iov[i].iov_base, iov[i].iov_base,
> + iov[i].iov_len);
> + dup_iov[i].iov_len = iov[i].iov_len;
IIUC, the iov_len for everything except TCP_IOV_PAYLOAD should be
uniform, so most of these won't change anything. I get nervous about
fragility if we write to values which are supposed to be readonly
after initialisation.
> }
> + }
>
> - if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2)
> + if (CONN_V4(conn)) {
> + if (tcp4_l2_flags_buf_used > TCP_FRAMES_MEM - 2)
> tcp_l2_flags_buf_flush(c);
> } else {
> - if (flags & DUP_ACK) {
> - memcpy(b6 + 1, b6, sizeof(*b6));
> - (iov + 1)->iov_len = iov->iov_len;
> - tcp6_l2_flags_buf_used++;
> - }
> -
> - if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2)
> + if (tcp6_l2_flags_buf_used > TCP_FRAMES_MEM - 2)
> tcp_l2_flags_buf_flush(c);
> }
>
> @@ -2168,30 +2139,44 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> ssize_t plen, int no_csum, uint32_t seq)
> {
> uint32_t *seq_update = &conn->seq_to_tap;
> + uint32_t *vnet_len;
> struct iovec *iov;
> + size_t ip_len;
>
> if (CONN_V4(conn)) {
> - struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used];
> - const uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL;
> + struct iovec *iov_prev = &tcp4_l2_iov[(tcp4_l2_buf_used - 1) * TCP_IOV_NUM];
> + const uint16_t *check = NULL;
> +
> + if (no_csum) {
> + struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
> + check = &iph->check;
> + }
>
> tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update;
> tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen;
>
> - iov = tcp4_l2_iov + tcp4_l2_buf_used++;
> - iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> - check, seq);
> - if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1)
> + iov = &tcp4_l2_iov[tcp4_l2_buf_used++ * TCP_IOV_NUM];
> + ip_len = tcp_l2_buf_fill_headers(c, conn, iov, plen, check,
> + seq);
> + iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
> + vnet_len = iov[TCP_IOV_TAP].iov_base;
> + *vnet_len = htonl(sizeof(struct ethhdr) +
> + sizeof(struct iphdr) +
> + ip_len);
> + if (tcp4_l2_buf_used > TCP_FRAMES_MEM - 1)
> tcp_l2_data_buf_flush(c);
> } else if (CONN_V6(conn)) {
> - struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used];
> -
> tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update;
> tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen;
>
> - iov = tcp6_l2_iov + tcp6_l2_buf_used++;
> - iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen,
> - NULL, seq);
> - if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1)
> + iov = &tcp6_l2_iov[tcp6_l2_buf_used++ * TCP_IOV_NUM];
> + ip_len = tcp_l2_buf_fill_headers(c, conn, iov, plen, NULL, seq);
> + iov[TCP_IOV_PAYLOAD].iov_len = ip_len;
> + vnet_len = iov[TCP_IOV_TAP].iov_base;
> + *vnet_len = htonl(sizeof(struct ethhdr) +
> + sizeof(struct ipv6hdr) +
> + ip_len);
> + if (tcp6_l2_buf_used > TCP_FRAMES_MEM - 1)
> tcp_l2_data_buf_flush(c);
> }
> }
> @@ -2247,8 +2232,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> iov_sock[0].iov_base = tcp_buf_discard;
> iov_sock[0].iov_len = already_sent;
>
> - if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) ||
> - (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) {
> + if (( v4 && tcp4_l2_buf_used + fill_bufs > TCP_FRAMES_MEM) ||
> + (!v4 && tcp6_l2_buf_used + fill_bufs > TCP_FRAMES_MEM)) {
> tcp_l2_data_buf_flush(c);
>
> /* Silence Coverity CWE-125 false positive */
> @@ -2257,9 +2242,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>
> for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
> if (v4)
> - iov->iov_base = &tcp4_l2_buf[tcp4_l2_buf_used + i].data;
> + iov->iov_base = &tcp4_l2_payload[tcp4_l2_buf_used + i].data;
> else
> - iov->iov_base = &tcp6_l2_buf[tcp6_l2_buf_used + i].data;
> + iov->iov_base = &tcp6_l2_payload[tcp6_l2_buf_used + i].data;
> iov->iov_len = mss;
> }
> if (iov_rem)
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-03-18 5:47 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-15 17:51 [RFC v2] tcp: Replace TCP buffer structure by an iovec array Laurent Vivier
2024-03-18 5:47 ` David Gibson [this message]
2024-03-18 8:53 ` Stefano Brivio
2024-03-19 4:51 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZffVdl5k0OSW7e4-@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).