From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 3BC875A026F for ; Wed, 13 Mar 2024 12:38:08 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1710329887; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iPV0W/WRzG/EPlBmk2gPE4A9R+sA83g16xLd/vRNNik=; b=YmmuXfPqjZIqM0Xa707FdbmA/f7vL3aVGFtuj47OzoOtfbxxLk32Vfpf5SGxXYp2A5n9FB YHFNotaMM2IpWw/QV3q1vfV71ulLy0lKQPS5FieYlHFPiuDi23xxFoXhkkwZftTMAWqOCl 3/b/FXt2cVQRVRQjw2CPEMC6EV56uEg= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-21-vVaAKxdwPt2B93Cor9bVbQ-1; Wed, 13 Mar 2024 07:38:05 -0400 X-MC-Unique: vVaAKxdwPt2B93Cor9bVbQ-1 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-a4660bdb26fso27249466b.0 for ; Wed, 13 Mar 2024 04:38:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710329882; x=1710934682; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=iPV0W/WRzG/EPlBmk2gPE4A9R+sA83g16xLd/vRNNik=; b=Z0vEyYu1RcGP5tfeqlO3VVSiW+m4EyV8GHY5rb1N/6UGWWYKEQ0l4mwxDV8HpPZbpb fsmYt3PtifEbCHkEyUcMxANtKdi/RUN7VmDzSqlXLBa7maQjh80CPudgtQbpUH32l3WR QnBkqlrF92fceywKtpE351BjKoOlcJBGclJto3OnWDoU1F877uWuce2euz22hhqH2NVl t38oJWW4uVWy9Cqe9qgQj1I4j3kvRrOIKcrXtn0B+xlPqqjoDYv5ge4NNrhrY5eIyq+/ Bsgg4HmgkZPu7kFfR/jvk3zza9KKq9XxAQxP3ITInQIUwtI6Gj2xQ5K80t03dyenzOLo P6zA== X-Gm-Message-State: AOJu0YxDdiG1xWnbRR9bUFtVT6gZI7ghrLp1vbSmLmYf0xFnklggCblh X2Z1psEJFhTkobJdRRbjFM96/uPvqNnsdyr17rVSBw2YJWA+wfZsHmpKjIWQNlph8x9z43fjdd/ /9qA8PwWHtG7SE8dkT2pP+hdW1DsH9p1OddPaMuGt7+ZHhknkXQX8jImWjrlfi9bsghyzZapAfG UZNZ3vjEB+ZTvZ4Hda/+W1g7pb8QYupurtYJQ= X-Received: by 2002:a17:907:c289:b0:a46:29ff:8dc1 with SMTP id tk9-20020a170907c28900b00a4629ff8dc1mr4948499ejc.19.1710329882114; Wed, 13 Mar 2024 04:38:02 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGr56TQuzNEFNG1aS+hpEniGH3YJpB+0fOg05x5uABumaFmh3PTD/PfxAq2HG/ifi8JvMEArA== X-Received: by 2002:a17:907:c289:b0:a46:29ff:8dc1 with SMTP id tk9-20020a170907c28900b00a4629ff8dc1mr4948471ejc.19.1710329881272; Wed, 13 Mar 2024 04:38:01 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id m27-20020a17090607db00b00a44b90abb1dsm4718392ejc.110.2024.03.13.04.38.00 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Mar 2024 04:38:00 -0700 (PDT) Date: Wed, 13 Mar 2024 12:37:25 +0100 From: Stefano Brivio To: Laurent Vivier Subject: Re: [RFC] tcp: Replace TCP buffer structure by an iovec array Message-ID: <20240313123725.7a37f311@elisabeth> In-Reply-To: <20240311133356.1405001-1-lvivier@redhat.com> References: <20240311133356.1405001-1-lvivier@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: LVX5YOEJ6EQSCZVJUHCYFZVKFITZDMG7 X-Message-ID-Hash: LVX5YOEJ6EQSCZVJUHCYFZVKFITZDMG7 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 11 Mar 2024 14:33:56 +0100 Laurent Vivier wrote: > To be able to provide pointers to TCP headers and IP headers without > worrying about alignment in the structure, split the structure into > several arrays and point to each part of the frame using an iovec array. > > Using iovec also allows us to simply ignore the first entry when the > vnet length header is not needed. And as the payload buffer contains > only the TCP header and the TCP data we can increase the size of the > TCP data to USHRT_MAX - sizeof(struct tcphdr). > > As a side effect, these changes improve performance by a factor of > x1.5. > > Signed-off-by: Laurent Vivier > --- > tap.c | 104 ++++++++++++++ > tap.h | 16 +++ > tcp.c | 429 ++++++++++++++++++++++++---------------------------------- > 3 files changed, 296 insertions(+), 253 deletions(-) > > diff --git a/tap.c b/tap.c > index c7b9372668ec..afd01cffbed6 100644 > --- a/tap.c > +++ b/tap.c > @@ -350,6 +350,30 @@ static size_t tap_send_frames_pasta(const struct ctx *c, > return i; > } > > +/** > + * tap_send_iov_pasta() - Send out multiple prepared frames > + * @c: Execution context > + * @iov: Array of frames, each frames is divided in an array of iovecs. > + * The first entry of the iovec is ignored Fits on single lines (I think it's more readable): * @c: Execution context * @iov: Array of frames, each one a vector of parts, TCP_IOV_VNET unused * @n: Number of frames in @iov > + * @n: Number of frames in @iov > + * > + * Return: number of frames actually sent > + */ > +static size_t tap_send_iov_pasta(const struct ctx *c, > + struct iovec iov[][TCP_IOV_NUM], size_t n) > +{ > + unsigned int i; > + > + for (i = 0; i < n; i++) { > + if (!tap_send_frames_pasta(c, &iov[i][TCP_IOV_ETH], > + TCP_IOV_NUM - TCP_IOV_ETH)) > + break; > + } > + > + return i; > + > +} > + > /** > * tap_send_frames_passt() - Send multiple frames to the passt tap > * @c: Execution context > @@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c, > return i; > } > > +/** > + * tap_send_iov_passt() - Send out multiple prepared frames ...I would argue that this function prepares frames as well. Maybe: * tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames > + * @c: Execution context > + * @iov: Array of frames, each frames is divided in an array of iovecs. > + * The first entry of the iovec is updated to point to an > + * uint32_t storing the frame length. * @iov: Array of frames, each one a vector of parts, TCP_IOV_VNET blank > + * @n: Number of frames in @iov > + * > + * Return: number of frames actually sent > + */ > +static size_t tap_send_iov_passt(const struct ctx *c, > + struct iovec iov[][TCP_IOV_NUM], > + size_t n) > +{ > + unsigned int i; > + > + for (i = 0; i < n; i++) { > + uint32_t vnet_len; > + int j; > + > + vnet_len = 0; This could be initialised in the declaration (yes, it's "reset" at every loop iteration). > + for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) > + vnet_len += iov[i][j].iov_len; > + > + vnet_len = htonl(vnet_len); > + iov[i][TCP_IOV_VNET].iov_base = &vnet_len; > + iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len); > + > + if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM)) ...which would now send a single frame at a time, but actually it can already send everything in one shot because it's using sendmsg(), if you move it outside of the loop and do something like (untested): return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n); > + break; > + } > + > + return i; > + > +} > + > /** > * tap_send_frames() - Send out multiple prepared frames > * @c: Execution context > @@ -418,6 +478,50 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n) > return m; > } > > +/** > + * tap_send_iov() - Send out multiple prepared frames > + * @c: Execution context > + * @iov: Array of frames, each frames is divided in an array of iovecs. > + * iovec array is: I think this description should be moved before the defines -- it could even be an enum with fixed numbers, and I just realised that in kernel-doc enums should be documented in the same way as structs (we didn't do that so far), that is: /** * enum tcp_iov_parts - I/O vector parts for one TCP frame * @TCP_IOV_NET: vnet length: host order, 32-bit frame length descriptor * ... */ enum tcp_iov_parts { /* vnet length (host order, 32-bit length descriptor) */ TCP_IOV_VNET = 0, [...] }; > + * TCP_IOV_VNET (0) vnet length > + * TCP_IOV_ETH (1) ethernet header > + * TCP_IOV_IP (2) IP (v4/v6) header > + * TCP_IOV_PAYLOAD (3) IP payload (TCP header + data) > + * TCP_IOV_NUM (4) is the number of entries in the iovec array > + * TCP_IOV_VNET entry is updated with passt, ignored with pasta. > + * @n: Number of frames in @iov > + * > + * Return: number of frames actually sent > + */ > +size_t tap_send_iov(const struct ctx *c, struct iovec iov[][TCP_IOV_NUM], > + size_t n) > +{ > + size_t m; > + unsigned int i; > + > + if (!n) > + return 0; > + > + switch (c->mode) { > + case MODE_PASST: > + m = tap_send_iov_passt(c, iov, n); > + break; > + case MODE_PASTA: > + m = tap_send_iov_pasta(c, iov, n); > + break; > + default: > + ASSERT(0); > + } > + > + if (m < n) > + debug("tap: failed to send %zu frames of %zu", n - m, n); > + > + for (i = 0; i < m; i++) > + pcap_iov(&iov[i][TCP_IOV_ETH], TCP_IOV_NUM - TCP_IOV_ETH); > + > + return m; > +} > + > /** > * eth_update_mac() - Update tap L2 header with new Ethernet addresses > * @eh: Ethernet headers to update > diff --git a/tap.h b/tap.h > index 437b9aa2b43f..2d8e7aa1101d 100644 > --- a/tap.h > +++ b/tap.h > @@ -6,6 +6,20 @@ > #ifndef TAP_H > #define TAP_H > > +/* > + * TCP frame iovec array: > + * TCP_IOV_VNET vnet length > + * TCP_IOV_ETH ethernet header > + * TCP_IOV_IP IP (v4/v6) header > + * TCP_IOV_PAYLOAD IP payload (TCP header + data) > + * TCP_IOV_NUM is the number of entries in the iovec array > + */ > +#define TCP_IOV_VNET 0 > +#define TCP_IOV_ETH 1 > +#define TCP_IOV_IP 2 > +#define TCP_IOV_PAYLOAD 3 > +#define TCP_IOV_NUM 4 > + > /** > * struct tap_hdr - L2 and tap specific headers > * @vnet_len: Frame length (for qemu socket transport) > @@ -74,6 +88,8 @@ void tap_icmp6_send(const struct ctx *c, > const void *in, size_t len); > int tap_send(const struct ctx *c, const void *data, size_t len); > size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t n); > +size_t tap_send_iov(const struct ctx *c, struct iovec iov[][TCP_IOV_NUM], > + size_t n); > void eth_update_mac(struct ethhdr *eh, > const unsigned char *eth_d, const unsigned char *eth_s); > void tap_listen_handler(struct ctx *c, uint32_t events); > diff --git a/tcp.c b/tcp.c > index d5eedf4d0138..5c8488108ef7 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -318,39 +318,7 @@ > > /* MSS rounding: see SET_MSS() */ > #define MSS_DEFAULT 536 > - > -struct tcp4_l2_head { /* For MSS4 macro: keep in sync with tcp4_l2_buf_t */ > -#ifdef __AVX2__ > - uint8_t pad[26]; > -#else > - uint8_t pad[2]; > -#endif > - struct tap_hdr taph; > - struct iphdr iph; > - struct tcphdr th; > -#ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))); > -#else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))); > -#endif > - > -struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ > -#ifdef __AVX2__ > - uint8_t pad[14]; > -#else > - uint8_t pad[2]; > -#endif > - struct tap_hdr taph; > - struct ipv6hdr ip6h; > - struct tcphdr th; > -#ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))); > -#else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))); > -#endif I'm so glad to see this going away. > -#define MSS4 ROUND_DOWN(USHRT_MAX - sizeof(struct tcp4_l2_head), 4) > -#define MSS6 ROUND_DOWN(USHRT_MAX - sizeof(struct tcp6_l2_head), 4) > > +#define MSS (USHRT_MAX - sizeof(struct tcphdr)) We can't exceed USHRT_MAX at Layer-2 level, though, so the MSS for IPv4 should now be: #define MSS4 ROUND_DOWN(USHRT_MAX - sizeof(struct ethhdr) - sizeof(struct iphdr) - sizeof(struct tcphdr), 4) ...and similar for IPv6. > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > #ifdef HAS_SND_WND > @@ -445,133 +413,78 @@ struct tcp_buf_seq_update { > }; > > /* Static buffers */ > - > /** > - * tcp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections > - * @pad: Align TCP header to 32 bytes, for AVX2 checksum calculation only > - * @taph: Tap-level headers (partially pre-filled) > - * @iph: Pre-filled IP header (except for tot_len and saddr) > - * @uh: Headroom for TCP header > - * @data: Storage for TCP payload > + * tcp_l2_flags_t - TCP header and data to send option flags 'struct tcp_l2_flags_t - ...', by the way (pre-existing, I see). > + * @th: TCP header > + * @opts TCP option flags > */ > -static struct tcp4_l2_buf_t { > -#ifdef __AVX2__ > - uint8_t pad[26]; /* 0, align th to 32 bytes */ > -#else > - uint8_t pad[2]; /* align iph to 4 bytes 0 */ > -#endif > - struct tap_hdr taph; /* 26 2 */ > - struct iphdr iph; /* 44 20 */ > - struct tcphdr th; /* 64 40 */ > - uint8_t data[MSS4]; /* 84 60 */ > - /* 65536 65532 */ > +struct tcp_l2_flags_t { > + struct tcphdr th; > + char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; > +}; This should still be aligned (when declared below) with: #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))) #else } __attribute__ ((packed, aligned(__alignof__(unsigned int)))) #endif because yes, now you get the unsigned int alignment for free, but the 32-byte boundary for AVX2 (checksums) not really, so that needs to be there, and then for clarity I would keep the explicit attribute for the non-AVX2 case. By the way, I guess you could still combine the definition and declaration as it was done earlier. > +/** > + * tcp_l2_payload_t - TCP header and data to send data > + * 32 bytes aligned to be able to use AVX2 checksum > + * @th: TCP header > + * @data: TCP data > + */ > +struct tcp_l2_payload_t { > + struct tcphdr th; /* 20 bytes */ > + uint8_t data[MSS]; /* 65516 bytes */ > #ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))) > +} __attribute__ ((packed, aligned(32))); > #else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))) > +} __attribute__ ((packed, aligned(__alignof__(unsigned int)))); > #endif > -tcp4_l2_buf[TCP_FRAMES_MEM]; > > -static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; > +/* Ethernet header for IPv4 frames */ > +static struct ethhdr tcp4_eth_src; > > +/* IPv4 headers */ > +static struct iphdr tcp4_l2_ip[TCP_FRAMES_MEM]; > +/* TCP headers and data for IPv4 frames */ > +static struct tcp_l2_payload_t tcp4_l2_payload[TCP_FRAMES_MEM]; > + > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; > static unsigned int tcp4_l2_buf_used; > > -/** > - * tcp6_l2_buf_t - Pre-cooked IPv6 packet buffers for tap connections > - * @pad: Align IPv6 header for checksum calculation to 32B (AVX2) or 4B > - * @taph: Tap-level headers (partially pre-filled) > - * @ip6h: Pre-filled IP header (except for payload_len and addresses) > - * @th: Headroom for TCP header > - * @data: Storage for TCP payload > - */ > -struct tcp6_l2_buf_t { > -#ifdef __AVX2__ > - uint8_t pad[14]; /* 0 align ip6h to 32 bytes */ > -#else > - uint8_t pad[2]; /* align ip6h to 4 bytes 0 */ > -#endif > - struct tap_hdr taph; /* 14 2 */ > - struct ipv6hdr ip6h; /* 32 20 */ > - struct tcphdr th; /* 72 60 */ > - uint8_t data[MSS6]; /* 92 80 */ > - /* 65536 65532 */ > -#ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))) > -#else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))) > -#endif > -tcp6_l2_buf[TCP_FRAMES_MEM]; > +/* IPv4 headers for TCP option flags frames */ > +static struct iphdr tcp4_l2_flags_ip[TCP_FRAMES_MEM]; > +/* TCP headers and option flags for IPv4 frames */ > +static struct tcp_l2_flags_t tcp4_l2_flags[TCP_FRAMES_MEM]; > > -static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; > +static unsigned int tcp4_l2_flags_buf_used; > + > +/* Ethernet header for IPv6 frames */ > +static struct ethhdr tcp6_eth_src; > > +/* IPv6 headers */ > +static struct ipv6hdr tcp6_l2_ip[TCP_FRAMES_MEM]; > +/* TCP headers and data for IPv6 frames */ > +static struct tcp_l2_payload_t tcp6_l2_payload[TCP_FRAMES_MEM]; > + > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; > static unsigned int tcp6_l2_buf_used; > > +/* IPv6 headers for TCP option flags frames */ > +static struct ipv6hdr tcp6_l2_flags_ip[TCP_FRAMES_MEM]; > +/* TCP headers and option flags for IPv6 frames */ > +static struct tcp_l2_flags_t tcp6_l2_flags[TCP_FRAMES_MEM]; > + > +static unsigned int tcp6_l2_flags_buf_used; > + > /* recvmsg()/sendmsg() data for tap */ > static char tcp_buf_discard [MAX_WINDOW]; > static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; > > -static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM]; > -static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM]; > -static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM]; > -static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM]; > +static struct iovec tcp4_l2_iov [TCP_FRAMES_MEM][TCP_IOV_NUM]; > +static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_IOV_NUM]; > +static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_IOV_NUM]; > +static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_IOV_NUM]; > > /* sendmsg() to socket */ > static struct iovec tcp_iov [UIO_MAXIOV]; > > -/** > - * tcp4_l2_flags_buf_t - IPv4 packet buffers for segments without data (flags) > - * @pad: Align TCP header to 32 bytes, for AVX2 checksum calculation only > - * @taph: Tap-level headers (partially pre-filled) > - * @iph: Pre-filled IP header (except for tot_len and saddr) > - * @th: Headroom for TCP header > - * @opts: Headroom for TCP options > - */ > -static struct tcp4_l2_flags_buf_t { > -#ifdef __AVX2__ > - uint8_t pad[26]; /* 0, align th to 32 bytes */ > -#else > - uint8_t pad[2]; /* align iph to 4 bytes 0 */ > -#endif > - struct tap_hdr taph; /* 26 2 */ > - struct iphdr iph; /* 44 20 */ > - struct tcphdr th; /* 64 40 */ > - char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; > -#ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))) > -#else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))) > -#endif > -tcp4_l2_flags_buf[TCP_FRAMES_MEM]; > - > -static unsigned int tcp4_l2_flags_buf_used; > - > -/** > - * tcp6_l2_flags_buf_t - IPv6 packet buffers for segments without data (flags) > - * @pad: Align IPv6 header for checksum calculation to 32B (AVX2) or 4B > - * @taph: Tap-level headers (partially pre-filled) > - * @ip6h: Pre-filled IP header (except for payload_len and addresses) > - * @th: Headroom for TCP header > - * @opts: Headroom for TCP options > - */ > -static struct tcp6_l2_flags_buf_t { > -#ifdef __AVX2__ > - uint8_t pad[14]; /* 0 align ip6h to 32 bytes */ > -#else > - uint8_t pad[2]; /* align ip6h to 4 bytes 0 */ > -#endif > - struct tap_hdr taph; /* 14 2 */ > - struct ipv6hdr ip6h; /* 32 20 */ > - struct tcphdr th /* 72 */ __attribute__ ((aligned(4))); /* 60 */ > - char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; > -#ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))) > -#else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))) > -#endif > -tcp6_l2_flags_buf[TCP_FRAMES_MEM]; > - > -static unsigned int tcp6_l2_flags_buf_used; > - > #define CONN(idx) (&(FLOW(idx)->tcp)) > > /* Table for lookup from remote address, local port, remote port */ > @@ -973,19 +886,8 @@ static void tcp_update_check_tcp6(struct ipv6hdr *ip6h, struct tcphdr *th) > */ > void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) This doesn't update "L2 buffers" anymore, just the Ethernet addresses (function comment should be updated). > { > - int i; > - > - for (i = 0; i < TCP_FRAMES_MEM; i++) { > - struct tcp4_l2_flags_buf_t *b4f = &tcp4_l2_flags_buf[i]; > - struct tcp6_l2_flags_buf_t *b6f = &tcp6_l2_flags_buf[i]; > - struct tcp4_l2_buf_t *b4 = &tcp4_l2_buf[i]; > - struct tcp6_l2_buf_t *b6 = &tcp6_l2_buf[i]; > - > - eth_update_mac(&b4->taph.eh, eth_d, eth_s); > - eth_update_mac(&b6->taph.eh, eth_d, eth_s); > - eth_update_mac(&b4f->taph.eh, eth_d, eth_s); > - eth_update_mac(&b6f->taph.eh, eth_d, eth_s); > - } > + eth_update_mac(&tcp4_eth_src, eth_d, eth_s); > + eth_update_mac(&tcp6_eth_src, eth_d, eth_s); > } > > /** > @@ -995,29 +897,42 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) > static void tcp_sock4_iov_init(const struct ctx *c) > { > struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP); > - struct iovec *iov; > int i; > > - for (i = 0; i < ARRAY_SIZE(tcp4_l2_buf); i++) { > - tcp4_l2_buf[i] = (struct tcp4_l2_buf_t) { > - .taph = TAP_HDR_INIT(ETH_P_IP), > - .iph = iph, > - .th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 } > - }; > - } > + (void)c; ...then drop it from the parameter list? > > - for (i = 0; i < ARRAY_SIZE(tcp4_l2_flags_buf); i++) { > - tcp4_l2_flags_buf[i] = (struct tcp4_l2_flags_buf_t) { > - .taph = TAP_HDR_INIT(ETH_P_IP), > - .iph = L2_BUF_IP4_INIT(IPPROTO_TCP) > - }; > - } > + tcp4_eth_src.h_proto = htons_constant(ETH_P_IP); > + for (i = 0; i < TCP_FRAMES_MEM; i++) { > + struct iovec *iov; > + > + /* headers */ > + tcp4_l2_ip[i] = iph; > + tcp4_l2_payload[i].th = (struct tcphdr){ > + .doff = sizeof(struct tcphdr) / 4, > + .ack = 1 > + }; > > - for (i = 0, iov = tcp4_l2_iov; i < TCP_FRAMES_MEM; i++, iov++) > - iov->iov_base = tap_iov_base(c, &tcp4_l2_buf[i].taph); > + tcp4_l2_flags_ip[i] = iph; > + tcp4_l2_flags[i].th = (struct tcphdr){ > + .doff = sizeof(struct tcphdr) / 4, > + .ack = 1 > + }; > > - for (i = 0, iov = tcp4_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++) > - iov->iov_base = tap_iov_base(c, &tcp4_l2_flags_buf[i].taph); > + /* iovecs */ > + iov = tcp4_l2_iov[i]; > + iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; > + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); > + iov[TCP_IOV_IP].iov_base = &tcp4_l2_ip[i]; > + iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr); > + iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_payload[i]; > + > + iov = tcp4_l2_flags_iov[i]; > + iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; > + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); > + iov[TCP_IOV_IP].iov_base = &tcp4_l2_flags_ip[i]; > + iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr); > + iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_flags[i]; > + } > } > > /** > @@ -1026,29 +941,43 @@ static void tcp_sock4_iov_init(const struct ctx *c) > */ > static void tcp_sock6_iov_init(const struct ctx *c) > { > - struct iovec *iov; > + struct ipv6hdr ip6 = L2_BUF_IP6_INIT(IPPROTO_TCP); > int i; > > - for (i = 0; i < ARRAY_SIZE(tcp6_l2_buf); i++) { > - tcp6_l2_buf[i] = (struct tcp6_l2_buf_t) { > - .taph = TAP_HDR_INIT(ETH_P_IPV6), > - .ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP), > - .th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 } > - }; > - } > + (void)c; > > - for (i = 0; i < ARRAY_SIZE(tcp6_l2_flags_buf); i++) { > - tcp6_l2_flags_buf[i] = (struct tcp6_l2_flags_buf_t) { > - .taph = TAP_HDR_INIT(ETH_P_IPV6), > - .ip6h = L2_BUF_IP6_INIT(IPPROTO_TCP) > - }; > - } > + tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6); > + for (i = 0; i < TCP_FRAMES_MEM; i++) { > + struct iovec *iov; > + > + /* headers */ > + tcp6_l2_ip[i] = ip6; > + tcp6_l2_payload[i].th = (struct tcphdr){ > + .doff = sizeof(struct tcphdr) / 4, > + .ack = 1 > + }; > > - for (i = 0, iov = tcp6_l2_iov; i < TCP_FRAMES_MEM; i++, iov++) > - iov->iov_base = tap_iov_base(c, &tcp6_l2_buf[i].taph); > + tcp6_l2_flags_ip[i] = ip6; > + tcp6_l2_flags[i].th = (struct tcphdr){ > + .doff = sizeof(struct tcphdr) / 4, > + .ack = 1 > + }; > > - for (i = 0, iov = tcp6_l2_flags_iov; i < TCP_FRAMES_MEM; i++, iov++) > - iov->iov_base = tap_iov_base(c, &tcp6_l2_flags_buf[i].taph); > + /* iovecs */ > + iov = tcp6_l2_iov[i]; > + iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; > + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); > + iov[TCP_IOV_IP].iov_base = &tcp6_l2_ip[i]; > + iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr); > + iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_payload[i]; > + > + iov = tcp6_l2_flags_iov[i]; > + iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; > + iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); > + iov[TCP_IOV_IP].iov_base = &tcp6_l2_flags_ip[i]; > + iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr); > + iov[TCP_IOV_PAYLOAD].iov_base = &tcp6_l2_flags[i]; > + } > } > > /** > @@ -1289,10 +1218,10 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); > */ > static void tcp_l2_flags_buf_flush(const struct ctx *c) > { > - tap_send_frames(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used); > + tap_send_iov(c, tcp6_l2_flags_iov, tcp6_l2_flags_buf_used); > tcp6_l2_flags_buf_used = 0; > > - tap_send_frames(c, tcp4_l2_flags_iov, tcp4_l2_flags_buf_used); > + tap_send_iov(c, tcp4_l2_flags_iov, tcp4_l2_flags_buf_used); > tcp4_l2_flags_buf_used = 0; > } > > @@ -1305,12 +1234,12 @@ static void tcp_l2_data_buf_flush(const struct ctx *c) > unsigned i; > size_t m; > > - m = tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); > + m = tap_send_iov(c, tcp6_l2_iov, tcp6_l2_buf_used); > for (i = 0; i < m; i++) > *tcp6_l2_buf_seq_update[i].seq += tcp6_l2_buf_seq_update[i].len; > tcp6_l2_buf_used = 0; > > - m = tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); > + m = tap_send_iov(c, tcp4_l2_iov, tcp4_l2_buf_used); > for (i = 0; i < m; i++) > *tcp4_l2_buf_seq_update[i].seq += tcp4_l2_buf_seq_update[i].len; > tcp4_l2_buf_used = 0; > @@ -1433,7 +1362,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, > * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers > * @c: Execution context > * @conn: Connection pointer > - * @p: Pointer to any type of TCP pre-cooked buffer > + * @iov: Pointer to an array of iovec of TCP pre-cooked buffer > * @plen: Payload length (including TCP header options) > * @check: Checksum, if already known > * @seq: Sequence number for this segment > @@ -1442,26 +1371,22 @@ static size_t tcp_fill_headers6(const struct ctx *c, > */ > static size_t tcp_l2_buf_fill_headers(const struct ctx *c, > const struct tcp_tap_conn *conn, > - void *p, size_t plen, > + struct iovec *iov, size_t plen, > const uint16_t *check, uint32_t seq) > { > const struct in_addr *a4 = inany_v4(&conn->faddr); > size_t ip_len, tlen; > > if (a4) { > - struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p; > - > - ip_len = tcp_fill_headers4(c, conn, &b->iph, &b->th, plen, > + ip_len = tcp_fill_headers4(c, conn, iov[TCP_IOV_IP].iov_base, > + iov[TCP_IOV_PAYLOAD].iov_base, plen, > check, seq); > - > - tlen = tap_iov_len(c, &b->taph, ip_len); > + tlen = ip_len - sizeof(struct iphdr); > } else { > - struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p; > - > - ip_len = tcp_fill_headers6(c, conn, &b->ip6h, &b->th, plen, > + ip_len = tcp_fill_headers6(c, conn, iov[TCP_IOV_IP].iov_base, > + iov[TCP_IOV_PAYLOAD].iov_base, plen, > seq); > - > - tlen = tap_iov_len(c, &b->taph, ip_len); > + tlen = ip_len - sizeof(struct ipv6hdr); > } > > return tlen; > @@ -1595,16 +1520,15 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > { > uint32_t prev_ack_to_tap = conn->seq_ack_to_tap; > uint32_t prev_wnd_to_tap = conn->wnd_to_tap; > - struct tcp4_l2_flags_buf_t *b4 = NULL; > - struct tcp6_l2_flags_buf_t *b6 = NULL; > + struct tcp_l2_flags_t *payload; > struct tcp_info tinfo = { 0 }; > socklen_t sl = sizeof(tinfo); > int s = conn->sock; > size_t optlen = 0; > struct iovec *iov; > + struct iovec *dup_iov; This should now go before 'int s' (longest to shortest), or the declaration could be combined on the same line (it's really symmetric) with *iov. > struct tcphdr *th; > char *data; > - void *p; > > if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) && > !flags && conn->wnd_to_tap) > @@ -1627,18 +1551,15 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > return 0; > > if (CONN_V4(conn)) { > - iov = tcp4_l2_flags_iov + tcp4_l2_flags_buf_used; > - p = b4 = tcp4_l2_flags_buf + tcp4_l2_flags_buf_used++; > - th = &b4->th; > - > - /* gcc 11.2 would complain on data = (char *)(th + 1); */ > - data = b4->opts; > + iov = tcp4_l2_flags_iov[tcp4_l2_flags_buf_used++]; > + dup_iov = tcp4_l2_flags_iov[tcp4_l2_flags_buf_used]; > } else { > - iov = tcp6_l2_flags_iov + tcp6_l2_flags_buf_used; > - p = b6 = tcp6_l2_flags_buf + tcp6_l2_flags_buf_used++; > - th = &b6->th; > - data = b6->opts; > + iov = tcp6_l2_flags_iov[tcp6_l2_flags_buf_used++]; > + dup_iov = tcp6_l2_flags_iov[tcp6_l2_flags_buf_used]; Maybe you could move the dup_iov assignments below, and group these with the increments of tcp[46]_l2_flags_buf_used, otherwise I'll scream "off-by-one" every time I see this (even though it's not actually the case). > } > + payload = iov[TCP_IOV_PAYLOAD].iov_base; > + th = &payload->th; > + data = payload->opts; > > if (flags & SYN) { > int mss; > @@ -1653,10 +1574,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > mss = tinfo.tcpi_snd_mss; > } else { > mss = c->mtu - sizeof(struct tcphdr); > - if (CONN_V4(conn)) > - mss -= sizeof(struct iphdr); > - else > - mss -= sizeof(struct ipv6hdr); > > if (c->low_wmem && > !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) > @@ -1688,8 +1605,9 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > th->syn = !!(flags & SYN); > th->fin = !!(flags & FIN); > > - iov->iov_len = tcp_l2_buf_fill_headers(c, conn, p, optlen, > - NULL, conn->seq_to_tap); > + iov[TCP_IOV_PAYLOAD].iov_len = tcp_l2_buf_fill_headers(c, conn, iov, > + optlen, NULL, > + conn->seq_to_tap); > > if (th->ack) { > if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap)) > @@ -1705,23 +1623,26 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > if (th->fin || th->syn) > conn->seq_to_tap++; > > + if (flags & DUP_ACK) { > + int i; > + for (i = 0; i < TCP_IOV_NUM; i++) { > + memcpy(dup_iov[i].iov_base, iov[i].iov_base, > + iov[i].iov_len); > + dup_iov[i].iov_len = iov[i].iov_len; > + } > + } > + > if (CONN_V4(conn)) { > - if (flags & DUP_ACK) { > - memcpy(b4 + 1, b4, sizeof(*b4)); > - (iov + 1)->iov_len = iov->iov_len; > + if (flags & DUP_ACK) > tcp4_l2_flags_buf_used++; > - } > > - if (tcp4_l2_flags_buf_used > ARRAY_SIZE(tcp4_l2_flags_buf) - 2) > + if (tcp4_l2_flags_buf_used > TCP_FRAMES_MEM - 2) > tcp_l2_flags_buf_flush(c); > } else { > - if (flags & DUP_ACK) { > - memcpy(b6 + 1, b6, sizeof(*b6)); > - (iov + 1)->iov_len = iov->iov_len; > + if (flags & DUP_ACK) > tcp6_l2_flags_buf_used++; > - } > > - if (tcp6_l2_flags_buf_used > ARRAY_SIZE(tcp6_l2_flags_buf) - 2) > + if (tcp6_l2_flags_buf_used > TCP_FRAMES_MEM - 2) > tcp_l2_flags_buf_flush(c); > } > > @@ -1887,15 +1808,14 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn, > unsigned int mss; > int ret; > > + (void)conn; /* unused */ > + > if ((ret = tcp_opt_get(opts, optlen, OPT_MSS, NULL, NULL)) < 0) > mss = MSS_DEFAULT; > else > mss = ret; > > - if (CONN_V4(conn)) > - mss = MIN(MSS4, mss); > - else > - mss = MIN(MSS6, mss); > + mss = MIN(MSS, mss); > > return MIN(mss, USHRT_MAX); > } > @@ -2171,27 +2091,30 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, > struct iovec *iov; > > if (CONN_V4(conn)) { > - struct tcp4_l2_buf_t *b = &tcp4_l2_buf[tcp4_l2_buf_used]; > - const uint16_t *check = no_csum ? &(b - 1)->iph.check : NULL; > + struct iovec *iov_prev = tcp4_l2_iov[tcp4_l2_buf_used - 1]; > + const uint16_t *check = NULL; > + > + if (no_csum) { > + struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; > + check = &iph->check; > + } > > tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq = seq_update; > tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len = plen; > > - iov = tcp4_l2_iov + tcp4_l2_buf_used++; > - iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, > - check, seq); > - if (tcp4_l2_buf_used > ARRAY_SIZE(tcp4_l2_buf) - 1) > + iov = tcp4_l2_iov[tcp4_l2_buf_used++]; > + iov[TCP_IOV_PAYLOAD].iov_len = tcp_l2_buf_fill_headers(c, conn, > + iov, plen, check, seq); > + if (tcp4_l2_buf_used > TCP_FRAMES_MEM - 1) > tcp_l2_data_buf_flush(c); > } else if (CONN_V6(conn)) { > - struct tcp6_l2_buf_t *b = &tcp6_l2_buf[tcp6_l2_buf_used]; > - > tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq = seq_update; > tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len = plen; > > - iov = tcp6_l2_iov + tcp6_l2_buf_used++; > - iov->iov_len = tcp_l2_buf_fill_headers(c, conn, b, plen, > - NULL, seq); > - if (tcp6_l2_buf_used > ARRAY_SIZE(tcp6_l2_buf) - 1) > + iov = tcp6_l2_iov[tcp6_l2_buf_used++]; > + iov[TCP_IOV_PAYLOAD].iov_len = tcp_l2_buf_fill_headers(c, conn, > + iov, plen, NULL, seq); Maybe use a temporary variable? > + if (tcp6_l2_buf_used > TCP_FRAMES_MEM - 1) > tcp_l2_data_buf_flush(c); > } > } > @@ -2247,8 +2170,8 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > iov_sock[0].iov_base = tcp_buf_discard; > iov_sock[0].iov_len = already_sent; > > - if (( v4 && tcp4_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp4_l2_buf)) || > - (!v4 && tcp6_l2_buf_used + fill_bufs > ARRAY_SIZE(tcp6_l2_buf))) { > + if (( v4 && tcp4_l2_buf_used + fill_bufs > TCP_FRAMES_MEM) || > + (!v4 && tcp6_l2_buf_used + fill_bufs > TCP_FRAMES_MEM)) { > tcp_l2_data_buf_flush(c); > > /* Silence Coverity CWE-125 false positive */ > @@ -2257,9 +2180,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { > if (v4) > - iov->iov_base = &tcp4_l2_buf[tcp4_l2_buf_used + i].data; > + iov->iov_base = &tcp4_l2_payload[tcp4_l2_buf_used + i].data; > else > - iov->iov_base = &tcp6_l2_buf[tcp6_l2_buf_used + i].data; > + iov->iov_base = &tcp6_l2_payload[tcp6_l2_buf_used + i].data; > iov->iov_len = mss; > } > if (iov_rem) The rest looks good to me (at the moment :)). -- Stefano