From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ak8Otvn5; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 7C7155A0278 for ; Mon, 28 Jul 2025 19:05:24 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753722323; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6TM/8bklumcQ2FkYNMshabAqyW5pnD5o6nsj8daRowI=; b=ak8Otvn5TyzHZqgTbRrs7LnyoT9nirmsDFgLuXKNFXLRLDhS5OsgSKUNPR3PWB37QEJQvO hAN73mp6wudTqmx7xGq87Dv0OqVdnc/RuqQv+uL5g2U7ix1WPagcCMWVreCp8KSIZOT5JT xuluolr8DD0jGRrw0tartGXVDfOHC5I= Received: from mail-pj1-f69.google.com (mail-pj1-f69.google.com [209.85.216.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-592-NInCigANPlSTZOnPF1DdLg-1; Mon, 28 Jul 2025 13:05:20 -0400 X-MC-Unique: NInCigANPlSTZOnPF1DdLg-1 X-Mimecast-MFC-AGG-ID: NInCigANPlSTZOnPF1DdLg_1753722319 Received: by mail-pj1-f69.google.com with SMTP id 98e67ed59e1d1-3139c0001b5so4238717a91.2 for ; Mon, 28 Jul 2025 10:05:19 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753722319; x=1754327119; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6TM/8bklumcQ2FkYNMshabAqyW5pnD5o6nsj8daRowI=; b=TeQ0NUw1pGwgIZ0kEcL7rBkPhxOGJhNdNDw5DyKaDayPwcRSaMspD5BRA+w6lR/6On OfiUuTOxVpxgrAUVVybVInzbQz1lS9BM7whw6nfPSWD8OcpZgr9Z4TfEpWCy/nt2bMkO 65mvrE2i6ZDZ7IalV/BMBU++6Zukii7qgNUOZGZo0NKRAJXNWgMOmrC930ZNtizGh6Ug jRZeJPpA/BainKTssDjY+2MuuK0XtWWQ0cUxe33kiCAz6j/8qV++fskfxlCrQSkNBWg7 eoFobuIC0s+zSwNBIOvNZe8Y/RHmB+bCXJu1t8JkU5WS+eK8W+ylMfnOBstfRmiL+2rY xyNA== X-Gm-Message-State: AOJu0YwJjAA6/060tUjn2NmFHK2d5HvFL+EujMG2AQ1SThyPlV7vzKx7 5W5+F3UG8yZ8uJyGNtq6x6V4jAmktp1XwN818OAQUG9buDuCJov0UTzSKRS2lfV6eI7t9qThTpt gQ6UvfnsyGGF63v6FwK+oN54obucyvGR3fm+ucOIQw2I1zvlukTgjNNLGEGvwJZZdrB8BoFNUPV ALGaU+WezQfuLiubPEsz8bRn+MXCDU X-Gm-Gg: ASbGncsrqx+BrcVCI0bsrx31SQpdfuRipGl+w72ugllw1uuymdCop9x3sXc6Qu0k9mg 3uyBHNqK4HhQcrssqw9FkIbFthmL/V61T2DJt4HdudsXuFSNVGZ4kmmgfLVc7Ty8UPD/BleoM2d zk+yLFZul+JuHRnNwlR6L7 X-Received: by 2002:a17:90b:4c51:b0:312:1ae9:1525 with SMTP id 98e67ed59e1d1-31e7785da1dmr17779551a91.8.1753722318630; Mon, 28 Jul 2025 10:05:18 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEs7kDieNjpuiddpagBr0cwjFQJex5KkLA2j2PL06diBkRt1LhK4awl4hA8xQk3ak9scYmWxXy71EIwGn+pTtE= X-Received: by 2002:a17:90b:4c51:b0:312:1ae9:1525 with SMTP id 98e67ed59e1d1-31e7785da1dmr17779495a91.8.1753722318056; Mon, 28 Jul 2025 10:05:18 -0700 (PDT) MIME-Version: 1.0 References: <20250709174748.3514693-1-eperezma@redhat.com> <20250709174748.3514693-12-eperezma@redhat.com> In-Reply-To: From: Eugenio Perez Martin Date: Mon, 28 Jul 2025 19:04:42 +0200 X-Gm-Features: Ac12FXwdiRRXtsgb3rYerL-PmVJRBmWRkkLIaZMBasqwXVRKbmXZcPQzIP_VJFs Message-ID: Subject: Re: [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer To: David Gibson X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 10AF8BPsWMGyTpDht2Kn-kSu7jv09KBKyyv-La1rby0_1753722319 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: UUXEFTDGXLVTEEGEHJV2LO2YLFURGIOJ X-Message-ID-Hash: UUXEFTDGXLVTEEGEHJV2LO2YLFURGIOJ X-MailFrom: eperezma@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jasowang@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, Jul 24, 2025 at 3:34=E2=80=AFAM David Gibson wrote: > > On Wed, Jul 09, 2025 at 07:47:48PM +0200, Eugenio P=C3=A9rez wrote: > > Now both tcp_sock and tap uses the circular buffer as intended. > > > > Very lightly tested. Especially, paths like ring full or almost full > > that are checked before producing like > > tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM. > > > > Processing the tx buffers in a circular buffer makes namespace rx go > > from to ~11.5Gbit/s. to ~17.26Gbit/s. > > > > TODO: Increase the tx queue length, as we spend a lot of descriptors in > > each request. Ideally, tx size should be at least > > bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with > > bigger queues. > > > > TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at > > least N used tx buffers and the next one in tcp_payload_flush. Maybe we > > can optimize it. > > > > Signed-off-by: Eugenio P=C3=A9rez > > --- > > tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++---------- > > 1 file changed, 106 insertions(+), 24 deletions(-) > > > > diff --git a/tcp_buf.c b/tcp_buf.c > > index f74d22d..326af79 100644 > > --- a/tcp_buf.c > > +++ b/tcp_buf.c > > @@ -53,13 +53,66 @@ static_assert(MSS6 <=3D sizeof(tcp_payload[0].data)= , "MSS6 is greater than 65516") > > > > /* References tracking the owner connection of frames in the tap outqu= eue */ > > static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; > > -static unsigned int tcp_payload_sock_used, tcp_payload_tap_used; > > + > > +/* > > + * sock_head: Head of buffers available for writing. tcp_data_to_tap m= oves it > > + * forward, but errors queueing to vhost can move it backwards to tap_= head > > + * again. > > + * > > + * tap_head: Head of buffers that have been sent to vhost. flush moves= this > > + * forward. > > + * > > + * tail: Chasing index. Increments when vhost uses buffers. > > + * > > + * _used: Independent variables to tell between full and empty. > > Hm. I kind of hope there's a less bulky way of doing this. > The other option I know is to always keep one entry unused. > > + */ > > +static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_p= ayload_tail, tcp_payload_sock_used, tcp_payload_tap_used; > > +#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1))) > > Worth putting this in util.h as a separate patch. > Agree. > > +static_assert(ARRAY_SIZE(tcp_payload) =3D=3D TCP_FRAMES_MEM, "TCP_FRAM= ES_MEM is not the size of tcp_payload anymore"); > > +static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power= of two"); > > + > > +static size_t tcp_payload_cnt_to_end(size_t head, size_t tail) > > +{ > > + assert(head !=3D tail); > > + size_t end =3D ARRAY_SIZE(tcp_payload) - tail; > > + size_t n =3D (head + end) % ARRAY_SIZE(tcp_payload); > > + > > + return MIN(n, end); > > +} > > + > > +/* Count the number of items that has been written from sock to the > > + * curcular buffer and can be sent to tap. > > s/curcular/circular/g > Thanks for the catch, fixing in the next release! > > + */ > > +static size_t tcp_payload_tap_cnt(void) > > +{ > > + return tcp_payload_sock_used - tcp_payload_tap_used; > > +} > > > > static void tcp_payload_sock_produce(size_t n) > > { > > + tcp_payload_sock_head =3D (tcp_payload_sock_head + n) % ARRAY_SIZ= E(tcp_payload); > > tcp_payload_sock_used +=3D n; > > } > > > > +/* Count the number of consecutive items that has been written from so= ck to the > > + * curcular buffer and can be sent to tap without having to wrap back = to the > > + * beginning of the buffer. > > + */ > > +static size_t tcp_payload_tap_cnt_to_end(void) > > +{ > > + if (tcp_payload_sock_head =3D=3D tcp_payload_tap_head) { > > + /* empty? */ > > + if (tcp_payload_sock_used - tcp_payload_tap_used =3D=3D 0= ) > > + return 0; > > + > > + /* full */ > > + return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head; > > + } > > + > > + return tcp_payload_cnt_to_end(tcp_payload_sock_head, > > + tcp_payload_tap_head); > > +} > > + > > static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS]; > > > > /** > > @@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, s= truct tcp_tap_conn **conns, > > } > > } > > > > -static void tcp_buf_free_old_tap_xmit(const struct ctx *c) > > +static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t targ= et) > > { > > - while (tcp_payload_tap_used) { > > - tap_free_old_xmit(c, tcp_payload_tap_used); > > + size_t n =3D tap_free_old_xmit(c, target); > > > > - tcp_payload_tap_used =3D 0; > > - tcp_payload_sock_used =3D 0; > > - } > > + tcp_payload_tail =3D (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_p= ayload) - 1); > > use % instead of & here - it's consistent with other places, and the > compiler should be able to optimize it to the same thing. > > > + tcp_payload_tap_used -=3D n; > > + tcp_payload_sock_used -=3D n; > > } > > > > /** > > @@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struc= t ctx *c) > > */ > > void tcp_payload_flush(const struct ctx *c) > > { > > - size_t m; > > + size_t m, n =3D tcp_payload_tap_cnt_to_end(); > > + struct iovec *head =3D &tcp_l2_iov[tcp_payload_tap_head][0]; > > > > - m =3D tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, > > - tcp_payload_sock_used, true); > > - if (m !=3D tcp_payload_sock_used) { > > - tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], > > - tcp_payload_sock_used - m); > > - } > > + tcp_buf_free_old_tap_xmit(c, (size_t)-1); > > + m =3D tap_send_frames(c, head, TCP_NUM_IOVS, n, true); > > tcp_payload_tap_used +=3D m; > > - tcp_buf_free_old_tap_xmit(c); > > + tcp_payload_tap_head =3D (tcp_payload_tap_head + m) % > > + ARRAY_SIZE(tcp_payload); > > + > > + if (m !=3D n) { > > + n =3D tcp_payload_tap_cnt_to_end(); > > + > > + tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head], > > + &tcp_l2_iov[tcp_payload_tap_head], n); > > + /* > > + * circular buffer wrap case. > > + * TODO: Maybe it's better to adapt tcp_revert_seq. > > + */ > > + tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0], > > + tcp_payload_tap_cnt() - n); > > + > > + tcp_payload_sock_head =3D tcp_payload_tap_head; > > + tcp_payload_sock_used =3D tcp_payload_tap_used; > > + } else if (tcp_payload_tap_cnt_to_end()) { > > + /* circular buffer wrap case */ > > + tcp_payload_flush(c); > > + } > > } > > > > /** > > @@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct= tcp_tap_conn *conn, int flags) > > size_t optlen; > > size_t l4len; > > uint32_t seq; > > + unsigned int i =3D tcp_payload_sock_head; > > int ret; > > > > - iov =3D tcp_l2_iov[tcp_payload_sock_used]; > > + iov =3D tcp_l2_iov[i]; > > if (CONN_V4(conn)) { > > - iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp4_payload_ip[tcp_pay= load_sock_used]); > > + iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp4_payload_ip[i]); > > iov[TCP_IOV_ETH].iov_base =3D &tcp4_eth_src; > > } else { > > - iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp6_payload_ip[tcp_pay= load_sock_used]); > > + iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp6_payload_ip[i]); > > iov[TCP_IOV_ETH].iov_base =3D &tcp6_eth_src; > > } > > > > @@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct= tcp_tap_conn *conn, int flags) > > return ret; > > > > tcp_payload_sock_produce(1); > > + i =3D tcp_payload_sock_head; > > l4len =3D optlen + sizeof(struct tcphdr); > > iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > > tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false); > > > > if (flags & DUP_ACK) { > > - struct iovec *dup_iov =3D tcp_l2_iov[tcp_payload_sock_use= d]; > > + struct iovec *dup_iov =3D tcp_l2_iov[i]; > > tcp_payload_sock_produce(1); > > + i =3D tcp_payload_sock_head; > > > > memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].io= v_base, > > iov[TCP_IOV_TAP].iov_len); > > @@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct = tcp_tap_conn *conn, int flags) > > } > > > > if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { > > + tcp_buf_free_old_tap_xmit(c, 2); > > tcp_payload_flush(c); > > + /* TODO how to fix this? original code didn't chech for s= uccess either */ > > + assert(tcp_payload_sock_used <=3D TCP_FRAMES_MEM - 2); > > } > > > > return 0; > > @@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, = struct tcp_tap_conn *conn, > > struct iovec *iov; > > > > conn->seq_to_tap =3D seq + dlen; > > - tcp_frame_conns[tcp_payload_sock_used] =3D conn; > > - iov =3D tcp_l2_iov[tcp_payload_sock_used]; > > + tcp_frame_conns[tcp_payload_sock_head] =3D conn; > > + iov =3D tcp_l2_iov[tcp_payload_sock_head]; > > if (CONN_V4(conn)) { > > if (no_csum) { > > - struct iovec *iov_prev =3D tcp_l2_iov[tcp_payload= _sock_used - 1]; > > + unsigned prev =3D (tcp_payload_sock_head - 1) % T= CP_FRAMES_MEM; > > + struct iovec *iov_prev =3D tcp_l2_iov[prev]; > > struct iphdr *iph =3D iov_prev[TCP_IOV_IP].iov_ba= se; > > > > check =3D &iph->check; > > } > > - iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp4_payload_ip[tcp_pay= load_sock_used]); > > + iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp4_payload_ip[tcp_pay= load_sock_head]); > > iov[TCP_IOV_ETH].iov_base =3D &tcp4_eth_src; > > } else if (CONN_V6(conn)) { > > iov[TCP_IOV_IP] =3D IOV_OF_LVALUE(tcp6_payload_ip[tcp_pay= load_sock_used]); > > @@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, s= truct tcp_tap_conn *conn, > > tcp_l2_buf_fill_headers(conn, iov, check, seq, false); > > tcp_payload_sock_produce(1); > > if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { > > + tcp_buf_free_old_tap_xmit(c, 1); > > tcp_payload_flush(c); > > + assert(tcp_payload_sock_used <=3D TCP_FRAMES_MEM - 1); > > } > > + > > } > > > > /** > > @@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, s= truct tcp_tap_conn *conn) > > } > > > > if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { > > + tcp_buf_free_old_tap_xmit(c, fill_bufs); > > tcp_payload_flush(c); > > + /* TODO how to report this to upper layers? */ > > + assert(tcp_payload_sock_used + fill_bufs <=3D TCP_FRAMES_= MEM); > > } > > > > for (i =3D 0, iov =3D iov_sock + 1; i < fill_bufs; i++, iov++) { > > - iov->iov_base =3D &tcp_payload[tcp_payload_sock_used + i]= .data; > > + iov->iov_base =3D &tcp_payload[(tcp_payload_sock_head + i= ) % TCP_FRAMES_MEM].data; > > iov->iov_len =3D mss; > > } > > if (iov_rem) > > -- > David Gibson (he or they) | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you, not the other wa= y > | around. > http://www.ozlabs.org/~dgibson