From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 59B425A0271 for ; Fri, 9 Feb 2024 06:01:22 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1707454877; bh=4EcdheaPF74HHWgeq89yN2OtveXoZjoS3mKhNsWBq08=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=HZT74OvlZcvEBhIeEm4cB02XOgrfyhhHeW4YNPSc8P/7EtXQJQ3hpdQ61T3fJj/QT BIqUXGcYzEFthlliXvFzylxT/5eag4eTvNguR4q+E3xJKGEEJ3bog1eMIKkBf5PICj Fn7BcCo0JF3sJyI3njrzEHu0wHirl4+C/uVPbY/foTzqljj5GSc3aYyLfMwOjXAjZk d12KfsqxrlueOn0t47ggfmywMN+ahrTvzOya6qZlAsBuKIYc4F7Y1tJKC/OV9mlUaq AQHXjZt/irPNjqyn+oalEwb9KoVZ4/DilH7TK/nVxp4ndlfXtNy6mQTHdjyxyXwLcd 7WaWbSfTnnT0g== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4TWMBj3xKpz4wcq; Fri, 9 Feb 2024 16:01:17 +1100 (AEDT) Date: Fri, 9 Feb 2024 15:57:51 +1100 From: David Gibson To: Laurent Vivier Subject: Re: [PATCH 22/24] tcp: vhost-user RX nocopy Message-ID: References: <20240202141151.3762941-1-lvivier@redhat.com> <20240202141151.3762941-23-lvivier@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="kz9HI1qqlwTh7lvm" Content-Disposition: inline In-Reply-To: <20240202141151.3762941-23-lvivier@redhat.com> Message-ID-Hash: K6V3VGRKK33A7XKOTJW55ECXSX4K5N5G X-Message-ID-Hash: K6V3VGRKK33A7XKOTJW55ECXSX4K5N5G X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --kz9HI1qqlwTh7lvm Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Feb 02, 2024 at 03:11:49PM +0100, Laurent Vivier wrote: > Signed-off-by: Laurent Vivier > --- > Makefile | 6 +- > tcp.c | 66 +++++--- > tcp_vu.c | 447 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > tcp_vu.h | 10 ++ > 4 files changed, 502 insertions(+), 27 deletions(-) > create mode 100644 tcp_vu.c > create mode 100644 tcp_vu.h >=20 > diff --git a/Makefile b/Makefile > index 2016b071ddf2..f7a403d19b61 100644 > --- a/Makefile > +++ b/Makefile > @@ -47,7 +47,7 @@ FLAGS +=3D -DDUAL_STACK_SOCKETS=3D$(DUAL_STACK_SOCKETS) > PASST_SRCS =3D arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icm= p.c \ > igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \ > passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c \ > - tcp_buf.c udp.c util.c iov.c ip.c virtio.c vhost_user.c > + tcp_buf.c tcp_vu.c udp.c util.c iov.c ip.c virtio.c vhost_user.c > QRAP_SRCS =3D qrap.c > SRCS =3D $(PASST_SRCS) $(QRAP_SRCS) > =20 > @@ -56,8 +56,8 @@ MANPAGES =3D passt.1 pasta.1 qrap.1 > PASST_HEADERS =3D arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \ > flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \ > netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \ > - tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_internal.h udp.h \ > - util.h iov.h ip.h virtio.h vhost_user.h > + tap.h tcp.h tcp_conn.h tcp_splice.h tcp_buf.h tcp_vu.h tcp_internal.h \ > + udp.h util.h iov.h ip.h virtio.h vhost_user.h > HEADERS =3D $(PASST_HEADERS) seccomp.h > =20 > C :=3D \#include \nstruct tcp_info x =3D { .tcpi_snd_wnd = =3D 0 }; > diff --git a/tcp.c b/tcp.c > index b6aca9f37f19..e829e12fe7c2 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -302,6 +302,7 @@ > #include "flow_table.h" > #include "tcp_internal.h" > #include "tcp_buf.h" > +#include "tcp_vu.h" > =20 > /* Sides of a flow as we use them in "tap" connections */ > #define SOCKSIDE 0 > @@ -1034,7 +1035,7 @@ size_t ipv4_fill_headers(const struct ctx *c, > tcp_set_tcp_header(th, conn, seq); > =20 > th->check =3D 0; > - if (c->mode !=3D MODE_VU || *c->pcap) > + if (c->mode !=3D MODE_VU) > th->check =3D tcp_update_check_tcp4(iph); > =20 > return ip_len; > @@ -1072,7 +1073,7 @@ size_t ipv6_fill_headers(const struct ctx *c, > tcp_set_tcp_header(th, conn, seq); > =20 > th->check =3D 0; > - if (c->mode !=3D MODE_VU || *c->pcap) > + if (c->mode !=3D MODE_VU) > th->check =3D tcp_update_check_tcp6(ip6h); > =20 > ip6h->hop_limit =3D 255; > @@ -1302,6 +1303,12 @@ int do_tcp_send_flag(struct ctx *c, struct tcp_tap= _conn *conn, int flags, > return 1; > } > =20 > +int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > +{ > + if (c->mode =3D=3D MODE_VU) > + return tcp_vu_send_flag(c, conn, flags); > + return tcp_buf_send_flag(c, conn, flags); Your previous renames to "tcp_buf" make some more sense to me now. It's not so much that the "tcp_buf" functions are related to buffer management but they belong to the (linear, passt-managed) buffer implementation of TCP. I see the rationale, but I still don't really like the name - I don't think that the connection from "tcp_buf" to, "TCP code specific to several but not all L2 interface implementations" is at all obvious. Not that a good way of conveying that quickly occurs to me. For the time being, I'm inclined to just stick with "tcp", or maybe "tcp_default" for the existing (tuntap & qemu socket) implementations and tcp_vu for the new ones. That can maybe cleaned up with a more systematic division of L2 interface types (on my list...). > +} > =20 > /** > * tcp_rst_do() - Reset a tap connection: send RST segment to tap, close= socket > @@ -1313,7 +1320,7 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn = *conn) > if (conn->events =3D=3D CLOSED) > return; > =20 > - if (!tcp_buf_send_flag(c, conn, RST)) > + if (!tcp_send_flag(c, conn, RST)) > conn_event(c, conn, CLOSED); > } > =20 > @@ -1430,7 +1437,8 @@ int tcp_conn_new_sock(const struct ctx *c, sa_famil= y_t af) > * > * Return: clamped MSS value > */ > -static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn, > +static uint16_t tcp_conn_tap_mss(const struct ctx *c, > + const struct tcp_tap_conn *conn, > const char *opts, size_t optlen) > { > unsigned int mss; > @@ -1441,7 +1449,10 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_= tap_conn *conn, > else > mss =3D ret; > =20 > - mss =3D MIN(tcp_buf_conn_tap_mss(conn), mss); > + if (c->mode =3D=3D MODE_VU) > + mss =3D MIN(tcp_vu_conn_tap_mss(conn), mss); > + else > + mss =3D MIN(tcp_buf_conn_tap_mss(conn), mss); This seems oddly complex. What are the actual circumstances in which the VU mss would differ from other cases? > return MIN(mss, USHRT_MAX); > } > @@ -1568,7 +1579,7 @@ static void tcp_conn_from_tap(struct ctx *c, > =20 > conn->wnd_to_tap =3D WINDOW_DEFAULT; > =20 > - mss =3D tcp_conn_tap_mss(conn, opts, optlen); > + mss =3D tcp_conn_tap_mss(c, conn, opts, optlen); > if (setsockopt(s, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss))) > flow_trace(conn, "failed to set TCP_MAXSEG on socket %i", s); > MSS_SET(conn, mss); > @@ -1625,7 +1636,7 @@ static void tcp_conn_from_tap(struct ctx *c, > } else { > tcp_get_sndbuf(conn); > =20 > - if (tcp_buf_send_flag(c, conn, SYN | ACK)) > + if (tcp_send_flag(c, conn, SYN | ACK)) > return; > =20 > conn_event(c, conn, TAP_SYN_ACK_SENT); > @@ -1673,6 +1684,13 @@ static int tcp_sock_consume(const struct tcp_tap_c= onn *conn, uint32_t ack_seq) > return 0; > } > =20 > +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > +{ > + if (c->mode =3D=3D MODE_VU) > + return tcp_vu_data_from_sock(c, conn); > + > + return tcp_buf_data_from_sock(c, conn); > +} > =20 > /** > * tcp_data_from_tap() - tap/guest data for established connection > @@ -1806,7 +1824,7 @@ static int tcp_data_from_tap(struct ctx *c, struct = tcp_tap_conn *conn, > max_ack_seq, conn->seq_to_tap); > conn->seq_ack_from_tap =3D max_ack_seq; > conn->seq_to_tap =3D max_ack_seq; > - tcp_buf_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn); In particular having changed all these calls from tcp_ to tcp_buf_ and now changing them back seems like churn that it would be nice to avoid. > } > =20 > if (!iov_i) > @@ -1822,14 +1840,14 @@ eintr: > * Then swiftly looked away and left. > */ > conn->seq_from_tap =3D seq_from_tap; > - tcp_buf_send_flag(c, conn, ACK); > + tcp_send_flag(c, conn, ACK); > } > =20 > if (errno =3D=3D EINTR) > goto eintr; > =20 > if (errno =3D=3D EAGAIN || errno =3D=3D EWOULDBLOCK) { > - tcp_buf_send_flag(c, conn, ACK_IF_NEEDED); > + tcp_send_flag(c, conn, ACK_IF_NEEDED); > return p->count - idx; > =20 > } > @@ -1839,7 +1857,7 @@ eintr: > if (n < (int)(seq_from_tap - conn->seq_from_tap)) { > partial_send =3D 1; > conn->seq_from_tap +=3D n; > - tcp_buf_send_flag(c, conn, ACK_IF_NEEDED); > + tcp_send_flag(c, conn, ACK_IF_NEEDED); > } else { > conn->seq_from_tap +=3D n; > } > @@ -1852,7 +1870,7 @@ out: > */ > if (conn->seq_dup_ack_approx !=3D (conn->seq_from_tap & 0xff)) { > conn->seq_dup_ack_approx =3D conn->seq_from_tap & 0xff; > - tcp_buf_send_flag(c, conn, DUP_ACK); > + tcp_send_flag(c, conn, DUP_ACK); > } > return p->count - idx; > } > @@ -1866,7 +1884,7 @@ out: > =20 > conn_event(c, conn, TAP_FIN_RCVD); > } else { > - tcp_buf_send_flag(c, conn, ACK_IF_NEEDED); > + tcp_send_flag(c, conn, ACK_IF_NEEDED); > } > =20 > return p->count - idx; > @@ -1891,7 +1909,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c= , struct tcp_tap_conn *conn, > if (!(conn->wnd_from_tap >>=3D conn->ws_from_tap)) > conn->wnd_from_tap =3D 1; > =20 > - MSS_SET(conn, tcp_conn_tap_mss(conn, opts, optlen)); > + MSS_SET(conn, tcp_conn_tap_mss(c, conn, opts, optlen)); > =20 > conn->seq_init_from_tap =3D ntohl(th->seq) + 1; > conn->seq_from_tap =3D conn->seq_init_from_tap; > @@ -1902,8 +1920,8 @@ static void tcp_conn_from_sock_finish(struct ctx *c= , struct tcp_tap_conn *conn, > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > */ > - tcp_buf_data_from_sock(c, conn); > - tcp_buf_send_flag(c, conn, ACK); > + tcp_data_from_sock(c, conn); > + tcp_send_flag(c, conn, ACK); > } > =20 > /** > @@ -1983,7 +2001,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int= af, > conn->seq_from_tap++; > =20 > shutdown(conn->sock, SHUT_WR); > - tcp_buf_send_flag(c, conn, ACK); > + tcp_send_flag(c, conn, ACK); > conn_event(c, conn, SOCK_FIN_SENT); > =20 > return 1; > @@ -1994,7 +2012,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int= af, > =20 > tcp_tap_window_update(conn, ntohs(th->window)); > =20 > - tcp_buf_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn); > =20 > if (p->count - idx =3D=3D 1) > return 1; > @@ -2024,7 +2042,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int= af, > if ((conn->events & TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_SENT)) { > shutdown(conn->sock, SHUT_WR); > conn_event(c, conn, SOCK_FIN_SENT); > - tcp_buf_send_flag(c, conn, ACK); > + tcp_send_flag(c, conn, ACK); > ack_due =3D 0; > } > =20 > @@ -2058,7 +2076,7 @@ static void tcp_connect_finish(struct ctx *c, struc= t tcp_tap_conn *conn) > return; > } > =20 > - if (tcp_buf_send_flag(c, conn, SYN | ACK)) > + if (tcp_send_flag(c, conn, SYN | ACK)) > return; > =20 > conn_event(c, conn, TAP_SYN_ACK_SENT); > @@ -2126,7 +2144,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, > =20 > conn->wnd_from_tap =3D WINDOW_DEFAULT; > =20 > - tcp_buf_send_flag(c, conn, SYN); > + tcp_send_flag(c, conn, SYN); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > =20 > tcp_get_sndbuf(conn); > @@ -2190,7 +2208,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > return; > =20 > if (conn->flags & ACK_TO_TAP_DUE) { > - tcp_buf_send_flag(c, conn, ACK_IF_NEEDED); > + tcp_send_flag(c, conn, ACK_IF_NEEDED); > tcp_timer_ctl(c, conn); > } else if (conn->flags & ACK_FROM_TAP_DUE) { > if (!(conn->events & ESTABLISHED)) { > @@ -2206,7 +2224,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap =3D conn->seq_ack_from_tap; > - tcp_buf_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn); > tcp_timer_ctl(c, conn); > } > } else { > @@ -2261,7 +2279,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_re= f ref, uint32_t events) > conn_event(c, conn, SOCK_FIN_RCVD); > =20 > if (events & EPOLLIN) > - tcp_buf_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn); > =20 > if (events & EPOLLOUT) > tcp_update_seqack_wnd(c, conn, 0, NULL); > diff --git a/tcp_vu.c b/tcp_vu.c > new file mode 100644 > index 000000000000..ed59b21cabdc > --- /dev/null > +++ b/tcp_vu.c > @@ -0,0 +1,447 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later Copyright notice. > +#include > +#include > +#include > + > +#include > + > +#include > + > +#include > +#include > + > +#include "util.h" > +#include "ip.h" > +#include "passt.h" > +#include "siphash.h" > +#include "inany.h" > +#include "vhost_user.h" > +#include "tcp.h" > +#include "pcap.h" > +#include "flow.h" > +#include "tcp_conn.h" > +#include "flow_table.h" > +#include "tcp_vu.h" > +#include "tcp_internal.h" > +#include "checksum.h" > + > +#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) > +#define CONN_V6(conn) (!CONN_V4(conn)) I don't love having these duplicated in two .c files. However, it might become irrelevant as I move towards having v4/v6 become implicit in the common flow addresses. > +/* vhost-user */ > +static const struct virtio_net_hdr vu_header =3D { > + .flags =3D VIRTIO_NET_HDR_F_DATA_VALID, > + .gso_type =3D VIRTIO_NET_HDR_GSO_NONE, > +}; > + > +static unsigned char buffer[65536]; > +static struct iovec iov_vu [VIRTQUEUE_MAX_SIZE]; > +static unsigned int indexes [VIRTQUEUE_MAX_SIZE]; > + > +uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn) > +{ > + (void)conn; > + return USHRT_MAX; > +} > + > +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > +{ > + VuDev *vdev =3D (VuDev *)&c->vdev; > + VuVirtqElement *elem; > + VuVirtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > + struct virtio_net_hdr_mrg_rxbuf *vh; > + size_t tlen, vnet_hdrlen, ip_len, optlen =3D 0; > + struct ethhdr *eh; > + int ret; > + int nb_ack; > + > + elem =3D vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer); > + if (!elem) > + return 0; > + > + if (elem->in_num < 1) { > + err("virtio-net receive queue contains no in buffers"); > + vu_queue_rewind(vdev, vq, 1); > + return 0; > + } > + > + /* Options: MSS, NOP and window scale (8 bytes) */ > + if (flags & SYN) > + optlen =3D OPT_MSS_LEN + 1 + OPT_WS_LEN; Given the number of subtle TCP bugs we've had to squash, it would be really nice if we could avoid duplicating TCP logic between paths. Could we make some abstraction that takes an iov, but can be also called from the non-vu case with a 1-entry iov representing a single buffer? > + vh =3D elem->in_sg[0].iov_base; > + > + vh->hdr =3D vu_header; > + if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) { > + vnet_hdrlen =3D sizeof(struct virtio_net_hdr_mrg_rxbuf); > + vh->num_buffers =3D htole16(1); > + } else { > + vnet_hdrlen =3D sizeof(struct virtio_net_hdr); > + } > + eh =3D (struct ethhdr *)((char *)elem->in_sg[0].iov_base + vnet_hdrlen); Ah... hmm.. I already had hope to clean up handling different L2 and below headers for the different "tap" types. We basically have ugly hacks to deal with the difference between tuntap (plain ethernet) and qemu socket (ethernet + length header). Now we're adding vhost-user (ethernet + vhost header), which is a similar issue. Abstracting this could also make it pretty easy to support further "tap" interfaces: a different hypervisor socket transfer with slightly different header, tuntap in "tun" mode (raw IP without ethernet headers), SLIP or PPP, =2E.. > + memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest)); > + memcpy(eh->h_source, c->mac, sizeof(eh->h_source)); > + > + if (CONN_V4(conn)) { > + struct iphdr *iph =3D (struct iphdr *)(eh + 1); > + struct tcphdr *th =3D (struct tcphdr *)(iph + 1); Hmm.. did I miss logic to check that there's room for the vhost + ethernet + IP + TCP headers in the first iov element? > + char *data =3D (char *)(th + 1); > + > + eh->h_proto =3D htons(ETH_P_IP); > + > + *th =3D (struct tcphdr){ > + .doff =3D sizeof(struct tcphdr) / 4, > + .ack =3D 1 > + }; > + > + *iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); > + > + ret =3D do_tcp_send_flag(c, conn, flags, th, data, optlen); > + if (ret <=3D 0) { > + vu_queue_rewind(vdev, vq, 1); > + return ret; > + } > + > + ip_len =3D ipv4_fill_headers(c, conn, iph, optlen, NULL, > + conn->seq_to_tap); > + > + tlen =3D ip_len + sizeof(struct ethhdr); > + > + if (*c->pcap) { > + uint32_t sum =3D proto_ipv4_header_checksum(iph, IPPROTO_TCP); > + > + th->check =3D csum(th, optlen + sizeof(struct tcphdr), sum); > + } > + } else { > + struct ipv6hdr *ip6h =3D (struct ipv6hdr *)(eh + 1); > + struct tcphdr *th =3D (struct tcphdr *)(ip6h + 1); > + char *data =3D (char *)(th + 1); > + > + eh->h_proto =3D htons(ETH_P_IPV6); > + > + *th =3D (struct tcphdr){ > + .doff =3D sizeof(struct tcphdr) / 4, > + .ack =3D 1 > + }; > + > + *ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); > + > + ret =3D do_tcp_send_flag(c, conn, flags, th, data, optlen); > + if (ret <=3D 0) { > + vu_queue_rewind(vdev, vq, 1); > + return ret; > + } > + > + ip_len =3D ipv6_fill_headers(c, conn, ip6h, optlen, > + conn->seq_to_tap); > + > + tlen =3D ip_len + sizeof(struct ethhdr); > + > + if (*c->pcap) { > + uint32_t sum =3D proto_ipv6_header_checksum(ip6h, IPPROTO_TCP); > + > + th->check =3D csum(th, optlen + sizeof(struct tcphdr), sum); > + } > + } > + > + pcap((void *)eh, tlen); > + > + tlen +=3D vnet_hdrlen; > + vu_queue_fill(vdev, vq, elem, tlen, 0); > + nb_ack =3D 1; > + > + if (flags & DUP_ACK) { > + elem =3D vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer); > + if (elem) { > + if (elem->in_num < 1 || elem->in_sg[0].iov_len < tlen) { > + vu_queue_rewind(vdev, vq, 1); > + } else { > + memcpy(elem->in_sg[0].iov_base, vh, tlen); > + nb_ack++; > + } > + } > + } > + > + vu_queue_flush(vdev, vq, nb_ack); > + vu_queue_notify(vdev, vq); > + > + return 0; > +} > + > +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > +{ > + uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > + uint32_t already_sent; > + VuDev *vdev =3D (VuDev *)&c->vdev; > + VuVirtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > + int s =3D conn->sock, v4 =3D CONN_V4(conn); > + int i, ret =3D 0, iov_count, iov_used; > + struct msghdr mh_sock =3D { 0 }; > + size_t l2_hdrlen, vnet_hdrlen, fillsize; > + ssize_t len; > + uint16_t *check; > + uint16_t mss =3D MSS_GET(conn); > + int num_buffers; > + int segment_size; > + struct iovec *first; > + bool has_mrg_rxbuf; > + > + if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { > + err("Got packet, but no available descriptors on RX virtq."); > + return 0; > + } > + > + already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > + > + if (SEQ_LT(already_sent, 0)) { > + /* RFC 761, section 2.1. */ > + flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > + conn->seq_ack_from_tap, conn->seq_to_tap); > + conn->seq_to_tap =3D conn->seq_ack_from_tap; > + already_sent =3D 0; > + } > + > + if (!wnd_scaled || already_sent >=3D wnd_scaled) { > + conn_flag(c, conn, STALLED); > + conn_flag(c, conn, ACK_FROM_TAP_DUE); > + return 0; > + } > + > + /* Set up buffer descriptors we'll fill completely and partially. */ > + > + fillsize =3D wnd_scaled; > + > + iov_vu[0].iov_base =3D tcp_buf_discard; > + iov_vu[0].iov_len =3D already_sent; > + fillsize -=3D already_sent; > + > + has_mrg_rxbuf =3D vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF); > + if (has_mrg_rxbuf) { > + vnet_hdrlen =3D sizeof(struct virtio_net_hdr_mrg_rxbuf); > + } else { > + vnet_hdrlen =3D sizeof(struct virtio_net_hdr); > + } passt style (unlike qemu) does not put braces on 1-line blocks. > + l2_hdrlen =3D vnet_hdrlen + sizeof(struct ethhdr) + sizeof(struct tcphd= r); That seems like a misleading variable name. The ethernet headers are certainly L2. Including the lower level headers in L2 is reasonable, but the IP and TCP headers are L3 and L4 headers respectively. > + if (v4) { > + l2_hdrlen +=3D sizeof(struct iphdr); > + } else { > + l2_hdrlen +=3D sizeof(struct ipv6hdr); > + } > + > + iov_count =3D 0; > + segment_size =3D 0; > + while (fillsize > 0 && iov_count < VIRTQUEUE_MAX_SIZE - 1) { > + VuVirtqElement *elem; > + > + elem =3D vu_queue_pop(vdev, vq, sizeof(VuVirtqElement), buffer); > + if (!elem) > + break; > + > + if (elem->in_num < 1) { > + err("virtio-net receive queue contains no in buffers"); > + goto err; > + } > + > + ASSERT(elem->in_num =3D=3D 1); > + ASSERT(elem->in_sg[0].iov_len >=3D l2_hdrlen); > + > + indexes[iov_count] =3D elem->index; > + > + if (segment_size =3D=3D 0) { > + iov_vu[iov_count + 1].iov_base =3D > + (char *)elem->in_sg[0].iov_base + l2_hdrlen; > + iov_vu[iov_count + 1].iov_len =3D > + elem->in_sg[0].iov_len - l2_hdrlen; > + } else { > + iov_vu[iov_count + 1].iov_base =3D elem->in_sg[0].iov_base; > + iov_vu[iov_count + 1].iov_len =3D elem->in_sg[0].iov_len; > + } > + > + if (iov_vu[iov_count + 1].iov_len > fillsize) > + iov_vu[iov_count + 1].iov_len =3D fillsize; > + > + segment_size +=3D iov_vu[iov_count + 1].iov_len; > + if (!has_mrg_rxbuf) { > + segment_size =3D 0; > + } else if (segment_size >=3D mss) { > + iov_vu[iov_count + 1].iov_len -=3D segment_size - mss; > + segment_size =3D 0; > + } > + fillsize -=3D iov_vu[iov_count + 1].iov_len; > + > + iov_count++; > + } > + if (iov_count =3D=3D 0) > + return 0; > + > + mh_sock.msg_iov =3D iov_vu; > + mh_sock.msg_iovlen =3D iov_count + 1; > + > + do > + len =3D recvmsg(s, &mh_sock, MSG_PEEK); > + while (len < 0 && errno =3D=3D EINTR); > + > + if (len < 0) > + goto err; > + > + if (!len) { > + vu_queue_rewind(vdev, vq, iov_count); > + if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) =3D=3D SOCK_FIN_RC= VD) { > + if ((ret =3D tcp_vu_send_flag(c, conn, FIN | ACK))) { > + tcp_rst(c, conn); > + return ret; > + } > + > + conn_event(c, conn, TAP_FIN_SENT); > + } > + > + return 0; > + } > + > + len -=3D already_sent; > + if (len <=3D 0) { > + conn_flag(c, conn, STALLED); > + vu_queue_rewind(vdev, vq, iov_count); > + return 0; > + } > + > + conn_flag(c, conn, ~STALLED); > + > + /* Likely, some new data was acked too. */ > + tcp_update_seqack_wnd(c, conn, 0, NULL); > + > + /* initialize headers */ > + iov_used =3D 0; > + num_buffers =3D 0; > + check =3D NULL; > + segment_size =3D 0; > + for (i =3D 0; i < iov_count && len; i++) { > + > + if (segment_size =3D=3D 0) > + first =3D &iov_vu[i + 1]; > + > + if (iov_vu[i + 1].iov_len > (size_t)len) > + iov_vu[i + 1].iov_len =3D len; > + > + len -=3D iov_vu[i + 1].iov_len; > + iov_used++; > + > + segment_size +=3D iov_vu[i + 1].iov_len; > + num_buffers++; > + > + if (segment_size >=3D mss || len =3D=3D 0 || > + i + 1 =3D=3D iov_count || !has_mrg_rxbuf) { > + > + struct ethhdr *eh; > + struct virtio_net_hdr_mrg_rxbuf *vh; > + char *base =3D (char *)first->iov_base - l2_hdrlen; > + size_t size =3D first->iov_len + l2_hdrlen; > + > + vh =3D (struct virtio_net_hdr_mrg_rxbuf *)base; > + > + vh->hdr =3D vu_header; > + if (has_mrg_rxbuf) > + vh->num_buffers =3D htole16(num_buffers); > + > + eh =3D (struct ethhdr *)((char *)base + vnet_hdrlen); > + > + memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest)); > + memcpy(eh->h_source, c->mac, sizeof(eh->h_source)); > + > + /* initialize header */ > + if (v4) { > + struct iphdr *iph =3D (struct iphdr *)(eh + 1); > + struct tcphdr *th =3D (struct tcphdr *)(iph + 1); > + > + eh->h_proto =3D htons(ETH_P_IP); > + > + *th =3D (struct tcphdr){ > + .doff =3D sizeof(struct tcphdr) / 4, > + .ack =3D 1 > + }; > + > + *iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); > + > + ipv4_fill_headers(c, conn, iph, segment_size, > + len ? check : NULL, conn->seq_to_tap); > + > + if (*c->pcap) { > + uint32_t sum =3D proto_ipv4_header_checksum(iph, IPPROTO_TCP); > + > + first->iov_base =3D th; > + first->iov_len =3D size - l2_hdrlen + sizeof(*th); > + > + th->check =3D csum_iov(first, num_buffers, sum); > + } > + > + check =3D &iph->check; > + } else { > + struct ipv6hdr *ip6h =3D (struct ipv6hdr *)(eh + 1); > + struct tcphdr *th =3D (struct tcphdr *)(ip6h + 1); > + > + eh->h_proto =3D htons(ETH_P_IPV6); > + > + *th =3D (struct tcphdr){ > + .doff =3D sizeof(struct tcphdr) / 4, > + .ack =3D 1 > + }; > + > + *ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); > + > + ipv6_fill_headers(c, conn, ip6h, segment_size, > + conn->seq_to_tap); > + if (*c->pcap) { > + uint32_t sum =3D proto_ipv6_header_checksum(ip6h, IPPROTO_TCP); > + > + first->iov_base =3D th; > + first->iov_len =3D size - l2_hdrlen + sizeof(*th); > + > + th->check =3D csum_iov(first, num_buffers, sum); > + } > + } > + > + /* set iov for pcap logging */ > + first->iov_base =3D eh; > + first->iov_len =3D size - vnet_hdrlen; > + > + pcap_iov(first, num_buffers); > + > + /* set iov_len for vu_queue_fill_by_index(); */ > + > + first->iov_base =3D base; > + first->iov_len =3D size; > + > + conn->seq_to_tap +=3D segment_size; > + > + segment_size =3D 0; > + num_buffers =3D 0; > + } > + } > + > + /* release unused buffers */ > + vu_queue_rewind(vdev, vq, iov_count - iov_used); > + > + /* send packets */ > + for (i =3D 0; i < iov_used; i++) { > + vu_queue_fill_by_index(vdev, vq, indexes[i], > + iov_vu[i + 1].iov_len, i); > + } > + > + vu_queue_flush(vdev, vq, iov_used); > + vu_queue_notify(vdev, vq); > + > + conn_flag(c, conn, ACK_FROM_TAP_DUE); > + > + return 0; > +err: > + vu_queue_rewind(vdev, vq, iov_count); > + > + if (errno !=3D EAGAIN && errno !=3D EWOULDBLOCK) { > + ret =3D -errno; > + tcp_rst(c, conn); > + } > + > + return ret; > +} > diff --git a/tcp_vu.h b/tcp_vu.h > new file mode 100644 > index 000000000000..8045a6e3edb8 > --- /dev/null > +++ b/tcp_vu.h > @@ -0,0 +1,10 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > + > +#ifndef TCP_VU_H > +#define TCP_VU_H > + > +uint16_t tcp_vu_conn_tap_mss(const struct tcp_tap_conn *conn); > +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags= ); > +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn); > + > +#endif /*TCP_VU_H */ --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --kz9HI1qqlwTh7lvm Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmXFsM4ACgkQzQJF27ox 2Gf9OA/5ASAjxHLL53ybtfcuzAxjeYeGNvKn8pXCXQKPF59vH42bD1PpFfwDmZuH FSvVah+HyYqCUpa6Nld1g2LCiy0+CHMyOkZTqSEtC5mddr4VeeK3WLRHOWUmReU6 NwnlB+7iv78v3Va3ZyA2HxT5BdhlAaHb0Ib8KKdDyQA6MVWpC/2rE3PdRKGGLy75 Mlz4x/njY/sd40TM6iM9vMEduisW2hd6RNJL1teHXNUED+1Gui6Uai4psnNYt0z8 pXz0+MnxRRogURkFoTxPmy74hCxbLuRpMgocFD9tt4VfWGqT2K4AvPn0b+7G5haP vRY6SCDfm+Zn1BJpsdXYgaMnkBiHTsVWdJ33I7rRS3kfzy7fMRTieSeLW9BaH+43 lT3GGTxqeGxHY/AzBucfoVZRhdOhmZpv7SG7nyM9LudPBFeKOOgbGpbPH+vilQhu 8TjF+DsXbLguZZBn8QC+CLIccY57oBI89+XO6gFFujKEL4ZJKl2b3Qq1m4KUIEPR T9Qa9CfU/KNl/7b7HOveuySSjD1t1EeTa1cZWz6cmDgLLnLtZFgMOI9PcoJhd0Sa CErpzWLytrYqi8SOgDXyFIw+vdBZ2fQdkGYF/LM5YrWljbfKzCB4/7LUOdUhJpm4 nolN25E5xONP5AhLSQB/n9ib8huvNhlKY8lmXhg4XP+PI2gLemI= =RvQl -----END PGP SIGNATURE----- --kz9HI1qqlwTh7lvm--