From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ekVp6zZH; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 709595A026E for ; Thu, 17 Oct 2024 02:10:46 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729123845; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u4natihdK2R/yOFnPURBLoBOIcKZ77L4H1BVDRMLApA=; b=ekVp6zZHnZVIQ2IXsYN0UOeCh3P1zb4si1u27rf1R827a9PAKmKLrRVTN81E3kpRm09Ixa Miuu0fhDwH7CcM9I9HzSswvz16E6FtOc06Xi6ARjhXMeBMxovuTFFyUpGuX6BI1kfiCY9S TzesC/Xg4eOUYyy0/uavPcis5wOAAqA= Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-304-iGI0ljZiMp23DpufY5U_qg-1; Wed, 16 Oct 2024 20:10:44 -0400 X-MC-Unique: iGI0ljZiMp23DpufY5U_qg-1 Received: by mail-pg1-f200.google.com with SMTP id 41be03b00d2f7-7ea9acf089bso248924a12.1 for ; Wed, 16 Oct 2024 17:10:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729123842; x=1729728642; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=FEIZZQeIWwUTVonxYZh8rCJEmiPdxR8Q1pNsbQIO+ZY=; b=X9OIPxIeuPY/CM1Lxe/WjG3e4kRzv9LIsRfbONXEuc5aCnmhrXmNHDtq1JqJxCoWJp Ro888EUKYwE8ULvhjyxRQynZZ0bLQ6NMyTA0o+t7KJhsZ5AZYbYN7cRcH1BCFzCZN/pn 0oWV5atLTk+73UeArTb+4rHwXZW4/km0X2kxavGVn0p4frgyoUgacHj+itRnWPP3otgC v5Ya6A+4P+jJOhK607OeykFOdZpUBuQ8GjwiWOL0I9ZDRK+G2fj6xDucdJMrn35s7Tgg 7rlFWqzJXVLSkRRtlFHDjLrVQd7DVS0kgsLksoeu4MsNIBWSagdpB86Nx7SCxGtMt9cd bX8g== X-Gm-Message-State: AOJu0YzwWn2k+WlBEa6BSaq55n1qZ0WOD3w/JA1PdToIH8hfSFMKU9AH RxeQcjTf8DBoz/mF/RibYUNyO3dOkD3bIaWCp9WjPHt5lFXd/oU5TDh9dqaYoD510CpKKLFqOrw 49q/K+kwDkzJhmLDD5wML+gB3lZCngpyAnirCV5cgcjc68uxyxi2NAUk002ndITyvE38DkfMlQr tBqc80fHnp1tPZva7uGrneG7i7odZSBt45 X-Received: by 2002:a17:90b:1e4c:b0:2e2:8f4d:45a with SMTP id 98e67ed59e1d1-2e3dc2d1d0dmr1955364a91.13.1729123840929; Wed, 16 Oct 2024 17:10:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFMZQLaQ4UE7DLIbjdVwbAfQsA8pUJpGS9QtYIYLSuZmQK6ssc6EZwvijea/UT2pvWHL4FdXw== X-Received: by 2002:a17:90b:1e4c:b0:2e2:8f4d:45a with SMTP id 98e67ed59e1d1-2e3dc2d1d0dmr1955277a91.13.1729123839604; Wed, 16 Oct 2024 17:10:39 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2e3e08d16f2sm453005a91.20.2024.10.16.17.10.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Oct 2024 17:10:38 -0700 (PDT) Date: Thu, 17 Oct 2024 02:10:34 +0200 From: Stefano Brivio To: Laurent Vivier Subject: Re: [PATCH v8 7/8] vhost-user: add vhost-user Message-ID: <20241017021034.437f3757@elisabeth> In-Reply-To: <20241010122903.1188992-8-lvivier@redhat.com> References: <20241010122903.1188992-1-lvivier@redhat.com> <20241010122903.1188992-8-lvivier@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Message-ID-Hash: OXIQNWGNROV4D6PKWQFU3VCNVZ3563JK X-Message-ID-Hash: OXIQNWGNROV4D6PKWQFU3VCNVZ3563JK X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 10 Oct 2024 14:29:01 +0200 Laurent Vivier wrote: > add virtio and vhost-user functions to connect with QEMU. >=20 > $ ./passt --vhost-user >=20 > and >=20 > # qemu-system-x86_64 ... -m 4G \ > -object memory-backend-memfd,id=3Dmemfd0,share=3Don,size=3D4G \ > -numa node,memdev=3Dmemfd0 \ > -chardev socket,id=3Dchr0,path=3D/tmp/passt_1.socket \ > -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0 \ > -device virtio-net,mac=3D9a:2b:2c:2d:2e:2f,netdev=3Dnetdev0 \ > ... >=20 > Signed-off-by: Laurent Vivier > --- > Makefile | 6 +- > conf.c | 21 ++- > epoll_type.h | 4 + > iov.c | 1 - > isolation.c | 15 +- > packet.c | 11 ++ > packet.h | 8 +- > passt.1 | 10 +- > passt.c | 9 + > passt.h | 6 + > pcap.c | 1 - > tap.c | 80 +++++++-- > tap.h | 5 +- > tcp.c | 7 + > tcp_vu.c | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++ > tcp_vu.h | 12 ++ > udp.c | 10 ++ > udp_vu.c | 336 ++++++++++++++++++++++++++++++++++++ > udp_vu.h | 13 ++ > vhost_user.c | 37 ++-- > vhost_user.h | 4 +- > virtio.c | 5 - > vu_common.c | 385 +++++++++++++++++++++++++++++++++++++++++ > vu_common.h | 47 +++++ > 24 files changed, 1454 insertions(+), 55 deletions(-) > create mode 100644 tcp_vu.c > create mode 100644 tcp_vu.h > create mode 100644 udp_vu.c > create mode 100644 udp_vu.h > create mode 100644 vu_common.c > create mode 100644 vu_common.h >=20 > diff --git a/Makefile b/Makefile > index 0e8ed60a0da1..1e8910dda1f4 100644 > --- a/Makefile > +++ b/Makefile > @@ -54,7 +54,8 @@ FLAGS +=3D -DDUAL_STACK_SOCKETS=3D$(DUAL_STACK_SOCKETS) > PASST_SRCS =3D arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd= .c \ > =09icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > =09ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > -=09tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c > +=09tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > +=09vhost_user.c virtio.c vu_common.c > QRAP_SRCS =3D qrap.c > SRCS =3D $(PASST_SRCS) $(QRAP_SRCS) > =20 > @@ -64,7 +65,8 @@ PASST_HEADERS =3D arch.h arp.h checksum.h conf.h dhcp.h= dhcpv6.h flow.h fwd.h \ > =09flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > =09lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.= h \ > =09siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.= h \ > -=09udp.h udp_flow.h util.h vhost_user.h virtio.h > +=09tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h= \ > +=09virtio.h vu_common.h > HEADERS =3D $(PASST_HEADERS) seccomp.h > =20 > C :=3D \#include \nstruct tcp_info x =3D { .tcpi_snd_wnd = =3D 0 }; > diff --git a/conf.c b/conf.c > index c63101970155..29d6e41f5770 100644 > --- a/conf.c > +++ b/conf.c > @@ -45,6 +45,7 @@ > #include "lineread.h" > #include "isolation.h" > #include "log.h" > +#include "vhost_user.h" > =20 > /** > * next_chunk - Return the next piece of a string delimited by a charact= er > @@ -762,9 +763,14 @@ static void usage(const char *name, FILE *f, int sta= tus) > =09=09=09" default: same interface name as external one\n"); > =09} else { > =09=09fprintf(f, > -=09=09=09" -s, --socket PATH=09UNIX domain socket path\n" > +=09=09=09" -s, --socket, --socket-path PATH=09UNIX domain socket path\n= " > =09=09=09" default: probe free path starting from " > =09=09=09UNIX_SOCK_PATH "\n", 1); > +=09=09fprintf(f, > +=09=09=09" --vhost-user=09=09Enable vhost-user mode\n" > +=09=09=09" UNIX domain socket is provided by -s option\n" > +=09=09=09" --print-capabilities=09print back-end capabilities in JSON f= ormat,\n" > +=09=09=09" only meaningful for vhost-user mode\n"); > =09} > =20 > =09fprintf(f, > @@ -1290,6 +1296,10 @@ void conf(struct ctx *c, int argc, char **argv) > =09=09{"map-host-loopback", required_argument, NULL,=09=0921 }, > =09=09{"map-guest-addr", required_argument,=09NULL,=09=0922 }, > =09=09{"dns-host",=09required_argument,=09NULL,=09=0924 }, > +=09=09{"vhost-user",=09no_argument,=09=09NULL,=09=0925 }, > +=09=09/* vhost-user backend program convention */ > +=09=09{"print-capabilities", no_argument,=09NULL,=09=0926 }, > +=09=09{"socket-path",=09required_argument,=09NULL,=09=09's' }, > =09=09{ 0 }, > =09}; > =09const char *logname =3D (c->mode =3D=3D MODE_PASTA) ? "pasta" : "pass= t"; > @@ -1478,6 +1488,15 @@ void conf(struct ctx *c, int argc, char **argv) > =09=09=09=09break; > =20 > =09=09=09die("Invalid host nameserver address: %s", optarg); > +=09=09case 25: > +=09=09=09if (c->mode =3D=3D MODE_PASTA) { > +=09=09=09=09err("--vhost-user is for passt mode only"); > +=09=09=09=09usage(argv[0], stdout, EXIT_SUCCESS); > +=09=09=09} > +=09=09=09c->mode =3D MODE_VU; > +=09=09=09break; > +=09=09case 26: > +=09=09=09vu_print_capabilities(); > =09=09=09break; > =09=09case 'd': > =09=09=09c->debug =3D 1; > diff --git a/epoll_type.h b/epoll_type.h > index 0ad1efa0ccec..f3ef41584757 100644 > --- a/epoll_type.h > +++ b/epoll_type.h > @@ -36,6 +36,10 @@ enum epoll_type { > =09EPOLL_TYPE_TAP_PASST, > =09/* socket listening for qemu socket connections */ > =09EPOLL_TYPE_TAP_LISTEN, > +=09/* vhost-user command socket */ > +=09EPOLL_TYPE_VHOST_CMD, > +=09/* vhost-user kick event socket */ > +=09EPOLL_TYPE_VHOST_KICK, > =20 > =09EPOLL_NUM_TYPES, > }; > diff --git a/iov.c b/iov.c > index 3f9e229a305f..3741db21790f 100644 > --- a/iov.c > +++ b/iov.c > @@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n= , > * > * Returns: The number of bytes successfully copied. > */ > -/* cppcheck-suppress unusedFunction */ > size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt, > =09=09 size_t offset, const void *buf, size_t bytes) > { > diff --git a/isolation.c b/isolation.c > index 45fba1e68b9d..c2a3c7b7911d 100644 > --- a/isolation.c > +++ b/isolation.c > @@ -379,12 +379,19 @@ void isolate_postfork(const struct ctx *c) > =20 > =09prctl(PR_SET_DUMPABLE, 0); > =20 > -=09if (c->mode =3D=3D MODE_PASTA) { > -=09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_pasta); > -=09=09prog.filter =3D filter_pasta; > -=09} else { > +=09switch (c->mode) { > +=09case MODE_PASST: > =09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_passt); > =09=09prog.filter =3D filter_passt; > +=09=09break; > +=09case MODE_PASTA: > +=09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_pasta); > +=09=09prog.filter =3D filter_pasta; > +=09=09break; > +=09case MODE_VU: > +=09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_vu); > +=09=09prog.filter =3D filter_vu; > +=09=09break; > =09} > =20 > =09if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) || > diff --git a/packet.c b/packet.c > index 37489961a37e..e5a78d079231 100644 > --- a/packet.c > +++ b/packet.c > @@ -36,6 +36,17 @@ > static int packet_check_range(const struct pool *p, size_t offset, size_= t len, > =09=09=09 const char *start, const char *func, int line) > { > +=09if (p->buf_size =3D=3D 0) { > +=09=09int ret; > + > +=09=09ret =3D vu_packet_check_range((void *)p->buf, offset, len, start); > + > +=09=09if (ret =3D=3D -1) > +=09=09=09trace("cannot find region, %s:%i", func, line); > + > +=09=09return ret; > +=09} > + > =09if (start < p->buf) { > =09=09trace("packet start %p before buffer start %p, " > =09=09 "%s:%i", (void *)start, (void *)p->buf, func, line); > diff --git a/packet.h b/packet.h > index 8377dcf678bb..3f70e949c066 100644 > --- a/packet.h > +++ b/packet.h > @@ -8,8 +8,10 @@ > =20 > /** > * struct pool - Generic pool of packets stored in a buffer > - * @buf:=09Buffer storing packet descriptors > - * @buf_size:=09Total size of buffer > + * @buf:=09Buffer storing packet descriptors, > + * =09=09a struct vu_dev_region array for passt vhost-user mode > + * @buf_size:=09Total size of buffer, > + * =09=090 for passt vhost-user mode > * @size:=09Number of usable descriptors for the pool > * @count:=09Number of used descriptors for the pool > * @pkt:=09Descriptors: see macros below > @@ -22,6 +24,8 @@ struct pool { > =09struct iovec pkt[1]; > }; > =20 > +int vu_packet_check_range(void *buf, size_t offset, size_t len, > +=09=09=09 const char *start); > void packet_add_do(struct pool *p, size_t len, const char *start, > =09=09 const char *func, int line); > void *packet_get_do(const struct pool *p, const size_t idx, > diff --git a/passt.1 b/passt.1 > index ef33267e9cd7..96532dd39aa2 100644 > --- a/passt.1 > +++ b/passt.1 > @@ -397,12 +397,20 @@ interface address are configured on a given host in= terface. > .SS \fBpasst\fR-only options > =20 > .TP > -.BR \-s ", " \-\-socket " " \fIpath > +.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath > Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to co= nnect to > \fBpasst\fR. > Default is to probe a free socket, not accepting connections, starting f= rom > \fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR. > =20 > +.TP > +.BR \-\-vhost-user > +Enable vhost-user. The vhost-user command socket is provided by \fB--soc= ket\fR. > + > +.TP > +.BR \-\-print-capabilities > +Print back-end capabilities in JSON format, only meaningful for vhost-us= er mode. > + > .TP > .BR \-F ", " \-\-fd " " \fIFD > Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket i= s opened > diff --git a/passt.c b/passt.c > index 79093ee02d62..2d105e81218d 100644 > --- a/passt.c > +++ b/passt.c > @@ -52,6 +52,7 @@ > #include "arch.h" > #include "log.h" > #include "tcp_splice.h" > +#include "vu_common.h" > =20 > #define EPOLL_EVENTS=09=098 > =20 > @@ -74,6 +75,8 @@ char *epoll_type_str[] =3D { > =09[EPOLL_TYPE_TAP_PASTA]=09=09=3D "/dev/net/tun device", > =09[EPOLL_TYPE_TAP_PASST]=09=09=3D "connected qemu socket", > =09[EPOLL_TYPE_TAP_LISTEN]=09=09=3D "listening qemu socket", > +=09[EPOLL_TYPE_VHOST_CMD]=09=09=3D "vhost-user command socket", > +=09[EPOLL_TYPE_VHOST_KICK]=09=09=3D "vhost-user kick socket", > }; > static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES, > =09 "epoll_type_str[] doesn't match enum epoll_type"); > @@ -360,6 +363,12 @@ loop: > =09=09case EPOLL_TYPE_PING: > =09=09=09icmp_sock_handler(&c, ref); > =09=09=09break; > +=09=09case EPOLL_TYPE_VHOST_CMD: > +=09=09=09vu_control_handler(c.vdev, c.fd_tap, eventmask); > +=09=09=09break; > +=09=09case EPOLL_TYPE_VHOST_KICK: > +=09=09=09vu_kick_cb(c.vdev, ref, &now); > +=09=09=09break; > =09=09default: > =09=09=09/* Can't happen */ > =09=09=09ASSERT(0); > diff --git a/passt.h b/passt.h > index 4908ed937dc8..311482d36257 100644 > --- a/passt.h > +++ b/passt.h > @@ -25,6 +25,8 @@ union epoll_ref; > #include "fwd.h" > #include "tcp.h" > #include "udp.h" > +#include "udp_vu.h" > +#include "vhost_user.h" > =20 > /* Default address for our end on the tap interface. Bit 0 of byte 0 mu= st be 0 > * (unicast) and bit 1 of byte 1 must be 1 (locally administered). Othe= rwise > @@ -94,6 +96,7 @@ struct fqdn { > enum passt_modes { > =09MODE_PASST, > =09MODE_PASTA, > +=09MODE_VU, > }; > =20 > /** > @@ -228,6 +231,7 @@ struct ip6_ctx { > * @freebind:=09=09Allow binding of non-local addresses for forwarding > * @low_wmem:=09=09Low probed net.core.wmem_max > * @low_rmem:=09=09Low probed net.core.rmem_max > + * @vdev:=09=09vhost-user device > */ > struct ctx { > =09enum passt_modes mode; > @@ -289,6 +293,8 @@ struct ctx { > =20 > =09int low_wmem; > =09int low_rmem; > + > +=09struct vu_dev *vdev; > }; > =20 > void proto_update_l2_buf(const unsigned char *eth_d, > diff --git a/pcap.c b/pcap.c > index 6ee6cdfd261a..718d6ad61732 100644 > --- a/pcap.c > +++ b/pcap.c > @@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t fr= ame_parts, unsigned int n, > * @iovcnt:=09Number of buffers (@iov entries) > * @offset:=09Offset of the L2 frame within the full data length > */ > -/* cppcheck-suppress unusedFunction */ > void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) > { > =09struct timespec now; > diff --git a/tap.c b/tap.c > index 4b826fdf7adc..22d19f1833f7 100644 > --- a/tap.c > +++ b/tap.c > @@ -58,6 +58,8 @@ > #include "packet.h" > #include "tap.h" > #include "log.h" > +#include "vhost_user.h" > +#include "vu_common.h" > =20 > /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handler= s */ > static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); > @@ -78,16 +80,22 @@ void tap_send_single(const struct ctx *c, const void = *data, size_t l2len) > =09struct iovec iov[2]; > =09size_t iovcnt =3D 0; > =20 > -=09if (c->mode =3D=3D MODE_PASST) { > +=09switch (c->mode) { > +=09case MODE_PASST: > =09=09iov[iovcnt] =3D IOV_OF_LVALUE(vnet_len); > =09=09iovcnt++; > -=09} > - > -=09iov[iovcnt].iov_base =3D (void *)data; > -=09iov[iovcnt].iov_len =3D l2len; > -=09iovcnt++; > +=09=09/* fall through */ > +=09case MODE_PASTA: > +=09=09iov[iovcnt].iov_base =3D (void *)data; > +=09=09iov[iovcnt].iov_len =3D l2len; > +=09=09iovcnt++; > =20 > -=09tap_send_frames(c, iov, iovcnt, 1); > +=09=09tap_send_frames(c, iov, iovcnt, 1); > +=09=09break; > +=09case MODE_VU: > +=09=09vu_send_single(c, data, l2len); > +=09=09break; > +=09} > } > =20 > /** > @@ -414,10 +422,18 @@ size_t tap_send_frames(const struct ctx *c, const s= truct iovec *iov, > =09if (!nframes) > =09=09return 0; > =20 > -=09if (c->mode =3D=3D MODE_PASTA) > +=09switch (c->mode) { > +=09case MODE_PASTA: > =09=09m =3D tap_send_frames_pasta(c, iov, bufs_per_frame, nframes); > -=09else > +=09=09break; > +=09case MODE_PASST: > =09=09m =3D tap_send_frames_passt(c, iov, bufs_per_frame, nframes); > +=09=09break; > +=09case MODE_VU: > +=09=09/* fall through */ > +=09default: > +=09=09ASSERT(0); > +=09} > =20 > =09if (m < nframes) > =09=09debug("tap: failed to send %zu frames of %zu", > @@ -976,7 +992,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, cha= r *p) > * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socke= t > * @c:=09=09Execution context > */ > -static void tap_sock_reset(struct ctx *c) > +void tap_sock_reset(struct ctx *c) > { > =09info("Client connection closed%s", c->one_off ? ", exiting" : ""); > =20 > @@ -987,6 +1003,8 @@ static void tap_sock_reset(struct ctx *c) > =09epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); > =09close(c->fd_tap); > =09c->fd_tap =3D -1; > +=09if (c->mode =3D=3D MODE_VU) > +=09=09vu_cleanup(c->vdev); > } > =20 > /** > @@ -1205,6 +1223,11 @@ static void tap_backend_show_hints(struct ctx *c) > =09=09info("or qrap, for earlier qemu versions:"); > =09=09info(" ./qrap 5 kvm ... -net socket,fd=3D5 -net nic,model=3Dvir= tio"); > =09=09break; > +=09case MODE_VU: > +=09=09info("You can start qemu with:"); > +=09=09info(" kvm ... -chardev socket,id=3Dchr0,path=3D%s -netdev vhos= t-user,id=3Dnetdev0,chardev=3Dchr0 -device virtio-net,netdev=3Dnetdev0 -obj= ect memory-backend-memfd,id=3Dmemfd0,share=3Don,size=3D$RAMSIZE -numa node,= memdev=3Dmemfd0\n", > +=09=09 c->sock_path); > +=09=09break; > =09} > } > =20 > @@ -1232,8 +1255,8 @@ static void tap_sock_unix_init(const struct ctx *c) > */ > void tap_listen_handler(struct ctx *c, uint32_t events) > { > -=09union epoll_ref ref =3D { .type =3D EPOLL_TYPE_TAP_PASST }; > =09struct epoll_event ev =3D { 0 }; > +=09union epoll_ref ref =3D { 0 }; > =09int v =3D INT_MAX / 2; > =09struct ucred ucred; > =09socklen_t len; > @@ -1273,6 +1296,10 @@ void tap_listen_handler(struct ctx *c, uint32_t ev= ents) > =09=09trace("tap: failed to set SO_SNDBUF to %i", v); > =20 > =09ref.fd =3D c->fd_tap; > +=09if (c->mode =3D=3D MODE_VU) > +=09=09ref.type =3D EPOLL_TYPE_VHOST_CMD; > +=09else > +=09=09ref.type =3D EPOLL_TYPE_TAP_PASST; > =09ev.events =3D EPOLLIN | EPOLLRDHUP; > =09ev.data.u64 =3D ref.u64; > =09epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); > @@ -1339,7 +1366,7 @@ static void tap_sock_tun_init(struct ctx *c) > * @base:=09Buffer base > * @size=09Buffer size > */ > -static void tap_sock_update_pool(void *base, size_t size) > +void tap_sock_update_pool(void *base, size_t size) > { > =09int i; > =20 > @@ -1353,13 +1380,15 @@ static void tap_sock_update_pool(void *base, size= _t size) > } > =20 > /** > - * tap_backend_init() - Create and set up AF_UNIX socket or > - *=09=09=09tuntap file descriptor > + * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file des= criptor > * @c:=09=09Execution context > */ > void tap_backend_init(struct ctx *c) > { > -=09tap_sock_update_pool(pkt_buf, sizeof(pkt_buf)); > +=09if (c->mode =3D=3D MODE_VU) > +=09=09tap_sock_update_pool(NULL, 0); > +=09else > +=09=09tap_sock_update_pool(pkt_buf, sizeof(pkt_buf)); > =20 > =09if (c->fd_tap !=3D -1) { /* Passed as --fd */ > =09=09struct epoll_event ev =3D { 0 }; > @@ -1367,10 +1396,17 @@ void tap_backend_init(struct ctx *c) > =20 > =09=09ASSERT(c->one_off); > =09=09ref.fd =3D c->fd_tap; > -=09=09if (c->mode =3D=3D MODE_PASST) > +=09=09switch (c->mode) { > +=09=09case MODE_PASST: > =09=09=09ref.type =3D EPOLL_TYPE_TAP_PASST; > -=09=09else > +=09=09=09break; > +=09=09case MODE_PASTA: > =09=09=09ref.type =3D EPOLL_TYPE_TAP_PASTA; > +=09=09=09break; > +=09=09case MODE_VU: > +=09=09=09ref.type =3D EPOLL_TYPE_VHOST_CMD; > +=09=09=09break; > +=09=09} > =20 > =09=09ev.events =3D EPOLLIN | EPOLLRDHUP; > =09=09ev.data.u64 =3D ref.u64; > @@ -1378,9 +1414,14 @@ void tap_backend_init(struct ctx *c) > =09=09return; > =09} > =20 > -=09if (c->mode =3D=3D MODE_PASTA) { > +=09switch (c->mode) { > +=09case MODE_PASTA: > =09=09tap_sock_tun_init(c); > -=09} else { > +=09=09break; > +=09case MODE_VU: > +=09=09vu_init(c); > +=09=09/* fall through */ > +=09case MODE_PASST: > =09=09tap_sock_unix_init(c); > =20 > =09=09/* In passt mode, we don't know the guest's MAC address until it > @@ -1388,6 +1429,7 @@ void tap_backend_init(struct ctx *c) > =09=09 * first packets will reach it. > =09=09 */ > =09=09memset(&c->guest_mac, 0xff, sizeof(c->guest_mac)); > +=09=09break; > =09} > =20 > =09tap_backend_show_hints(c); > diff --git a/tap.h b/tap.h > index 8728cc5c09c3..dfbd8b9ebd72 100644 > --- a/tap.h > +++ b/tap.h > @@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx= *c, > */ > static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) > { > -=09thdr->vnet_len =3D htonl(l2len); > +=09if (thdr) > +=09=09thdr->vnet_len =3D htonl(l2len); > } > =20 > void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sp= ort, > @@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, > void tap_handler_passt(struct ctx *c, uint32_t events, > =09=09 const struct timespec *now); > int tap_sock_unix_open(char *sock_path); > +void tap_sock_reset(struct ctx *c); > +void tap_sock_update_pool(void *base, size_t size); > void tap_backend_init(struct ctx *c); > void tap_flush_pools(void); > void tap_handler(struct ctx *c, const struct timespec *now); > diff --git a/tcp.c b/tcp.c > index eae02b1647e3..fd2def0d8a39 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -304,6 +304,7 @@ > #include "flow_table.h" > #include "tcp_internal.h" > #include "tcp_buf.h" > +#include "tcp_vu.h" > =20 > /* MSS rounding: see SET_MSS() */ > #define MSS_DEFAULT=09=09=09536 > @@ -1328,6 +1329,9 @@ int tcp_prepare_flags(const struct ctx *c, struct t= cp_tap_conn *conn, > static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, > =09=09=09 int flags) > { > +=09if (c->mode =3D=3D MODE_VU) > +=09=09return tcp_vu_send_flag(c, conn, flags); > + > =09return tcp_buf_send_flag(c, conn, flags); > } > =20 > @@ -1721,6 +1725,9 @@ static int tcp_sock_consume(const struct tcp_tap_co= nn *conn, uint32_t ack_seq) > */ > static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *= conn) > { > +=09if (c->mode =3D=3D MODE_VU) > +=09=09return tcp_vu_data_from_sock(c, conn); > + > =09return tcp_buf_data_from_sock(c, conn); > } > =20 > diff --git a/tcp_vu.c b/tcp_vu.c > new file mode 100644 > index 000000000000..1126fb39d138 > --- /dev/null > +++ b/tcp_vu.c > @@ -0,0 +1,476 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* tcp_vu.c - TCP L2 vhost-user management functions > + * > + * Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#include > +#include > +#include > + > +#include > + > +#include > + > +#include > +#include > + > +#include "util.h" > +#include "ip.h" > +#include "passt.h" > +#include "siphash.h" > +#include "inany.h" > +#include "vhost_user.h" > +#include "tcp.h" > +#include "pcap.h" > +#include "flow.h" > +#include "tcp_conn.h" > +#include "flow_table.h" > +#include "tcp_vu.h" > +#include "tap.h" > +#include "tcp_internal.h" > +#include "checksum.h" > +#include "vu_common.h" > + > +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1]; > +static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE]; > + > +/** > + * tcp_vu_l2_hdrlen() - return the size of the header in level 2 frame (= TDP) > + * @v6:=09=09Set for IPv6 packet > + * > + * Return: Return the size of the header > + */ > +static size_t tcp_vu_l2_hdrlen(bool v6) > +{ > +=09size_t l2_hdrlen; > + > +=09l2_hdrlen =3D sizeof(struct ethhdr) + sizeof(struct tcphdr); > + > +=09if (v6) > +=09=09l2_hdrlen +=3D sizeof(struct ipv6hdr); > +=09else > +=09=09l2_hdrlen +=3D sizeof(struct iphdr); > + > +=09return l2_hdrlen; > +} > + > +/** > + * tcp_vu_update_check() - Calculate TCP checksum > + * @tapside:=09Address information for one side of the flow > + * @iov:=09Pointer to the array of IO vectors > + * @iov_used:=09Length of the array > + */ > +static void tcp_vu_update_check(const struct flowside *tapside, > +=09=09=09 struct iovec *iov, int iov_used) > +{ > +=09char *base =3D iov[0].iov_base; > + > +=09if (inany_v4(&tapside->oaddr)) { > +=09=09const struct iphdr *iph =3D vu_ip(base); > + > +=09=09tcp_update_check_tcp4(iph, iov, iov_used, > +=09=09=09=09 (char *)vu_payloadv4(base) - base); > +=09} else { > +=09=09const struct ipv6hdr *ip6h =3D vu_ip(base); > + > +=09=09tcp_update_check_tcp6(ip6h, iov, iov_used, > +=09=09=09=09 (char *)vu_payloadv6(base) - base); > +=09} > +} > + > +/** > + * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payloa= d) > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * @flags:=09TCP flags: if not set, send segment only if ACK is due > + * > + * Return: negative error code on connection reset, 0 otherwise > + */ > +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int= flags) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09const struct flowside *tapside =3D TAPFLOW(conn); > +=09size_t l2len, l4len, optlen, hdrlen; > +=09struct ethhdr *eh; > +=09int elem_cnt; > +=09int nb_ack; > +=09int ret; > + > +=09hdrlen =3D tcp_vu_l2_hdrlen(CONN_V6(conn)); > + > +=09vu_init_elem(elem, iov_vu, 2); > + > +=09elem_cnt =3D vu_collect_one_frame(vdev, vq, elem, 1, > +=09=09=09=09=09hdrlen + OPT_MSS_LEN + OPT_WS_LEN + 1, > +=09=09=09=09=090, NULL); > +=09if (elem_cnt < 1) > +=09=09return 0; > + > +=09vu_set_vnethdr(vdev, &iov_vu[0], 1, 0); > + > +=09eh =3D vu_eth(iov_vu[0].iov_base); > + > +=09memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); > +=09memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > + > +=09if (CONN_V4(conn)) { > +=09=09struct tcp_payload_t *payload; > +=09=09struct iphdr *iph; > +=09=09uint32_t seq; > + > +=09=09eh->h_proto =3D htons(ETH_P_IP); > + > +=09=09iph =3D vu_ip(iov_vu[0].iov_base); > +=09=09*iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); > + > +=09=09payload =3D vu_payloadv4(iov_vu[0].iov_base); > +=09=09memset(&payload->th, 0, sizeof(payload->th)); > +=09=09payload->th.doff =3D offsetof(struct tcp_flags_t, opts) / 4; > +=09=09payload->th.ack =3D 1; > + > +=09=09seq =3D conn->seq_to_tap; > +=09=09ret =3D tcp_prepare_flags(c, conn, flags, &payload->th, > +=09=09=09=09=09(char *)payload->data, &optlen); > +=09=09if (ret <=3D 0) { > +=09=09=09vu_queue_rewind(vq, 1); > +=09=09=09return ret; > +=09=09} > + > +=09=09l4len =3D tcp_fill_headers4(conn, NULL, iph, payload, optlen, > +=09=09=09=09=09 NULL, seq, true); > +=09=09l2len =3D sizeof(*iph); > +=09} else { > +=09=09struct tcp_payload_t *payload; > +=09=09struct ipv6hdr *ip6h; > +=09=09uint32_t seq; > + > +=09=09eh->h_proto =3D htons(ETH_P_IPV6); > + > +=09=09ip6h =3D vu_ip(iov_vu[0].iov_base); > +=09=09*ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); > + > +=09=09payload =3D vu_payloadv6(iov_vu[0].iov_base); > +=09=09memset(&payload->th, 0, sizeof(payload->th)); > +=09=09payload->th.doff =3D offsetof(struct tcp_flags_t, opts) / 4; > +=09=09payload->th.ack =3D 1; > + > +=09=09seq =3D conn->seq_to_tap; > +=09=09ret =3D tcp_prepare_flags(c, conn, flags, &payload->th, > +=09=09=09=09=09(char *)payload->data, &optlen); > +=09=09if (ret <=3D 0) { > +=09=09=09vu_queue_rewind(vq, 1); > +=09=09=09return ret; > +=09=09} > + > +=09=09l4len =3D tcp_fill_headers6(conn, NULL, ip6h, payload, optlen, > +=09=09=09=09=09 seq, true); > +=09=09l2len =3D sizeof(*ip6h); > +=09} > +=09l2len +=3D l4len + sizeof(struct ethhdr); > + > +=09elem[0].in_sg[0].iov_len =3D l2len + > +=09=09=09=09 sizeof(struct virtio_net_hdr_mrg_rxbuf); > +=09if (*c->pcap) { > +=09=09tcp_vu_update_check(tapside, &elem[0].in_sg[0], 1); > +=09=09pcap_iov(&elem[0].in_sg[0], 1, > +=09=09=09 sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +=09} > +=09nb_ack =3D 1; > + > +=09if (flags & DUP_ACK) { > +=09=09elem_cnt =3D vu_collect_one_frame(vdev, vq, &elem[1], 1, l2len, > +=09=09=09=09=09=090, NULL); > +=09=09if (elem_cnt =3D=3D 1) { > +=09=09=09memcpy(elem[1].in_sg[0].iov_base, > +=09=09=09 elem[0].in_sg[0].iov_base, l2len); > +=09=09=09vu_set_vnethdr(vdev, &elem[1].in_sg[0], 1, 0); > +=09=09=09nb_ack++; > + > +=09=09=09if (*c->pcap) > +=09=09=09=09pcap_iov(&elem[1].in_sg[0], 1, 0); > +=09=09} > +=09} > + > +=09vu_flush(vdev, vq, elem, nb_ack); > + > +=09return 0; > +} > + > +/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user = buffers > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * @v4:=09=09Set for IPv4 connections > + * @fillsize:=09Number of bytes we can receive So, it's the third time I review a version of this function, and the third time I ask myself in which sense we _can_ receive those bytes. :) Now that I remembered: what about "Maximum bytes to fill in guest-side receiving window"? > + * @datalen:=09Size of received data (output) > + * > + * Return: Number of iov entries used to store the data > + */ > +static ssize_t tcp_vu_sock_recv(const struct ctx *c, > +=09=09=09=09struct tcp_tap_conn *conn, bool v4, > +=09=09=09=09size_t fillsize, ssize_t *dlen) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09struct msghdr mh_sock =3D { 0 }; > +=09uint16_t mss =3D MSS_GET(conn); > +=09int s =3D conn->sock; > +=09size_t l2_hdrlen; > +=09int elem_cnt; > +=09ssize_t ret; > + > +=09*dlen =3D 0; > + > +=09l2_hdrlen =3D tcp_vu_l2_hdrlen(!v4); > + > +=09vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE); > + > +=09elem_cnt =3D vu_collect(vdev, vq, elem, VIRTQUEUE_MAX_SIZE, mss, > +=09=09=09 l2_hdrlen, fillsize); > +=09if (elem_cnt < 0) { > +=09=09tcp_rst(c, conn); > +=09=09return -ENOMEM; > +=09} > + > +=09mh_sock.msg_iov =3D iov_vu; > +=09mh_sock.msg_iovlen =3D elem_cnt + 1; > + > +=09do > +=09=09ret =3D recvmsg(s, &mh_sock, MSG_PEEK); > +=09while (ret < 0 && errno =3D=3D EINTR); > + > +=09if (ret < 0) { > +=09=09vu_queue_rewind(vq, elem_cnt); > +=09=09if (errno !=3D EAGAIN && errno !=3D EWOULDBLOCK) { > +=09=09=09ret =3D -errno; > +=09=09=09tcp_rst(c, conn); > +=09=09} > +=09=09return ret; > +=09} > +=09if (!ret) { > +=09=09vu_queue_rewind(vq, elem_cnt); > + > +=09=09if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) =3D=3D SOCK_FI= N_RCVD) { > +=09=09=09int retf =3D tcp_vu_send_flag(c, conn, FIN | ACK); > +=09=09=09if (retf) { > +=09=09=09=09tcp_rst(c, conn); > +=09=09=09=09return retf; > +=09=09=09} > + > +=09=09=09conn_event(c, conn, TAP_FIN_SENT); > +=09=09} > +=09=09return 0; > +=09} > + > +=09*dlen =3D ret; > + > +=09return elem_cnt; > +} > + > +/** > + * tcp_vu_prepare() - Prepare the packet header > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * @first:=09Pointer to the array of IO vectors > + * @dlen:=09Packet data length > + * @check:=09Checksum, if already known > + */ > +static void tcp_vu_prepare(const struct ctx *c, > +=09=09=09 struct tcp_tap_conn *conn, struct iovec *first, > +=09=09=09 size_t dlen, const uint16_t **check) > +{ > +=09const struct flowside *toside =3D TAPFLOW(conn); > +=09char *base =3D first->iov_base; > +=09struct ethhdr *eh; > + > +=09/* we guess the first iovec provided by the guest can embed > +=09 * all the headers needed by L2 frame > +=09 */ What happens if it doesn't (buggy guest)? Do we have a way to make sure it's the case? I guess it's more straightforward to do this in tcp_vu_data_from_sock() where we check and set iov_len (even though the implication of VIRTIO_NET_F_MRG_RXBUF isn't totally clear to me). > + > +=09eh =3D vu_eth(base); > + > +=09memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); > +=09memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > + > +=09/* initialize header */ > +=09if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) { > +=09=09struct tcp_payload_t *payload; > +=09=09struct iphdr *iph; > + > +=09=09ASSERT(first[0].iov_len >=3D sizeof(struct virtio_net_hdr_mrg_rxbu= f) + > +=09=09 sizeof(struct ethhdr) + sizeof(struct iphdr) + > +=09=09 sizeof(struct tcphdr)); > + > +=09=09eh->h_proto =3D htons(ETH_P_IP); > + > +=09=09iph =3D vu_ip(base); > +=09=09*iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); > +=09=09payload =3D vu_payloadv4(base); > +=09=09memset(&payload->th, 0, sizeof(payload->th)); > +=09=09payload->th.doff =3D offsetof(struct tcp_payload_t, data) / 4; > +=09=09payload->th.ack =3D 1; > + > +=09=09tcp_fill_headers4(conn, NULL, iph, payload, dlen, > +=09=09=09=09 *check, conn->seq_to_tap, true); > +=09=09*check =3D &iph->check; > +=09} else { > +=09=09struct tcp_payload_t *payload; > +=09=09struct ipv6hdr *ip6h; > + > +=09=09ASSERT(first[0].iov_len >=3D sizeof(struct virtio_net_hdr_mrg_rxbu= f) + > +=09=09 sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + > +=09=09 sizeof(struct tcphdr)); > + > +=09=09eh->h_proto =3D htons(ETH_P_IPV6); > + > +=09=09ip6h =3D vu_ip(base); > +=09=09*ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); > + > +=09=09payload =3D vu_payloadv6(base); > +=09=09memset(&payload->th, 0, sizeof(payload->th)); > +=09=09payload->th.doff =3D offsetof(struct tcp_payload_t, data) / 4; > +=09=09payload->th.ack =3D 1; > + > +=09=09tcp_fill_headers6(conn, NULL, ip6h, payload, dlen, > +=09=09=09=09 conn->seq_to_tap, true); > +=09} > +} > + > +/** > + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost= -user, > + *=09=09=09 in window > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * > + * Return: Negative on connection reset, 0 otherwise > + */ > +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn= ) > +{ > +=09uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09const struct flowside *tapside =3D TAPFLOW(conn); > +=09uint16_t mss =3D MSS_GET(conn); > +=09size_t l2_hdrlen, fillsize; > +=09int i, iov_cnt, iov_used; > +=09int v4 =3D CONN_V4(conn); > +=09uint32_t already_sent =3D 0; > +=09const uint16_t *check; > +=09struct iovec *first; > +=09int frame_size; > +=09int num_buffers; > +=09ssize_t len; > + > +=09if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { > +=09=09flow_err(conn, > +=09=09=09 "Got packet, but RX virtqueue not usable yet"); > +=09=09return 0; > +=09} > + > +=09already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > + > +=09if (SEQ_LT(already_sent, 0)) { > +=09=09/* RFC 761, section 2.1. */ > +=09=09flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > +=09=09=09 conn->seq_ack_from_tap, conn->seq_to_tap); > +=09=09conn->seq_to_tap =3D conn->seq_ack_from_tap; > +=09=09already_sent =3D 0; > +=09} > + > +=09if (!wnd_scaled || already_sent >=3D wnd_scaled) { > +=09=09conn_flag(c, conn, STALLED); > +=09=09conn_flag(c, conn, ACK_FROM_TAP_DUE); > +=09=09return 0; > +=09} > + > +=09/* Set up buffer descriptors we'll fill completely and partially. */ > + > +=09fillsize =3D wnd_scaled; > + > +=09if (peek_offset_cap) > +=09=09already_sent =3D 0; > + > +=09iov_vu[0].iov_base =3D tcp_buf_discard; > +=09iov_vu[0].iov_len =3D already_sent; > +=09fillsize -=3D already_sent; > + > +=09/* collect the buffers from vhost-user and fill them with the > +=09 * data from the socket > +=09 */ > +=09iov_cnt =3D tcp_vu_sock_recv(c, conn, v4, fillsize, &len); > +=09if (iov_cnt <=3D 0) > +=09=09return iov_cnt; > + > +=09len -=3D already_sent; > +=09if (len <=3D 0) { > +=09=09conn_flag(c, conn, STALLED); > +=09=09vu_queue_rewind(vq, iov_cnt); > +=09=09return 0; > +=09} > + > +=09conn_flag(c, conn, ~STALLED); > + > +=09/* Likely, some new data was acked too. */ > +=09tcp_update_seqack_wnd(c, conn, 0, NULL); > + > +=09/* initialize headers */ > +=09l2_hdrlen =3D tcp_vu_l2_hdrlen(!v4); > +=09iov_used =3D 0; > +=09num_buffers =3D 0; > +=09check =3D NULL; > +=09frame_size =3D 0; > + > +=09/* iov_vu is an array of buffers and the buffer size can be > +=09 * smaller than the frame size we want to use but with > +=09 * num_buffer we can merge several virtio iov buffers in one packet > +=09 * we need only to set the packet headers in the first iov and > +=09 * num_buffer to the number of iov entries ...this part is clear to me, what I don't understand is if we still have a way to guarantee that the sum of several buffers is big enough to fit frame_size bytes. > +=09 */ > +=09for (i =3D 0; i < iov_cnt && len; i++) { > + Excess newline. > +=09=09if (frame_size =3D=3D 0) > +=09=09=09first =3D &iov_vu[i + 1]; > + > +=09=09if (iov_vu[i + 1].iov_len > (size_t)len) > +=09=09=09iov_vu[i + 1].iov_len =3D len; > + > +=09=09len -=3D iov_vu[i + 1].iov_len; > +=09=09iov_used++; > + > +=09=09frame_size +=3D iov_vu[i + 1].iov_len; > +=09=09num_buffers++; > + > +=09=09if (frame_size >=3D mss || len =3D=3D 0 || > +=09=09 i + 1 =3D=3D iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG= _RXBUF)) { > +=09=09=09if (i + 1 =3D=3D iov_cnt) > +=09=09=09=09check =3D NULL; > + > +=09=09=09/* restore first iovec base: point to vnet header */ > +=09=09=09vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen); > + > +=09=09=09tcp_vu_prepare(c, conn, first, frame_size, &check); > +=09=09=09if (*c->pcap) { > +=09=09=09=09tcp_vu_update_check(tapside, first, num_buffers); > +=09=09=09=09pcap_iov(first, num_buffers, > +=09=09=09=09=09 sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +=09=09=09} > + > +=09=09=09conn->seq_to_tap +=3D frame_size; We always increase this, even if, later... > + > +=09=09=09frame_size =3D 0; > +=09=09=09num_buffers =3D 0; > +=09=09} > +=09} > + > +=09/* release unused buffers */ > +=09vu_queue_rewind(vq, iov_cnt - iov_used); > + > +=09/* send packets */ > +=09vu_flush(vdev, vq, elem, iov_used); we fail to send packets, that is, even if vu_queue_fill_by_index() returns early because (!vq->vring.avail). We had this same issue on the non-vhost-user path until commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") (completely reworked with time). There, it was pretty bad with small (default) values for wmem_max and rmem_max. Now, I _guess_ with vhost-user it won't be so easy to hit that, because virtqueue buffers are (altogether) bigger, so we can probably fix this later, but if it's not exceedingly complicated, we should consider fixing it now. If we hit something like that, the behaviour is pretty bad, with constant retransmissions and stalls. The mapping between queued frames and connections is done in tcp_data_to_tap(), where tcp4_frame_conns[] and tcp6_frame_conns[] items are set to the current (highest) value of tcp4_payload_used and tcp6_payload_used. Then, if we fail to transmit some frames, tcp_revert_seq() uses those arrays to revert the seq_to_tap values. I guess you could make vu_queue_fill_by_index() return an error, propagate it, and make vu_flush() call something like tcp_revert_seq() in case. > + > +=09conn_flag(c, conn, ACK_FROM_TAP_DUE); > + > +=09return 0; > +} > diff --git a/tcp_vu.h b/tcp_vu.h > new file mode 100644 > index 000000000000..6ab6057f352a > --- /dev/null > +++ b/tcp_vu.h > @@ -0,0 +1,12 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#ifndef TCP_VU_H > +#define TCP_VU_H > + > +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int= flags); > +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn= ); > + > +#endif /*TCP_VU_H */ > diff --git a/udp.c b/udp.c > index 8fc5d8099310..1171d9d1a75b 100644 > --- a/udp.c > +++ b/udp.c > @@ -628,6 +628,11 @@ void udp_listen_sock_handler(const struct ctx *c, > =09=09=09 union epoll_ref ref, uint32_t events, > =09=09=09 const struct timespec *now) > { > +=09if (c->mode =3D=3D MODE_VU) { > +=09=09udp_vu_listen_sock_handler(c, ref, events, now); > +=09=09return; > +=09} > + > =09udp_buf_listen_sock_handler(c, ref, events, now); > } > =20 > @@ -697,6 +702,11 @@ static void udp_buf_reply_sock_handler(const struct = ctx *c, union epoll_ref ref, > void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > =09=09=09 uint32_t events, const struct timespec *now) > { > +=09if (c->mode =3D=3D MODE_VU) { > +=09=09udp_vu_reply_sock_handler(c, ref, events, now); > +=09=09return; > +=09} > + > =09udp_buf_reply_sock_handler(c, ref, events, now); > } > =20 > diff --git a/udp_vu.c b/udp_vu.c > new file mode 100644 > index 000000000000..3cb76945c9c1 > --- /dev/null > +++ b/udp_vu.c > @@ -0,0 +1,336 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* udp_vu.c - UDP L2 vhost-user management functions > + * > + * Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "checksum.h" > +#include "util.h" > +#include "ip.h" > +#include "siphash.h" > +#include "inany.h" > +#include "passt.h" > +#include "pcap.h" > +#include "log.h" > +#include "vhost_user.h" > +#include "udp_internal.h" > +#include "flow.h" > +#include "flow_table.h" > +#include "udp_flow.h" > +#include "udp_vu.h" > +#include "vu_common.h" > + > +static struct iovec iov_vu=09=09[VIRTQUEUE_MAX_SIZE]; > +static struct vu_virtq_element=09elem=09=09[VIRTQUEUE_MAX_SIZE]; > + > +/** > + * udp_vu_l2_hdrlen() - return the size of the header in level 2 frame (= UDP) > + * @v6:=09=09Set for IPv6 packet > + * > + * Return: Return the size of the header > + */ > +static size_t udp_vu_l2_hdrlen(bool v6) > +{ > +=09size_t l2_hdrlen; > + > +=09l2_hdrlen =3D sizeof(struct ethhdr) + sizeof(struct udphdr); > + > +=09if (v6) > +=09=09l2_hdrlen +=3D sizeof(struct ipv6hdr); > +=09else > +=09=09l2_hdrlen +=3D sizeof(struct iphdr); > + > +=09return l2_hdrlen; > +} > + > +static int udp_vu_sock_init(int s, union sockaddr_inany *s_in) > +{ > +=09struct msghdr msg =3D { > +=09=09.msg_name =3D s_in, > +=09=09.msg_namelen =3D sizeof(union sockaddr_inany), > +=09}; > + > +=09return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT); > +} > + > +/** > + * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user bu= ffers > + * @c:=09=09Execution context > + * @s:=09=09Socket to receive from > + * @events:=09epoll events bitmap > + * @v6:=09=09Set for IPv6 connections > + * @datalen:=09Size of received data (output) Now it's dlen. > + * > + * Return: Number of iov entries used to store the datagram > + */ > +static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events, > +=09=09=09 bool v6, ssize_t *dlen) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09int max_elem, iov_cnt, idx, iov_used; > +=09struct msghdr msg =3D { 0 }; > +=09size_t off, l2_hdrlen; > + > +=09ASSERT(!c->no_udp); > + > +=09if (!(events & EPOLLIN)) > +=09=09return 0; > + > +=09/* compute L2 header length */ > + > +=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09max_elem =3D VIRTQUEUE_MAX_SIZE; > +=09else > +=09=09max_elem =3D 1; > + > +=09l2_hdrlen =3D udp_vu_l2_hdrlen(v6); > + > +=09vu_init_elem(elem, iov_vu, max_elem); > + > +=09iov_cnt =3D vu_collect_one_frame(vdev, vq, elem, max_elem, > +=09=09=09 ETH_MAX_MTU - l2_hdrlen, > +=09=09=09 l2_hdrlen, NULL); The indentation is a bit weird here, I would expect the fifth and following arguments to be aligned under (. > +=09if (iov_cnt =3D=3D 0) > +=09=09return 0; > + > +=09msg.msg_iov =3D iov_vu; > +=09msg.msg_iovlen =3D iov_cnt; > + > +=09*dlen =3D recvmsg(s, &msg, 0); > +=09if (*dlen < 0) { > +=09=09vu_queue_rewind(vq, iov_cnt); > +=09=09return 0; > +=09} > + > +=09/* count the numbers of buffer filled by recvmsg() */ > +=09idx =3D iov_skip_bytes(iov_vu, iov_cnt, *dlen, &off); > + > +=09/* adjust last iov length */ > +=09if (idx < iov_cnt) > +=09=09iov_vu[idx].iov_len =3D off; > +=09iov_used =3D idx + !!off; > + > +=09/* we have at least the header */ > +=09if (iov_used =3D=3D 0) > +=09=09iov_used =3D 1; Is iov_used =3D=3D 0 the only case where we need to add 1 to it? > + > +=09/* release unused buffers */ > +=09vu_queue_rewind(vq, iov_cnt - iov_used); > + > +=09vu_set_vnethdr(vdev, &iov_vu[0], iov_used, l2_hdrlen); > + > +=09return iov_used; > +} > + > +/** > + * udp_vu_prepare() - Prepare the packet header > + * @c:=09=09Execution context > + * @toside:=09Address information for one side of the flow > + * @datalen:=09Packet data length dlen now. > + * > + * Return: Layer-4 length > + */ > +static size_t udp_vu_prepare(const struct ctx *c, > +=09=09=09 const struct flowside *toside, ssize_t dlen) > +{ > +=09struct ethhdr *eh; > +=09size_t l4len; > + > +=09/* ethernet header */ > +=09eh =3D vu_eth(iov_vu[0].iov_base); > + > +=09memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); > +=09memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > + > +=09/* initialize header */ > +=09if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) { > +=09=09struct iphdr *iph =3D vu_ip(iov_vu[0].iov_base); > +=09=09struct udp_payload_t *bp =3D vu_payloadv4(iov_vu[0].iov_base); > + > +=09=09eh->h_proto =3D htons(ETH_P_IP); > + > +=09=09*iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP); > + > +=09=09l4len =3D udp_update_hdr4(iph, bp, toside, dlen, true); > +=09} else { > +=09=09struct ipv6hdr *ip6h =3D vu_ip(iov_vu[0].iov_base); > +=09=09struct udp_payload_t *bp =3D vu_payloadv6(iov_vu[0].iov_base); > + > +=09=09eh->h_proto =3D htons(ETH_P_IPV6); > + > +=09=09*ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP); > + > +=09=09l4len =3D udp_update_hdr6(ip6h, bp, toside, dlen, true); > +=09} > + > +=09return l4len; > +} > + > +/** > + * udp_vu_csum() - Calculate and set checksum for a UDP packet > + * @toside:=09ddress information for one side of the flow Address > + * @l4len:=09IPv4 Payload length Not actually passed. > + * @iov_used:=09Length of the array "Number of used iov_vu items"? Otherwise it's a bit hard to understand which array you're referring to. > + */ > +static void udp_vu_csum(const struct flowside *toside, int iov_used) > +{ > +=09const struct in_addr *src4 =3D inany_v4(&toside->oaddr); > +=09const struct in_addr *dst4 =3D inany_v4(&toside->eaddr); > +=09char *base =3D iov_vu[0].iov_base; > +=09struct udp_payload_t *bp; > + > +=09if (src4 && dst4) { > +=09=09bp =3D vu_payloadv4(base); > +=09=09csum_udp4(&bp->uh, *src4, *dst4, iov_vu, iov_used, > +=09=09=09 (char *)&bp->data - base); > +=09} else { > +=09=09bp =3D vu_payloadv6(base); > +=09=09csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, > +=09=09=09 iov_vu, iov_used, (char *)&bp->data - base); > +=09} > +} > + > +/** > + * udp_vu_listen_sock_handler() - Handle new data from socket > + * @c:=09=09Execution context > + * @ref:=09epoll reference > + * @events:=09epoll events bitmap > + * @now:=09Current timestamp > + */ > +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref= , > +=09=09=09=09uint32_t events, const struct timespec *now) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09int i; > + > +=09if (udp_sock_errs(c, ref.fd, events) < 0) { > +=09=09err("UDP: Unrecoverable error on listening socket:" > +=09=09 " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port); > +=09=09return; > +=09} > + > +=09for (i =3D 0; i < UDP_MAX_FRAMES; i++) { > +=09=09const struct flowside *toside; > +=09=09union sockaddr_inany s_in; > +=09=09flow_sidx_t batchsidx; > +=09=09uint8_t batchpif; > +=09=09ssize_t dlen; > +=09=09int iov_used; > +=09=09bool v6; > + > +=09=09if (udp_vu_sock_init(ref.fd, &s_in) < 0) > +=09=09=09break; > + > +=09=09batchsidx =3D udp_flow_from_sock(c, ref, &s_in, now); > +=09=09batchpif =3D pif_at_sidx(batchsidx); > + > +=09=09if (batchpif !=3D PIF_TAP) { > +=09=09=09if (flow_sidx_valid(batchsidx)) { > +=09=09=09=09flow_sidx_t fromsidx =3D flow_sidx_opposite(batchsidx); > +=09=09=09=09struct udp_flow *uflow =3D udp_at_sidx(batchsidx); > + > +=09=09=09=09flow_err(uflow, > +=09=09=09=09=09"No support for forwarding UDP from %s to %s", > +=09=09=09=09=09pif_name(pif_at_sidx(fromsidx)), > +=09=09=09=09=09pif_name(batchpif)); > +=09=09=09} else { > +=09=09=09=09debug("Discarding 1 datagram without flow"); > +=09=09=09} > + > +=09=09=09continue; > +=09=09} > + > +=09=09toside =3D flowside_at_sidx(batchsidx); > + > +=09=09v6 =3D !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); > + > +=09=09iov_used =3D udp_vu_sock_recv(c, ref.fd, events, v6, &dlen); > +=09=09if (iov_used <=3D 0) > +=09=09=09break; > + > +=09=09udp_vu_prepare(c, toside, dlen); > +=09=09if (*c->pcap) { > +=09=09=09udp_vu_csum(toside, iov_used); > +=09=09=09pcap_iov(iov_vu, iov_used, > +=09=09=09=09 sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +=09=09} > +=09=09vu_flush(vdev, vq, elem, iov_used); > +=09} > +} > + > +/** > + * udp_vu_reply_sock_handler() - Handle new data from flow specific sock= et > + * @c:=09=09Execution context > + * @ref:=09epoll reference > + * @events:=09epoll events bitmap > + * @now:=09Current timestamp > + */ > +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > +=09=09=09 uint32_t events, const struct timespec *now) > +{ > +=09flow_sidx_t tosidx =3D flow_sidx_opposite(ref.flowside); > +=09const struct flowside *toside =3D flowside_at_sidx(tosidx); > +=09struct udp_flow *uflow =3D udp_at_sidx(ref.flowside); > +=09int from_s =3D uflow->s[ref.flowside.sidei]; > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09int i; > + > +=09ASSERT(!c->no_udp); > + > +=09if (udp_sock_errs(c, from_s, events) < 0) { > +=09=09flow_err(uflow, "Unrecoverable error on reply socket"); > +=09=09flow_err_details(uflow); > +=09=09udp_flow_close(c, uflow); > +=09=09return; > +=09} > + > +=09for (i =3D 0; i < UDP_MAX_FRAMES; i++) { > +=09=09uint8_t topif =3D pif_at_sidx(tosidx); > +=09=09ssize_t dlen; > +=09=09int iov_used; > +=09=09bool v6; > + > +=09=09ASSERT(uflow); > + > +=09=09if (topif !=3D PIF_TAP) { > +=09=09=09uint8_t frompif =3D pif_at_sidx(ref.flowside); > + > +=09=09=09flow_err(uflow, > +=09=09=09=09 "No support for forwarding UDP from %s to %s", > +=09=09=09=09 pif_name(frompif), pif_name(topif)); > +=09=09=09continue; > +=09=09} > + > +=09=09v6 =3D !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); > + > +=09=09iov_used =3D udp_vu_sock_recv(c, from_s, events, v6, &dlen); > +=09=09if (iov_used <=3D 0) > +=09=09=09break; > +=09=09flow_trace(uflow, "Received 1 datagram on reply socket"); > +=09=09uflow->ts =3D now->tv_sec; > + > +=09=09udp_vu_prepare(c, toside, dlen); > +=09=09if (*c->pcap) { > +=09=09=09udp_vu_csum(toside, iov_used); > +=09=09=09pcap_iov(iov_vu, iov_used, > +=09=09=09=09 sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +=09=09} > +=09=09vu_flush(vdev, vq, elem, iov_used); > +=09} > +} > diff --git a/udp_vu.h b/udp_vu.h > new file mode 100644 > index 000000000000..ba7018d3bf01 > --- /dev/null > +++ b/udp_vu.h > @@ -0,0 +1,13 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#ifndef UDP_VU_H > +#define UDP_VU_H > + > +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref= , > +=09=09=09=09uint32_t events, const struct timespec *now); > +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > +=09=09=09 uint32_t events, const struct timespec *now); > +#endif /* UDP_VU_H */ > diff --git a/vhost_user.c b/vhost_user.c > index 1e302926b8fe..e905f3329f71 100644 > --- a/vhost_user.c > +++ b/vhost_user.c > @@ -48,12 +48,13 @@ > /* vhost-user version we are compatible with */ > #define VHOST_USER_VERSION 1 > =20 > +static struct vu_dev vdev_storage; I see that struct vu_dev is 1564 bytes (on x86), but struct ctx is 580320 bytes (because of tcp_ctx and udp_ctx), so I wouldn't see a problem if you embedded this directly into struct ctx without a pointer. > + > /** > * vu_print_capabilities() - print vhost-user capabilities > * =09=09=09 this is part of the vhost-user backend > * =09=09=09 convention. > */ > -/* cppcheck-suppress unusedFunction */ > void vu_print_capabilities(void) > { > =09info("{"); > @@ -163,9 +164,7 @@ static void vmsg_close_fds(const struct vhost_user_ms= g *vmsg) > */ > static void vu_remove_watch(const struct vu_dev *vdev, int fd) > { > -=09/* Placeholder to add passt related code */ > -=09(void)vdev; > -=09(void)fd; > +=09epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL); > } > =20 > /** > @@ -487,6 +486,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vde= v, > =09=09} > =09} > =20 > +=09/* As vu_packet_check_range() has no access to the number of > +=09 * memory regions, mark the end of the array with mmap_addr =3D 0 > +=09 */ > +=09ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1); > +=09vdev->regions[vdev->nregions].mmap_addr =3D 0; > + > +=09tap_sock_update_pool(vdev->regions, 0); > + > =09return false; > } > =20 > @@ -615,9 +622,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vd= ev, > */ > static void vu_set_watch(const struct vu_dev *vdev, int fd) > { > -=09/* Placeholder to add passt related code */ > -=09(void)vdev; > -=09(void)fd; > +=09union epoll_ref ref =3D { .type =3D EPOLL_TYPE_VHOST_KICK, .fd =3D fd= }; > +=09struct epoll_event ev =3D { 0 }; > + > +=09ev.data.u64 =3D ref.u64; > +=09ev.events =3D EPOLLIN; > +=09epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev); > } > =20 > /** > @@ -829,14 +839,14 @@ static bool vu_set_vring_enable_exec(struct vu_dev = *vdev, > * @c:=09=09execution context > * @vdev:=09vhost-user device > */ > -/* cppcheck-suppress unusedFunction */ > -void vu_init(struct ctx *c, struct vu_dev *vdev) > +void vu_init(struct ctx *c) > { > =09int i; > =20 > -=09vdev->context =3D c; > +=09c->vdev =3D &vdev_storage; > +=09c->vdev->context =3D c; > =09for (i =3D 0; i < VHOST_USER_MAX_QUEUES; i++) { > -=09=09vdev->vq[i] =3D (struct vu_virtq){ > +=09=09c->vdev->vq[i] =3D (struct vu_virtq){ >From previous patch: missing whitespace between ) and { > =09=09=09.call_fd =3D -1, > =09=09=09.kick_fd =3D -1, > =09=09=09.err_fd =3D -1, > @@ -849,7 +859,6 @@ void vu_init(struct ctx *c, struct vu_dev *vdev) > * vu_cleanup() - Reset vhost-user device > * @vdev:=09vhost-user device > */ > -/* cppcheck-suppress unusedFunction */ > void vu_cleanup(struct vu_dev *vdev) > { > =09unsigned int i; > @@ -896,8 +905,7 @@ void vu_cleanup(struct vu_dev *vdev) > */ > static void vu_sock_reset(struct vu_dev *vdev) > { > -=09/* Placeholder to add passt related code */ > -=09(void)vdev; > +=09tap_sock_reset(vdev->context); > } > =20 > static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev, > @@ -925,7 +933,6 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_de= v *vdev, > * @fd:=09=09vhost-user message socket > * @events:=09epoll events > */ > -/* cppcheck-suppress unusedFunction */ > void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) > { > =09struct vhost_user_msg msg =3D { 0 }; > diff --git a/vhost_user.h b/vhost_user.h > index 5af349ba58b8..464ba21e962f 100644 > --- a/vhost_user.h > +++ b/vhost_user.h > @@ -183,7 +183,6 @@ struct vhost_user_msg { > * > * Return: true if the virqueue is enabled, false otherwise > */ > -/* cppcheck-suppress unusedFunction */ > static inline bool vu_queue_enabled(const struct vu_virtq *vq) > { > =09return vq->enable; > @@ -195,14 +194,13 @@ static inline bool vu_queue_enabled(const struct vu= _virtq *vq) > * > * Return: true if the virqueue is started, false otherwise > */ > -/* cppcheck-suppress unusedFunction */ > static inline bool vu_queue_started(const struct vu_virtq *vq) > { > =09return vq->started; > } > =20 > void vu_print_capabilities(void); > -void vu_init(struct ctx *c, struct vu_dev *vdev); > +void vu_init(struct ctx *c); > void vu_cleanup(struct vu_dev *vdev); > void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events); > #endif /* VHOST_USER_H */ > diff --git a/virtio.c b/virtio.c > index 380590afbca3..0598ff479858 100644 > --- a/virtio.c > +++ b/virtio.c > @@ -328,7 +328,6 @@ static bool vring_can_notify(const struct vu_dev *dev= , struct vu_virtq *vq) > * @dev:=09Vhost-user device > * @vq:=09=09Virtqueue > */ > -/* cppcheck-suppress unusedFunction */ > void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq) > { > =09if (!vq->vring.avail) > @@ -504,7 +503,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, stru= ct vu_virtq *vq, unsigned i > * > * Return: -1 if there is an error, 0 otherwise > */ > -/* cppcheck-suppress unusedFunction */ > int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virt= q_element *elem) > { > =09unsigned int head; > @@ -565,7 +563,6 @@ void vu_queue_unpop(struct vu_virtq *vq) > * @vq:=09=09Virtqueue > * @num:=09Number of element to unpop > */ > -/* cppcheck-suppress unusedFunction */ > bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num) > { > =09if (num > vq->inuse) > @@ -621,7 +618,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsi= gned int index, > * @len:=09Size of the element > * @idx:=09Used ring entry index > */ > -/* cppcheck-suppress unusedFunction */ > void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *e= lem, > =09=09 unsigned int len, unsigned int idx) > { > @@ -645,7 +641,6 @@ static inline void vring_used_idx_set(struct vu_virtq= *vq, uint16_t val) > * @vq:=09=09Virtqueue > * @count:=09Number of entry to flush > */ > -/* cppcheck-suppress unusedFunction */ > void vu_queue_flush(struct vu_virtq *vq, unsigned int count) > { > =09uint16_t old, new; > diff --git a/vu_common.c b/vu_common.c > new file mode 100644 > index 000000000000..4977d6af0f92 > --- /dev/null > +++ b/vu_common.c > @@ -0,0 +1,385 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* Copyright Red Hat > + * Author: Laurent Vivier > + * > + * common_vu.c - vhost-user common UDP and TCP functions > + */ > + > +#include > +#include > +#include > +#include > + > +#include "util.h" > +#include "passt.h" > +#include "tap.h" > +#include "vhost_user.h" > +#include "pcap.h" > +#include "vu_common.h" > + > +/** > + * vu_packet_check_range() - Check if a given memory zone is contained i= n > + * =09=09=09 a mapped guest memory region > + * @buf:=09Array of the available memory regions > + * @offset:=09Offset of data range in packet descriptor > + * @size:=09Length of desired data range > + * @start:=09Start of the packet descriptor > + * > + * Return: 0 if the zone is in a mapped memory region, -1 otherwise > + */ > +int vu_packet_check_range(void *buf, size_t offset, size_t len, > +=09=09=09 const char *start) > +{ > +=09struct vu_dev_region *dev_region; > + > +=09for (dev_region =3D buf; dev_region->mmap_addr; dev_region++) { > +=09=09/* NOLINTNEXTLINE(performance-no-int-to-ptr) */ > +=09=09char *m =3D (char *)dev_region->mmap_addr; > + > +=09=09if (m <=3D start && > +=09=09 start + offset + len <=3D m + dev_region->mmap_offset + > +=09=09=09=09=09 dev_region->size) > +=09=09=09return 0; > +=09} > + > +=09return -1; > +} > + > +/** > + * vu_init_elem() - initialize an array of virtqueue element with 1 iov = in each > + * @elem:=09Array of virtqueue element to initialize > + * @iov:=09Array of iovec to assign to virtqueue element > + * @elem_cnt:=09Number of virtqueue element > + */ > +void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov, int = elem_cnt) > +{ > +=09int i; > + > +=09for (i =3D 0; i < elem_cnt; i++) { > +=09=09elem[i].out_num =3D 0; > +=09=09elem[i].out_sg =3D NULL; > +=09=09elem[i].in_num =3D 1; > +=09=09elem[i].in_sg =3D &iov[i]; > +=09} > +} > + > +/** > + * vu_collect_one_frame() - collect virtio buffers from a given virtqueu= e for > + *=09=09=09 one frame > + * @vdev:=09=09vhost-user device > + * @vq:=09=09=09virtqueue to collect from > + * @elem:=09=09Array of virtqueue element elements > + * =09=09=09each element must be initialized with one iovec entry > + * =09=09=09in the in_sg array. > + * @max_elem:=09=09Number of virtqueue element in the array elements (?) > + * @size:=09=09Maximum size of the data in the frame > + * @hdrlen:=09=09Size of the frame header Return: count of usable elements from virtqueue (?) > + */ > +int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq, > +=09=09=09 struct vu_virtq_element *elem, int max_elem, > +=09=09=09 size_t size, size_t hdrlen, size_t *frame_size) > +{ > +=09size_t current_size =3D 0; > +=09struct iovec *iov; > +=09int elem_cnt =3D 0; > +=09int ret; > + > +=09/* header is at least virtio_net_hdr_mrg_rxbuf */ > +=09hdrlen +=3D sizeof(struct virtio_net_hdr_mrg_rxbuf); > + > +=09/* collect first (or unique) element, it will contain header */ s/unique/single/ > +=09ret =3D vu_queue_pop(vdev, vq, &elem[0]); > +=09if (ret < 0) > +=09=09goto out; > + > +=09if (elem[0].in_num < 1) { > +=09=09warn("virtio-net receive queue contains no in buffers"); > +=09=09vu_queue_detach_element(vq); > +=09=09goto out; > +=09} > + > +=09iov =3D &elem[elem_cnt].in_sg[0]; > + > +=09ASSERT(iov->iov_len >=3D hdrlen); > + > +=09/* add space for header */ > +=09iov->iov_base =3D (char *)iov->iov_base + hdrlen; > +=09iov->iov_len -=3D hdrlen; > + > +=09if (iov->iov_len > size) > +=09=09iov->iov_len =3D size; > + > +=09elem_cnt++; > +=09current_size =3D iov->iov_len; > + > +=09if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09goto out; > + > +=09/* if possible, coalesce buffers to reach size */ > +=09while (current_size < size && elem_cnt < max_elem) { > + > +=09=09ret =3D vu_queue_pop(vdev, vq, &elem[elem_cnt]); > +=09=09if (ret < 0) > +=09=09=09break; > + > +=09=09if (elem[elem_cnt].in_num < 1) { > +=09=09=09warn("virtio-net receive queue contains no in buffers"); > +=09=09=09vu_queue_detach_element(vq); > +=09=09=09break; > +=09=09} > + > +=09=09iov =3D &elem[elem_cnt].in_sg[0]; > + > +=09=09if (iov->iov_len > size - current_size) > +=09=09=09iov->iov_len =3D size - current_size; > + > +=09=09current_size +=3D iov->iov_len; > +=09=09elem_cnt++; > +=09} > + > +out: > +=09if (frame_size) > +=09=09*frame_size =3D current_size; > + > +=09return elem_cnt; > +} > + > +/** > + * vu_collect() - collect virtio buffers from a given virtqueue > + * @vdev:=09=09vhost-user device > + * @vq:=09=09=09virtqueue to collect from > + * @elem:=09=09Array of virtqueue element elements > + * =09=09=09each element must be initialized with one iovec entry > + * =09=09=09in the in_sg array. > + * @max_elem:=09=09Number of virtqueue element in the array > + * @max_frame_size:=09Maximum size of the data in the frame > + * @hdrlen:=09=09Size of the frame header > + * @size:=09=09Total size of the buffers we need to collect > + * =09=09=09(if size > max_frame_size, we collect several frame) frames * Return: number of available buffers > + */ > +int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq, > +=09 struct vu_virtq_element *elem, int max_elem, > +=09 size_t max_frame_size, size_t hdrlen, size_t size) > +{ > +=09int elem_cnt =3D 0; > + > +=09while (size > 0 && elem_cnt < max_elem) { > +=09=09size_t frame_size; > +=09=09int cnt; > + > +=09=09if (max_frame_size > size) > +=09=09=09max_frame_size =3D size; > + > +=09=09cnt =3D vu_collect_one_frame(vdev, vq, > +=09=09=09=09=09 &elem[elem_cnt], max_elem - elem_cnt, > +=09=09=09=09=09 max_frame_size, hdrlen, &frame_size); > +=09=09if (cnt =3D=3D 0) > +=09=09=09break; > + > +=09=09size -=3D frame_size; > +=09=09elem_cnt +=3D cnt; > + > +=09=09if (frame_size < max_frame_size) > +=09=09=09break; > +=09} > + > +=09return elem_cnt; > +} > + > +/** > + * vu_set_vnethdr() - set virtio-net headers in a given iovec > + * @vdev:=09=09vhost-user device > + * @iov:=09=09One iovec to initialize > + * @num_buffers:=09Number of guest buffers of the frame > + * @hdrlen:=09=09Size of the frame header > + */ > +void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov, > +=09=09 int num_buffers, size_t hdrlen) > +{ > +=09struct virtio_net_hdr_mrg_rxbuf *vnethdr; > + > +=09/* header is at least virtio_net_hdr_mrg_rxbuf */ > +=09hdrlen +=3D sizeof(struct virtio_net_hdr_mrg_rxbuf); > + > +=09/* NOLINTNEXTLINE(clang-analyzer-core.UndefinedBinaryOperatorResult) = */ > +=09iov->iov_base =3D (char *)iov->iov_base - hdrlen; > +=09iov->iov_len +=3D hdrlen; > + > +=09vnethdr =3D iov->iov_base; > +=09vnethdr->hdr =3D VU_HEADER; > +=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09vnethdr->num_buffers =3D htole16(num_buffers); > +} > + > +/** > + * vu_flush() - flush all the collected buffers to the vhost-user interf= ace > + * @vdev:=09vhost-user device > + * @vq:=09=09vhost-user virtqueue > + * @elem:=09virtqueue element array to send back to the virqueue virtqueue > + * @iov_used:=09Length of the array It's elem_cnt now. > + */ > +void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, > +=09 struct vu_virtq_element *elem, int elem_cnt) > +{ > +=09int i; > + > +=09for (i =3D 0; i < elem_cnt; i++) > +=09=09vu_queue_fill(vq, &elem[i], elem[i].in_sg[0].iov_len, i); > + > +=09vu_queue_flush(vq, elem_cnt); > +=09vu_queue_notify(vdev, vq); > +} > + > +/** > + * vu_handle_tx() - Receive data from the TX virtqueue > + * @vdev:=09vhost-user device > + * @index:=09index of the virtqueue > + * @now:=09Current timestamp > + */ > +static void vu_handle_tx(struct vu_dev *vdev, int index, > +=09=09=09 const struct timespec *now) > +{ > +=09struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE]; > +=09struct iovec out_sg[VIRTQUEUE_MAX_SIZE]; > +=09struct vu_virtq *vq =3D &vdev->vq[index]; > +=09int hdrlen =3D sizeof(struct virtio_net_hdr_mrg_rxbuf); > +=09int out_sg_count; > +=09int count; > + > +=09if (!VHOST_USER_IS_QUEUE_TX(index)) { > +=09=09debug("vhost-user: index %d is not a TX queue", index); > +=09=09return; > +=09} > + > +=09tap_flush_pools(); > + > +=09count =3D 0; > +=09out_sg_count =3D 0; > +=09while (count < VIRTQUEUE_MAX_SIZE) { > +=09=09int ret; > + > +=09=09elem[count].out_num =3D 1; > +=09=09elem[count].out_sg =3D &out_sg[out_sg_count]; > +=09=09elem[count].in_num =3D 0; > +=09=09elem[count].in_sg =3D NULL; > +=09=09ret =3D vu_queue_pop(vdev, vq, &elem[count]); > +=09=09if (ret < 0) > +=09=09=09break; > +=09=09out_sg_count +=3D elem[count].out_num; > + > +=09=09if (elem[count].out_num < 1) { > +=09=09=09debug("virtio-net header not in first element"); > +=09=09=09break; > +=09=09} > +=09=09ASSERT(elem[count].out_num =3D=3D 1); > + > +=09=09tap_add_packet(vdev->context, > +=09=09=09 elem[count].out_sg[0].iov_len - hdrlen, > +=09=09=09 (char *)elem[count].out_sg[0].iov_base + hdrlen); > +=09=09count++; > +=09} > +=09tap_handler(vdev->context, now); > + > +=09if (count) { > +=09=09int i; > + > +=09=09for (i =3D 0; i < count; i++) > +=09=09=09vu_queue_fill(vq, &elem[i], 0, i); > +=09=09vu_queue_flush(vq, count); > +=09=09vu_queue_notify(vdev, vq); > +=09} > +} > + > +/** > + * vu_kick_cb() - Called on a kick event to start to receive data > + * @vdev:=09vhost-user device > + * @ref:=09epoll reference information > + * @now:=09Current timestamp > + */ > +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, > +=09=09const struct timespec *now) > +{ > +=09eventfd_t kick_data; > +=09ssize_t rc; > +=09int idx; > + > +=09for (idx =3D 0; idx < VHOST_USER_MAX_QUEUES; idx++) { > +=09=09if (vdev->vq[idx].kick_fd =3D=3D ref.fd) > +=09=09=09break; > +=09} > + > +=09if (idx =3D=3D VHOST_USER_MAX_QUEUES) > +=09=09return; > + > +=09rc =3D eventfd_read(ref.fd, &kick_data); > +=09if (rc =3D=3D -1) > +=09=09die_perror("vhost-user kick eventfd_read()"); > + > +=09debug("vhost-user: ot kick_data: %016"PRIx64" idx:%d", "ot"? Missing space after "idx:". > +=09 kick_data, idx); > +=09if (VHOST_USER_IS_QUEUE_TX(idx)) > +=09=09vu_handle_tx(vdev, idx, now); > +} > + > +/** > + * vu_send_single() - Send a buffer to the front-end using the RX virtqu= eue > + * @c:=09=09execution context > + * @buf:=09address of the buffer > + * @size:=09size of the buffer > + * > + * Return: number of bytes sent, -1 if there is an error I would say it returns 0 on error. > + */ > +int vu_send_single(const struct ctx *c, const void *buf, size_t size) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE]; > +=09struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; > +=09size_t total; > +=09int elem_cnt, max_elem; > +=09int i; > + > +=09debug("vu_send_single size %zu", size); > + > +=09if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { > +=09=09err("Got packet, but no available descriptors on RX virtq."); > +=09=09return 0; > +=09} > + > +=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09max_elem =3D VIRTQUEUE_MAX_SIZE; > +=09else > +=09=09max_elem =3D 1; > + > +=09vu_init_elem(elem, in_sg, max_elem); > + > +=09elem_cnt =3D vu_collect_one_frame(vdev, vq, elem, max_elem, size, > +=09=09=09=09=090, &total); > +=09if (total < size) { > +=09=09debug("vu_send_single: no space to send the data " > +=09=09 "elem_cnt %d size %zd", elem_cnt, total); > +=09=09goto err; > +=09} > + > +=09vu_set_vnethdr(vdev, in_sg, elem_cnt, 0); > + > +=09/* copy data from the buffer to the iovec */ > +=09iov_from_buf(in_sg, elem_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf)= , > +=09=09 buf, size); > + > +=09if (*c->pcap) { > +=09=09pcap_iov(in_sg, elem_cnt, > +=09=09=09 sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +=09} > + > +=09vu_flush(vdev, vq, elem, elem_cnt); > + > +=09debug("vhost-user sent %zu", total); > + > +=09return total; > +err: > +=09for (i =3D 0; i < elem_cnt; i++) > +=09=09vu_queue_detach_element(vq); > + > +=09return 0; > +} > diff --git a/vu_common.h b/vu_common.h > new file mode 100644 > index 000000000000..1d6048060059 > --- /dev/null > +++ b/vu_common.h > @@ -0,0 +1,47 @@ > +/* SPDX-License-Identifier: GPL-2.0-or-later > + * Copyright Red Hat > + * Author: Laurent Vivier > + * > + * vhost-user common UDP and TCP functions > + */ > + > +#ifndef VU_COMMON_H > +#define VU_COMMON_H > +#include > + > +static inline void *vu_eth(void *base) > +{ > +=09return ((char *)base + sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +} > + > +static inline void *vu_ip(void *base) > +{ > +=09return (struct ethhdr *)vu_eth(base) + 1; > +} > + > +static inline void *vu_payloadv4(void *base) > +{ > +=09return (struct iphdr *)vu_ip(base) + 1; > +} > + > +static inline void *vu_payloadv6(void *base) > +{ > +=09return (struct ipv6hdr *)vu_ip(base) + 1; > +} > + > +void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov, > +=09=09 int elem_cnt); > +int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq, > +=09=09=09 struct vu_virtq_element *elem, int max_elem, > +=09=09=09 size_t size, size_t hdrlen, size_t *frame_size); > +int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq, > +=09 struct vu_virtq_element *elem, int max_elem, > +=09 size_t max_frame_size, size_t hdrlen, size_t size); > +void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov, > + int num_buffers, size_t hdrlen); > +void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, > +=09 struct vu_virtq_element *elem, int elem_cnt); > +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, > +=09=09const struct timespec *now); > +int vu_send_single(const struct ctx *c, const void *buf, size_t size); > +#endif /* VU_COMMON_H */ --=20 Stefano