From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=jBt7GR55; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id ABD655A004C for ; Thu, 19 Sep 2024 15:51:56 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1726753915; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TZm+dcygQlmlY7Lap6xTo/VXzfXzzipomaTUvVexJzo=; b=jBt7GR55CX+yC3IJ/m4WymZXEMz1+XWxs982aN8+/OmB74mR3i63S2XU0UdvWJhxuHvjvG nsKa64+eqC8MzNEcZBvQJgO8vTg2yJUOoOUctdaEWgCfWNzTXNqb+afshLN+xuVGDmrKiY bn8m4b1ZSgicsLlAqgrAeAOs3fHf7RQ= Received: from mail-pf1-f197.google.com (mail-pf1-f197.google.com [209.85.210.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-553-oPVT-r1jNYqPwQZWbWnL6Q-1; Thu, 19 Sep 2024 09:51:53 -0400 X-MC-Unique: oPVT-r1jNYqPwQZWbWnL6Q-1 Received: by mail-pf1-f197.google.com with SMTP id d2e1a72fcca58-718d873b1e8so1291288b3a.3 for ; Thu, 19 Sep 2024 06:51:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726753911; x=1727358711; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=VHOlcyJY5rnPyzb0cKlGjsb7NeQzLGs8Hi4H9v1md+A=; b=tfeCIP+1YAb/dqhNe/Uh5AGEScmLgJzbFS7sPgkq4doajbDPh6s4nJEHdW+oClmM9N GQFL7GtiQ6rQ5q8e5oaVyNoVAfzLU+R6fJXl5ZWuhk0bXR/9HdSgPBBHO5CJrXrfaIPL pa7JkJ/GKiUPE3VxFKc6bDzlMe28Rf/jULrE9GqnE/wXaHG4giGV+YUxBbJO9fi4X/fr EJngMGwIrlSh5XlQ2lEGUeNZHwuNh51hK3uTbZDOeyTemAO1bfnqxdhUzauj9lTxN5Kk jaMi2NvlI859notiZYLqgueavuyTecHUPBBySwq4q3NatqVCotuOuSjIdKdMEUvOmxOo 2+fA== X-Gm-Message-State: AOJu0YyqZc4AlrumUs+OUiL3VW+FykNPDV1hNgcJ1+s655OhtRsZ2H3C iCaz4+GrqeHL9HbgaDYuKoeX1XnkjOevENf+IBECd8Bisu9zAsSCEMjHXCLSPwcpblua35eG3DA 0ceHWFra9jHo43LTy/UY8rhoyuMLHMBNojq9zLzQZmufa+MDUVb4i9uRjDwGg+7twqJXYATWQdN jQTBRc9eFxPED5Slirprgh9V/qc3jnvSNl X-Received: by 2002:a05:6a00:148d:b0:717:8d52:643 with SMTP id d2e1a72fcca58-71936a4cc15mr36989020b3a.11.1726753910437; Thu, 19 Sep 2024 06:51:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE2NudiEt6rkBV8Q+OO4rUNfqMX0FirDBXpXOtVlPFny508EH5qAylNsAL0SyzFfY7ZLuHqtQ== X-Received: by 2002:a05:6a00:148d:b0:717:8d52:643 with SMTP id d2e1a72fcca58-71936a4cc15mr36988850b3a.11.1726753908515; Thu, 19 Sep 2024 06:51:48 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-71944ab502csm8273837b3a.51.2024.09.19.06.51.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Sep 2024 06:51:47 -0700 (PDT) Date: Thu, 19 Sep 2024 15:51:43 +0200 From: Stefano Brivio To: Laurent Vivier Subject: Re: [PATCH v5 4/4] vhost-user: add vhost-user Message-ID: <20240919155143.437ef709@elisabeth> In-Reply-To: <20240913162036.3635783-5-lvivier@redhat.com> References: <20240913162036.3635783-1-lvivier@redhat.com> <20240913162036.3635783-5-lvivier@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: JMSRT4IIID5F7NMBZQR7KI4N3X4G6MYW X-Message-ID-Hash: JMSRT4IIID5F7NMBZQR7KI4N3X4G6MYW X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Sorry for the delay, I wanted first to finish extending tests to run also functional ones (not just throughput and latency) with vhost-user, but it's taking me a bit longer than expected, so here comes the review. By the way, by mistake I let passt run in non-vhost-user mode while QEMU was configured to use it. This results in a loop: qemu-system-x86_64: -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0: Failed = to read msg header. Read 0 instead of 12. Original request 1. qemu-system-x86_64: -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0: vhost_b= ackend_init failed: Protocol error qemu-system-x86_64: -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0: failed = to init vhost_net for queue 0 qemu-system-x86_64: -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0: Failed = to read msg header. Read 0 instead of 12. Original request 1. qemu-system-x86_64: -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0: vhost_b= ackend_init failed: Protocol error qemu-system-x86_64: -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0: failed = to init vhost_net for queue 0 ... and passt says: accepted connection from PID 4807 Bad frame size from guest, resetting connection Client connection closed accepted connection from PID 4807 Bad frame size from guest, resetting connection Client connection closed ... while happily flooding system logs. I guess it should be fixed in QEMU at some point: if the vhost_net initialisation fails, I don't see the point in retrying. This is without the "reconnect" option by the way: $ qemu-system-$(uname -m) -machine accel=3Dkvm -M accel=3Dkvm:tcg -m 16G -cpu host -smp 6 -kernel /home/sbrivio/nf/arch/x86/boot/bzImage -i= nitrd /home/sbrivio/passt/test/mbuto.img -nographic -serial stdio -nodefau= lts -append "console=3DttyS0 mitigations=3Doff apparmor=3D0" -chardev so= cket,id=3Dchr0,path=3D/tmp/passt-tests-TLZU2Y/passt_in_ns/passt.socket -ne= tdev vhost-user,id=3Dnetdev0,chardev=3Dchr0 -device virtio-net,netdev=3Dne= tdev0 -object memory-backend-memfd,id=3Dmemfd0,share=3Don,size=3D16G -num= a node,memdev=3Dmemfd0 -pidfile /tmp/passt-tests-TLZU2Y/passt_in_ns/qemu.p= id -device vhost-vsock-pci,guest-cid=3D94557 On Fri, 13 Sep 2024 18:20:34 +0200 Laurent Vivier wrote: > add virtio and vhost-user functions to connect with QEMU. >=20 > $ ./passt --vhost-user >=20 > and >=20 > # qemu-system-x86_64 ... -m 4G \ > -object memory-backend-memfd,id=3Dmemfd0,share=3Don,size=3D4G \ > -numa node,memdev=3Dmemfd0 \ > -chardev socket,id=3Dchr0,path=3D/tmp/passt_1.socket \ > -netdev vhost-user,id=3Dnetdev0,chardev=3Dchr0 \ > -device virtio-net,mac=3D9a:2b:2c:2d:2e:2f,netdev=3Dnetdev0 \ > ... >=20 > Signed-off-by: Laurent Vivier > --- > Makefile | 6 +- > checksum.c | 1 - > conf.c | 23 +- > epoll_type.h | 4 + > isolation.c | 17 +- > packet.c | 11 + > packet.h | 8 +- > passt.1 | 10 +- > passt.c | 26 +- > passt.h | 6 + > pcap.c | 1 - > tap.c | 111 +++++++-- > tap.h | 5 +- > tcp.c | 31 ++- > tcp_buf.c | 8 +- > tcp_internal.h | 3 +- > tcp_vu.c | 647 +++++++++++++++++++++++++++++++++++++++++++++++++ > tcp_vu.h | 12 + > udp.c | 78 +++--- > udp.h | 8 +- > udp_internal.h | 34 +++ > udp_vu.c | 397 ++++++++++++++++++++++++++++++ > udp_vu.h | 13 + > vhost_user.c | 32 +-- > virtio.c | 1 - > vu_common.c | 36 +++ > vu_common.h | 34 +++ > 27 files changed, 1457 insertions(+), 106 deletions(-) > create mode 100644 tcp_vu.c > create mode 100644 tcp_vu.h > create mode 100644 udp_internal.h > create mode 100644 udp_vu.c > create mode 100644 udp_vu.h > create mode 100644 vu_common.c > create mode 100644 vu_common.h >=20 > diff --git a/Makefile b/Makefile > index 0e8ed60a0da1..1e8910dda1f4 100644 > --- a/Makefile > +++ b/Makefile > @@ -54,7 +54,8 @@ FLAGS +=3D -DDUAL_STACK_SOCKETS=3D$(DUAL_STACK_SOCKETS) > PASST_SRCS =3D arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd= .c \ > =09icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > =09ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > -=09tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c > +=09tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > +=09vhost_user.c virtio.c vu_common.c > QRAP_SRCS =3D qrap.c > SRCS =3D $(PASST_SRCS) $(QRAP_SRCS) > =20 > @@ -64,7 +65,8 @@ PASST_HEADERS =3D arch.h arp.h checksum.h conf.h dhcp.h= dhcpv6.h flow.h fwd.h \ > =09flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > =09lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.= h \ > =09siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.= h \ > -=09udp.h udp_flow.h util.h vhost_user.h virtio.h > +=09tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h= \ > +=09virtio.h vu_common.h > HEADERS =3D $(PASST_HEADERS) seccomp.h > =20 > C :=3D \#include \nstruct tcp_info x =3D { .tcpi_snd_wnd = =3D 0 }; > diff --git a/checksum.c b/checksum.c > index 006614fcbb28..aa5b7ae1cb66 100644 > --- a/checksum.c > +++ b/checksum.c > @@ -501,7 +501,6 @@ uint16_t csum(const void *buf, size_t len, uint32_t i= nit) > * > * Return: 16-bit folded, complemented checksum > */ > -/* cppcheck-suppress unusedFunction */ > uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init) > { > =09unsigned int i; > diff --git a/conf.c b/conf.c > index b27588649af3..eb8e1685713a 100644 > --- a/conf.c > +++ b/conf.c > @@ -45,6 +45,7 @@ > #include "lineread.h" > #include "isolation.h" > #include "log.h" > +#include "vhost_user.h" > =20 > /** > * next_chunk - Return the next piece of a string delimited by a charact= er > @@ -769,9 +770,14 @@ static void usage(const char *name, FILE *f, int sta= tus) > =09=09=09" default: same interface name as external one\n"); > =09} else { > =09=09fprintf(f, > -=09=09=09" -s, --socket PATH=09UNIX domain socket path\n" > +=09=09=09" -s, --socket, --socket-path PATH=09UNIX domain socket path\n= " > =09=09=09" default: probe free path starting from " > =09=09=09UNIX_SOCK_PATH "\n", 1); > +=09=09fprintf(f, > +=09=09=09" --vhost-user=09=09Enable vhost-user mode\n" > +=09=09=09" UNIX domain socket is provided by -s option\n" > +=09=09=09" --print-capabilities=09print back-end capabilities in JSON f= ormat,\n" > +=09=09=09" only meaningful for vhost-user mode\n"); > =09} > =20 > =09fprintf(f, > @@ -1291,6 +1297,10 @@ void conf(struct ctx *c, int argc, char **argv) > =09=09{"netns-only",=09no_argument,=09=09NULL,=09=0920 }, > =09=09{"map-host-loopback", required_argument, NULL,=09=0921 }, > =09=09{"map-guest-addr", required_argument,=09NULL,=09=0922 }, > +=09=09{"vhost-user",=09no_argument,=09=09NULL,=09=0923 }, > +=09=09/* vhost-user backend program convention */ > +=09=09{"print-capabilities", no_argument,=09NULL,=09=0924 }, > +=09=09{"socket-path",=09required_argument,=09NULL,=09=09's' }, > =09=09{ 0 }, > =09}; > =09const char *logname =3D (c->mode =3D=3D MODE_PASTA) ? "pasta" : "pass= t"; > @@ -1429,7 +1439,6 @@ void conf(struct ctx *c, int argc, char **argv) > =09=09=09=09 sizeof(c->ip6.ifname_out), "%s", optarg); > =09=09=09if (ret <=3D 0 || ret >=3D (int)sizeof(c->ip6.ifname_out)) > =09=09=09=09die("Invalid interface name: %s", optarg); > - > =09=09=09break; > =09=09case 17: > =09=09=09if (c->mode !=3D MODE_PASTA) > @@ -1468,6 +1477,16 @@ void conf(struct ctx *c, int argc, char **argv) > =09=09=09conf_nat(optarg, &c->ip4.map_guest_addr, > =09=09=09=09 &c->ip6.map_guest_addr, NULL); > =09=09=09break; > +=09=09case 23: > +=09=09=09if (c->mode =3D=3D MODE_PASTA) { > +=09=09=09=09err("--vhost-user is for passt mode only"); > +=09=09=09=09usage(argv[0], stdout, EXIT_SUCCESS); > +=09=09=09} > +=09=09=09c->mode =3D MODE_VU; > +=09=09=09break; > +=09=09case 24: > +=09=09=09vu_print_capabilities(); I guess you should also check if (c->mode =3D=3D MODE_PASTA) for this one. > +=09=09=09break; > =09=09case 'd': > =09=09=09c->debug =3D 1; > =09=09=09c->quiet =3D 0; > diff --git a/epoll_type.h b/epoll_type.h > index 0ad1efa0ccec..f3ef41584757 100644 > --- a/epoll_type.h > +++ b/epoll_type.h > @@ -36,6 +36,10 @@ enum epoll_type { > =09EPOLL_TYPE_TAP_PASST, > =09/* socket listening for qemu socket connections */ > =09EPOLL_TYPE_TAP_LISTEN, > +=09/* vhost-user command socket */ > +=09EPOLL_TYPE_VHOST_CMD, > +=09/* vhost-user kick event socket */ > +=09EPOLL_TYPE_VHOST_KICK, > =20 > =09EPOLL_NUM_TYPES, > }; > diff --git a/isolation.c b/isolation.c > index 45fba1e68b9d..3d5fd60fde46 100644 > --- a/isolation.c > +++ b/isolation.c > @@ -377,14 +377,21 @@ void isolate_postfork(const struct ctx *c) > { > =09struct sock_fprog prog; > =20 > -=09prctl(PR_SET_DUMPABLE, 0); > +=09//prctl(PR_SET_DUMPABLE, 0); > =20 > -=09if (c->mode =3D=3D MODE_PASTA) { > -=09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_pasta); > -=09=09prog.filter =3D filter_pasta; > -=09} else { > +=09switch (c->mode) { > +=09case MODE_PASST: > =09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_passt); > =09=09prog.filter =3D filter_passt; > +=09=09break; > +=09case MODE_PASTA: > +=09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_pasta); > +=09=09prog.filter =3D filter_pasta; > +=09=09break; > +=09case MODE_VU: > +=09=09prog.len =3D (unsigned short)ARRAY_SIZE(filter_vu); > +=09=09prog.filter =3D filter_vu; > +=09=09break; > =09} > =20 > =09if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) || > diff --git a/packet.c b/packet.c > index 37489961a37e..e5a78d079231 100644 > --- a/packet.c > +++ b/packet.c > @@ -36,6 +36,17 @@ > static int packet_check_range(const struct pool *p, size_t offset, size_= t len, > =09=09=09 const char *start, const char *func, int line) > { > +=09if (p->buf_size =3D=3D 0) { > +=09=09int ret; > + > +=09=09ret =3D vu_packet_check_range((void *)p->buf, offset, len, start); > + > +=09=09if (ret =3D=3D -1) > +=09=09=09trace("cannot find region, %s:%i", func, line); > + > +=09=09return ret; > +=09} > + > =09if (start < p->buf) { > =09=09trace("packet start %p before buffer start %p, " > =09=09 "%s:%i", (void *)start, (void *)p->buf, func, line); > diff --git a/packet.h b/packet.h > index 8377dcf678bb..3f70e949c066 100644 > --- a/packet.h > +++ b/packet.h > @@ -8,8 +8,10 @@ > =20 > /** > * struct pool - Generic pool of packets stored in a buffer > - * @buf:=09Buffer storing packet descriptors > - * @buf_size:=09Total size of buffer > + * @buf:=09Buffer storing packet descriptors, > + * =09=09a struct vu_dev_region array for passt vhost-user mode > + * @buf_size:=09Total size of buffer, > + * =09=090 for passt vhost-user mode > * @size:=09Number of usable descriptors for the pool > * @count:=09Number of used descriptors for the pool > * @pkt:=09Descriptors: see macros below > @@ -22,6 +24,8 @@ struct pool { > =09struct iovec pkt[1]; > }; > =20 > +int vu_packet_check_range(void *buf, size_t offset, size_t len, > +=09=09=09 const char *start); > void packet_add_do(struct pool *p, size_t len, const char *start, > =09=09 const char *func, int line); > void *packet_get_do(const struct pool *p, const size_t idx, > diff --git a/passt.1 b/passt.1 > index 79d134dbe098..822714147be8 100644 > --- a/passt.1 > +++ b/passt.1 > @@ -378,12 +378,20 @@ interface address are configured on a given host in= terface. > .SS \fBpasst\fR-only options > =20 > .TP > -.BR \-s ", " \-\-socket " " \fIpath > +.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath > Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to co= nnect to > \fBpasst\fR. > Default is to probe a free socket, not accepting connections, starting f= rom > \fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR. > =20 > +.TP > +.BR \-\-vhost-user I think we should introduce this option as deprecated right away, so that we can switch to vhost-user mode by default soon (checking if the hypervisor sends us a vhost-user command) without having to keep this option around. At that point, we can add --no-vhost-user instead. If it makes sense, you could copy the text from --stderr: Note that this configuration option is \fBdeprecated\fR and will be removed= in a future version. > +Enable vhost-user. The vhost-user command socket is provided by \fB--soc= ket\fR. > + > +.TP > +.BR \-\-print-capabilities > +Print back-end capabilities in JSON format, only meaningful for vhost-us= er mode. > + > .TP > .BR \-F ", " \-\-fd " " \fIFD > Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket i= s opened > diff --git a/passt.c b/passt.c > index ad6f0bc32df6..b64efeaf346c 100644 > --- a/passt.c > +++ b/passt.c > @@ -74,6 +74,8 @@ char *epoll_type_str[] =3D { > =09[EPOLL_TYPE_TAP_PASTA]=09=09=3D "/dev/net/tun device", > =09[EPOLL_TYPE_TAP_PASST]=09=09=3D "connected qemu socket", > =09[EPOLL_TYPE_TAP_LISTEN]=09=09=3D "listening qemu socket", > +=09[EPOLL_TYPE_VHOST_CMD]=09=09=3D "vhost-user command socket", > +=09[EPOLL_TYPE_VHOST_KICK]=09=09=3D "vhost-user kick socket", > }; > static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES, > =09 "epoll_type_str[] doesn't match enum epoll_type"); > @@ -206,6 +208,7 @@ int main(int argc, char **argv) > =09struct rlimit limit; > =09struct timespec now; > =09struct sigaction sa; > +=09struct vu_dev vdev; > =20 > =09clock_gettime(CLOCK_MONOTONIC, &log_start); > =20 > @@ -262,6 +265,8 @@ int main(int argc, char **argv) > =09pasta_netns_quit_init(&c); > =20 > =09tap_sock_init(&c); > +=09if (c.mode =3D=3D MODE_VU) > +=09=09vu_init(&c, &vdev); > =20 > =09secret_init(&c); > =20 > @@ -352,14 +357,31 @@ loop: > =09=09=09tcp_timer_handler(&c, ref); > =09=09=09break; > =09=09case EPOLL_TYPE_UDP_LISTEN: > -=09=09=09udp_listen_sock_handler(&c, ref, eventmask, &now); > +=09=09=09if (c.mode =3D=3D MODE_VU) { Eventually, we'll probably want to make passt more generic and to support multiple guests, so at that point this might become EPOLL_TYPE_UDP_VU_LISTEN if it's a socket we opened for a guest using vhost-user. Or maybe we'll have to unify the receive paths, so this will remain EPOLL_TYPE_UDP_LISTEN. Either way, _if it's more convenient for you right now_, I wouldn't see any issue in defining new EPOLL_TYPE_UDP_VU_{LISTEN,REPLY} values. > +=09=09=09=09udp_vu_listen_sock_handler(&c, ref, eventmask, > +=09=09=09=09=09=09=09 &now); > +=09=09=09} else { > +=09=09=09=09udp_buf_listen_sock_handler(&c, ref, eventmask, > +=09=09=09=09=09=09=09 &now); > +=09=09=09} > =09=09=09break; > =09=09case EPOLL_TYPE_UDP_REPLY: > -=09=09=09udp_reply_sock_handler(&c, ref, eventmask, &now); > +=09=09=09if (c.mode =3D=3D MODE_VU) > +=09=09=09=09udp_vu_reply_sock_handler(&c, ref, eventmask, > +=09=09=09=09=09=09=09 &now); > +=09=09=09else > +=09=09=09=09udp_buf_reply_sock_handler(&c, ref, eventmask, > +=09=09=09=09=09=09=09 &now); > =09=09=09break; > =09=09case EPOLL_TYPE_PING: > =09=09=09icmp_sock_handler(&c, ref); > =09=09=09break; > +=09=09case EPOLL_TYPE_VHOST_CMD: > +=09=09=09vu_control_handler(&vdev, c.fd_tap, eventmask); > +=09=09=09break; > +=09=09case EPOLL_TYPE_VHOST_KICK: > +=09=09=09vu_kick_cb(&vdev, ref, &now); > +=09=09=09break; > =09=09default: > =09=09=09/* Can't happen */ > =09=09=09ASSERT(0); > diff --git a/passt.h b/passt.h > index 031c9b669cc4..a98f043c7e64 100644 > --- a/passt.h > +++ b/passt.h > @@ -25,6 +25,8 @@ union epoll_ref; > #include "fwd.h" > #include "tcp.h" > #include "udp.h" > +#include "udp_vu.h" > +#include "vhost_user.h" > =20 > /* Default address for our end on the tap interface. Bit 0 of byte 0 mu= st be 0 > * (unicast) and bit 1 of byte 1 must be 1 (locally administered). Othe= rwise > @@ -94,6 +96,7 @@ struct fqdn { > enum passt_modes { > =09MODE_PASST, > =09MODE_PASTA, > +=09MODE_VU, > }; > =20 > /** > @@ -227,6 +230,7 @@ struct ip6_ctx { > * @no_ra:=09=09Disable router advertisements > * @low_wmem:=09=09Low probed net.core.wmem_max > * @low_rmem:=09=09Low probed net.core.rmem_max > + * @vdev:=09=09vhost-user device > */ > struct ctx { > =09enum passt_modes mode; > @@ -287,6 +291,8 @@ struct ctx { > =20 > =09int low_wmem; > =09int low_rmem; > + > +=09struct vu_dev *vdev; > }; > =20 > void proto_update_l2_buf(const unsigned char *eth_d, > diff --git a/pcap.c b/pcap.c > index 46cc4b0d72b6..7e9c56090041 100644 > --- a/pcap.c > +++ b/pcap.c > @@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t fr= ame_parts, unsigned int n, > *=09=09containing packet data to write, including L2 header > * @iovcnt:=09Number of buffers (@iov entries) > */ > -/* cppcheck-suppress unusedFunction */ > void pcap_iov(const struct iovec *iov, size_t iovcnt) > { > =09struct timespec now; > diff --git a/tap.c b/tap.c > index 41af6a6d0c85..3e1b3c13c321 100644 > --- a/tap.c > +++ b/tap.c > @@ -58,6 +58,7 @@ > #include "packet.h" > #include "tap.h" > #include "log.h" > +#include "vhost_user.h" > =20 > /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handler= s */ > static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); > @@ -78,16 +79,22 @@ void tap_send_single(const struct ctx *c, const void = *data, size_t l2len) > =09struct iovec iov[2]; > =09size_t iovcnt =3D 0; > =20 > -=09if (c->mode =3D=3D MODE_PASST) { > +=09switch (c->mode) { > +=09case MODE_PASST: > =09=09iov[iovcnt] =3D IOV_OF_LVALUE(vnet_len); > =09=09iovcnt++; > -=09} > - > -=09iov[iovcnt].iov_base =3D (void *)data; > -=09iov[iovcnt].iov_len =3D l2len; > -=09iovcnt++; > +=09=09/* fall through */ > +=09case MODE_PASTA: > +=09=09iov[iovcnt].iov_base =3D (void *)data; > +=09=09iov[iovcnt].iov_len =3D l2len; > +=09=09iovcnt++; > =20 > -=09tap_send_frames(c, iov, iovcnt, 1); > +=09=09tap_send_frames(c, iov, iovcnt, 1); > +=09=09break; > +=09case MODE_VU: > +=09=09vu_send(c->vdev, data, l2len); > +=09=09break; > +=09} > } > =20 > /** > @@ -406,10 +413,18 @@ size_t tap_send_frames(const struct ctx *c, const s= truct iovec *iov, > =09if (!nframes) > =09=09return 0; > =20 > -=09if (c->mode =3D=3D MODE_PASTA) > +=09switch (c->mode) { > +=09case MODE_PASTA: > =09=09m =3D tap_send_frames_pasta(c, iov, bufs_per_frame, nframes); > -=09else > +=09=09break; > +=09case MODE_PASST: > =09=09m =3D tap_send_frames_passt(c, iov, bufs_per_frame, nframes); > +=09=09break; > +=09case MODE_VU: > +=09=09/* fall through */ > +=09default: > +=09=09ASSERT(0); > +=09} > =20 > =09if (m < nframes) > =09=09debug("tap: failed to send %zu frames of %zu", > @@ -968,7 +983,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, cha= r *p) > * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socke= t > * @c:=09=09Execution context > */ > -static void tap_sock_reset(struct ctx *c) > +void tap_sock_reset(struct ctx *c) > { > =09info("Client connection closed%s", c->one_off ? ", exiting" : ""); > =20 > @@ -979,6 +994,8 @@ static void tap_sock_reset(struct ctx *c) > =09epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); > =09close(c->fd_tap); > =09c->fd_tap =3D -1; > +=09if (c->mode =3D=3D MODE_VU) > +=09=09vu_cleanup(c->vdev); > } > =20 > /** > @@ -1196,11 +1213,17 @@ static void tap_sock_unix_init(struct ctx *c) > =09ev.data.u64 =3D ref.u64; > =09epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev); > =20 > -=09info("\nYou can now start qemu (>=3D 7.2, with commit 13c6be96618c):"= ); > -=09info(" kvm ... -device virtio-net-pci,netdev=3Ds -netdev stream,id= =3Ds,server=3Doff,addr.type=3Dunix,addr.path=3D%s", > -=09 c->sock_path); > -=09info("or qrap, for earlier qemu versions:"); > -=09info(" ./qrap 5 kvm ... -net socket,fd=3D5 -net nic,model=3Dvirtio= "); > +=09if (c->mode =3D=3D MODE_VU) { > +=09=09info("You can start qemu with:"); > +=09=09info(" kvm ... -chardev socket,id=3Dchr0,path=3D%s -netdev vhos= t-user,id=3Dnetdev0,chardev=3Dchr0 -device virtio-net,netdev=3Dnetdev0 -obj= ect memory-backend-memfd,id=3Dmemfd0,share=3Don,size=3D$RAMSIZE -numa node,= memdev=3Dmemfd0\n", > +=09=09 c->sock_path); > +=09} else { > +=09=09info("\nYou can now start qemu (>=3D 7.2, with commit 13c6be96618c= ):"); > +=09=09info(" kvm ... -device virtio-net-pci,netdev=3Ds -netdev stream= ,id=3Ds,server=3Doff,addr.type=3Dunix,addr.path=3D%s", > +=09=09 c->sock_path); > +=09=09info("or qrap, for earlier qemu versions:"); > +=09=09info(" ./qrap 5 kvm ... -net socket,fd=3D5 -net nic,model=3Dvir= tio"); > +=09} > } > =20 > /** > @@ -1210,8 +1233,8 @@ static void tap_sock_unix_init(struct ctx *c) > */ > void tap_listen_handler(struct ctx *c, uint32_t events) > { > -=09union epoll_ref ref =3D { .type =3D EPOLL_TYPE_TAP_PASST }; > =09struct epoll_event ev =3D { 0 }; > +=09union epoll_ref ref; > =09int v =3D INT_MAX / 2; > =09struct ucred ucred; > =09socklen_t len; > @@ -1251,6 +1274,10 @@ void tap_listen_handler(struct ctx *c, uint32_t ev= ents) > =09=09trace("tap: failed to set SO_SNDBUF to %i", v); > =20 > =09ref.fd =3D c->fd_tap; > +=09if (c->mode =3D=3D MODE_VU) > +=09=09ref.type =3D EPOLL_TYPE_VHOST_CMD; > +=09else > +=09=09ref.type =3D EPOLL_TYPE_TAP_PASST; > =09ev.events =3D EPOLLIN | EPOLLRDHUP; > =09ev.data.u64 =3D ref.u64; > =09epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); > @@ -1312,21 +1339,52 @@ static void tap_sock_tun_init(struct ctx *c) > =09epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); > } > =20 > +/** > + * tap_sock_update_buf() - Set the buffer base and size for the pool of = packets > + * @base:=09Buffer base > + * @size=09Buffer size > + */ > +void tap_sock_update_buf(void *base, size_t size) > +{ > +=09int i; > + > +=09pool_tap4_storage.buf =3D base; > +=09pool_tap4_storage.buf_size =3D size; > +=09pool_tap6_storage.buf =3D base; > +=09pool_tap6_storage.buf_size =3D size; > + > +=09for (i =3D 0; i < TAP_SEQS; i++) { > +=09=09tap4_l4[i].p.buf =3D base; > +=09=09tap4_l4[i].p.buf_size =3D size; > +=09=09tap6_l4[i].p.buf =3D base; > +=09=09tap6_l4[i].p.buf_size =3D size; > +=09} > +} > + > /** > * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file des= criptor > * @c:=09=09Execution context > */ > void tap_sock_init(struct ctx *c) > { > -=09size_t sz =3D sizeof(pkt_buf); > +=09size_t sz; > +=09char *buf; > =09int i; > =20 > -=09pool_tap4_storage =3D PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz); > -=09pool_tap6_storage =3D PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz); > +=09if (c->mode =3D=3D MODE_VU) { > +=09=09buf =3D NULL; > +=09=09sz =3D 0; > +=09} else { > +=09=09buf =3D pkt_buf; > +=09=09sz =3D sizeof(pkt_buf); > +=09} > + > +=09pool_tap4_storage =3D PACKET_INIT(pool_tap4, TAP_MSGS, buf, sz); > +=09pool_tap6_storage =3D PACKET_INIT(pool_tap6, TAP_MSGS, buf, sz); > =20 > =09for (i =3D 0; i < TAP_SEQS; i++) { > -=09=09tap4_l4[i].p =3D PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz); > -=09=09tap6_l4[i].p =3D PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz); > +=09=09tap4_l4[i].p =3D PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz); > +=09=09tap6_l4[i].p =3D PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz); > =09} > =20 > =09if (c->fd_tap !=3D -1) { /* Passed as --fd */ > @@ -1335,10 +1393,17 @@ void tap_sock_init(struct ctx *c) > =20 > =09=09ASSERT(c->one_off); > =09=09ref.fd =3D c->fd_tap; > -=09=09if (c->mode =3D=3D MODE_PASST) > +=09=09switch (c->mode) { > +=09=09case MODE_PASST: > =09=09=09ref.type =3D EPOLL_TYPE_TAP_PASST; > -=09=09else > +=09=09=09break; > +=09=09case MODE_PASTA: > =09=09=09ref.type =3D EPOLL_TYPE_TAP_PASTA; > +=09=09=09break; > +=09=09case MODE_VU: > +=09=09=09ref.type =3D EPOLL_TYPE_VHOST_CMD; > +=09=09=09break; > +=09=09} > =20 > =09=09ev.events =3D EPOLLIN | EPOLLRDHUP; > =09=09ev.data.u64 =3D ref.u64; > diff --git a/tap.h b/tap.h > index ec9e2acec460..c5447f7077eb 100644 > --- a/tap.h > +++ b/tap.h > @@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx= *c, > */ > static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) > { > -=09thdr->vnet_len =3D htonl(l2len); > +=09if (thdr) > +=09=09thdr->vnet_len =3D htonl(l2len); > } > =20 > void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sp= ort, > @@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, > void tap_handler_passt(struct ctx *c, uint32_t events, > =09=09 const struct timespec *now); > int tap_sock_unix_open(char *sock_path); > +void tap_sock_reset(struct ctx *c); > +void tap_sock_update_buf(void *base, size_t size); > void tap_sock_init(struct ctx *c); > void tap_flush_pools(void); > void tap_handler(struct ctx *c, const struct timespec *now); > diff --git a/tcp.c b/tcp.c > index f9fe1b9a1330..b4b8864799a8 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -304,6 +304,7 @@ > #include "flow_table.h" > #include "tcp_internal.h" > #include "tcp_buf.h" > +#include "tcp_vu.h" > =20 > /* MSS rounding: see SET_MSS() */ > #define MSS_DEFAULT=09=09=09536 > @@ -903,6 +904,7 @@ static void tcp_fill_header(struct tcphdr *th, > * @dlen:=09TCP payload length > * @check:=09Checksum, if already known > * @seq:=09Sequence number for this segment > + * @no_tcp_csum: Do not set TCP checksum > * > * Return: The IPv4 payload length, host order > */ > @@ -910,7 +912,7 @@ static size_t tcp_fill_headers4(const struct tcp_tap_= conn *conn, > =09=09=09=09struct tap_hdr *taph, > =09=09=09=09struct iphdr *iph, struct tcphdr *th, > =09=09=09=09size_t dlen, const uint16_t *check, > -=09=09=09=09uint32_t seq) > +=09=09=09=09uint32_t seq, bool no_tcp_csum) > { > =09const struct flowside *tapside =3D TAPFLOW(conn); > =09const struct in_addr *src4 =3D inany_v4(&tapside->oaddr); > @@ -929,7 +931,10 @@ static size_t tcp_fill_headers4(const struct tcp_tap= _conn *conn, > =20 > =09tcp_fill_header(th, conn, seq); > =20 > -=09tcp_update_check_tcp4(iph, th); > +=09if (no_tcp_csum) > +=09=09th->check =3D 0; > +=09else > +=09=09tcp_update_check_tcp4(iph, th); > =20 > =09tap_hdr_update(taph, l3len + sizeof(struct ethhdr)); > =20 > @@ -945,13 +950,14 @@ static size_t tcp_fill_headers4(const struct tcp_ta= p_conn *conn, > * @dlen:=09TCP payload length > * @check:=09Checksum, if already known > * @seq:=09Sequence number for this segment > + * @no_tcp_csum: Do not set TCP checksum > * > * Return: The IPv6 payload length, host order > */ > static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn, > =09=09=09=09struct tap_hdr *taph, > =09=09=09=09struct ipv6hdr *ip6h, struct tcphdr *th, > -=09=09=09=09size_t dlen, uint32_t seq) > +=09=09=09=09size_t dlen, uint32_t seq, bool no_tcp_csum) > { > =09const struct flowside *tapside =3D TAPFLOW(conn); > =09size_t l4len =3D dlen + sizeof(*th); > @@ -970,7 +976,10 @@ static size_t tcp_fill_headers6(const struct tcp_tap= _conn *conn, > =20 > =09tcp_fill_header(th, conn, seq); > =20 > -=09tcp_update_check_tcp6(ip6h, th); > +=09if (no_tcp_csum) > +=09=09th->check =3D 0; > +=09else > +=09=09tcp_update_check_tcp6(ip6h, th); > =20 > =09tap_hdr_update(taph, l4len + sizeof(*ip6h) + sizeof(struct ethhdr)); > =20 > @@ -984,12 +993,14 @@ static size_t tcp_fill_headers6(const struct tcp_ta= p_conn *conn, > * @dlen:=09TCP payload length > * @check:=09Checksum, if already known > * @seq:=09Sequence number for this segment > + * @no_tcp_csum: Do not set TCP checksum > * > * Return: IP payload length, host order > */ > size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, > =09=09=09 struct iovec *iov, size_t dlen, > -=09=09=09 const uint16_t *check, uint32_t seq) > +=09=09=09 const uint16_t *check, uint32_t seq, > +=09=09=09 bool no_tcp_csum) > { > =09const struct flowside *tapside =3D TAPFLOW(conn); > =09const struct in_addr *a4 =3D inany_v4(&tapside->oaddr); > @@ -998,13 +1009,13 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_ta= p_conn *conn, > =09=09return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base, > =09=09=09=09=09 iov[TCP_IOV_IP].iov_base, > =09=09=09=09=09 iov[TCP_IOV_PAYLOAD].iov_base, dlen, > -=09=09=09=09=09 check, seq); > +=09=09=09=09=09 check, seq, no_tcp_csum); > =09} > =20 > =09return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base, > =09=09=09=09 iov[TCP_IOV_IP].iov_base, > =09=09=09=09 iov[TCP_IOV_PAYLOAD].iov_base, dlen, > -=09=09=09=09 seq); > +=09=09=09=09 seq, no_tcp_csum); > } > =20 > /** > @@ -1237,6 +1248,9 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap= _conn *conn, > */ > int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) > { > +=09if (c->mode =3D=3D MODE_VU) > +=09=09return tcp_vu_send_flag(c, conn, flags); > + > =09return tcp_buf_send_flag(c, conn, flags); > } > =20 > @@ -1630,6 +1644,9 @@ static int tcp_sock_consume(const struct tcp_tap_co= nn *conn, uint32_t ack_seq) > */ > static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > { > +=09if (c->mode =3D=3D MODE_VU) > +=09=09return tcp_vu_data_from_sock(c, conn); > + > =09return tcp_buf_data_from_sock(c, conn); > } > =20 > diff --git a/tcp_buf.c b/tcp_buf.c > index 1a398461a34b..10a663bdfc26 100644 > --- a/tcp_buf.c > +++ b/tcp_buf.c > @@ -320,7 +320,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_c= onn *conn, int flags) > =09=09return ret; > =09} > =20 > -=09l4len =3D tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq); > +=09l4len =3D tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq, false= ); > =09iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > =20 > =09if (flags & DUP_ACK) { > @@ -381,7 +381,8 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp= _tap_conn *conn, > =09=09tcp4_frame_conns[tcp4_payload_used] =3D conn; > =20 > =09=09iov =3D tcp4_l2_iov[tcp4_payload_used++]; > -=09=09l4len =3D tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq); > +=09=09l4len =3D tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq, > +=09=09=09=09=09=09false); > =09=09iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > =09=09if (tcp4_payload_used > TCP_FRAMES_MEM - 1) > =09=09=09tcp_payload_flush(c); > @@ -389,7 +390,8 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp= _tap_conn *conn, > =09=09tcp6_frame_conns[tcp6_payload_used] =3D conn; > =20 > =09=09iov =3D tcp6_l2_iov[tcp6_payload_used++]; > -=09=09l4len =3D tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq); > +=09=09l4len =3D tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq, > +=09=09=09=09=09=09false); > =09=09iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > =09=09if (tcp6_payload_used > TCP_FRAMES_MEM - 1) > =09=09=09tcp_payload_flush(c); > diff --git a/tcp_internal.h b/tcp_internal.h > index aa8bb64f1f33..e7fe735bfcb4 100644 > --- a/tcp_internal.h > +++ b/tcp_internal.h > @@ -91,7 +91,8 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *con= n); > =20 > size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, > =09=09=09 struct iovec *iov, size_t dlen, > -=09=09=09 const uint16_t *check, uint32_t seq); > +=09=09=09 const uint16_t *check, uint32_t seq, > +=09=09=09 bool no_tcp_csum); > int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn= , > =09=09=09 int force_seq, struct tcp_info *tinfo); > int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn, int flag= s, > diff --git a/tcp_vu.c b/tcp_vu.c > new file mode 100644 > index 000000000000..e3e32d628524 > --- /dev/null > +++ b/tcp_vu.c > @@ -0,0 +1,647 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* tcp_vu.c - TCP L2 vhost-user management functions > + * > + * Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#include > +#include > +#include > + > +#include > + > +#include > + > +#include > +#include > + > +#include "util.h" > +#include "ip.h" > +#include "passt.h" > +#include "siphash.h" > +#include "inany.h" > +#include "vhost_user.h" > +#include "tcp.h" > +#include "pcap.h" > +#include "flow.h" > +#include "tcp_conn.h" > +#include "flow_table.h" > +#include "tcp_vu.h" > +#include "tcp_internal.h" > +#include "checksum.h" > +#include "vu_common.h" > + > +/** > + * struct tcp_payload_t - TCP header and data to send segments with payl= oad > + * @th:=09=09TCP header > + * @data:=09TCP data > + */ > +struct tcp_payload_t { > +=09struct tcphdr th; > +=09uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)]; > +}; > + > +/** > + * struct tcp_flags_t - TCP header and data to send zero-length > + *=09=09=09segments (flags) > + * @th:=09=09TCP header > + * @opts=09TCP options > + */ > +struct tcp_flags_t { > +=09struct tcphdr th; > +=09char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; > +}; > + > +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; > +static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE]; > + > +/** > + * tcp_vu_l2_hdrlen() - return the size of the header in level 2 frame (= TCP) > + * @v6:=09=09Set for IPv6 packet > + * > + * Return: Return the size of the header > + */ > +static size_t tcp_vu_l2_hdrlen(bool v6) > +{ > +=09size_t l2_hdrlen; > + > +=09l2_hdrlen =3D sizeof(struct virtio_net_hdr_mrg_rxbuf) + sizeof(struct= ethhdr) + > +=09=09 sizeof(struct tcphdr); > + > +=09if (v6) > +=09=09l2_hdrlen +=3D sizeof(struct ipv6hdr); > +=09else > +=09=09l2_hdrlen +=3D sizeof(struct iphdr); > + > +=09return l2_hdrlen; > +} > + > +/** > + * tcp_vu_pcap() - Capture a single frame to pcap file (TCP) > + * @c:=09=09Execution context > + * @tapside:=09Address information for one side of the flow > + * @iov:=09Pointer to the array of IO vectors > + * @iov_used:=09Length of the array > + * @l4len:=09IPv4 Payload length > + */ > +static void tcp_vu_pcap(const struct ctx *c, const struct flowside *taps= ide, 'c' should be const (unless you modify data pointed by it, but I don't see where), otherwise gcc complains: tcp.c: In function =E2=80=98tcp_send_flag=E2=80=99: tcp.c:1249:41: warning: passing argument 1 of =E2=80=98tcp_vu_send_flag=E2= =80=99 discards =E2=80=98const=E2=80=99 qualifier from pointer target type = [-Wdiscarded-qualifiers] 1249 | return tcp_vu_send_flag(c, conn, flags); | ^ In file included from tcp.c:307: tcp_vu.h:9:34: note: expected =E2=80=98struct ctx *=E2=80=99 but argument i= s of type =E2=80=98const struct ctx *=E2=80=99 9 | int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int = flags); | ~~~~~~~~~~~~^ > +=09=09=09struct iovec *iov, int iov_used, size_t l4len) > +{ > +=09const struct in_addr *src =3D inany_v4(&tapside->oaddr); > +=09const struct in_addr *dst =3D inany_v4(&tapside->eaddr); > +=09char *base =3D iov[0].iov_base; > +=09size_t size =3D iov[0].iov_len; > +=09struct tcp_payload_t *bp; > +=09uint32_t sum; > + > +=09if (!*c->pcap) > +=09=09return; > + > +=09if (src && dst) { > +=09=09bp =3D vu_payloadv4(base); > +=09=09sum =3D proto_ipv4_header_psum(l4len, IPPROTO_TCP, > +=09=09=09=09=09 *src, *dst); > +=09} else { > +=09=09bp =3D vu_payloadv6(base); > +=09=09sum =3D proto_ipv6_header_psum(l4len, IPPROTO_TCP, > +=09=09=09=09=09 &tapside->oaddr.a6, > +=09=09=09=09=09 &tapside->eaddr.a6); > +=09} > +=09iov[0].iov_base =3D &bp->th; > +=09iov[0].iov_len =3D size - ((char *)iov[0].iov_base - base); > +=09bp->th.check =3D 0; > +=09bp->th.check =3D csum_iov(iov, iov_used, sum); > + > +=09/* set iov for pcap logging */ > +=09iov[0].iov_base =3D base + sizeof(struct virtio_net_hdr_mrg_rxbuf); > +=09iov[0].iov_len =3D size - sizeof(struct virtio_net_hdr_mrg_rxbuf); > + > +=09pcap_iov(iov, iov_used); > + > +=09/* restore iov[0] */ > +=09iov[0].iov_base =3D base; > +=09iov[0].iov_len =3D size; > +} > + > +/** > + * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payloa= d) > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * @flags:=09TCP flags: if not set, send segment only if ACK is due > + * > + * Return: negative error code on connection reset, 0 otherwise > + */ > +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags= ) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09const struct flowside *tapside =3D TAPFLOW(conn); > +=09struct virtio_net_hdr_mrg_rxbuf *vh; > +=09struct iovec l2_iov[TCP_NUM_IOVS]; > +=09size_t l2len, l4len, optlen; > +=09struct iovec in_sg; > +=09struct ethhdr *eh; > +=09int nb_ack; > +=09int ret; > + > +=09elem[0].out_num =3D 0; > +=09elem[0].out_sg =3D NULL; > +=09elem[0].in_num =3D 1; > +=09elem[0].in_sg =3D &in_sg; > +=09ret =3D vu_queue_pop(vdev, vq, &elem[0]); > +=09if (ret < 0) > +=09=09return 0; > + > +=09if (elem[0].in_num < 1) { > +=09=09debug("virtio-net receive queue contains no in buffers"); > +=09=09vu_queue_rewind(vq, 1); > +=09=09return 0; > +=09} > + > +=09vh =3D elem[0].in_sg[0].iov_base; > + > +=09vh->hdr =3D VU_HEADER; > +=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09vh->num_buffers =3D htole16(1); > + > +=09l2_iov[TCP_IOV_TAP].iov_base =3D NULL; > +=09l2_iov[TCP_IOV_TAP].iov_len =3D 0; > +=09l2_iov[TCP_IOV_ETH].iov_base =3D (char *)elem[0].in_sg[0].iov_base + = sizeof(struct virtio_net_hdr_mrg_rxbuf); > +=09l2_iov[TCP_IOV_ETH].iov_len =3D sizeof(struct ethhdr); > + > +=09eh =3D l2_iov[TCP_IOV_ETH].iov_base; > + > +=09memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); > +=09memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > + > +=09if (CONN_V4(conn)) { > +=09=09struct tcp_flags_t *payload; > +=09=09struct iphdr *iph; > +=09=09uint32_t seq; > + > +=09=09l2_iov[TCP_IOV_IP].iov_base =3D (char *)l2_iov[TCP_IOV_ETH].iov_ba= se + > +=09=09=09=09=09=09 l2_iov[TCP_IOV_ETH].iov_len; > +=09=09l2_iov[TCP_IOV_IP].iov_len =3D sizeof(struct iphdr); > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_base =3D (char *)l2_iov[TCP_IOV_IP].io= v_base + > +=09=09=09=09=09=09=09 l2_iov[TCP_IOV_IP].iov_len; > + > +=09=09eh->h_proto =3D htons(ETH_P_IP); > + > +=09=09iph =3D l2_iov[TCP_IOV_IP].iov_base; > +=09=09*iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); > + > +=09=09payload =3D l2_iov[TCP_IOV_PAYLOAD].iov_base; > +=09=09payload->th =3D (struct tcphdr){ > +=09=09=09.doff =3D offsetof(struct tcp_flags_t, opts) / 4, > +=09=09=09.ack =3D 1 > +=09=09}; > + > +=09=09seq =3D conn->seq_to_tap; > +=09=09ret =3D tcp_prepare_flags(c, conn, flags, &payload->th, payload->o= pts, &optlen); > +=09=09if (ret <=3D 0) { > +=09=09=09vu_queue_rewind(vq, 1); > +=09=09=09return ret; > +=09=09} > + > +=09=09l4len =3D tcp_l2_buf_fill_headers(conn, l2_iov, optlen, NULL, seq, > +=09=09=09=09=09=09true); > +=09=09/* keep the following assignment for clarity */ > +=09=09/* cppcheck-suppress unreadVariable */ > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > + > +=09=09l2len =3D l4len + sizeof(*iph) + sizeof(struct ethhdr); > +=09} else { > +=09=09struct tcp_flags_t *payload; > +=09=09struct ipv6hdr *ip6h; > +=09=09uint32_t seq; > + > +=09=09l2_iov[TCP_IOV_IP].iov_base =3D (char *)l2_iov[TCP_IOV_ETH].iov_ba= se + > +=09=09=09=09=09=09 l2_iov[TCP_IOV_ETH].iov_len; > +=09=09l2_iov[TCP_IOV_IP].iov_len =3D sizeof(struct ipv6hdr); > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_base =3D (char *)l2_iov[TCP_IOV_IP].io= v_base + > +=09=09=09=09=09=09=09 l2_iov[TCP_IOV_IP].iov_len; > + > +=09=09eh->h_proto =3D htons(ETH_P_IPV6); > + > +=09=09ip6h =3D l2_iov[TCP_IOV_IP].iov_base; > +=09=09*ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); > + > +=09=09payload =3D l2_iov[TCP_IOV_PAYLOAD].iov_base; > +=09=09payload->th =3D (struct tcphdr){ > +=09=09=09.doff =3D offsetof(struct tcp_flags_t, opts) / 4, > +=09=09=09.ack =3D 1 > +=09=09}; > + > +=09=09seq =3D conn->seq_to_tap; > +=09=09ret =3D tcp_prepare_flags(c, conn, flags, &payload->th, payload->o= pts, &optlen); > +=09=09if (ret <=3D 0) { > +=09=09=09vu_queue_rewind(vq, 1); > +=09=09=09return ret; > +=09=09} > + > +=09=09l4len =3D tcp_l2_buf_fill_headers(conn, l2_iov, optlen, NULL, seq, > +=09=09=09=09=09=09true); > +=09=09/* keep the following assignment for clarity */ > +=09=09/* cppcheck-suppress unreadVariable */ > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > + > +=09=09l2len =3D l4len + sizeof(*ip6h) + sizeof(struct ethhdr); > +=09} > +=09l2len +=3D sizeof(struct virtio_net_hdr_mrg_rxbuf); > +=09ASSERT(l2len <=3D elem[0].in_sg[0].iov_len); > + > +=09elem[0].in_sg[0].iov_len =3D l2len; > +=09tcp_vu_pcap(c, tapside, &elem[0].in_sg[0], 1, l4len); > + > +=09vu_queue_fill(vq, &elem[0], l2len, 0); > +=09nb_ack =3D 1; It took me a while to understand this, I guess "nb" means "number" (of ACKs) but you set this to one regardless of whether you send any ACK segment (also on SYN). What about 'count', 'segs', 'seg_count', 'pkt_count'...? > + > +=09if (flags & DUP_ACK) { > +=09=09struct iovec in_sg_dup; > + > +=09=09elem[1].out_num =3D 0; > +=09=09elem[1].out_sg =3D NULL; > +=09=09elem[1].in_num =3D 1; > +=09=09elem[1].in_sg =3D &in_sg_dup; > +=09=09ret =3D vu_queue_pop(vdev, vq, &elem[1]); > +=09=09if (ret =3D=3D 0) { > +=09=09=09if (elem[1].in_num < 1 || elem[1].in_sg[0].iov_len < l2len) { > +=09=09=09=09vu_queue_rewind(vq, 1); > +=09=09=09} else { > +=09=09=09=09memcpy(elem[1].in_sg[0].iov_base, vh, l2len); > +=09=09=09=09nb_ack++; > + > +=09=09=09=09tcp_vu_pcap(c, tapside, &elem[1].in_sg[0], 1, > +=09=09=09=09=09 l4len); > + > +=09=09=09=09vu_queue_fill(vq, &elem[1], l2len, 1); > +=09=09=09} > +=09=09} > +=09} > + > +=09vu_queue_flush(vq, nb_ack); By the way, the comment to vu_queue_flush() is also a bit misleading, it says "Number of entry to flush", which makes it look like an index, while it should say "Number of entries to flush". > +=09vu_queue_notify(vdev, vq); > + > +=09return 0; > +} > + > +/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user = buffers > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * @v4:=09=09Set for IPv4 connections > + * @fillsize:=09Number of bytes we can receive > + * @datalen:=09Size of received data (output) > + * > + * Return: Number of iov entries used to store the data , negative on failure > + */ > +static ssize_t tcp_vu_sock_recv(struct ctx *c, > +=09=09=09=09struct tcp_tap_conn *conn, bool v4, > +=09=09=09=09size_t fillsize, ssize_t *data_len) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09static struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; > +=09struct msghdr mh_sock =3D { 0 }; > +=09uint16_t mss =3D MSS_GET(conn); > +=09static int in_sg_count; > +=09int s =3D conn->sock; > +=09size_t l2_hdrlen; > +=09int segment_size; > +=09int iov_cnt; > +=09ssize_t ret; > + > +=09l2_hdrlen =3D tcp_vu_l2_hdrlen(!v4); > + > +=09iov_cnt =3D 0; > +=09in_sg_count =3D 0; > +=09segment_size =3D 0; > +=09*data_len =3D 0; > +=09while (fillsize > 0 && iov_cnt < VIRTQUEUE_MAX_SIZE - 1 && I couldn't figure out why this needs to be less than VIRTQUEUE_MAX_SIZE - 1: do we need to leave one free slot in the queue? > +=09=09=09 in_sg_count < ARRAY_SIZE(in_sg)) { As you're assuming elem[iov_cnt].in_num =3D=3D 1, this will always stop at in_sg_count < ARRAY_SIZE(in_sg) - 1. I'm not sure if it's intended. > +=09=09elem[iov_cnt].out_num =3D 0; > +=09=09elem[iov_cnt].out_sg =3D NULL; > +=09=09elem[iov_cnt].in_num =3D ARRAY_SIZE(in_sg) - in_sg_count; > +=09=09elem[iov_cnt].in_sg =3D &in_sg[in_sg_count]; > +=09=09ret =3D vu_queue_pop(vdev, vq, &elem[iov_cnt]); > +=09=09if (ret < 0) > +=09=09=09break; > + > +=09=09if (elem[iov_cnt].in_num < 1) { > +=09=09=09warn("virtio-net receive queue contains no in buffers"); > +=09=09=09break; > +=09=09} > + > +=09=09in_sg_count +=3D elem[iov_cnt].in_num; > + > +=09=09ASSERT(elem[iov_cnt].in_num =3D=3D 1); > +=09=09ASSERT(elem[iov_cnt].in_sg[0].iov_len >=3D l2_hdrlen); Both would terminate passt on an issue from the hypervisor from which we could probably recover. I guess those should be err() and break. > + > +=09=09if (segment_size =3D=3D 0) { > +=09=09=09iov_vu[iov_cnt + 1].iov_base =3D > +=09=09=09=09=09(char *)elem[iov_cnt].in_sg[0].iov_base + l2_hdrlen; > +=09=09=09iov_vu[iov_cnt + 1].iov_len =3D > +=09=09=09=09=09elem[iov_cnt].in_sg[0].iov_len - l2_hdrlen; > +=09=09} else { > +=09=09=09iov_vu[iov_cnt + 1].iov_base =3D elem[iov_cnt].in_sg[0].iov_bas= e; > +=09=09=09iov_vu[iov_cnt + 1].iov_len =3D elem[iov_cnt].in_sg[0].iov_len; > +=09=09} > + > +=09=09if (iov_vu[iov_cnt + 1].iov_len > fillsize) > +=09=09=09iov_vu[iov_cnt + 1].iov_len =3D fillsize; > + > +=09=09segment_size +=3D iov_vu[iov_cnt + 1].iov_len; > +=09=09if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) { > +=09=09=09segment_size =3D 0; > +=09=09} else if (segment_size >=3D mss) { > +=09=09=09iov_vu[iov_cnt + 1].iov_len -=3D segment_size - mss; > +=09=09=09segment_size =3D 0; > +=09=09} > +=09=09fillsize -=3D iov_vu[iov_cnt + 1].iov_len; > + > +=09=09iov_cnt++; > +=09} > +=09if (iov_cnt =3D=3D 0) > +=09=09return 0; > + > +=09mh_sock.msg_iov =3D iov_vu; > +=09mh_sock.msg_iovlen =3D iov_cnt + 1; I guess this should also change along with the check on peek_offset_cap (David's comment). > + > +=09do > +=09=09ret =3D recvmsg(s, &mh_sock, MSG_PEEK); > +=09while (ret < 0 && errno =3D=3D EINTR); > + > +=09if (ret < 0) { > +=09=09vu_queue_rewind(vq, iov_cnt); > +=09=09if (errno !=3D EAGAIN && errno !=3D EWOULDBLOCK) { > +=09=09=09ret =3D -errno; > +=09=09=09tcp_rst(c, conn); > +=09=09} > +=09=09return ret; > +=09} > +=09if (!ret) { > +=09=09vu_queue_rewind(vq, iov_cnt); > + > +=09=09if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) =3D=3D SOCK_FI= N_RCVD) { > +=09=09=09int retf =3D tcp_vu_send_flag(c, conn, FIN | ACK); > +=09=09=09if (retf) { > +=09=09=09=09tcp_rst(c, conn); > +=09=09=09=09return retf; > +=09=09=09} > + > +=09=09=09conn_event(c, conn, TAP_FIN_SENT); > +=09=09} > +=09=09return 0; > +=09} > + > +=09*data_len =3D ret; > +=09return iov_cnt; On end-of-file, we return 0, as expected: no entries were used. But otherwise, if recvmsg() returns a value that's less than iov_cnt, you still return iov_cnt: is that intended? If yes, it doesn't fit with the comment to this function. > +} > + > +/** > + * tcp_vu_prepare() - Prepare the packet header > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * @first:=09Pointer to the array of IO vectors > + * @data_len:=09Packet data length ...this is the payload length, I suppose? This is called segment_size in the caller, so I guess it's that. But then we should call it 'dlen', see commit 5566386f5f11 ("treewide: Standardise variable names for various packet lengths"). By the way, we should probably copy the table from that commit message (with s/plen/dlen/) somewhere in the code, at some point. > + * @check:=09Checksum, if already known > + * > + * Return: Level-4 length Layer. I would call this "IPv4 payload length" for clarity. > + */ > +static size_t tcp_vu_prepare(const struct ctx *c, > +=09=09=09 struct tcp_tap_conn *conn, struct iovec *first, > +=09=09=09 size_t data_len, const uint16_t **check) > +{ > +=09const struct flowside *toside =3D TAPFLOW(conn); > +=09struct iovec l2_iov[TCP_NUM_IOVS]; > +=09char *base =3D first->iov_base; > +=09struct ethhdr *eh; > +=09size_t l4len; > + > +=09/* we guess the first iovec provided by the guest can embed > +=09 * all the headers needed by L2 frame > +=09 */ > + > +=09l2_iov[TCP_IOV_TAP].iov_base =3D NULL; > +=09l2_iov[TCP_IOV_TAP].iov_len =3D 0; > +=09l2_iov[TCP_IOV_ETH].iov_base =3D base + sizeof(struct virtio_net_hdr_= mrg_rxbuf); > +=09l2_iov[TCP_IOV_ETH].iov_len =3D sizeof(struct ethhdr); > + > +=09eh =3D l2_iov[TCP_IOV_ETH].iov_base; > + > +=09memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); > +=09memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > + > +=09/* initialize header */ > +=09if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) { > +=09=09struct tcp_payload_t *payload; > +=09=09struct iphdr *iph; > + > +=09=09ASSERT(first[0].iov_len >=3D sizeof(struct virtio_net_hdr_mrg_rxbu= f) + > +=09=09 sizeof(struct ethhdr) + sizeof(struct iphdr) + > +=09=09 sizeof(struct tcphdr)); > + > +=09=09l2_iov[TCP_IOV_IP].iov_base =3D (char *)l2_iov[TCP_IOV_ETH].iov_ba= se + > +=09=09=09=09=09=09 l2_iov[TCP_IOV_ETH].iov_len; > +=09=09l2_iov[TCP_IOV_IP].iov_len =3D sizeof(struct iphdr); > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_base =3D (char *)l2_iov[TCP_IOV_IP].io= v_base + > +=09=09=09=09=09=09=09 l2_iov[TCP_IOV_IP].iov_len; > + > + > +=09=09eh->h_proto =3D htons(ETH_P_IP); > + > +=09=09iph =3D l2_iov[TCP_IOV_IP].iov_base; > +=09=09*iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP); > +=09=09payload =3D l2_iov[TCP_IOV_PAYLOAD].iov_base; > +=09=09payload->th =3D (struct tcphdr){ > +=09=09=09.doff =3D offsetof(struct tcp_payload_t, data) / 4, > +=09=09=09.ack =3D 1 > +=09=09}; > + > +=09=09l4len =3D tcp_l2_buf_fill_headers(conn, l2_iov, data_len, *check, > +=09=09=09=09=09=09conn->seq_to_tap, true); > +=09=09/* keep the following assignment for clarity */ > +=09=09/* cppcheck-suppress unreadVariable */ > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > + > +=09=09*check =3D &iph->check; > +=09} else { > +=09=09struct tcp_payload_t *payload; > +=09=09struct ipv6hdr *ip6h; > + > +=09=09ASSERT(first[0].iov_len >=3D sizeof(struct virtio_net_hdr_mrg_rxbu= f) + > +=09=09 sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + > +=09=09 sizeof(struct tcphdr)); > + > +=09=09l2_iov[TCP_IOV_IP].iov_base =3D (char *)l2_iov[TCP_IOV_ETH].iov_ba= se + > +=09=09=09=09=09=09 l2_iov[TCP_IOV_ETH].iov_len; > +=09=09l2_iov[TCP_IOV_IP].iov_len =3D sizeof(struct ipv6hdr); > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_base =3D (char *)l2_iov[TCP_IOV_IP].io= v_base + > +=09=09=09=09=09=09=09 l2_iov[TCP_IOV_IP].iov_len; > + > + > +=09=09eh->h_proto =3D htons(ETH_P_IPV6); > + > +=09=09ip6h =3D l2_iov[TCP_IOV_IP].iov_base; > +=09=09*ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP); > + > +=09=09payload =3D l2_iov[TCP_IOV_PAYLOAD].iov_base; > +=09=09payload->th =3D (struct tcphdr){ > +=09=09=09.doff =3D offsetof(struct tcp_payload_t, data) / 4, > +=09=09=09.ack =3D 1 > +=09=09}; > +; > +=09=09l4len =3D tcp_l2_buf_fill_headers(conn, l2_iov, data_len, NULL, > +=09=09=09=09=09=09conn->seq_to_tap, true); > +=09=09/* keep the following assignment for clarity */ > +=09=09/* cppcheck-suppress unreadVariable */ > +=09=09l2_iov[TCP_IOV_PAYLOAD].iov_len =3D l4len; > +=09} > + > +=09return l4len; > +} > + > +/** > + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost= -user, > + *=09=09=09 in window > + * @c:=09=09Execution context > + * @conn:=09Connection pointer > + * > + * Return: Negative on connection reset, 0 otherwise > + */ > +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) gcc isn't happy with this one either, I don't see where you modify pointed data? tcp.c: In function =E2=80=98tcp_data_from_sock=E2=80=99: tcp.c:1645:46: warning: passing argument 1 of =E2=80=98tcp_vu_data_from_soc= k=E2=80=99 discards =E2=80=98const=E2=80=99 qualifier from pointer target t= ype [-Wdiscarded-qualifiers] 1645 | return tcp_vu_data_from_sock(c, conn); | ^ tcp_vu.h:10:39: note: expected =E2=80=98struct ctx *=E2=80=99 but argument = is of type =E2=80=98const struct ctx *=E2=80=99 10 | int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)= ; | ~~~~~~~~~~~~^ > +{ > +=09uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09const struct flowside *tapside =3D TAPFLOW(conn); > +=09uint16_t mss =3D MSS_GET(conn); > +=09size_t l2_hdrlen, fillsize; > +=09int i, iov_cnt, iov_used; > +=09int v4 =3D CONN_V4(conn); > +=09uint32_t already_sent =3D 0; > +=09const uint16_t *check; > +=09struct iovec *first; > +=09int segment_size; > +=09int num_buffers; > +=09ssize_t len; > + > +=09if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { > +=09=09flow_err(conn, > +=09=09=09 "Got packet, but RX virtqueue not usable yet"); > +=09=09return 0; > +=09} > + > +=09already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > + > +=09if (SEQ_LT(already_sent, 0)) { > +=09=09/* RFC 761, section 2.1. */ > +=09=09flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > +=09=09=09 conn->seq_ack_from_tap, conn->seq_to_tap); > +=09=09conn->seq_to_tap =3D conn->seq_ack_from_tap; > +=09=09already_sent =3D 0; > +=09} > + > +=09if (!wnd_scaled || already_sent >=3D wnd_scaled) { > +=09=09conn_flag(c, conn, STALLED); > +=09=09conn_flag(c, conn, ACK_FROM_TAP_DUE); > +=09=09return 0; > +=09} > + > +=09/* Set up buffer descriptors we'll fill completely and partially. */ > + > +=09fillsize =3D wnd_scaled; > + > +=09if (peek_offset_cap) > +=09=09already_sent =3D 0; > + > +=09iov_vu[0].iov_base =3D tcp_buf_discard; > +=09iov_vu[0].iov_len =3D already_sent; > +=09fillsize -=3D already_sent; > + > +=09/* collect the buffers from vhost-user and fill them with the > +=09 * data from the socket > +=09 */ > +=09iov_cnt =3D tcp_vu_sock_recv(c, conn, v4, fillsize, &len); > +=09if (iov_cnt <=3D 0) > +=09=09return iov_cnt; > + > +=09len -=3D already_sent; > +=09if (len <=3D 0) { > +=09=09conn_flag(c, conn, STALLED); > +=09=09vu_queue_rewind(vq, iov_cnt); > +=09=09return 0; > +=09} > + > +=09conn_flag(c, conn, ~STALLED); > + > +=09/* Likely, some new data was acked too. */ > +=09tcp_update_seqack_wnd(c, conn, 0, NULL); > + > +=09/* initialize headers */ > +=09l2_hdrlen =3D tcp_vu_l2_hdrlen(!v4); > +=09iov_used =3D 0; > +=09num_buffers =3D 0; > +=09check =3D NULL; > +=09segment_size =3D 0; > + > +=09/* iov_vu is an array of buffers and the buffer size can be > +=09 * smaller than the segment size we want to use but with > +=09 * num_buffer we can merge several virtio iov buffers in one packet > +=09 * we need only to set the packet headers in the first iov and > +=09 * num_buffer to the number of iov entries Wait, what? :) s/packet/packet./ and s/we/We/ should make this more readable. What do you mean by "with num_buffer"? Does that refer to VIRTIO_NET_F_MRG_RXBUF? > +=09 */ > +=09for (i =3D 0; i < iov_cnt && len; i++) { > + > +=09=09if (segment_size =3D=3D 0) > +=09=09=09first =3D &iov_vu[i + 1]; > + > +=09=09if (iov_vu[i + 1].iov_len > (size_t)len) > +=09=09=09iov_vu[i + 1].iov_len =3D len; > + > +=09=09len -=3D iov_vu[i + 1].iov_len; > +=09=09iov_used++; > + > +=09=09segment_size +=3D iov_vu[i + 1].iov_len; > +=09=09num_buffers++; > + > +=09=09if (segment_size >=3D mss || len =3D=3D 0 || Shouldn't we stop just _before_ exceeding the MSS? Here it looks like we decide to prepare a frame after we did (plus some other conditions), instead of having a look at the next item to see if it can also fit. > +=09=09 i + 1 =3D=3D iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG= _RXBUF)) { > +=09=09=09struct virtio_net_hdr_mrg_rxbuf *vh; > +=09=09=09size_t l4len; > + > +=09=09=09if (i + 1 =3D=3D iov_cnt) > +=09=09=09=09check =3D NULL; > + > +=09=09=09/* restore first iovec base: point to vnet header */ > +=09=09=09first->iov_base =3D (char *)first->iov_base - l2_hdrlen; > +=09=09=09first->iov_len =3D first->iov_len + l2_hdrlen; > + > +=09=09=09vh =3D first->iov_base; > + > +=09=09=09vh->hdr =3D VU_HEADER; > +=09=09=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09=09=09vh->num_buffers =3D htole16(num_buffers); > + > +=09=09=09l4len =3D tcp_vu_prepare(c, conn, first, segment_size, &check); > + > +=09=09=09tcp_vu_pcap(c, tapside, first, num_buffers, l4len); > + > +=09=09=09conn->seq_to_tap +=3D segment_size; > + > +=09=09=09segment_size =3D 0; > +=09=09=09num_buffers =3D 0; > +=09=09} > +=09} > + > +=09/* release unused buffers */ > +=09vu_queue_rewind(vq, iov_cnt - iov_used); > + > +=09/* send packets */ > +=09vu_send_frame(vdev, vq, elem, &iov_vu[1], iov_used); > + > +=09conn_flag(c, conn, ACK_FROM_TAP_DUE); > + > +=09return 0; > +} > diff --git a/tcp_vu.h b/tcp_vu.h > new file mode 100644 > index 000000000000..b433c3e0d06f > --- /dev/null > +++ b/tcp_vu.h > @@ -0,0 +1,12 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#ifndef TCP_VU_H > +#define TCP_VU_H > + > +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags= ); > +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn); > + > +#endif /*TCP_VU_H */ > diff --git a/udp.c b/udp.c > index 2ba00c9c20a8..f7b5b5eb6421 100644 > --- a/udp.c > +++ b/udp.c > @@ -109,8 +109,7 @@ > #include "pcap.h" > #include "log.h" > #include "flow_table.h" > - > -#define UDP_MAX_FRAMES=09=0932 /* max # of frames to receive at once */ > +#include "udp_internal.h" > =20 > /* "Spliced" sockets indexed by bound port (host order) */ > static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; > @@ -118,20 +117,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; > =20 > /* Static buffers */ > =20 > -/** > - * struct udp_payload_t - UDP header and data for inbound messages > - * @uh:=09=09UDP header > - * @data:=09UDP data > - */ > -static struct udp_payload_t { > -=09struct udphdr uh; > -=09char data[USHRT_MAX - sizeof(struct udphdr)]; > -#ifdef __AVX2__ > -} __attribute__ ((packed, aligned(32))) > -#else > -} __attribute__ ((packed, aligned(__alignof__(unsigned int)))) > -#endif > -udp_payload[UDP_MAX_FRAMES]; > +/* UDP header and data for inbound messages */ > +static struct udp_payload_t udp_payload[UDP_MAX_FRAMES]; > =20 > /* Ethernet header for IPv4 frames */ > static struct ethhdr udp4_eth_hdr; > @@ -298,11 +285,13 @@ static void udp_splice_send(const struct ctx *c, si= ze_t start, size_t n, > * @bp:=09=09Pointer to udp_payload_t to update > * @toside:=09Flowside for destination side > * @dlen:=09Length of UDP payload > + * @no_udp_csum: Do not set UPD checksum > * > * Return: size of IPv4 payload (UDP header + data) > */ > -static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *= bp, > -=09=09=09 const struct flowside *toside, size_t dlen) > +size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, > +=09=09 const struct flowside *toside, size_t dlen, > +=09=09 bool no_udp_csum) > { > =09const struct in_addr *src =3D inany_v4(&toside->oaddr); > =09const struct in_addr *dst =3D inany_v4(&toside->eaddr); > @@ -319,7 +308,10 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, st= ruct udp_payload_t *bp, > =09bp->uh.source =3D htons(toside->oport); > =09bp->uh.dest =3D htons(toside->eport); > =09bp->uh.len =3D htons(l4len); > -=09csum_udp4(&bp->uh, *src, *dst, bp->data, dlen); > +=09if (no_udp_csum) > +=09=09bp->uh.check =3D 0; > +=09else > +=09=09csum_udp4(&bp->uh, *src, *dst, bp->data, dlen); > =20 > =09return l4len; > } > @@ -330,11 +322,13 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, s= truct udp_payload_t *bp, > * @bp:=09=09Pointer to udp_payload_t to update > * @toside:=09Flowside for destination side > * @dlen:=09Length of UDP payload > + * @no_udp_csum: Do not set UPD checksum > * > * Return: size of IPv6 payload (UDP header + data) > */ > -static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t= *bp, > -=09=09=09 const struct flowside *toside, size_t dlen) > +size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, > +=09=09 const struct flowside *toside, size_t dlen, > +=09=09 bool no_udp_csum) > { > =09uint16_t l4len =3D dlen + sizeof(bp->uh); > =20 > @@ -348,7 +342,16 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, = struct udp_payload_t *bp, > =09bp->uh.source =3D htons(toside->oport); > =09bp->uh.dest =3D htons(toside->eport); > =09bp->uh.len =3D ip6h->payload_len; > -=09csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, bp->data, dl= en); > +=09if (no_udp_csum) { > +=09=09/* O is an invalid checksum for UDP IPv6 and dropped by > +=09=09 * the kernel stack, even if the checksum is disabled by virtio > +=09=09 * flags. We need to put any non-zero value here. > +=09=09 */ > +=09=09bp->uh.check =3D 0xffff; > +=09} else { > +=09=09csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, > +=09=09=09 bp->data, dlen); > +=09} > =20 > =09return l4len; > } > @@ -358,9 +361,11 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, = struct udp_payload_t *bp, > * @mmh:=09Receiving mmsghdr array > * @idx:=09Index of the datagram to prepare > * @toside:=09Flowside for destination side > + * @no_udp_csum: Do not set UPD checksum > */ > -static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx, > -=09=09=09 const struct flowside *toside) > +static void udp_tap_prepare(const struct mmsghdr *mmh, > +=09=09=09 unsigned idx, const struct flowside *toside, > +=09=09=09 bool no_udp_csum) > { > =09struct iovec (*tap_iov)[UDP_NUM_IOVS] =3D &udp_l2_iov[idx]; > =09struct udp_payload_t *bp =3D &udp_payload[idx]; > @@ -368,13 +373,15 @@ static void udp_tap_prepare(const struct mmsghdr *m= mh, unsigned idx, > =09size_t l4len; > =20 > =09if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) { > -=09=09l4len =3D udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len)= ; > +=09=09l4len =3D udp_update_hdr6(&bm->ip6h, bp, toside, > +=09=09=09=09=09mmh[idx].msg_len, no_udp_csum); > =09=09tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + > =09=09=09 sizeof(udp6_eth_hdr)); > =09=09(*tap_iov)[UDP_IOV_ETH] =3D IOV_OF_LVALUE(udp6_eth_hdr); > =09=09(*tap_iov)[UDP_IOV_IP] =3D IOV_OF_LVALUE(bm->ip6h); > =09} else { > -=09=09l4len =3D udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len)= ; > +=09=09l4len =3D udp_update_hdr4(&bm->ip4h, bp, toside, > +=09=09=09=09=09mmh[idx].msg_len, no_udp_csum); > =09=09tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + > =09=09=09 sizeof(udp4_eth_hdr)); > =09=09(*tap_iov)[UDP_IOV_ETH] =3D IOV_OF_LVALUE(udp4_eth_hdr); > @@ -447,7 +454,7 @@ static int udp_sock_recverr(int s) > * > * Return: Number of errors handled, or < 0 if we have an unrecoverable = error > */ > -static int udp_sock_errs(const struct ctx *c, int s, uint32_t events) > +int udp_sock_errs(const struct ctx *c, int s, uint32_t events) > { > =09unsigned n_err =3D 0; > =09socklen_t errlen; > @@ -524,7 +531,7 @@ static int udp_sock_recv(const struct ctx *c, int s, = uint32_t events, > } > =20 > /** > - * udp_listen_sock_handler() - Handle new data from socket > + * udp_buf_listen_sock_handler() - Handle new data from socket > * @c:=09=09Execution context > * @ref:=09epoll reference > * @events:=09epoll events bitmap > @@ -532,8 +539,8 @@ static int udp_sock_recv(const struct ctx *c, int s, = uint32_t events, > * > * #syscalls recvmmsg > */ > -void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, > -=09=09=09 uint32_t events, const struct timespec *now) > +void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref re= f, > +=09=09=09=09 uint32_t events, const struct timespec *now) > { > =09const socklen_t sasize =3D sizeof(udp_meta[0].s_in); > =09int n, i; > @@ -565,7 +572,8 @@ void udp_listen_sock_handler(const struct ctx *c, uni= on epoll_ref ref, > =09=09=09=09udp_splice_prepare(udp_mh_recv, i); > =09=09=09} else if (batchpif =3D=3D PIF_TAP) { > =09=09=09=09udp_tap_prepare(udp_mh_recv, i, > -=09=09=09=09=09=09flowside_at_sidx(batchsidx)); > +=09=09=09=09=09=09flowside_at_sidx(batchsidx), > +=09=09=09=09=09=09false); > =09=09=09} > =20 > =09=09=09if (++i >=3D n) > @@ -599,7 +607,7 @@ void udp_listen_sock_handler(const struct ctx *c, uni= on epoll_ref ref, > } > =20 > /** > - * udp_reply_sock_handler() - Handle new data from flow specific socket > + * udp_buf_reply_sock_handler() - Handle new data from flow specific soc= ket > * @c:=09=09Execution context > * @ref:=09epoll reference > * @events:=09epoll events bitmap > @@ -607,8 +615,8 @@ void udp_listen_sock_handler(const struct ctx *c, uni= on epoll_ref ref, > * > * #syscalls recvmmsg > */ > -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > -=09=09=09 uint32_t events, const struct timespec *now) > +void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref= , > +=09=09=09=09uint32_t events, const struct timespec *now) > { > =09flow_sidx_t tosidx =3D flow_sidx_opposite(ref.flowside); > =09const struct flowside *toside =3D flowside_at_sidx(tosidx); > @@ -636,7 +644,7 @@ void udp_reply_sock_handler(const struct ctx *c, unio= n epoll_ref ref, > =09=09if (pif_is_socket(topif)) > =09=09=09udp_splice_prepare(udp_mh_recv, i); > =09=09else if (topif =3D=3D PIF_TAP) > -=09=09=09udp_tap_prepare(udp_mh_recv, i, toside); > +=09=09=09udp_tap_prepare(udp_mh_recv, i, toside, false); > =09=09/* Restore sockaddr length clobbered by recvmsg() */ > =09=09udp_mh_recv[i].msg_hdr.msg_namelen =3D sizeof(udp_meta[i].s_in); > =09} > diff --git a/udp.h b/udp.h > index a8e76bfe8f37..ea23fb36b637 100644 > --- a/udp.h > +++ b/udp.h > @@ -9,10 +9,10 @@ > #define UDP_TIMER_INTERVAL=09=091000 /* ms */ > =20 > void udp_portmap_clear(void); > -void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, > -=09=09=09 uint32_t events, const struct timespec *now); > -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > -=09=09=09 uint32_t events, const struct timespec *now); > +void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref re= f, > +=09=09=09=09 uint32_t events, const struct timespec *now); > +void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref= , > +=09=09=09=09uint32_t events, const struct timespec *now); > int udp_tap_handler(const struct ctx *c, uint8_t pif, > =09=09 sa_family_t af, const void *saddr, const void *daddr, > =09=09 const struct pool *p, int idx, const struct timespec *now); > diff --git a/udp_internal.h b/udp_internal.h > new file mode 100644 > index 000000000000..cc80e3055423 > --- /dev/null > +++ b/udp_internal.h > @@ -0,0 +1,34 @@ > +/* SPDX-License-Identifier: GPL-2.0-or-later > + * Copyright (c) 2021 Red Hat GmbH > + * Author: Stefano Brivio > + */ > + > +#ifndef UDP_INTERNAL_H > +#define UDP_INTERNAL_H > + > +#include "tap.h" /* needed by udp_meta_t */ > + > +#define UDP_MAX_FRAMES=09=0932 /* max # of frames to receive at once */ > + > +/** > + * struct udp_payload_t - UDP header and data for inbound messages > + * @uh:=09=09UDP header > + * @data:=09UDP data > + */ > +struct udp_payload_t { > +=09struct udphdr uh; > +=09char data[USHRT_MAX - sizeof(struct udphdr)]; > +#ifdef __AVX2__ > +} __attribute__ ((packed, aligned(32))); > +#else > +} __attribute__ ((packed, aligned(__alignof__(unsigned int)))); > +#endif > + > +size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, > +=09=09 const struct flowside *toside, size_t dlen, > +=09=09 bool no_udp_csum); > +size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, > + const struct flowside *toside, size_t dlen, > +=09=09 bool no_udp_csum); > +int udp_sock_errs(const struct ctx *c, int s, uint32_t events); > +#endif /* UDP_INTERNAL_H */ > diff --git a/udp_vu.c b/udp_vu.c > new file mode 100644 > index 000000000000..fa390dec994a > --- /dev/null > +++ b/udp_vu.c > @@ -0,0 +1,397 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* udp_vu.c - UDP L2 vhost-user management functions > + * > + * Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "checksum.h" > +#include "util.h" > +#include "ip.h" > +#include "siphash.h" > +#include "inany.h" > +#include "passt.h" > +#include "pcap.h" > +#include "log.h" > +#include "vhost_user.h" > +#include "udp_internal.h" > +#include "flow.h" > +#include "flow_table.h" > +#include "udp_flow.h" > +#include "udp_vu.h" > +#include "vu_common.h" > + > +static struct iovec iov_vu=09=09[VIRTQUEUE_MAX_SIZE]; > +static struct vu_virtq_element=09elem=09=09[VIRTQUEUE_MAX_SIZE]; Why these spaces and tabs if things are not aligned anyway? It makes it a bit difficult to read. > +static struct iovec in_sg[VIRTQUEUE_MAX_SIZE]; > +static int in_sg_count; > + > +/** > + * udp_vu_l2_hdrlen() - return the size of the header in level 2 frame (= UDP) layer. But it's actually the sum of all headers, up to Layer-4? > + * @v6:=09=09Set for IPv6 packet > + * > + * Return: Return the size of the header > + */ > +static size_t udp_vu_l2_hdrlen(bool v6) > +{ > +=09size_t l2_hdrlen; > + > +=09l2_hdrlen =3D sizeof(struct virtio_net_hdr_mrg_rxbuf) + sizeof(struct= ethhdr) + > +=09=09 sizeof(struct udphdr); > + > +=09if (v6) > +=09=09l2_hdrlen +=3D sizeof(struct ipv6hdr); > +=09else > +=09=09l2_hdrlen +=3D sizeof(struct iphdr); > + > +=09return l2_hdrlen; > +} > + > +static int udp_vu_sock_init(int s, union sockaddr_inany *s_in) > +{ > +=09struct msghdr msg =3D { > +=09=09.msg_name =3D s_in, > +=09=09.msg_namelen =3D sizeof(union sockaddr_inany), > +=09}; > + > +=09return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT); > +} > + > +/** > + * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user bu= ffers > + * @c:=09=09Execution context > + * @s:=09=09Socket to receive from > + * @events:=09epoll events bitmap > + * @v6:=09=09Set for IPv6 connections > + * @datalen:=09Size of received data (output) > + * > + * Return: Number of iov entries used to store the datagram > + */ > +static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events, > +=09=09=09 bool v6, ssize_t *data_len) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09int virtqueue_max, iov_cnt, idx, iov_used; > +=09size_t fillsize, size, off, l2_hdrlen; > +=09struct virtio_net_hdr_mrg_rxbuf *vh; > +=09struct msghdr msg =3D { 0 }; > +=09char *base; > + > +=09ASSERT(!c->no_udp); > + > +=09if (!(events & EPOLLIN)) > +=09=09return 0; > + > +=09/* compute L2 header length */ ...this is not related to virtqueue_max and VIRTIO_NET_F_MRG_RXBUF, right? > + > +=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09virtqueue_max =3D VIRTQUEUE_MAX_SIZE; > +=09else > +=09=09virtqueue_max =3D 1; > + > +=09l2_hdrlen =3D udp_vu_l2_hdrlen(v6); > + > +=09fillsize =3D USHRT_MAX; > +=09iov_cnt =3D 0; > +=09in_sg_count =3D 0; > +=09while (fillsize && iov_cnt < virtqueue_max && > +=09=09=09in_sg_count < ARRAY_SIZE(in_sg)) { > +=09=09int ret; > + > +=09=09elem[iov_cnt].out_num =3D 0; > +=09=09elem[iov_cnt].out_sg =3D NULL; > +=09=09elem[iov_cnt].in_num =3D ARRAY_SIZE(in_sg) - in_sg_count; > +=09=09elem[iov_cnt].in_sg =3D &in_sg[in_sg_count]; > +=09=09ret =3D vu_queue_pop(vdev, vq, &elem[iov_cnt]); > +=09=09if (ret < 0) > +=09=09=09break; > +=09=09in_sg_count +=3D elem[iov_cnt].in_num; > + > +=09=09if (elem[iov_cnt].in_num < 1) { > +=09=09=09err("virtio-net receive queue contains no in buffers"); > +=09=09=09vu_queue_rewind(vq, iov_cnt); > +=09=09=09return 0; > +=09=09} > +=09=09ASSERT(elem[iov_cnt].in_num =3D=3D 1); > +=09=09ASSERT(elem[iov_cnt].in_sg[0].iov_len >=3D l2_hdrlen); > + > +=09=09if (iov_cnt =3D=3D 0) { > +=09=09=09base =3D elem[iov_cnt].in_sg[0].iov_base; > +=09=09=09size =3D elem[iov_cnt].in_sg[0].iov_len; > + > +=09=09=09/* keep space for the headers */ > +=09=09=09iov_vu[0].iov_base =3D base + l2_hdrlen; > +=09=09=09iov_vu[0].iov_len =3D size - l2_hdrlen; > +=09=09} else { > +=09=09=09iov_vu[iov_cnt].iov_base =3D elem[iov_cnt].in_sg[0].iov_base; > +=09=09=09iov_vu[iov_cnt].iov_len =3D elem[iov_cnt].in_sg[0].iov_len; > +=09=09} > + > +=09=09if (iov_vu[iov_cnt].iov_len > fillsize) > +=09=09=09iov_vu[iov_cnt].iov_len =3D fillsize; > + > +=09=09fillsize -=3D iov_vu[iov_cnt].iov_len; > + > +=09=09iov_cnt++; > +=09} > +=09if (iov_cnt =3D=3D 0) > +=09=09return 0; > + > +=09msg.msg_iov =3D iov_vu; > +=09msg.msg_iovlen =3D iov_cnt; > + > +=09*data_len =3D recvmsg(s, &msg, 0); > +=09if (*data_len < 0) { > +=09=09vu_queue_rewind(vq, iov_cnt); > +=09=09return 0; > +=09} > + > +=09/* restore original values */ > +=09iov_vu[0].iov_base =3D base; > +=09iov_vu[0].iov_len =3D size; > + > +=09/* count the numbers of buffer filled by recvmsg() */ > +=09idx =3D iov_skip_bytes(iov_vu, iov_cnt, l2_hdrlen + *data_len, > +=09=09=09 &off); > +=09/* adjust last iov length */ > +=09if (idx < iov_cnt) > +=09=09iov_vu[idx].iov_len =3D off; > +=09iov_used =3D idx + !!off; > + > +=09/* release unused buffers */ > +=09vu_queue_rewind(vq, iov_cnt - iov_used); > + > +=09vh =3D (struct virtio_net_hdr_mrg_rxbuf *)base; > +=09vh->hdr =3D VU_HEADER; > +=09if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) > +=09=09vh->num_buffers =3D htole16(iov_used); > + > +=09return iov_used; > +} > + > +/** > + * udp_vu_prepare() - Prepare the packet header > + * @c:=09=09Execution context > + * @toside:=09Address information for one side of the flow > + * @datalen:=09Packet data length > + * > + * Return:i Level-4 length Same as above. > + */ > +static size_t udp_vu_prepare(const struct ctx *c, > +=09=09=09 const struct flowside *toside, ssize_t data_len) > +{ > +=09struct ethhdr *eh; > +=09size_t l4len; > + > +=09/* ethernet header */ > +=09eh =3D vu_eth(iov_vu[0].iov_base); > + > +=09memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); > +=09memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > + > +=09/* initialize header */ > +=09if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) { > +=09=09struct iphdr *iph =3D vu_ip(iov_vu[0].iov_base); > +=09=09struct udp_payload_t *bp =3D vu_payloadv4(iov_vu[0].iov_base); > + > +=09=09eh->h_proto =3D htons(ETH_P_IP); > + > +=09=09*iph =3D (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP); > + > +=09=09l4len =3D udp_update_hdr4(iph, bp, toside, data_len, true); > +=09} else { > +=09=09struct ipv6hdr *ip6h =3D vu_ip(iov_vu[0].iov_base); > +=09=09struct udp_payload_t *bp =3D vu_payloadv6(iov_vu[0].iov_base); > + > +=09=09eh->h_proto =3D htons(ETH_P_IPV6); > + > +=09=09*ip6h =3D (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP); > + > +=09=09l4len =3D udp_update_hdr6(ip6h, bp, toside, data_len, true); > +=09} > + > +=09return l4len; > +} > + > +/** > + * udp_vu_pcap() - Capture a single frame to pcap file (UDP) > + * @c:=09=09Execution context > + * @toside:=09ddress information for one side of the flow address > + * @l4len:=09IPv4 Payload length > + * @iov_used:=09Length of the array > + */ > +static void udp_vu_pcap(const struct ctx *c, const struct flowside *tosi= de, > +=09=09=09size_t l4len, int iov_used) > +{ > +=09const struct in_addr *src4 =3D inany_v4(&toside->oaddr); > +=09const struct in_addr *dst4 =3D inany_v4(&toside->eaddr); > +=09char *base =3D iov_vu[0].iov_base; > +=09size_t size =3D iov_vu[0].iov_len; > +=09struct udp_payload_t *bp; > +=09uint32_t sum; > + > +=09if (!*c->pcap) > +=09=09return; > + > +=09if (src4 && dst4) { > +=09=09bp =3D vu_payloadv4(base); > +=09=09sum =3D proto_ipv4_header_psum(l4len, IPPROTO_UDP, *src4, *dst4); > +=09} else { > +=09=09bp =3D vu_payloadv6(base); > +=09=09sum =3D proto_ipv6_header_psum(l4len, IPPROTO_UDP, > +=09=09=09=09=09 &toside->oaddr.a6, > +=09=09=09=09=09 &toside->eaddr.a6); > +=09=09bp->uh.check =3D 0; /* by default, set to 0xffff */ > +=09} > + > +=09iov_vu[0].iov_base =3D &bp->uh; > +=09iov_vu[0].iov_len =3D size - ((char *)iov_vu[0].iov_base - base); > + > +=09bp->uh.check =3D csum_iov(iov_vu, iov_used, sum); > + > +=09/* set iov for pcap logging */ > +=09iov_vu[0].iov_base =3D base + sizeof(struct virtio_net_hdr_mrg_rxbuf)= ; > +=09iov_vu[0].iov_len =3D size - sizeof(struct virtio_net_hdr_mrg_rxbuf); > +=09pcap_iov(iov_vu, iov_used); > + > +=09/* restore iov_vu[0] */ > +=09iov_vu[0].iov_base =3D base; > +=09iov_vu[0].iov_len =3D size; > +} > + > +/** > + * udp_vu_listen_sock_handler() - Handle new data from socket > + * @c:=09=09Execution context > + * @ref:=09epoll reference > + * @events:=09epoll events bitmap > + * @now:=09Current timestamp > + */ > +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref= , > +=09=09=09=09uint32_t events, const struct timespec *now) > +{ > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09const struct flowside *toside; > +=09union sockaddr_inany s_in; > +=09flow_sidx_t batchsidx; > +=09uint8_t batchpif; > +=09bool v6; > +=09int i; > + > +=09if (udp_sock_errs(c, ref.fd, events) < 0) { > +=09=09err("UDP: Unrecoverable error on listening socket:" > +=09=09 " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port); > +=09=09return; > +=09} > + > +=09if (udp_vu_sock_init(ref.fd, &s_in) < 0) > +=09=09return; > + > +=09batchsidx =3D udp_flow_from_sock(c, ref, &s_in, now); > +=09batchpif =3D pif_at_sidx(batchsidx); > + > +=09if (batchpif !=3D PIF_TAP) { > +=09=09if (flow_sidx_valid(batchsidx)) { > +=09=09=09flow_sidx_t fromsidx =3D flow_sidx_opposite(batchsidx); > +=09=09=09struct udp_flow *uflow =3D udp_at_sidx(batchsidx); > + > +=09=09=09flow_err(uflow, > +=09=09=09=09 "No support for forwarding UDP from %s to %s", > +=09=09=09=09 pif_name(pif_at_sidx(fromsidx)), > +=09=09=09=09 pif_name(batchpif)); > +=09=09} else { > +=09=09=09debug("Discarding 1 datagram without flow"); > +=09=09} > + > +=09=09return; > +=09} > + > +=09toside =3D flowside_at_sidx(batchsidx); > + > +=09v6 =3D !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); > + > +=09for (i =3D 0; i < UDP_MAX_FRAMES; i++) { > +=09=09ssize_t data_len; > +=09=09size_t l4len; > +=09=09int iov_used; > + > +=09=09iov_used =3D udp_vu_sock_recv(c, ref.fd, events, v6, &data_len); > +=09=09if (iov_used <=3D 0) > +=09=09=09return; > + > +=09=09l4len =3D udp_vu_prepare(c, toside, data_len); > +=09=09udp_vu_pcap(c, toside, l4len, iov_used); > +=09=09vu_send_frame(vdev, vq, elem, iov_vu, iov_used); > +=09} > +} > + > +/** > + * udp_vu_reply_sock_handler() - Handle new data from flow specific sock= et > + * @c:=09=09Execution context > + * @ref:=09epoll reference > + * @events:=09epoll events bitmap > + * @now:=09Current timestamp > + */ > +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > +=09=09=09 uint32_t events, const struct timespec *now) > +{ > +=09flow_sidx_t tosidx =3D flow_sidx_opposite(ref.flowside); > +=09const struct flowside *toside =3D flowside_at_sidx(tosidx); > +=09struct vu_dev *vdev =3D c->vdev; > +=09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > +=09struct udp_flow *uflow =3D udp_at_sidx(ref.flowside); > +=09int from_s =3D uflow->s[ref.flowside.sidei]; > +=09uint8_t topif =3D pif_at_sidx(tosidx); > +=09bool v6; > +=09int i; > + > +=09ASSERT(!c->no_udp); > +=09ASSERT(uflow); > + > +=09if (udp_sock_errs(c, from_s, events) < 0) { > +=09=09flow_err(uflow, "Unrecoverable error on reply socket"); > +=09=09flow_err_details(uflow); > +=09=09udp_flow_close(c, uflow); > +=09=09return; > +=09} > + > +=09if (topif !=3D PIF_TAP) { > +=09=09uint8_t frompif =3D pif_at_sidx(ref.flowside); > + > +=09=09flow_err(uflow, > +=09=09=09 "No support for forwarding UDP from %s to %s", > +=09=09=09 pif_name(frompif), pif_name(topif)); > +=09=09return; > +=09} > + > +=09v6 =3D !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); > + > +=09for (i =3D 0; i < UDP_MAX_FRAMES; i++) { > +=09=09ssize_t data_len; > +=09=09size_t l4len; > +=09=09int iov_used; > + > +=09=09iov_used =3D udp_vu_sock_recv(c, from_s, events, v6, &data_len); > +=09=09if (iov_used <=3D 0) > +=09=09=09return; > +=09=09flow_trace(uflow, "Received 1 datagram on reply socket"); > +=09=09uflow->ts =3D now->tv_sec; > + > +=09=09l4len =3D udp_vu_prepare(c, toside, data_len); > +=09=09udp_vu_pcap(c, toside, l4len, iov_used); > +=09=09vu_send_frame(vdev, vq, elem, iov_vu, iov_used); > +=09} > +} > diff --git a/udp_vu.h b/udp_vu.h > new file mode 100644 > index 000000000000..ba7018d3bf01 > --- /dev/null > +++ b/udp_vu.h > @@ -0,0 +1,13 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* Copyright Red Hat > + * Author: Laurent Vivier > + */ > + > +#ifndef UDP_VU_H > +#define UDP_VU_H > + > +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref= , > +=09=09=09=09uint32_t events, const struct timespec *now); > +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref, > +=09=09=09 uint32_t events, const struct timespec *now); > +#endif /* UDP_VU_H */ > diff --git a/vhost_user.c b/vhost_user.c > index 3b38e06f268e..0f98ee7fa7c3 100644 > --- a/vhost_user.c > +++ b/vhost_user.c > @@ -52,7 +52,6 @@ > * =09=09=09 this is part of the vhost-user backend > * =09=09=09 convention. > */ > -/* cppcheck-suppress unusedFunction */ > void vu_print_capabilities(void) > { > =09info("{"); > @@ -162,9 +161,7 @@ static void vmsg_close_fds(const struct vhost_user_ms= g *vmsg) > */ > static void vu_remove_watch(const struct vu_dev *vdev, int fd) > { > -=09/* Placeholder to add passt related code */ > -=09(void)vdev; > -=09(void)fd; > +=09epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL); > } > =20 > /** > @@ -425,7 +422,6 @@ static bool map_ring(struct vu_dev *vdev, struct vu_v= irtq *vq) > * > * Return: 0 if the zone is in a mapped memory region, -1 otherwise > */ > -/* cppcheck-suppress unusedFunction */ > int vu_packet_check_range(void *buf, size_t offset, size_t len, > =09=09=09 const char *start) > { > @@ -515,6 +511,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vde= v, > =09=09} > =09} > =20 > +=09/* As vu_packet_check_range() has no access to the number of > +=09 * memory regions, mark the end of the array with mmap_addr =3D 0 > +=09 */ > +=09ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1); > +=09vdev->regions[vdev->nregions].mmap_addr =3D 0; > + > +=09tap_sock_update_buf(vdev->regions, 0); > + > =09return false; > } > =20 > @@ -643,9 +647,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vd= ev, > */ > static void vu_set_watch(const struct vu_dev *vdev, int fd) > { > -=09/* Placeholder to add passt related code */ > -=09(void)vdev; > -=09(void)fd; > +=09union epoll_ref ref =3D { .type =3D EPOLL_TYPE_VHOST_KICK, .fd =3D fd= }; > +=09struct epoll_event ev =3D { 0 }; > + > +=09ev.data.u64 =3D ref.u64; > +=09ev.events =3D EPOLLIN; > +=09epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev); > } > =20 > /** > @@ -685,7 +692,6 @@ static int vu_wait_queue(const struct vu_virtq *vq) > * > * Return: number of bytes sent, -1 if there is an error > */ > -/* cppcheck-suppress unusedFunction */ > int vu_send(struct vu_dev *vdev, const void *buf, size_t size) > { > =09struct vu_virtq *vq =3D &vdev->vq[VHOST_USER_RX_QUEUE]; > @@ -869,7 +875,6 @@ static void vu_handle_tx(struct vu_dev *vdev, int ind= ex, > * @ref:=09epoll reference information > * @now:=09Current timestamp > */ > -/* cppcheck-suppress unusedFunction */ > void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, > =09=09const struct timespec *now) > { > @@ -1104,11 +1109,11 @@ static bool vu_set_vring_enable_exec(struct vu_de= v *vdev, > * @c:=09=09execution context > * @vdev:=09vhost-user device > */ > -/* cppcheck-suppress unusedFunction */ > void vu_init(struct ctx *c, struct vu_dev *vdev) > { > =09int i; > =20 > +=09c->vdev =3D vdev; > =09vdev->context =3D c; > =09for (i =3D 0; i < VHOST_USER_MAX_QUEUES; i++) { > =09=09vdev->vq[i] =3D (struct vu_virtq){ > @@ -1124,7 +1129,6 @@ void vu_init(struct ctx *c, struct vu_dev *vdev) > * vu_cleanup() - Reset vhost-user device > * @vdev:=09vhost-user device > */ > -/* cppcheck-suppress unusedFunction */ > void vu_cleanup(struct vu_dev *vdev) > { > =09unsigned int i; > @@ -1171,8 +1175,7 @@ void vu_cleanup(struct vu_dev *vdev) > */ > static void vu_sock_reset(struct vu_dev *vdev) > { > -=09/* Placeholder to add passt related code */ > -=09(void)vdev; > +=09tap_sock_reset(vdev->context); > } > =20 > static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev, > @@ -1200,7 +1203,6 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_= dev *vdev, > * @fd:=09=09vhost-user message socket > * @events:=09epoll events > */ > -/* cppcheck-suppress unusedFunction */ > void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) > { > =09struct vhost_user_msg msg =3D { 0 }; > diff --git a/virtio.c b/virtio.c > index 237395396606..31e56def2c23 100644 > --- a/virtio.c > +++ b/virtio.c > @@ -562,7 +562,6 @@ void vu_queue_unpop(struct vu_virtq *vq) > * @vq:=09=09Virtqueue > * @num:=09Number of element to unpop > */ > -/* cppcheck-suppress unusedFunction */ > bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num) > { > =09if (num > vq->inuse) > diff --git a/vu_common.c b/vu_common.c > new file mode 100644 > index 000000000000..7a9caae17f42 > --- /dev/null > +++ b/vu_common.c > @@ -0,0 +1,36 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* Copyright Red Hat > + * Author: Laurent Vivier > + * > + * common_vu.c - vhost-user common UDP and TCP functions > + */ > + > +#include > +#include > +#include > + > +#include "util.h" > +#include "passt.h" > +#include "vhost_user.h" > +#include "vu_common.h" > + > +/** > + * vu_send_frame() - Send one frame to the vhost-user interface > + * @vdev:=09vhost-user device > + * @vq:=09=09vhost-user virtqueue > + * @elem:=09virtqueue element array to send back to the virqueue > + * @iov_vu:=09iovec array containing the data to send > + * @iov_used:=09Length of the array > + */ > +void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq, > +=09=09 struct vu_virtq_element *elem, const struct iovec *iov_vu, > +=09=09 int iov_used) > +{ > +=09int i; > + > +=09for (i =3D 0; i < iov_used; i++) > +=09=09vu_queue_fill(vq, &elem[i], iov_vu[i].iov_len, i); > + > +=09vu_queue_flush(vq, iov_used); > +=09vu_queue_notify(vdev, vq); > +} > diff --git a/vu_common.h b/vu_common.h > new file mode 100644 > index 000000000000..20950b44493c > --- /dev/null > +++ b/vu_common.h > @@ -0,0 +1,34 @@ > +/* SPDX-License-Identifier: GPL-2.0-or-later > + * Copyright Red Hat > + * Author: Laurent Vivier > + * > + * vhost-user common UDP and TCP functions > + */ > + > +#ifndef VU_COMMON_H > +#define VU_COMMON_H > + > +static inline void *vu_eth(void *base) > +{ > +=09return ((char *)base + sizeof(struct virtio_net_hdr_mrg_rxbuf)); > +} > + > +static inline void *vu_ip(void *base) > +{ > +=09return (struct ethhdr *)vu_eth(base) + 1; > +} > + > +static inline void *vu_payloadv4(void *base) > +{ > +=09return (struct iphdr *)vu_ip(base) + 1; > +} > + > +static inline void *vu_payloadv6(void *base) > +{ > +=09return (struct ipv6hdr *)vu_ip(base) + 1; > +} > + > +void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq, > +=09=09 struct vu_virtq_element *elem, const struct iovec *iov_vu, > +=09=09 int iov_used); > +#endif /* VU_COMMON_H */ --=20 Stefano