From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=h5bqnL+y; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 41FF25A026F for ; Wed, 02 Apr 2025 09:16:31 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1743578190; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WaLETUa2kl5vstpsUvY9b1ZooRcMcPCZEU6dT0nKyLA=; b=h5bqnL+yX8gzJAGMkA8NHP1Z/MWjBEBQW3LDL/mbf530OfdbuBVJgn4lFaB/1Yfc5DNNl4 nMSQ8QovX8W9pO9j/mldQpWcVGXpJI+E0d7BUDwSeQi9KSWQztXNiaPRDcxlXhC84AKKyL TNwxp6FkBCEMTbkB0FcZ+KZlA5fnLok= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-355-EO27t0pmPbSdA1lrTLhDKg-1; Wed, 02 Apr 2025 03:16:28 -0400 X-MC-Unique: EO27t0pmPbSdA1lrTLhDKg-1 X-Mimecast-MFC-AGG-ID: EO27t0pmPbSdA1lrTLhDKg_1743578188 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-39130f02631so2192572f8f.2 for ; Wed, 02 Apr 2025 00:16:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743578187; x=1744182987; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=SsAN7Gp8NQjypp+WezJe2Q3B+/D2awACuZV/2JHY2QI=; b=qxsbgnnrFkCYy9sOyesP6GOd/GzhFH1mncEQGB5fL6TnKLWrY8qwB66oRY4o41oVdu JtJHtWqPPhtOMM2x3a6RmCyOgA/RPFVvfi5WK/WH1RFbAfvPiz2Bls7S364FYxTEo6jJ w+RfGR226mmnLxFT7851alUQRxOcWRbfuCdDGx2tnMCCUQUe/E+X9sBV3S/ZhS/sN9ge dz89f7geFfUHmH+ga8KyM1oxAnnHYho5bmqKrhEMz07AJslwgNZMTYw5HdfM7P2TXhI+ K0qzk4GY79stJwr7be5talrDCwxyOKf1l3cIyDuHXuLxF9/ryO04MNaW8iTjGFU2zpzQ DB5A== X-Gm-Message-State: AOJu0Yz/CwpPhlxtSrhUcu76G4fBObGz+joBodCb4xdVkhhUnbH5n+eN YAw8ttk+ipWdQBtab7bdlzFYZOvpnmwkYzaXCLmZWhjFm4o3dA8DBxSp8VSUrGLbGZxANvH9tW5 oQRuyBv9t8PWw8z/qN9jaAa9ZU8TPgoJtNmuLNTliVo/gSm9zA0hnWxubDhO3QmEl4sFQoTcArG +n1/I6TJn8+zeSeBvTpnHlTeGFABx/t4JG X-Gm-Gg: ASbGncvVtvh+OVW1DFFsHSENVYVvcf09ckxVYM1aPmN7Qk56MpkNKNSZkIzIKsaDotN zQ1KL7Tbld9c6dT/EBSqCSwFPqSUG3RAjeZj88Koq+PoFpligWf3mTcFGpH8gA3lvyzU/dxXcm/ GOjSiffH5IQRx8cMmK9uF7jwPni1vUvcbeDtN6+kR7RQSaJvun8XquHAiDQVZX5FFd0urnGgXgF vbuymm4KZ8aHuyZbuhJlNjDYf1HflaTXEiGHDeuFRypXrhqWxsucZXcUbqMnXhUhBK+k/PPrOw/ HmuXz6k4JSMCLM0mwCRLp4JpivQ= X-Received: by 2002:a05:6000:400d:b0:39c:dfa:e1bb with SMTP id ffacd0b85a97d-39c2366a843mr5005195f8f.42.1743578187046; Wed, 02 Apr 2025 00:16:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEecjPO1F4iN7vPlJGTf7kHqQw1hibQ/o5hl6Tek8ETNvw4VN3OOGQMpyClg7TG1yANHnw7Fw== X-Received: by 2002:a05:6000:400d:b0:39c:dfa:e1bb with SMTP id ffacd0b85a97d-39c2366a843mr5005163f8f.42.1743578186499; Wed, 02 Apr 2025 00:16:26 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43eb5fd0d36sm11550965e9.10.2025.04.02.00.16.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Apr 2025 00:16:25 -0700 (PDT) Date: Wed, 2 Apr 2025 09:16:22 +0200 From: Stefano Brivio To: Eugenio =?UTF-8?B?UMOpcmV6?= Subject: Re: [PATCH 2/3] tap: implement vhost_call_cb Message-ID: <20250402091622.7cda67ba@elisabeth> In-Reply-To: <20250401113809.1765282-3-eperezma@redhat.com> References: <20250401113809.1765282-1-eperezma@redhat.com> <20250401113809.1765282-3-eperezma@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: UINBo4CoRCslr_pxSfdaTVp1ijwGn-gGHNVp18mDNX8_1743578188 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: IX7T7NJ2DAGOLINM7TZUTTWBPJZSUJYD X-Message-ID-Hash: IX7T7NJ2DAGOLINM7TZUTTWBPJZSUJYD X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jmaloy@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: A couple of general notes: - there's no need to Cc: people already on this list unless you need their attention specifically (it can get quite noisy...) - things kind of make sense to me, many are hard to evaluate at this early stage, so I noted below just some specific comments/questions here, but in the sense of "being on the same page" my current answer is... yes, I guess so! - I'm not reviewing 1/3 and 3/3 right away as I guess you'll revisit them anyway, I'm just not sure we need a separate pool... but I'm not sure if that's temporary either. On Tue, 1 Apr 2025 13:38:08 +0200 Eugenio P=C3=A9rez wrote: > This is the main callback when the tap device has processed any buffer. > Another possibility is to reuse the tap callback for this, so less code > changes are needed. >=20 > Signed-off-by: Eugenio P=C3=A9rez > --- > epoll_type.h | 2 + > passt.c | 4 + > passt.h | 12 +- > tap.c | 316 ++++++++++++++++++++++++++++++++++++++++++++++++++- > tap.h | 2 + > 5 files changed, 334 insertions(+), 2 deletions(-) >=20 > diff --git a/epoll_type.h b/epoll_type.h > index 7f2a121..6284c79 100644 > --- a/epoll_type.h > +++ b/epoll_type.h > @@ -44,6 +44,8 @@ enum epoll_type { > =09EPOLL_TYPE_REPAIR_LISTEN, > =09/* TCP_REPAIR helper socket */ > =09EPOLL_TYPE_REPAIR, > +=09/* vhost-kernel call socket */ > +=09EPOLL_TYPE_VHOST_CALL, > =20 > =09EPOLL_NUM_TYPES, > }; > diff --git a/passt.c b/passt.c > index cd06772..19c5d5f 100644 > --- a/passt.c > +++ b/passt.c > @@ -79,6 +79,7 @@ char *epoll_type_str[] =3D { > =09[EPOLL_TYPE_VHOST_KICK]=09=09=3D "vhost-user kick socket", > =09[EPOLL_TYPE_REPAIR_LISTEN]=09=3D "TCP_REPAIR helper listening socket"= , > =09[EPOLL_TYPE_REPAIR]=09=09=3D "TCP_REPAIR helper socket", > +=09[EPOLL_TYPE_VHOST_CALL]=09=09=3D "vhost-kernel call socket", > }; > static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES, > =09 "epoll_type_str[] doesn't match enum epoll_type"); > @@ -357,6 +358,9 @@ loop: > =09=09case EPOLL_TYPE_REPAIR: > =09=09=09repair_handler(&c, eventmask); > =09=09=09break; > +=09=09case EPOLL_TYPE_VHOST_CALL: > +=09=09=09vhost_call_cb(&c, ref, &now); > +=09=09=09break; > =09=09default: > =09=09=09/* Can't happen */ > =09=09=09ASSERT(0); > diff --git a/passt.h b/passt.h > index 8f45091..eb5aa03 100644 > --- a/passt.h > +++ b/passt.h > @@ -45,7 +45,7 @@ union epoll_ref; > * @icmp:=09ICMP-specific reference part > * @data:=09Data handled by protocol handlers > * @nsdir_fd:=09netns dirfd for fallback timer checking if namespace is = gone > - * @queue:=09vhost-user queue index for this fd > + * @queue:=09vhost queue index for this fd > * @u64:=09Opaque reference for epoll_ctl() and epoll_wait() > */ > union epoll_ref { > @@ -271,11 +271,14 @@ struct ctx { > =09int fd_tap; > =09int fd_repair_listen; > =09int fd_repair; > +=09/* TODO document all added fields */ > +=09int fd_vhost; > =09unsigned char our_tap_mac[ETH_ALEN]; > =09unsigned char guest_mac[ETH_ALEN]; > =09uint16_t mtu; > =20 > =09uint64_t hash_secret[2]; > +=09uint64_t virtio_features; > =20 > =09int ifi4; > =09struct ip4_ctx ip4; > @@ -299,6 +302,13 @@ struct ctx { > =09int no_icmp; > =09struct icmp_ctx icmp; > =20 > +=09struct { > +=09=09uint16_t last_used_idx; > + > +=09=09int kick_fd; > +=09=09int call_fd; > +=09} vq[2]; > + > =09int no_dns; > =09int no_dns_search; > =09int no_dhcp_dns; > diff --git a/tap.c b/tap.c > index ce859ba..fbe83aa 100644 > --- a/tap.c > +++ b/tap.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include Why do we need eventfds here? Is there anything peculiar in the protocol, or it's all stuff that can be done with "regular" file descriptors plus epoll? > #include > #include > #include > @@ -82,6 +83,46 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf= ); > #define TAP_SEQS=09=09128 /* Different L4 tuples in one batch */ > #define FRAGMENT_MSG_RATE=0910 /* # seconds between fragment warnings *= / > =20 > +#define VHOST_VIRTIO 0xAF > +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) > +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) > +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) > +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory= ) > +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_= state) > +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring= _addr) > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring= _file) > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring= _file) > +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_= file) > +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) > +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vrin= g_file) Coding style (no strict requirement): align those nicely/table-like if possible. > + > +char virtio_rx_pkt_buf[PKT_BUF_BYTES] __attribute__((aligned(PAGE_SIZE))= ); > +static PACKET_POOL_NOINIT(pool_virtiorx4, TAP_MSGS, virtio_rx_pkt_buf); > +static PACKET_POOL_NOINIT(pool_virtiorx6, TAP_MSGS, virtio_rx_pkt_buf); > + > +/* TODO is it better to have 1024 or 65520 bytes per packet? */ In general 65535 bytes (including the Ethernet header) appears to be a good idea, but actual profiling would be nice in the long term. > +#define VHOST_NDESCS 1024 > +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((alig= ned(PAGE_SIZE))); Coding style, here and a bit all over the place: wrap to 80 columns ("net" kernel-like style). > +static union { > +=09struct vring_avail avail; > +=09char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])]; > +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __att= ribute__((aligned(PAGE_SIZE))); > +static union { > +=09struct vring_used used; > +=09char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; > +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attri= bute__((aligned(PAGE_SIZE))); > + > +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ What's the "all descs ring"? A short "theory of operation" section might help eventually. > +#define N_VHOST_REGIONS 7 > +union { > +=09struct vhost_memory mem; > +=09char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; > +} vhost_memory =3D { > +=09.mem =3D { > +=09=09.nregions =3D N_VHOST_REGIONS, > +=09}, > +}; > + > /** > * tap_l2_max_len() - Maximum frame size (including L2 header) for curre= nt mode > * @c:=09=09Execution context > @@ -1360,6 +1401,89 @@ void tap_listen_handler(struct ctx *c, uint32_t ev= ents) > =09tap_start_connection(c); > } > =20 > +static void *virtqueue_get_buf(struct ctx *c, unsigned qid, unsigned *le= n) > +{ > +=09struct vring_used *used =3D !qid ? &vring_used_0.used : &vring_used_1= .used; > +=09uint32_t i; > +=09uint16_t used_idx, last_used; > + > +=09/* TODO think if this has races with previous eventfd_read */ > +=09/* TODO we could improve performance with a shadow_used_idx */ > +=09used_idx =3D le16toh(used->idx); > + > +=09smp_rmb(); > + > +=09if (used_idx =3D=3D c->vq[qid].last_used_idx) { > +=09=09*len =3D 0; > +=09=09return NULL; > +=09} > + > +=09last_used =3D c->vq[qid].last_used_idx & VHOST_NDESCS; > +=09i =3D le32toh(used->ring[last_used].id); > +=09*len =3D le32toh(used->ring[last_used].len); > + > +=09if (i > VHOST_NDESCS) { > +=09=09/* TODO imporove this, it will cause infinite loop */ > +=09=09warn("vhost: id %u at used position %u out of range (max=3D%u)", i= , last_used, VHOST_NDESCS); > +=09=09return NULL; > +=09} > + > +=09if (*len > PKT_BUF_BYTES/VHOST_NDESCS) { > +=09=09warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDE= SCS); > +=09=09return NULL; > +=09} > + > +=09/* TODO check if the id is valid and it has not been double used */ > +=09c->vq[qid].last_used_idx++; > +=09return virtio_rx_pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS); > +} > + > +/* container is tx but we receive it from vhost POV */ > +void vhost_call_cb(struct ctx *c, union epoll_ref ref, const struct time= spec *now) > +{ > +=09eventfd_read(ref.fd, (eventfd_t[]){ 0 }); > + > +=09tap_flush_pools(); > + > +=09while (true) { > +=09=09struct virtio_net_hdr_mrg_rxbuf *hdr; > +=09=09unsigned len; > + > +=09=09hdr =3D virtqueue_get_buf(c, ref.queue, &len); > +=09=09if (!hdr) > +=09=09=09break; > + > +=09=09if (len < sizeof(*hdr)) { > +=09=09=09warn("vhost: invalid len %u", len); > +=09=09=09continue; > +=09=09} > + > +=09=09/* TODO this will break from this moment */ > +=09=09if (hdr->num_buffers !=3D 1) { > +=09=09=09warn("vhost: Too many buffers %u, %zu bytes should be enough fo= r everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS); > +=09=09=09continue; > +=09=09} > + > +=09=09/* TODO fix the v6 pool to support ipv6 */ > +=09=09tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), pool_virtio= rx4, pool_virtiorx6); > +=09} > + > +=09tap_handler(c, now, pool_virtiorx4, pool_virtiorx6); > +} > + > +/* TODO: Actually refill */ > +static void rx_pkt_refill(int kick_fd) > +{ > +=09for (unsigned i =3D 0; i < VHOST_NDESCS; ++i) { > +=09=09vring_desc[0][i].addr =3D (uintptr_t)virtio_rx_pkt_buf + i * (PKT_= BUF_BYTES/VHOST_NDESCS); > +=09=09vring_desc[0][i].len =3D PKT_BUF_BYTES/VHOST_NDESCS; > +=09=09vring_desc[0][i].flags =3D VRING_DESC_F_WRITE; > +=09} > + > +=09vring_avail_0.avail.idx =3D VHOST_NDESCS; > +=09eventfd_write(kick_fd, 1); > +} > + > /** > * tap_ns_tun() - Get tuntap fd in namespace > * @c:=09=09Execution context > @@ -1370,10 +1494,13 @@ void tap_listen_handler(struct ctx *c, uint32_t e= vents) > */ > static int tap_ns_tun(void *arg) > { > +=09/* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and = VHOST_NET_F_VIRTIO_NET_HDR actually */ > +=09static const uint64_t features =3D > +=09=09(1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | = (1ULL << VHOST_NET_F_VIRTIO_NET_HDR); > =09struct ifreq ifr =3D { .ifr_flags =3D IFF_TAP | IFF_NO_PI }; I kind of wonder, by the way, if IFF_TUN simplifies things here. It's something we should already add, as an option, see also: https://bugs.passt.top/show_bug.cgi?id=3D49, but if it makes your life easier for any reason you might consider adding it right away. > =09int flags =3D O_RDWR | O_NONBLOCK | O_CLOEXEC; > =09struct ctx *c =3D (struct ctx *)arg; > -=09int fd, rc; > +=09int fd, vhost_fd, rc; > =20 > =09c->fd_tap =3D -1; > =09memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ); > @@ -1383,6 +1510,175 @@ static int tap_ns_tun(void *arg) > =09if (fd < 0) > =09=09die_perror("Failed to open() /dev/net/tun"); > =20 > +=09vhost_fd =3D open("/dev/vhost-net", flags); > +=09if (vhost_fd < 0) > +=09=09die_perror("Failed to open() /dev/vhost-net"); Note pretty much to future self: this will need adjustments to AppArmor and SELinux policies. > + > +=09rc =3D ioctl(vhost_fd, VHOST_SET_OWNER, NULL); > +=09if (rc < 0) > +=09=09die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed"); > + > +=09rc =3D ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features); > +=09if (rc < 0) > +=09=09die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed"); > + > +=09/* TODO inform more explicitely */ > +=09fprintf(stderr, "vhost features: %lx\n", c->virtio_features); > +=09fprintf(stderr, "req features: %lx\n", features); > +=09c->virtio_features &=3D features; > +=09if (c->virtio_features !=3D features) > +=09=09die("vhost does not support required features"); > + > +=09for (int i =3D 0; i < ARRAY_SIZE(c->vq); i++) { No declarations directly in loops (it hides them somehow). > +=09=09struct vhost_vring_file file =3D { > +=09=09=09.index =3D i, > +=09=09}; > +=09=09union epoll_ref ref =3D { .type =3D EPOLL_TYPE_VHOST_CALL, > +=09=09=09=09=09.queue =3D i }; > +=09=09struct epoll_event ev; > + > +=09=09file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > +=09=09ref.fd =3D file.fd; > +=09=09if (file.fd < 0) > +=09=09=09die_perror("Failed to create call eventfd"); > + > +=09=09rc =3D ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file); > +=09=09if (rc < 0) > +=09=09=09die_perror( > +=09=09=09=09"VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed"); Same as net kernel style: if it's more than a line, even as a single statement, use curly brackets (rationale: somebody later adds another statement without noticing and... oops). > + > +=09=09ev =3D (struct epoll_event){ .data.u64 =3D ref.u64, .events =3D EP= OLLIN }; > +=09=09rc =3D epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); > +=09=09if (rc < 0) > +=09=09=09die_perror("Failed to add call eventfd to epoll"); > +=09=09c->vq[i].call_fd =3D file.fd; > +=09} > + > +=09/* 1:1 translation */ > +=09vhost_memory.mem.regions[0] =3D (struct vhost_memory_region){ Space between cast and initialiser, ") {", for consistency. I'll wait before we have some kind of theory of operation / general description before actually looking into those, I'm not sure about the exact role of those seven regions. > +=09=09.guest_phys_addr =3D (uintptr_t)&vring_desc, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09 * ptrs there so... > +=09=09 */ > +=09=09.memory_size =3D sizeof(vring_desc), > +=09=09.userspace_addr =3D (uintptr_t)&vring_desc, > +=09}; > +=09vhost_memory.mem.regions[1] =3D (struct vhost_memory_region){ > +=09=09.guest_phys_addr =3D (uintptr_t)&vring_avail_0, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09=09* ptrs there so... > +=09=09=09*/ > +=09=09.memory_size =3D sizeof(vring_avail_0), > +=09=09.userspace_addr =3D (uintptr_t)&vring_avail_0, > +=09}; > +=09vhost_memory.mem.regions[2] =3D (struct vhost_memory_region){ > +=09=09.guest_phys_addr =3D (uintptr_t)&vring_avail_1, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09=09* ptrs there so... > +=09=09=09*/ > +=09=09.memory_size =3D sizeof(vring_avail_1), > +=09=09.userspace_addr =3D (uintptr_t)&vring_avail_1, > +=09}; > +=09vhost_memory.mem.regions[3] =3D (struct vhost_memory_region){ > +=09=09.guest_phys_addr =3D (uintptr_t)&vring_used_0, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09=09* ptrs there so... > +=09=09=09*/ > +=09=09.memory_size =3D sizeof(vring_avail_0), > +=09=09.userspace_addr =3D (uintptr_t)&vring_used_0, > +=09}; > +=09vhost_memory.mem.regions[4] =3D (struct vhost_memory_region){ > +=09=09.guest_phys_addr =3D (uintptr_t)&vring_used_1, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09=09* ptrs there so... > +=09=09=09*/ > +=09=09.memory_size =3D sizeof(vring_avail_1), > +=09=09.userspace_addr =3D (uintptr_t)&vring_used_1, > +=09}; > +=09vhost_memory.mem.regions[5] =3D (struct vhost_memory_region){ > +=09=09.guest_phys_addr =3D (uintptr_t)virtio_rx_pkt_buf, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09=09* ptrs there so... > +=09=09=09*/ > +=09=09.memory_size =3D sizeof(virtio_rx_pkt_buf), > +=09=09.userspace_addr =3D (uintptr_t)virtio_rx_pkt_buf, > +=09}; > +=09vhost_memory.mem.regions[6] =3D (struct vhost_memory_region){ > +=09=09.guest_phys_addr =3D (uintptr_t)pkt_buf, > +=09=09/* memory size should include the last byte, but we probably never= send > +=09=09=09* ptrs there so... > +=09=09=09*/ > +=09=09.memory_size =3D sizeof(pkt_buf), > +=09=09.userspace_addr =3D (uintptr_t)pkt_buf, > +=09}; > +=09static_assert(6 < N_VHOST_REGIONS); > + > +=09rc =3D ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem); > +=09if (rc < 0) > +=09=09die_perror( > +=09=09=09"VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed"); > + > +=09/* TODO: probably it increases RX perf */ > +#if 0 > +=09struct ifreq ifr; > +=09memset(&ifr, 0, sizeof(ifr)); > + > +=09if (ioctl(fd, TUNGETIFF, &ifr) !=3D 0) > +=09=09die_perror("Unable to query TUNGETIFF on FD %d", fd); > +=09} > + > +=09if (ifr.ifr_flags & IFF_VNET_HDR) > +=09=09net->dev.features &=3D ~(1ULL << VIRTIO_NET_F_MRG_RXBUF); > +#endif > +=09rc =3D ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features); > +=09if (rc < 0) > +=09=09die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed"); > + > +=09/* Duplicating foreach queue to follow the exact order from QEMU */ > +=09for (int i =3D 0; i < ARRAY_SIZE(c->vq); i++) { > +=09=09struct vhost_vring_addr addr =3D { > +=09=09=09.index =3D i, > +=09=09=09.desc_user_addr =3D (unsigned long)vring_desc[i], > +=09=09=09.avail_user_addr =3D i =3D=3D 0 ? (unsigned long)&vring_avail_0= : > +=09=09=09=09=09=09=09=09=09=09(unsigned long)&vring_avail_1, > +=09=09=09.used_user_addr =3D i =3D=3D 0 ? (unsigned long)&vring_used_0 : > +=09=09=09=09=09=09=09=09=09=09(unsigned long)&vring_used_1, > +=09=09=09/* GPA addr */ > +=09=09=09.log_guest_addr =3D i =3D=3D 0 ? (unsigned long)&vring_used_0 : > +=09=09=09=09=09=09=09=09=09 (unsigned long)&vring_used_1, > +=09=09}; > +=09=09struct vhost_vring_state state =3D { > +=09=09=09.index =3D i, > +=09=09=09.num =3D VHOST_NDESCS, > +=09=09}; > +=09=09struct vhost_vring_file file =3D { > +=09=09=09.index =3D i, > +=09=09}; > + > +=09=09rc =3D ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state); > +=09=09if (rc < 0) > +=09=09=09die_perror( > +=09=09=09=09"VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed"); > + > +=09=09fprintf(stderr, "qid: %d\n", i); > +=09=09fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr); > +=09=09fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr= ); > +=09=09fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr); > +=09=09rc =3D ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr); > +=09=09if (rc < 0) > +=09=09=09die_perror( > +=09=09=09=09"VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed"); > + > +=09=09file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > +=09=09if (file.fd < 0) > +=09=09=09die_perror("Failed to create kick eventfd"); > +=09=09rc =3D ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file); > +=09=09if (rc < 0) > +=09=09=09die_perror( > +=09=09=09=09"VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed"); > +=09=09c->vq[i].kick_fd =3D file.fd; > +=09} > + > =09rc =3D ioctl(fd, (int)TUNSETIFF, &ifr); > =09if (rc < 0) > =09=09die_perror("TUNSETIFF ioctl on /dev/net/tun failed"); > @@ -1390,7 +1686,25 @@ static int tap_ns_tun(void *arg) > =09if (!(c->pasta_ifi =3D if_nametoindex(c->pasta_ifn))) > =09=09die("Tap device opened but no network interface found"); > =20 > +=09rx_pkt_refill(c->vq[0].kick_fd); > + > +=09/* Duplicating foreach queue to follow the exact order from QEMU */ > +=09for (int i =3D 0; i < ARRAY_SIZE(c->vq); i++) { > +=09=09struct vhost_vring_file file =3D { > +=09=09=09.index =3D i, > +=09=09=09.fd =3D fd, > +=09=09}; > + > +=09=09fprintf(stderr, "qid: %d\n", file.index); > +=09=09fprintf(stderr, "tap fd: %d\n", file.fd); > +=09=09rc =3D ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file); > +=09=09if (rc < 0) > +=09=09=09die_perror( > +=09=09=09=09"VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed"); > +=09} > + > =09c->fd_tap =3D fd; > +=09c->fd_vhost =3D vhost_fd; > =20 > =09return 0; > } > diff --git a/tap.h b/tap.h > index 0b5ad17..91d3f62 100644 > --- a/tap.h > +++ b/tap.h > @@ -69,6 +69,8 @@ static inline void tap_hdr_update(struct tap_hdr *thdr,= size_t l2len) > =09=09thdr->vnet_len =3D htonl(l2len); > } > =20 > +void vhost_call_cb(struct ctx *c, union epoll_ref ref, const struct time= spec *now); > + > unsigned long tap_l2_max_len(const struct ctx *c); > void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); > void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, --=20 Stefano