From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=PEz80VYB; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 31F115A0278 for ; Mon, 28 Jul 2025 18:34:26 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753720464; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cN2QNIZRTHnagONqfVGRgKCcI2/2rKqMuSv6RPMlj3c=; b=PEz80VYBglWzGSaO3jY392LzBbigi63t84eUlFheVC7vZVlFlc871xejESnXNsqf5IaGBT Anssa8pxI+kDl6wdLgPe3HswtMOQpbbkkyZIQ7txrxIZi7LzVjMtgwTi/o4XxAdyzcMrSr V99hByDPYFr4wiMBhrRJTYYSN950CF0= Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-186-23iHVpRFMAC5mMbW_aFQhQ-1; Mon, 28 Jul 2025 12:34:22 -0400 X-MC-Unique: 23iHVpRFMAC5mMbW_aFQhQ-1 X-Mimecast-MFC-AGG-ID: 23iHVpRFMAC5mMbW_aFQhQ_1753720462 Received: by mail-pj1-f72.google.com with SMTP id 98e67ed59e1d1-31effad1358so58615a91.1 for ; Mon, 28 Jul 2025 09:34:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753720461; x=1754325261; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cN2QNIZRTHnagONqfVGRgKCcI2/2rKqMuSv6RPMlj3c=; b=UQSBxKaHGpCVPxjGeai8G5WaXnG+4mCzwGbUWNL5kARkFxq3vqS0fMURoHcFb05+v+ RvZvCFvn2A4SwMo7tcz0W1MlHZCe3lpDh4x8W5XUBBQvAKuQLddUBC3MJarcD33+OzM5 vmwUI+9LGu4IuKoMHyLxhFlRNieIY74u0Olh/HawKU0GUSQ9fft9ShijCKQlww+M2RLB Y3b6vWoiNOpMG4VrYHXTJYdFYspbyrgkH/ExpJJUjA/AovdV44zIsI3PoCHqF58R0AeG fD8wotRXddZUqdAwv/twmDvdyE600R8Xj5nlXl10HHXGfSoxlQ2A9boD15OGnAb+5mHF 2jXw== X-Gm-Message-State: AOJu0YxlJMKWy2jiTMwlfmuo96jvIBMbFs2lY1jDKIwdgS7EsB2hQwsQ Hl3RRYNvnDvBEcfqXKWYqMwLpU5qTTQICU+KNwBKMA5mqn7JB+o9WaHedKY37s9leh30XLDb6u5 iK0Q2slOCcBbUtoxB+g76BGDhP0IpCf7Mto8Qme2jG042bXE7qCTwovvjxXxkqtvxpq5OTwYR2u swrjC3iSOaQA4OFI6H+bdaDGY/Rwgi X-Gm-Gg: ASbGncucwbbVBqWWuxuyCIMR9C51Ec3s6GJdFRtl7JKFIVZ9ax/LGTBychZPZt4pl3u AF1AJpVty7bxkft1ohCfGXt2k7l+Qed91MdN+xJAjDGJVxyUJsOml3MJpBxU3LlYRzKGCufKEBg Ebb86csW3DyM9QXcUZtx3f X-Received: by 2002:a17:90b:3a8a:b0:31e:b772:dfcb with SMTP id 98e67ed59e1d1-31f28cdb992mr264396a91.10.1753720460878; Mon, 28 Jul 2025 09:34:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFF+immgcssyYX4De08yGz2MIZVZqul78LkDlTEs0GLQ5eJzlDS1CzJaHrQiwkIkqnJ4LaOSoua/JQRYrl5gdQ= X-Received: by 2002:a17:90b:3a8a:b0:31e:b772:dfcb with SMTP id 98e67ed59e1d1-31f28cdb992mr264320a91.10.1753720459975; Mon, 28 Jul 2025 09:34:19 -0700 (PDT) MIME-Version: 1.0 References: <20250709174748.3514693-1-eperezma@redhat.com> <20250709174748.3514693-2-eperezma@redhat.com> In-Reply-To: From: Eugenio Perez Martin Date: Mon, 28 Jul 2025 18:33:43 +0200 X-Gm-Features: Ac12FXwyR0WdQOjdlF6ne0ppkABAAWfI1kOxhS-9DchyOxvhENcGpst0RjqQFDw Message-ID: Subject: Re: [RFC v2 01/11] tap: implement vhost_call_cb To: David Gibson X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: oYYFcqzlqtRbAWZeJO0x7FsyzTCTWYPmBI2XmT0iBYE_1753720462 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: YECZTSPU4VJXTO6GGE3MZDURPRTVKP7N X-Message-ID-Hash: YECZTSPU4VJXTO6GGE3MZDURPRTVKP7N X-MailFrom: eperezma@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jasowang@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, Jul 23, 2025 at 8:57=E2=80=AFAM David Gibson wrote: > > On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio P=C3=A9rez wrote: > > Implement the rx side of the vhost-kernel, where the namespace send > > packets through its tap device and the kernel passes them to > > pasta using a virtqueue. The virtqueue is build using a > > virtual ring (vring), which includes: > > * A descriptor table (buffer info) > > * An available ring (buffers ready for the kernel) > > * A used ring (buffers that the kernel has filled) > > > > The descriptor table holds an array of buffers defined by address and > > length. The kernel writes the packets transmitted by the namespace into > > these buffers. The number of descriptors (vq size) is set by > > VHOST_NDESCS. Pasta fills this table using pkt_buf, splitting it > > evenly across all descriptors. This table is read-only for the kernel. > > > > The available ring is where pasta marks which buffers the kernel can > > use. It's read only for the kernel. It includes a ring[] array with > > the descriptor indexes and an avail->idx index. Pasta increments > > avail->idx when a new buffer is added, modulo the size of the > > virtqueue. As pasta writes the rx buffers sequentially, ring[] is > > always [0, 1, 2...] and only avail->idx is incremented when new buffers > > are available for the kernel. avail->idx can be incremented by more > > than one at a time. > > > > Pasta also notifies the kernel of new available buffers by writing to > > the kick eventfd. > > > > Once the kernel has written a frame in a descriptor it writes its index > > into used_ring->ring[] and increments the used_ring->idx accordly. > > Like the avail idx the kernel can increase it by more than one. Pasta > > gets a notification in the call eventfd, so we add it into the epoll ct= x. > > > > Pasta assumes buffers are used in order. QEMU also assumes it in the > > virtio-net migration code so it is safe. > > > > Now, vhost-kernel is designed to read the virtqueues and the buffers as > > *guest* physical addresses (GPA), not process virtual addresses (HVA). > > The way QEMU tells the translations is through the memory regions. > > Since we don't have GPAs, let's just send the memory regions as a 1:1 > > translations of the HVA. > > > > TODO: Evaluate if we can reuse the tap fd code instead of making a new > > epoll event type. > > I'm not sure this is a particularly desirable code, unless it obvious > removes a bunch of duplicated code. As a rule I'd prefer to use > different epoll event types rather than a single epoll type that then > needs to consult additional information to determine how to handle it. > Sure, it can be done that way. > > TODO: Split a new file for vhost (Stefano) > > > > Signed-off-by: Eugenio P=C3=A9rez > > --- > > RFC v2: > > * Need to integrate "now" parameter in tap_add_packet and replace > > TAP_MSGS to TAP_MSGS{4,6}. > > * Actually refill rx queue at the end of operation. > > * Explain virtqueues and memory regions theory of operation. > > * Extrack vhost_kick so it can be reused by tx. > > * Add macro for VHOST regions. > > * Only register call_cb in rx queue, as tx calls are handled just if > > needed. > > * Renamed vhost_call_cb to tap_vhost_input (David) > > * Move the inuse and last_used_idx to a writable struct from vhost tx > > function. Context is readonly for vhost tx code path. > > * Changed from inuse to num_free tracking, more aligned with kernel drv > > * Use always the same tap_pool instead of having the two splitted for v= 4 > > and v6. > > * Space between (struct vhost_memory_region) {.XXX =3D ...}. (Stefano) > > * Expand comments about memory region size (David) > > * Add some TODOs. > > > > Signed-off-by: Eugenio P=C3=A9rez > > --- > > epoll_type.h | 2 + > > passt.c | 7 +- > > passt.h | 10 +- > > tap.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++- > > tap.h | 8 ++ > > 5 files changed, 335 insertions(+), 3 deletions(-) > > > > diff --git a/epoll_type.h b/epoll_type.h > > index 12ac59b..0371c14 100644 > > --- a/epoll_type.h > > +++ b/epoll_type.h > > @@ -44,6 +44,8 @@ enum epoll_type { > > EPOLL_TYPE_REPAIR_LISTEN, > > /* TCP_REPAIR helper socket */ > > EPOLL_TYPE_REPAIR, > > + /* vhost-kernel call socket */ > > + EPOLL_TYPE_VHOST_CALL, > > > > EPOLL_NUM_TYPES, > > }; > > diff --git a/passt.c b/passt.c > > index 388d10f..0f2659c 100644 > > --- a/passt.c > > +++ b/passt.c > > @@ -79,6 +79,7 @@ char *epoll_type_str[] =3D { > > [EPOLL_TYPE_VHOST_KICK] =3D "vhost-user kick socket", > > [EPOLL_TYPE_REPAIR_LISTEN] =3D "TCP_REPAIR helper listening = socket", > > [EPOLL_TYPE_REPAIR] =3D "TCP_REPAIR helper socket", > > + [EPOLL_TYPE_VHOST_CALL] =3D "vhost-kernel call socket", > > }; > > static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES, > > "epoll_type_str[] doesn't match enum epoll_type"); > > @@ -310,7 +311,8 @@ loop: > > > > switch (ref.type) { > > case EPOLL_TYPE_TAP_PASTA: > > - tap_handler_pasta(&c, eventmask, &now); > > + // TODO: Find a better way, maybe reuse the fd. > > + // tap_handler_pasta(&c, eventmask, &now); > > This change seems bogus. We still need this if we want to be able to > fall back to tap without vhost, which I think we want to be able to > do. I'm also not clear why you need it - if you don't want these > events, can't you just not register them (at least not with this event > type) in epoll_ctl? > Absolutely, it was commented out for testing purposes but I forgot to note = it. > > break; > > case EPOLL_TYPE_TAP_PASST: > > tap_handler_passt(&c, eventmask, &now); > > @@ -357,6 +359,9 @@ loop: > > case EPOLL_TYPE_REPAIR: > > repair_handler(&c, eventmask); > > break; > > + case EPOLL_TYPE_VHOST_CALL: > > + tap_vhost_input(&c, ref, &now); > > I'd suggest calling this tap_handler_vhost() for consistency with > tap_handler_pasta() and others. I've also found it usually ends up a > bit cleaner to pass the relevant individual members of ref, rather > than the whole thing. > I'm ok with tap_handler_vhost, but tap_vhost_input was your suggestion :) [= 1]. > > + break; > > default: > > /* Can't happen */ > > ASSERT(0); > > diff --git a/passt.h b/passt.h > > index 8693794..7bb86c4 100644 > > --- a/passt.h > > +++ b/passt.h > > @@ -45,7 +45,7 @@ union epoll_ref; > > * @icmp: ICMP-specific reference part > > * @data: Data handled by protocol handlers > > * @nsdir_fd: netns dirfd for fallback timer checking if namesp= ace is gone > > - * @queue: vhost-user queue index for this fd > > + * @queue: vhost queue index for this fd > > * @u64: Opaque reference for epoll_ctl() and epoll_wait() > > */ > > union epoll_ref { > > @@ -269,11 +269,14 @@ struct ctx { > > int fd_tap; > > int fd_repair_listen; > > int fd_repair; > > + /* TODO document all added fields */ > > + int fd_vhost; > > unsigned char our_tap_mac[ETH_ALEN]; > > unsigned char guest_mac[ETH_ALEN]; > > uint16_t mtu; > > > > uint64_t hash_secret[2]; > > + uint64_t virtio_features; > > > > int ifi4; > > struct ip4_ctx ip4; > > @@ -297,6 +300,11 @@ struct ctx { > > int no_icmp; > > struct icmp_ctx icmp; > > > > + struct { > > + int kick_fd; > > + int call_fd; > > + } vq[2]; > > Maybe not in scope for this series, but this is reminding me that we > should really split out the tap-backend specific fields into their own > union of substructures to make it clearer what fields are relevant in > which configurations. > > > > int no_dns; > > int no_dns_search; > > int no_dhcp_dns; > > diff --git a/tap.c b/tap.c > > index 6db5d88..e4a3822 100644 > > --- a/tap.c > > +++ b/tap.c > > @@ -31,6 +31,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6,= pkt_buf); > > #define TAP_SEQS 128 /* Different L4 tuples in one batch *= / > > #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warning= s */ > > > > +#define VHOST_VIRTIO 0xAF > > +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) > > +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) > > +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) > > +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_mem= ory) > > +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vri= ng_state) > > +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vri= ng_addr) > > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vri= ng_file) > > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vri= ng_file) > > +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vri= ng_file) > > +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) > > +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vr= ing_file) > > This stuff probably belongs in linux_dep.h (if you can't get it from a > system or kernel header). > > > +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520) > > I suspect this is wrong. For starters we should definitely be using a > define, not open-coded 65520. But more importantly, 65520 is the > (default) MTU - the maximum size *of the IP packet*, excluding L2 > headers. I'm assuming these buffers will also need to contain the L2 > header, so will need to be larger. Looking at the later code, it > looks like these buffers also include the virtio_net_hdr_mrg_rxbuf > pseudo-physical header, so will need to be bigger again. > > I think you'll want a new define for this, which would be > L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header > (virtion_net_hdr_mrg_rxbuf, AFAICT). For now, I'm going to call this > L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it > later in this mail. You may also need/want to round it up to > 4/8/whatever bytes to avoid poorly aligned buffers. > Very good points, applying it for the next series. > > +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), > > + "Number of vhost descs must be a power of two by= standard"); > > Relying on this calculation come out to a power of 2 seems kind of > fragile. Maybe just round down to a power of 2 instead? Although I > guess that could mean a bunch of wasted space. > To me it's the classical kernel include/linux/log2.h:is_power_of_2 function, but we can replace (or entirely omit) the assert for sure. > To avoid wasting space if this doesn't come neatly to a power of 2, > would it be safe to round _up_ to a power of 2, as long as we never, > ever put an index beyond what we actually have space for into the > available queue? > All descriptors are exposed to the kernel always in the rx queue, so rounding is not possible. This number must be a power of 2 in the split case, but I'll be happy if it is just a comment in the final version. > > +static struct { > > + /* Number of free descriptors */ > > + uint16_t num_free; > > + > > + /* Last used idx processed */ > > + uint16_t last_used_idx; > > +} vqs[2]; > > + > > +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((al= igned(PAGE_SIZE))); > > +static union { > > + struct vring_avail avail; > > + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])]; > > The purpose of this idiom of a union with a char buffer based on > offsetof isn't obvious to me yet. > It's because vring_avail contains a flexible array member. In VirtIO the avail ring is defined as: struct virtq_avail { le16 flags; le16 idx; le16 ring[ /* Queue Size */ ]; /* ommiting last member here, as it is not used by pasta */ }; But the kernel uapi definition of vring_avail is: struct vring_avail { __virtio16 flags; __virtio16 idx; __virtio16 ring[]; }; So I need to allocate enough bytes for ring[]. To me the easiest way is by using char buf[offsetof(last_member_of_ring[])]. I would be happy to redefine it for pasta, as we know the size in advance, or to hide it in a macro. > > +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __a= ttribute__((aligned(PAGE_SIZE))); > > +static union { > > + struct vring_used used; > > + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; > > +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __att= ribute__((aligned(PAGE_SIZE))); > > Maybe 2 entry arrays, rather than *_0 and *_1? > Flexible array members again :(. > > + > > +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ > > +#define N_VHOST_REGIONS 6 > > +union { > > + struct vhost_memory mem; > > + char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]= ; > > +} vhost_memory =3D { > > + .mem =3D { > > + .nregions =3D N_VHOST_REGIONS, > > + }, > > +}; > > + > > /** > > * tap_l2_max_len() - Maximum frame size (including L2 header) for cur= rent mode > > * @c: Execution context > > @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c, > > tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); > > } > > > > +static void vhost_kick(struct vring_used *used, int kick_fd) { > > passt style puts the opening { of a function on the nect line. > Ouch, I'll fix it in the next version! > > + /* We need to expose available array entries before checking avai= l > > + * event. > > + * > > + * TODO: Does eventfd_write already do this? > > + */ > > + smp_mb(); > > + > > + if (!(used->flags & VRING_USED_F_NO_NOTIFY)) > > + eventfd_write(kick_fd, 1); > > +} > > + > > /** > > * tap_send_frames_pasta() - Send multiple frames to the pasta tap > > * @c: Execution context > > @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t = events) > > tap_start_connection(c); > > } > > > > +static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len) > > None of these functions have descriptive comment blocks yet. I > realise this is an RFC, but it can certainly make things easier to > review if you know what to expect before reading the function. > > > +{ > > + struct vring_used *used =3D !qid ? &vring_used_0.used : &vring_us= ed_1.used; > > Using an array as suggested above would avoid this awkward construct. > > > + uint32_t i; > > + uint16_t used_idx, last_used; > > passt style orders declarations in reverse order of line length (when > possible), so this line should be before the previous one. > Got it, I'll use it from now on. > > + > > + /* TODO think if this has races with previous eventfd_read */ > > + /* TODO we could improve performance with a shadow_used_idx */ > > + used_idx =3D le16toh(used->idx); > > + > > + smp_rmb(); > > + > > + if (used_idx =3D=3D vqs[0].last_used_idx) { > > + *len =3D 0; > > + return NULL; > > + } > > + > > + last_used =3D vqs[0].last_used_idx % VHOST_NDESCS; > > + i =3D le32toh(used->ring[last_used].id); > > + *len =3D le32toh(used->ring[last_used].len); > > + > > + /* Make sure the kernel is consuming the descriptors in order */ > > + if (i !=3D last_used) { > > + die("vhost: id %u at used position %u !=3D %u", i, last_u= sed, i); > > + return NULL; > > + } > > + > > + if (*len > PKT_BUF_BYTES/VHOST_NDESCS) { > > You want your L1_MAX_LEN_VHOST (or whatever it's called) again here. > > > + warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/= VHOST_NDESCS); > > And here. > > > + return NULL; > > + } > > + > > + /* TODO check if the id is valid and it has not been double used = */ > > + vqs[0].last_used_idx++; > > + vqs[0].num_free++; > > + return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS); > > And here. > > > +} > > + > > +/* TODO this assumes the kernel consumes descriptors in order */ > > +static void rx_pkt_refill(struct ctx *c) > > +{ > > + /* TODO: tune this threshold */ > > + if (!vqs[0].num_free) > > + return; > > + > > + vring_avail_0.avail.idx +=3D vqs[0].num_free; > > + vqs[0].num_free =3D 0; > > + vhost_kick(&vring_used_0.used, c->vq[0].kick_fd); > > +} > > + > > +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct = timespec *now) > > +{ > > + eventfd_read(ref.fd, (eventfd_t[]){ 0 }); > > + > > + tap_flush_pools(); > > + > > + while (true) { > > + struct virtio_net_hdr_mrg_rxbuf *hdr; > > + unsigned len; > > + > > + hdr =3D virtqueue_get_rx_buf(ref.queue, &len); > > + if (!hdr) > > + break; > > You could use while ((hdr =3D virtqueue_get_rx_buf(...))) instead of > while (true), couldn't you? > Sure. > > + > > + if (len < sizeof(*hdr)) { > > You should check that there are enough bytes for the L2 header as well > as the mrg_rxbuf header - tap_add_packet() doesn't appear to check > that itself (maybe it should). > That's right, I thought tap_add_packet did it but now I realize it assumes there is. Why not check it in tap_add_packet, as all namespace input needs to go there? We need to adapt the function signature though. > > + warn("vhost: invalid len %u", len); > > + continue; > > + } > > + > > + /* TODO this will break from this moment */ > > + if (hdr->num_buffers !=3D 1) { > > + warn("vhost: Too many buffers %u, %zu bytes shoul= d be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS); > > L1_MAX_LEN_VHOST again. I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the two macros in parallel? > Also passt style keeps lines below 80 > columns. > Fixing for the next RFC. > > + continue; > > + } > > + > > + /* TODO fix the v6 pool to support ipv6 */ > > Not clear on why anything particular is needed for ipv6 here. > This comment is outdated actually, I only used pool_tap4 here. Fixing in the next revision! > > + tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), no= w); > > + } > > + > > + tap_handler(c, now); > > + rx_pkt_refill(c); > > +} > > + > > /** > > * tap_ns_tun() - Get tuntap fd in namespace > > * @c: Execution context > > @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t= events) > > */ > > static int tap_ns_tun(void *arg) > > { > > + /* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF = and VHOST_NET_F_VIRTIO_NET_HDR actually */ > > + static const uint64_t features =3D > > + (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_= RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR); > > If I understand the code above properly, for the Rx path at least we > don't need or want F_MRG_RXBUF. So why require it here? If we want > it for the Tx path, we can omit it here and add it later in the > series. > I can try without it, sure. > > struct ifreq ifr =3D { .ifr_flags =3D IFF_TAP | IFF_NO_PI }; > > int flags =3D O_RDWR | O_NONBLOCK | O_CLOEXEC; > > struct ctx *c =3D (struct ctx *)arg; > > - int fd, rc; > > + unsigned i; > > + int fd, vhost_fd, rc; > > > > c->fd_tap =3D -1; > > memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ); > > @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg) > > if (fd < 0) > > die_perror("Failed to open() /dev/net/tun"); > > > > + vhost_fd =3D open("/dev/vhost-net", flags); > > + if (vhost_fd < 0) > > + die_perror("Failed to open() /dev/vhost-net"); > > + > > + rc =3D ioctl(vhost_fd, VHOST_SET_OWNER, NULL); > > + if (rc < 0) > > + die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net faile= d"); > > + > > + rc =3D ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features); > > + if (rc < 0) > > + die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net fa= iled"); > > + > > + /* TODO inform more explicitely */ > > + fprintf(stderr, "vhost features: %lx\n", c->virtio_features); > > + fprintf(stderr, "req features: %lx\n", features); > > You should use info() or debug() here. > Right, I'll change. > > + c->virtio_features &=3D features; > > + if (c->virtio_features !=3D features) > > + die("vhost does not support required features"); > > + > > + for (i =3D 0; i < ARRAY_SIZE(c->vq); i++) { > > + struct vhost_vring_file file =3D { > > + .index =3D i, > > + }; > > + union epoll_ref ref =3D { .type =3D EPOLL_TYPE_VHOST_CALL= , > > + .queue =3D i }; > > + struct epoll_event ev; > > + > > + file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > + ref.fd =3D file.fd; > > + if (file.fd < 0) > > + die_perror("Failed to create call eventfd"); > > I know it's technically harmless, but it reads better for me if you > check for the error immediately after the eventfd() call, before doing > something else with the value. > Changing in the next version. > > + > > + rc =3D ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file); > > + if (rc < 0) > > + die_perror( > > + "VHOST_SET_VRING_CALL ioctl on /dev/vhost= -net failed"); > > + > > + ev =3D (struct epoll_event){ .data.u64 =3D ref.u64, .even= ts =3D EPOLLIN }; > > + if (i =3D=3D 0) { > > + rc =3D epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.f= d, &ev); > > + if (rc < 0) > > + die_perror("Failed to add call eventfd to= epoll"); > > + } > > + c->vq[i].call_fd =3D file.fd; > > + } > > + > > + /* > > + * Unlike most of C-style APIs, userspace_addr+memory_size is > > + * also accesible by the kernel. Include a -1 to adjust. > > + */ > > +#define VHOST_MEMORY_REGION_PTR(addr, size) \ > > + (struct vhost_memory_region) { \ > > + .guest_phys_addr =3D (uintptr_t)addr, \ > > + .memory_size =3D size - 1, \ > > + .userspace_addr =3D (uintptr_t)addr, \ > > + } > > +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeo= f(elem)) > > + > > + /* 1:1 translation */ > > + vhost_memory.mem.regions[0] =3D VHOST_MEMORY_REGION(vring_desc); > > + vhost_memory.mem.regions[1] =3D VHOST_MEMORY_REGION(vring_avail_0= ); > > + vhost_memory.mem.regions[2] =3D VHOST_MEMORY_REGION(vring_avail_1= ); > > + vhost_memory.mem.regions[3] =3D VHOST_MEMORY_REGION(vring_used_0)= ; > > + vhost_memory.mem.regions[4] =3D VHOST_MEMORY_REGION(vring_used_1)= ; > > + vhost_memory.mem.regions[5] =3D VHOST_MEMORY_REGION(pkt_buf); > > + static_assert(5 < N_VHOST_REGIONS); > > +#undef VHOST_MEMORY_REGION > > +#undef VHOST_MEMORY_REGION_PTR > > + > > + rc =3D ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem); > > + if (rc < 0) > > + die_perror( > > + "VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net fail= ed"); > > + > > + /* TODO: probably it increases RX perf */ > > +#if 0 > > + struct ifreq ifr; > > + memset(&ifr, 0, sizeof(ifr)); > > + > > + if (ioctl(fd, TUNGETIFF, &ifr) !=3D 0) > > + die_perror("Unable to query TUNGETIFF on FD %d", fd); > > + } > > + > > + if (ifr.ifr_flags & IFF_VNET_HDR) > > + net->dev.features &=3D ~(1ULL << VIRTIO_NET_F_MRG_RXBUF); > > +#endif > > + rc =3D ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features); > > + if (rc < 0) > > + die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net fa= iled"); > > + > > + /* Duplicating foreach queue to follow the exact order from QEMU = */ > > + for (i =3D 0; i < ARRAY_SIZE(c->vq); i++) { > > + struct vhost_vring_addr addr =3D { > > + .index =3D i, > > + .desc_user_addr =3D (unsigned long)vring_desc[i], > > + .avail_user_addr =3D i =3D=3D 0 ? (unsigned long)= &vring_avail_0 : > > + = (unsigned long)&vring_avail_1, > > + .used_user_addr =3D i =3D=3D 0 ? (unsigned long)&= vring_used_0 : > > + = (unsigned long)&vring_used_1, > > + /* GPA addr */ > > + .log_guest_addr =3D i =3D=3D 0 ? (unsigned long)&= vring_used_0 : > > + = (unsigned long)&vring_used_1, > > + }; > > + struct vhost_vring_state state =3D { > > + .index =3D i, > > + .num =3D VHOST_NDESCS, > > + }; > > + struct vhost_vring_file file =3D { > > + .index =3D i, > > + }; > > + > > + rc =3D ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state); > > + if (rc < 0) > > + die_perror( > > + "VHOST_SET_VRING_NUM ioctl on /dev/vhost-= net failed"); > > + > > + fprintf(stderr, "qid: %d\n", i); > > + fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_us= er_addr); > > + fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_= user_addr); > > + fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_us= er_addr); > > + rc =3D ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr); > > + if (rc < 0) > > + die_perror( > > + "VHOST_SET_VRING_ADDR ioctl on /dev/vhost= -net failed"); > > + > > + file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); > > + if (file.fd < 0) > > + die_perror("Failed to create kick eventfd"); > > + rc =3D ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file); > > + if (rc < 0) > > + die_perror( > > + "VHOST_SET_VRING_KICK ioctl on /dev/vhost= -net failed"); > > + c->vq[i].kick_fd =3D file.fd; > > + > > + vqs[i].num_free =3D VHOST_NDESCS; > > + } > > + > > rc =3D ioctl(fd, (int)TUNSETIFF, &ifr); > > if (rc < 0) > > die_perror("TUNSETIFF ioctl on /dev/net/tun failed"); > > @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg) > > if (!(c->pasta_ifi =3D if_nametoindex(c->pasta_ifn))) > > die("Tap device opened but no network interface found"); > > > > + for (i =3D 0; i < VHOST_NDESCS; ++i) { > > + vring_desc[0][i].addr =3D (uintptr_t)pkt_buf + i * (PKT_B= UF_BYTES/VHOST_NDESCS); > > + vring_desc[0][i].len =3D PKT_BUF_BYTES/VHOST_NDESCS; > > + vring_desc[0][i].flags =3D VRING_DESC_F_WRITE; > > + } > > + for (i =3D 0; i < VHOST_NDESCS; ++i) { > > No { } for single line blocks in passt style. > I'll change for the next version. > > + vring_avail_0.avail.ring[i] =3D i; > > + } > > + > > + rx_pkt_refill(c); > > + > > + /* Duplicating foreach queue to follow the exact order from QEMU = */ > > + for (int i =3D 0; i < ARRAY_SIZE(c->vq); i++) { > > + struct vhost_vring_file file =3D { > > + .index =3D i, > > + .fd =3D fd, > > + }; > > + > > + fprintf(stderr, "qid: %d\n", file.index); > > + fprintf(stderr, "tap fd: %d\n", file.fd); > > + rc =3D ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file); > > + if (rc < 0) > > + die_perror( > > + "VHOST_NET_SET_BACKEND ioctl on /dev/vhos= t-net failed"); > > + } > > + > > c->fd_tap =3D fd; > > + c->fd_vhost =3D vhost_fd; > > > > return 0; > > } > > diff --git a/tap.h b/tap.h > > index 6fe3d15..ff8cee5 100644 > > --- a/tap.h > > +++ b/tap.h > > @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *th= dr, size_t l2len) > > thdr->vnet_len =3D htonl(l2len); > > } > > > > +/** > > + * tap_vhost_input() - Handler for new data on the socket to hyperviso= r vq > > + * @c: Execution context > > + * @ref: epoll reference > > + * @now: Current timestamp > > + */ > > passt convention puts these comments in the .c file not the .h, even > for exported functions. > Ouch, I missed that. > > +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct = timespec *now); > > + > > unsigned long tap_l2_max_len(const struct ctx *c); > > void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); > > void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, > [1] https://archives.passt.top/passt-dev/Z-3oQcL28QuIh6LT@zatzit/