From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=GHFMtg7N; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 8AA2D5A004E for ; Thu, 17 Oct 2024 02:10:38 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729123837; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NuS3XrJnnb4ZnmX1bAx5t39qMv9J+gPISzDt+BhUW8g=; b=GHFMtg7NSuq6FKsjqvnOw4OVReR5PeYKuOELJ7J6cuEO26QPufDY4OeBhoSV9JWnlnnGka 2iPURL3PZWGUwqA1hOz/hhIhQMTH1DqUhM9hywUzSkWlFdIBhKecjeX3Zam7L6axwTZgRO c4DNWwZt3zxQhJPSp/4KB730Y6E8i0s= Received: from mail-pj1-f70.google.com (mail-pj1-f70.google.com [209.85.216.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-477-Uy7f17BKM6un51eD0JLkXQ-1; Wed, 16 Oct 2024 20:10:35 -0400 X-MC-Unique: Uy7f17BKM6un51eD0JLkXQ-1 Received: by mail-pj1-f70.google.com with SMTP id 98e67ed59e1d1-2e2c6a5fc86so284404a91.2 for ; Wed, 16 Oct 2024 17:10:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729123834; x=1729728634; h=mime-version:organization:references:in-reply-to:message-id:subject :cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=38Ckg8abGRQ7uIKo6qXjAoIn/JfLHBUdaCIVYJmb2bo=; b=cgtrvmDCSJHmrIqZOKYHT3EvrlRibGC2h+CPXIBE6e2SpCevQezTgE/5jhAANtATcU r1QnTx4sG7iVBhwLcXQevYF7CQbrHxQa/QgRRNmGOZN0l9b0YerAYXcCuFeFXEIl+VQy wEBwZ4MrOYVK7Xx8nPQc6DDRGzWUpclwY2LbnesefqDVdlQEihAlJWDsDc4xPKgP5exi KMRKbi1WeoW4gsxqkx16M9V3T+JgUJk2Q5uaXX+GYGkG3PZg9+CHXYaLzghRDXC50MRz eHYNSsP7DfrqsFRdt/bERbs8oFFdyPtvbstnxs98FZm3COvp9cGnUtzwPwHL+5cHB+G7 hYnw== X-Gm-Message-State: AOJu0YwCBb7aKNiGec+l2tuea/9fNgE9bvG/jMH3L6P/RSc5X7VlPGrw LgQ69RnssCjPzZA+ppMYxh5kxyttyBUVPanDrUG5WXcMyBoxj27cJEvxqC+goD+TmjlwJCPuo7j H0kF8kDutsfQAG1FUFhCtLyrZa2ixU4gZ2pV+//2hBsOzUlAGiOZ9YpenSE46xJjdG35odAcdtk 8+x0JEbY1wL08nvb6Ta9KMXLx6F2g4ybtJ X-Received: by 2002:a17:90b:1a86:b0:2e2:a8e0:85fa with SMTP id 98e67ed59e1d1-2e3ab7c75c8mr6250413a91.8.1729123833400; Wed, 16 Oct 2024 17:10:33 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFzb/BkdI54v7yFdJVpPeTLjhQk3/UcdG1m4cOtYDtTVteIvwO4OsbIN+3x3MAScgluzxMy6g== X-Received: by 2002:a17:90b:1a86:b0:2e2:a8e0:85fa with SMTP id 98e67ed59e1d1-2e3ab7c75c8mr6250351a91.8.1729123832340; Wed, 16 Oct 2024 17:10:32 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20d17f9d49csm33842925ad.66.2024.10.16.17.10.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Oct 2024 17:10:31 -0700 (PDT) Date: Thu, 17 Oct 2024 02:10:27 +0200 From: Stefano Brivio To: Laurent Vivier Subject: Re: [PATCH v7 0/8] Add vhost-user support to passt. (part 3) Message-ID: <20241017021027.2ac9ea53@elisabeth> In-Reply-To: <20241011200730.63c97dc7@elisabeth> References: <20241009090716.691361-1-lvivier@redhat.com> <20241009193726.7e2e5790@elisabeth> <20241010090801.23da8bff@elisabeth> <20241011200730.63c97dc7@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/mixed; boundary="MP_/NMSZLojx+6.eoRW/YaLijs4" Message-ID-Hash: CVCMJI6SFGXO5EUSADZ4VXRIEXBDW5NM X-Message-ID-Hash: CVCMJI6SFGXO5EUSADZ4VXRIEXBDW5NM X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --MP_/NMSZLojx+6.eoRW/YaLijs4 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Fri, 11 Oct 2024 20:07:30 +0200 Stefano Brivio wrote: > [...] > > > ...maybe I can try out a kernel with a version of that as > > clear_page_rep() and see what happens. =20 >=20 > ...so I tried, it looks like this, but it doesn't boot for some reason: I played with this a bit more. If I select the AVX2-based page clearing with: =09if (system_state >=3D SYSTEM_RUNNING && irq_fpu_usable()) { instead of just irq_fpu_usable(), the kernel boots, and everything works (also after init). I tested this in a VM where I can't really get a baseline throughput that's comparable to the host: iperf3 to iperf3 via loopback gives me about 50 Gbps (instead of 70 as I get on the host), and the same iperf3 vhost-user test with outbound traffic from the nested, L2 guest yields about 20 Gbps (instead of 25). 1. The VMOVDQA version I was originally trying looks like this: Samples: 39K of event 'cycles:P', Event count (approx.): 34909065261 Children Self Command Shared Object Symbol - 94.32% 0.87% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_6= 4 =E2=97=86 - 93.45% entry_SYSCALL_64 = =E2=96=92 - 93.29% do_syscall_64 = =E2=96=92 - 79.66% __sys_sendmsg = =E2=96=92 - 79.52% ___sys_sendmsg = =E2=96=92 - 78.88% ____sys_sendmsg = =E2=96=92 - 78.46% tcp_sendmsg = =E2=96=92 - 66.75% tcp_sendmsg_locked = =E2=96=92 - 25.81% sk_page_frag_refill = =E2=96=92 - 25.73% skb_page_frag_refill = =E2=96=92 - 25.34% alloc_pages_mpol_noprof = =E2=96=92 - 25.17% __alloc_pages_noprof = =E2=96=92 - 24.91% get_page_from_freelist = =E2=96=92 - 23.38% kernel_init_pages = =E2=96=92 0.88% kernel_fpu_begin_mask = =E2=96=92 - 15.37% tcp_write_xmit = =E2=96=92 - 14.14% __tcp_transmit_skb = =E2=96=92 - 13.31% __ip_queue_xmit = =E2=96=92 - 11.06% ip_finish_output2 = =E2=96=92 - 10.89% __dev_queue_xmit = =E2=96=92 - 10.00% __local_bh_enable_ip = =E2=96=92 - do_softirq.part.0 = =E2=96=92 - handle_softirqs = =E2=96=92 - 9.86% net_rx_action = =E2=96=92 - 7.95% __napi_poll = =E2=96=92 + process_backlog = =E2=96=92 + 1.17% napi_consume_skb= =E2=96=92 + 0.61% dev_hard_start_xmit = =E2=96=92 - 1.56% ip_local_out = =E2=96=92 - __ip_local_out = =E2=96=92 - 1.29% nf_hook_slow = =E2=96=92 1.00% nf_conntrack_in = =E2=96=92 + 14.60% _copy_from_iter = =E2=96=92 + 3.97% __tcp_push_pending_frames = =E2=96=92 + 2.42% tcp_stream_alloc_skb = =E2=96=92 + 2.08% tcp_wmem_schedule = =E2=96=92 0.64% __check_object_size = =E2=96=92 + 11.08% release_sock = =E2=96=92 + 4.48% ksys_write = =E2=96=92 + 3.57% __x64_sys_epoll_wait = =E2=96=92 + 2.26% __x64_sys_getsockopt = =E2=96=92 1.09% syscall_exit_to_user_mode = =E2=96=92 + 0.90% ksys_read = =E2=96=92 0.64% syscall_trace_enter = =E2=96=92 ...that's 24.91% clock cycles spent on get_page_from_freelist() instead of 25.61% I was getting with the original clear_page() implementation. Checkin= g the annotated output, it doesn't look very... superscalar: Samples: 39K of event 'cycles:P', 4000 Hz, Event count (approx.): 349090652= 61 Percent=E2=94=82 { = =E2=96=92 =E2=94=82 return page_to_virt(page); = =E2=96=92 =E2=94=8232: mov %r12,%rbx = =E2=96=92 =E2=94=82 sub vmemmap_base,%rbx = =E2=96=92 0.32 =E2=94=82 sar $0x6,%rbx = =E2=96=92 0.02 =E2=94=82 shl $0xc,%rbx = =E2=96=92 0.02 =E2=94=82 add page_offset_base,%rbx = =E2=96=92 =E2=94=82 clear_page(): = =E2=96=92 =E2=94=82 if (system_state >=3D SYSTEM_RUNNING && irq_fpu_usable(= )) { =E2=96=92 0.05 =E2=94=82 cmpl $0x2,system_state = =E2=96=92 0.47 =E2=94=82 jbe 21 = =E2=96=92 0.01 =E2=94=82 call irq_fpu_usable = =E2=96=92 0.20 =E2=94=82 test %al,%al = =E2=96=92 0.56 =E2=94=82 je 21 = =E2=96=92 =E2=94=82 kernel_fpu_begin_mask(0); = =E2=96=92 0.07 =E2=94=82 xor %edi,%edi = =E2=96=92 =E2=94=82 call kernel_fpu_begin_mask = =E2=96=92 =E2=94=82 MEMSET_AVX2_ZERO(0); = =E2=96=92 0.06 =E2=94=82 vpxor %ymm0,%ymm0,%ymm0 = =E2=96=92 =E2=94=82 for (i =3D 0; i < BYTES_TO_YMM(PAGE_SIZE); i++) = =E2=96=92 0.58 =E2=94=82 lea 0x1000(%rbx),%rax = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * i= ], 0); =E2=96=92 4.96 =E2=94=826f: vmovdqa %ymm0,(%rbx) = =E2=96=92 71.38 =E2=94=82 vmovdqa %ymm0,0x20(%rbx) = =E2=96=92 =E2=94=82 for (i =3D 0; i < BYTES_TO_YMM(PAGE_SIZE); i++) = =E2=96=92 2.81 =E2=94=82 add $0x40,%rbx = =E2=96=92 0.06 =E2=94=82 cmp %rbx,%rax = =E2=96=92 17.22 =E2=94=82 jne 6f = =E2=96=92 =E2=94=82 kernel_fpu_end(); = =E2=96=92 =E2=94=82 call kernel_fpu_end = =E2=96=92 =E2=94=82 kernel_init_pages(): = =E2=96=92 0.44 =E2=94=82 add $0x40,%r12 = =E2=96=92 0.55 =E2=94=82 cmp %r12,%rbp = =E2=96=92 0.07 =E2=94=82 jne 32 = =E2=96=92 =E2=94=82 clear_highpage_kasan_tagged(page + i); = =E2=96=92 =E2=94=82 kasan_enable_current(); = =E2=96=92 =E2=94=82 } = =E2=97=86 =E2=94=828f: pop %rbx = =E2=96=92 0.11 =E2=94=82 pop %rbp = =E2=96=92 =E2=94=82 pop %r12 = =E2=96=92 0.01 =E2=94=82 jmp __x86_return_thunk = =E2=96=92 =E2=94=8298: jmp __x86_return_thunk = =E2=96=92 2. Let's try to unroll it: Samples: 39K of event 'cycles:P', Event count (approx.): 33598124504 Children Self Command Shared Object Symbol + 92.49% 0.33% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_6= 4_a=E2=97=86 - 92.01% 0.47% passt.avx2 [kernel.kallsyms] [k] do_syscall_64 = =E2=96=92 - 91.54% do_syscall_64 = =E2=96=92 - 75.04% __sys_sendmsg = =E2=96=92 - 74.85% ___sys_sendmsg = =E2=96=92 - 74.26% ____sys_sendmsg = =E2=96=92 - 73.68% tcp_sendmsg = =E2=96=92 - 62.69% tcp_sendmsg_locked = =E2=96=92 - 22.26% sk_page_frag_refill = =E2=96=92 - 22.14% skb_page_frag_refill = =E2=96=92 - 21.74% alloc_pages_mpol_noprof = =E2=96=92 - 21.52% __alloc_pages_noprof = =E2=96=92 - 21.25% get_page_from_freelist = =E2=96=92 - 20.04% prep_new_page = =E2=96=92 - 19.57% clear_page = =E2=96=92 0.55% kernel_fpu_begin_mask = =E2=96=92 + 15.04% tcp_write_xmit = =E2=96=92 + 13.77% _copy_from_iter = =E2=96=92 + 5.12% __tcp_push_pending_frames = =E2=96=92 + 2.05% tcp_wmem_schedule = =E2=96=92 + 1.86% tcp_stream_alloc_skb = =E2=96=92 0.73% __check_object_size = =E2=96=92 + 10.15% release_sock = =E2=96=92 + 0.62% lock_sock_nested = =E2=96=92 + 5.63% ksys_write = =E2=96=92 + 4.65% __x64_sys_epoll_wait = =E2=96=92 + 2.61% __x64_sys_getsockopt = =E2=96=92 1.21% syscall_exit_to_user_mode = =E2=96=92 + 1.16% ksys_read = =E2=96=92 + 0.84% syscall_trace_enter = =E2=96=92 annotated: Samples: 39K of event 'cycles:P', 4000 Hz, Event count (approx.): 335981245= 04 clear_page /proc/kcore [Percent: local period] Percent=E2=94=82 =E2=94=82 ffffffffb5198480 : 0.06 =E2=94=82 push %rbx 0.27 =E2=94=82 cmpl $0x2,0x1ab243c(%rip) 0.07 =E2=94=82 mov %rdi,%rbx =E2=94=82 ja 1b =E2=94=82 d: mov %rbx,%rdi =E2=94=82 call clear_page_rep =E2=94=82 pop %rbx =E2=94=82 jmp srso_return_thunk 0.03 =E2=94=82 1b: call irq_fpu_usable 0.14 =E2=94=82 test %al,%al 0.64 =E2=94=82 je d 0.04 =E2=94=82 xor %edi,%edi =E2=94=82 call kernel_fpu_begin_mask 0.05 =E2=94=82 vpxor %ymm0,%ymm0,%ymm0 0.80 =E2=94=82 vmovdqa %ymm0,(%rbx) 1.12 =E2=94=82 vmovdqa %ymm0,0x20(%rbx) 0.06 =E2=94=82 vmovdqa %ymm0,0x40(%rbx) 1.39 =E2=94=82 vmovdqa %ymm0,0x60(%rbx) 0.24 =E2=94=82 vmovdqa %ymm0,0x80(%rbx) 0.58 =E2=94=82 vmovdqa %ymm0,0xa0(%rbx) 0.21 =E2=94=82 vmovdqa %ymm0,0xc0(%rbx) 0.77 =E2=94=82 vmovdqa %ymm0,0xe0(%rbx) 0.38 =E2=94=82 vmovdqa %ymm0,0x100(%rbx) 7.60 =E2=94=82 vmovdqa %ymm0,0x120(%rbx) 0.26 =E2=94=82 vmovdqa %ymm0,0x140(%rbx) 1.38 =E2=94=82 vmovdqa %ymm0,0x160(%rbx) 0.42 =E2=94=82 vmovdqa %ymm0,0x180(%rbx) 1.25 =E2=94=82 vmovdqa %ymm0,0x1a0(%rbx) 0.26 =E2=94=82 vmovdqa %ymm0,0x1c0(%rbx) 0.73 =E2=94=82 vmovdqa %ymm0,0x1e0(%rbx) 0.33 =E2=94=82 vmovdqa %ymm0,0x200(%rbx) 1.72 =E2=94=82 vmovdqa %ymm0,0x220(%rbx) 0.16 =E2=94=82 vmovdqa %ymm0,0x240(%rbx) 0.61 =E2=94=82 vmovdqa %ymm0,0x260(%rbx) 0.19 =E2=94=82 vmovdqa %ymm0,0x280(%rbx) 0.68 =E2=94=82 vmovdqa %ymm0,0x2a0(%rbx) 0.22 =E2=94=82 vmovdqa %ymm0,0x2c0(%rbx) 0.66 =E2=94=82 vmovdqa %ymm0,0x2e0(%rbx) 0.50 =E2=94=82 vmovdqa %ymm0,0x300(%rbx) 0.67 =E2=94=82 vmovdqa %ymm0,0x320(%rbx) 0.29 =E2=94=82 vmovdqa %ymm0,0x340(%rbx) 0.31 =E2=94=82 vmovdqa %ymm0,0x360(%rbx) 0.14 =E2=94=82 vmovdqa %ymm0,0x380(%rbx) 0.55 =E2=94=82 vmovdqa %ymm0,0x3a0(%rbx) 0.35 =E2=94=82 vmovdqa %ymm0,0x3c0(%rbx) 0.82 =E2=94=82 vmovdqa %ymm0,0x3e0(%rbx) 0.25 =E2=94=82 vmovdqa %ymm0,0x400(%rbx) 0.49 =E2=94=82 vmovdqa %ymm0,0x420(%rbx) = =E2=96=92 0.18 =E2=94=82 vmovdqa %ymm0,0x440(%rbx) = =E2=96=92 1.05 =E2=94=82 vmovdqa %ymm0,0x460(%rbx) = =E2=96=92 0.08 =E2=94=82 vmovdqa %ymm0,0x480(%rbx) = =E2=96=92 2.22 =E2=94=82 vmovdqa %ymm0,0x4a0(%rbx) = =E2=96=92 0.20 =E2=94=82 vmovdqa %ymm0,0x4c0(%rbx) = =E2=96=92 2.33 =E2=94=82 vmovdqa %ymm0,0x4e0(%rbx) = =E2=96=92 0.03 =E2=94=82 vmovdqa %ymm0,0x500(%rbx) = =E2=96=92 2.87 =E2=94=82 vmovdqa %ymm0,0x520(%rbx) = =E2=96=92 0.08 =E2=94=82 vmovdqa %ymm0,0x540(%rbx) = =E2=96=92 1.60 =E2=94=82 vmovdqa %ymm0,0x560(%rbx) = =E2=96=92 0.01 =E2=94=82 vmovdqa %ymm0,0x580(%rbx) = =E2=96=92 7.03 =E2=94=82 vmovdqa %ymm0,0x5a0(%rbx) = =E2=96=92 0.42 =E2=94=82 vmovdqa %ymm0,0x5c0(%rbx) = =E2=96=92 2.74 =E2=94=82 vmovdqa %ymm0,0x5e0(%rbx) = =E2=96=92 0.69 =E2=94=82 vmovdqa %ymm0,0x600(%rbx) = =E2=96=92 2.34 =E2=94=82 vmovdqa %ymm0,0x620(%rbx) = =E2=96=92 0.37 =E2=94=82 vmovdqa %ymm0,0x640(%rbx) = =E2=96=92 1.21 =E2=94=82 vmovdqa %ymm0,0x660(%rbx) = =E2=96=92 0.22 =E2=94=82 vmovdqa %ymm0,0x680(%rbx) = =E2=96=92 1.16 =E2=94=82 vmovdqa %ymm0,0x6a0(%rbx) = =E2=96=92 0.29 =E2=94=82 vmovdqa %ymm0,0x6c0(%rbx) = =E2=96=92 0.98 =E2=94=82 vmovdqa %ymm0,0x6e0(%rbx) = =E2=96=92 0.19 =E2=94=82 vmovdqa %ymm0,0x700(%rbx) = =E2=96=92 0.81 =E2=94=82 vmovdqa %ymm0,0x720(%rbx) = =E2=96=92 0.47 =E2=94=82 vmovdqa %ymm0,0x740(%rbx) = =E2=96=92 0.69 =E2=94=82 vmovdqa %ymm0,0x760(%rbx) = =E2=96=92 0.23 =E2=94=82 vmovdqa %ymm0,0x780(%rbx) = =E2=96=92 0.68 =E2=94=82 vmovdqa %ymm0,0x7a0(%rbx) = =E2=96=92 0.30 =E2=94=82 vmovdqa %ymm0,0x7c0(%rbx) = =E2=96=92 0.68 =E2=94=82 vmovdqa %ymm0,0x7e0(%rbx) = =E2=96=92 0.25 =E2=94=82 vmovdqa %ymm0,0x800(%rbx) = =E2=97=86 0.58 =E2=94=82 vmovdqa %ymm0,0x820(%rbx) = =E2=96=92 0.19 =E2=94=82 vmovdqa %ymm0,0x840(%rbx) = =E2=96=92 0.83 =E2=94=82 vmovdqa %ymm0,0x860(%rbx) = =E2=96=92 0.27 =E2=94=82 vmovdqa %ymm0,0x880(%rbx) = =E2=96=92 1.01 =E2=94=82 vmovdqa %ymm0,0x8a0(%rbx) = =E2=96=92 0.16 =E2=94=82 vmovdqa %ymm0,0x8c0(%rbx) = =E2=96=92 0.89 =E2=94=82 vmovdqa %ymm0,0x8e0(%rbx) = =E2=96=92 0.24 =E2=94=82 vmovdqa %ymm0,0x900(%rbx) = =E2=96=92 0.98 =E2=94=82 vmovdqa %ymm0,0x920(%rbx) = =E2=96=92 0.28 =E2=94=82 vmovdqa %ymm0,0x940(%rbx) = =E2=96=92 0.86 =E2=94=82 vmovdqa %ymm0,0x960(%rbx) = =E2=96=92 0.23 =E2=94=82 vmovdqa %ymm0,0x980(%rbx) = =E2=96=92 1.19 =E2=94=82 vmovdqa %ymm0,0x9a0(%rbx) = =E2=96=92 0.28 =E2=94=82 vmovdqa %ymm0,0x9c0(%rbx) = =E2=96=92 1.04 =E2=94=82 vmovdqa %ymm0,0x9e0(%rbx) = =E2=96=92 0.33 =E2=94=82 vmovdqa %ymm0,0xa00(%rbx) = =E2=96=92 0.90 =E2=94=82 vmovdqa %ymm0,0xa20(%rbx) = =E2=96=92 0.35 =E2=94=82 vmovdqa %ymm0,0xa40(%rbx) = =E2=96=92 0.87 =E2=94=82 vmovdqa %ymm0,0xa60(%rbx) = =E2=96=92 0.25 =E2=94=82 vmovdqa %ymm0,0xa80(%rbx) = =E2=96=92 0.89 =E2=94=82 vmovdqa %ymm0,0xaa0(%rbx) = =E2=96=92 0.28 =E2=94=82 vmovdqa %ymm0,0xac0(%rbx) = =E2=96=92 0.92 =E2=94=82 vmovdqa %ymm0,0xae0(%rbx) = =E2=96=92 0.23 =E2=94=82 vmovdqa %ymm0,0xb00(%rbx) = =E2=96=92 1.39 =E2=94=82 vmovdqa %ymm0,0xb20(%rbx) = =E2=96=92 0.29 =E2=94=82 vmovdqa %ymm0,0xb40(%rbx) = =E2=96=92 1.15 =E2=94=82 vmovdqa %ymm0,0xb60(%rbx) = =E2=96=92 0.26 =E2=94=82 vmovdqa %ymm0,0xb80(%rbx) = =E2=96=92 1.33 =E2=94=82 vmovdqa %ymm0,0xba0(%rbx) = =E2=96=92 0.29 =E2=94=82 vmovdqa %ymm0,0xbc0(%rbx) = =E2=96=92 1.05 =E2=94=82 vmovdqa %ymm0,0xbe0(%rbx) = =E2=96=92 0.25 =E2=94=82 vmovdqa %ymm0,0xc00(%rbx) = =E2=96=92 0.89 =E2=94=82 vmovdqa %ymm0,0xc20(%rbx) = =E2=96=92 0.34 =E2=94=82 vmovdqa %ymm0,0xc40(%rbx) = =E2=96=92 0.78 =E2=94=82 vmovdqa %ymm0,0xc60(%rbx) = =E2=96=92 0.40 =E2=94=82 vmovdqa %ymm0,0xc80(%rbx) = =E2=96=92 0.99 =E2=94=82 vmovdqa %ymm0,0xca0(%rbx) = =E2=96=92 0.44 =E2=94=82 vmovdqa %ymm0,0xcc0(%rbx) = =E2=96=92 1.06 =E2=94=82 vmovdqa %ymm0,0xce0(%rbx) = =E2=96=92 0.35 =E2=94=82 vmovdqa %ymm0,0xd00(%rbx) = =E2=96=92 0.85 =E2=94=82 vmovdqa %ymm0,0xd20(%rbx) = =E2=96=92 0.46 =E2=94=82 vmovdqa %ymm0,0xd40(%rbx) = =E2=96=92 0.88 =E2=94=82 vmovdqa %ymm0,0xd60(%rbx) = =E2=96=92 0.38 =E2=94=82 vmovdqa %ymm0,0xd80(%rbx) = =E2=96=92 0.82 =E2=94=82 vmovdqa %ymm0,0xda0(%rbx) = =E2=96=92 0.40 =E2=94=82 vmovdqa %ymm0,0xdc0(%rbx) = =E2=96=92 0.98 =E2=94=82 vmovdqa %ymm0,0xde0(%rbx) = =E2=96=92 0.27 =E2=94=82 vmovdqa %ymm0,0xe00(%rbx) = =E2=96=92 1.10 =E2=94=82 vmovdqa %ymm0,0xe20(%rbx) = =E2=96=92 0.25 =E2=94=82 vmovdqa %ymm0,0xe40(%rbx) = =E2=96=92 0.89 =E2=94=82 vmovdqa %ymm0,0xe60(%rbx) = =E2=96=92 0.32 =E2=94=82 vmovdqa %ymm0,0xe80(%rbx) = =E2=96=92 0.87 =E2=94=82 vmovdqa %ymm0,0xea0(%rbx) = =E2=96=92 0.22 =E2=94=82 vmovdqa %ymm0,0xec0(%rbx) = =E2=96=92 0.94 =E2=94=82 vmovdqa %ymm0,0xee0(%rbx) = =E2=96=92 0.27 =E2=94=82 vmovdqa %ymm0,0xf00(%rbx) = =E2=96=92 0.90 =E2=94=82 vmovdqa %ymm0,0xf20(%rbx) = =E2=96=92 0.28 =E2=94=82 vmovdqa %ymm0,0xf40(%rbx) = =E2=96=92 0.79 =E2=94=82 vmovdqa %ymm0,0xf60(%rbx) = =E2=96=92 0.31 =E2=94=82 vmovdqa %ymm0,0xf80(%rbx) = =E2=96=92 1.11 =E2=94=82 vmovdqa %ymm0,0xfa0(%rbx) = =E2=96=92 0.25 =E2=94=82 vmovdqa %ymm0,0xfc0(%rbx) = =E2=96=92 0.99 =E2=94=82 vmovdqa %ymm0,0xfe0(%rbx) = =E2=96=92 0.10 =E2=94=82 pop %rbx = =E2=96=92 =E2=94=82 jmp 0xffffffffb4e4b050 = =E2=96=92 ...that looks like progress: we now spend 21.25% of the clock cyles on get_page_from_freelist() (non-AVX: 25.61%). But still, there seem to be (somewhat unexpected) stalls. For example, after 8 VMOVDQA instructions: 7.60 =E2=94=82 vmovdqa %ymm0,0x120(%rbx) we have one where we spend/wait a long time, and there are more later. 3. ...what if we use a non-temporal hint, that is, if we clear the page without making it cache hot ("stream" instead of "store")? That's vmovntdq m256, ymm ("nt" meaning non-temporal). It's not vmovntdqa (where "a" stands for "aligned"), as one could expect from the vmovdqa abov= e, because there's no unaligned version. The only vmovntdq_a_ instruction is vmovntdqa ymm, m256 (memory to register= , "stream load"), because there's an unaligned equivalent in that case. Anyway, perf output: Samples: 39K of event 'cycles:P', Event count (approx.): 33890710610 Children Self Command Shared Object Symbol - 92.62% 0.88% passt.avx2 [kernel.vmlinux] [k] entry_SYSC= ALL=E2=97=86 - 91.74% entry_SYSCALL_64 = =E2=96=92 - 91.60% do_syscall_64 = =E2=96=92 - 75.05% __sys_sendmsg = =E2=96=92 - 74.88% ___sys_sendmsg = =E2=96=92 - 74.22% ____sys_sendmsg = =E2=96=92 - 73.65% tcp_sendmsg = =E2=96=92 - 61.71% tcp_sendmsg_locked = =E2=96=92 - 24.82% _copy_from_iter = =E2=96=92 24.40% rep_movs_alternative = =E2=96=92 - 14.69% sk_page_frag_refill = =E2=96=92 - 14.57% skb_page_frag_refill = =E2=96=92 - 14.19% alloc_pages_mpol_noprof = =E2=96=92 - 14.03% __alloc_pages_noprof = =E2=96=92 - 13.77% get_page_from_freelist = =E2=96=92 - 12.56% prep_new_page = =E2=96=92 - 12.19% clear_page = =E2=96=92 0.68% kernel_fpu_begin_mask = =E2=96=92 - 11.12% tcp_write_xmit = =E2=96=92 - 10.17% __tcp_transmit_skb = =E2=96=92 - 9.62% __ip_queue_xmit = =E2=96=92 - 8.08% ip_finish_output2 = =E2=96=92 - 7.96% __dev_queue_xmit = =E2=96=92 - 7.26% __local_bh_enable_ip = =E2=96=92 - 7.24% do_softirq.part.0 = =E2=96=92 - handle_softirqs = =E2=96=92 - net_rx_action = =E2=96=92 + 5.80% __napi_poll = =E2=96=92 + 0.87% napi_consume_skb= =E2=96=92 - 1.06% ip_local_out = =E2=96=92 - 1.05% __ip_local_out = =E2=96=92 - 0.90% nf_hook_slow = =E2=96=92 0.66% nf_conntrack_in = =E2=96=92 + 4.22% __tcp_push_pending_frames = =E2=96=92 + 2.51% tcp_wmem_schedule = =E2=96=92 + 1.99% tcp_stream_alloc_skb = =E2=96=92 0.59% __check_object_size = =E2=96=92 + 11.21% release_sock = =E2=96=92 0.52% lock_sock_nested = =E2=96=92 + 5.32% ksys_write = =E2=96=92 + 4.75% __x64_sys_epoll_wait = =E2=96=92 + 2.45% __x64_sys_getsockopt = =E2=96=92 1.29% syscall_exit_to_user_mode = =E2=96=92 + 1.25% ksys_read = =E2=96=92 + 0.70% syscall_trace_enter = =E2=96=92 ...finally we cut down significantly on cycles spent to clear pages, with get_page_from_freelist() taking 13.77% of the cycles instead of 25.61%. That's about half the overhead. This makes _copy_from_iter() the biggest consumer of cycles under tcp_sendmsg_locked(), which is what I expected. Does this mean that we increase the overhead there because we're increasing the amount of cache misses there, or are we simply more efficient? I'm not sure yet. For completeness, annotated version of clear_page(): =E2=94=82 if (system_state >=3D SYSTEM_RUNNING && irq_fpu_usable= ()) { 0.09 =E2=94=82 1b: call irq_fpu_usable 0.11 =E2=94=82 test %al,%al 0.51 =E2=94=82 je d =E2=94=82 kernel_fpu_begin_mask(0); 0.16 =E2=94=82 xor %edi,%edi =E2=94=82 call kernel_fpu_begin_mask =E2=94=82 MEMSET_AVX2_ZERO(0); 0.05 =E2=94=82 vpxor %ymm0,%ymm0,%ymm0 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x00], 0); 0.79 =E2=94=82 vmovntdq %ymm0,(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x01], 0); 2.46 =E2=94=82 vmovntdq %ymm0,0x20(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x02], 0); 0.07 =E2=94=82 vmovntdq %ymm0,0x40(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x03], 0); 1.35 =E2=94=82 vmovntdq %ymm0,0x60(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x04], 0); 0.18 =E2=94=82 vmovntdq %ymm0,0x80(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x05], 0); 1.40 =E2=94=82 vmovntdq %ymm0,0xa0(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x06], 0); 0.11 =E2=94=82 vmovntdq %ymm0,0xc0(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x07], 0); 0.81 =E2=94=82 vmovntdq %ymm0,0xe0(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x08], 0); 0.07 =E2=94=82 vmovntdq %ymm0,0x100(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x09], 0); 1.25 =E2=94=82 vmovntdq %ymm0,0x120(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x0a], 0); 0.08 =E2=94=82 vmovntdq %ymm0,0x140(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x0b], 0); 1.36 =E2=94=82 vmovntdq %ymm0,0x160(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x0c], 0); 0.11 =E2=94=82 vmovntdq %ymm0,0x180(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x0d], 0); 1.73 =E2=94=82 vmovntdq %ymm0,0x1a0(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x0e], 0); 0.09 =E2=94=82 vmovntdq %ymm0,0x1c0(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x0f], 0); 0.97 =E2=94=82 vmovntdq %ymm0,0x1e0(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x10], 0); 0.07 =E2=94=82 vmovntdq %ymm0,0x200(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x11], 0); 1.25 =E2=94=82 vmovntdq %ymm0,0x220(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x12], 0); 0.14 =E2=94=82 vmovntdq %ymm0,0x240(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x13], 0); 0.79 =E2=94=82 vmovntdq %ymm0,0x260(%rbx) =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x14], 0); =E2=96=92 0.09 =E2=94=82 vmovntdq %ymm0,0x280(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x15], 0); =E2=96=92 1.19 =E2=94=82 vmovntdq %ymm0,0x2a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x16], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0x2c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x17], 0); =E2=96=92 1.45 =E2=94=82 vmovntdq %ymm0,0x2e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x18], 0); =E2=96=92 =E2=94=82 vmovntdq %ymm0,0x300(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x19], 0); =E2=96=92 1.45 =E2=94=82 vmovntdq %ymm0,0x320(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x1a], 0); =E2=96=92 0.05 =E2=94=82 vmovntdq %ymm0,0x340(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x1b], 0); =E2=96=92 1.49 =E2=94=82 vmovntdq %ymm0,0x360(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x1c], 0); =E2=96=92 0.14 =E2=94=82 vmovntdq %ymm0,0x380(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x1d], 0); =E2=96=92 1.34 =E2=94=82 vmovntdq %ymm0,0x3a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x1e], 0); =E2=96=92 0.09 =E2=94=82 vmovntdq %ymm0,0x3c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x1f], 0); =E2=96=92 1.69 =E2=94=82 vmovntdq %ymm0,0x3e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x20], 0); =E2=96=92 0.16 =E2=94=82 vmovntdq %ymm0,0x400(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x21], 0); =E2=96=92 1.15 =E2=94=82 vmovntdq %ymm0,0x420(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x22], 0); =E2=96=92 0.13 =E2=94=82 vmovntdq %ymm0,0x440(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x23], 0); =E2=96=92 1.36 =E2=94=82 vmovntdq %ymm0,0x460(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x24], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0x480(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x25], 0); =E2=96=92 1.01 =E2=94=82 vmovntdq %ymm0,0x4a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x26], 0); =E2=96=92 0.09 =E2=94=82 vmovntdq %ymm0,0x4c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x27], 0); =E2=97=86 1.53 =E2=94=82 vmovntdq %ymm0,0x4e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x28], 0); =E2=96=92 0.12 =E2=94=82 vmovntdq %ymm0,0x500(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x29], 0); =E2=96=92 1.45 =E2=94=82 vmovntdq %ymm0,0x520(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x2a], 0); =E2=96=92 0.13 =E2=94=82 vmovntdq %ymm0,0x540(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x2b], 0); =E2=96=92 0.97 =E2=94=82 vmovntdq %ymm0,0x560(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x2c], 0); =E2=96=92 0.12 =E2=94=82 vmovntdq %ymm0,0x580(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x2d], 0); =E2=96=92 1.21 =E2=94=82 vmovntdq %ymm0,0x5a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x2e], 0); =E2=96=92 0.15 =E2=94=82 vmovntdq %ymm0,0x5c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x2f], 0); =E2=96=92 1.42 =E2=94=82 vmovntdq %ymm0,0x5e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x30], 0); =E2=96=92 0.19 =E2=94=82 vmovntdq %ymm0,0x600(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x31], 0); =E2=96=92 1.12 =E2=94=82 vmovntdq %ymm0,0x620(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x32], 0); =E2=96=92 0.04 =E2=94=82 vmovntdq %ymm0,0x640(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x33], 0); =E2=96=92 1.59 =E2=94=82 vmovntdq %ymm0,0x660(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x34], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0x680(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x35], 0); =E2=96=92 1.65 =E2=94=82 vmovntdq %ymm0,0x6a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x36], 0); =E2=96=92 0.14 =E2=94=82 vmovntdq %ymm0,0x6c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x37], 0); =E2=96=92 1.00 =E2=94=82 vmovntdq %ymm0,0x6e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x38], 0); =E2=96=92 0.14 =E2=94=82 vmovntdq %ymm0,0x700(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x39], 0); =E2=96=92 1.31 =E2=94=82 vmovntdq %ymm0,0x720(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x3a], 0); =E2=96=92 0.10 =E2=94=82 vmovntdq %ymm0,0x740(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x3b], 0); =E2=96=92 1.21 =E2=94=82 vmovntdq %ymm0,0x760(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x3c], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0x780(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x3d], 0); =E2=96=92 1.27 =E2=94=82 vmovntdq %ymm0,0x7a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x3e], 0); =E2=96=92 0.09 =E2=94=82 vmovntdq %ymm0,0x7c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x3f], 0); =E2=96=92 1.28 =E2=94=82 vmovntdq %ymm0,0x7e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x40], 0); =E2=96=92 0.11 =E2=94=82 vmovntdq %ymm0,0x800(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x41], 0); =E2=96=92 1.32 =E2=94=82 vmovntdq %ymm0,0x820(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x42], 0); =E2=96=92 0.09 =E2=94=82 vmovntdq %ymm0,0x840(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x43], 0); =E2=96=92 1.43 =E2=94=82 vmovntdq %ymm0,0x860(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x44], 0); =E2=96=92 0.11 =E2=94=82 vmovntdq %ymm0,0x880(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x45], 0); =E2=96=92 1.21 =E2=94=82 vmovntdq %ymm0,0x8a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x46], 0); =E2=96=92 0.11 =E2=94=82 vmovntdq %ymm0,0x8c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x47], 0); =E2=96=92 1.09 =E2=94=82 vmovntdq %ymm0,0x8e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x48], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0x900(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x49], 0); =E2=96=92 1.26 =E2=94=82 vmovntdq %ymm0,0x920(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x4a], 0); =E2=96=92 0.16 =E2=94=82 vmovntdq %ymm0,0x940(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x4b], 0); =E2=96=92 1.58 =E2=94=82 vmovntdq %ymm0,0x960(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x4c], 0); =E2=96=92 0.05 =E2=94=82 vmovntdq %ymm0,0x980(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x4d], 0); =E2=96=92 1.54 =E2=94=82 vmovntdq %ymm0,0x9a0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x4e], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0x9c0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x4f], 0); =E2=96=92 1.66 =E2=94=82 vmovntdq %ymm0,0x9e0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x50], 0); =E2=96=92 0.16 =E2=94=82 vmovntdq %ymm0,0xa00(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x51], 0); =E2=96=92 1.31 =E2=94=82 vmovntdq %ymm0,0xa20(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x52], 0); =E2=96=92 0.20 =E2=94=82 vmovntdq %ymm0,0xa40(%rbx) = =E2=97=86 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x53], 0); =E2=96=92 1.44 =E2=94=82 vmovntdq %ymm0,0xa60(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x54], 0); =E2=96=92 0.05 =E2=94=82 vmovntdq %ymm0,0xa80(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x55], 0); =E2=96=92 1.52 =E2=94=82 vmovntdq %ymm0,0xaa0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x56], 0); =E2=96=92 0.21 =E2=94=82 vmovntdq %ymm0,0xac0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x57], 0); =E2=96=92 1.09 =E2=94=82 vmovntdq %ymm0,0xae0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x58], 0); =E2=96=92 0.22 =E2=94=82 vmovntdq %ymm0,0xb00(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x59], 0); =E2=96=92 1.58 =E2=94=82 vmovntdq %ymm0,0xb20(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x5a], 0); =E2=96=92 0.12 =E2=94=82 vmovntdq %ymm0,0xb40(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x5b], 0); =E2=96=92 1.46 =E2=94=82 vmovntdq %ymm0,0xb60(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x5c], 0); =E2=96=92 0.04 =E2=94=82 vmovntdq %ymm0,0xb80(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x5d], 0); =E2=96=92 1.62 =E2=94=82 vmovntdq %ymm0,0xba0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x5e], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0xbc0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x5f], 0); =E2=96=92 1.71 =E2=94=82 vmovntdq %ymm0,0xbe0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x60], 0); =E2=96=92 0.19 =E2=94=82 vmovntdq %ymm0,0xc00(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x61], 0); =E2=96=92 1.89 =E2=94=82 vmovntdq %ymm0,0xc20(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x62], 0); =E2=96=92 0.11 =E2=94=82 vmovntdq %ymm0,0xc40(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x63], 0); =E2=96=92 1.98 =E2=94=82 vmovntdq %ymm0,0xc60(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x64], 0); =E2=96=92 0.16 =E2=94=82 vmovntdq %ymm0,0xc80(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x65], 0); =E2=96=92 1.58 =E2=94=82 vmovntdq %ymm0,0xca0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x66], 0); =E2=96=92 0.13 =E2=94=82 vmovntdq %ymm0,0xcc0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x67], 0); =E2=96=92 1.16 =E2=94=82 vmovntdq %ymm0,0xce0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x68], 0); =E2=96=92 0.09 =E2=94=82 vmovntdq %ymm0,0xd00(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x69], 0); =E2=96=92 1.67 =E2=94=82 vmovntdq %ymm0,0xd20(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x6a], 0); =E2=96=92 0.11 =E2=94=82 vmovntdq %ymm0,0xd40(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x6b], 0); =E2=96=92 1.82 =E2=94=82 vmovntdq %ymm0,0xd60(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x6c], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0xd80(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x6d], 0); =E2=96=92 1.57 =E2=94=82 vmovntdq %ymm0,0xda0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x6e], 0); =E2=96=92 0.02 =E2=94=82 vmovntdq %ymm0,0xdc0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x6f], 0); =E2=96=92 1.27 =E2=94=82 vmovntdq %ymm0,0xde0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x70], 0); =E2=96=92 =E2=94=82 vmovntdq %ymm0,0xe00(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x71], 0); =E2=96=92 1.48 =E2=94=82 vmovntdq %ymm0,0xe20(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x72], 0); =E2=96=92 0.11 =E2=94=82 vmovntdq %ymm0,0xe40(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x73], 0); =E2=96=92 1.87 =E2=94=82 vmovntdq %ymm0,0xe60(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x74], 0); =E2=96=92 0.16 =E2=94=82 vmovntdq %ymm0,0xe80(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x75], 0); =E2=96=92 1.45 =E2=94=82 vmovntdq %ymm0,0xea0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x76], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0xec0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x77], 0); =E2=96=92 1.65 =E2=94=82 vmovntdq %ymm0,0xee0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x78], 0); =E2=96=92 0.10 =E2=94=82 vmovntdq %ymm0,0xf00(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x79], 0); =E2=96=92 1.53 =E2=94=82 vmovntdq %ymm0,0xf20(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x7a], 0); =E2=96=92 0.07 =E2=94=82 vmovntdq %ymm0,0xf40(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x7b], 0); =E2=96=92 1.51 =E2=94=82 vmovntdq %ymm0,0xf60(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x7c], 0); =E2=96=92 0.12 =E2=94=82 vmovntdq %ymm0,0xf80(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x7d], 0); =E2=96=92 1.62 =E2=94=82 vmovntdq %ymm0,0xfa0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x7e], 0); =E2=96=92 0.08 =E2=94=82 vmovntdq %ymm0,0xfc0(%rbx) = =E2=96=92 =E2=94=82 MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * = 0x7f], 0); =E2=96=92 1.62 =E2=94=82 vmovntdq %ymm0,0xfe0(%rbx) = =E2=96=92 =E2=94=82 } = =E2=96=92 0.13 =E2=94=82 pop %rbx = =E2=96=92 =E2=94=82 kernel_fpu_end(); = =E2=96=92 =E2=94=82 jmp ffffffff8104b050 = =E2=96=92 ...no stalls on any particular store. Current patch attached if you want to fry^W test on your laptop. --=20 Stefano --MP_/NMSZLojx+6.eoRW/YaLijs4 Content-Type: text/x-patch Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=clear_page_avx2_stream.patch diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index f3d257c45225..d753740cb06c 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -9,6 +9,7 @@ #include #include +#include /* duplicated to the one in bootmem.h */ extern unsigned long max_pfn; @@ -44,6 +45,17 @@ void clear_page_orig(void *page); void clear_page_rep(void *page); void clear_page_erms(void *page); +#define MEMSET_AVX2_ZERO(reg) \ + asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg) +#define MEMSET_AVX2_STORE(loc, reg) \ + asm volatile("vmovntdq %%ymm" #reg ", %0" : "=m" (loc)) + +#define YMM_BYTES (256 / 8) +#define BYTES_TO_YMM(x) ((x) / YMM_BYTES) +extern void kernel_fpu_begin_mask(unsigned int kfpu_mask); +extern void kernel_fpu_end(void); +extern bool irq_fpu_usable(void); + static inline void clear_page(void *page) { /* @@ -51,6 +63,182 @@ static inline void clear_page(void *page) * below clobbers @page, so we perform unpoisoning before it. */ kmsan_unpoison_memory(page, PAGE_SIZE); + + if (system_state >= SYSTEM_RUNNING && irq_fpu_usable()) { + kernel_fpu_begin_mask(0); + MEMSET_AVX2_ZERO(0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x00], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x01], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x02], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x03], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x04], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x05], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x06], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x07], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x08], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x09], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x10], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x11], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x12], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x13], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x14], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x15], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x16], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x17], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x18], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x19], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x20], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x21], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x22], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x23], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x24], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x25], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x26], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x27], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x28], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x29], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x30], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x31], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x32], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x33], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x34], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x35], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x36], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x37], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x38], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x39], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x40], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x41], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x42], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x43], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x44], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x45], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x46], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x47], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x48], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x49], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x50], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x51], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x52], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x53], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x54], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x55], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x56], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x57], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x58], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x59], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x60], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x61], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x62], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x63], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x64], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x65], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x66], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x67], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x68], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x69], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6f], 0); + + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x70], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x71], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x72], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x73], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x74], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x75], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x76], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x77], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x78], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x79], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7a], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7b], 0); + + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7c], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7d], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7e], 0); + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7f], 0); + + kernel_fpu_end(); + return; + } + alternative_call_2(clear_page_orig, clear_page_rep, X86_FEATURE_REP_GOOD, clear_page_erms, X86_FEATURE_ERMS, --MP_/NMSZLojx+6.eoRW/YaLijs4--