On Thu, Mar 14, 2024 at 05:26:17PM +0100, Stefano Brivio wrote: > On Thu, 14 Mar 2024 16:54:02 +0100 > Laurent Vivier wrote: > > > On 3/14/24 16:47, Stefano Brivio wrote: > > > On Thu, 14 Mar 2024 15:07:48 +0100 > > > Laurent Vivier wrote: > > > > > >> On 3/13/24 12:37, Stefano Brivio wrote: > > >> ... > > >>>> @@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c, > > >>>> return i; > > >>>> } > > >>>> > > >>>> +/** > > >>>> + * tap_send_iov_passt() - Send out multiple prepared frames > > >>> > > >>> ...I would argue that this function prepares frames as well. Maybe: > > >>> > > >>> * tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames > > >>> > > >>>> + * @c: Execution context > > >>>> + * @iov: Array of frames, each frames is divided in an array of iovecs. > > >>>> + * The first entry of the iovec is updated to point to an > > >>>> + * uint32_t storing the frame length. > > >>> > > >>> * @iov: Array of frames, each one a vector of parts, TCP_IOV_VNET blank > > >>> > > >>>> + * @n: Number of frames in @iov > > >>>> + * > > >>>> + * Return: number of frames actually sent > > >>>> + */ > > >>>> +static size_t tap_send_iov_passt(const struct ctx *c, > > >>>> + struct iovec iov[][TCP_IOV_NUM], > > >>>> + size_t n) > > >>>> +{ > > >>>> + unsigned int i; > > >>>> + > > >>>> + for (i = 0; i < n; i++) { > > >>>> + uint32_t vnet_len; > > >>>> + int j; > > >>>> + > > >>>> + vnet_len = 0; > > >>> > > >>> This could be initialised in the declaration (yes, it's "reset" at > > >>> every loop iteration). > > >>> > > >>>> + for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) > > >>>> + vnet_len += iov[i][j].iov_len; > > >>>> + > > >>>> + vnet_len = htonl(vnet_len); > > >>>> + iov[i][TCP_IOV_VNET].iov_base = &vnet_len; > > >>>> + iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len); > > >>>> + > > >>>> + if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM)) > > >>> > > >>> ...which would now send a single frame at a time, but actually it can > > >>> already send everything in one shot because it's using sendmsg(), if you > > >>> move it outside of the loop and do something like (untested): > > >>> > > >>> return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n); > > >>> > > >>>> + break; > > >>>> + } > > >>>> + > > >>>> + return i; > > >>>> + > > >>>> +} > > >>>> + > > >> > > >> I tried to do something like that but I have a performance drop: > > >> > > >> static size_t tap_send_iov_passt(const struct ctx *c, > > >> struct iovec iov[][TCP_IOV_NUM], > > >> size_t n) > > >> { > > >> unsigned int i; > > >> uint32_t vnet_len[n]; > > >> > > >> for (i = 0; i < n; i++) { > > >> int j; > > >> > > >> vnet_len[i] = 0; > > >> for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) > > >> vnet_len[i] += iov[i][j].iov_len; > > >> > > >> vnet_len[i] = htonl(vnet_len[i]); > > >> iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; > > >> iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t); > > >> } > > >> > > >> return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM; > > >> } > > >> > > >> iperf3 -c localhost -p 10001 -t 60 -4 > > >> > > >> berfore > > >> [ ID] Interval Transfer Bitrate Retr > > >> [ 5] 0.00-60.00 sec 33.0 GBytes 4.72 Gbits/sec 1 sender > > >> [ 5] 0.00-60.06 sec 33.0 GBytes 4.72 Gbits/sec receiver > > >> > > >> after: > > >> [ ID] Interval Transfer Bitrate Retr > > >> [ 5] 0.00-60.00 sec 18.2 GBytes 2.60 Gbits/sec 0 sender > > >> [ 5] 0.00-60.07 sec 18.2 GBytes 2.60 Gbits/sec receiver > > > > > > Weird, it looks like doing one sendmsg() per frame results in a higher > > > throughput than one sendmsg() per multiple frames, which sounds rather > > > absurd. Perhaps we should start looking into what perf(1) reports, in > > > terms of both syscall overhead and cache misses. > > > > > > I'll have a look later today or tomorrow -- unless you have other > > > ideas as to why this might happen... > > > > Perhaps in first case we only update one vnet_len and in the second case we have to update > > an array of vnet_len, so there is an use of more cache lines? We should be able to test this relatively easily, yes? By updating all the vnet_len then using a single sendmsg(). > Yes, I'm wondering if for example this: > > iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; > > causes a prefetch of everything pointed by iov[i][...], so we would > prefetch (and throw away) each buffer, one by one. > > Another interesting experiment to verify if this is the case could be > to "flush" a few frames at a time (say, 4), with something like this on > top of your original change (completely untested): > > [...] > > if (!((i + 1) % 4) && > !tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * 4)) > break; > } > > if ((i + 1) % 4) { > tap_send_frames_passt(c, iov[i / 4], > TCP_IOV_NUM * ((i + 1) % 4)); > } > > Or maybe we could set vnet_len right after we receive data in the > buffers. I really hope we can avoid this. If we want to allow IPv4<->IPv6 translation, then we can't know the vnet_len until after we've done our routing / translation. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson