public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>, passt-dev@passt.top
Subject: Re: [RFC] tcp: Replace TCP buffer structure by an iovec array
Date: Fri, 15 Mar 2024 10:46:55 +1000	[thread overview]
Message-ID: <ZfOaf1cgy_UP9-nR@zatzit> (raw)
In-Reply-To: <20240314172617.22c28caa@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 5948 bytes --]

On Thu, Mar 14, 2024 at 05:26:17PM +0100, Stefano Brivio wrote:
> On Thu, 14 Mar 2024 16:54:02 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > On 3/14/24 16:47, Stefano Brivio wrote:
> > > On Thu, 14 Mar 2024 15:07:48 +0100
> > > Laurent Vivier <lvivier@redhat.com> wrote:
> > >   
> > >> On 3/13/24 12:37, Stefano Brivio wrote:
> > >> ...  
> > >>>> @@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c,
> > >>>>    	return i;
> > >>>>    }
> > >>>>    
> > >>>> +/**
> > >>>> + * tap_send_iov_passt() - Send out multiple prepared frames  
> > >>>
> > >>> ...I would argue that this function prepares frames as well. Maybe:
> > >>>
> > >>>    * tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames
> > >>>      
> > >>>> + * @c:		Execution context
> > >>>> + * @iov:	Array of frames, each frames is divided in an array of iovecs.
> > >>>> + *              The first entry of the iovec is updated to point to an
> > >>>> + *              uint32_t storing the frame length.  
> > >>>
> > >>>    * @iov:	Array of frames, each one a vector of parts, TCP_IOV_VNET blank
> > >>>      
> > >>>> + * @n:		Number of frames in @iov
> > >>>> + *
> > >>>> + * Return: number of frames actually sent
> > >>>> + */
> > >>>> +static size_t tap_send_iov_passt(const struct ctx *c,
> > >>>> +				 struct iovec iov[][TCP_IOV_NUM],
> > >>>> +				 size_t n)
> > >>>> +{
> > >>>> +	unsigned int i;
> > >>>> +
> > >>>> +	for (i = 0; i < n; i++) {
> > >>>> +		uint32_t vnet_len;
> > >>>> +		int j;
> > >>>> +
> > >>>> +		vnet_len = 0;  
> > >>>
> > >>> This could be initialised in the declaration (yes, it's "reset" at
> > >>> every loop iteration).
> > >>>      
> > >>>> +		for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++)
> > >>>> +			vnet_len += iov[i][j].iov_len;
> > >>>> +
> > >>>> +		vnet_len = htonl(vnet_len);
> > >>>> +		iov[i][TCP_IOV_VNET].iov_base = &vnet_len;
> > >>>> +		iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len);
> > >>>> +
> > >>>> +		if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM))  
> > >>>
> > >>> ...which would now send a single frame at a time, but actually it can
> > >>> already send everything in one shot because it's using sendmsg(), if you
> > >>> move it outside of the loop and do something like (untested):
> > >>>
> > >>> 	return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n);
> > >>>      
> > >>>> +			break;
> > >>>> +	}
> > >>>> +
> > >>>> +	return i;
> > >>>> +
> > >>>> +}
> > >>>> +  
> > >>
> > >> I tried to do something like that but I have a performance drop:
> > >>
> > >> static size_t tap_send_iov_passt(const struct ctx *c,
> > >>                                    struct iovec iov[][TCP_IOV_NUM],
> > >>                                    size_t n)
> > >> {
> > >>           unsigned int i;
> > >>           uint32_t vnet_len[n];
> > >>
> > >>           for (i = 0; i < n; i++) {
> > >>                   int j;
> > >>
> > >>                   vnet_len[i] = 0;
> > >>                   for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++)
> > >>                           vnet_len[i] += iov[i][j].iov_len;
> > >>
> > >>                   vnet_len[i] = htonl(vnet_len[i]);
> > >>                   iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i];
> > >>                   iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t);
> > >>           }
> > >>
> > >>           return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM;
> > >> }
> > >>
> > >> iperf3 -c localhost -p 10001  -t 60  -4
> > >>
> > >> berfore
> > >> [ ID] Interval           Transfer     Bitrate         Retr
> > >> [  5]   0.00-60.00  sec  33.0 GBytes  4.72 Gbits/sec    1             sender
> > >> [  5]   0.00-60.06  sec  33.0 GBytes  4.72 Gbits/sec                  receiver
> > >>
> > >> after:
> > >> [ ID] Interval           Transfer     Bitrate         Retr
> > >> [  5]   0.00-60.00  sec  18.2 GBytes  2.60 Gbits/sec    0             sender
> > >> [  5]   0.00-60.07  sec  18.2 GBytes  2.60 Gbits/sec                  receiver  
> > > 
> > > Weird, it looks like doing one sendmsg() per frame results in a higher
> > > throughput than one sendmsg() per multiple frames, which sounds rather
> > > absurd. Perhaps we should start looking into what perf(1) reports, in
> > > terms of both syscall overhead and cache misses.
> > > 
> > > I'll have a look later today or tomorrow -- unless you have other
> > > ideas as to why this might happen...
> > 
> > Perhaps in first case we only update one vnet_len and in the second case we have to update 
> > an array of vnet_len, so there is an use of more cache lines?

We should be able to test this relatively easily, yes?  By updating
all the vnet_len then using a single sendmsg().


> Yes, I'm wondering if for example this:
> 
> 		iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i];
> 
> causes a prefetch of everything pointed by iov[i][...], so we would
> prefetch (and throw away) each buffer, one by one.
> 
> Another interesting experiment to verify if this is the case could be
> to "flush" a few frames at a time (say, 4), with something like this on
> top of your original change (completely untested):
> 
> 		[...]
> 
> 		if (!((i + 1) % 4) &&
> 		    !tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * 4))
> 			break;
> 	}
> 
> 	if ((i + 1) % 4) {
> 		tap_send_frames_passt(c, iov[i / 4],
> 				      TCP_IOV_NUM * ((i + 1) % 4));
> 	}
> 
> Or maybe we could set vnet_len right after we receive data in the
> buffers.

I really hope we can avoid this.  If we want to allow IPv4<->IPv6
translation, then we can't know the vnet_len until after we've done
our routing / translation.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-03-15  0:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-11 13:33 [RFC] tcp: Replace TCP buffer structure by an iovec array Laurent Vivier
2024-03-12 22:56 ` Stefano Brivio
2024-03-13 11:37 ` Stefano Brivio
2024-03-13 14:42   ` Laurent Vivier
2024-03-13 15:27     ` Stefano Brivio
2024-03-13 15:20   ` Laurent Vivier
2024-03-13 16:58     ` Stefano Brivio
2024-03-14 14:07   ` Laurent Vivier
2024-03-14 15:47     ` Stefano Brivio
2024-03-14 15:54       ` Laurent Vivier
2024-03-14 16:26         ` Stefano Brivio
2024-03-15  0:46           ` David Gibson [this message]
2024-03-14  4:22 ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfOaf1cgy_UP9-nR@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).