public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets
Date: Fri, 3 Jan 2025 17:06:49 +1100	[thread overview]
Message-ID: <Z3d-ecXZefFcUSHe@zatzit> (raw)
In-Reply-To: <20250102230014.3dcc957d@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 6276 bytes --]

On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 14:46:45 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:34 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Currently we attempt to size pool_tap[46] so they have room for the maximum
> > > > possible number of packets that could fit in pkt_buf, TAP_MSGS.  However,
> > > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
> > > > the minimum possible L2 frame size.  But, we don't enforce that L2 frames
> > > > are at least ETH_ZLEN when we receive them from the tap backend, and since
> > > > we're dealing with virtual interfaces we don't have the physical Ethernet
> > > > limitations requiring that length.  Indeed it is possible to generate a
> > > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
> > > > the 'pasta' backend is only 42 bytes long).
> > > > 
> > > > It's also unclear if this limit is sufficient for vhost-user which isn't
> > > > limited by the size of pkt_buf as the other modes are.
> > > > 
> > > > We could attempt to correct the calculation, but that would leave us with
> > > > even larger arrays, which in practice rarely accumulate more than a handful
> > > > of packets.  So, instead, put an arbitrary cap on the number of packets we
> > > > can put in a batch, and if we run out of space, process and flush the
> > > > batch.  
> > > 
> > > I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> > > general I couldn't really see a difference in latency (especially for
> > > UDP streams with small packets) or throughput. Figures from short
> > > throughput tests (such as the ones from the test suite) look a bit more
> > > variable, but I don't have any statistically meaningful data.
> > > 
> > > Then I looked into how many messages we might have in the array without
> > > this change, and I realised that, with the throughput tests from the
> > > suite, we very easily exceed the 256 limit.  
> > 
> > Ah, interesting.
> > 
> > > Perhaps surprisingly we get the highest buffer counts with TCP transfers
> > > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
> > > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
> > > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval,
> > > the extra system call overhead from forcibly flushing batches might
> > > become rather relevant.  
> > 
> > Really?  I thought syscall overhead (as in the part that's
> > per-syscall, rather than per-work) was generally in the tens of µs
> > range, rather than the ms range.
> 
> Tens or hundreds of µs (mind that it's several of them), so there
> could be just one order of magnitude between the two.

Hm, ok.

> > But in any case, I'd be fine with upping the size of the array to 4k
> > or 8k based on that empirical data.  That's still much smaller than
> > the >150k we have now.
> 
> I would even go with 32k -- there are embedded systems with a ton of
> memory but still much slower clocks compared to my typical setup. Go
> figure. Again, I think we should test and profile this, ideally, but if
> not, then let's pick something that's ~10x of what I see.
> 
> > > With lower MTUs, it looks like we have a lower CPU load and
> > > transmissions are scheduled differently (resulting in smaller batches),
> > > but I didn't really trace things.  
> > 
> > Ok.  I wonder if with the small MTUs we're hitting throughput
> > bottlenecks elsewhere which mean this particular path isn't
> > over-exercised.
> > 
> > > So I start thinking that this has the *potential* to introduce a
> > > performance regression in some cases and we shouldn't just assume that
> > > some arbitrary 256 limit is good enough. I didn't check with perf(1),
> > > though.
> > > 
> > > Right now that array takes, effectively, less than 100 KiB (it's ~5000
> > > copies of struct iovec, 16 bytes each), and in theory that could be
> > > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> > > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will
> > > have no practical effect anyway.  
> > 
> > Yeah, I guess.  I don't love the fact that currently for correctness
> > (not spuriously dropping packets) we rely on a fairly complex
> > calculation that's based on information from different layers: the
> > buffer size and enforcement is in the packet pool layer and is
> > independent of packet layout, but the minimum frame size comes from
> > the tap layer and depends quite specifically on which L2 encapsulation
> > we're using.
> 
> Well but it's exactly one line, and we're talking about the same
> project and tool, not something that's separated by several API layers.
> 
> By the way: on one hand you have that. On the other hand, you're adding
> an arbitrary limit that comes from a test I just did, which is also
> based on information from different layers.

The difference is that it used to be a hard limit: if we exceeded it
we drop packets.  Now, we just do a little more work.
Actually... that makes me realise the actual size of the batches isn't
the relevant factor in choosing the size.  The amortized cost of
processing a packet will be pretty much:

	  (per byte costs) * (packet size)
	+ (per packet costs)
	+ (per batch costs) / (batch size)

So, we just need to allow (batch size) to become large enough to make
the last term negligible compared to the other two.  I find it hard to
believe that we'd need more than ~1000 to do that.

> > > All in all, I think we shouldn't change this limit without a deeper
> > > understanding of the practical impact. While this change doesn't bring
> > > any practical advantage, the current behaviour is somewhat tested by
> > > now, and a small limit isn't.  
> 
> ...and this still applies, I think.
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-01-03  6:07 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-20  8:35 [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20  8:35 ` [PATCH v2 01/12] test focus David Gibson
2024-12-20  8:35 ` [PATCH v2 02/12] hack: stop on fail, but not perf fail David Gibson
2024-12-20  8:35 ` [PATCH v2 03/12] make passt dumpable David Gibson
2024-12-20  8:35 ` [PATCH v2 04/12] packet: Use flexible array member in struct pool David Gibson
2024-12-20  8:35 ` [PATCH v2 05/12] packet: Don't pass start and offset separately too packet_check_range() David Gibson
2024-12-20  8:35 ` [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  1:00     ` David Gibson
2025-01-02 21:59       ` Stefano Brivio
2025-01-03  1:16         ` David Gibson
2025-01-05 23:43           ` Stefano Brivio
2024-12-20  8:35 ` [PATCH v2 07/12] packet: Remove unhelpful packet_get_try() macro David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  2:15     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  4:48         ` David Gibson
2025-01-06 10:55           ` Stefano Brivio
2024-12-20  8:35 ` [PATCH v2 08/12] util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers David Gibson
2024-12-20  8:35 ` [PATCH v2 09/12] packet: Distinguish severities of different packet_{add,git}_do() errors David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  2:58     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  5:06         ` David Gibson
2025-01-06 10:55           ` Stefano Brivio
2024-12-20  8:35 ` [PATCH v2 10/12] packet: Move packet length checks into packet_check_range() David Gibson
2024-12-20  8:35 ` [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets David Gibson
2025-01-01 21:54   ` Stefano Brivio
2025-01-02  3:46     ` David Gibson
2025-01-02 22:00       ` Stefano Brivio
2025-01-03  6:06         ` David Gibson [this message]
2024-12-20  8:35 ` [PATCH v2 12/12] packet: More cautious checks to avoid pointer arithmetic UB David Gibson
2024-12-20  9:00 ` [PATCH v2 00/12] Cleanups to packet pool handling and sizing David Gibson
2024-12-20 10:06   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z3d-ecXZefFcUSHe@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).