On Thu, 8 Aug 2024 11:28:50 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Aug 07, 2024 at 03:06:44PM +0200, Stefano Brivio wrote:
> > On Wed, 7 Aug 2024 20:51:08 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Aug 07, 2024 at 12:11:26AM +0200, Stefano Brivio wrote:  
> > > > On Mon,  5 Aug 2024 22:36:45 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > Add a new test script to run the equivalent of the tests in build/all
> > > > > using exeter and Avocado.  This new version of the tests is more robust
> > > > > than the original, since it makes a temporary copy of the source tree so
> > > > > will not be affected by concurrent manual builds.    
> > > > 
> > > > I think this is much more readable than the previous Python attempt.    
> > > 
> > > That's encouraging.
> > >   
> > > > On the other hand, I guess it's not an ideal candidate for a fair
> > > > comparison because this is exactly the kind of stuff where shell
> > > > scripting shines: it's a simple test that needs a few basic shell
> > > > commands.    
> > > 
> > > Right.
> > >   
> > > > On that subject, the shell test is about half the lines of code (just
> > > > skipping headers, it's 48 lines instead of 90... and yes, this version    
> > > 
> > > Even ignoring the fact that this case is particularly suited to shell,
> > > I don't think that's really an accurate comparison, but getting to one
> > > is pretty hard.
> > > 
> > > The existing test isn't 48 lines of shell, but of "passt test DSL".
> > > There are several hundred additional lines of shell to interpret that.  
> > 
> > Yeah, but the 48 lines is all I have to look at, which is what matters
> > I would argue. That's exactly why I wrote that interpreter.
> > 
> > Here, it's 90 lines of *test file*.  
> 
> Fair point.  Fwiw, it's down to 77 so far for my next draft.
> 
> > > Now obviously we don't need all of that for just this test.  Likewise
> > > the new Python test needs at least exeter - that's only a couple of
> > > hundred lines - but also Avocado (huge, but only a small amount is
> > > really relevant here).
> > >   
> > > > now uses a copy of the source code, but that would be two lines).    
> > > 
> > > I feel like it would be a bit more than two lines, to copy exactly
> > > what youwant, and to clean up after yourself.  
> > 
> > host    mkdir __STATEDIR__/sources
> > host    cp --parents $(git ls-files) __STATEDIR__/sources
> > 
> > ...which is actually an improvement on the original as __STATEDIR__ can
> > be handled in a centralised way, if one wants to keep that after the
> > single test case, after the whole test run, or not at all.  
> 
> Huh, I didn't know about cp --parents, which does exactly what's
> needed.  In the Python library there are, alas, several things that do
> almost but not quite what's needed.  I guess I could just invoke 'cp
> --parents' myself.
> 
> > > > In terms of time overhead, dropping delays to make the display capture
> > > > nice (a feature that we would anyway lose with exeter plus Avocado, if
> > > > I understood correctly):    
> > > 
> > > Yes.  Unlike you, I'm really not convinced of the value of the display
> > > capture versus log files, at least in the majority of cases.  
> > 
> > Well, but I use that...
> > 
> > By the way, openQA nowadays takes periodic screenshots. That's certainly
> > not as useful, but I'm indeed not the only one who benefits from
> > _seeing_ tests as they run instead of correlating log files from
> > different contexts, especially when you have a client, a server, and
> > what you're testing in between.  
> 
> If you have to correlate multiple logs that's a pain, yes.  My
> approach here is, as much as possible, to have a single "log"
> (actually stdout & stderr) from the top level test logic, so the
> logical ordering is kind of built in.

That's not necessarily helpful: if I have a client and a server, things
are much clearer to me if I have two different logs, side-by-side. Even
more so if you have a guest, a host, and a namespace "in between".

I see the difference as I'm often digging through Podman CI's logs,
where there's a single log (including stdout and stderr), because bats
doesn't offer a context functionality like we have right now.

It's sometimes really not easy to understand what's going on in Podman's
tests without copying and pasting into an editor and manually marking
things.

> > > I certainly don't think it's worth slowing down the test running in the
> > > normal case.  
> > 
> > It doesn't significantly slow things down,  
> 
> It does if you explicitly add delays to make the display capture nice
> as mentioned above.

Okay, I didn't realise the amount of eye-candy I left in even when
${FAST} is set (which probably only makes sense when run as './ci').
With the patch attached I get:

$ time ./run
[...]
real	17m17.686s
user	0m0.010s
sys	0m0.014s

I also cut the duration of throughput and latency tests down to one
second. After we fixed lot of issues in passt, and some in QEMU and
kernel, results are now surprisingly consistent.

Still, a significant part of it is Podman's tests (which I'm working on
speeding up, for the sake of Podman's own CI), and performance tests
anyway. Without those:

$ time ./run
[...]
real	5m57.612s
user	0m0.011s
sys	0m0.009s

> > but it certainly makes it
> > more complicated to run test cases in parallel... which you can't do
> > anyway for throughput and latency tests (which take 22 out of the 37
> > minutes of a current CI run), unless you set up VMs with CPU pinning and
> > cgroups, or a server farm.  
> 
> So, yes, the perf tests take the majority of the runtime for CI, but
> I'm less concerned about runtime for CI tests.  I'm more interested in
> runtime for a subset of functional tests you can run repeatedly while
> developing.  I routinely disable the perf and other slow tests, to get
> a subset taking 5-7 minutes.  That's ok, but I'm pretty confident I
> can get better coverage in significantly less time using parallel
> tests.

Probably, yes, but still I would like to point out that the difference
between five and ten minutes is not as relevant in terms of workflow as
the difference between one and five minutes.

> > I mean, I see the value of running things in parallel in a general
> > case, but I don't think you should just ignore everything else.
> >   
> > > > $ time (make clean; make passt; make clean; make pasta; make clean; make qrap; make clean; make; d=$(mktemp -d); prefix=$d make install; prefix=$d make uninstall; )
> > > > [...]
> > > > real	0m17.449s
> > > > user	0m15.616s
> > > > sys	0m2.136s    
> > > 
> > > On my system:
> > > [...]
> > > real	0m20.325s
> > > user	0m15.595s
> > > sys	0m5.287s
> > >   
> > > > compared to:
> > > > 
> > > > $ time ./run
> > > > [...]
> > > > real	0m18.217s
> > > > user	0m0.010s
> > > > sys	0m0.001s
> > > > 
> > > > ...which I would call essentially no overhead. I didn't try out this
> > > > version yet, I suspect it would be somewhere in between.    
> > > 
> > > Well..
> > > 
> > > $ time PYTHONPATH=test/exeter/py3 test/venv/bin/avocado run test/build/build.json 
> > > [...]
> > > RESULTS    : PASS 5 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
> > > JOB TIME   : 10.85 s
> > > 
> > > real	0m11.000s
> > > user	0m23.439s
> > > sys	0m7.315s
> > > 
> > > Because parallel.  It looks like the avocado start up time is
> > > reasonably substantial too, so that should look better with a larger
> > > set of tests.  
> > 
> > With the current set of tests, I doubt it's ever going to pay off. Even
> > if you run the non-perf tests in 10% of the time, it's going to be 24
> > minutes instead of 37.  
> 
> Including the perf tests, probably not.  Excluding them (which is
> extremely useful when actively coding) I think it will.
> 
> > I guess it will start making sense with larger matrices of network
> > environments, or with more test cases (but really a lot of them).  
> 
> We could certainly do with a lot more tests, though I expect it will
> take a while to get them.
> 

-- 
Stefano