From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefano Brivio To: passt-dev@passt.top Subject: Re: [PATCH 2/2] test/passt_in_ns: Consistent sleep commands before starting socat client Date: Sat, 17 Sep 2022 13:28:42 +0200 Message-ID: <20220917132842.2684a7d2@elisabeth> In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============9146163181790970078==" --===============9146163181790970078== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit On Sat, 17 Sep 2022 20:19:21 +1000 David Gibson wrote: > On Sat, Sep 17, 2022 at 10:44:41AM +0200, Stefano Brivio wrote: > > On Sat, 17 Sep 2022 13:32:45 +1000 > > David Gibson wrote: > > > > > On Sat, Sep 17, 2022 at 01:55:34AM +0200, Stefano Brivio wrote: > > > > There are some 'sleep 1' commands between starting the socat server > > > > and its corresponding client to avoid races due to the server not > > > > being ready as we start sending data. > > > > > > > > However, those don't cover all the cases where we might need them, > > > > and in some cases the sleep command actually ended up being before > > > > the server even starts. > > > > > > > > This fixes occasional failures in TCP and UDP simple transfer tests, > > > > that became apparent with the new command dispatch mechanism. > > > > > > > > Signed-off-by: Stefano Brivio > > > > > > Heh, I have a very similar patch in my upcoming series. I used sleep > > > 0.1 though, to avoid taking so long, which seems to be sufficient in > > > practice. > > > > Just mind POSIX only specifies integers as argument for sleep(1), a > > pattern I commonly use is: > > > > sleep 0.1 || sleep 1 > > Ah, right. > > > > I did look for a proper way to wait for the socat server to be ready, > > > but I couldn't find anything workable. > > > > I guess we could enable logging and look for "starting accept loop", > > from _xioopen_listen() in xio-listen.c. > > Yeah, I suppose. > > > Anyway, right now, I'm just trying to get the tests to complete with > > all your pending series, because it's been a while. This doesn't look > > too bad, we can try to improve it later. > > Yeah, fair enough. When I rebase I'll see if there's any refinements > I can make relatively easily. Okay, thanks. > > > I thought I could retry on the > > > client side, but that doesn't work (at least for host to guest > > > connections) because passt is always listening on the forwarded ports, > > > so the client won't see a connection refused from the actual server. > > > I don't think there's any sane way to propagate connection refused in > > > that way, although we could and probably should forward connection > > > resets outwards. > > > > They should be already, and (manual) retries actually worked for me, > > I'm not entirely sure what you mean by manual retries. I was trying > to use socat's retry option, but that only seems to fire on > ECONNREFUSED. I was watching the text execution, then as I saw the socat server process getting stuck, I just re-run the client with ssh -F ... with the original command, and that worked for me every time. So, the behaviour I'm inferring is: - server not listening, no retry option: connect() goes through, but anything else will get ECONNRESET (RST from passt), plus there should be EPOLLERR if socat checks that: client exits - server not listening, retry option: client exits without retry because it doesn't get ECONNREFUSED ...I guess a cleaner behaviour on passt's side would be to delay the accept() (that's in tcp_conn_from_sock()) until we get SYN, ACK from the guest. But: - we can't send a RST if we don't accept(), as we have no socket to close(). Maybe we could close and reopen the listening socket...? We need to be a bit careful about not turning that into a vector for DoS, though - we can't send ICMP/ICMPv6 messages ...so we risk a connect() timeout, which is even less desirable. In case the connection goes through, though... I actually tried in the past to wait before we accept(), and it was increasing latency (of course), so I discarded that approach, because I couldn't think of any practical case. But here we are, so perhaps this behaviour should be there, at least as an option (maybe even as default). > > but it looks like added complexity. Before even checking the connection > > state, tcp_tap_handler() checks for the RST flag: > > > > if (th->rst) { > > conn_event(c, conn, CLOSED); > > return p->count; > > } > > > > and this will abruptly close the originating socket, which implies a > > RST, on Linux. > > Right, I found that... but I'm not sure it is sending an RST. Setting > SO_LINGER with zero timeout does send an RST, but I'm not sure it does > so without it. I actually checked, it did for me, without SO_LINGER. > Even if it does, it's not really helpful for this. I found that > sending an RST doesn't reliably cause the client socat to exit with > failure. I'm not 100% sure what's going on, but I think what can > happen is that the client sends everything and completes before the > RST arrives, because it all gets buffered before in passt (or at least > in the kernel socket buffers associated with passt). In that case the > client doesn't detect the error. ...right. > > We don't send out ICMP or ICMPv6 messages, even if the guest > > additionally replied with one, because we can't ("ping" sockets are > > just for that -- echoes). For TCP, a simple RST is probably not as > > descriptive, but that's all we can do. > > > > > Btw, I've also been doing a bunch of work on the static checks - with > > > some different options I've brought the runtime of make cppcheck from > > > ~40 minutes down to ~40 seconds :). > > > > Whoa, 40 minutes? For me it was never more than a couple of minutes, > > Yeah, about 36 minutes to be precise, I think. There's a reason I > haven't been attempting to run this until now. Wow, okay, I had no idea. > > see https://passt.top/passt/about/#continuous-integration, > > "build/static_checkers". > > > > I guess it depends on system headers... but in any case that's > > currently taking way too long, also for myself. :) > > Huh, interesting. I wonder if it's because it simply isn't finding > all the system headers for you. I see you have a missingIncludeSystem > suppression in there, and when I removed that I found it complained > about not finding the headers on Debian until I added > -I/usr/include/x86_64-linux-gnu. Anyway, I have patches that fix this > which I'll be sending soon. In any case, since the run time will be > exponential in the number of config defines, it doesn't take a huge > difference in the headers to make a vast difference to runtime. Ah, yes, that might explain it. > I also found that both clang-tidy and cppcheck fail for me, I have a > bunch of fixes for this, but I'm not finished yet. There's a couple > of tricky ones, including one with dueling errors - cppcheck says > there's a redundant NULL check, but if I remove it clang-tidy says > there's a NULL pointer dereference. Still working on it. I'm currently using Cppcheck 2.8 and LLVM 13.0.1, perhaps you have more recent versions, I'll update them as soon as I finally get the tests to go through. ;) Thanks in advance for fixing those. -- Stefano --===============9146163181790970078==--