From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Gibson To: passt-dev@passt.top Subject: Re: [PATCH 2/2] test/passt_in_ns: Consistent sleep commands before starting socat client Date: Mon, 19 Sep 2022 11:48:33 +1000 Message-ID: In-Reply-To: <20220917132842.2684a7d2@elisabeth> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6543235428601746219==" --===============6543235428601746219== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit On Sat, Sep 17, 2022 at 01:28:42PM +0200, Stefano Brivio wrote: > On Sat, 17 Sep 2022 20:19:21 +1000 > David Gibson wrote: > > > On Sat, Sep 17, 2022 at 10:44:41AM +0200, Stefano Brivio wrote: > > > On Sat, 17 Sep 2022 13:32:45 +1000 > > > David Gibson wrote: > > > > > > > On Sat, Sep 17, 2022 at 01:55:34AM +0200, Stefano Brivio wrote: > > > > > There are some 'sleep 1' commands between starting the socat server > > > > > and its corresponding client to avoid races due to the server not > > > > > being ready as we start sending data. > > > > > > > > > > However, those don't cover all the cases where we might need them, > > > > > and in some cases the sleep command actually ended up being before > > > > > the server even starts. > > > > > > > > > > This fixes occasional failures in TCP and UDP simple transfer tests, > > > > > that became apparent with the new command dispatch mechanism. > > > > > > > > > > Signed-off-by: Stefano Brivio > > > > > > > > Heh, I have a very similar patch in my upcoming series. I used sleep > > > > 0.1 though, to avoid taking so long, which seems to be sufficient in > > > > practice. > > > > > > Just mind POSIX only specifies integers as argument for sleep(1), a > > > pattern I commonly use is: > > > > > > sleep 0.1 || sleep 1 > > > > Ah, right. > > > > > > I did look for a proper way to wait for the socat server to be ready, > > > > but I couldn't find anything workable. > > > > > > I guess we could enable logging and look for "starting accept loop", > > > from _xioopen_listen() in xio-listen.c. > > > > Yeah, I suppose. > > > > > Anyway, right now, I'm just trying to get the tests to complete with > > > all your pending series, because it's been a while. This doesn't look > > > too bad, we can try to improve it later. > > > > Yeah, fair enough. When I rebase I'll see if there's any refinements > > I can make relatively easily. > > Okay, thanks. > > > > > I thought I could retry on the > > > > client side, but that doesn't work (at least for host to guest > > > > connections) because passt is always listening on the forwarded ports, > > > > so the client won't see a connection refused from the actual server. > > > > I don't think there's any sane way to propagate connection refused in > > > > that way, although we could and probably should forward connection > > > > resets outwards. > > > > > > They should be already, and (manual) retries actually worked for me, > > > > I'm not entirely sure what you mean by manual retries. I was trying > > to use socat's retry option, but that only seems to fire on > > ECONNREFUSED. > > I was watching the text execution, then as I saw the socat server > process getting stuck, I just re-run the client with ssh -F ... with the > original command, and that worked for me every time. Ah, yeah, I'd expect that to work. The difficulty is automatically detecting the failure. > So, the behaviour I'm inferring is: > > - server not listening, no retry option: connect() goes > through, but anything else will get ECONNRESET (RST from passt), plus > there should be EPOLLERR if socat checks that: client exits That doesn't seem to be the case. We *eventually* get an ECONNRESET, but it seems like we can send some data before it happens - sometimes quite a lot. I'm assuming this is because quite a lot can fit into socket buffers before the RST makes its way back. > - server not listening, retry option: client exits without retry because > it doesn't get ECONNREFUSED Right. > ...I guess a cleaner behaviour on passt's side would be to delay the > accept() (that's in tcp_conn_from_sock()) until we get SYN, ACK from > the guest. But: I thought about that, but I don't think its workable. For starters, I don't think there's a way of explicitly refusing a connection from userspace. Not accept()ing, then closing and re-opening the listening socket *might* do it. But.. more fundamentally, due to the connection backlog, I think by the time we get the notification in userspace, the kernel has probably already completed the TCP handshake, so it's too late to signal a connection refused to the client. > - we can't send a RST if we don't accept(), as we have no socket to > close(). Maybe we could close and reopen the listening socket...? We > need to be a bit careful about not turning that into a vector for > DoS, though > > - we can't send ICMP/ICMPv6 messages > > ...so we risk a connect() timeout, which is even less desirable. > > In case the connection goes through, though... I actually tried in the > past to wait before we accept(), and it was increasing latency (of > course), so I discarded that approach, because I couldn't think of any > practical case. > > But here we are, so perhaps this behaviour should be there, at least as > an option (maybe even as default). > > > > but it looks like added complexity. Before even checking the connection > > > state, tcp_tap_handler() checks for the RST flag: > > > > > > if (th->rst) { > > > conn_event(c, conn, CLOSED); > > > return p->count; > > > } > > > > > > and this will abruptly close the originating socket, which implies a > > > RST, on Linux. > > > > Right, I found that... but I'm not sure it is sending an RST. Setting > > SO_LINGER with zero timeout does send an RST, but I'm not sure it does > > so without it. > > I actually checked, it did for me, without SO_LINGER. > > > Even if it does, it's not really helpful for this. I found that > > sending an RST doesn't reliably cause the client socat to exit with > > failure. I'm not 100% sure what's going on, but I think what can > > happen is that the client sends everything and completes before the > > RST arrives, because it all gets buffered before in passt (or at least > > in the kernel socket buffers associated with passt). In that case the > > client doesn't detect the error. > > ...right. > > > > We don't send out ICMP or ICMPv6 messages, even if the guest > > > additionally replied with one, because we can't ("ping" sockets are > > > just for that -- echoes). For TCP, a simple RST is probably not as > > > descriptive, but that's all we can do. > > > > > > > Btw, I've also been doing a bunch of work on the static checks - with > > > > some different options I've brought the runtime of make cppcheck from > > > > ~40 minutes down to ~40 seconds :). > > > > > > Whoa, 40 minutes? For me it was never more than a couple of minutes, > > > > Yeah, about 36 minutes to be precise, I think. There's a reason I > > haven't been attempting to run this until now. > > Wow, okay, I had no idea. > > > > see https://passt.top/passt/about/#continuous-integration, > > > "build/static_checkers". > > > > > > I guess it depends on system headers... but in any case that's > > > currently taking way too long, also for myself. :) > > > > Huh, interesting. I wonder if it's because it simply isn't finding > > all the system headers for you. I see you have a missingIncludeSystem > > suppression in there, and when I removed that I found it complained > > about not finding the headers on Debian until I added > > -I/usr/include/x86_64-linux-gnu. Anyway, I have patches that fix this > > which I'll be sending soon. In any case, since the run time will be > > exponential in the number of config defines, it doesn't take a huge > > difference in the headers to make a vast difference to runtime. > > Ah, yes, that might explain it. > > > I also found that both clang-tidy and cppcheck fail for me, I have a > > bunch of fixes for this, but I'm not finished yet. There's a couple > > of tricky ones, including one with dueling errors - cppcheck says > > there's a redundant NULL check, but if I remove it clang-tidy says > > there's a NULL pointer dereference. Still working on it. > > I'm currently using Cppcheck 2.8 and LLVM 13.0.1, perhaps you have more > recent versions, I'll update them as soon as I finally get the tests to > go through. ;) Thanks in advance for fixing those. Huh, odd. I'm using cppcheck 2.7 most of the time (the one packaged in Fedora 36). I also tried cppcheck 2.9 on Debian which gets some additional errors. I have LLVM 14.0.5, so that might explain that side. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --===============6543235428601746219== Content-Type: application/pgp-signature Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="signature.asc" MIME-Version: 1.0 LS0tLS1CRUdJTiBQR1AgU0lHTkFUVVJFLS0tLS0KCmlRSXpCQUVCQ0FBZEZpRUVvVUx4V3U0L1dz MGRCK1h0Z3lwWTRnRXdZU0lGQW1NbnlsMEFDZ2tRZ3lwWTRnRXcKWVNKQjRSQUF6bkZ4OFg4R0p4 TkZKQlZyR0VjdFFIMG5GSUcwbDNpU0t0dmJ4Q3BCcE9SSEJZOHZLV1JGK05mNApEVXBvWG9KeThu ekVsZjZsL2ZIMFQyamVjTVNyMGRhWWQxZFJXa1hOeVBLWGZyWVh3SmJlelJaYnV5S0NHNEhCCnVX TnNoWW1HZ3BheGZkMFhSTUJWMHljUFBaUnJiOXBaSytZSldnakkvM1RGb1BYbGpUVEJUb3c0b216 NHFRaG8Kc2RRdkdMUVBUZVJPV3ozU1RGcWxhZ0VWMW41dVhYRndYOEE5WmlDd2NQeEE3elBSYVUw eHpqMWlqNkdjRk13YQphQml1bGIvM0lyazdXcHFwVnhOSWEzWUdacEFIcDZaL1UvMkFyZDNZV1M2 bC9idUlSVERWV1l0ajd2S2NKS3hHCmhCNUFYYk91aE9UTjVvZXpHK1VKVkRCVVhMNVQ1VnZ2R0dM OFA3VWx4NFFlQ0ZiNE03bThOSWtOZWRwMXJLWEgKTjRCTFhnL0tQSlFOdDg5SUxFM0ErcXI2UU1m MFBnd0loL3V2Tnp6UkNtL3k0by9kcmtBUFdENVFNcWNmdWJxNgpwR1h4RGZaV3VxUU56ZFVWdzNH azlUTzR4WVIxbWFTaW1XYTRGdGJIWTE5SGJqOUpRcUhNUXVjUm4vRlhGODZFClVXbXE0UzV6V2dt d0xXRi81azNReFRiUVcrR1dGZWxrVmlWTno2RzArSmdYN3I3aFdwNEpVUWNqaVJsSnZGL28KUU1G TFFvVWRhSUhSd3VFL1lzWEYvbTVLaGFJU25JQ3ozZDRGOVMzcGhLODNsZ1g5NkU5YWNVWG9nRkRj RDNSTgp2SmJYSUxKbFEwK0pYVWVOYnVHZ20xUnBFMDhjeEhCUUdoZEE5TzhxQm4vSUVLMjhsNnc9 Cj0wVktWCi0tLS0tRU5EIFBHUCBTSUdOQVRVUkUtLS0tLQo= --===============6543235428601746219==--