On Sun, Feb 18, 2024 at 10:01:24PM +0100, Stefano Brivio wrote:
> On Tue,  6 Feb 2024 12:17:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > This makes a number of changes to improve error reporting while connecting
> > a new spliced socket:
> >  * We use flow_err() and similar functions so all messages include info
> >    on which specific flow was affected
> >  * We use strerror() to interpret raw error values
> >  * We now report errors on connection (at "trace" level, since this would
> >    allow spamming the logs)
> >  * We also look up and report some details on EPOLLERR events, which can
> >    include connection errors, since we use a non-blocking connect().  Again
> >    we use "trace" level since this can spam the logs.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp_splice.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> > 
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 5ba9c8ea..49075e5c 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -349,8 +349,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> >  		ASSERT(0);
> >  
> >  	if (conn->s[1] < 0) {
> > -		warn("Couldn't open connectable socket for splice (%d)",
> > -		     conn->s[1]);
> > +		flow_err(conn,
> > +			 "Couldn't open connectable socket for splice: %s",
> > +			 strerror(-conn->s[1]));
> 
> It took me a bit to convince myself that we actually store negative
> error codes in pools, which sounds like a neat idea, except for two
> things:

Yeah, it took me a while to discover that too :)

> - in tcp_sock_refill_pool(), once we get an error from
>   tcp_conn_new_sock(), should we really continue to call it and try to
>   get more sockets?
> 
>   Presumably, that will have something to do with some kind of resource
>   exhaustion, so perhaps we shouldn't risk making it worse and just
>   stop refilling the pool. If we do, we might have up to one negative
>   error code in the pool, I guess.

Some sort of exhaustion is certainly the most likely cause, yes.

> - if we have an error code in the pool, that error might have occurred
>   a while ago, but here, we're logging it at the moment we need that
>   socket.

Fair point.

> Maybe it would be simpler and saner to just have -1 values in the pool
> (error or no socket available) and report errors in tcp_conn_new_sock()
> instead?

Yeah, that makes sense.  I'll rework to that goal.

> I'm still reviewing patches 17/22 to 22/22, sorry for the delay.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson