From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 44C485A026F for ; Thu, 5 Oct 2023 10:15:19 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1696493715; bh=40/vHDADflS0f16Sriu64L9PaAClMlKZstb0DYuVX8c=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=MgC1+yBXCcRuzvPpUbDKIzb1Fs83gkQ2kuOp4XxnkOr8CiZ5e3KemG/WDp/Nwdy82 wZvTfWSo6TtCCRwGFQDVWFNDYVnuwnM9seBpOlsvCOjWuCb+oMiRnIma69fN6hyo9Q oLmi0yHSOiRu/L1mMk/l2FYXCgZ4cDqkwK84ByIY= Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4S1PW71WLDz4xQ1; Thu, 5 Oct 2023 19:15:15 +1100 (AEDT) Date: Thu, 5 Oct 2023 18:36:01 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH RFT 2/5] tcp: Reset STALLED flag on ACK only, check for pending socket data Message-ID: References: <20230922220610.58767-1-sbrivio@redhat.com> <20230922220610.58767-3-sbrivio@redhat.com> <20230927190533.2fc53bbf@elisabeth> <20230929172015.3b5969bc@elisabeth> <20231005081849.7463eccc@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="bkB8qTwxdDG1Kd+b" Content-Disposition: inline In-Reply-To: <20231005081849.7463eccc@elisabeth> Message-ID-Hash: GHWA7XHZK5EKV2KCLD4MYRLP2XBCN6NL X-Message-ID-Hash: GHWA7XHZK5EKV2KCLD4MYRLP2XBCN6NL X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Matej Hrica , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --bkB8qTwxdDG1Kd+b Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Oct 05, 2023 at 08:18:49AM +0200, Stefano Brivio wrote: > On Tue, 3 Oct 2023 14:20:58 +1100 > David Gibson wrote: >=20 > > On Fri, Sep 29, 2023 at 05:20:15PM +0200, Stefano Brivio wrote: > > > On Thu, 28 Sep 2023 11:48:38 +1000 > > > David Gibson wrote: > > > =20 > > > > On Wed, Sep 27, 2023 at 07:05:33PM +0200, Stefano Brivio wrote: =20 > > > > > On Mon, 25 Sep 2023 13:07:24 +1000 > > > > > David Gibson wrote: > > > > > =20 > > > > > > I think the change itself here is sound, but I have some nits t= o pick > > > > > > with the description and reasoning. > > > > > >=20 > > > > > > On Sat, Sep 23, 2023 at 12:06:07AM +0200, Stefano Brivio wrote:= =20 > > > > > > > In tcp_tap_handler(), we shouldn't reset the STALLED flag (in= dicating > > > > > > > that we ran out of tap-side window space, or that all availab= le > > > > > > > socket data is already in flight -- better names welcome! = =20 > > > > > >=20 > > > > > > Hmm.. when you put it like that it makes me wonder if those two= quite > > > > > > different conditions really need the same handling. Hrm.. I gu= ess > > > > > > both conditions mean that we can't accept data from the socket,= even > > > > > > if it's availble. =20 > > > > >=20 > > > > > Right. I mean, we can also call them differently... or maybe pick= a > > > > > name that reflects the outcome/handling instead of what happened.= =20 > > > >=20 > > > > Sure, if we could think of one. Except, on second thoughts, I'm not > > > > sure my characterization is correct. If the tap side window is full > > > > then, indeed, we can't accept data from the socket. However if > > > > everything we have is in flight that doesn't mean we couldn't put m= ore > > > > data into flight if it arrived. =20 > > >=20 > > > Right, but that's why we set EPOLLET... > > > =20 > > > > That consideration, together with the way we use MSG_PEEK possibly > > > > means that we fundamentally need to use edge-triggered interrupts - > > > > with the additional trickiness that entails - to avoid busy polling. > > > > Although even that only works if we get a new edge interrupt when d= ata > > > > is added to a buffer that's been PEEKed but not TRUNCed. If that's > > > > not true the MSG_PEEK approach might be doomed :(. =20 > > >=20 > > > without EPOLLONESHOT, which wouldn't have this behaviour. From epoll(= 7): > > >=20 > > > Since even with edge-triggered epoll, multiple events can be = generated > > > upon receipt of multiple chunks of data, the caller has the = option to > > > specify the EPOLLONESHOT flag [...] > > >=20 > > > so yes, in general, when the socket has more data, we'll get another > > > event. I didn't test this in an isolated case, perhaps we should, but > > > from my memory it always worked. =20 > >=20 > > Ok. That text does seem to suggest it works that way, although it's > > not entirely clear that it must always give new events. >=20 > A glimpse at the code confirms that, but... yes, I think we should test > this more specifically, perhaps even shipping that test case under doc/. Seems wise. > > > On the other hand, we could actually use EPOLLONESHOT in the other > > > case, as an optimisation, when we're waiting for an ACK from the tap > > > side. =20 > >=20 > > Hrm.. I can't actually see a case were EPOLLONESHOT would be useful. > > By the time we know the receiver's window has been filled, we're > > already processing the last event that we'll be able to until the > > window opens again. Setting EPOLLONESHOT would be requesting one more > > event. >=20 > Ah, true -- we should have it "always" set and always re-arm, which is > messy and would probably kill any resemblance of high throughput. >=20 > > > > > > > ) on any > > > > > > > event: do that only if the first packet in a batch has the AC= K flag > > > > > > > set. =20 > > > > > >=20 > > > > > > "First packet in a batch" may not be accurate here - we're look= ing at > > > > > > whichever packet we were up to before calling data_from_tap(). = There > > > > > > could have been earlier packets in the receive batch that were = already > > > > > > processed. =20 > > > > >=20 > > > > > Well, it depends on what we call "batch" -- here I meant the pool= of > > > > > packets (that are passed as a batch to tcp_tap_handler()). Yes, "= pool" > > > > > would be more accurate. =20 > > > >=20 > > > > Uh.. I don't think that actually helps. Remember pools aren't queu= es. > > > > The point here is that is that the packet we're considering is not = the > > > > first of the batch/pool/whatever, but the first of what's left. =20 > > >=20 > > > Right, yes, I actually meant the sub-pool starting from the index (no= w) > > > given by the caller. > > > =20 > > > > > > This also raises the question of why the first data packet shou= ld be > > > > > > particularly privileged here. =20 > > > > >=20 > > > > > No reason other than convenience, and yes, it can be subtly wrong. > > > > > =20 > > > > > > I'm wondering if what we really want to > > > > > > check is whether data_from_tap() advanced the ack pointer at al= l. =20 > > > > >=20 > > > > > Right, we probably should do that instead. =20 > > > >=20 > > > > Ok. > > > > =20 > > > > > > I'm not clear on when the th->ack check would ever fail in prac= tice: > > > > > > aren't the only normal packets in a TCP connection without ACK = the > > > > > > initial SYN or an RST? We've handled the SYN case earlier, so = should > > > > > > we just have a blanket case above this that if we get a packet = with > > > > > > !ACK, we reset the connection? =20 > > > > >=20 > > > > > One thing that's legitimate (rarely seen, but I've seen it, I don= 't > > > > > remember if the Linux kernel ever does that) is a segment without= ACK, > > > > > and without data, that just updates the window (for example after= a > > > > > zero window). > > > > >=20 > > > > > If the sequence received/processed so far doesn't correspond to t= he > > > > > latest sequence sent, omitting the ACK flag is useful so that the > > > > > window update is not taken as duplicate ACK (that would trigger > > > > > retransmission). =20 > > > >=20 > > > > Ah, ok, I wasn't aware of that case. =20 > > >=20 > > > On a second thought, in that case, we just got a window update, so it= 's > > > very reasonable to actually check again if we can send more. Hence the > > > check on th->ack is bogus anyway. > > > =20 > > > > > > > Make sure we check for pending socket data when we reset it: > > > > > > > reverting back to level-triggered epoll events, as tcp_epoll_= ctl() > > > > > > > does, isn't guaranteed to actually trigger a socket event. = =20 > > > > > >=20 > > > > > > Which sure seems like a kernel bug. Some weird edge conditions= for > > > > > > edge-triggered seems expected, but this doesn't seem like valid > > > > > > level-triggered semantics. =20 > > > > >=20 > > > > > Hmm, yes, and by doing a quick isolated test actually this seems = to work > > > > > as intended in the kernel. I should drop this and try again. > > > > > =20 > > > > > > Hmmm... is toggling EPOLLET even what we want. IIUC, the heart= of > > > > > > what's going on here is that we can't take more data from the s= ocket > > > > > > until something happens on the tap side (either the window expa= nds, or > > > > > > it acks some data). In which case should we be toggling EPOLLI= N on > > > > > > the socket instead? That seems more explicitly to be saying to= the > > > > > > socket side "we don't currently care if you have data available= ". =20 > > > > >=20 > > > > > The reason was to act on EPOLLRDHUP at the same time. But well, we > > > > > could just mask EPOLLIN and EPOLLRDHUP, then -- I guess that woul= d make > > > > > more sense. =20 > > > >=20 > > > > So specifically to mask EPOLLRDHUP as well? On the grounds that if > > > > we're still chewing on what we got already we don't yet care that > > > > we've reached the end, yes? =20 > > >=20 > > > Right. =20 > >=20 > > Ok. > >=20 > > > > So yes, explicitly masking both those > > > > makes more sense to me.. except that as above, I suspect we can't h= ave > > > > level-triggered + MSG_PEEK + no busy polling all at once. =20 > > >=20 > > > Hmm, right, that's the other problem if we mask EPOLLIN: we won't get > > > events on new data. I think EPOLLET is really what we need here, at > > > least for the case where we are not necessarily waiting for an ACK. > > >=20 > > > For the other case (window full), we can either mask EPOLLIN | > > > EPOLLRDHUP or set EPOLLONESHOT (possibly slightly more complicated > > > because we need to re-add the file descriptor). =20 > >=20 > > So, thinking further, what I think we want is to always set EPOLLET on > > the TCP sockets. If all received data is in flight we don't need > > anything special, at least assuming epoll works like we think it does > > above: when we get more data, we get an event and check if we can send > > more data into flight. >=20 > ...I think having EPOLLET always set causes races that are potentially > unsolvable, Hrm.. I'm a little sceptical that we'd have unsolvable races when always using EPOLLET that we don't already have using it sometimes (though maybe with more subtle and complex triggering conditions). I feel like it's easier to reason through to avoid races if we stay in a single trigger mode. > because we can't always read from the socket until we get > -EAGAIN. That is, this will work: >=20 > - recv() all data, we can't write more data to tap > - at some point, we can write again to tap > - more data comes, we'll wake up and continue >=20 > but this won't: >=20 > - partial recv(), we can't write more data to tap > - at some point, we can write again to tap > - no additional data comes Well.. with the other change I'm suggesting here, once the data we did send gets ACKed, we'll recheck the incoming socket and send more. It will delay that remainder data a bit, but not by _that_ much. Since in this case we're not getting more data from the socket, that shouldn't even be all that harmful to throughput. That said, it's not ideal. We could address that by enabling an EPOLLOUT (possibly ONESHOT) event on the tap socket if we get a partial send. When that event is triggered, we'd scan through connections for any unsent data. > > When the receive window fills we really don't care about new data > > until it opens again, so clear EPOLLIN | EPOLLRDHUP. When the window > > does open again - i.e. when we get an ack or window update - both > > reenable EPOLLIN | EPOLLRDHUP and call sock_handler() to process > > anything that's accumulated since we turned EPOLLIN off. >=20 > Agreed. This should be a mere optimisation on top of the current > behaviour, by the way. Well, it becomes an essential change if we always enable EPOLLET. > > Details to figure out: > > * Do we need to be careful about order of re-enable EPOLLIN > > vs. rechecking the recv() buffer? >=20 > It should be done before checking I guess, so that we can end up with > one spurious event (because we already read the data a future EPOLLIN > will tell us about), but never a missing event (because data comes > between recv() and epoll_ctl()). Yes, I think that's right. > > * At what point would we trigger the CLAMP_WINDOW workaround in that > > scheme? >=20 > When we read any data from the socket, with MSG_TRUNC, after the window > full condition. Well, yes, but what's the point at which we flag that the window full condition has occurred? Since that's occurring on the read side buffer, we're not directly aware of it. > > * Is there any impact of EPOLLET on the other events? >=20 > Not that I'm aware of. EPOLLHUP and EPOLLERR are reported anyway, and > we don't want EPOLLRDHUP to differ (in this sense) from EPOLLIN. But > again, this is under the assumption that we do *not* always set EPOLLET. >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --bkB8qTwxdDG1Kd+b Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmUeZzUACgkQzQJF27ox 2Gc4yxAAmto0aqiBqU1s8ng2Wzl5C9BobjHVykFBJTuh904GG2xICJ+GSwwaeIC9 JbFFEyDWa20Oosqk4hYpSME0BNlbpLPxgnUV6a1dNO6HkwdIlcEf0z3ehht294qO tqMiWOAerMxnr0E51loIf4BdVaCHGKijFaphfJyaz4A5NFQVfuvTzDf3YQvl8ypc vdBy3QqKSw7luIj3gSZt+lv5E2411e0Jen+OH/EvppjHmBptS+4j5cQ0BfCIg1Lh Be1Wmonqn1X13IQpoAgsL6OsdNAbizk0kKJDklilpB2JEFIYN/NWFmXD18sS8SzR TBF2gGqRSBTNcs6up9Jy7cNfnpItnpkeIeXfrq36d7NxNTiIpMvFzj1ryJs+fV8A v/DiYVOLTYSRxvf24gj4NdCA7OeHMNy5nL8h9bMi1Wd9cnQTPUfOJ8Kjn2W4ND8o VmLtBNoz9zcgmjQnOUkBrPUkjxYJldMmAfUNjr6FqbjYNQQpZJ6xFK1aOKHzbdA1 fSgGD1qIad1ZUtqjSqM01QwROXdGQxB/btEnkQ1QfXRik3MYR1Ns+mBxuWneEUv/ QLp2AfHIzRGRbGfAUaweX7h3Yhwbklby7jYhZB0l0QQgTrjx9K41mo4M5py0Nq+j vwvytqEnOl7bSy9xDy8OW7euoYcyqGe8OonpiLZpBxT0hWDiC+0= =wFM5 -----END PGP SIGNATURE----- --bkB8qTwxdDG1Kd+b--