From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=DB2Kl8uz; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 6C5355A0623 for ; Wed, 12 Feb 2025 22:39:12 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739396351; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CsvRTCvuEfPgp7pleid5CdBLbI6hyK6+E/7H4g34RQI=; b=DB2Kl8uzNg0jpKkeLPEcXFQ/HkKGe8wBzU5YloZz2QSJtbDv8Iu6GDU3YdJnnt9+Zz+qLo DZBPnAwqZgAol4RvmeGcp2AvJ/E42LzPZA2tEHySSM77TRMgutIRp9IzF9QNW2vCFLw/WQ QrQiDwlt9SmpM9H9BHllfGk1XDU8BO0= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-495-X9UEl6nYOCysOcp2V1Ce-w-1; Wed, 12 Feb 2025 16:39:09 -0500 X-MC-Unique: X9UEl6nYOCysOcp2V1Ce-w-1 X-Mimecast-MFC-AGG-ID: X9UEl6nYOCysOcp2V1Ce-w Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-471a2d5b124so4723441cf.2 for ; Wed, 12 Feb 2025 13:39:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739396349; x=1740001149; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CsvRTCvuEfPgp7pleid5CdBLbI6hyK6+E/7H4g34RQI=; b=MQJiVOUKESqjvU9YylbGEVwtGiaB/K4b8auv4gKpVsGdx71i7wzn0NIAIwtcw6wsC5 LsbAaRnW77szjadbEMOiXSy1ZU02mp9BgqAfCBJpQW3JwdD/biB7hhNBCwJqdNQniFmJ LnfewecyhUzQO6BGCM0k0z0ALujifq4XKvjKLbDbfmruOx8QTzK/9k3tEodoQhHD7QMX KHBDAhcBopec/fxOON1wBTd4m7RkDTGR+6wWWndUMJN+9rOhmfvx94f0iQ8IJnkH28r1 gBhv6SxyhatuRrH4aqJg7XdoGVz6k9yp8y11yJp5M7Rh7OsBgwBjCZpl+tguLBNylSgF AsfA== X-Gm-Message-State: AOJu0Yy7zcrIKKrrLUJiDCo8OsLZ+pnWdXg0WSwOIGIbsyROcSHR2dKH 31mj/BbDPwhf8rQh/FJ8zUqY6uKqhQngRnEMAg64bNHu6BhMc0thO97W9ywKfP75Vh2L0jQhe3v u42duvVemIWiOyLYnqBkErsHAggd45Oh7DRh84+nPHcyHHSQ0Hw== X-Gm-Gg: ASbGncu7PCHdp8zbYLk8f1R/2Ku+qrlS5gFqYnAxM3dvLw3SLIl98IrZrxukVAqRzy7 0+LSq15fUrCXfaHdzFSZ7h1QN1MvNLKp3X7Hf3AsE1JdnhgJ6xLLTd31ePETuzcpKfmo5TvWmGm dQVGTHx+afui7aJHFh3sfUYclwvxrCQT0kts078gMckY3hjTYIDIqbST4o1StIMmB6oV3B7EDOG zy1XyR1cGKjTUDpQE2seSdNe1Me1oYmRtwapQBP63lhZPGLW6YmjLpiDlm6PYB2WKAY5XAWNKJC F30= X-Received: by 2002:ac8:59c1:0:b0:471:a30b:d4fc with SMTP id d75a77b69052e-471bed2ff3bmr13531911cf.2.1739396349263; Wed, 12 Feb 2025 13:39:09 -0800 (PST) X-Google-Smtp-Source: AGHT+IF2xkHrPj2agqOzKFUnaDOZuxKK+arhG2V0EIxV/KjtmgU24Pxh1vdS5a2XE2LgZETfTBVXoA== X-Received: by 2002:ac8:59c1:0:b0:471:a30b:d4fc with SMTP id d75a77b69052e-471bed2ff3bmr13531651cf.2.1739396348899; Wed, 12 Feb 2025 13:39:08 -0800 (PST) Received: from [10.0.0.215] ([24.225.235.209]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-471492a90bcsm79913951cf.31.2025.02.12.13.39.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 12 Feb 2025 13:39:08 -0800 (PST) Message-ID: Date: Wed, 12 Feb 2025 16:39:07 -0500 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] udp: create and send ICMPv4 to local peer when applicable To: David Gibson References: <20250209170056.1160547-1-jmaloy@redhat.com> From: Jon Maloy In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: VDg-ZKFBJG3zjhTtes3_vGuuRxs17Pi15XQd9dt6d0M_1739396349 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: KRTCAFXUDQAR4UVJPC3ZJMO24XMSBKYC X-Message-ID-Hash: KRTCAFXUDQAR4UVJPC3ZJMO24XMSBKYC X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2025-02-10 19:55, David Gibson wrote: > On Sun, Feb 09, 2025 at 12:00:56PM -0500, Jon Maloy wrote: >> When a local peer sends a UDP message to a non-existing port on an >> existing remote host, that host will return an ICMP message containing >> the error code ICMP_PORT_UNREACH, plus the header and the first eight >> bytes of the original message. If the sender socket has been connected, >> it uses this message to issue a "Connection Refused" event to the user. [...] >> +void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, >> + struct in_addr dst, size_t l4len, uint8_t proto) > > As discussed on our call, it probably makes sense to alter this to set > the Don't Fragment bit always. I have been studying the Linux code, but this is handled in so many places that it it hard to get a full grip on it. Wireshark logs seem to indicate that it is always set, and googling a bit on the subject indicates the same. I think it is safe to do it, so if you agree I can post a prepending patch setting this. > >> { >> uint16_t l3len = l4len + sizeof(*ip4h); >> >> diff --git a/tap.h b/tap.h >> index dfbd8b9..3ba00c1 100644 >> --- a/tap.h >> +++ b/tap.h >> @@ -44,6 +44,8 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) >> thdr->vnet_len = htonl(l2len); >> } >> [...] >> + * udp_send_conn_fail_icmp4() - Construct and send ICMP to local peer >> + * @c: Execution context >> + * @ee: Extended error descriptor >> + * @ref: epoll reference >> + * @iov: First bytes (max 8) of original UDP message body > > I don't love passing this as an iov, because that tends to imply there > might be multiple buffers, which is not the case here (both the caller > and callee require a single buffer). The real reason is that csum_udp4() requires an iov, so if we don't pass it here we will have to recreate it inside the new function. > >> + */ >> +static void udp_send_conn_fail_icmp4(const struct ctx *c, >> + const struct sock_extended_err *ee, >> + union epoll_ref ref, >> + struct iovec *iov) > > > >> +{ >> + flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside); > > This is subtly, but badly wrong. ref.flowside is only valid if the > ref is of a type which uses the flowside field. That's true for UDP > reply sockets, but *not* for UDP listening sockets, and > udp_sock_errs() is called on both. I think this function needs to > take an explicit flowside. Yes, I was a little uncertain about this. Can you give me small code snippet of how this should be done? > >> + const struct flowside *toside = flowside_at_sidx(tosidx); >> + struct in_addr oaddr = toside->oaddr.v4mapped.a4; >> + struct in_addr eaddr = toside->eaddr.v4mapped.a4; >> + struct iov_tail data = IOV_TAIL(iov, 1, 0); >> + struct { >> + struct icmphdr icmp4h; >> + struct iphdr ip4h; >> + struct udphdr uh; >> + char data[8]; >> + } msg; > > Needs a ((packed)) attribute to ensure we don't get compiler padding. > ok. > It took me a minute to work out what was going on here. Specifically > that ip4h and uh here aren't the headers of the packet we're sending > now, but the reconstructed headers of the packet that prompted the > ICMP error. Yes, I try to re-create the original message as closely as possible. > >> + size_t udplen = sizeof(msg.uh) + iov->iov_len; >> + size_t msglen = sizeof(msg.icmp4h) + sizeof(msg.ip4h) + udplen; >> + >> + memset(&msg, 0, sizeof(msg)); >> + msg.icmp4h.type = ee->ee_type; >> + msg.icmp4h.code = ee->ee_code; >> + tap_push_ip4h(&msg.ip4h, eaddr, oaddr, udplen, IPPROTO_UDP); > > This isn't quite a natural fit. It does work, but the tap_push_() > functions are kind of designed to work with a byte buffer where we > only work out the positions of headers as we construct them. Probably > a clean up for some other day. > >> + msg.uh.source = htons(toside->eport); >> + msg.uh.dest = htons(toside->oport); >> + msg.uh.len = htons(udplen); >> + csum_udp4(&msg.uh, oaddr, eaddr, &data); > > It might make sense to split a tap_push_uh4() function out from > tap_udp4_send() which you could re-use here. Good point. > >> + memcpy(&msg.data, iov->iov_base, iov->iov_len); >> + tap_icmp4_send(c, oaddr, eaddr, &msg, msglen); >> +} >> + >> /** >> * udp_sock_recverr() - Receive and clear an error from a socket >> - * @s: Socket to receive from >> + * @c: Execution context >> + * @ref: epoll reference >> * >> * Return: 1 if error received and processed, 0 if no more errors in queue, < 0 >> * if there was an error reading the queue >> * >> * #syscalls recvmsg >> */ >> -static int udp_sock_recverr(int s) >> +static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref) > > Similarly udp_sock_recverr() is used both on "listening" and reply > sockets, so it's not really safe to use the ref here. You can > explicitly pass the fd easily enough. The flow is trickier... I can pass *both* the fd and ref, alternativedly the flowside, but it looks redundant and adds extra code. If I add a flowside struct instead of ref I will need to declare that at the all the multiple locations where udp_sock_errs(). Do you have a good suggestion? > >> { >> const struct sock_extended_err *ee; >> const struct cmsghdr *hdr; >> char buf[CMSG_SPACE(sizeof(*ee))]; >> + char udp_data[8]; >> + int s = ref.fd; >> + struct iovec iov = { >> + .iov_base = udp_data, >> + .iov_len = sizeof(udp_data) [...] >> if (rc < 0) >> @@ -558,7 +607,7 @@ static void udp_buf_listen_sock_handler(const struct ctx *c, >> const socklen_t sasize = sizeof(udp_meta[0].s_in); >> int n, i; >> >> - if (udp_sock_errs(c, ref.fd, events) < 0) { >> + if (udp_sock_errs(c, ref, events) < 0) { > > ... because in the listening socket case we don't know the flow until > we've looked at the packet. I'd be tempted as a first cut to only do > the ICMP thing for errors on reply sockets. I can try that, but it would feel a little reduced. > > Listening sockets will be trickier. I think we'll have to actually > check msg_name in the received error in order to work out which flow > it belongs to. Actually, I did that first, until I realized I could extract the source IP address from the flow object. But how does that help? ///jon