On Mon, Mar 04, 2024 at 11:47:17PM +0100, Stefano Brivio wrote:
> On Mon, 4 Mar 2024 12:00:40 +0100
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > On Mon, 4 Mar 2024 12:54:12 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Fri, Mar 01, 2024 at 07:56:51AM +0100, Stefano Brivio wrote:  
> > > > On Fri, 1 Mar 2024 10:09:39 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Thu, Feb 29, 2024 at 03:15:53PM +0100, Stefano Brivio wrote:    
> > > > > > On Thu, 29 Feb 2024 09:56:25 +0100
> > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >       
> > > > > > > On Thu, 29 Feb 2024 19:49:09 +1100
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > >       
> > > > > > > > On Thu, Feb 29, 2024 at 08:05:09AM +0100, Stefano Brivio wrote:        
> > > > > > > > > On Thu, 29 Feb 2024 11:38:53 +1100
> > > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > > >           
> > > > > > > > > > On Wed, Feb 28, 2024 at 02:26:18PM +0100, Laurent Vivier wrote:          
> > > > > > > > > > > On 2/19/24 04:08, David Gibson wrote:            
> > > > > > > > > > > > On Sat, Feb 17, 2024 at 04:07:23PM +0100, Laurent Vivier wrote:  
> > > > > > > > > > > >
> > > > > > > > > > > > [...]
> > > > > > > > > > > >          
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * proto_ipv6_header_psum() - Calculates the partial checksum of an
> > > > > > > > > > > > > + * 			      IPv6 header for UDP or TCP
> > > > > > > > > > > > > + * @payload_len:	Payload length
> > > > > > > > > > > > > + * @proto:		Protocol number
> > > > > > > > > > > > > + * @saddr:		Source address
> > > > > > > > > > > > > + * @daddr:		Destination address
> > > > > > > > > > > > > + * Returns:	Partial checksum of the IPv6 header
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol,
> > > > > > > > > > > > > +				struct in6_addr saddr, struct in6_addr daddr)            
> > > > > > > > > > > > 
> > > > > > > > > > > > Hrm, this is passing 2 16-byte IPv6 addresses by value, which might
> > > > > > > > > > > > not be what we want.            
> > > > > > > > > > > 
> > > > > > > > > > > The idea here is to avoid the pointer alignment problem (&ip6h->saddr and
> > > > > > > > > > > &ip6h->daddr can be misaligned).            
> > > > > > > > > > 
> > > > > > > > > > Ah, right.  That's a neat idea, but I'm not sure it really helps: I
> > > > > > > > > > think it will just move the misaligned access from inside the function
> > > > > > > > > > to the call site, where we try to marshal the parameter from something
> > > > > > > > > > unaligned.          
> > > > > > > > > 
> > > > > > > > > I haven't tested this yet, but note that this is generally okay: the
> > > > > > > > > problem is *dereferencing* an unaligned pointer. But if you load memory
> > > > > > > > > from an aligned pointer, and extract a value from this memory, it's all
> > > > > > > > > fine.          
> > > > > > > > 
> > > > > > > > Right, that's kind of what I'm getting at.  Assuming this value starts
> > > > > > > > in an unaligned buffer, then in order to pass this by value the caller
> > > > > > > > will need to load from that unaligned pointer.  AFAIK, the compiler
> > > > > > > > will base the type of loads only on the pointed to type, which isn't
> > > > > > > > changed whether we dereference in the caller or the callee.
> > > > > > > >         
> > > > > > > > > 
> > > > > > > > > Speaking MIPS, this is not safe on all CPU models:
> > > > > > > > > 
> > > > > > > > > 	la	$1, 1002   # s1 now contains the value 1002
> > > > > > > > > 	lw	$2, 0($1)  # load word from memory at 1002 + 0 into s2
> > > > > > > > > 
> > > > > > > > > but this is:
> > > > > > > > > 
> > > > > > > > > 	la	$1, 1000   # s1 now contains the value 1000
> > > > > > > > > 	la	$2, 1004   # s3 now contains the value 1004
> > > > > > > > > 	lw	$3, 0($1)  # load word from memory at 1000 + 0 into s3
> > > > > > > > > 	lw	$4, 0($3)  # load word from memory at 1004 + 0 into s4
> > > > > > > > > 	sll	$5, $3, 16 # 16-bit shift left s3 into s5
> > > > > > > > > 	srl	$6, $4, 16 # 16-bit shift right s4 into s6
> > > > > > > > > 	or	$2, $5, $6 # OR s5 and s6 into s2          
> > > > > > > > 
> > > > > > > > Right, but I don't think merely moving the dereference to the caller
> > > > > > > > will necessarily induce the compiler to generate this rather than the
> > > > > > > > former.        
> > > > > > > 
> > > > > > > Oh, oops, I didn't realise this was the case (I haven't reviewed the
> > > > > > > patch yet).      
> > > > > > 
> > > > > > ...no, that's not the case. Dereferencing 'iph' from
> > > > > > struct tcp[46]_l2_buf_t is fine:
> > > > > > 
> > > > > > struct tcp4_l2_buf_t {
> > > > > >         uint8_t                    pad[2];               /*     0     2 */
> > > > > >         struct tap_hdr             taph;                 /*     2    18 */
> > > > > >         struct iphdr               iph;                  /*    20    20 */
> > > > > > 	[...]
> > > > > > } __attribute__((__packed__));
> > > > > > 
> > > > > > struct tcp6_l2_buf_t {
> > > > > >         uint8_t                    pad[2];               /*     0     2 */
> > > > > >         struct tap_hdr             taph;                 /*     2    18 */
> > > > > >         struct ipv6hdr             ip6h;                 /*    20    40 */
> > > > > > 	[...]
> > > > > > } __attribute__((__packed__));
> > > > > > 
> > > > > > The problematic structures are the UDP buffers:
> > > > > > 
> > > > > > struct udp4_l2_buf_t {
> > > > > >         struct sockaddr_in         s_in;                 /*     0    16 */
> > > > > >         struct tap_hdr             taph;                 /*    16    18 */
> > > > > >         struct iphdr               iph;                  /*    34    20 */
> > > > > > 	[...]
> > > > > > } __attribute__((__aligned__(4)));
> > > > > > 
> > > > > > and for UDP, this patch is dereferencing buffer pointers only, not
> > > > > > pointers to headers.      
> > > > > 
> > > > > Ok... but my point remains, I'm not seeing that passing the address by
> > > > > value actually helps - it just seems to change whether we need to
> > > > > handle the unaligned load in the caller or the callee.    
> > > > 
> > > > For UDP and IPv4 (from 6/9):
> > > > 
> > > > +       b->iph.check = csum_ip4_header(b->iph.tot_len, IPPROTO_UDP,
> > > > +                                      b->iph.saddr, b->iph.daddr);
> > > > 
> > > > and for IPv6 (this patch):
> > > > 
> > > > +       b->uh.check = csum(&b->uh, ntohs(b->ip6h.payload_len),
> > > > +                          proto_ipv6_header_psum(b->ip6h.payload_len,
> > > > +                                                 IPPROTO_UDP,
> > > > +                                                 b->ip6h.saddr,
> > > > +                                                 b->ip6h.daddr));
> > > > 
> > > > these cause loads starting from 'b', which is aligned, instead of
> > > > passing 'iph' or 'ip6h', unaligned, and loading from there.    
> > > 
> > > No... the loads are still from b->ip6h.saddr, b->ip6h.daddr and
> > > b->ip6h.payload_len.  
> > 
> > It depends how we define "loading from" -- the problem, in general, is
> > not the memory location per se, the problem is dereferencing memory
> > pointers.
> > 
> > I plan to try an example on MIPS in a bit [...]
> 
> Actually, armhf first (for clarity):
> 
> $ cat align.c
> #include <stdio.h>
> #include <stdint.h>
> 
> struct disarray {
>     uint8_t oops;
>     uint32_t v1;
>     uint32_t v2;
> } __attribute__((packed, aligned(__alignof__(unsigned int))));
> 
> void f1(uint32_t *v1) {
>     *v1 += 42;
> }
> 
> uint32_t f2(uint32_t v2) {
>     return v2++;
> }
> 
> int main()
> {
>     struct disarray d = { 0x55, 0xaa, 0xaa };
> 
>     f1(&d.v1);
>     f2(d.v2);
> 
>     fprintf(stdout, "%08x %08x", d.v1, d.v2);
> }
> 
> $ arm-linux-gnueabihf-gcc-12 -g -O0 -fno-stack-protector -fomit-frame-pointer -mno-unaligned-access -o align align.c
> align.c: In function ‘main’:
> align.c:22:8: warning: taking address of packed member of ‘struct disarray’ may result in an unaligned pointer value [-Waddress-of-packed-member]
>    22 |     f1(&d.v1);
>       |        ^~~~~
> 
> $ arm-linux-gnueabihf-objdump -S --disassemble=main align
> [...]
>     f1(&d.v1);
>  562:	ab01      	add	r3, sp, #4
>  564:	3301      	adds	r3, #1
>  566:	4618      	mov	r0, r3
>  568:	f7ff ffde 	bl	528 <f1>
> [...]
> 
> before the call to f1(), the address in r3 is not aligned (we just
> added #1), despite -mno-unaligned-access. I guess gcc can only warn
> about that, but not fix it.
> 
> This:
>   https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
> 
> says:
>   -munaligned-access
>   -mno-unaligned-access
> 
>     Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned. By default unaligned access is disabled for all pre-ARMv6, all ARMv6-M and for ARMv8-M Baseline architectures, and enabled for all other architectures. If unaligned access is not enabled then words in packed data structures are accessed a byte at a time. 
> 
> Implying, I guess, that on those architectures unaligned accesses
> shouldn't be done. I think Thumb mode also has issues with this, by
> the way. 
> 
> And in f1() we just have a ldr from that address (passed on r0):
> void f1(uint32_t *v1) {
>  528:	b082      	sub	sp, #8
>  52a:	9001      	str	r0, [sp, #4]
>     *v1 += 42;
>  52c:	9b01      	ldr	r3, [sp, #4]
>  52e:	681b      	ldr	r3, [r3, #0]
>  530:	f103 022a 	add.w	r2, r3, #42	@ 0x2a
> 
> $ arm-linux-gnueabihf-objdump -S --disassemble=f1 align
> [...]
>     *v1 += 42;
>  52c:	9b01      	ldr	r3, [sp, #4]
>  52e:	681b      	ldr	r3, [r3, #0]
>  530:	f103 022a 	add.w	r2, r3, #42	@ 0x2a
> 
> ...but the call to f2() is fine: we load with offset 8 from the stack
> pointer, shift word right, load from offset 12, shift word left, OR:
> 
> $ arm-linux-gnueabihf-objdump -S --disassemble=main align
> [...]
>     f2(d.v2);
>  56c:	9b02      	ldr	r3, [sp, #8]
>  56e:	0a1b      	lsrs	r3, r3, #8
>  570:	f89d 200c 	ldrb.w	r2, [sp, #12]
>  574:	0612      	lsls	r2, r2, #24
>  576:	4313      	orrs	r3, r2
>  578:	4618      	mov	r0, r3
>  57a:	f7ff ffe0 	bl	53e <f2>
> [...]

Huh.  Ok, so I guess the compiler realises it's doing a load from a
packed structure and generates the necessary fixup code.  I thought it
would only consider the type of the actually loaded value.  Are you
sure it still does this correctly when optimization is enabled?

> 
> Now on to MIPS (MIPS32):
> 
> $ mips-linux-gnu-gcc-12 -g -O0 -fno-stack-protector -fomit-frame-pointer -mno-unaligned-access -o align align.c
> align.c: In function ‘main’:
> align.c:22:8: warning: taking address of packed member of ‘struct disarray’ may result in an unaligned pointer value [-Waddress-of-packed-member]
>    22 |     f1(&d.v1);
>       |        ^~~~~
> 
> $ mips-linux-gnu-objdump -S --disassemble=main align
> [...]
>     f1(&d.v1);
>  7bc:	27a20019 	addiu	v0,sp,25
>  7c0:	00402025 	move	a0,v0
>  7c4:	8f82802c 	lw	v0,-32724(gp)
>  7c8:	0040c825 	move	t9,v0
>  7cc:	0411ffe0 	bal	750 <f1>
>  7d0:	00000000 	nop
>  7d4:	8fbc0010 	lw	gp,16(sp)
> [...]
> 
> '&d.v1' is passed in a0, again unaligned (stack pointer plus 25). And f1()
> uses it just like that:
> 
> $ mips-linux-gnu-objdump -S --disassemble=f1 align
> [...]
> void f1(uint32_t *v1) {
>  750:	afa40000 	sw	a0,0(sp)
>     *v1 += 42;
>  754:	8fa20000 	lw	v0,0(sp)
>  758:	8c420000 	lw	v0,0(v0)
>  75c:	2443002a 	addiu	v1,v0,42
> [...]
> 
> while the call to f2() is, again, fine:
> 
> $ mips-linux-gnu-objdump -S --disassemble=main align
>     f2(d.v2);
>  7e0:	8ba2001d 	lwl	v0,29(sp)
>  7e4:	9ba20020 	lwr	v0,32(sp)
>  7e8:	00402025 	move	a0,v0
>  7ec:	8f828030 	lw	v0,-32720(gp)
>  7f0:	0040c825 	move	t9,v0
>  7f4:	0411ffdf 	bal	774 <f2>
>  7f8:	00000000 	nop
>  7fc:	8fbc0010 	lw	gp,16(sp)
> 
> two loads, from stack pointer + 29 and stack pointer + 32. MIPS32 has lwl
> and lwr (the infamous US4814976A patent, now expired) to avoid load plus
> shift plus OR.
> 
> Now, you might argue that what I'm describing here might simply be gcc's
> behaviour, and if gcc avoids unaligned loads as long as we don't pass
> unaligned pointers around, that's not any better for us -- other compilers
> might do things differently.
> 
> And... yes, packed structures are actually a GNU extension: C standards
> don't say anything about loads like my f1(d.v2) call above, so all I'm
> showing here is that a particular compiler is fine with these accesses,
> but not unaligned pointers.
> 
> On the other hand, this seems to be a well established behaviour, and I
> don't think we could realistically drop every load of unaligned *values*.
> Unaligned pointers, we currently don't dereference any, because gcc warns
> otherwise.
> 
> So, practically speaking, I guess as long as we avoid dereferencing
> unaligned pointers, we should be fine?
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson