Why SPUs need aligned data

SPU programming is a daunting prospect at the best of times. Can’t get useful data without first issuing some DMAs. Can’t get too much data because the code and the data must never exceed 256KB. Can’t load/store/process anything other than a 16-byte vector. This last one is what i’m going to focus on optimizing here.

Something as simple as dereferencing a float pointer becomes complicated when the hardware only allows loading 16-byte vectors. To achieve this we need to load the 16-byte vector containing our float, then shift the float into the left most position in the vector. For example, this code becomes the following assembly.

float foo(float* p)
{
	return p[0];
}
ori     $2,$3,0       // make a copy of "p"
lqd     $3,0($3)      // load the 16-bytes containing p[0]
rotqby  $3,$3,$2      // move p[0] into the leftmost slot of the vector
bi      $0            // return
But what if we know the pointers alignment. Even better, what if we know p is aligned to 16. The previous example is reduced to this.
lqd     $3,0($3)      // load the 16-bytes starting with p[0]
bi      $0            // return
2 instructions. 2! And one of them was a return unrelated to our pointer dereference. Well, thats us done then. Just got to align the pointer and the compiler will generate this assembly. Unfortunately, no. C++ doesn’t allow alignment specifiers on parameters, or silently ignores the alignment when it does accept it. If C++ would let us write this, we’d be done.
float foo(float __attribute__((aligned((16))))* p)
{
	return p[0];
}

SPU Intrinsics to the Rescue

Here’s where a little intrinsic magic helps. The SPU compiler knows that vector pointers are aligned to 16-bytes. Casting the float pointer to a vector pointer and dereferencing it will give us the aligned load instuction from the assembly. The rotate instruction is required if the float element we want isn’t on a 16-byte boundary, but now the rotate is by a constant which is always better as it gives the compiler more information. I’ve separated this out into 2 functions. The “deref” function contains the magic and “foo” is there to show that calling it with constant offsets optimises away to the best case code. “bar” shows how the SPU compiler can now perform common subexpression elimination to remove the multiple identical vector loads. The original “bar” code would need 8 loads and 8 rotates here. Now we have only 2 loads and 6 rotates. Note 2 rotates were removed for shifting by 0 which is redundant.

float deref(float* p, int i) __attribute__((always_inline));
float deref(float* p, int i)
{
	// only do the fast version on constant offsets.
	// the compiler can do a better job than we can with non-constant offsets
	if (__builtin_constant_p(i))
	{
		int vec_offset = (i & ~3) / 4;
		int byte_offset = (i & 3) * 4;
		qword qw = *((qword*)(p) + vec_offset);
		qw = si_rotqby(qw, si_from_int(byte_offset));
		return si_to_float(qw);
	}
	else
	{
		return p[i];
	}
}
float foo(float* p)
{
	return deref(p, 0);
}
float bar(float* p)
{
	return deref(p, 0) + deref(p, 1) + deref(p, 2) + deref(p, 3)
		+ deref(p, 4) + deref(p, 5) + deref(p, 6) + deref(p, 7);
}
// foo
lqd	$3,0($3)
bi	$0
//bar
lqd	$15,0($3)
lqd	$14,16($3)
rotqbyi	$16,$15,4
rotqbyi	$13,$15,8
rotqbyi	$11,$15,12
rotqbyi	$6,$14,4
fa	$12,$16,$15
rotqbyi	$4,$14,8
rotqbyi	$2,$14,12
fa	$10,$12,$13
fa	$8,$10,$11
fa	$7,$8,$14
fa	$5,$7,$6
fa	$3,$5,$4
fa	$3,$3,$2
bi	$0

How to stop losing alignment information

So we’ve got a very fast way to load floats, but only if the pointer is aligned. We can add an assert to “foo” to make sure the pointer is always aligned at run-time, but I want a compile-time check. For this i’m going to turn to templates and template specialisation for help. A class AP (short for AlignedPointer) takes 2 template arguments: T for the type of the pointer; and Align for the pointers alignment. I’m not going to write about all the methods here, instead have a look at the code and the comments. Most of the operators in APBase are just here to allow us to drop in AP everywhere we used to have float* without needing any other changes.

#define ALIGN(a) __attribute__((aligned((a))))
#define ALWAYS_INLINE __attribute__((always_inline))

template<typename T>
class APBase
{
protected:
	T * ptr;
public:
	/* default constructor just makes a null pointer */
	ALWAYS_INLINE APBase()
	{
		this->ptr = __null;
	}
	/* cast operator allows automatic conversion back to
	the pointer */
	ALWAYS_INLINE operator T*() const
	{  
		return ptr; 
	}
	/* standard dereference ignores the alignment and returns
	the data */
	ALWAYS_INLINE T operator*() const 
	{
		return *ptr;
	}
	/* brackets operator also just behaves like standard */
	ALWAYS_INLINE T operator[](int i)
	{
		return ptr[i];
	}
	/* only allow casting pointer to int on 32-bit systems */
#ifdef _LP32
	ALWAYS_INLINE operator int()
	{
		return (int)ptr;
	}
#endif
	ALWAYS_INLINE operator long long()
	{
		return (long long)ptr;
	}
};

/* base AP class doesn't do anything special
rely on the specialisations for all the real code */
template<typename T, int Align>
class AP : public APBase<T>
{
};

/* standard float pointer can be constructed from any
float pointer and can copy construct from any other
AP<float> instance */
template<>
class AP<float, __alignof(float)> : public APBase<float>
{
public:
	ALWAYS_INLINE AP()
	{
		this->ptr = __null;
	}
	ALWAYS_INLINE AP(float* ptr)
	{
		this->ptr = ptr;
	}
	/* vector float pointers are already aligned so can copy them */
	ALWAYS_INLINE AP(vector float* ptr)
	{
		this->ptr = (float*)ptr;
	}
	// can copy any other AP to this one
	template<typename T, int align>
	ALWAYS_INLINE AP(const AP<T, align>& other)
	{
		this->ptr = (T*)other;
	}
};

/* 16 byte aligned float pointer can do optimized
loads, but we need to ensure it always contains an aligned
pointer */
template<>
class AP<float, __alignof(vector float)> : public APBase<float>
{
public:
	ALWAYS_INLINE AP()
	{
		this->ptr = __null;
	}
	/* force explicit constructor to stop
	any old pointer silently being assigned here.
	if someone calls the constructor we can assume they knew what
	they were doing */
	explicit ALWAYS_INLINE AP(float* ptr)
	{
		this->ptr = ptr;
	}
	/* vector float pointers are already aligned so can copy them */
	ALWAYS_INLINE AP(vector float* ptr)
	{
		this->ptr = (float*)ptr;
	}
	/* allow aligned float to be cast back to vector float* */
	operator vector float* () const
	{  
		return (vector float*)ptr; 
	}
	/* [] operator is the optimized one from earlier */
	ALWAYS_INLINE float operator[](int i)
	{
		if (__builtin_constant_p(i))
		{
			int vec_i = i & ~3;
			i &= 3;
			qword qw = *((qword*)(ptr) + (vec_i * 4));
			qw = si_rotqby(qw, si_from_int(i * 4));
			return si_to_float(qw);
		}
		else
		{
			return ptr[i];
		}
	}
	/* standard dereference is now optimal version */
	ALWAYS_INLINE float operator*() const 
	{
		return si_to_float(*((qword*)(ptr)));
	}
};

/* standard vector class with 2 constructors:
one for unaligned float poitners, and
one for aligned pointers */
class Vector
{
	union
	{
		struct
		{
			float x, y, z, w;
		};
		vector float vf;
	};
public:
	Vector(float* f)
	{
		x = f[0];
		y = f[1];
		z = f[2];
		w = f[3];
	}
	/* can just do a vector float load for aligned pointer */
	Vector(AP<float, 16> f)
	{
		vf = *(vector float*)f;
	}
} ALIGN(16);


ALIGN(16) float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
int makeVectors(Vector& a, Vector& b)
{
	AP<float, 16> ap(data);
	a = Vector(data);
	b = Vector(ap);
}

void foo(float* fp)
{
	AP<float, 4> ap0;
	AP<float, 16> ap1;
	ap0 = fp;
	ap1 = f; // error, possibly unsafe assignment
	ap1 = (AP<float, 16>)fp;
	fp = ap0;
	fp = ap1;
	ap0 = ap1;
	ap1 = ap0; // error, also possibly unsafe
	float f = *ap0 + *ap1;
	int i = (int)ap0;
	i = (int)ap1;
	long long l = (long long)ap0;
	l = (long long)ap1;
	vector float* vf;
	vf = (vector float*)ap0; // error, can't cast 4-byte aligned pointer to vector pointer
	vf = (vector float*)ap1;
	ap0 = vf;
	ap1 = vf;
}
This post is getting quite long so i won’t post the assembly for the code above, but try it out. The constructor for “b” in “makeVectors” takes only 2 lines of assembly. Loading “a” takes almost 20 lines. I know there are better ways to load a vector from an unaligned pointer that takes less than 20 instructions, but thats not what i’m interested in here. Also, as “foo” shows, these classes allow for compile-time error checking for safe casts between types, particularly disallowing casting an unaligned pointer into an aligned one, and preventing casting a pointer to an int on 64-bit systems.

Efficient dot-product on SPU

The Vector constructor is a nice example of where to use this class, but I wanted a more realistic test so tried it out on dot products. Below are 5 different variants of dot-product. I know there are better ways to do this, so please don’t comment about better implementations, its just a nice example. Not surprisingly, “dot1” is bad, coming in at 32 instructions. I expected “dot4” to be best. The union trick is used by many people, and the vector in the union should have hinted to the compiler to use efficient loads to get each element, but this is not the case. “dot5” wins with 16 instructions to “dot4″s 30. Anyway, thats it, now to find something else to speed up.

float dot(float* a, float* b)
{
	return (a[0] * b[0]) + (a[1] * b[1]) + (a[2] * b[2]) + (a[3] * b[3]);
}
float dot2(AP<float,4> a, AP<float,4> b)
{
	return (a[0] * b[0]) + (a[1] * b[1]) + (a[2] * b[2]) + (a[3] * b[3]);
}
float dot3(AP<float,16> a, AP<float,16> b)
{
	return (a[0] * b[0]) + (a[1] * b[1]) + (a[2] * b[2]) + (a[3] * b[3]);
}
float dot4(vector float* a, vector float* b)
{
	union
	{
		vector float vf;
		float f[4];
	} u;

	u.vf = *a * *b;
	return u.f[0] + u.f[1] + u.f[2] + u.f[3];
}
float dot5(vector float* a, vector float* b)
{
	vector float vf = *a * *b;
	AP<float, 16> ap(&vf);
	return ap[0] + ap[1] + ap[2] + ap[3];
}
Advertisements
This entry was posted in C++, Compilers. Bookmark the permalink.

8 Responses to Why SPUs need aligned data

  1. Jonathan says:

    Nice article 🙂

    You can (I think) get the behaviour you’re looking for by using a typedef:

    // align.c
    typedef float __attribute__((aligned(16))) float_a16;
    float foo(float_a16* p) {
    return p[0];
    }

    $ spu-elf-gcc align.c -O -S -o-
    ...
    foo:
    lqd $3,0($3)
    bi $lr
    ...

    (using fsf gcc-4.5.2)

    • petecubano says:

      Nice! Of all the things i tried, that wasn’t one of them. I didn’t mind how we get the alignment information through to GCC, i just wanted a way to do it. I would encourage all programmers to use typedefs like that instead of just float* everywhere. Its much more expressive.

      The only issue is that it allows this. If only there was a way to get GCC to error when losing alignment info!

      void f(float* a)
      {
      foo(a);
      }

  2. Jonathan says:

    Well, you can, but that won’t work to access anything but a single element.

    Dereferencing p[1] gives a very nasty

    ai $3,$3,4
    lqd $3,0($3) // zeros low bits, so ai 4 is effectively a no-op
    bi $lr

    Attempting to declare an array containing float_a16 elements results in a compile-time error (alignment of array elements is greater than element size)

  3. petecubano says:

    Damn. Well thats no good. GCC should have put the 4 in the lqd instruction.

    Tried changing it to this, so that the alignment is on the pointer, but it silently ignores it.

    typedef float* __attribute__((aligned(16))) floatp_a16;
    float foo(floatp_a16 p) {
    return p[0];
    }

  4. Jonathan says:

    Am I able to ask what compiler you’re using? For deref(), I’m seeing rotqby instructions generated, not rotqbyi. It isn’t difficult to convince the compiler to generate them (a few extra explicit checks and si_rotqbyi() calls), but clearly my compiler isn’t smart enough to do so.

  5. petecubano says:

    I’m not sure if i can say, but i’m going to anyway. I’m using the Sony SPU compiler. No idea how up-to-date it is with GCC updates or how many Sony specific fixes it contains, but thats the one.

    Yeah, i should have used rotqbyi in my code as i had the _builtin_constant_p intrinsic anyway. Could also have checked for shift by 0 if the compiler wasn’t optimizing that away.

    • Jonathan says:

      I suspect there’s a few tweaks in there that are missing from my compiler 🙂

      The flip side of that is that dot4() and dot5() generate almost identical code for me – dot4() is slightly better because dot5() has some extra stack manipulation instructions – cruft that should be optimised out by gcc (but isn’t).

      (dot and dot2 run in 38 cycles, dot3, dot4 and dot5 in 34. With -ffast-math, dot, dot2 and dot3 run two cycles faster, making dot3 the fastest in this somewhat unrepresentative test 🙂

      I do like the class you’ve written – it performs well, it makes alignment explicit which is very useful, and it is quite neat to use. On the other hand, I think I’d still prefer to be using machine vectors directly – that avoids the alignment problem and will work directly with intrinsics (amongst other things).

      Oh – and I didn’t know about __builtin_constant_p() before I read this post. I’ll be making good use of that, I think 😀

      • petecubano says:

        I think there’s tweaks missing from my compiler too, especially on dot4 which is 30 instructions, and contains many useless shifts. I’m surprised -ffash-math improves things on this example. Surely SPUs should have this option on by default as they don’t have FP exceptions and NaNs.

        I know what you mean about preferring using vectors directly. Thats what i prefer too. I do think there may still be a use for aligned pointer information like this, especially as people move less vector heavy code to SPUs but i’m happy to be proved wrong if any game devs want to comment.

        Glad you like the __builtin_constant_p intrinsic. Its very useful here. My only worry with the way i used it here is that the compiler will think the function is quite large at first and so it may limit inlining, but the branch and half the function will always be removed. Perhaps GCC needs better ways to evalulate the cost of functions with a constant checking branch.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s