More power to the C++ gurus

I’ve seen some discussions lately on the web about POD types in C++. These are classes without complicated constructors, assignment operators, etc and are extremely useful for defining things like vectors (float vectors, not the standard library list).

Trouble is, in games your expert programmer will define the vector in the most efficient POD way and everything will run well. That is until someone less obsessed with the low-level details of C++ comes along, flips to the appropriate page of their object oriented programming book, and adds a complex constructor to the class. Everything will continue to work but there’s a very good chance that programmer just wreaked havoc on the frame-rate.

So, here’s what i’m proposing. I want an attribute, just like the __attribute__ extension GCC already has which says “this is a POD class so don’t let anyone add anything to it that stops it being POD”. Note, i don’t want the attribute to make the class POD, its already POD by design. I want the attribute to enforce that property to stop anyone doing anything unwanted thats allowed by the language, but will make the engine gurus angry.

Anyway, just an idea, but one that I think could really be valuable. I’ve also been thinking that a command line option to ban calling virtual functions or malloc from an __attribute__((hot)) function would be nice, or maybe a new attribute that means “so hot you can’t call virtuals from in here”.

Also, just realised the POD attribute isn’t strictly necessary as you could define a union containing an element of the class and comment next to it to say that the union is guaranteeing the POD-ness of the class, but I still like the idea of having more attributes to control allowed code (and give error messages when the rules are broken), and not just hint things to the compiler as most current attributes do.

Posted in C++, Compilers | Tagged | 5 Comments

Why SPUs need aligned data

SPU programming is a daunting prospect at the best of times. Can’t get useful data without first issuing some DMAs. Can’t get too much data because the code and the data must never exceed 256KB. Can’t load/store/process anything other than a 16-byte vector. This last one is what i’m going to focus on optimizing here.

Something as simple as dereferencing a float pointer becomes complicated when the hardware only allows loading 16-byte vectors. To achieve this we need to load the 16-byte vector containing our float, then shift the float into the left most position in the vector. For example, this code becomes the following assembly.

float foo(float* p)
{
	return p[0];
}
ori     $2,$3,0       // make a copy of "p"
lqd     $3,0($3)      // load the 16-bytes containing p[0]
rotqby  $3,$3,$2      // move p[0] into the leftmost slot of the vector
bi      $0            // return
But what if we know the pointers alignment. Even better, what if we know p is aligned to 16. The previous example is reduced to this.
lqd     $3,0($3)      // load the 16-bytes starting with p[0]
bi      $0            // return
2 instructions. 2! And one of them was a return unrelated to our pointer dereference. Well, thats us done then. Just got to align the pointer and the compiler will generate this assembly. Unfortunately, no. C++ doesn’t allow alignment specifiers on parameters, or silently ignores the alignment when it does accept it. If C++ would let us write this, we’d be done.
float foo(float __attribute__((aligned((16))))* p)
{
	return p[0];
}

SPU Intrinsics to the Rescue

Here’s where a little intrinsic magic helps. The SPU compiler knows that vector pointers are aligned to 16-bytes. Casting the float pointer to a vector pointer and dereferencing it will give us the aligned load instuction from the assembly. The rotate instruction is required if the float element we want isn’t on a 16-byte boundary, but now the rotate is by a constant which is always better as it gives the compiler more information. I’ve separated this out into 2 functions. The “deref” function contains the magic and “foo” is there to show that calling it with constant offsets optimises away to the best case code. “bar” shows how the SPU compiler can now perform common subexpression elimination to remove the multiple identical vector loads. The original “bar” code would need 8 loads and 8 rotates here. Now we have only 2 loads and 6 rotates. Note 2 rotates were removed for shifting by 0 which is redundant.

float deref(float* p, int i) __attribute__((always_inline));
float deref(float* p, int i)
{
	// only do the fast version on constant offsets.
	// the compiler can do a better job than we can with non-constant offsets
	if (__builtin_constant_p(i))
	{
		int vec_offset = (i & ~3) / 4;
		int byte_offset = (i & 3) * 4;
		qword qw = *((qword*)(p) + vec_offset);
		qw = si_rotqby(qw, si_from_int(byte_offset));
		return si_to_float(qw);
	}
	else
	{
		return p[i];
	}
}
float foo(float* p)
{
	return deref(p, 0);
}
float bar(float* p)
{
	return deref(p, 0) + deref(p, 1) + deref(p, 2) + deref(p, 3)
		+ deref(p, 4) + deref(p, 5) + deref(p, 6) + deref(p, 7);
}
// foo
lqd	$3,0($3)
bi	$0
//bar
lqd	$15,0($3)
lqd	$14,16($3)
rotqbyi	$16,$15,4
rotqbyi	$13,$15,8
rotqbyi	$11,$15,12
rotqbyi	$6,$14,4
fa	$12,$16,$15
rotqbyi	$4,$14,8
rotqbyi	$2,$14,12
fa	$10,$12,$13
fa	$8,$10,$11
fa	$7,$8,$14
fa	$5,$7,$6
fa	$3,$5,$4
fa	$3,$3,$2
bi	$0

How to stop losing alignment information

So we’ve got a very fast way to load floats, but only if the pointer is aligned. We can add an assert to “foo” to make sure the pointer is always aligned at run-time, but I want a compile-time check. For this i’m going to turn to templates and template specialisation for help. A class AP (short for AlignedPointer) takes 2 template arguments: T for the type of the pointer; and Align for the pointers alignment. I’m not going to write about all the methods here, instead have a look at the code and the comments. Most of the operators in APBase are just here to allow us to drop in AP everywhere we used to have float* without needing any other changes.

#define ALIGN(a) __attribute__((aligned((a))))
#define ALWAYS_INLINE __attribute__((always_inline))

template<typename T>
class APBase
{
protected:
	T * ptr;
public:
	/* default constructor just makes a null pointer */
	ALWAYS_INLINE APBase()
	{
		this->ptr = __null;
	}
	/* cast operator allows automatic conversion back to
	the pointer */
	ALWAYS_INLINE operator T*() const
	{  
		return ptr; 
	}
	/* standard dereference ignores the alignment and returns
	the data */
	ALWAYS_INLINE T operator*() const 
	{
		return *ptr;
	}
	/* brackets operator also just behaves like standard */
	ALWAYS_INLINE T operator[](int i)
	{
		return ptr[i];
	}
	/* only allow casting pointer to int on 32-bit systems */
#ifdef _LP32
	ALWAYS_INLINE operator int()
	{
		return (int)ptr;
	}
#endif
	ALWAYS_INLINE operator long long()
	{
		return (long long)ptr;
	}
};

/* base AP class doesn't do anything special
rely on the specialisations for all the real code */
template<typename T, int Align>
class AP : public APBase<T>
{
};

/* standard float pointer can be constructed from any
float pointer and can copy construct from any other
AP<float> instance */
template<>
class AP<float, __alignof(float)> : public APBase<float>
{
public:
	ALWAYS_INLINE AP()
	{
		this->ptr = __null;
	}
	ALWAYS_INLINE AP(float* ptr)
	{
		this->ptr = ptr;
	}
	/* vector float pointers are already aligned so can copy them */
	ALWAYS_INLINE AP(vector float* ptr)
	{
		this->ptr = (float*)ptr;
	}
	// can copy any other AP to this one
	template<typename T, int align>
	ALWAYS_INLINE AP(const AP<T, align>& other)
	{
		this->ptr = (T*)other;
	}
};

/* 16 byte aligned float pointer can do optimized
loads, but we need to ensure it always contains an aligned
pointer */
template<>
class AP<float, __alignof(vector float)> : public APBase<float>
{
public:
	ALWAYS_INLINE AP()
	{
		this->ptr = __null;
	}
	/* force explicit constructor to stop
	any old pointer silently being assigned here.
	if someone calls the constructor we can assume they knew what
	they were doing */
	explicit ALWAYS_INLINE AP(float* ptr)
	{
		this->ptr = ptr;
	}
	/* vector float pointers are already aligned so can copy them */
	ALWAYS_INLINE AP(vector float* ptr)
	{
		this->ptr = (float*)ptr;
	}
	/* allow aligned float to be cast back to vector float* */
	operator vector float* () const
	{  
		return (vector float*)ptr; 
	}
	/* [] operator is the optimized one from earlier */
	ALWAYS_INLINE float operator[](int i)
	{
		if (__builtin_constant_p(i))
		{
			int vec_i = i & ~3;
			i &= 3;
			qword qw = *((qword*)(ptr) + (vec_i * 4));
			qw = si_rotqby(qw, si_from_int(i * 4));
			return si_to_float(qw);
		}
		else
		{
			return ptr[i];
		}
	}
	/* standard dereference is now optimal version */
	ALWAYS_INLINE float operator*() const 
	{
		return si_to_float(*((qword*)(ptr)));
	}
};

/* standard vector class with 2 constructors:
one for unaligned float poitners, and
one for aligned pointers */
class Vector
{
	union
	{
		struct
		{
			float x, y, z, w;
		};
		vector float vf;
	};
public:
	Vector(float* f)
	{
		x = f[0];
		y = f[1];
		z = f[2];
		w = f[3];
	}
	/* can just do a vector float load for aligned pointer */
	Vector(AP<float, 16> f)
	{
		vf = *(vector float*)f;
	}
} ALIGN(16);


ALIGN(16) float data[4] = {1.0f, 2.0f, 3.0f, 4.0f};
int makeVectors(Vector& a, Vector& b)
{
	AP<float, 16> ap(data);
	a = Vector(data);
	b = Vector(ap);
}

void foo(float* fp)
{
	AP<float, 4> ap0;
	AP<float, 16> ap1;
	ap0 = fp;
	ap1 = f; // error, possibly unsafe assignment
	ap1 = (AP<float, 16>)fp;
	fp = ap0;
	fp = ap1;
	ap0 = ap1;
	ap1 = ap0; // error, also possibly unsafe
	float f = *ap0 + *ap1;
	int i = (int)ap0;
	i = (int)ap1;
	long long l = (long long)ap0;
	l = (long long)ap1;
	vector float* vf;
	vf = (vector float*)ap0; // error, can't cast 4-byte aligned pointer to vector pointer
	vf = (vector float*)ap1;
	ap0 = vf;
	ap1 = vf;
}
This post is getting quite long so i won’t post the assembly for the code above, but try it out. The constructor for “b” in “makeVectors” takes only 2 lines of assembly. Loading “a” takes almost 20 lines. I know there are better ways to load a vector from an unaligned pointer that takes less than 20 instructions, but thats not what i’m interested in here. Also, as “foo” shows, these classes allow for compile-time error checking for safe casts between types, particularly disallowing casting an unaligned pointer into an aligned one, and preventing casting a pointer to an int on 64-bit systems.

Efficient dot-product on SPU

The Vector constructor is a nice example of where to use this class, but I wanted a more realistic test so tried it out on dot products. Below are 5 different variants of dot-product. I know there are better ways to do this, so please don’t comment about better implementations, its just a nice example. Not surprisingly, “dot1” is bad, coming in at 32 instructions. I expected “dot4” to be best. The union trick is used by many people, and the vector in the union should have hinted to the compiler to use efficient loads to get each element, but this is not the case. “dot5” wins with 16 instructions to “dot4″s 30. Anyway, thats it, now to find something else to speed up.

float dot(float* a, float* b)
{
	return (a[0] * b[0]) + (a[1] * b[1]) + (a[2] * b[2]) + (a[3] * b[3]);
}
float dot2(AP<float,4> a, AP<float,4> b)
{
	return (a[0] * b[0]) + (a[1] * b[1]) + (a[2] * b[2]) + (a[3] * b[3]);
}
float dot3(AP<float,16> a, AP<float,16> b)
{
	return (a[0] * b[0]) + (a[1] * b[1]) + (a[2] * b[2]) + (a[3] * b[3]);
}
float dot4(vector float* a, vector float* b)
{
	union
	{
		vector float vf;
		float f[4];
	} u;

	u.vf = *a * *b;
	return u.f[0] + u.f[1] + u.f[2] + u.f[3];
}
float dot5(vector float* a, vector float* b)
{
	vector float vf = *a * *b;
	AP<float, 16> ap(&vf);
	return ap[0] + ap[1] + ap[2] + ap[3];
}
Posted in C++, Compilers | 8 Comments

Data Oriented Design – The Numbers

There’s recently been an interesting discussion on twitter about Data Oriented Design.  I’ve followed the arguments, seen the presentations, but i wanted to try it out for myself.  More importantly, as a programmer obsessed with optimisation, i wanted to see if it can really make that big a difference to performance.  I mean seriously, after over 30 years of x86 processor development, can a virtual call or a couple more bytes in a structure still matter?
To kick things off, i’m going to declare a couple of objects, a Car and a Truck.  Using classic (good, according to uni) object oriented design, both of these can be moved, so lets inherit both from a common Movable class.  I’m then going to update all my Movable objects, as would be done once per frame in a game.
typedef vec_type ...;

class Movable
{
	vec_type pos;
	vec_type speed;
	virtual void update(float timestep)
	{
		pos += speed * timestep;
	}
};

class Car : public Movable
{
	void* model;
	void* wheels[4];
	void* damage;
	void* driver;
	int type;
	virtual void update(float timestep)
	{
		Movable::update(timestep);
	}
};

class Truck : public Movable
{
	void* model;
	void* wheels[6];
	void* damage;
	void* driver;
	int type;
	virtual void update(float timestep)
	{
		Movable::update(timestep);
	}
};
So, nothing too complex here. I’ve padded the Car and Truck with a few extra fields we’d likely see on them, just to get some useless data to fill up the cache when we update each object.
Now onto the UpdateManager classes. I’ve written several of these to test out the different methods of updating all the Movable objects. The worst one i could think of was a std::vector of pointers to the Cars and Trucks. All the objects are also “newed” so that they shouldn’t all be contiguous in memory, ie, i’m going to gets lots and lots of cache misses.
#define NUM_OBJECTS 1000000

class VectorUpdateManager
{
	vector<Movable<vec_type>*> objects;
	/* constructors, etc */
	void update(float timestep)
	{
		for (auto it = objects.cbegin(); it != objects.cend(); ++it)
		{
			auto object = *it;
			object->update(timestep);
		}
	}
};
Next up, a fixed-sized array instead of a vector, so that we don’t have the overhead of the iterator doing the update. We still have pointers so the data won’t necessarily be contiguous.
class VectorUpdateManager
{
	Movable<vec_type>* objects[NUM_OBJECTS];
	/* constructors, etc */
	void update(float timestep)
	{
		for (int i = 0; i < NUM_OBJECTS; i++)
		{
			objects[i]->update(timestep);
		}
	}
};
Then a fixed-sized array of each of Car and Truck so we have no virtual calls. As the array is of objects and not pointers, we also now guarantee data is contiguous. This should avoid alot of the slowdown with cache misses.
class VectorUpdateManager
{
	Car<vec_type> cars[NUM_OBJECTS / 2];
	Truck<vec_type> trucks[NUM_OBJECTS / 2];
	/* constructors, etc */
	void update(float timestep)
	{
		for (int i = 0; i < NUM_OBJECTS / 2; i++)
		{
			cars[i].update(timestep);
		}
		for (int i = 0; i < NUM_OBJECTS / 2; i++)
		{
			trucks[i].update(timestep);
		}
	}
};
And finally, the Data Oriented Design. Here we have an array of just the Movable class so no virtual calls, and the data is as small as necessary for an update.
class VectorUpdateManager
{
	Movable<vec_type> objects[NUM_OBJECTS];
	/* constructors, etc */
	void update(float timestep)
	{
		for (int i = 0; i < NUM_OBJECTS; i++)
		{
			objects[i].update(timestep);
		}
	}
};

Results

So, 4 cases. Vector of pointers, array of pointers, 2 arrays of objects, and the data oriented approach. The times, in µs, are:

Vector of pointers: 86961
Array of pointers: 77188
2 Object arrays: 71645
Data oriented: 37521

Going from std:vector to an array gets us about a 10% speedup. An array of pointers to an array of objects, another 10%. But losing all that extra bloat and going data oriented gives us a 100% speedup. I’m stunned by this. Like I said earlier, i’ve seen the presentations, but to see the numbers for myself amazes me. I really thought that the enormous effort Intel and AMD have put into caches, memory buses, etc, would lessen the effect of bad coding, but i was wrong. Sure this was a simple example, and the update method in real games would be more complex, but that would only alleviate the virtual function overhead and from these numbers, its the cache misses that dictate performance.

Next up. I can get this down to 6100 µs. Yes, that’s another 600% faster than my best result here. And all it took was SSE and for Visual Studio or Intel to do something I still don’t understand.

Posted in C++, Data Oriented Design | 34 Comments