Tuesday, May 26, 2009
Monday, April 13, 2009
20090413 photograph
A photo (again quite possibly not taken by me) from a trip with the family for a stroll around Round Hill.
Taken 20051023.
Sunday, April 12, 2009
Thursday, April 9, 2009
20090409 photograph
From the deck of our previous house in Burnie.
The exif timestamp on the photo reveals that it was not taken by me - before 7am? No.
Taken 20050314.
Tuesday, April 7, 2009
20090407 photograph
We traveled to and from Hobart a several times while our first child was very young, While we were stopped so that she could feed, I would run around experimenting with the (still new) camera.
Taken in Campbell Town, through trees that have since been removed.
Taken 20041217.
Monday, April 6, 2009
Friday, April 3, 2009
20090403 photograph
A terrible shot in many, many ways (dingy venue, trying to avoid the on-camera flash, focus fail etc), that also captures something quite special from the evening ;)
Taken 20041218.
Wednesday, April 1, 2009
20090401 photograph
I was going through some of my old photos recently and noticed a number that I like but don't have a particularly good excuse to show people, so here's one of them :)
Taken shortly after purchasing my Canon 300D (and shortly before the birth of my first child) I really like this picture - the grey of the ocean, the rock/wave connection and the colours on the rock and weed in the foreground.
I've taken the "same" photo on several other occasions, but none have gripped me the same way.
Friday, January 16, 2009
-funroll-loops
In general, C is a lousy language for expressing this kind of parallelism on the SPU. The original loop that 'inspired' this nonsense looks something like :
which is quite clear and straightforward to read, but with hidden complexity - the lack of quadword alignment, the way it is expressed as three seperate multiply-adds, and the separation into three (unpacked) variables which are repacked inside func().
for (j = 0; j < num_indexes; j += 3) {
const float *v0, *v1, *v2;
v0 = (const float *) (vertices + indexes[j+0] * vertex_size);
v1 = (const float *) (vertices + indexes[j+1] * vertex_size);
v2 = (const float *) (vertices + indexes[j+2] * vertex_size);
func(v0, v1, v2);
}which is quite clear and straightforward to read, but with hidden complexity - the lack of quadword alignment, the way it is expressed as three seperate multiply-adds, and the separation into three (unpacked) variables which are repacked inside func().
Unrolled 2
Longer than the other one, but with better odd/even balance and only one shuffle constant. Probably faster.
for (j = 0; j < num_indexes; j += 24) {
qword* lower_qword = (qword*)&indexes[j];
qword indices0 = lower_qword[0];
qword indices1 = lower_qword[1];
qword indices2 = lower_qword[2];
qword vs0 = indices0;
qword vs1 = si_shlqbyi(indices0, 6);
qword vs3 = si_shlqbyi(indices1, 2);
qword vs4 = si_shlqbyi(indices1, 8);
qword vs6 = si_shlqbyi(indices2, 4);
qword vs7 = si_shlqbyi(indices2, 10);
qword tmp2a, tmp2b, tmp5a, tmp5b;
qword tmp2a = si_shlqbyi(indices0, 12);
qword tmp2b = si_rotqmbyi(indices1, 12|16);
qword vs2 = si_selb(tmp2a, tmp2b, si_fsmh(0x20));
qword tmp5a = si_shlqbyi(indices1, 14);
qword tmp5b = si_rotqmbyi(indices2, 14|16);
qword vs5 = si_selb(tmp5a, tmp5b, si_fsmh(0x60));
vs0 = si_shufb(vs0, vs0, SHUFB8(0,A,0,B,0,C,0,0));
vs1 = si_shufb(vs1, vs1, SHUFB8(0,A,0,B,0,C,0,0));
vs2 = si_shufb(vs2, vs2, SHUFB8(0,A,0,B,0,C,0,0));
vs3 = si_shufb(vs3, vs3, SHUFB8(0,A,0,B,0,C,0,0));
vs4 = si_shufb(vs4, vs4, SHUFB8(0,A,0,B,0,C,0,0));
vs5 = si_shufb(vs5, vs5, SHUFB8(0,A,0,B,0,C,0,0));
vs6 = si_shufb(vs6, vs6, SHUFB8(0,A,0,B,0,C,0,0));
vs7 = si_shufb(vs7, vs7, SHUFB8(0,A,0,B,0,C,0,0));
vs0 = si_mpya(vs0, vertex_sizes, verticess);
vs1 = si_mpya(vs1, vertex_sizes, verticess);
vs2 = si_mpya(vs2, vertex_sizes, verticess);
vs3 = si_mpya(vs3, vertex_sizes, verticess);
vs4 = si_mpya(vs4, vertex_sizes, verticess);
vs5 = si_mpya(vs5, vertex_sizes, verticess);
vs6 = si_mpya(vs6, vertex_sizes, verticess);
vs7 = si_mpya(vs7, vertex_sizes, verticess);
switch(num_indexes - j) {
default: func(vs7);
case 21: func(vs6);
case 18: func(vs5);
case 15: func(vs4);
case 12: func(vs3);
case 9: func(vs2);
case 6: func(vs1);
case 3: func(vs0);
}
}
Unrolled 1
Shortest form I've found so far. Not a good odd/even balance on the pipeline usage though.
for (j = 0; j < num_indexes; j += 24) {
qword* lower_qword = (qword*)&indexes[j];
qword i0 = lower_qword[0];
qword i1 = lower_qword[1];
qword i2 = lower_qword[2];
qword i0r = si_rotqmbyi(i0, -2);
qword i1r = si_rotqmbyi(i1, -2);
qword i2r = si_rotqmbyi(i2, -2);
qword v0 = si_mpya(i0, vertex_sizes, verticess);
qword v1 = si_mpya(i1, vertex_sizes, verticess);
qword v2 = si_mpya(i2, vertex_sizes, verticess);
qword v0r = si_mpya(i0r, vertex_sizes, verticess);
qword v1r = si_mpya(i1r, vertex_sizes, verticess);
qword v2r = si_mpya(i2r, vertex_sizes, verticess);
// Little constant reuse here :\
qword vs7 = si_shufb(v2r, v2, SHUFB4(c,D,d,0));
qword vs6 = si_shufb(v2r, v2, SHUFB4(B,b,C,0));
qword vs5 = si_shufb(v1, v2r, SHUFB4(D,a,0,0));
vs5 = si_shufb(vs5, v2, SHUFB4(A,B,a,0));
qword vs4 = si_shufb(v1, v1r, SHUFB4(c,C,d,0));
qword vs3 = si_shufb(v1, v1r, SHUFB4(A,b,B,0));
qword vs2 = si_shufb(v0r, v0, SHUFB4(D,d,0,0));
vs2 = si_shufb(vs2, v1r,SHUFB4(A,B,a,0));
qword vs1 = si_shufb(v0r, v0, SHUFB4(b,C,c,0));
qword vs0 = si_shufb(v0r, v0, SHUFB4(A,a,B,0));
switch(num_indexes - j) {
default: func(vs7);
case 21: func(vs6);
case 18: func(vs5);
case 15: func(vs4);
case 12: func(vs3);
case 9: func(vs2);
case 6: func(vs1);
case 3: func(vs0);
}
}
Thursday, January 15, 2009
SPU unaligned loads
Extract three adjacent ushorts from an arbitrary array location.
(Would do a lot better unrolled, I think)
(Would do a lot better unrolled, I think)
for (j = 0; j < num_indexes; j += 3) {
// Determine address of aligned qword containing indexes[j]
qword lower_qword = si_from_ptr(&indexes[j]);
// Load qword containing indexes[j] and successor
qword first = si_lqd(lower_qword, 0);
qword second = si_lqd(lower_qword, 16);
// Calculate &indexes[j]&15 - offset of index from 16 byte alignment
qword offset = si_andi(lower_qword, 15);
// Generate a mask to select the appropriate parts of first and second
// form byte select mask from (1<
qword one = si_from_uint(1);
qword mask = si_fsmb(si_sf(one, si_shl(one, offset)));
// Rotate first and second parts to desired locations
// This is the key interesting bit, but I'd like to
// think this could be improved upon...
first = si_shlqby(first, offset);
second = si_rotqmby(second, si_ori(offset, 16));
// Store indexes[j],[j+1],[j+2] in vs.
qword is = si_selb(first, second, mask);
// Expand is to uint positioning
is = si_shufb(is, is, SHUFB8(0,A,0,B,0,C,0,0));
qword vs = si_mpya(is, (qword)spu_splats(vertex_size),
(qword)spu_splats((unsigned)vertices));
func(vs);
}Wednesday, January 14, 2009
20090114
Cubular - I wonder how hard it would be to make one...
25c3 - Hours of entertainment.
Perpetual calendar - Just the thing to go with my binary clock.
No great archive in the sky - Backup. (note to self: backup).
Geeks Bearing Gifts - Want.
Twitterville

25c3 - Hours of entertainment.
Perpetual calendar - Just the thing to go with my binary clock.
No great archive in the sky - Backup. (note to self: backup).
Geeks Bearing Gifts - Want.
Twitterville
Subscribe to:
Posts (Atom)



