Number crunching using Single precision SSE regs [Archive] - thinBasic: Basic Programming Language Community Forum

View Full Version : Number crunching using Single precision SSE regs

Charles Pegge

25-07-2008, 17:25

Intel & AMD (in 32 bit mode) support single precision arithmetic four numbers at the same time using the SSE registers.

Just a few caveats though:

This is ideal for vector processing but beware the loss of accuracy with division, reciprocals & square roots. Accuracy has been compromised for the sake of speed and I found that the Fibonacci number calculated in the example below is only accurate to 4 digits after applying the reciprocal.

I also found that my AMD64 did not support operations directly from memory. All operands have to be loaded to the SSE registers first.

Also, In older processors (pre 2006) FPU instructions cannot be mingled with SSE instructions since they share part of the same hardware.

' Floating point vector maths using SIMD instructions

uses "Oxygen"

type Txyzw
x as single
y as single
z as single
w as single
end type

dim v1, v2 as Txyzw

v1.x = 1
v1.y = 2
v1.z = 4
v1.w = 8

' NB
' AMD does not support adding from memory eg addps xmm0,[#v1]
' loss of precision when doing division, square roots and reciprocals

dim SSE_Demo as string = "
'how accurate is 1/3 ?
movups xmm0, [#v1] ' load
movups xmm1, xmm0 ' Add
addps xmm0, xmm1 ' Add
addps xmm0, xmm1 ' Add
divps xmm1, xmm0 ' divide
mulps xmm1, xmm1 ' multiply
sqrtps xmm1, xmm1 ' square root
'rcpps xmm1, xmm1 ' reciprocal
movups [#v2], xmm1 ' save
'ret
'
' First number is the fibonacci ratio?
' it should be 1.618033.. but does not quite make it
movups xmm0, [#v1] ' load
movups xmm1, xmm0 ' load
addps xmm0, xmm0 ' double
mulps xmm0, xmm0 ' square
addps xmm0, xmm1 ' Add
sqrtps xmm0, xmm0 ' square root
subps xmm0, xmm1 ' subtract
addps xmm1, xmm1 ' double
divps xmm0, xmm1 ' divide
rcpps xmm0, xmm0 ' reciprocal
movups [#v2], xmm0 ' save
ret

"

o2_asmo SSE_Demo
if len(o2_error) then
msgbox 0, "Assembly error"+$CRLF+O2_Error+$CRLF+O2_View(SSE_Demo)
else
o2_exec
msgbox 0, STR$(v2.x)+STR$(v2.y)+STR$(v2.z)+STR$(v2.w)
'msgbox 0,O2_View(SSE_Demo)
endif

kryton9

26-07-2008, 09:12

Charles, I am not sure if you knew this or not. But nVidia seems to have written a language to create programs that use the power of the gpu for researchers. Since you are low level power coder, you might want to check this out. You have I think access to over 100 cores in the gpu at the moment :)

It is called Cuda. Here is the link:
http://www.nvidia.com/object/cuda_what_is.html

Charles Pegge

26-07-2008, 09:54

Thanks Kent - I will investigate. The Cuda driver which is downloading now, is quite a lump: 72 Megs!

If we can devise light-weight support for GPU calculations it will be well worth the effort.

Using the x86 SIMD instructions for matrix calculations is not ideal - only giving an x2 improvement over the FPU - there are some new instructions in SSE4 which improve matters but these made their appearance in 2007 so most CPUs out there won't support them.

Petr Schreiber

26-07-2008, 10:58

Hi Charles,

thanks a lot but it did not worked on my AMD64. How to tweak it to make it run? :)

Petr

Charles Pegge

26-07-2008, 11:36

Hi Petr,

You could try commenting out instruction lines to see which ones are disruptive. We know that movups and addps works from your previous demo.

My cpu is an Athlon 64 X2.

Petr Schreiber

26-07-2008, 11:44

All is ok,

I had old oxygen DLL in that directory, don't know why ::)

So it works well now, thanks!

Petr

Charles Pegge

26-07-2008, 12:32

Phew! :)

I've adapted an Intel example of 4x4 matrix multiply with SSE2 instructions. I don't understand the way it shuffles data around. but I'll post it here ASAP. perhaps you will be able to tell me how it works :).

Charles Pegge

26-07-2008, 14:13

Okay, here are two versions for 4x4 Matrix multiplication for comparison: SSE (Intel) and FPU (my code)

' Floating point vector maths using SIMD instructions
' http://download.intel.com/design/PentiumIII/sml/24504501.pdf
' 4*4 Matrix multiply

' Also includes FPU-based alternative

uses "Oxygen"

type Txyzw
x as single
y as single
z as single
w as single
end type

dim va(16),vb(16),vc(16) as single

'va(1) = 1
'va(2) = 2
'va(3) = 4
'va(4) = 8

va(1) = 1
va(5) = 2
va(9) = 4
va(13) =8

dim i as long
for i=1 to 16 : vb(i)=1 : next

dim SSE_Demo as string = "

'call fpu_matrix_mul
'ret

; see http://download.intel.com/design/PentiumIII/sml/24504501.pdf
;--------------
sse_matrix_mul:
;--------------
mov edx, #vb ; src1
mov ecx, #va ; src2
mov eax, #vc ; dst
;
movss xmm0, [edx]
movups xmm1, [ecx]
shufps xmm0, xmm0, 0
movss xmm2, [edx+4]
mulps xmm0, xmm1
shufps xmm2, xmm2, 0
movups xmm3, [ecx+16]
movss xmm7, [edx+8]
mulps xmm2, xmm3
shufps xmm7, xmm7, 0
addps xmm0, xmm2
movups xmm4, [ecx+32]
movss xmm2, [edx+12]
mulps xmm7, xmm4
shufps xmm2, xmm2, 0
addps xmm0, xmm7
movups xmm5, [ecx+48]
movss xmm6, [edx+16]
mulps xmm2, xmm5
movss xmm7, [edx+20]
shufps xmm6, xmm6, 0
addps xmm0, xmm2
shufps xmm7, xmm7, 0
movlps [eax], xmm0
movhps [eax+8], xmm0
mulps xmm7, xmm3
movss xmm0, [edx+24]
mulps xmm6, xmm1
shufps xmm0, xmm0, 0
addps xmm6, xmm7
mulps xmm0, xmm4
movss xmm2, [edx+36]
addps xmm6, xmm0
movss xmm0, [edx+28]
movss xmm7, [edx+32]
shufps xmm0, xmm0, 0
shufps xmm7, xmm7, 0
mulps xmm0, xmm5
mulps xmm7, xmm1
addps xmm6, xmm0
shufps xmm2, xmm2, 0
movlps [eax+16], xmm6
movhps [eax+24], xmm6
mulps xmm2, xmm3
movss xmm6, [edx+40]
addps xmm7, xmm2
shufps xmm6, xmm6, 0
movss xmm2, [edx+44]
mulps xmm6, xmm4
shufps xmm2, xmm2, 0
addps xmm7, xmm6
mulps xmm2, xmm5
movss xmm0, [edx+52]
addps xmm7, xmm2
shufps xmm0, xmm0, 0
movlps [eax+32], xmm7
movss xmm2, [edx+48]
movhps [eax+40], xmm7
mulps xmm0, xmm3
shufps xmm2, xmm2, 0
movss xmm6, [edx+56]
mulps xmm2, xmm1
shufps xmm6, xmm6, 0
addps xmm2, xmm0
mulps xmm6, xmm4
movss xmm7, [edx+60]
shufps xmm7, xmm7, 0
addps xmm2, xmm6
mulps xmm7, xmm5
addps xmm2, xmm7
movups [eax+48], xmm2
ret
;--------------

;--------------
fpu_matrix_mul:
;--------------
mov ecx,#va
mov edx,#vb
mov eax,#vc

block:
;-----
call column
call column
call column
call column
ret

column:
;------
call cell
call cell
call cell
call cell
add edx,16
sub ecx,16
ret

cell: ' row A * column B
;-----------------------
fld dword ptr [ecx ]
fmul dword ptr [edx ]
fld dword ptr [ecx+16]
fmul dword ptr [edx+04]
fld dword ptr [ecx+32]
fmul dword ptr [edx+08]
fld dword ptr [ecx+48]
fmul dword ptr [edx+12]
faddp st(1),st(0)
faddp st(1),st(0)
faddp st(1),st(0)
fstp dword ptr [eax]
add eax,4
add ecx,4
ret

"

o2_asmo SSE_Demo
if len(o2_error) then
msgbox 0, "Assembly error"+$CRLF+O2_Error+$CRLF+O2_View(SSE_Demo)
else
o2_exec
msgbox 0, STR$(vc(01))+STR$(vc(02))+STR$(vc(03))+STR$(vc(04))+$cr_
+STR$(vc(05))+STR$(vc(06))+STR$(vc(07))+STR$(vc(08))+$cr_
+STR$(vc(09))+STR$(vc(10))+STR$(vc(11))+STR$(vc(12))+$cr_
+STR$(vc(13))+STR$(vc(14))+STR$(vc(15))+STR$(vc(16)) '+$cr_
'msgbox 0,O2_View(SSE_Demo)
endif

Petr Schreiber

26-07-2008, 18:31

Hi Charles,

cool code indeed!
SSE performs 1.25x faster than your classic code. Not bad result for your solution, as I would expect SSE to make it lot faster :)

Petr

Charles Pegge

26-07-2008, 19:09

There may be some overhead in your test loop Petr. Intel suggests 2.1x faster than their FPU equivalent. I'm getting 0.1 secs over 0.17 secs under PB. But in any case it is not a major leap in performance. SSE2 is not flexible enough to do this matrix stuff efficiently.

There is still some additonal efficiency to squeeze out of the FPU version by expanding some of the subroutines and maybe using more registers - but this one was compact and simple to write.

PS: Trouble with CUDA - Nvidia GeForce 7600 too old?

Petr Schreiber

26-07-2008, 19:16

Hi Charles,

maybe, I had SSE in buffer 1 abd FPU in buffer 2, maybe it affects. I used 1000000 loops in ThinBasic, could be it.

7600 is great card, I eNVy you :) But similarly to my GF6, it has "old" architecture. I think there are no stream processors included in our series of GeForce. CUDA runs on GF 8600, 8800 ... and higher I think.

Petr

kryton9

27-07-2008, 00:58

I was shocked, 7600 too old? That is a surprise for sure!

Petr Schreiber

27-07-2008, 08:19

GF8, 9 and 2xx use unified shader architecture, that means there is no vertex-only and pixel-only shader processing, there are general purpose hybrid shader "units" which can handle various tasks.

Petr

kryton9

28-07-2008, 03:49

It sure can get confusing with all this graphics stuff. Amazing how game companies seem to time releases well to use available technology at release time.