View Full Version : x86 assembler...
Johannes
01-03-2011, 17:58
... y u no have clear mnemonics?
I just spent an hour debugging the hell out of my cube root function only to find out it was my internal assembler compare function that was at fault.
For future reference: use ja, jb, etc. for unsigned compares. Use jg, jl and certainly js (that s means something) for signed compares.
On the other hand: I only have to do the nth-power root and the division-with-remainder subroutine (tricky, have to set thinBasic variables directly) and the module will have the same functionality as the thinBasic include script.
Johannes
06-03-2011, 03:13
... I wish I knew how to quit you.
Rewrote the internal addition subroutine in pure assembler and calculating the 10,000th Fibonacci number (*) now runs 51 times faster.
(*) F(10000)= 3364476487 6431783266 6216120051 0754331030 2148460680 0639065647 6997468008 1442166662 3681555955 1363373402 5582065332 6808361593 7373479048 3865268263 0408924630 5643188735 4544369559 8274916066 0209988418 3933864652 7313000888 3026923567 3613135117 5792974378 5441375213 0520504347 7016022647 5831890652 7890855154 3661595829 8727968298 7510631200 5754287834 5321551510 3870818298 9697916131 2785626503 3195487140 2142875326 9818796204 6936097879 9003509623 0229102636 8131493195 2756302278 3762844154 0360584402 5721143349 6118002309 1208287046 0889239623 2883546150 5776583271 2525460935 9112820392 5285393434 6209042452 4892940390 1706233888 9910858410 6518317336 0437470737 9085526317 6432573399 3712871937 5877468974 7992630583 7065742830 1616374089 6917842637 8624212835 2581128205 1637029808 9332099905 7079200643 6742620238 9783111470 0540749984 5925036063 3560933883 8319233867 8305613643 5351892133 2797329081 3373264265 2633989763 9227234078 8292817795 3580570993 6910491754 7080893184 1056146322 3382174656 3732124822 6383092103 2977016480 5472624384 2374862411 4530938122 0656491403 2751086643 3945175121 6152654536 1333111314 0424368548 0510676584 3493523836 9596534280 7176877532 8348234345 5573667197 3139274627 3629108210 6792807847 1803532913 1176778924 6590899386 3545932789 4523777674 4061922403 3763867400 4021330343 2974969020 2832814593 3418826817 6838930720 0363479562 3117103101 2919531697 9460763273 7589253530 7725523759 4378843450 4067715555 7790564504 4301664011 9462580972 2167297586 1502696844 3146952034 6149322911 0597067624 3268515992 8347098912 8470674086 2008587135 0162603120 7190317208 6094081298 3215810772 8207635318 6624611278 2455372085 3236530577 5956430072 5177443150 5153960090 5168603220 3491632226 4088524885 2433158051 5348496224 3484829938 0905070483 4824493274 5373262456 7755879089 1871908036 6205800959 4743150052 4025327097 4699531877 0724376825 9074199396 3226598414 7498193609 2852239450 3970716544 3156421328 1576889080 5878318340 4917434556 2705202235 6484649519 6112460268 3139709750 6938264870 6613264507 6650746115 1267752274 8621598642 5307112984 4118262266 1057163515 0692600298 6170494542 5047491378 1151541399 4155067125 6271197133 2527636319 3960690289 5650288268 6083622410 8205056243 0701794976 1711212330 6607331005 9947366875
Johannes
06-03-2011, 03:21
The script I used to get those 2,090 decimals of F(10000) mentioned above.
Uses "File"
Module "BigInt"
BigInt_SetFormat(10,$SPC,FALSE)
File_Save(App_SourcePath+"fibo10k.txt",BigInt_ToString(BigInt_Fib("10000")))
A simple cut & paste. The ten-digit formatting is already in there.
this is wonderful, it gives instantaneous result
i have tried it with 200,000 digits it takes 3.663193 seconds,
i have checked the results with another math soft and it is the same.
with a million digits the program hangs in memory.
for a future mission you may want to add a primality checking to the library or even factorising feature.
Uses "File" ,"console"
Module "BigInt"
Dim T1,T2 As Quad
HiResTimer_Init
T1 = HiResTimer_Get
BigInt_SetFormat(10,$SPC,FALSE)
FILE_Save(APP_SourcePath+"fibo200k.txt",BigInt_ToString(BigInt_Fib("200000")))
T2 = HiResTimer_Get
PrintL "Script time: " & Format$((T2 - T1)/10^6, "#0.000000") & " seconds"
WaitKey
i am sure that this library can be used in schools and other educational places
Johannes
06-03-2011, 13:39
i have tried it with 200,000 digits it takes 3.663193 seconds,
i have checked the results with another math soft and it is the same.
with a million digits the program hangs in memory.
for a future mission you may want to add a primality checking to the library or even factorising feature.
Strange that 1 million runs out of memory for you because the decimal string takes up less than 209,000 bytes and the Big Integer for that number takes up 86,785 bytes. Since I need three Big Integers to calculate the Fibonacci sequence you would need "only" 458 kBytes of memory to store all that. I've included the results for fibo200k, 500k, 800k and 1M. Execution times on my system were 2.7 seconds, 17.7 seconds, 46.8 seconds and 71.9 seconds respectively.
Prime checking (or factoring, which comes down to the same thing) is a tricky business. I thought about it and rejected it for practical reasons. Up to 64-bit Big Integers it would almost take no time at all since I have specialised 32-bit routines for internal multiplication and division (factoring is done up to the square root of the original_number). But beyond that...
Programming BigInt_IsPrime, _NextPrime and _PreviousPrime is not difficult. It is difficult to keep those functions fast for really big numbers.
sorry i was too hurry to decide ,indeed it takes 1.5 minutes to finish the 1 million digits, my cpu 3.Ghz 2M cash, oldest dual core, but nowadays telling the cpu GHz is useless since the cpu speed now depends on the intel optimizations of the product
500k = 22.7 s
800k = 58.4 s
1000k = 91.19 s
Johannes
06-03-2011, 23:53
... y no have more than eight registers?
Especially because PowerBASIC won't let me use ebp or esp. Tried storing ebp in memory to free it for processing but PB wouldn't let me. I know it's "only in my mind" and a habit I picked up working with ARM assembler for over 20 years. Having sixteen all-purpose registers of which only three are reserved (program counter, link register for call, stack pointer) is so comfortable. But I know x86 architecture has a far superior memory management so I just have to learn to change my behaviour.
What I had was this.
FUNCTION BII_MulOld(BYVAL a AS STRING,BYVAL b AS STRING) AS STRING
LOCAL c AS STRING
LOCAL i AS DWORD
LOCAL j AS DWORD
LOCAL n AS DWORD
LOCAL k AS QUAD
LOCAL l AS QUAD
b = b+$NUL+$NUL+$NUL+$NUL
c = STRING$(LEN(a)+LEN(b)+4,0)
FOR i=1 TO LEN(b)-4 STEP 3
n = CVDWD(MID$(b,i,3)+$NUL)
IF n=0 THEN ITERATE FOR
k = 0
FOR j=1 TO LEN(a)-3 STEP 4
l = n*CVDWD(MID$(a,j,4))+CVDWD(MID$(c,j+i-1,4))+k
k = INT(l/BII_4g)
l = l MOD BII_4g
MID$(c,j+i-1,4) = MKDWD$(l)
NEXT j
WHILE k>0
l = CVDWD(MID$(c,j+i-1,4))+k
k = 0
IF l>=BII_4g THEN l = l-BII_4g : k = 1
MID$(c,j+i-1,4) = MKDWD$(l)
j = j+4
WEND
NEXT i
TrimHighNull(c)
FUNCTION = c
END FUNCTION
And after a lot of thinking I ended up with this.
'----------------------------------------------------------------------------
'
' Multiply two unsigned Big Integers.
'
FUNCTION BII_Mul(BYVAL a AS STRING,BYVAL b AS STRING) AS STRING
#REGISTER NONE
LOCAL c AS STRING
LOCAL i AS DWORD
' Outer loop is slowest so longest number in a.
IF LEN(a)<LEN(b) THEN SWAP a,b
' Reserve memory for product.
c = STRING$(LEN(a)+LEN(b),0)
push3
! mov eax,b ; pointer to string b
! mov ebx,[eax-4] ; length of string b
! shr ebx,2 ; number of words in b
! xor edx,edx ; offset for c
asm_mul1:
! mov ecx,[eax] ; word of b
! cmp ecx,0 ; skip if word zero
! je asm_mul5
! push eax ; put registers on stack
! push ebx
! push edx
' Inner loop multiplies a with a word from b and adds this to c.
! mov esi,a ; pointer to string a
! mov ebx,[esi-4] ; length of string a
! shr ebx,2 ; number of words in a
! mov i,ebx
! mov edi,c ; pointer to string c
! add edi,edx ; add offset
! xor ebx,ebx ; carry for first multiplication
asm_mul2:
! mov eax,[esi] ; word of a
! mul ecx ; multiply with word ofn b
! add eax,[edi] ; add to word of c
! adc edx,0 ; carry for high word
! add eax,ebx ; carry of previous multiplication
! adc edx,0 ; carry for high word
! mov [edi],eax ; word of product
! mov ebx,edx ; save high word as carry for next loop
! add esi,4 ; next word of a
! add edi,4 ; next word of c
! dec dword i ; all words in a done?
! jne asm_mul2 ; until counter is zero
' Process possible carry beyond highest word of a.
! add ebx,[edi] ; add final carry
! mov [edi],ebx ; store word
! jnc asm_mul4 ; no carry means finished
asm_mul3:
! add edi,4 ; next word of a
! inc dword [edi] ; carry set, add zero, so inc
! je asm_mul3 ; carry means herhalen
' End inner loop multiplication.
asm_mul4:
! pop edx ; get registers from stack
! pop ebx
! pop eax
asm_mul5:
! add eax,4 ; next word in b
! add edx,4 ; increase offset for c
! dec ebx ; all words in b done?
! jne asm_mul1 ; until counter is zero
pop3
' Remove possible high words zero.
TrimHighNull(c)
FUNCTION = c
END FUNCTION
It was a lot of work but I'm more than pleased with the result: calculating 3 to the power 32,767 now runs more than 243 times faster. The speed increase compared with the original (interpreted) thinBasic include script is a factor of more than 3200.