PDA

View Full Version : Tail recursion optimization?



Vandalf
17-03-2017, 03:23
I have a tail recursive piece of code that could potentially call itself recursively up to thousands of times, and overall be executed up to millions of times each time the program is run. This raises some concerns regarding performance, and the size of the stack available to thinBasic.
That's why I have these questions:

Does thinBasic optimize/eliminate tail recursive calls?
If the answer to the previous is no, do I run the risk of overflowing the stack, if the function I'm talking about takes one potentially 240-character long string, one 6-character long string, and one dword as arguments?
Is there any profiling information available regarding the FILE_Exists function from the FILE module? My recursive function involves it, so it could be called millions of times as well. I would like to have even a rough estimate as to how long it takes.



This is the code:

uses "FILE"
$myseparator = "_-"

Function uniquefilename(unsafe As String) As String
If Not FILE_Exists(unsafe) Then Return unsafe
Return sequencefilename(getpathandfilename(unsafe), getdotext(unsafe),0)
End Function

Function sequencefilename(filename As String, dotext As String,_
seqnum As DWord) As String
If Not FILE_Exists(filename & $myseparator & tob32(seqnum) &
dotext) Then _
Return filename & $myseparator & tob32(seqnum) & dotext
Return sequencefilename(filename,dotext,seqnum+1)
End Function
tob32 is a function that encodes a number as a base 32 string. As for the other functions, getpathandfilename and getdotext, well, you can probably imagine what they do.

ErosOlmi
17-03-2017, 07:36
Ciao,

recursion is always a possible source of problems and will for sure slow down execution due to the necessity to allocate/deallocate local functions stack in a controlled way.
thinBasic does not creates real compiled machine code applications so the way it handle local stack is completely different from compiled applications.
In thinBasic stacks are an array of pointers (one for each stacked call to a function) to hash tables created/destroyed on the fly in heap memory, the same memory used by any other allocated variable.
Every function call has its own hash table used to keep track of function local variables and parameters.

That said, you need to make your tests in a test controlled environment and see what happen.
Factors that can influence thinBasic capability to resists to many recursive calls and speed execution are many.
On speed execution we can work to try to optimize it in some way (for example developing external compiled DLL to do part of the job)

To your questions:

no optimization takes place on recursive functions
that risk is always there. Influenced factors are so many that is difficult to me to reply. That's why I said you need to try
I'm not aware File_Exists has some performance issue but if you suspect or have evidences that it can be optimized, I can work on it.
Internally it is developed like the following (Power Basic code):

FUNCTION File_Exists(BYVAL FullFileName AS STRING) AS LONG FUNCTION = %FALSE
IF DIR$(FullFileName, %NORMAL OR %READONLY OR %HIDDEN OR %SYSTEM) = "" THEN EXIT FUNCTION
FUNCTION = %TRUE
END FUNCTION

There are at least two point where I can optimize a bit (passing file name by reference and testing for length instead of string)


Looking at your functions maybe you can get some heklp on File_PathSplit function: http://www.thinbasic.com/public/products/thinBasic/help/html/index.html?file_pathsplit.htm
Calling a native thinBasic function is always hundred of times faster than calling script functions.

Regarding profiling, if your script does not crash due to recursion ... thinBasic has a rough profiler that can help you.
Just add the following line to your source code

#Profile on, %TRUE
and you will have the following: http://www.thinbasic.com/public/products/thinBasic/help/html/index.html?profile.htm
It measure all script function call measuring execution time.
It also save a .csv file in the same directory of your script.

Hope this can help.
Keep me informed, I will be happy to help on this massive task if I can.

Ciao
Eros

Vandalf
17-03-2017, 10:29
Thanks, Eros, for the advice about the File_Pathsplit function, and for the insight into how thinBasic operates internally. From your reply, I figure that a typical program would be more likely to run out of heap memory altogether, long before it could overflow the stack.

I have to suggest, then, a seemingly easy optimization for tail calls (not just tail recursive): destroy the previous "local stack" and its pointer, and replace it with a pointer to the new local stack. This could decrease memory usage dramatically, at perhaps a slightly increased runtime cost (the time thinBasic takes to figure out if a call is a tail call or not). Or perhaps a "TAIL" or "tailcall" keyword could be introduced, to indicate to thinBasic that this optimization can be done on a given function or call.
Though I don't know if this would be actually easy to implement in thinBasic.

As for the recursive function I posted, I expect it to typically be executed less than a hundred times each time the program runs, so optimizing it isn't as high a priority as my tone might have implied. But then, you never know when you're going to find the madman who tries to create five thousand files, all named "%HOMEPATH%\a.txt" :rolleyes:
Still, I might run some tests one of these days.

Also, congratulations and thanks for the #profile directive. That's a big win for thinBasic.

ErosOlmi
20-03-2017, 20:01
I have to suggest, then, a seemingly easy optimization for tail calls (not just tail recursive): destroy the previous "local stack" and its pointer, and replace it with a pointer to the new local stack. This could decrease memory usage dramatically, at perhaps a slightly increased runtime cost (the time thinBasic takes to figure out if a call is a tail call or not). Or perhaps a "TAIL" or "tailcall" keyword could be introduced, to indicate to thinBasic that this optimization can be done on a given function or call.
Though I don't know if this would be actually easy to implement in thinBasic.
Yes, not easy at the moment mainly for my time. I want to release new version with new IDE and working on implementing any data type inside TYPE/END TYPE. That's my priority.
But I will study a solution for tail calls.


As for the recursive function I posted, I expect it to typically be executed less than a hundred times each time the program runs, so optimizing it isn't as high a priority as my tone might have implied. But then, you never know when you're going to find the madman who tries to create five thousand files, all named "%HOMEPATH%\a.txt" :rolleyes:
Still, I might run some tests one of these days.
Le me know about your tests, I'm curious.


Also, congratulations and thanks for the #profile directive. That's a big win for thinBasic.
Thanks. It slow down execution a bit but can show interesting information for medium/complex scripts.

Vandalf
21-03-2017, 12:49
For reasons unknown, my copy of thinBasic is unable to generate timing data when profiling. I did try running it as administrator. All it ever shows in the time columns is 0.0000. The rest of the profiling data are shown correctly.
In the end, I had to use a watch.

In an empty folder, uniquefilenamev1 generated 201 files in 12.5 seconds.
In an empty folder, uniquefilenamev2 generated 201 files in 16 seconds.
This was all with #profile on.
I did some more tests, but I don't count them because I forgot to note if #profile was on or off during those. Still, they showed the same tendency where v1 is faster than v2.

The difference between the two variants is that v1 uses the tob32 function, which takes a number and returns a corresponding base 32* string, while v2 uses the b32plusone function, which takes a base 32* string and returns the equivalent of converting it to a number, adding one to it, and then applying the tob32 function to that new number.

*Custom encoding, not conforming to RFC 4648.

This is the script:

Uses "FILE"
#PROFILE On
$tob32_charset = "0123456789ABCDEFGHJKMNPQRTUVWXYZ"
'0123456789ABCDEFGH JK MN PQR TUVWXYZ
%tob32_charsetptr = StrPtr($tob32_charset) As DWord

$testfilename = "C:\users\usr1\tst\ab.txt"

For i As Word=0 To 200
FILE_Append( uniquefilenamev1( $testfilename ) , "hello world" )
Next

MsgBox 0, "done"

Function uniquefilenamev1(unsafe As String) As String
If Not FILE_Exists(unsafe) Then Return unsafe
Return sequencefilenamev1( _
FILE_PathSplit(unsafe, %PATH_ROOTPATHFILE), _
"." & FILE_PathSplit(unsafe, %PATH_EXT), _
0)
End Function

Function uniquefilenamev2(unsafe As String) As String
If Not FILE_Exists(unsafe) Then Return unsafe
Return sequencefilenamev2( _
FILE_PathSplit(unsafe, %PATH_ROOTPATHFILE), _
"." & FILE_PathSplit(unsafe, %PATH_EXT), _
"0")
End Function

Function sequencefilenamev1(filename As String, dotext As String, seqnum As Number) As String
Dim retfilename As String = filename & "-" & tob32(seqnum) & dotext
If Not FILE_Exists(retfilename) Then Return retfilename
Return sequencefilenamev1(filename,dotext,seqnum+1)
End Function

Function sequencefilenamev2(filename As String, dotext As String, seqnum As String) As String
Dim retfilename As String = filename & "-" & seqnum & dotext
If Not FILE_Exists(retfilename) Then Return retfilename
Return sequencefilenamev2(filename,dotext,b32plusone(seqnum))
End Function

Function tob32(nu As Number) As String
Static n As DWord
Static res As String
res=""
Repeat
n=nu And 31
res= Memory_Get(%tob32_charsetptr+n,1) & res
nu=SHIFT RIGHT nu, 5
Until nu=0
Return res
End Function

Function b32plusone(thestring As String) As String
'returns a base 32 string with a value of one plus the given string's value
'the encoding it assumes is 0123456789ABCDEFGH JK MN PQR TUVWXYZ
'it is case sensitive
Static carry,tmp As Byte
Static i,j As DWord
i= Len(thestring)
j= StrPtr(thestring)-1
carry=0
tmp=Asc(thestring,i)
' HI,KL,NO,RS,9:,Z[
If(In(Poke(j+i,tmp+1),73,76,79,83,58,91)>0) Then _
If (In(Poke(j+i,tmp+2),59,92)>0) Then _
If Poke(j+i,tmp+8)<>65 Then _
carry=Poke(j+i,48)
For i =i-1 To 1 Step -1 While ((carry>0) And (Asc(thestring,i)=90))
Poke(j+i, 48)
Next

If (i<=0) Then
If (carry>0)Then Return "1" + thestring
Else
tmp=Asc(thestring,i)
If (carry>0) Then carry = 1
If(In(Poke(j+i,tmp+carry),73,76,79,83,58)>0) Then _
If Poke(j+i,tmp+2)=59 Then Poke(j+i,65)
End If

Return thestring
End Function

ErosOlmi
22-03-2017, 11:40
Ciao Vandalf,

very interesting script and needs.
I think Profiler is not showing timing due to impossibility (for some reasons) to use high resolution timing functions: https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx
For future versions I will add some features allowing to understand if high resolution timing functions are available and if not automatically switch to a less precise timing.

I've simplified a little your script in order to focus on few aspects.
On my computer creating 201 files on an empty directory takes about 3.5 seconds.
I will be into a 2 days meeting at work but I will recap this in the week-end.
My idea is to try to optimize (if possible) some functions already used here in your script and/or develop new specific functions in Core or File modules.

I will let you know.

Thanks
Eros



Uses "FILE"
uses "Console"


#PROFILE On


$tob32_charset = "0123456789ABCDEFGHJKMNPQRTUVWXYZ"
'0123456789ABCDEFGH JK MN PQR TUVWXYZ
%tob32_charsetptr = StrPtr($tob32_charset) As DWord


'---Create a path under script directory where to save files
$testPath = APP_ScriptPath & "Profile_NoTiming\"
DIR_MakeAll($testPath)


'---Initial file name
$testfilename = $testPath & "ab.txt"


string sFileName


double t0, t1, t2, t9


t0 = Timer
printl "Starting time:", Time$
For i As Word=0 To 200
t1 = Timer
sFileName = uniquefilenamev1( $testfilename )
t2 = timer
printl i, sFileName, format$(T2 - T1, "#0.0000"), "secs"
FILE_append(sFileName, "hello world")
Next
printl "Ending time:", Time$
t9 = Timer


printl "All done in", format$(T9 - T0, "#0.0000"), "secs"
WaitKey




Function uniquefilenamev1(unsafe As String) As String
If Not FILE_Exists(unsafe) Then
function = unsafe
Else
function = sequencefilenamev1( _
FILE_PathSplit(unsafe, %PATH_ROOTPATHFILE), _
"." & FILE_PathSplit(unsafe, %PATH_EXT), _
0)
end if
End Function

Function sequencefilenamev1(filename As String, dotext As String, seqnum As long) As String
string retfilename = filename & "-" & tob32(seqnum) & dotext


If Not FILE_Exists(retfilename) Then
function = retfilename
Else
function = sequencefilenamev1(filename, dotext, seqnum + 1)
end If
End Function


Function tob32(nu As long) As String
Static n As DWord
dim res As String

'res = ""
Repeat
n = nu And 31
res = Memory_Get(%tob32_charsetptr + n, 1) & res
nu = SHIFT RIGHT nu, 5
Until nu = 0

function = res
End Function

ErosOlmi
24-03-2017, 08:11
Ciao Vandalf,

the following version of the scripts takes about 0.4 seconds in my computer on an empty directory and 7.2 seconds into a directory already having 201 files.

Trick is to add

Static lSequence as Long
inside uniquefilenamev1 function in order to remember the last sequence and avoid to repeat checking all the time from the beginning (zero).

It is a trick for this situation, maybe it cannot be used in all situations.
For example if there is a situation where a process creates files and another process remove files, this can create "holes" in the sequence and the below functions would not intercept it during execution. It would fill the gaps during a successive execution.

Let me know.
Ciao
Eros




Uses "FILE"
uses "Console"


#PROFILE On


$tob32_charset = "0123456789ABCDEFGHJKMNPQRTUVWXYZ"
'0123456789ABCDEFGH JK MN PQR TUVWXYZ
%tob32_charsetptr = StrPtr($tob32_charset) As DWord


'---Create a path under script directory where to save files
$testPath = APP_ScriptPath & "Profile_NoTiming\"
DIR_MakeAll($testPath)


'---Initial file name
$testfilename = $testPath & "ab.txt"


string sFileName


double t0, t1, t2, t9


t0 = Timer
printl "Starting time:", Time$
For i As Word=0 To 200
t1 = Timer
sFileName = uniquefilenamev1( $testfilename )
t2 = timer
printl i, sFileName, format$(T2 - T1, "#0.0000"), "secs"
FILE_append(sFileName, "hello world")
Next
printl "Ending time:", Time$
t9 = Timer


printl "All done in", format$(T9 - T0, "#0.0000"), "secs"
WaitKey




Function uniquefilenamev1(unsafe As String) As String
Static lSequence as Long

If Not FILE_Exists(unsafe) Then
function = unsafe
Else
function = sequencefilenamev1( _
FILE_PathSplit(unsafe, %PATH_ROOTPATHFILE), _
"." & FILE_PathSplit(unsafe, %PATH_EXT), _
lSequence)
lSequence += 1
end if
End Function

Function sequencefilenamev1(filename As String, dotext As String, seqnum As long) As String
string retfilename


retfilename = filename & "-" & tob32(seqnum) & dotext
If Not FILE_Exists(retfilename) Then
function = retfilename
Else
function = sequencefilenamev1(filename, dotext, seqnum + 1)
end If

'---Non recursive version of the function---
' retfilename = filename & "-" & tob32(seqnum) & dotext
' While FILE_Exists(retfilename)
' seqnum += 1
' retfilename = filename & "-" & tob32(seqnum) & dotext
' wend
' function = retfilename

End Function


Function tob32(nu As long) As String
Static n As DWord
dim res As String

'res = ""
Repeat
n = nu And 31
res = Memory_Get(%tob32_charsetptr + n, 1) & res
nu = SHIFT RIGHT nu, 5
Until nu = 0

function = res
End Function

Vandalf
24-03-2017, 14:50
Thank you, Eros. That's a pretty clever solution.
I was thinking of a more algorithmic strategy: a binary search.
Using decimal numbers for readability, these could be the filenames checked with such a strategy:

"ab-1.txt" 'exists
"ab-2.txt" 'exists
"ab-4.txt" 'exists
"ab-8.txt" 'exists
"ab-16.txt" 'doesn't exist
"ab-12.txt" 'exists
"ab-14.txt" 'exists
"ab-15.txt" 'doesn't exist, and is the "least" filename available

This assumes that there are no gaps in the sequence of files. If gaps existed, this strategy would give erratic results, filling some gaps but not others. Your strategy is better in that regard, in my opinion: I want the numeric part of the filename to reflect (at least partially) the order of creation of the files.
Your strategy is also much better in that it resolves filenames in constant time, whereas a binary search takes logarithmic time. Both are still a massive improvement over the original, which takes linear time.

The full background to my initial problem is this:
I am developing a web scraping program. To paraphrase, a program that parses an HTML file (a webpage) and selectively downloads files from hyperlinks found in the HTML file.
I quickly discovered that this program would overwrite previously downloaded files that had the same original name. I went through several quick and dirty solutions to that problem, finally settling on what I posted earlier in this thread.
Then I started to wonder about the worst-case performance: When a malicious webmaster or an oblivious user prepares a webpage linking to many (hundreds or thousands of) different files, all with the same filename (or with a small number of different names). I would try to scrape that page, and my program would work for a while, but it would eventually slow down and practically freeze, simply due to the sheer number of filenames it would have to check to find the first one available.
This worst-case scenario is not too far-fetched: smartphones, for example, tend to give very simplistic default names to user-created files. I have seen hundreds of different files named "16 - 1.jpg" in Google Plus.

My next step should be to patch your solution into the web scraper. During one scraping session, when my program detected long sequences of filenames in the downloads folder, it would add their "root filename" to a dictionary. The dictionary would keep track of the length of those sequences. It would be a generalized form of your solution.
I would eventually want to also devise a strategy to speed up the initial phase of assessing the length of a sequence of filenames. Likely involving the OS_Shell function and the "dir /b" and "find /c" commands.

So thanks again, Eros. You're being great help.