PDA

View Full Version : Tokenizer- user-keys, but how ???



ReneMiner
01-11-2015, 20:22
I try the tokenizer-engine to parse a script and to recognize thinBasic-keywords and equates.

But how does it work?

I tried this way (and a few others already)



Uses "console", "Tokenizer"

Begin Const

%Token_TBKeyword = 100
%Token_TBEquate
%Token_Comment
%Token_Parenthesis
End Const


Function TBMain()

' read in all keywords:
' --- enter a valid path here if thinBasic is not installed on "C:\" !

SetupKeywords( "c:\thinBasic\thinAir\Syntax\thinBasic\thinBasic_Keywords.ini" )
' run tokenizer on this script:
Tokenize(APP_SourceName)

PrintL "------------------------- key to end"
WaitKey

End Function

Sub SetupKeywords(ByVal sFile As String)

Local allKeywords() As String

Local i As Long

Parse File sFile, allKeywords, $CRLF
Array Sort allKeywords, Descend ' brings empty elements to the end

While StrPtrLen(StrPtr(allKeywords(UBound(allKeywords)))) = 0
' remove empty elements
ReDim Preserve allKeywords(UBound(allKeywords)-1)
Wend
Array Sort allKeywords, Ascend ' now sort as needed

Tokenizer_Default_Char("#", %TOKENIZER_DEFAULT_ALPHA)
Tokenizer_Default_Char("$", %TOKENIZER_DEFAULT_ALPHA)
Tokenizer_Default_Char("%", %TOKENIZER_DEFAULT_ALPHA)
Tokenizer_Default_Char(":", %TOKENIZER_DEFAULT_NEWLINE)

Tokenizer_KeyAdd("'", %Token_Comment, 0)
Tokenizer_KeyAdd("(", %Token_Parenthesis, 1)
Tokenizer_KeyAdd(")", %Token_Parenthesis, -1)

For i = 1 To UBound(allKeywords)
Select Case Peek(Byte, StrPtr(allKeywords(i)))
Case 36, 37 ' $, %
Tokenizer_KeyAdd(allKeywords(i), %Token_TBEquate, i)
Case Else
Tokenizer_KeyAdd(allKeywords(i), %Token_TBKeyword, i)
End Select
Next

End Sub

Sub Tokenize(sFile As String)


Local sToken, sCode As String
Local lPos, lMain, lSub, lParenthesis, lLines As Long
Local pKey As DWord

sCode = Load_File(sFile)
If StrPtrLen(StrPtr(sCode)) = 0 Then Exit Sub

lPos = 1
Do
pKey = Tokenizer_GetNextToken(sCode, lPos, lMain, sToken, lSub)
Incr lLines

Select Case lMain
Case %TOKENIZER_FINISHED
Exit Do
Case %TOKENIZER_ERROR
Exit Do
Case %TOKENIZER_QUOTE
PrintL "quoted string : " & sToken
Case %TOKENIZER_DELIMITER
PrintL "delimiter : " & sToken
Case %TOKENIZER_NUMBER
PrintL "number : " & sToken
Case %TOKENIZER_EOL
PrintL "(new line)"
Case Else
Select Case Tokenizer_KeyGetMainType(pKey)
Case %Token_TBKeyword
PrintL "TBKeyword :" & sToken
Case %Token_TBEquate
PrintL "TBEquate :" & sToken
Case %Token_Parenthesis
lParenthesis += Tokenizer_KeyGetSubType(pKey)
PrintL = "parenthesis :" & sToken & Str$(lParenthesis)
Case %Token_Comment
Tokenizer_MoveToEol(sCode, lPos, TRUE)
PrintL "comment"
Case Else
PrintL "other token :" & sToken
End Select
End Select

If lLines > 20 Then
PrintL "------------------- key to continue --------------"
WaitKey
lLines = 0
EndIf

Loop

If lParenthesis <> 0 Then PrintL "found unbalanced parenthesis"

End Sub



why does it not recognize my user-tokens?

ErosOlmi
12-11-2015, 00:52
Hi René,

sorry for the delaying replying but I was getting crazy about this while the problem is so simple.
When you use Tokenizer_KeyAdd(...) you must specify the Key in Upper Case otherwise the tokenizer is not able to find it.

So change your key load loop to something like:


For i = 1 To UBound(allKeywords)
Select Case Peek(Byte, StrPtr(allKeywords(i)))
Case 36, 37 ' $, %
Tokenizer_KeyAdd(ucase$(allKeywords(i)), %Token_TBEquate, i)
Case Else
Tokenizer_KeyAdd(ucase$(allKeywords(i)), %Token_TBKeyword, i)
End Select
Next

It is not written into the manual.
I'm thinking to automatically do it in the Module.
I need to double check if this change will work in all the situations.

Ciao
Eros

ReneMiner
12-11-2015, 09:48
thanks for the info. Now it's getting interesting and Tokenizer much more powerful.

Now some very basic question to this: assume i make tokens from global variables or UDT's/Types/Subs/Functions in a script and i want to load another script to tokenize then and the tokens of the previous script are no longer valid. I was searching for something as


pKey = Tokenizer_KeyFind("MY_UNNEEDED_TOKEN")

Tokenizer_KeyRemove(pKey) ' but i did not find anything like this


What operation to perform to kill or remove a user-key from Tokenizer-Engine?


+++ one more (see example above incl. your suggested changes)
I did something as


Tokenizer_KeyAdd("'", %Token_Comment, 0)
Tokenizer_KeyAdd("(", %Token_Parenthesis, 1)
Tokenizer_KeyAdd(")", %Token_Parenthesis, -1)
' Tokenizer_KeyAdd($DQ & "console" & $DQ, %Token_Comment, 123) ' - i also tried Ucase$() here...



but these are still just delimiters. I tried quoted strings also, as shown above.

Are user-tokens limited to %Tokenizer_String-group only?

ErosOlmi
12-11-2015, 11:31
I think in the short term I can do something like Tokenizer_Reset in order to clear all internal Tokenizer data structure and programmer can reload a new grammar.

Regarding string delimiters, most of them are "predefined" and you cannot change "predefined" one like ()' ...
I will check what I can do, maybe the possibility is already in place but not explained.

Will see this night

ErosOlmi
13-11-2015, 08:46
René,

I've checked the code of the module and I think we cannot go too much far with current setup unless we break syntax compatibility with module functions as they are now developed.
I'm developing a Tokenizer Module Class able to wrap current functionalities and to be further developed in order to have many tokenizers (different objects of the Tokenizer class) working at the same time each with its rule.

As soon as I have something to test, I will publish here.

Ciao
Eros

ReneMiner
13-11-2015, 16:03
...
I'm developing a Tokenizer Module Class able to wrap current functionalities and to be further developed in order to have many tokenizers (different objects of the Tokenizer class) working at the same time each with its rule.

As soon as I have something to test, I will publish here.

Ciao
Eros

sounds good.
Very nice if different tokenizers could run in different setups depending on the current task :)

ErosOlmi
18-11-2015, 22:53
I'm working on Tokenizer. So far I have done some initial work in order to create a module class able to setup some features and load and check keys.
Not really something that can do interesting things but is seems to get a shape.
Now I'm working on real tokenizer. After that I will publish a working module.

All previous features will continue to work as now.

Below an example on how syntax is:



'---
'---Loads needed modules
'---
Uses "File"
Uses "Console"
Uses "Tokenizer"


Long cBrightCyanBGBlue = %CONSOLE_FOREGROUND_BLUE | %CONSOLE_FOREGROUND_GREEN | %CONSOLE_FOREGROUND_INTENSITY | %CONSOLE_BACKGROUND_BLUE
Long cBrightRed = %CONSOLE_FOREGROUND_RED | %CONSOLE_FOREGROUND_INTENSITY




Function TBMain() As Long
Quad T0, T1 '---Performance timing
Long Counter


'---Declare a new tokenizer engine
Dim MyParser As cTokenizer


'---Instantiate new tokenizer
MyParser = New cTokenizer()


PrintL "-Configure Tokenizer---" In cBrightCyanBGBlue
'---Change some default behave in char parsing
MyParser.Default_Char("$", %TOKENIZER_DEFAULT_ALPHA)
MyParser.Default_Char("%", %TOKENIZER_DEFAULT_ALPHA)
MyParser.Default_Char(":", %TOKENIZER_DEFAULT_NEWLINE)

'---Those two lines are equivalent
MyParser.Default_Char(";", %TOKENIZER_DEFAULT_NEWLINE)
MyParser.Default_Code(37 , %TOKENIZER_DEFAULT_NEWLINE)

'---A set of chars can be indicated in one go using Default_Set method
MyParser.Default_Set("$%", %TOKENIZER_DEFAULT_ALPHA)
MyParser.Default_Set(":;", %TOKENIZER_DEFAULT_NEWLINE)

MyParser.Options.CaseSensitive = %FALSE
PrintL "Tokenizer option, CaseSensitive = " & MyParser.Options.CaseSensitive


PrintL
PrintL "-Loading keys---" In cBrightCyanBGBlue
'---Create a new keywords group. Assign a value >= 100
Dim MyKeys As Long Value 100

Dim sFile As String = "c:\thinBasic\thinAir\Syntax\thinBasic\thinBasic_Keywords.ini"
Dim allKeywords() As String
Dim nKeys As Long

T0 = Timer
nKeys = Parse File sFile, allKeywords, $CRLF
PrintL "Number of keys I'm going to load:", nkeys
For Counter = 1 To UBound(allKeywords)
MyParser.Key.Add(allKeywords(Counter), MyKeys, Counter)
Next
T1 = Timer
PrintL "Loading time: " & Format$(T1 - T0) & " mSec"



PrintL
PrintL "-Checking some keys---" In cBrightCyanBGBlue

PrintL "If the following few lines will return a number, all is ok"
PrintL MyParser.Contains("Dim" )
PrintL MyParser.Contains("As" )
PrintL MyParser.Contains("PrintL" )
PrintL MyParser.Contains("Uses" )
PrintL MyParser.Contains("Zer")

WaitKey


End Function

Petr Schreiber
19-11-2015, 00:02
Eros,

do you ever sleep :)? Fantastic...

I must admit some concepts in original Tokenizer a bit confusing - for example, the Default* functions talk about groups, but these are like character groups, while user groups are about whole tokens, strings... This could be made more clear/separated.

For example:
MyParser.SpecialChars.Set|Get|Add|AddAscii|Remove

- Set => assigns them all at once, erasing any previous setup
- Get => returns current characters as string
- Add => add to the characters, if not present already
- AddAscii => add to the characters by ASCII code, if not present already
- Remove => removes character from SpecialChars

Then we have user keywords.

I think user should not care about some reserved items in range 0..99. He could be simply provided with .CreateGroupType to receive a "handle" he can work with further. Then .DestroyGroupType could release all data for given group.

tbKeywords = myParser.CreateKeyGroup("ThinBASIC Keywords")
myParse.Keys.Add("DIM", tbKeywords)
myParse.Keys.Add("AS", tbKeywords)
myParse.Keys.Add("LONG", tbKeywords)

SubType could come as 3rd parameter, because it could be optional.

.Keys.Remove could be useful.

Just ideas, better to discuss via Skype, maybe?


Petr

ErosOlmi
19-11-2015, 00:10
In reality DEFAULTs is an internal array of each ascii codes indexed by the the ASC code of each char.
Inside the array each letter is marked with its default type: alphabetic, numeric, delimiter, new line, ... and so on.
When you set some default you just change the type of that char.

Anyway, I agree syntax is not that elegant in this case :(
Will consider a change.

Regarding internal generation of parsing dictionary groups ... great idea!

In the meantime I've developed keys search


PrintL "Checking MainType and SubType of DIM key"
PrintL "DIM MainType: " & MyParser.Key("Dim").MainType
PrintL "DIM SubType : " & MyParser.Key("Dim").SubType


PrintTitle "-A BIG search of 1M Keys ---"
lTimer.Start
For Counter = 1 To 1000000
MyParser.Key("Dim").MainType
Next
lTimer.Stop
PrintL "1M search time in mSec: " & lTimer.Elapsed(%CTIMER_MILLISECONDS)




I'm using PowerBasic PowerCollection in order to store keys. It is not efficient like my personal hash table but it allows me more flexibility.
Searching for 1M times a key inside a almost 7000 keys it takes 4 seconds on my PC.
Hope to half the time at the end :)

Thanks
Eros

ErosOlmi
19-11-2015, 00:25
do you ever sleep :)?


Today: 2 hours travel to work, 10 hours at work, 1.5 hours to return home
Some eat, 1.5 hour in programming at home.
Now I want a BIG 6 hours sleeping.
Tomorrow morning awake at 6:30
GOTO Today

oops: I've used a GOTO :evil:

Petr Schreiber
19-11-2015, 00:38
I think the timing is already very, very good - I would not worry about optimizing too much at this stage.

Good night!,
Petr

P.S. GOTO 4EVER (just not in ThinBASIC, for our coders peaceful dreams)

ErosOlmi
19-11-2015, 07:05
Regarding character setup, I've adopted the following syntax: object.Char.Set.{Alpha|NewLine|Space|Delim|Numeric|DQuote}

Example:


'---Declare a new tokenizer engine
Dim MyParser As cTokenizer


'---Instantiate new tokenizer
MyParser = New cTokenizer()

...


MyParser.Char.Set.NewLine (":;")
MyParser.Char.Set.Space ($SPC & $TAB)
MyParser.Char.Set.Delim (",.-+*/")
MyParser.Char.Set.Numeric ("0123456789")
MyParser.Char.Set.Alpha ("$%ABCDabcd")
MyParser.Char.Set.DQuote ($DQ)

...


I will now work on:

key group creation
few utility functions (.keys.delete, .keys.list, ...)
loading string buffer of test to parse
parsing
scanning

After that I will release a working new Tokenizer module to test.

Ciao
Eros

Petr Schreiber
19-11-2015, 13:52
Reads great, looking forward to play more with it :)


Petr

ErosOlmi
24-11-2015, 08:35
Note: attached module removed because a new one is posted in a successive post of this thread

I've something to play with.
Just very rough, not using 100% configured options and char recognition, but just to give you an idea I will publish it right now.

You need thinBasic Beta 1.9.16.4 installed. See http://www.thinbasic.com/community/showthread.php?12600
Get attached Tokenizer module and substitute you current one in \thinBasic\Lib\


Following example will give you an idea on how to setup, get info, scan a text, speed and new syntax


'---
'---Loads needed modules
'---
Uses "File"
Uses "Console"
Uses "Tokenizer"


Long cBrightCyanBGBlue = %CONSOLE_FOREGROUND_BLUE | %CONSOLE_FOREGROUND_GREEN | %CONSOLE_FOREGROUND_INTENSITY | %CONSOLE_BACKGROUND_BLUE
Long cBrightGreen = %CONSOLE_FOREGROUND_GREEN | %CONSOLE_FOREGROUND_INTENSITY
Long cBrightMagenta = %CCOLOR_FMAGENTA | %CONSOLE_FOREGROUND_INTENSITY
Long cBrightRed = %CONSOLE_FOREGROUND_RED | %CONSOLE_FOREGROUND_INTENSITY




'--------------------------------------------------------
'
'--------------------------------------------------------
Function TBMain() As Long
Long Counter


'---Create a timer
Local lTimer As cTimer
lTimer = New cTimer

'---Declare a new tokenizer engine
Dim MyParser As cTokenizer


'---Instantiate new tokenizer
MyParser = New cTokenizer()


PrintTitle "Configure Tokenizer"
'---Change some default behave in char parsing
'---Every single ASCII char can be associated to one of the predefined 6 char sets
'---A set of characters can be indicated in one go using .Char.Set. method followed by the type:
'--- NewLine, Space, Delim,Numeric, Alpha, DQuote
PrintStep "Setting character sets (if needed)"
MyParser.Char.Set.NewLine (":;")
MyParser.Char.Set.Space ($SPC & $TAB)
MyParser.Char.Set.Delim (",.-+*/")
MyParser.Char.Set.Numeric ("0123456789")
MyParser.Char.Set.Alpha ("$%ABCDabcd")
MyParser.Char.Set.DQuote ($DQ)

MyParser.Options.CaseSensitive = %FALSE
PrintStep "Tokenizer option, CaseSensitive = " & MyParser.Options.CaseSensitive


PrintTitle "Loading keys"
'---Create a new keywords group. Assign a value >= 100
Dim MyKeys As Long Value 100

Dim sFile As String = APP_Path & "thinAir\Syntax\thinBasic\thinBasic_Keywords.ini"
Dim allKeywords() As String
Dim nKeys As Long

nKeys = Parse File sFile, allKeywords, $CRLF
PrintStep "Number of keys I'm going to load: " & nkeys
lTimer.Start
For Counter = 1 To UBound(allKeywords)
MyParser.Keys.Add(allKeywords(Counter), MyKeys, Counter)
Next
lTimer.Stop
PrintStep "Number of keys loaded: " & MyParser.Keys.Count
PrintStep "Loading time Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")



PrintTitle "Checking some keys"
PrintStep "If the following few lines will return a number, all is ok"
PrintData MyParser.Keys.Contain("Dim" )
PrintData MyParser.Keys.Contain("As" )
PrintData MyParser.Keys.Contain("PrintL" )
PrintData MyParser.Keys.Contain("Uses" )
PrintData MyParser.Keys.Contain("Zer")


PrintStep "Checking MainType and SubType of DIM key"
PrintData "DIM MainType: " & MyParser.Key("Dim").MainType
PrintData "DIM SubType : " & MyParser.Key("Dim").SubType


PrintTitle "A BIG search of 100K Keys"
PrintStep "Searching for 100K keys, ..."
lTimer.Start
For Counter = 1 To 100000
MyParser.Key("Dim").MainType
Next
lTimer.Stop
PrintStep "100K search time in Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")


PrintTitle "-Start Scanning 100 times current script source code---"
lTimer.Start
PrintStep "Number of Tokens during scan: " & MyParser.Scan(Repeat$(100, Load_File(APP_ScriptFullName)))
lTimer.Stop
PrintStep "Scan time in Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")




PrintTitle "Scanning a line of text in order to see how to scan a new source and to access single tokens"
lTimer.Start
PrintStep "Number of Tokens during scan: " & MyParser.Scan("Dim MyVariable As Long")
PrintStep "Number of Tokens count : " & MyParser.Tokens.Count

For Counter = 1 To MyParser.Tokens.Count
PrintData "Token Data :" & MyParser.Token(Counter).Data
PrintData "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
PrintData "Token SubType :" & MyParser.Token(Counter).SubType
PrintData "-------------------------------------------"
Next
lTimer.Stop
PrintStep "Scan time in Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")


PrintWarning "All done. Press a key to finish"
WaitKey


End Function


'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintTitle(ByVal sTitle As String)
PrintL
PrintL "-" & sTitle & Repeat$(78 - Len(sTitle), "-") In cBrightCyanBGBlue
End Function

'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintStep(ByVal sStep As String)
PrintL " " & sStep In cBrightGreen
End Function


'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintData(ByVal sData As String)
PrintL " " & sData In cBrightMagenta
End Function


'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintWarning(ByVal sData As String)
PrintL
PrintL sData In cBrightRed
End Function

ReneMiner
24-11-2015, 12:16
ok, i played already a little

i tried to replace line 108 with something as this:


PrintStep "Number of Tokens during scan: " & MyParser.Scan("Dim MyVariable As Long " & $CRLF _
& "Dim another variable like " & $DQ & "something" & $DQ & " At 0" & $CRLF _
& "produce " & $DQ & "ERROR" & $CRLF _
& "another line" )


- i see it tokenizes all at once and i can check the result when done.
In my current project i scan single lines of code only (using classic tokenizer) , it's hard because of line-continiuation but simpler because don't have to calculate line for a token and can trap Errors.

What it would need were some important information of Position.
let's say this


String n = "12345678901234567890" ' (this just for orientation)
String s = "THIS ARE" & $CRLF _
& "A FEW LOSE TOKENS"


If for example after parsing s


MyParser.Token(1).Data ' will return "THIS"
MyParser.Token(2).Data ' will return "ARE"
MyParser.Token(3).Data ' will return $CRLF (EOL, 2 Bytes) i guess,


i need something like


MyParser.Token(1).ContinueAt ' to return 5
MyParser.Token(2).ContinueAt ' return 19 here...
MyParser.Token(3).ContinueAt ' would return 21 then


so i can subtract Length of token from the position where the parser continues at to scan for the next token to find out the position of the actual token in the text,

in basic expression:



Long tokenXStartPos = MyParser.Token(X).ContinueAt - Len(MyParser.Token(X).data)


You could as well add


MyParser.Token(X).StartsAt ' to retrieve Starting-position

to make it more simple for the users but more effort for you...

Also useful could be to retrieve
MyParser.Token(X).LineNumber
( only real CRLF probably, tells on which line the token was found )
MyParser.Token(X).LinePosition
( tell if this is the 1st, 2nd, 3rd, 4th, 5th... token on the current line )

expressed in basic after parsing s (see above)


MyParser.Token(1).Data ' returns "THIS"
MyParser.Token(1).LineNumber ' returns 1
MyParser.Token(1).LinePosition ' returns 1

MyParser.Token(2).Data ' returns "ARE"
MyParser.Token(2).LineNumber ' returns 1
MyParser.Token(2).LinePosition ' returns 2

MyParser.Token(3).Data ' returns $CRLF
MyParser.Token(3).LineNumber ' returns 1
MyParser.Token(3).LinePosition ' returns 3

MyParser.Token(4).Data ' returns "A"
MyParser.Token(4).LineNumber ' returns 2
MyParser.Token(4).LinePosition ' returns 1

MyParser.Token(5).Data ' returns "FEW"
MyParser.Token(5).LineNumber ' returns 2
MyParser.Token(5).LinePosition ' returns 2

MyParser.Token(6).Data ' returns "LOSE"
MyParser.Token(6).LineNumber ' returns 2
MyParser.Token(6).LinePosition ' returns 3
' ...

ErosOlmi
24-11-2015, 13:08
Hi René,

sure I can add your requests.
I already had the idea to add byte position of token but your idea to have both (start of current and next one) is better.
Also OK to get line number using only real lines.

SCAN does the job in one go but it needs more info to operate like "removecomments" or options like that before scanning.
I just wanted to be sure to be on the right track and you liked it.

Ciao
Eros

Petr Schreiber
24-11-2015, 19:28
Hi Eros,

the parsing speed is uberawesome!!! It aslo reads very nicely, very fluent. This really opens whole new range of possibilities.

Just one thing I am not sure about - what is the meaning of Counter as 3rd parameter here:


MyParser.Keys.Add(allKeywords(Counter), MyKeys, Counter)

Is it something like absolute position in keyword list? I would expect the .Add to just append at end, without need to care about position. If you could explain a bit, I would appreciate - maybe I got it wrong.

Ideas:
.Keys.AddRange could allow to add whole string array of keys, to avoid need of iterating via FOR/NEXT

I also support ideas by Rene, very clever as usual :)

thank youu!
Petr

ErosOlmi
24-11-2015, 20:08
Last number is in reality your personal index to recognize your keys when encountered.

Imagine you want to create an interpreter or a descendant parser able to interpret numeric equations with formulas and basic keywords.
You add your keys like SIN, COS, TAN with your key dictionary, like 100 or whatever

Every time the parser encounter SIN or COS or TAN will return that they are just KEYS of the dictionary 100.
But that's not enough if you want to do something to react to SIN or react to COS or react to TAN.

You need that: SIN belongs to dictionary 100 and is ID = 1234 (for example) in order to identify that the parser has found a key of your dictionary and that key is identified by 1234 (or whatever)
The same for COS: parser must return that it found a key of dictionary 100 and that key is 654 (or whatever)
The same for TAN: parser must return that it found a key of dictionary 100 and that key is 444 (or whatever)

Of course you can always identify your key by its name (token name) but making a selection using strings is very very slow compared to numbers.
Anyway it is just an option that programmer can use or not. The ID of the keys can be an index to an internal array or can be a pointer to a function. It is up to the programmer to use or not.
I will make it optional in next version, I agree it is better.

Ciao
Eros

ErosOlmi
25-11-2015, 08:28
Attached to this post an updated version of Tokenizer module.
Extract thinBasic_Tokenizer.dll and substitute the one you have into \thinBasic\Lib\ directory

Note: you need thinBasic 1.9.16.4 minimum version in order to test it: http://www.thinbasic.com/community/showthread.php?12600


What' new in this:



user defined main types MUST be created with a dedicated method with an optional description

'---Create a new keywords group.
Dim MyKeys As Long Value MyParser.NewMainType("MyKeys")

during .Scan operation User keys (user main types) are now identified
token information stored during .Scan method has now token length and absolute token start and end position inside scanned string for easy retrieve exact position


PrintData "Token Data :" & MyParser.Token(Counter).Data
PrintData "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
PrintData "Token SubType :" & MyParser.Token(Counter).SubType
PrintData "Token PosStart:" & MyParser.Token(Counter).PosStart
PrintData "Token PosEnd :" & MyParser.Token(Counter).PosEnd
PrintData "Token Length :" & MyParser.Token(Counter).Len



If this version will be OK, I will work next on recognizing René suggestions:

Token absolute line inside scanned source code (absolute means real lines and not line continuation)
Token relative position inside line (1, 2, 3, ...)



Full example showing new functionalities:

'---'---Loads needed modules
'---
Uses "File"
Uses "Console"
Uses "Tokenizer"


Long cBrightCyanBGBlue = %CONSOLE_FOREGROUND_BLUE | %CONSOLE_FOREGROUND_GREEN | %CONSOLE_FOREGROUND_INTENSITY | %CONSOLE_BACKGROUND_BLUE
Long cBrightGreen = %CONSOLE_FOREGROUND_GREEN | %CONSOLE_FOREGROUND_INTENSITY
Long cBrightMagenta = %CCOLOR_FMAGENTA | %CONSOLE_FOREGROUND_INTENSITY
Long cBrightRed = %CONSOLE_FOREGROUND_RED | %CONSOLE_FOREGROUND_INTENSITY




'--------------------------------------------------------
'
'--------------------------------------------------------
Function TBMain() As Long
Long Counter


'---Create a timer
Local lTimer As cTimer
lTimer = New cTimer

'---Declare a new tokenizer engine
Dim MyParser As cTokenizer


'---Instantiate new tokenizer
MyParser = New cTokenizer()


PrintTitle "Configure Tokenizer"
'---Change some default behave in char parsing
'---Every single ASCII char can be associated to one of the predefined 6 char sets
'---A set of characters can be indicated in one go using .Char.Set. method followed by the type:
'--- NewLine, Space, Delim,Numeric, Alpha, DQuote
PrintStep "Setting character sets (if needed)"
MyParser.Char.Set.NewLine (":;")
MyParser.Char.Set.Space ($SPC & $TAB)
MyParser.Char.Set.Delim (",.-+*/")
MyParser.Char.Set.Numeric ("0123456789")
MyParser.Char.Set.Alpha ("$%ABCDabcd")
MyParser.Char.Set.DQuote ($DQ)

MyParser.Options.CaseSensitive = %FALSE
PrintStep "Tokenizer option, CaseSensitive = " & MyParser.Options.CaseSensitive


PrintTitle "Loading keys"
'---Create a new keywords group. NEVER USE YOU OWN MANUAL KEYs code, always use this method to get a new one
Dim MyKeys As Long Value MyParser.NewMainType("MyKeys")


Dim sFile As String = APP_Path & "thinAir\Syntax\thinBasic\thinBasic_Keywords.ini"
Dim allKeywords() As String
Dim nKeys As Long

nKeys = Parse File sFile, allKeywords, $CRLF
PrintStep "Number of keys I'm going to load: " & nkeys
lTimer.Start
For Counter = 1 To UBound(allKeywords)
MyParser.Keys.Add(allKeywords(Counter), MyKeys)
Next
lTimer.Stop
PrintStep "Number of keys loaded: " & MyParser.Keys.Count
PrintStep "Loading time Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")



PrintTitle "Checking some keys"
PrintStep "If the following few lines will return a number, all is ok"
PrintData MyParser.Keys.Contain("Dim" )
PrintData MyParser.Keys.Contain("As" )
PrintData MyParser.Keys.Contain("PrintL" )
PrintData MyParser.Keys.Contain("Uses" )
PrintData MyParser.Keys.Contain("Zer")


PrintStep "Checking MainType and SubType of DIM key"
PrintData "DIM MainType: " & MyParser.Key("Dim").MainType
PrintData "DIM SubType : " & MyParser.Key("Dim").SubType


PrintTitle "A BIG search of 100K Keys"
PrintStep "Searching for 100K keys, ..."
lTimer.Start
For Counter = 1 To 100000
MyParser.Key("Dim").MainType
Next
lTimer.Stop
PrintStep "100K search time in Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")


Dim sStringToScan As String


sStringToScan = Repeat$(100, Load_File(APP_ScriptFullName))


PrintTitle "-Start Scanning 100 times current script source code---"
lTimer.Start
PrintStep "Number of Tokens during scan: " & MyParser.Scan(sStringToScan)
lTimer.Stop
PrintStep "Scan time in Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")




sStringToScan = "Result = Sin(1) - Cos(0):"
'"123456789012345678901234567890" ' (this just for orientation)
'sStringToScan = "THIS ARE" & $CRLF &
' "A FEW LOSE TOKENS"


PrintTitle "Scanning a line of text in order to see how to scan a new source and to access single tokens"
lTimer.Start
PrintStep "Number of Tokens during scan: " & MyParser.Scan(sStringToScan)
PrintStep "Number of Tokens count : " & MyParser.Tokens.Count
lTimer.Stop
PrintStep "Scan time in Sec: " & lTimer.ElapsedToString(%CTIMER_SECONDS, "#0.0000")

For Counter = 1 To MyParser.Tokens.Count
PrintData "Token Data :" & MyParser.Token(Counter).Data
PrintData "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
PrintData "Token SubType :" & MyParser.Token(Counter).SubType
PrintData "Token PosStart:" & MyParser.Token(Counter).PosStart
PrintData "Token PosEnd :" & MyParser.Token(Counter).PosEnd
PrintData "Token Length :" & MyParser.Token(Counter).Len
PrintData "-------------------------------------------"
Next


PrintWarning "All done. Press a key to finish"
WaitKey


End Function


'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintTitle(ByVal sTitle As String)
PrintL
PrintL "-" & sTitle & Repeat$(78 - Len(sTitle), "-") In cBrightCyanBGBlue
End Function

'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintStep(ByVal sStep As String)
PrintL " " & sStep In cBrightGreen
End Function


'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintData(ByVal sData As String)
PrintL " " & sData In cBrightMagenta
End Function


'--------------------------------------------------------
'
'--------------------------------------------------------
Function PrintWarning(ByVal sData As String)
PrintL
PrintL sData In cBrightRed
End Function

ReneMiner
25-11-2015, 11:42
for the in-line-position, mostly it's only about if the current token is the first token on the current line or not.

So the previous token was either a $CRLF or its the very first token in the text to scan...

Nice: Error tells a position to so we can recheck ourselves at the position if there's a single $DQ...

ErosOlmi
25-11-2015, 12:21
Will see what I can do, but pretty sure I can do almost all with cTokenizer module class :)
Using a module class let me encapsulate all options under a single instance of the class without interference with the rest of the module.


I think I will change .PosStart and .PosEnd to a more esplicative .ByteStart and .ByteEnd

ReneMiner
25-11-2015, 12:31
...err, the Byte(-Start and also Line) - idea i had too but in another sense:



Long lStart ' should be in range 1 to Strptrlen(Strptr(sText))
Long lLine ' should be in range 1 to ParseCount(sText, $CRLF)

myParser.Scan(sText, Byte lStart) ' Byte option to tell the parser to start scanning at Byte lStart
myParser.Scan(sText, Line lLine ) ' Line option to scan desired line only



Have time for a very small idea?
check this:
http://www.thinbasic.com/community/showthread.php?12615-Array-Join&p=92320#post92320

Petr Schreiber
25-11-2015, 18:46
Eros,

thanks a lot for the explanations, now I understand, very practical!
I like the new functions, thanks a lot for them!

One wish - could ToString be added for SubType as well? There is a string specified during the NewSubType, so could be returned this way.


Petr

ErosOlmi
25-11-2015, 19:03
Petr,

actually there is not string we can specify on SubType information in Tokenizer class.
A possible string is the Token itself.

But maybe there is something I do not get from your request.
If you can give me a real life example maybe I can better understand the request.

Ciao
Eros

Petr Schreiber
25-11-2015, 23:07
Buonasera,

my apologies for confusion. Take this line:


Dim MyKeys As Long Value MyParser.NewMainType("MyKeys")

...the generation of new main type takes "MyKeys" as parameter, and produces some integer.

If there would be some kind of internal dictionary, it could match back the integer to "MyKeys".

Just idea, low priority.


Petr