PDA

View Full Version : Bracket parsing



Petr Schreiber
17-12-2015, 20:50
Hi Eros,

I am trying to use new Tokenizer for parsing JSON (https://en.wikipedia.org/wiki/JSON)s, but I have hard time defining []{} as my custom types - they keep being "delimiters". Any ideas?


#MINVERSION 1.9.16.6

Uses "Console", "Tokenizer"

String buffer = "[{""a"":1, ""b"":2},{""c"":3, ""d"":4}]"

Dim tokenizer As cTokenizer
tokenizer = New cTokenizer()

Long brackets = tokenizer.NewMainType("brackets")
tokenizer.Char.Set.Delim (",:")
tokenizer.Keys.Add("[", brackets, 10)
tokenizer.Keys.Add("]", brackets, 11)
tokenizer.Keys.Add("{", brackets, 20)
tokenizer.Keys.Add("}", brackets, 21)


tokenizer.Scan(buffer)

Long i
For i = 1 To tokenizer.Tokens.Count
If tokenizer.Token(i).Data = "[" Then ' -- No idea why this IF passes even for non "[" btw
PrintL "Data :" & tokenizer.Token(i).Data
PrintL "MainType:" & tokenizer.Token(i).MainType & " (" & tokenizer.Token(i).MainType.ToString & ")"
PrintL "SubType :" & tokenizer.Token(i).SubType
PrintL
End If
Next

WaitKey



Petr

ErosOlmi
17-12-2015, 22:12
Ciao Petr,

attached a new version of thinBasic_Tokenizer.dll module that implements

<tokenizer>.Char.Set.SubType (sDelimiters, UserSubType)

I'm sorry but due to how I implemented recognition of delimiters, a standard delimiter like []{} cannot be a keys.
But using something like

tokenizer.Char.Set.SubType ("[", 10)
you can associate a subtype number (from 1 to 255) to a single char delimiter.
In this way you can recognize special char delimiters you want to keep track among all other delimiters.


Regarding IF not working ... it is something I knew that sooner or later would have occurred but so far I have no a solution. It is a little tricky.
When thinBasic Core engine encounter an IF ... or a SELECT ... it has to decide if the next expression evaluates in a string or a numeric expression.
To achieve this it put in practice what is the so called "look ahead" technique that is the parser save the actual pointer and then look ahead for one or two tokens in order to try to understand if the next statement evaluates into a string or numeric expression. Than it goes back into the saved pointer and decide how to go on.

But now that we have dotted notation with multiple levels, understanding if the next expressions evaluates into a string or a numeric is not that simple.
Parser should go on for many tokes and also it would need a way to interrogate a module class asking if such sequence of tokens evaluate into a string or a number.
At the moment I have no way to do that. I know it is a BIG PROBLEM and I will go on trying to find a way.

A simple work around is to add en empty string "". In this way thinBasic will find a quote and immediately it will understand that the next will be a string expression

If "" & tokenizer.Token(i).Data = "[" Then
or

If "[" = tokenizer.Token(i).Data Then

otherwise it will be a numeric expression that will evaluate into

If Val(tokenizer.Token(i).Data) = Val("[") Then
and at the end the above will give

If 0 = 0 Then


Here a complete example


#MINVERSION 1.9.16.6


Uses "Console", "Tokenizer"


String buffer = "[{""a"":1, ""b"":2},{""c"":3, ""d"":4}]"


Dim tokenizer As CTOKENIZER
tokenizer = New CTOKENIZER()


Long brackets = tokenizer.NewMainType("brackets")
tokenizer.Char.Set.Delim (",:")
'tokenizer.Keys.Add("[", brackets, 10)
'tokenizer.Keys.Add("]", brackets, 11)
'tokenizer.Keys.Add("{", brackets, 20)
'tokenizer.Keys.Add("}", brackets, 21)
tokenizer.Char.Set.SubType ("[", 10)
tokenizer.Char.Set.SubType ("]", 11)
tokenizer.Char.Set.SubType ("{", 20)
tokenizer.Char.Set.SubType ("}", 21)




tokenizer.Scan(buffer)


Long i
For i = 1 To tokenizer.Tokens.Count
If tokenizer.Token(i).MainType = %TOKENIZER_DELIMITER Then
If tokenizer.Token(i).SubType Then
PrintL "Data :" & tokenizer.Token(i).Data
PrintL "MainType:" & tokenizer.Token(i).MainType & " (" & tokenizer.Token(i).MainType.ToString & ")"
PrintL "SubType :" & tokenizer.Token(i).SubType
PrintL
End If
End If
Next


WaitKey

ErosOlmi
18-12-2015, 05:35
I had another idea but with some compromise.

You can force []{} to be Alphabetic and not delimiters.
In this way you can use Keys.

The compromise is that consecutive Alphabetic chars are considered a single token until a delimiter or a space is encountered.
So [{ or }] (or any possible combinations) are considered a single token and you need to define all the possible keys.

Ciao
Eros




#MINVERSION 1.9.16.8


Uses "Console", "Tokenizer"


String buffer = "[{""a"":1, ""b"":2},{""c"":3, ""d"":4}]"


Dim tokenizer As CTOKENIZER
tokenizer = New CTOKENIZER()


tokenizer.Char.Set.Delim (",:")
tokenizer.Char.Set.Alpha ("[]{}") '---<<< Define as alphabetic


Long brackets = tokenizer.NewMainType("brackets")
tokenizer.Keys.Add("[", brackets, 10)
tokenizer.Keys.Add("]", brackets, 11)
tokenizer.Keys.Add("{", brackets, 20)
tokenizer.Keys.Add("}", brackets, 21)
tokenizer.Keys.Add("[{", brackets, 1020)
tokenizer.Keys.Add("}]", brackets, 2111)


tokenizer.Scan(buffer)


Long i
For i = 1 To tokenizer.Tokens.Count
If tokenizer.Token(i).MainType = brackets Then
PrintL "Data :" & tokenizer.Token(i).Data
PrintL "MainType:" & tokenizer.Token(i).MainType & " (" & tokenizer.Token(i).MainType.ToString & ")"
PrintL "SubType :" & tokenizer.Token(i).SubType
PrintL
End If
Next


WaitKey

ErosOlmi
18-12-2015, 06:30
Attached another version of thinBasic_Tokenizer,dll version.

I've added Alpha_Single possibility in order to specify some characters that are considered alphabetic but must be taken singularly:

<tokenizer>.Char.Set.Alpha_Single ("[]{}")

Let me know if it works in all cases.
Ciao
Eros

Example:


#MINVERSION 1.9.16.8


Uses "Console", "Tokenizer"


String buffer = "a[{""a"":1, ""b"":2},{""c"":3, ""d"":4}]"


Dim tokenizer As CTOKENIZER
tokenizer = New CTOKENIZER()


tokenizer.Char.Set.Delim (",:")
tokenizer.Char.Set.Alpha_Single ("[]{}")


Long brackets = tokenizer.NewMainType("brackets")
tokenizer.Keys.Add("[", brackets, 10)
tokenizer.Keys.Add("]", brackets, 11)
tokenizer.Keys.Add("{", brackets, 20)
tokenizer.Keys.Add("}", brackets, 21)
'tokenizer.Keys.Add("[{", brackets, 1020)
'tokenizer.Keys.Add("}]", brackets, 2111)


tokenizer.Scan(buffer)


Long i
For i = 1 To tokenizer.Tokens.Count
If tokenizer.Token(i).MainType = brackets Then
PrintL "Data :" & tokenizer.Token(i).Data
PrintL "MainType:" & tokenizer.Token(i).MainType & " (" & tokenizer.Token(i).MainType.ToString & ")"
PrintL "SubType :" & tokenizer.Token(i).SubType
PrintL
End If
Next


WaitKey

Petr Schreiber
18-12-2015, 21:34
You are saint, thank you Eros :)

So far, so good. Will let you know, if I find anything :)


Petr

ReneMiner
19-06-2022, 21:47
More than 700 of the 3000 pages in help look as this



Navigation: ThinBASIC Modules (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/thinbasic_modules.htm) > Tokenizer (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/tokenizer.htm) > Tokenizer Module Classes (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/tokenizer_module_classes.htm) > cTokenizer (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/ctokenizer.htm) >
https://www.thinbasic.com/community/@MSITStore:D:\thinBasic\Help\thinBasic.chm::/hm_btn_navigate_prev.png (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/ctokenizer_tokens_count.htm)https://www.thinbasic.com/community/@MSITStore:D:\thinBasic\Help\thinBasic.chm::/hm_btn_navigate_top.png (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/ctokenizer.htm)https://www.thinbasic.com/community/@MSITStore:D:\thinBasic\Help\thinBasic.chm::/hm_btn_navigate_next.png (mk:@MSITStore:D:\thinBasic\Help\thinBasic.chm::/ctokenizer_token____maintype.htm)
<cTokenizer>.Token




Enter topic text here.








































Approximately 300 - more or less - pages show something as





For more infomation about xyz see

MSDN: http://msdn.microsoft.com/en-us/library/ms645524(VS.85).aspx












Not to mention that these links are not leading to the desired information.

On my harddrive there are more than 1800 !!! Scripts that i wrote myself in the past 3 years.
probably 60 of them are working. Thats 0.3%

75% of the scripts have bugs that actually are none - all if/endif select case/end select function/ end function do's and loops are checked, all parenthesis' count is correct and was double-checked using 2 different programs parse them char by char.

The interpreter was in much better shape in the years 2014 to 2016 - loop-for-while-repeat-nesting was working correct until
an unbelieveable depth - today a script with 2 or 3 nested loops is barely getting to the point where the innermost of 5 loops should start. But


The most often occuring reason is a default-value on optional parameters. Since more than 2 1/2 years this is broken - workarounds using Function_CParam are not reliable

- when variants are involved Function_CParam fails.

Also when numerals with a value of 0 or strings with the content of "" are passed - the count is incorrect.

If a script contains large parts of comments or multiline-strings, thinAir complains about many wrong bugs that are none.

For simple tasks as counting matching parenthesis to eliminate these as very first major-sources of bugs - especially parsing-bugs because the interpreter understands something is wrong but its not checking the correct reasons - and there is no function provided
but that is the second most reason why scripts do not run and wrong errors are diagnosed as missing if for EndIf or Invalid Type declaration inside of Sub - even the script does not contain a single sub.

A line that starts with IF as


If All(True, True, True)

Else

EndIf


is a real killer. The missing "Then" is not detected - the script gets started ... the consequences... try it 3, 4 or 5 times...

On other places it detects "End Function" but expected was something else - but all there is are commented lines - not even
keyword END anywhere in the text.

the third-most unfound bug that brings thinAir to crash is a faulty error-report where the scriptname.tbasic.lastruntimeError.ini has a length of 300 to 400 kB. Other Editors also crash when trying to load this but i just let it check the filesize and erase it if more than 5 kB - in thinair that mechanic is hidden and after a crash its impossible to interfere before thinair kills itself.

Also that graph-thing - its the most useless feature to add at all - the cost of performance does not pay off the interesting look without a purpose. It tells nothing but how a script can be slowed down.


And now tokenizer, i assigned the keyword STOP not case-sensitive as Member of custom group "TERMINATE" with
a user-defined ID of -1.
Maintype.ToString should return "TERMINATE" but what does it do?
it becomes an ID of 12 and returns the name of an equate! How does that get in there?

i do not understand this language any more :(


Ready...>stop


number of tokens : 1
Data :stop
MainType:12 (%TOKENIZER_STRING)
SubType :0
Function to call : Action_%TOKENIZER_STRING
Ready...>

Petr Schreiber
04-07-2022, 20:04
Hi Rene,

I am sorry the docs are not complete - we are working on it per partes once our jobs allow us, the volume is huge.

I can confirm the cTokenizer docs need improvement. I will consider it as next step once I finish documenting cAppLog interface.


Petr