PDA

View Full Version : Is there any existing code that tokenizes a single line?



TBQuerier
18-08-2013, 21:17
Hi,

I've been all over the web, and tried some of the code from several Basic dialects. But I can't seem to find any simple code that will just take a single line/string (not a file), which might contain quoted strings, and extract out either symbolic tokens or just the elements. Bint32 was mentioned several times, but I haven't been able to locate the source for that.


The type of code I'm looking for, doesn't require while...wend or if...endif support. It would just need to be able to extract e.g.


from
sVar := "whats happening" (or sVar = "whats happening?" or sVar="whats happening?" )

into
{
"sVar",
":=",
"whats happening?"
}


Is there anything like that around? One of the problems with trying out each mention of 'tokenizer' or 'lexer' on the web, is that you never know ahead of time, whether it will handle quoted strings.


Sorry if this is off-topic or already addressed, but I really have searched a lot, for the past month.

Petr Schreiber
19-08-2013, 09:10
Hi TBQuerier,

I think the ThinBASIC Tokenizer module can do what you ask for. Have a look at this basic example (derived from code by Eros):



Uses "Console", "Tokenizer"

Function TBMain()

String MyBuffer ' -- Will contain string buffer to be parsed
Long CurrentPosition ' -- Current buffer pointer position
Long TokenMainType ' -- Will contain current token main type
String Token ' -- Will contain current string token
Long TokenSubType ' -- Will contain current token sub type

' -- Parser tuning
%CustomKeywords = 100
%CustomKeyword_Var = 1
%CustomKeyword_String = 2
Tokenizer_KeyAdd("VAR" , %CustomKeywords, %CustomKeyword_Var)
Tokenizer_KeyAdd("STRING" , %CustomKeywords, %CustomKeyword_String)


Tokenizer_Default_Set(";", %TOKENIZER_DEFAULT_NEWLINE)

' -- Prepare text for parsing
MyBuffer = "var sVar : string;" +
"sVar := ""whats happening"""



' -- Init current buffer position. THIS IS IMPORTANT
CurrentPosition = 1

' -- Loops until token is end of buffer
While TokenMainType <> %TOKENIZER_FINISHED

' -- Here we are. Most important point here is that all passed parameters
' must be a single variable and not an expression. This is necessary because
' parameters are passed by reference in order to return information about token
' --
' MyBuffer must contain the string you want to parse
' CurrentPosition must be initialized to 1. After execution this parameter will contains
' current position just after current token
' TokenMainType on exit, it will contain the main type of the token found
' Token on exit, it will contain the string representation of the token found
' TokenSubType on exit, it will contain the sub type of the token found (if relevant)
' --
Tokenizer_GetNextToken(MyBuffer, CurrentPosition, TokenMainType, Token, TokenSubType)

' -- Write some info
PrintL LSet$(Token, 32) + DecodeType_ToString(TokenMainType, TokenSubType)

Wend

PrintL "Press any key to quit..."
WaitKey

End Function

Function DecodeType_ToString( nType As Long, nSubType As Long ) As String
String sResult

Select Case nType
Case %TOKENIZER_FINISHED
Return "Tokenizer finished..."

Case %TOKENIZER_ERROR
sResult = "Error"

Case %TOKENIZER_UNDEFTOK
sResult = "Undefined token"

Case %TOKENIZER_EOL
sResult = "End of line"

Case %TOKENIZER_DELIMITER
sResult = "Delimiter"

Case %TOKENIZER_NUMBER
sResult = "Number"

Case %TOKENIZER_STRING
sResult = "String"

Case %TOKENIZER_QUOTE
sResult = "Quoted"

Case %CustomKeywords
sResult = "Custom keyword / " + Choose$(nSubType, "%CustomKeyword_Var", "%CustomKeyword_String")
End Select

Return sResult

End Function


You can check ThinBasic/SampleScripts/Tokenizer for 2 more examples.


Petr

ErosOlmi
19-08-2013, 10:29
Hi TBQuerier and welcome to thinBasic community forum.

Petr, thanks a lot for the example.
It should give TBQuerier an idea of Tokenizer module.

Tokenizer module has been designed to be enough general to adapt to any string data in order to identify tokens and transform into something else.
I developed it because I too was quite frustrated to search for simple tokenizer able to do the job without much effort in writing grammars or things like that.

TBQuerier
19-08-2013, 12:37
Hi TBQuerier and welcome to thinBasic community forum.

Petr, thanks a lot for the example.
It should give TBQuerier an idea of Tokenizer module.

Tokenizer module has been designed to be enough general to adapt to any string data in order to identify tokens and transform into something else.
I developed it because I too was quite frustrated to search for simple tokenizer able to do the job without much effort in writing grammars or things like that.

Perfect, thanks Ptr and Eros, I'll play around with it this week.

TBQuerier
19-08-2013, 16:14
Hi TBQuerier and welcome to thinBasic community forum.

Petr, thanks a lot for the example.
It should give TBQuerier an idea of Tokenizer module.

Tokenizer module has been designed to be enough general to adapt to any string data in order to identify tokens and transform into something else.
I developed it because I too was quite frustrated to search for simple tokenizer able to do the job without much effort in writing grammars or things like that.


Thanks for the foresight, Eros. We're allowed to use any of the ThinBasic DLLs with apps from other Basic dialects (Power, Pure, Oxygen, etc.). Is that correct?

ErosOlmi
19-08-2013, 16:45
In theory yes, I have no problem.

In practice none of the thinBasic modules (almost all the DLLs present into \thinBasic\Lib\ directory, for example thinBasic_Tokenizer.dll) can be used without the main thinBasic Core Dll called thinCore.dll because all modules calls special parsing functions present into main thinCore.dll.

So, in few words, you cannot just use functions inside the modules you need like a standard dll. You need to run a thinBasic script or embed thinCore.dll as internal scripting language into another application.

TBQuerier
19-08-2013, 16:54
In theory yes, I have no problem.

In practice none of the thinBasic modules (almost all the DLLs present into \thinBasic\Lib\ directory, for example thinBasic_Tokenizer.dll) can be used without the main thinBasic Core Dll called thinCore.dll because all modules calls special parsing functions present into main thinCore.dll.

So, in few words, you cannot just use functions inside the modules you need like a standard dll. You need to run a thinBasic script or embed thinCore.dll as internal scripting language into another application.


Ok, thanks Eros.