View Full Version : Can Parse or Split not count 'empty' tokens somehow?
EmbeddedMan
07-09-2016, 18:50
If you run
Uses "Console"
Dim x As Integer
Dim ar() As String
x = Parse("A B", ar, Any " " + Chr$(9))
PrintL "Found " & Str$(x) & " tokens"
you will get
Found 6 tokens
This is because there are 5 spaces between the A and the B.
Is there any way for me to parse a string, and get an array of tokens out that do NOT include repeated delimiters? I'd like to just get 2 tokens out, "A" and "B", no matter what combination of whitespaces there are between them.
It feels to me like Parse and maybe Split should work this way (i.e. ignore multiple sequential occurrences of the delimiter characters) when the 'ANY' keyword is used, but they don't.
I'm just trying to parse a line from a file, which contains several non-whitespace tokens, but may have complex whitespace between tokens (combinations of spaces and tabs, arbitrary). How do you parse this type of line in ThinBasic to just extra the tokens, treating any sequence of multiple whitespace characters as a single delimiter?
Any ideas?
*Brian
ErosOlmi
07-09-2016, 22:32
Hi Brian,
Parse function family is quite limited and work well only on structured data.
I suggest to use Tokenizer module, some examples can be found in \Samplescripts\Tokenizer\ directory
Tokenizer module takes advance of some technique I use to parse thinBasic sources.
Help: http://www.thinbasic.com/public/products/thinBasic/help/html/index.html?tokenizer_equates.htm
Here an example using standard tokenizer. From this example you can create your own function tailored to your needs, for example filling an array of those tokens you need:
uses "Console"uses "Tokenizer"
'---
'---Declare needed variables
'---
dim MyBuffer as string '---Will contains string buffer to be parsed
dim CurrentPosition as long '---Current buffer pointer position
dim TokenType as long '---Will contains current token type
dim Token as string '---Will contains current string token
dim nTokens as quad '---In order to count number of tokens found
'---Load this script into a string buffer
MyBuffer = "A B C, d" & $tab & "(Hi there) ""I'm a string"""
'---Init current buffer position. THIS IS IMPORTANT
CurrentPosition = 1
'---Loops until token is end of buffer
while TokenType <> %TOKENIZER_FINISHED
'---Here we are. Most important point here is that all passed parameters
' must be a single variable and not an expression. This is necessary because
' parameters are passed by reference in order to return information about token
'---
' MyBuffer must contains the string you want to parse
' CurrentPosition must be initialized to 1. After execution this parameter will contains
' current position just after current token
' TokenType on exit, it will contain the type of token found
' Token on exit, it will contain the string representation of the token found
Tokenizer_GetNextToken(MyBuffer, CurrentPosition, TokenType, Token)
'---In order to count number of tokens found
incr nTokens
'---Write some info
printl "Token " & nTokens & ": " & Token & " (type: " & TokenType & ")"
wend
'---ENd timer
'---Give results
printl repeat$(70, "-")
printl "Number of Tokens found: " & nTokens
waitkey
Inside \Samplescript\Tokenizer\ you will also found and example using cTokenizer class that for some aspect is more easier and powerful: Tokenizer_UsingModuleClass.tbasic
Let me know if it works for you or you need more functionalities.
And a great thanks for what you know.
But ... you are exaggerating :)
EmbeddedMan
07-09-2016, 23:29
Very nice. The Tokenizer does what I need it to, although it is less straightforward than the others. That's OK, it's much more powerful and generic. Thanks - my program is working very well now!
*Brian
ErosOlmi
07-09-2016, 23:45
This is a quick and dirty example using Tokenizer class cTokenizer.
The only think you need to do is to define a New cTokenizer variable and use .Scan method to automatically scan the string.
Than you can user FOR/NEXT to get all the tokens data and info.
Maybe for future I can develop a Tokens to Array in order to scan string and fill an array with all tokes found.
uses "Console"uses "Tokenizer"
dim MyParser As new CTOKENIZER
Dim sStringToScan as string
dim Counter as long
'---Tokenize first string
sStringToScan = "A B C, d" & $tab & "(Hi there) ""I'm a string"""
printl "Number of Tokens found during scan : " & MyParser.Scan(sStringToScan)
printl "Number of Tokens count with Count method: " & MyParser.Tokens.Count
For Counter = 1 To MyParser.Tokens.Count
printl "Token Data :" & MyParser.Token(Counter).Data
printl "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
printl "Token SubType :" & MyParser.Token(Counter).SubType
printl "Token PosStart:" & MyParser.Token(Counter).PosStart
printl "Token PosEnd :" & MyParser.Token(Counter).PosEnd
printl "Token Length :" & MyParser.Token(Counter).Len
printl "-------------------------------------------"
Next
'---Tokenize another string
sStringToScan = "1 2 3 4"
printl "Number of Tokens found during scan : " & MyParser.Scan(sStringToScan)
printl "Number of Tokens count with Count method: " & MyParser.Tokens.Count
For Counter = 1 To MyParser.Tokens.Count
printl "Token Data :" & MyParser.Token(Counter).Data
printl "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
printl "Token SubType :" & MyParser.Token(Counter).SubType
printl "Token PosStart:" & MyParser.Token(Counter).PosStart
printl "Token PosEnd :" & MyParser.Token(Counter).PosEnd
printl "Token Length :" & MyParser.Token(Counter).Len
printl "-------------------------------------------"
Next
waitkey
There are many other options you can apply to class before Scanning to classify some special tokens you want to recognize.
Check example in \thinBasic\SampleScripts\Tokenizer\Tokenizer_UsingModuleClass.tbasic
EmbeddedMan
07-09-2016, 23:53
I wonder if it could simply be an alternate form of Parse or Split. They are _so_close_ to what I need to do - it's only in how they handle delimiters that would need to change.
*Brian
ErosOlmi
08-09-2016, 00:11
I will think to something but not in short time.
ANY is PARSE functions means that any char in string delimiter is considered a possible delimiter.
Instead you need to interpret any consecutive delimiter as one delimiter.
This is contrary to the idea of PARSE that is usually used for situation like CSV files where "1,,3,," means 5 columns.
I need to quite completely rewrite PARSE functions, that is not a problem but I need some time.
I will let you know.
Eros
ErosOlmi
08-09-2016, 20:24
Dear Brian,
can you please attach a file with a file showing some of the strings you want to parse?
I'm working on a function that is doing what you asked but I need a deep real test.
Thanks a lot
Eros
ReneMiner
09-09-2016, 18:23
an idea to parse without returning empty elements, simply use replace$ on multiple occurance of delimiter as
sData = Replace$(sData, sDelimiter & sDelimiter, With sDelimiter)
and thereafter parse the data.
Example, contains 2 functions to parse since i can not use the ANY-keyword as a thinBasic-function-parameter.
to Parse without ANY-option use Parse_2
to parse with ANY-option use Parse_2_ANY
' #Filename "test_ParseMultipleDelimiters.tBasic"
Uses "Console"
PrintL "first run:" In 10
test "A B" ' string to parse
PrintL "second run:" In 10
test "A " & $CRLF & " B" ' string to parse
PrintL
' ' uncomment for testing error
' PrintL "now test Error:"
' Dim s() As String
' Parse_2 "A B", s, ""
PrintL Repeat$(30, "-") & "> key to end " In 42
WaitKey
' ----------------------------------------------------------------------------
Function test( ByVal sTest As String )
' ----------------------------------------------------------------------------
' sTest the string to be parsed
' used delimiters here: $SPC & $CRLF
Local x As Long
Local ar() As String
' call parse, using the thinCore-function:
PrintL
PrintL "1. PARSE " & $DQ & sTest & $DQ & ", Any $SPC & $CRLF" In 15
x = Parse sTest, ar, Any $SPC & $CRLF
Print_Results x, ar
' -----------------------------------------------
' call parse_2, does not allow ANY
PrintL
PrintL "2. Parse_2 " & $DQ & sTest & $DQ & ", $SPC (no ANY-option available)" In 15
x = Parse_2 sTest, ar, $SPC
Print_Results x, ar
' -----------------------------------------------
' call parse_2_ANY
PrintL
PrintL "3. Parse_2_ANY " & $DQ & sTest & $DQ & ", $SPC & $CRLF" In 15
x = Parse_2_ANY sTest, ar, $SPC & $CRLF
Print_Results x, ar
End Function
' ----------------------------------------------------------------------------
Function Parse_2( ByVal sData As String, _
ByRef sResult() As String, _
ByVal sDelimiter As String _
) As Long
' ----------------------------------------------------------------------------
' this will parse sData delimited by sDelimiter
' inTo sResult without empty elements, no ANY-option:
' returns count of parsed elements
Local lenS As Long
If sDelimiter = "" Then
MsgBox 0, "Can not parse by """"", %MB_OK Or %MB_ICONERROR, "Parse_2: Invalid delimiter"
Stop
EndIf
Do
lenS = StrPtrLen(StrPtr(sData))
sData = Replace$(sData, sDelimiter & sDelimiter, With sDelimiter)
Loop Until StrPtrLen(StrPtr(sData)) = lenS
Function = Parse sData, sResult, sDelimiter
End Function
' ----------------------------------------------------------------------------
Function Parse_2_ANY( ByVal sData As String, _
ByRef sResult() As String, _
ByVal sDelimiter As String _
) As Long
' ----------------------------------------------------------------------------
' this will parse sData delimited by ANY sDelimiter
' into sResult without empty elements, with ANY-Option:
' returns count of parsed elements
Local lenS As Long
If sDelimiter = "" Then
MsgBox 0, "Can not parse by """"", %MB_OK Or %MB_ICONERROR, "Parse_2_ANY: Invalid delimiter"
Stop
EndIf
sData = Replace$( sData, Any sDelimiter, With ">thinBASIC_rules!<" )
Do
lenS = StrPtrLen(StrPtr(sData))
sData = Replace$(sData, ">thinBASIC_rules!<>thinBASIC_rules!<", With ">thinBASIC_rules!<")
Loop Until StrPtrLen(StrPtr(sData)) = lenS
Function = Parse sData, sResult, ">thinBASIC_rules!<"
End Function
' ----------------------------------------------------------------------------
Function Print_Results(ByVal x As Long,
ByVal ar() As String)
' ----------------------------------------------------------------------------
' to have a uniform printout on all results
Local i As Long
PrintL
PrintL "found " & Str$(x) & " tokens:" In 11
PrintL
If x Then
For i = 1 To x
PrintL i, $DQ & ar(i) & $DQ In 14
Next
EndIf
PrintL
PrintL Repeat$(30, "-") & "> key to continue" In 28
WaitKey
PrintL $CRLF
End Function
Perhaps syntax for Parse can be extended by another optional switch from currently
nItems = Parse ( [File] sMainString, sArray, [Any] sDelimiter [, [Any] nFieldsDelim [, nMaxRowToCheckForNField]] )
to something like
nItems = Parse ( [File] sMainString, sArray [When Not ""], [Any] sDelimiter [, [Any] nFieldsDelim [, nMaxRowToCheckForNField]])
?