PDA

View Full Version : Can Parse or Split not count 'empty' tokens somehow?



EmbeddedMan
07-09-2016, 18:50
If you run



Uses "Console"
Dim x As Integer
Dim ar() As String

x = Parse("A B", ar, Any " " + Chr$(9))
PrintL "Found " & Str$(x) & " tokens"


you will get


Found 6 tokens


This is because there are 5 spaces between the A and the B.

Is there any way for me to parse a string, and get an array of tokens out that do NOT include repeated delimiters? I'd like to just get 2 tokens out, "A" and "B", no matter what combination of whitespaces there are between them.

It feels to me like Parse and maybe Split should work this way (i.e. ignore multiple sequential occurrences of the delimiter characters) when the 'ANY' keyword is used, but they don't.

I'm just trying to parse a line from a file, which contains several non-whitespace tokens, but may have complex whitespace between tokens (combinations of spaces and tabs, arbitrary). How do you parse this type of line in ThinBasic to just extra the tokens, treating any sequence of multiple whitespace characters as a single delimiter?

Any ideas?

*Brian

ErosOlmi
07-09-2016, 22:32
Hi Brian,

Parse function family is quite limited and work well only on structured data.
I suggest to use Tokenizer module, some examples can be found in \Samplescripts\Tokenizer\ directory
Tokenizer module takes advance of some technique I use to parse thinBasic sources.

Help: http://www.thinbasic.com/public/products/thinBasic/help/html/index.html?tokenizer_equates.htm

Here an example using standard tokenizer. From this example you can create your own function tailored to your needs, for example filling an array of those tokens you need:

uses "Console"uses "Tokenizer"


'---
'---Declare needed variables
'---
dim MyBuffer as string '---Will contains string buffer to be parsed
dim CurrentPosition as long '---Current buffer pointer position
dim TokenType as long '---Will contains current token type
dim Token as string '---Will contains current string token


dim nTokens as quad '---In order to count number of tokens found


'---Load this script into a string buffer
MyBuffer = "A B C, d" & $tab & "(Hi there) ""I'm a string"""


'---Init current buffer position. THIS IS IMPORTANT
CurrentPosition = 1


'---Loops until token is end of buffer
while TokenType <> %TOKENIZER_FINISHED


'---Here we are. Most important point here is that all passed parameters
' must be a single variable and not an expression. This is necessary because
' parameters are passed by reference in order to return information about token
'---
' MyBuffer must contains the string you want to parse
' CurrentPosition must be initialized to 1. After execution this parameter will contains
' current position just after current token
' TokenType on exit, it will contain the type of token found
' Token on exit, it will contain the string representation of the token found
Tokenizer_GetNextToken(MyBuffer, CurrentPosition, TokenType, Token)


'---In order to count number of tokens found
incr nTokens


'---Write some info
printl "Token " & nTokens & ": " & Token & " (type: " & TokenType & ")"

wend

'---ENd timer
'---Give results
printl repeat$(70, "-")
printl "Number of Tokens found: " & nTokens


waitkey





Inside \Samplescript\Tokenizer\ you will also found and example using cTokenizer class that for some aspect is more easier and powerful: Tokenizer_UsingModuleClass.tbasic

Let me know if it works for you or you need more functionalities.

And a great thanks for what you know.
But ... you are exaggerating :)

EmbeddedMan
07-09-2016, 23:29
Very nice. The Tokenizer does what I need it to, although it is less straightforward than the others. That's OK, it's much more powerful and generic. Thanks - my program is working very well now!

*Brian

ErosOlmi
07-09-2016, 23:45
This is a quick and dirty example using Tokenizer class cTokenizer.
The only think you need to do is to define a New cTokenizer variable and use .Scan method to automatically scan the string.
Than you can user FOR/NEXT to get all the tokens data and info.

Maybe for future I can develop a Tokens to Array in order to scan string and fill an array with all tokes found.


uses "Console"uses "Tokenizer"


dim MyParser As new CTOKENIZER
Dim sStringToScan as string
dim Counter as long


'---Tokenize first string
sStringToScan = "A B C, d" & $tab & "(Hi there) ""I'm a string"""
printl "Number of Tokens found during scan : " & MyParser.Scan(sStringToScan)
printl "Number of Tokens count with Count method: " & MyParser.Tokens.Count

For Counter = 1 To MyParser.Tokens.Count
printl "Token Data :" & MyParser.Token(Counter).Data
printl "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
printl "Token SubType :" & MyParser.Token(Counter).SubType
printl "Token PosStart:" & MyParser.Token(Counter).PosStart
printl "Token PosEnd :" & MyParser.Token(Counter).PosEnd
printl "Token Length :" & MyParser.Token(Counter).Len
printl "-------------------------------------------"
Next




'---Tokenize another string
sStringToScan = "1 2 3 4"
printl "Number of Tokens found during scan : " & MyParser.Scan(sStringToScan)
printl "Number of Tokens count with Count method: " & MyParser.Tokens.Count

For Counter = 1 To MyParser.Tokens.Count
printl "Token Data :" & MyParser.Token(Counter).Data
printl "Token MainType:" & MyParser.Token(Counter).MainType & " (" & MyParser.Token(Counter).MainType.ToString & ")"
printl "Token SubType :" & MyParser.Token(Counter).SubType
printl "Token PosStart:" & MyParser.Token(Counter).PosStart
printl "Token PosEnd :" & MyParser.Token(Counter).PosEnd
printl "Token Length :" & MyParser.Token(Counter).Len
printl "-------------------------------------------"
Next


waitkey




There are many other options you can apply to class before Scanning to classify some special tokens you want to recognize.
Check example in \thinBasic\SampleScripts\Tokenizer\Tokenizer_UsingModuleClass.tbasic

EmbeddedMan
07-09-2016, 23:53
I wonder if it could simply be an alternate form of Parse or Split. They are _so_close_ to what I need to do - it's only in how they handle delimiters that would need to change.

*Brian

ErosOlmi
08-09-2016, 00:11
I will think to something but not in short time.

ANY is PARSE functions means that any char in string delimiter is considered a possible delimiter.
Instead you need to interpret any consecutive delimiter as one delimiter.
This is contrary to the idea of PARSE that is usually used for situation like CSV files where "1,,3,," means 5 columns.
I need to quite completely rewrite PARSE functions, that is not a problem but I need some time.

I will let you know.
Eros

ErosOlmi
08-09-2016, 20:24
Dear Brian,

can you please attach a file with a file showing some of the strings you want to parse?
I'm working on a function that is doing what you asked but I need a deep real test.

Thanks a lot
Eros

ReneMiner
09-09-2016, 18:23
an idea to parse without returning empty elements, simply use replace$ on multiple occurance of delimiter as


sData = Replace$(sData, sDelimiter & sDelimiter, With sDelimiter)

and thereafter parse the data.


Example, contains 2 functions to parse since i can not use the ANY-keyword as a thinBasic-function-parameter.

to Parse without ANY-option use Parse_2
to parse with ANY-option use Parse_2_ANY



' #Filename "test_ParseMultipleDelimiters.tBasic"


Uses "Console"

PrintL "first run:" In 10

test "A B" ' string to parse


PrintL "second run:" In 10

test "A " & $CRLF & " B" ' string to parse

PrintL

' ' uncomment for testing error

' PrintL "now test Error:"
' Dim s() As String
' Parse_2 "A B", s, ""


PrintL Repeat$(30, "-") & "> key to end " In 42
WaitKey


' ----------------------------------------------------------------------------
Function test( ByVal sTest As String )
' ----------------------------------------------------------------------------
' sTest the string to be parsed

' used delimiters here: $SPC & $CRLF

Local x As Long
Local ar() As String

' call parse, using the thinCore-function:
PrintL
PrintL "1. PARSE " & $DQ & sTest & $DQ & ", Any $SPC & $CRLF" In 15

x = Parse sTest, ar, Any $SPC & $CRLF

Print_Results x, ar

' -----------------------------------------------

' call parse_2, does not allow ANY
PrintL
PrintL "2. Parse_2 " & $DQ & sTest & $DQ & ", $SPC (no ANY-option available)" In 15

x = Parse_2 sTest, ar, $SPC

Print_Results x, ar

' -----------------------------------------------

' call parse_2_ANY
PrintL
PrintL "3. Parse_2_ANY " & $DQ & sTest & $DQ & ", $SPC & $CRLF" In 15

x = Parse_2_ANY sTest, ar, $SPC & $CRLF

Print_Results x, ar


End Function



' ----------------------------------------------------------------------------
Function Parse_2( ByVal sData As String, _
ByRef sResult() As String, _
ByVal sDelimiter As String _
) As Long
' ----------------------------------------------------------------------------

' this will parse sData delimited by sDelimiter
' inTo sResult without empty elements, no ANY-option:
' returns count of parsed elements

Local lenS As Long

If sDelimiter = "" Then
MsgBox 0, "Can not parse by """"", %MB_OK Or %MB_ICONERROR, "Parse_2: Invalid delimiter"
Stop
EndIf


Do
lenS = StrPtrLen(StrPtr(sData))
sData = Replace$(sData, sDelimiter & sDelimiter, With sDelimiter)
Loop Until StrPtrLen(StrPtr(sData)) = lenS

Function = Parse sData, sResult, sDelimiter

End Function

' ----------------------------------------------------------------------------
Function Parse_2_ANY( ByVal sData As String, _
ByRef sResult() As String, _
ByVal sDelimiter As String _
) As Long
' ----------------------------------------------------------------------------

' this will parse sData delimited by ANY sDelimiter
' into sResult without empty elements, with ANY-Option:
' returns count of parsed elements

Local lenS As Long

If sDelimiter = "" Then
MsgBox 0, "Can not parse by """"", %MB_OK Or %MB_ICONERROR, "Parse_2_ANY: Invalid delimiter"
Stop
EndIf

sData = Replace$( sData, Any sDelimiter, With ">thinBASIC_rules!<" )



Do
lenS = StrPtrLen(StrPtr(sData))
sData = Replace$(sData, ">thinBASIC_rules!<>thinBASIC_rules!<", With ">thinBASIC_rules!<")
Loop Until StrPtrLen(StrPtr(sData)) = lenS


Function = Parse sData, sResult, ">thinBASIC_rules!<"

End Function


' ----------------------------------------------------------------------------
Function Print_Results(ByVal x As Long,
ByVal ar() As String)
' ----------------------------------------------------------------------------
' to have a uniform printout on all results
Local i As Long
PrintL

PrintL "found " & Str$(x) & " tokens:" In 11
PrintL

If x Then
For i = 1 To x
PrintL i, $DQ & ar(i) & $DQ In 14
Next
EndIf
PrintL

PrintL Repeat$(30, "-") & "> key to continue" In 28
WaitKey
PrintL $CRLF

End Function


Perhaps syntax for Parse can be extended by another optional switch from currently


nItems = Parse ( [File] sMainString, sArray, [Any] sDelimiter [, [Any] nFieldsDelim [, nMaxRowToCheckForNField]] )

to something like


nItems = Parse ( [File] sMainString, sArray [When Not ""], [Any] sDelimiter [, [Any] nFieldsDelim [, nMaxRowToCheckForNField]])


?