Idea for case-insensitive string comparisons [Archive] - thinBasic: Basic Programming Language Community Forum

View Full Version : Idea for case-insensitive string comparisons

Robert Hodge

17-06-2013, 01:50

It's very common to require case-insensitive comparisons of two strings.

For example,

DIM ONE AS STRING, TWO AS STRING
' ...
IF UCASE$(ONE) = UCASE$(TWO) THEN ...

This is such a common requirement that a shortcut syntax would be helpful in making code more clear.

Assume that the comparison operators could be written with ! as a suffix. Then, you would have,

=! means case-insensitive =
<>! means case-insensitive <>
<! means case-insensitive <
<=! means case-insensitive <=
>! means case-insensitive >
>=! means case-insensitive >=

So, the example above could be rewritten as,

IF ONE =! TWO THEN ...

What is interesting about using ! as a suffix is that it makes possible using it on the compound assignment operators.

Example:

ONE =! TWO ' same as ONE = UCASE$(TWO)
ONE &=! TWO ' same as ONE &= UCASE$(TWO)

etc.

ReneMiner

17-06-2013, 08:54

Direction sounds reasonable but I don't like those hyroglyphical annotations.

What would make sense and reads understandable:

Dim a, b as String

a= "Hello123"
b= "hello123"

If SameText(a, b [,Collate Ucase]) Then
'...
Endif

this even a tB-noob can read and understand without checking the help-file - while I don't know if I could remember what "&>=!" means after one week not using it.
The exclamation-mark disturbs me that far because it's used in some scripting languages of a few games I know, as NOT,
so this " != " means (translated to tB) " <> " (not equal) , "!<" means not smaller, " !>=< " not within, " !>< " not between etc.
So reading this "One &=! Two" makes me read something else - in words: "Let One Be One And Add Not Two" - excluding some thing from some other thing. Result of this : "One" = "One" And Add Not "Two" logical would be "neTw"...
Since its about String+ some wildcard anyway how about "$$" or "$*" as a shortcut for already existing Collate Ucase?

Perhaps one could use/improve already available methods instead of inventing new ones -

I would suggest to add some "Collate Ucase"- switch as known from "Array Scan"-method to existing memory comparison methods (where it makes sense) - which would be Memory_Equals, Memory_Differs and Memory_Compare - so not just to be able to compare string-content which is stored to some dynamic string but anywhere like this:

' regular String-Example:
If StrPtrLen(StrPtr(a)) = StrPtrLen(StrPtr(b)) Then
If Memory_IsEqual(StrPtr(a), StrPtr(b), StrPtrLen(StrPtr(a)), Collate Ucase) Then '...
Else
' no match anyway...
Endif

' "wild" String-comparison
Long lLen = Iif( Heap_Size(pA) > Heap_Size(pB), Heap_Size(pB), Heap_Size(pA)) ' just to make the next line look shorter....

If Memory_IsEqual(pA, pB, lLen, Collate Ucase) Then '...

' Mid$ + Left$ + Right$-comparison-substitute
If Memory_IsEqual(pA + whatEverStart, pB + whatEverStart, whatEverLen, Collate Ucase) Then...

' alternative wildcard annotation
If Memory_Differs(pA, pB, lLen, $*) Then '...not the same text, case insensitive...

this way one could do case-insensitive text-comparison even in "wild-string-space" (heap) or just within a certain section of the string without peeking it to some local storage-variable in advance - like Mid$ with built-in Ucase$ on both parameters - but a few times faster I guess..
Can even compare mixed from strings, heap, file-line-content, dictionary-content etc. without the need of converting to something nor storing it locally before

Robert Hodge

17-06-2013, 15:53

I can't argue the point about what is done in other languages. I am a C coder from way back, so I know all about != meaning NOT EQUAL.

A better comparison is Rexx, which does a similar (but not exactly the same) thing. They have = to mean Equal, and == to mean Exactly Equal. The "exactly" part for strings means that if leading or trailing spaces are present, they are treated as normal data and not ignored.

So, in Rexx, "ABC" = " ABC " is true, but "ABC" == " ABC " is false. In TB terms, == is much like Trim$("ABC") = Trim$(" ABC ").

They use ==, >==, <==, << and >>. For Not Equal, they have various notations, but the one that best fits this is /==.

I am open to other punctuation, but adding alternative function names goes against the main idea, which is to keep this stuff short.

A possible character might be a suffix of $. That would give operators like =$, <>$, <$, >$, <=$, >=$ and so on. Compound assignments could be done by ONE &=$ TWO or ONE +=$ TWO, etc.

Another possible syntax is to use the ^ character. This has the nice quality that it "points up" and so it might help people remember that there is a transformation to upper case going on. This would give operators like =^, <>^, <^, >^, <=^, >=^ and so on, with compound operators ONE &=^ TWO or ONE +=^ TWO, etc. For operators like =, =^ doesn't look to bad, but <^ is just plain ugly, and so I couldn't recommend it.

Similar remarks could be made about combining operators with @ or #. The main problem is trying to find something that is still available, won't break TB's lexical scanner, and is readable without looking confusing or 'gross'.

One that *might* work is colon. This would give operators like =:, <>:, <:, >:, <=:, >=: and so on, with compound operators ONE &=: TWO or ONE +=: TWO, etc.

This is readable, doesn't use the "NOT-like" character !, isn't ugly and is nice an concise. No, you couldn't understand it without reading the manual. However, this notation is not for idle passers-by that read TB code once a year; it's for people pounding out lots and lots of code, and need it to be more concise.

As for making syntax that is understood by someone who never read the manual, that is true enough as far as it goes, but at some point, you just have to read the manual. Making code readable is a fine goal, but there are limits to how far you can take that. When you simplify the syntax, that makes code more readable, too.

ReneMiner

17-06-2013, 16:12

yeah, $^ might be better than $*...

I think it's about the exchange of ideas here so they can grow from "but when this or that" what other people mean. If you know "!=" from C already you can imagine the confusion about "NOT" when reading "!" (the games I'm talking about is like TES Part 3 to whatever now is available which have some c-script alike language - that's where I know the "Elsewhile" from).

Maybe for common strings it's faster to develop a solution if Ucase or 'CaseDoesNotMatter' is done by string-methods only. But they are always slower in execution. If you compare memory numerical bitwise to ignore the LCase-bit (32 if 64 is set) I think it'll still be faster.
But I fear you are more up to some shorter way to write it - than to improve functionality or execution speed?

Perhaps another person can see... another way?

Robert Hodge

18-06-2013, 05:21

One that *might* work is colon. This would give operators like =:, <>:, <:, >:, <=:, >=: and so on, with compound operators ONE &=: TWO or ONE +=: TWO, etc.

I think in terms of syntax, this one here is about as good as it's going to get.

As far as runtime efficiency, it has an advantage as well - at least, a potential one.

Consider. If I have

DIM ONE AS STRING
DIM TWO AS STRING
' assign ONE and TWO
IF UCASE$(ONE) = UCASE$(TWO) THEN ...

what does TB have to do to implement this? To my eyes, it would have to create a temporary copy of ONE and upper-case it, then make a temporary copy of TWO and upper-case it, then compare the two temporaries. But, what happens if we allow this ...

IF ONE =: TWO THEN ...

Now, we can have a case-insensitive comparsion operator at a low level, so that it could translate to upper case and do the comparison a byte at a time, and so no temporaries would need to be created.

So, you get two advantages with this syntax: (a) much less code, and (b) faster execution of case-insensitive comparisons.

ReneMiner

18-06-2013, 07:44

The colon I don't like because it has meaning: "Expression ends here" since 1977 or longer - but the chars used in the end won't matter...

I would not know how to make it to give the Ucase-order to both sides of the expression without any parenthesis - perhaps Eros can - but I have no idea about all this- I'm only a basic end-user . In order to proceed strings more than once case-insensitive I would use some state as known from OpenGL:

Enable Collate Ucase
'do case-insensitive string-stuff here using normal syntax as

a += b
'whatever

Disable Collate Ucase
' now case sensitive again

you could enable that at very front of script and never disable for the whole script and use ordinary syntax without the need of additional annotation.

or short for some just once expression in parenthesis like

If Collate Ucase(a = b) Then PrintL "a and b have same text"
' equals Ucase$(a) = Ucase$(b)

' shorter syntax could read
If $^(a = b) Then PrintL "a and b have same text"

' other meaning "$^" as simple shortcut/replacement for Ucase$
$^(a) += $^(b) ' etc...

Michael Clease

18-06-2013, 10:41

Seems like a good enhancement to me.

one = "a"
two = "b"
three = "A"
four = "B"

IF UCASE(ONE) = UCASE(TWO) AND LCASE(THREE) = LCASE(FOUR) THEN
PRINT "You need a new computer\n"
ELSE
PRINT "Try again with like values\n"
END IF

I like this version the best it stays subscribed to the title of the language BASIC and not "C" as some people seem to want it to become, lets not reinvent the wheel just how much hand holding do you think people need?

The above would still be the most clear version to me, its intention is quite clear and even has a greater flexibility.

Mike C.

Robert Hodge

18-06-2013, 17:20

This concept doesn't really demand the use of a colon, but that is really not that big a deal in terms of implementing it. The issue about colons ending a statement is really not that hard to solve.

The way it works is that, for any compiler or interpreter, they have a 'lexical' phase and a 'parsing' phase. The lexical phase 'grabs tokens' and categorizes them into known types, such as keyword, number, closures (like parens), punctuation, etc. The parser assigns meaning to these tokens.

If we had a token like =: then the lexical scanner would see '=' and then it would 'look ahead' to see if there was a colon following it. If so, it would 'grab' the two characters and treat them as a composite token of "=:" rather than an = sign followed by an end-of-statement : colon delimiter. This kind of stuff is very standard in compiler-like software, and certainly Eros knows all about this.

Some of the suggestions like ENABLE COLLATE UCASE, IF COLLATE UCASE, etc. are certainly possible. But, the goal of this exercise was to make case insensitive comparisons shorter, not longer. If I wanted a long expression for this, then

IF UCASE$(ONE) = UCASE$(TWO) THEN ...

' is just as wordy as

IF COLLATE UCASE (ONE = TWO) THEN ...

' and neither of which are nearly as nice to type or easy to read as

IF ONE =: TWO THEN ...

ReneMiner

18-06-2013, 20:37

anyway- there's already IsLike()-method that allows string comparison with some fancy stuff around as whatever you want -from $DQ to Chr$(34) or """ or whatever - even spaces or - I dunno - what you can type in 5 seconds... around - it has also 6 letters and two parens to type , allows even wildcards or leading/trailing truncate and has case-insensitiive-switch if desired. Maybe think about using this method to get forward developing that rexx-module - I'm nosy :D
John you're sniffing a new victim out? :bb:

Robert Hodge

18-06-2013, 23:27

The IsLike function is certainly a possibility. You would have to treat the right-hand side of the comparison as the "pattern" string, so it would be:

DIM ONE AS STRING
DIM TWO AS STRING

' instead of ...

IF UCASE$(ONE) = UCASE$(TWO) THEN ...

' you'd use ...

IF IsLike(ONE, TWO, %FALSE) THEN ...

The main drawback to IsLike is that if the second argument contained any pattern characters like * ? or # the comparison would not work right.

Now, if you really wanted functional notation rather than my clever new operators, you could use EQ, NE, GT, LT, GE and LE, since these names don't appear to be taken by anything else. That would render the comparison as,

IF EQ(ONE, TWO) THEN ...

which isn't too bad. I still like my way better:

IF ONE =: TWO THEN ...

but hey - "you takes what you's can gets".

If we wanted to have case-insensitive comparisons that were also trimmed, we could use EQ_T, NE_T, GT_T, LT_T, GE_T and LE_T, or something like that. Implementing "trimmed" versions of these functions would have a lower priority though.

zlatkoAB

19-06-2013, 00:35

Michael Clease

19-06-2013, 01:36

Why not just create a unit file with the commands you require, here is an example

Dim s As StringDim ss As String
Dim sMsg As String

s= "T"
ss="T"

If eq(s,ss) Then
sMsg += "The strings are equal"
Else
sMsg += "The strings are not equal"
End If

Function eq(ByRef string1 As String, ByRef string2 As String)
If Ucase$(String1) = Ucase$(String2) Then
Return %TRUE
Else
Return %FALSE
End If
End Function

Function neq(ByRef string1 As String, ByRef string2 As String)
If Ucase$(String1) = Ucase$(String2) Then
Return %FALSE
Else
Return %TRUE
End If
End Function

Robert Hodge

19-06-2013, 01:50

I am not sure but i think that this native win32 api function compare string in not case sensitive way..

Declare Function strcmp Lib "kernel32" Alias "lstrcmpA" (ByVal psz1 As
String, ByVal psz2 As String) As Long

You are probably thinking of lstrcmpiA. However, both routines handle C-style null-terminated strings, not Basic strings which have a byte count. There are "n" versions like lstrncmpiA that do a byte-counted case-insensitive compare, BUT you'd somehow have to get the lengths of both strings, and somehow pass that as a parameter to the compare routine. And again, even if this would work, it would mean having to write something like,

IF lstrncmpiA(ONE,TWO,LEN(ONE)) = 0 THEN ...

' and that really isn't much better than

IF UCASE$(ONE) = UCASE$(TWO) THEN ...

' that's why I am still holding out for ...

IF ONE =: TWO THEN ...

ReneMiner

20-06-2013, 21:17

So I made a suggest (http://www.thinbasic.com/community/project.php?issueid=417)which would probably not have come to my mind without this discussion - it's not exactly the way you wanted but the improvement of existing methods (see above) because I think they could really be useful if one had to compare text-content wherever in memory it might be located - even if the functionality is already available for one who uses those gray cells inside that bubble on top of the neck :D

ErosOlmi

21-06-2013, 20:05

You are probably thinking of lstrcmpiA. However, both routines handle C-style null-terminated strings, not Basic strings which have a byte count.

All thinBasic dynamic strings are BSTR string (OLE32 strings).
BSTR strings are NULL terminated automatically by OLE32 engine.
Ending NULL char does not count as string len but is just added by OLE32 engine just for compatibility reasons with ASCIIZ strings.
Of course, because BSTR strings can have any NULL inside a dynamic strings, when passes to ASCIIZ strings the first NULL will count.

Calling external functions requiring ASCIIZ strings can be done very easily: just declare parameters as ASCIIZ and thinBasic will take care of necessary conversions.

The following is an example:

Uses "console"

Declare Function Are_String_Equal_Regardless_Case Lib "kernel32" Alias "lstrcmpi" (ByVal psz1 As Asciiz, ByVal psz2 As Asciiz) As Long

String sOne = "this is a string"
String sTwo = "THIS IS A STRING"

PrintL Are_String_Equal_Regardless_Case (sOne, sTwo)
WaitKey

Reference of lstrcmpi function: http://msdn.microsoft.com/en-us/library/windows/desktop/ms647489(v=vs.85).aspx

What about == as case insensitive string comparison?

Robert Hodge

21-06-2013, 22:49

It might work if the thinBasic user has never experienced any other language before. I have alway known == to be a compare rather than an assignment. Why not put in the extra effort and follow the traditional BASIC standard like the SB examples I posted?

Pretty much everyone has written in something else, so we always carry with us our "pet favorites" as far as preferred syntax.

The == comes from C (and from there, it went to so many other places: C++, Java, Perl, etc.) as a straight comparison.
In Rexx, == means "exactly equal" and so is not only case-sensitive but must be the exact length (it wont' ignore padding) and thus it's even "more" sensitive than the garden variety = compare.

For Rexx, they have a whole second set of comparsions, so there is ==, <<, >>, <<=, >>= and /==, where /= is their not-equal. (They have a hard time deciding what not-equal should be, so there are many alternatives like \= and ¬=, etc.) TB's not-equal would be a problem, since it's <>. The only thing I could think of for <> to follow the pattern of = becoming == is <<>> which looks pretty weird. That's one reason why I like =: so much; it's not only easy to type and read, but it's consistent, with every comparison ending in the : colon.

When you have a language that already makes heavy use of punctuation, trying to add new tokens that can be successfully recognized by the lexical scanner, AND be somehow meaningful, AND not be 'ugly', is really hard. The more mature the language is, the fewer "escape hatches" and "undeveloped property" is still left. You might have all kinds of cool ideas, but you start running out of places to "show-horn" them in to the language. You have to be extremely careful not to break the parser adding new, novel syntax.

As far as "Why not put in the extra effort and follow the traditional BASIC standard", yes, one can simply write code like,

IF UCASE$(ONE) = UCASE$(TWO) THEN ...

and that IS standard BASIC. It never was a question of whether BASIC can DO it, only whether there could be a way to do it while at the same time writing less code. Personally, I like novel syntax, and I wouldn't be bothered at all by,

IF ONE =: TWO THEN ...

but a lot of people are bothered by changes to the language when they seem overly "radical". Of course, one man's 'radical' is another man's "cool". Depends on your point of view.