Internal (SDK mostly) Vartype-thoughts about encoding and other stuff [Archive] - thinBasic: Basic Programming Language Community Forum

ReneMiner

05-02-2022, 17:11

I had a few thoughts, concerning the Vartypes in the sdk divided to VarMainTyppe and VarSubType.
Just to remind

%VarMainType_IsNumber = 20& ' 0x0014
%VarMainType_IsString = 30& ' 0X001E
%VarMainType_IsAsciiZ = 25& ' 0x0019
%VarMainType_IsVariant = 50& ' 0x0032
%VarMainType_IsUDT = 60&
%VarMainType_IsPTR = 70&
%VarMainType_IsObject = 80&
%VarMainType_IsClass = 90&
%VarMainType_IsFunction = 95&
%VarMainType_IsDispatch = 100&

%VarSubType_Byte = 1&
%VarSubType_Integer = 2&
%VarSubType_Word = 3&
%VarSubType_DWord = 4&
%VarSubType_Long = 5&
%VarSubType_Quad = 6&
%VarSubType_Single = 7&
%VarSubType_Double = 8&
%VarSubType_Currency = 9& ' 0x0900
%VarSubType_Ext = 10& ' 0x0A00
%VarSubType_Variant = 50& '0x3200

thats essential it. Now through a little script i am currrently on i decided to save on parameters when i put maintype and subtype into 1 equate and use the
as you already can see indicated rightmost byte for the maintype . the subtype-values were actually just to multiply by 256 (0x100) to add up then. In case
of the Variant-types sub + mayntype add up to 0x3232 - and we have these %VT_-equates that are to determine what kind exactly - if strings or bytes or
whatever type the variant just holds.
the ending Maintype easy to see for the human reader and easy to understand 16 bits no meaning +8 bits detail +8 bit major information.

the more left,
the higher the number -
the higher the detail

Remind that we use Long variables for the equates - 32 Bit. means you see only half of the variables up there where you see a hex-number.
It does not require additional equates if we know that adding var-subtype and -maintype is to be done like

%VarMainType_IsNumber + 256 * %VarSubType_Ext

and requires not much of imagination to add in further information as

%VarSubType_Variant * 256 + %VarMainType_IsVariant + 65536 * %VT_BSTR
= 0x00083232

and we have a very detailled information of this variable in one parameter. Internal functions to decode this were not difficult to create since the bytes of the value
can be read by just placing a

Local B(4) As Byte At Varptr(myParameter)
maintype = B(4)
subtype = B(3)

and the meaning of the other 2 bytes depends on the main- & subtype.

Without any changes to the current enumeration, ( i've used new names here) and tried to work with names that say something more
than the currently mostly for compatibility with powerbasic used names. But to be honest - if you dont know what is the meaning of DWORD
you can not guess it from anything but from WORD and the knowledge that there is a verb "Double" meaning somewhat as "twice" in the english language.

Once you know "Int" means "Integer" and "U" means "Unsigned" and the number means the count of Bits then "UInt32" is faster to understand for someone
who is willing to learn a language as thinBasic and doesn't know anything of Powerbasic or other languages that use similar names.
And if they know maybe
"Integer" - wow, but Integer even in Visual Studio means sometimes still 16, sometimes 32 and now also 64 bits... ,
"Short" sometimes is a byte, sometimes the size of a word...
Long Longlong and ExtraLargeLongLongDingDong :... what will be next?
Someone open the door please and kick 'em out :D

When we repeat the pattern in Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64 and will there ever be one : UInt64 ? the new users learning curve goes up very fast
since one of these always appears somewhere. If we continue - already present for Float32, Float64 and Float80 that names floating point numerals - we could add Dec32 or Dec64 if we had decimals or if decimals 64 Bit would be available with 2 and 4 decimal digits after the decimal delimiter also that might -to keep it short- become Dec2d64 and Dec4d64 .

Now we also have Strings. There was originaly only 1 type of them and named string only. Through all the Unicode-madness, Ansi, Dos, codepage. OEM all was squeezed into the same type and someone saw its required to have another one that is not working as the original -native string and has no length to know its size in advance but is terminated by a character that is never used for a text. the chr$(0) glued to its end - became AsciiZ or StringZ. then strings became wider using 2 bytes per char and
-WAsciiZ was not invented because there are no ascii-chars , but we have WString and WStringZ... and thats 16 Bit...

Meanwhile 32Bit was required since the company that "invented" Unicode did not really invent it. The name is wrong. Its definetly not only one and should be titled
16Bit-Multicode. But we will not bother names... only facts and as clear as possible.

i started an approach that looks as

%VarType_Numeral = &H0014 ' Multiplied all Subtypes * 256
%VarType_Literal = &H001E ' added Subtypes + maintypes
%Vartype_LiteralZ = &H0019 ' subtype-bits of strings used to define details
' ' 3 bits used to differ the encodings with a value that is
' ' equal to the "count of used Bytes per char"
' (UTF7=1,UTF16=2,UTF32=4)
%VarType_String = &H001E 'PLAIN NOT ENCODED/UNKNOWN ENCODING OR Data
' UTF7 is plain ASSCII codes 0 to 127 only
' ' if any char in utf7-string had an asc-value > 127 it has to be considered buggy
' + Bytes per char ' and requires conversion to ANSI|UTF8|OEM|DOS
' + ByteOrderMark ' ANSI (once in a while values > 127, rarely multiple times in a row > 127)
' + Encodings ' UTF8 never one value above 127 only, always 2 to 4 bytes >=128 continously)
' where it applies (in the hope never to have 64Bit chars in a 32Bit environment)
' 8 is available to indicate the use of a BOM (byte order mark)
' leftmost bits apply to
' UTF7 UTF16 UTF32
' 128|&H80 ->UTF8 not no need
%VarType_SUTF7 = &H011E ' 64|&H40 ->ANSI thought no need
%VarType_SUTF8 = &H811E ' 32|&H20 ->OEM in detail no need
%VarType_WString = &H021E ' 16|&H10 ->DOS but used no need
%VarType_SUTF32 = &H041E ' 8|&H08 BOM BOM BOM
%VarType_AsciiZ = &H1900
' with this code the Vartype_String 0x001E (unformatted) becomes SUTF8 0x811E (UTF8 no BOM)
' SZUTF8 zero terminated utf8 0x00008119 (no BOM) 0x00008919 (same with BOM)
%VarType_WStringZ = &H0219 ' the dos/oem/ansi for the 7bit-strings and for utf16 could use the 2 leftmost bytes of the
%VarType_SZUTF32 = &H041E ' 32-Bit vartype equate for the notation of codepage /dos etc. to "convert" once the encoding
' is known, the current unformatted 0x0000001E (String) or 0x00000019 (AsciiZ) could change
%VarType_UDT = &H003C ' to a number that tells if OEM or Codepage, Ansi or DOS-keyboard enumeration
%VarType_PTR = &H0046 ' since the left bits of the right byte could give a clue about what there is.
%VarType_Object = &H0050
%VarType_Class = &H005A
%VarType_Function = &H005F
%VarType_Dispatch = &H0064

%VarType_Byte = &H0114
%VarType_Integer = &H0214
%VarType_Word = &H0314
%VarType_DWord = &H0414
%VarType_Long = &H0514
%VarType_Quad = &H0614
%VarType_Single = &H0714
%VarType_Double = &H00000814
%VarType_Currency = &H00000914
%VarType_Ext = &H00000A14

%VarType_Variant = &H00003232

As you see, the strings could include information about bytes that are used for a char but names are follwing the BitCount-usage of the numerals.
And UTF-encodings also replect the bits. String - A leading S for a chain of concentanated characters because the C already is for classes. chars are no
datatype (yet) but i tend to "Literals" Lit8, Lit16, Lit32 if there were to separate in mind the use of unicode from legacy char, wchar oem, ansi and stuff

The 2 left bytes for string-derived datatypes could as in the remarks described alike the variant-%vt_ above hold the codepage or whatever dos,oem and Ansi (that all are using different charsets: ANSI does not mean one certain set of characters but some lookup-tables-collection of whatever keyboards. I did not study completely
also the UTF16 have a probably 16-bit table of pages for encodings. If the last 3 bits of the last (subtype-)byte would say "bytes used per char", the second from the right (2) makes the 16Bit and on the left of our 32Bit equate are 2 bytes usused - no questions where to store the 16Bits to have string-types that have all detail included. The maintype would never change by another encoding
- only the "pre-teminated" string thats length is in a dword left of the StrPtr so we know its end in advance and the other thats terminating zero in the end will know from one of the 3 rightmiost bits in the subtype if to terminate using MKByt$(0), MKWrd$(0) or MKDwd$(0)...

Just thoughts... compatible to future

edit- yes, improved a bit