I have a small console script- 2 functions. Main reads in some unicode-codepoints, parses them and should copy the bits from UTF32 into the UTF8-masks
that is
codepoints display UTF-8 byte-mask
u+0000 to u+007F (Range ASCII 7 Bit, UTF 7) 1 byte "0........"
u+0080 to u+07FF 2 bytes "110.....10......"
u+0800 to u+FFFF 3 bytes "1110....10......10......"
u+10000 to u+10FFFF 4 bytes "11110...10......10......10......"
the script has a list of all unicode characters that are parenthesis/brackets/braces
Any of them has a sibling (mirrored counterpart) and i have them just with the codepoint (u+0000 to u+10FFFF) - anyway these are using codepoints with 4 hex digits at most. My script intended to use the dual expression "0b0101010101..." of an 8, 16, 24 or 32 bit value and place a surfing byte on the very right (LSB) for both - the mask and the codepoint. wherever is a dot in the mask (CHR$(46|\x2E)) the rightmost dual digit (0 or 1) from the
unicodepoint is placed into the mask so this will simply convert UTF32 (DWCHAR/DWIDE) to UTF-8 values.
I have encountered issues using BIN$(value[,nDigits]) and again with indexing of array-members. Somehow there are values that should be captured before causing errors but it's like
If ubound(bCode) = 2 Then
Printl bCode(2)
End If
impossible that error here : it complains Error 400 for the Printl bCode(2) Index should be between 2 and 0 (THIS ORDER!) Actual Value = 2.
There occur more such bugs i post the script
uses "console"
'##############################################################################
function tbmain()
console_ShowWindow(%Console_SW_Show)
string sName(),sLine()
byte bCode()
long nLines,i
quad qNum
'from those it should pick like 0028 - convert to utf7/8
' = "0028" - same for this one
' (less than u+0080 /c2 80 will not change anything but the count of bytes used for the utf7/8-char)
string sText = "0028; 0029; o # LEFT PARENTHESIS
0029; 0028; c # RIGHT PARENTHESIS
005B; 005D; o # LEFT SQUARE BRACKET
005D; 005B; c # RIGHT SQUARE BRACKET
007B; 007D; o # LEFT CURLY BRACKET
007D; 007B; c # RIGHT CURLY BRACKET
0F3A; 0F3B; o # TIBETAN MARK GUG RTAGS GYON
0F3B; 0F3A; c # TIBETAN MARK GUG RTAGS GYAS
0F3C; 0F3D; o # TIBETAN MARK ANG KHANG GYON
0F3D; 0F3C; c # TIBETAN MARK ANG KHANG GYAS
169B; 169C; o # OGHAM FEATHER MARK
169C; 169B; c # OGHAM REVERSED FEATHER MARK
2045; 2046; o # LEFT SQUARE BRACKET WITH QUILL
2046; 2045; c # RIGHT SQUARE BRACKET WITH QUILL
207D; 207E; o # SUPERSCRIPT LEFT PARENTHESIS
207E; 207D; c # SUPERSCRIPT RIGHT PARENTHESIS
208D; 208E; o # SUBSCRIPT LEFT PARENTHESIS
208E; 208D; c # SUBSCRIPT RIGHT PARENTHESIS
2308; 2309; o # LEFT CEILING
2309; 2308; c # RIGHT CEILING
230A; 230B; o # LEFT FLOOR
230B; 230A; c # RIGHT FLOOR
2329; 232A; o # LEFT-POINTING ANGLE BRACKET
232A; 2329; c # RIGHT-POINTING ANGLE BRACKET
2768; 2769; o # MEDIUM LEFT PARENTHESIS ORNAMENT
2769; 2768; c # MEDIUM RIGHT PARENTHESIS ORNAMENT
276A; 276B; o # MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT
276B; 276A; c # MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT
276C; 276D; o # MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT
276D; 276C; c # MEDIUM RIGHT-POINTING ANGLE BRACKET ORNAMENT
276E; 276F; o # HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
276F; 276E; c # HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT
2770; 2771; o # HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT
2771; 2770; c # HEAVY RIGHT-POINTING ANGLE BRACKET ORNAMENT
2772; 2773; o # LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT
2773; 2772; c # LIGHT RIGHT TORTOISE SHELL BRACKET ORNAMENT
2774; 2775; o # MEDIUM LEFT CURLY BRACKET ORNAMENT
2775; 2774; c # MEDIUM RIGHT CURLY BRACKET ORNAMENT
27C5; 27C6; o # LEFT S-SHAPED BAG DELIMITER
27C6; 27C5; c # RIGHT S-SHAPED BAG DELIMITER
27E6; 27E7; o # MATHEMATICAL LEFT WHITE SQUARE BRACKET
27E7; 27E6; c # MATHEMATICAL RIGHT WHITE SQUARE BRACKET
27E8; 27E9; o # MATHEMATICAL LEFT ANGLE BRACKET
27E9; 27E8; c # MATHEMATICAL RIGHT ANGLE BRACKET
27EA; 27EB; o # MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
27EB; 27EA; c # MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET
27EC; 27ED; o # MATHEMATICAL LEFT WHITE TORTOISE SHELL BRACKET
27ED; 27EC; c # MATHEMATICAL RIGHT WHITE TORTOISE SHELL BRACKET
27EE; 27EF; o # MATHEMATICAL LEFT FLATTENED PARENTHESIS
27EF; 27EE; c # MATHEMATICAL RIGHT FLATTENED PARENTHESIS
2983; 2984; o # LEFT WHITE CURLY BRACKET
2984; 2983; c # RIGHT WHITE CURLY BRACKET
2985; 2986; o # LEFT WHITE PARENTHESIS
2986; 2985; c # RIGHT WHITE PARENTHESIS
2987; 2988; o # Z NOTATION LEFT IMAGE BRACKET
2988; 2987; c # Z NOTATION RIGHT IMAGE BRACKET
2989; 298A; o # Z NOTATION LEFT BINDING BRACKET
298A; 2989; c # Z NOTATION RIGHT BINDING BRACKET
298B; 298C; o # LEFT SQUARE BRACKET WITH UNDERBAR
298C; 298B; c # RIGHT SQUARE BRACKET WITH UNDERBAR
298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER
2991; 2992; o # LEFT ANGLE BRACKET WITH DOT
2992; 2991; c # RIGHT ANGLE BRACKET WITH DOT
2993; 2994; o # LEFT ARC LESS-THAN BRACKET
2994; 2993; c # RIGHT ARC GREATER-THAN BRACKET
2995; 2996; o # DOUBLE LEFT ARC GREATER-THAN BRACKET
2996; 2995; c # DOUBLE RIGHT ARC LESS-THAN BRACKET
2997; 2998; o # LEFT BLACK TORTOISE SHELL BRACKET
2998; 2997; c # RIGHT BLACK TORTOISE SHELL BRACKET
29D8; 29D9; o # LEFT WIGGLY FENCE
29D9; 29D8; c # RIGHT WIGGLY FENCE
29DA; 29DB; o # LEFT DOUBLE WIGGLY FENCE
29DB; 29DA; c # RIGHT DOUBLE WIGGLY FENCE
29FC; 29FD; o # LEFT-POINTING CURVED ANGLE BRACKET
29FD; 29FC; c # RIGHT-POINTING CURVED ANGLE BRACKET
2E22; 2E23; o # TOP LEFT HALF BRACKET
2E23; 2E22; c # TOP RIGHT HALF BRACKET
2E24; 2E25; o # BOTTOM LEFT HALF BRACKET
2E25; 2E24; c # BOTTOM RIGHT HALF BRACKET
2E26; 2E27; o # LEFT SIDEWAYS U BRACKET
2E27; 2E26; c # RIGHT SIDEWAYS U BRACKET
2E28; 2E29; o # LEFT DOUBLE PARENTHESIS
2E29; 2E28; c # RIGHT DOUBLE PARENTHESIS
2E55; 2E56; o # LEFT SQUARE BRACKET WITH STROKE
2E56; 2E55; c # RIGHT SQUARE BRACKET WITH STROKE
2E57; 2E58; o # LEFT SQUARE BRACKET WITH DOUBLE STROKE
2E58; 2E57; c # RIGHT SQUARE BRACKET WITH DOUBLE STROKE
2E59; 2E5A; o # TOP HALF LEFT PARENTHESIS
2E5A; 2E59; c # TOP HALF RIGHT PARENTHESIS
2E5B; 2E5C; o # BOTTOM HALF LEFT PARENTHESIS
2E5C; 2E5B; c # BOTTOM HALF RIGHT PARENTHESIS
3008; 3009; o # LEFT ANGLE BRACKET
3009; 3008; c # RIGHT ANGLE BRACKET
300A; 300B; o # LEFT DOUBLE ANGLE BRACKET
300B; 300A; c # RIGHT DOUBLE ANGLE BRACKET
300C; 300D; o # LEFT CORNER BRACKET
300D; 300C; c # RIGHT CORNER BRACKET
300E; 300F; o # LEFT WHITE CORNER BRACKET
300F; 300E; c # RIGHT WHITE CORNER BRACKET
3010; 3011; o # LEFT BLACK LENTICULAR BRACKET
3011; 3010; c # RIGHT BLACK LENTICULAR BRACKET
3014; 3015; o # LEFT TORTOISE SHELL BRACKET
3015; 3014; c # RIGHT TORTOISE SHELL BRACKET
3016; 3017; o # LEFT WHITE LENTICULAR BRACKET
3017; 3016; c # RIGHT WHITE LENTICULAR BRACKET
3018; 3019; o # LEFT WHITE TORTOISE SHELL BRACKET
3019; 3018; c # RIGHT WHITE TORTOISE SHELL BRACKET
301A; 301B; o # LEFT WHITE SQUARE BRACKET
301B; 301A; c # RIGHT WHITE SQUARE BRACKET
FE59; FE5A; o # SMALL LEFT PARENTHESIS
FE5A; FE59; c # SMALL RIGHT PARENTHESIS
FE5B; FE5C; o # SMALL LEFT CURLY BRACKET
FE5C; FE5B; c # SMALL RIGHT CURLY BRACKET
FE5D; FE5E; o # SMALL LEFT TORTOISE SHELL BRACKET
FE5E; FE5D; c # SMALL RIGHT TORTOISE SHELL BRACKET
FF08; FF09; o # FULLWIDTH LEFT PARENTHESIS
FF09; FF08; c # FULLWIDTH RIGHT PARENTHESIS
FF3B; FF3D; o # FULLWIDTH LEFT SQUARE BRACKET
FF3D; FF3B; c # FULLWIDTH RIGHT SQUARE BRACKET
FF5B; FF5D; o # FULLWIDTH LEFT CURLY BRACKET
FF5D; FF5B; c # FULLWIDTH RIGHT CURLY BRACKET
FF5F; FF60; o # FULLWIDTH LEFT WHITE PARENTHESIS
FF60; FF5F; c # FULLWIDTH RIGHT WHITE PARENTHESIS
FF62; FF63; o # HALFWIDTH LEFT CORNER BRACKET
FF63; FF62; c # HALFWIDTH RIGHT CORNER BRACKET"
' split into lines
nLines = parse sText, sLine, crlf
printl tstr$(nLines) & " lines"
for i = 1 to nLines
' get the 4 hex-digits on the left
sText = leftf$(sLine(i),4)
print "U+" & sText & " = "
' making it a quad for bit shift etc
qNum = val("&H" & sText )
codepoint_To_UTF8( qNum,bCode)
' bCode() is supposed to be the UTF8-char - as is (special string-types planned that work on the base of encoded chars
printl "CountOf bCODE()" & str$(countof(bCode))
select case countof(bCode)
case 1
print hex$(bCode(1),2) & $spc
case 2
print hex$(bCode(1),2) & $spc & hex$(bCode(2),2)
case 3
print hex$(bCode(1),2) & $spc & hex$(bCode(2),2) & $spc & hex$(bCode(3),2)
case 4
print hex$(bCode(1),2) & $spc & hex$(bCode(2),2) & $spc & hex$(bCode(3),2) & $spc & hex$(bCode(4),2)
end select
printl " = " & rightf$(sLine(i), Lenf(sLine(i))- (2 + instr(1,sLine(i)," # ")))
next
waitkey
end function
'##############################################################################
function CODEPOINT_TO_UTF8(byval qVal As quad, byref bCode() as Byte ) as BOOLEAN
Long nBytes = iif( qVal >= 0x110000, 0,
iif( qVal >= 0x10000, 4,
iif( qVal >= 0x800, 3,
iif( qVal >= 0x80, 2,
iif( qVal >= 0, 1,
0 )))))
long nBits ,i, lLen
quad qSum
string s_OUT, sBin
sBin = bin$(qVal)
lLen = Lenf(sBin)
Local b_WRITE as byte at 0
local b_READ as byte at StrPtr(sBIN)+ lLen
redim bCode(nBytes)
select case nBytes
case 1
s_OUT = "0......."
bCode(1) = val("&B" & sBin)
return true
case 2
s_OUT = "110.....10......"
case 3
s_OUT = "1110....10......10......"
case 4
s_OUT = "11110...10......10......10......"
end select
setat( b_Write, strptr(s_OUT) + lenf(s_OUT) -1)
nBits = 1
repeat
setat( b_READ, getat(b_READ) - 1)
while b_WRITE <> 46
setat( b_WRITE, getat(b_WRITE) - 1)
nBits += 1
if nBits >= 8 then
nBits -= 8
if nBytes then bCode(nBytes) = val("&B" & memory_Get(getat(b_WRITE) - nBits, 8))
nBytes -= 1
if nBytes < 1 then return true
end if
wend
if b_WRITE = 46 then b_WRITE = b_READ
until getat(b_READ) = strptr(sBIN) or getAt(b_WRITE) = Strptr(s_OUT)
end function
' as CHARu07 Alias ChrASCII Alias (CHAR_)UTF7 "UTF7" is somewhat as a String-Type that does not accept any byte > 127 or it will transform
' ChrANSI (legacy chartype is no simple vartype but accepts trailing parenthesis wrapping a codepage-number)
' CHARu10 Alias WCHAR Alias (CHAR_)UTF16 |WSTRING
' CHARu20 Alias DWCHAR Alias (CHAR_)UTF32 | DWSTRING
' all char-sequences above are available with a fixed count of elements i.e. WSTRING*128 is sized as 128 Word-variables,
' DWSTRING * 128 is sized as 128 DWORD-variables
' CHARu08 Alias (CHAR_)UTF8 is exceptional : "Dim s As UTF8 * 128" is illegal - utf8-strings can not be of fixed size because
' every utf8-char is a dynamic array of 1 to 4 bytes and in case of using 1 byte in all chars it transforms to UTF7. Use the BSTRING
' (binary string type) for fixed size UTF8 or for ANSI when it requires to contain $NUL-chars
Bookmarks