A bug and thinking about utf8 encoding [Archive] - thinBasic: Basic Programming Language Community Forum

View Full Version : A bug and thinking about utf8 encoding

xLeaves

25-05-2022, 07:45

I noticed that the 1.11.3.0 - 1.11.7.0 version changelog included a change to the FILE_Exists function, and then noticed that some of my files named in Chinese were not running through thinbasic.exe anymore.

I then did some experiments and confirmed that this was due to an utf8 encoding issue, and that plain English path scripts were working.

However, I converted the path to utf8 encoding and started thinbasic.exe, but still can not run, and in a pure English path of the script, through the #include reference to external scripts, and must use utf8 encoding to run properly.

Irresponsible guess, maybe because thinbasic.exe is through the call FILE_Exists, FILE_Load and other functions to load the script file, and these functions in later versions changed to utf8 encoding support, which led to all this problem?

Multi-language encoding support is a great thing, and I initially suggested that the author make changes in this regard, but then in time I realized that this might not be a very good choice.

OEM coding always follows the windows distribution and is the preferred language for each country version of windows, usually we classify it as ANSI like, the advantage of this coding is that there is no cost of understanding and it is very easy to use, unless it is a systematic project that needs to be available to all countries and there is a need to be compatible with multiple languages on one PC, otherwise unicode is not a necessary choice.

Of course, this doesn't mean I'm against unicode, but one of the more frustrating reasons for me is that when extensions like UI, File, etc. add support for utf8, it becomes more complicated to manipulate multiple languages, and in the last year when I've been training people around me to program in thinbasic, I've often had to explain to them why this area needs to be character I often had to explain to people around me when I was training them to use thinbasic for programming in the last year why character encoding was needed in this area, and then what character encoding was and why there were so many of them.

Of course, this confusion may be caused by the fact that thinbasic does not make all functions firm when utf8. If all functions were utf8, then the encoding conversion would not be valid, but obviously not, which means that a lot of encoding conversion operations are needed when calling the system API, which reduces the efficiency of software execution.

The BASIC like language has always been known for its simplicity, so maybe it's time for me to suggest to the authors to eliminate utf8, or maybe we can change the encoding to switchable? For example, set the encoding by some command, and after switching, the function will automatically convert the text according to the set encoding, but this is a big job and takes a long time.

The above is my thinking and discussion about this matter, but thinbasic.exe can not run under the non-ascii path of the problem is real, perhaps in future versions, we can prioritize this problem will be fixed.

Translated with www.DeepL.com/Translator (free version)

xLeaves

25-05-2022, 08:23

I have collected the functions that now use utf8 encoding by default.

<Textbox>.Text
<ButtonName>.Text

MENU SET TEXT
MENU GET TEXT
MENU ADD STRING
MENU ADD POPUP
DIALOG NEW
CONTROL_GetText
CONTROL GET TEXT
CONTROL_SetText
CONTROL SET TEXT
DIALOG GET TEXT
DIALOG SET TEXT
Control Append Text

FILE_Exists
FILE_Load
FILE_Save
FILE_Append

Load_File
Save_File

Petr Schreiber

27-05-2022, 19:22

Hi xLeaves,

thank you very much for your message.

I fully agree there is a long way before thinBasic can be considered unicode friendly, we are getting there one step after another.

It is good to be aware of this limitation, I will think how to reflect it in the documentation.

Petr

xLeaves

28-05-2022, 05:45

We may be able to do so, the operation of unicode and oem encoding into two groups of functions, now at this stage to oem encoding, and gradually increase the scope of unicode support in the future, which may increase the size of the program, but with the progress of hardware, people are less sensitive to the size, it can be expected that the volume will probably increase by a few dozen KB, which is not a very serious problem.

Translated with www.DeepL.com/Translator (free version)

<Textbox>.TextW
<ButtonName>.TextW

MENU SET TEXTW
MENU GET TEXTW
MENU ADD STRINGW
MENU ADD POPUPW
DIALOG NEWW
CONTROL_GetTextW
CONTROL GET TEXTW
CONTROL_SetTextW
CONTROL SET TEXTW
DIALOG GET TEXTW
DIALOG SET TEXTW
Control Append TextW

FILE_ExistsW
FILE_LoadW
FILE_SaveW
FILE_AppendW

Load_FileW
Save_FileW

And the following functions still maintain access to the oem code:

<Textbox>.Text
<ButtonName>.Text

MENU SET TEXT
MENU GET TEXT
MENU ADD STRING
MENU ADD POPUP
DIALOG NEW
CONTROL_GetText
CONTROL GET TEXT
CONTROL_SetText
CONTROL SET TEXT
DIALOG GET TEXT
DIALOG SET TEXT
Control Append Text

FILE_Exists
FILE_Load
FILE_Save
FILE_Append

Load_File
Save_File

OEM code refers to CP_ACP or CP_OEMCP, which is usually considered ANSI, but is used as the default multi-byte code page in windows operating systems in different languages in different countries.

When I was overriding the AnsiToUTF8$ function before, I found that it used CP_ANSI or other fixed English code pages instead of CP_OEM, which would cause multibyte code pages other than English to not be converted to utf8 correctly.

Translated with www.DeepL.com/Translator (free version)

TheInsider

14-02-2025, 21:04

Yes, thank you for this post that helped me figure out a problem. I spent two hours trying to debug why:

file_kill("test1.json")
file_append("test1.json",temp$)

and

file_save("test2.json",temp$)

were not writing out exactly the same files. Turns out the — character was the culprit and was getting saved differently.

of course file_save_utf8 is what was needed.