PDA

View Full Version : Line termination LF vs CRLF vs CR



fmaxwell
01-04-2008, 03:19
Hi.

I had a multi-line file that I was processing with a thinBasic program I wrote. It came back reporting only one line was read.

The problem was that the file used Unix line terminations (no CR preceding the LF).

Would it be possible to change the FILE_LineInput function to accept any of the following as valid line terminations: LF (Unix-standard), CRLF (DOS/Windows), or CR (Macintosh)?

Secondly, would you consider adding a FILE_LineTermination function which would set the character(s) that were written to the end of each line? Examples:



FILE_LineTermination($CR) ' Write file with Macintosh line terminations
FILE_LineTermination($CRLF) ' Write file with DOS/Windows line terminations
FILE_LineTermination($LF) ' Write file with Unix line terminations


With those changes, one could read an ASCII file regardless of the system on which it was created and write a file that conformed to whichever line termination standard was needed.

Just a suggestion...

ErosOlmi
01-04-2008, 06:11
Good idea. I will check what I can do. Maybe I have to change native function.

In the meantime a possible way is to load the whole file and parse it into an array of lines using something like the following. If file is up to some MByte it should do the work very very quickly.



uses "FILE"

'---Change file name as needed
dim MyFile as string value APP_SourceFullName

'---Will contains all lines loaded
Dim MyLines() AS STRING

'---Will count number of lines found
DIM nLines AS LONG

'---Load the full file and parse it into tokens separated by $LF
'---Returns number of tokens (in this case lines) found
'---MyLines array will have all tokens loaded inside
nLines = PARSE(file_load(MyFile), MyLines, $lf)

msgbox 0, "Lines loaded: " & nLines


Also (again as possible altyernative) consider FILE_Load (http://www.thinbasic.com/public/products/thinBasic/help/html/file_load.htm) and FILE_Save (http://www.thinbasic.com/public/products/thinBasic/help/html/file_save.htm) that work on the full file buffer. Than you can handle file content in string buffer.

Ciao
Eros

ErosOlmi
01-04-2008, 06:59
Just in case you didn't had the opportunity to look at PARSE (http://www.thinbasic.com/public/products/thinBasic/help/html/parse.htm) function, it has an additional field (called FieldDelim) that will let you parse fields inside parsed lines and automatically fill the array (in this case a matrix array). See following example.

Ciao
Eros



'-------------------------------------
'---Matrix Example
'-------------------------------------
Dim MyMatrix() As String
Dim nLines As Long

'---The following line will automatically dimension and fill MyMatrix to 3 rows and 5 colums
' nLines will contain the number of lines parsed, that is 3 in this case
nLines = PARSE("1,2,3,4,5|6,7,8,9,10|A,B,C", MyMatrix, "|", ",")

MSGBOX 0, "Number of lines : " & UBound(MyMatrix(1))
MSGBOX 0, "Number of columns: " & UBound(MyMatrix(2))

ErosOlmi
01-04-2008, 20:53
Fred,

do you have a sample file to test? If yes, can you please attach to this thread?
I think I have a solution developing a new dedicated module able to parse, line by line, file from DOS, Unix, Mac systems.

Thanks a lot
Eros

fmaxwell
01-04-2008, 22:35
Eros,

The file I've been working on is several million lines in length (and I can process it okay once the line terminations are CRLF), so it's not practical to include that here. It also is a bit large for the parse function to be used on the entire file. Also, each line consists of comma-separated data, so I would have to parse twice, once on LF to break it into lines, and once on commas to break it into data elements.

The problem is that I do not know, in advance, which line terminations the file will have when I receive it. It depends on whether it is transferred as FTP ASCII or Binary, whether it is copied through removable media or a mapped, shared network drive, whether they are using a Mac, Linux PC, Windows PC, Solaris, or something else.

Thanks.

Regards,
Fred

ErosOlmi
02-04-2008, 00:15
OK Fred. I think I've something to work with.
Please find here enclosed a new module and a test script

I've tested all your example file plus a Unix file of around 25Mb containing about 518000 lines. It took less than 2 seconds to read it line by line (without console output of curse).
I dind't have time to add documentation but it should be easy to follow.

Copy thinBabasic_FileLine.dll in your \thinBasic\Lib\ directory
Copy the test script where you prefer. Change file reference inside it in order to find input files to test.

Module is very very untested so sorry if errors or GPF. But we can improve it if we are on the right road.
Module uses memory mapped files technique so regardless the size of the file it should not influence too much memory consumtion.
Line separators ($CRLF, $LF, $CR) are automatically recognised sono need to indicate anything. Important is that files are TEXT files.

Let me know about your tests.

Ciao
Eros

fmaxwell
02-04-2008, 02:13
Eros,

Initial testing at this end looks great. Seems to work with every text file thrown at it. I may try some mixed up ones that have multiple line termination types.

The need to indicate the line termination type was for the File_LinePrint function, so that it would terminate the output file lines according to the user's needs (for users writing Mac or Unix files). I do not have a current need to write files in Unix or Mac format, but I know that I will at some time and so will others. Might as well give the flexibility to specify any string as a line termination.

I will be converting my program to use the new feature you just provided (understanding that it is relatively untested) for further testing. If it works, will you be rolling the functions into the regular FILE module or will they remain separate?

Regards,
Fred

ErosOlmi
02-04-2008, 07:21
If it works, will you be rolling the functions into the regular FILE module or will they remain separate?


No, this will be developed as new official thinBasic module because it doesn't use standard I/O approach to files but it uses Memory Mapped Files (http://msdn2.microsoft.com/en-us/library/ms810613.aspx) to solve the problem of reading big text files. All the hard work is done by the Operating System and not by the module code.

This module will be dedicated to reading files and not writing (for the moment).
Writing the output is relatively simple becasue you can just create a new standard file in BINARY mode and PUT (http://www.thinbasic.com/public/products/thinBasic/help/html/file_put.htm) in it what you need with the line terminator you need as a standard string.

Maybe I will change the name of the module (I'm not satisfied with "FileLine" name :-\ ) and so the name of the inside functions. Maybe some function will have a little changed syntax. So for the moment please do not develop big scripts but just test new functions to catch bugs or suggest improvements.

Ciao
Eros

ErosOlmi
02-04-2008, 11:54
Forgot to mention about a possible limitation of this module approach.

Consecutive multiple $CRLF or $LF or $CR (in any sequence they are found) are considered only once. So empty lines in files are just ignored by parser and next line with one or more char will be returned.

I will see if I can change this approach (if needed).
Eros