PDA

View Full Version : ARRAY SCAN results



LCSims
14-11-2011, 05:19
Greets,

I'm just learning thinBasic and hate to ask a question or request help, but after reading the Help file and checking sample scripts I have run into a problem.

Quick background: I'm working with an elevation file and I have a LOT of elevation files! First six lines are header data and then there are 1201 lines of elevation data, each with 1201 entries. Any geo-referenced point with no data gets assigned -9999. What I'm attempting to do on a batch basis is read the file into an array with Parse(FILE_Load(FileToLoad), MyMatrix(), $CRLF , $SPC). Then I use nRec = Array Scan MyMatrix() , = "-9999" to scan the array and count all instances of -9999.

The file I'm testing with has 1202 instances of -9999, but my little program tells me there are 2422 occurrences??

I would be very appreciative if someone could take a quick look and tell me where my beginner skills went awry? Many, many years ago I worked in Microsoft's PDS 7.0, but things have changed from that time and my mind isn't quite as sharp.

The file location would need to be adjusted in the thinBasic file (line 4) and please excuse some of the other code, as I put things in just to make sure I'm following along OK.

Thank you for any assistance,
Lance

ErosOlmi
17-11-2011, 19:05
Dear Lance,

sorry for the delay but your post has been automatically moderated by forum anti-spam software.
This is usually done till a user has reached at least 4 posts published.
Anyway, now I've got it and released from vault area.

Ok, the problem is that ARRAY SCAN ... works on arrays and not on matrices so when you use it on a matrix you get incorrect results. Maybe I will add a check and create a run time error.
I got your code and amended a bit in order to be able to determine missing data:


uses "file"
uses "console"


Dim FileToLoad As String Value APP_SourcePath & "cgn21w001.asc.txt"
dim MyMatrix() as string
dim nLines as long
dim nCols as long
dim T0, T1 as quad


Dim sBuffer As String
Dim InfoBuffer As String
Dim DataBuffer As String
Dim lPos As Long
Dim nRec As Long


'---Just one line do the job of loading file data, parsing text lines, dimensioning and filling the matrix.
'------
'---Load full file into a string buffer
PrintL "Loading file " & FileToLoad
sBuffer = FILE_Load(FileToLoad)

PrintL "getting info and data parts ..."
'---Now we have to remove first 6 lines
'---First find the 6th occurrence of $CRLF
lPos = InStr(sBuffer, $CRLF, 6)
'---Than create two buffers: one for the info part and one for data part
InfoBuffer = LEFT$(sBuffer, lPos)
DataBuffer = Mid$(sBuffer, lPos + 2)

PrintL "Creating Matrix data ..."
'---Now we parse data buffer
Parse(DataBuffer, MyMatrix(), $CRLF , $SPC)


'--Now get the number of lines and max number of columns parsed
nLines = ubound(MyMatrix(1))
nCols = ubound(MyMatrix(2))


PrintL "Lines:", nLines, "Columns:", nCols

'---Write some info
PrintL "Searching missing data ..."


dim CountLine as long
dim CountCol as long


For CountLine = 1 To nLines
For CountCol = 1 To nCols
If MyMatrix(CountLine, CountCol) = "-9999" Then
Incr nRec
End If
Next
Next
PrintL "Missing data found:", nRec



PrintL Repeat$(79, "-")
PrintL "Program terminated. Press any key to close."
WaitKey




Let me know and sorry again for the delay.

Ciao
Eros

PS: a matrix scan can be a nice idea :idea:

ErosOlmi
17-11-2011, 19:30
Here a version with some timing handling



uses "file"
uses "console"


Dim FileToLoad As String Value APP_SourcePath & "cgn21w001.asc.txt"
dim MyMatrix() as string
dim nLines as long
dim nCols as long
dim T0, T1 as quad


Dim sBuffer As String
Dim InfoBuffer As String
Dim DataBuffer As String
Dim lPos As Long
Dim nRec As Long
Dim MyTimer As cTimer


MyTimer = New cTimer("Timer used to store elapsed time between stages")
MyTimer.Start


'---Just one line do the job of loading file data, parsing text lines, dimensioning and filling the matrix.
'------
'---Load full file into a string buffer
PrintL "Input file: " & FileToLoad
Print "Loading file ... "
sBuffer = FILE_Load(FileToLoad)
PrintL MyTimer.Elapsed


Print "Getting info and data parts ..."
'---Now we have to remove first 6 lines
'---First find the 6th occurrence of $CRLF
lPos = InStr(sBuffer, $CRLF, 6)
'---Than create two buffers: one for the info part and one for data part
InfoBuffer = LEFT$(sBuffer, lPos)
DataBuffer = Mid$(sBuffer, lPos + 2)
PrintL MyTimer.Elapsed
PrintL "Info size in bytes: ", Len(InfoBuffer)
PrintL "Data size in bytes: ", Len(DataBuffer)




Print "Creating Matrix data ... "
'---Now we parse data buffer
Parse(DataBuffer, MyMatrix(), $CRLF , $SPC)


'--Now get the number of lines and max number of columns parsed
nLines = ubound(MyMatrix(1))
nCols = ubound(MyMatrix(2))
PrintL MyTimer.Elapsed


PrintL "Lines:", nLines, "Columns:", nCols

'---Write some info
Print "Searching missing data ... "


dim CountLine as long
dim CountCol as long


For CountLine = 1 To nLines
For CountCol = 1 To nCols
If MyMatrix(CountLine, CountCol) = "-9999" Then
Incr nRec
End If
Next
Next
PrintL MyTimer.Elapsed
PrintL "Missing data found:", nRec



PrintL Repeat$(79, "-")
PrintL "Total time:", MyTimer.Elapsed
MyTimer.Stop


PrintL "Program terminated. Press any key to close."
WaitKey




Almost most of the time is taken by process that build MyMatrix that is Parse function.
Maybe I can improve this process in future thinBasic versions.

LCSims
17-11-2011, 23:36
Thank you, Eros. Your code works well, based on initial tests. Only 15,000 more files to run it on!

I guess my confusion was in defining an "Array" versus a "Matrix", but that's clear enough for me now.

My work in thinBasic has just begun and I anticipate there will be more than enough questions to get past the threshold of four posts. I posted a couple of replies about an inline If...Then...Else If issue, but those don't show on the post count.

Thanks for your work and assistance. I'll do my best to research before asking. Though I may have a couple of generalized questions in a couple of days.

Lance

ErosOlmi
17-11-2011, 23:42
I posted a couple of replies about an inline If...Then...Else If issue, but those don't show on the post count.


You mean this post: http://www.thinbasic.com/community/project.php?issueid=296
It was made into support area and unfortunately that area is not considered by anti spam system working in this forum.
In any case you are now aboard ;)

ErosOlmi
18-11-2011, 10:49
Thank you, Eros. Your code works well, based on initial tests. Only 15,000 more files to run it on!


A lot of data.
You posted an example of 8Mb file times 15000 files is around 120Gb of data.
Posted file has 1201 rows times 1201 columns so it has 1442401 items times 15000 files it will bring you more than 21636015000 items (more than 21 billions)

Depending on what you need to do with such a huge amount if data, maybe consider making a script that loads data into a database for further analyzing in there.
You can make a table with few fields:
fileID (used to back identify the source of data)
row
column
value

ErosOlmi
18-11-2011, 11:51
Dear Lance,

thanks to your example I was able to improve execution speed of PARSE function when used with quadratic string buffers by almost 10 times.

Here the posted example working with 8Mb file for 1201 x 1201 string matrix was taking 24 seconds to be executed.
With the new version of PARSE function it will take 2.5 seconds.
I was making too much parsing string operation inside a loop while I could do it just once.

I will post very soon a preliminary thinBasic 1.9 version so you will be able to post.
It should change a lot the time prospective to have have to manage 15000 files.

Ciao
Eros

LCSims
18-11-2011, 23:32
Greets Eros,

Something good came out of my hacking? I'm pleased!

My data sets cover the planet, as I make add-ons for a flight simulator. The 1201x1201 sets are very low resolution compared to some other areas. The main body of the U.S. comes in 11801x11801 for one degree of latitude and longitude. Even more precise data gets much, much larger. I found thinBasic because the other variant I was using choked on the 11801x data, literally froze up. I'll be making quarter files in a lot of areas, as adjoining areas may have to be loaded and I'd hate to tax my system with three files loaded into "matrixes" of 11801x.

Right now I'm preparing to start looping through the 15,000 valid files to see how much, if any invalid data is in the files. Some is to be expected. Running on a secondary computer should take about 15 hours or so, maybe more? But that's what secondary systems are for!

Thanks again for your efforts. I'll come back next week with a couple of generalized questions.

Lance

ErosOlmi
19-11-2011, 10:52
OK, I've uploaded a preliminary thinBasic 1.9.0.0

It implements a super fast version of PARSE function when used in "quadratic" string buffer to be parsed into a matrix.
As I said, in my tests usinf Lance data (attached to first post of this thread) I passed from 22 seconds to less than 1 second for the parsing process.

I've also introduced a new optional parameter used to reduce the time spent during number of columns determination.
To determine maximum number of columns in the string buffer, PARSE scans all the lines determining which line has the maximum number of columns in order to dimension the matrix.
If one is secure that the number of columns are fixed for all the lines, this new parameter can be set to one for telling PARSE function to just use 1 line for number of columns determination.

Url is: http://www.thinbasic.biz/projects/thinbasic/thinBasic_1.9.0.0.zip

Let me know.
Eros

LCSims
19-11-2011, 22:43
Thank you for the update, Eros. I have a sub-set of data files that I use to test my programming on. One folder has 50 files, the other 92. Each file maintains the same structure as the file you looked at, 1801 rows and 1801 columns. I have my program set up to loop through all the files in a folder, with a timer running. I'll refer to 1.8.9 as old and 1.9.0 as new;

50 files old = 1206.7sec new = 161.2sec
92 files old = 2221.6sec new = 313.9sec

A big savings, especially when working on a large scale. I didn't implement the optional parameter setting for columns, but will take a look and learn about that. As the data sets are uniform in structure, that parameter could be worthwhile.

Lance

ErosOlmi
20-11-2011, 09:49
I'm really happy I could help on this.

I will next work to see if I can improve even more parse function (I think I have some line of code that can be optimized again) and also possibly make a scan function working on matrix.
If you can post a zipped version of a 1801 x 1801 file I will use for testing.

Ciao
Eros

ErosOlmi
22-11-2011, 01:47
Dear Lance,

I've updated again version 1.9 because I've found a bug in PARSE function introduced by recent update. The bug effect PARSE function only when used with single array parding and not quadratic buffers like in your case.

Anyway better to update agina at usual url
Url is: http://www.thinbasic.biz/projects/thinbasic/thinBasic_1.9.0.0.zip

I think I've also improved speed again in PARSE function when used on quadratic buffers by another 10% or so (at least here)

Ciao.
Eros

LCSims
23-11-2011, 03:23
Greets Eros,

The speed improvement so far has been quite noticeable, but faster will be better as I'm getting ready to start file comparing. This will mean parsing two or more files at a time and should be of benefit.

I'm hoping the Zip upload works OK. It's a file with a matrix of 1801, as requested. I can make most any size, should you feel the need.

Thanks again,

Lance