PDA

View Full Version : My word count program is vaporware



sandyrepope
24-05-2007, 04:01
For about a week I've been trying to figure out how to tell the computer how to count the words in a text file. So far I haven't figured out anything. I'm really stumped about how to tell if something is a word.

I need some advice. Should I look for characters that are alphabet and count when there's something else? Or, should I just look for spaces and count them?

Any and all comments are needed!

Thanks
Sandy

Michael Hartlef
24-05-2007, 06:49
It depends what you count as a word. IS 5!#/?3 a word. IF you say yes, then count for spaces, that is the most easiest approach.

ErosOlmi
24-05-2007, 08:15
Maybe Tokenizer module and sample scripts can help.
The only problem I see is if text is between "". In this case Tokenizer will return a single token. But I can work on this in order to allow both parsing: quoted string or single tokens. This problem is a classical problem so worth to have a look.

I will see if I find some time to write a sample in next days (sorry but few little time till the week-end).

Ciao
Eros

kryton9
24-05-2007, 09:01
A long way around, but kind of tough coding challenge that would really test thinBasic's scripting power would be to take your text, copy it to the clipboard, have it past it into a word processor with word counting abilities, get the count and then give that back to the user all in the background. But it would show how to tap into exisiting programs with thinBasic as the scipt engine doing all the work.

Is this even possible Eros, or Roberto?

ErosOlmi
24-05-2007, 10:14
Yes, it is possible via clipboard commands, or COM interface or SendKey but will not solve the problem but just send to someone else :D

The problem is a nice challenge maybe we can organize. A common text to work on could help a start. I remember there was a similar challenge in Power Basic. Maybe I can get the text on which they worked for the challenge.

Ciao
Eros

kryton9
24-05-2007, 13:10
It is an interesting problem as so many ways to solve it.

sandyrepope
24-05-2007, 16:15
I'm slowly getting ideas to try and I think I'll be able to code one of them this weekend and see if it works.

I was thinking that if the word count turns out good enough then I could add features such as keeping track of the unique words and how often they are used.

If I can get the word count program working then there are several ideas that I'll try.

I'm also thinking about having a script that can gather up words from text files and keep a master list of unique words. This list could be used in other programs such as hangman which I plan to try one day soon.

I'm also planning to see if I can build a script that will take several text files and put them together in one text file. I don't really have a need to do this but am curious as to how difficult it would be just to do.

That's all for now...
Sandy

ErosOlmi
24-05-2007, 17:04
Attached a file with a text file with contest rules taken from a contest conducted some time ago in Power Basic site.
I think the text file can be takes as input. Regarding the rules we can decide here a common set.

Ciao
Eros

kryton9
24-05-2007, 22:18
I read the rules, wow they thought of lots of stuff for that contest, some serious stuff. A little too much work to meet all the requirements.
How about we keep ours simpler?

Petr Schreiber
30-05-2007, 21:50
Hi,

this hardly meets contest rules, but it could be useful for someone:


'=============================================================================
'= Simple word count demo =
'=============================================================================

dim NumWords as long
dim testString as string = "Hi, how is it going ? Well !? ..."

NumWords = CountWords(testString)
msgbox (0, "Text:"+$TAB+$TAB+testString+$CRLF+ _
"Word count:"+$TAB+format$(NumWords))

function CountWords( BYVAL sString as string ) as long

' -- First we will replace "weird" characters with spaces
sString = Replace$( sString, any chr$(0 to 31)+";,.?!/\^"+CHR$(123 to 255), with $SPC)
' -- Then we will reduce multiple spaces to single space
sString = trimfull$(sString)

' -- Number of words ? Just parsecount ...
function = parsecount(sString, $SPC)

end function


What's ok about code: commas and other weird characters are not counted as valid, and it can be easily modified in case I forgot some of them :). Also majority of operations is not parser dependant, most of work do compiled internals of few functions used.

What's baaad: BYVAL is bottleneck for big and huge strings, most of time will be spent creating copy of the string :(


Bye,
Petr

kryton9
30-05-2007, 22:27
Petr, that is really clever. Also a great example of the powerful features of thinBasic. Really really neat!!!

sandyrepope
30-05-2007, 22:39
Psch
Thanks for posting your example. I didn't fully understand how to use the Replace$ command. Your use of it makes it a lot easier to understand all that can be done with this command.

If its ok I'd like to borrow some of your script to improve mine.

Thanks
Sandy

ErosOlmi
30-05-2007, 23:18
Sandy,

Petr is taking advantage of TRIMFULL$ (http://www.thinbasic.com/public/products/thinBasic/help/html/trimfull$.htm) function replacing any non letter/number to spaces using REPLACE$ (http://www.thinbasic.com/public/products/thinBasic/help/html/replace$.htm) function than using TRIMFULL$ (http://www.thinbasic.com/public/products/thinBasic/help/html/trimfull$.htm) to remove any let or right space from the buffer and also all repetition of 2 or more spaces inside the buffer.

Try to substitute the following:



dim testString as string = "1"


with something more ... attractive ;)




dim testString as string = "123,.()This is a word, 1+2+3+4+5+6+7+8 is equal to ... I do not know! ...#@[[]]é*é*é>_>_:;:;_:"



and see results.

Of course that code do not consider if "123" is to consider a word or not. If not, you are in trouble because you cannot substitute all number to spaces otherwhise something like "Hello123World" will be considered 2 words. In any case a nice and very clever method.

Ciao
Eros

sandyrepope
31-05-2007, 00:43
Of course that code do not consider if "123" is to consider a word or not. If not, you are in trouble because you cannot substitute all number to spaces otherwhise something like "Hello123World" will be considered 2 words. In any case a nice and very clever method.

I'm not sure what others would think about this but if I were doing the word count for myself and not using a script I would count 'Hello123World' as two words. That's just how I'd look at it.

Thanks
Sandy

ErosOlmi
31-05-2007, 00:47
Is just a matter of rules. Important for a program is to know what rules to follow.
So if your rule is that numbers inside a word is considered like a delimiter, then is ok. Is just a rule for, the program, to follow and consider.

Ciao
Eros

Petr Schreiber
31-05-2007, 08:43
Hi sandy,

Here is how replace$ works in the sample:


sString = Replace$( sString, any chr$(0 to 31)+";,.?!/\^"+CHR$(123 to 255), with $SPC)


It will scan sString for occurience of any of the characters in match string, and each of them replace with spaces.
We could also simply remove$ them, but I thought it would be slower. This way the length of string is same before trimfull$.

To see the steps, you can watch this "debug" version:


'=============================================================================
'= Simple word count demo =
'=============================================================================

dim NumWords as long
dim testString as string = "Hi, how is it going ? Well !? ... This is a text, isn't it ?"

NumWords = CountWords(testString)
msgbox (0, "Text:"+$TAB+$TAB+testString+$CRLF+ _
"Word count:"+$TAB+format$(NumWords))

function CountWords( BYVAL sString as string )
local DebugInfo as string

DebugInfo = "Original:"+$TAB+$TAB+sString+$CRLF

' -- First we will replace "weird" characters with spaces
sString = Replace$( sString, any chr$(0 to 31)+";,.?!/\^"+CHR$(123 to 255), with $SPC)

DebugInfo += "Replace$:"+$TAB+sString+$CRLF

' -- Then we will reduce multiple spaces to single space
sString = trimfull$(sString)

DebugInfo += "TrimFull$:"+$TAB+$TAB+sString

msgbox 0, DebugInfo
' -- Number of words ? Just parsecount ...
function = parsecount(sString, $SPC)

end function



Bye,
Petr

P.S. And sorry for "1" in original sample as test string, I was probably very sleepy when posting it :)

sandyrepope
31-05-2007, 15:44
Psch, Thank you for the script. The debug part sure makes it easier to see what is going on with Replace$. I just hope my scripts turn out as good as yours do.

Thanks
Sandy

Petr Schreiber
31-05-2007, 20:04
Thanks,

and good luck with your project!

Bye,
Petr