Frequency List of Words

deeverd · Post by **deeverd** » Sat Jul 28, 2007 10:59 pm

Hello Studio Forum,

At first, I was thrilled when I discovered Revolution's "50 sample working scripts" online because there is one in particular that I really need to understand:

"Listing All the Unique Words in a Piece of Text."

Unfortunately, when I press the copy button on any of the sample working scripts, nothing gets transferred to my clipboard to paste into my program in order to experiment and learn.

The documentation seems excellent with those scripts, but being very new to the whole programming scene, it is rather confusing to me to not be able to see the whole script in the order in which it appears and is executed in the program.

I've tried various methods to try to get the intact script out of the example, but alas, to no avail. Does anyone either have that particular script verbatim or can anyone offer a good working example of how to create a program that searches through a text and lists the number of times that each particular word is found?

First off, I need to be able to get a "Frequency List of Words."

Secondly, I also need to be able to find out how many times that a particular word in a database field is found within a block of text (which is probably a different program than the frequency list because in this second case, words I'm looking for might or might not be found in the text).

After spending about 3 & 1/2 weeks on this project with only minimal success, I couldn't help but to think it's about time to swallow some more pride and yell, "Help!" "SOS" and "Mayday." I do believe that becoming a beginning programmer is the world's best and fastest way to become humble.

Thanks in advance as always. These Revolution forums are lifesavers and I'm looking forward to the day that I know enough to offer some help in return to newbies who know less than me.

All the best, deeverd

Mark · Post by **Mark** » Sun Jul 29, 2007 9:01 am

Hi Deeverd,

I wonder if this application could be of any help to you:

http://economy-x-talk.com/zipfer.html

Originally, I made this application for a friend of mine. Now I update it only when I receive feature requests.

You may try to following to get a list of words:

Code: Select all

put fld x into myVar
replace cr with space in myVar
split myVar by cr and tab
cobine myVar by cd and tab

Now, myVar should contain a list of words without duplicates. Use a repeat loop with an offset function to count the words.

Best,

Mark

deeverd · Post by **deeverd** » Sun Jul 29, 2007 9:10 pm

Hi Mark,

I took a look at your website again to check out your Zipfer 1.2 program. It looks quite intriguing and it looks like something I want to have. I bet it's a tool that I can really use when I start taking some statistics courses at university (I'm just starting my fourth year in a Ph.D. program, and very soon I'm going to have to start getting heavily into statistics.)

When the student loan money comes in around five weeks from now, I'll definitely buy a copy with the support package.

In the meantime, I'll see what I can do with the script that you kindly sent. I've only recently began delving into arrays and elements and keys, so it's all still a whole new world to me, especially since I am not a technology major.

Thanks again, deeverd
P.S. Your helpfulness has inspired me to make a contribution on the Windows forum.

Mark · Post by **Mark** » Sun Jul 29, 2007 9:31 pm

Hi Deeverd,

To use Zipfer, you really don't need to know anything about statistics. It is very straightforward and simple. It does a lot of counting for you and the few statistics only tell you whether the frequencies follow a proper Zipf curve, which may or may not be of interest to you.

Of course, I'll be happy to answer any questions about this product and I look forward to any feature requests.

Best,

Mark

deeverd · Post by **deeverd** » Thu Aug 02, 2007 8:40 pm

Hi Mark,

I've been away a few days, but have tried your script for creating a word frequency list, but am still rather confused. The first part of the script seems to work fine and I've included how I used it below:

on mouseUp
global gFrequencyList

put field "textField" into gFrequencyList
replace CR with space in gFrequencyList
split gFrequencyList by CR and tab
combine gFrequencyList by CR and tab
put gFrequencyList into field "frequencyList"
end mouseUp

The part I still don't understand is how to use a repeat loop with an offset function to count the words.

I've used an offset function to find where a particular word or chunk was located, but I don't yet grasp how to use it to count words when trying to find the number of times that each of the words occurs in a text. I could definitely use any advice you have along those lines.

Thanks in advance, deeverd

Mark · Post by **Mark** » Thu Aug 02, 2007 11:50 pm

Hi Deeverd,

You can use the offset function to find the position of a word relative to a particular word in the text.

offset("Word",theText,myNumberOfFirstWord)

When you have found a word at position x, the number of the first word of the remainder of the text is x+1. So, you start looking for the second instance of the word at word x+1. Repeat this until the offset function returns 0 (zero).

Someting like this:

Code: Select all

put 0 into myNumberOfFirstWord
put 0 into myArray["Word"]
repeat forever
  put offset("Word",theText,numberOfFirstWord) into myPos
  if myPos = 0 then exit repeat
  else
    add myPos+1 to myNumberOfFirstWord
    add 1 to myArray["Word"]
  end if
end repeat

(untested)

Of course, you will need to put this script inside another repeat loop to count all words in your list.

If I remember correctly, I use regular expressions to filter the text and count the number of remaining words. So, the approach I show here is not the approach used in Zipfer, but it is an easy and effective aproach.

Best,

Mark

FourthWorld · Post by **FourthWorld** » Fri Aug 03, 2007 12:30 am

Here's a word count example from the MetaCard IDE:

on mouseUp
put empty into field "result"
answer file "Select a text file for input:"
if it is empty then exit mouseUp
# let user know we're working on it
set the cursor to watch
put it into inputFile
open file inputFile for read
read from file inputFile until eof
put it into fileContent
close file inputFile
# wordCount is an associative array, its indexes are words
# with the contents of each element being number of times
# that word appears
repeat for each word w in fileContent
add 1 to wordCount[w]
end repeat
# copy all the indexes that is in the wordCount associative array
put keys(wordCount) into keyWords
# sort the indexes -- keyWords contains a list of elements in array
sort keyWords
repeat for each line l in keyWords
put l & tab & wordCount[l] & return after displayResult
end repeat
put displayResult into field "result"
end mouseUp

deeverd · Post by **deeverd** » Fri Aug 03, 2007 9:45 pm

Hello Studio Forum,

Just in case there are any newbies out there who are as wet behind the ears as myself when it comes to programming, I thought I'd better offer a correction to an earlier statement I made that included some sample script, lest you/they become as confused as I do at this early stage of my programming career:

In replying to Mark, I mentioned that the code script I listed seemed to work Ok. Actually, it doesn't. Below is a sample of script that really does do the job of extracting all the unique words from any text:

Code: Select all

on mouseUp
  replace CR with space in field "textField"
  replace space with comma in field "textField"
  put field "textField" into myFrequencyList
  split myFrequencyList by comma and space
  combine myFrequencyList by comma and CR
  put myFrequencyList into field "frequencyList" 
  replace comma with empty in field "frequencyList"
  end mouseUp

What this code does is to first create one big string out of any text field (field "textField" in this case), no matter what size, separated by commas, so that it can be turned into an array. After becoming an array, it then puts the unique words into the field of my choice, which in my example has been named field "frequencyList." Finally, I get rid of all the commas that get put into field "frequencyList" so I am left with a list that has every unique word, one per line, without any words being repeated.

Anyway, it actually seems to work pretty well, although I don't know how elegant it is, but even this progress thus far would not have been accomplished so quickly without the help of Mark. Thank you.

Now hopefully, between the advice left by both Mark and FourthWorld, I'll be able to use their suggestions to figure out how to get this script to afterwards count out how many times that each word in the list is actually found in the orginal text. I'm still struggling with that part, but at least I'm halfway there.

I hope this helps someone who is attempting to do the same thing. All the best, deeverd

Mark · Post by **Mark** » Fri Aug 03, 2007 10:03 pm

i Deeverd,

In my opinion, using split/combine is a neat trick to get all unique words from a text. As long as you are not dealing with an enormous amount of data, it should work fine and I would even say it is elegant.

If counting the words is your main purpose, Richard's (4thWorld's) script is fast and effective.

Best,

Mark

LiveCode Forums.

Frequency List of Words

Frequency List of Words

Zipfer Sounds Quite Interesting

Correction