Occurrences

ivanw · Post by **ivanw** » Tue Jul 07, 2009 7:33 pm

I'm trying to find the max number of occurrences of each keyword in a list of keywords. So far I've been thinking of looping each line and then checking whether that line is contained within each of the other lines in the list.

e.g.
fox
brown fox
quick brown fox
quick brown fox

after calculations would be:

fox 3
brown fox 2
quick brown fox 2

Would this be possible with arrays? Any other suggestions would be greatly appreciated.

Thanks,
Ivan

Klaus · Post by **Klaus** » Thu Jul 09, 2009 8:49 am

Hi Ivan,

yep, arrays are a good way to solve this!

Like this:

Code: Select all

...
put fld "keywords" into tList
put empty into tArray
repeat for each line i in tList
  add 1 to tArray[i]
end repeat

## Build a new list with: Name of string TAB number of occurrences
put keys of tArray into tKeys
repeat fore each line k in tKeys
  put k & TAB tArray[k] & CR after list_of_occurrences
end repeat
delete char -1 of list_of_occurrences
### Do what you want with list_of_occurrences
...

Should be pretty fast ("repeat for each" is insanely fast!), even for looooong lists

Best

Klaus

SparkOut · Post by **SparkOut** » Thu Jul 09, 2009 11:40 am

But I'm not sure that's exactly what's wanted is it? It counts the number of times a line matches, but not a substring in each line.
I'm not really clear on the instructions but I thought that the list:
fox
brown fox
quick brown fox
quick brown fox

would contain the line "fox" four times (once in each line). The "brown fox" line appears once on its own and twice more in the subsequent lines. The "quick brown fox" line appears twice. So by my interpretation the results should be 4, 3, 2, rather than 3, 2, 2. So here's an amended version that will do what I thought it should, but I'm not at all certain that it's what is desired.

Code: Select all

   put fld "keywords" into tList 
   put empty into tArray 
   repeat for each line i in tList 
      add 0 to tArray[i] --initialise the array with the right keys but don't count the lines yet
   end repeat 
   
   ## Build a new list with: Name of string TAB number of occurrences 
   put keys of tArray into tKeys 
   repeat for each line k in tKeys 
      repeat for each line i in tList
         if k is in i then
            add 1 to tArray[k]
         end if
      end repeat
      put k & TAB & tArray[k] & CR after list_of_occurrences 
   end repeat 
   delete char -1 of list_of_occurrences 
   ### Do what you want with list_of_occurrences
   -- or ignore it completely and just use the array keys and value to represent the count of each keyword/phrase

Oh, you also ought to do some whitespace trimming and error checking so that you don't get "duplicate" array keys created because there's a trailing space at the end of one of the lines, for example. And maybe sort the results list or the keys of the array that you'll be using to work with.

Klaus · Post by **Klaus** » Thu Jul 09, 2009 12:18 pm

Oh yes, after reading this again it looks you are correct, SparkOut.

Sorry Ivan, take SparkOuts solution

ivanw · Post by **ivanw** » Fri Jul 10, 2009 5:19 am

Many thanks Klaus & SparkOut

Indeed there was a typo in my original example - I'll test this solution and let you know how it goes.