Just One Array Intersect Question

deeverd · Post by **deeverd** » Fri Jan 18, 2008 5:51 pm

Hello again,

In my quest to come up with the fastest way to receive a number's only return when matching the contents of a text file to a database list, I am now convinced that using an intersect command is the quickest way, however...

Here's the problem:

I first turn my text file into an array, and then I turn my database list into an array, and then I do the following:

Code: Select all

intersect textArray with databaseArray.

All this works fine, but so far all I can get a return on is the number of keys in the textArray that match with the databaseArray. What I really need to get is the amount of items/contents that remain inside each of those keys that are inside the keys of the new intersected array.

According to the Rev documentation, the contents of my keys that match should remain unchanged from the original, but no matter how many hundreds of times I tried, I can't find a way to get a return on the number of the content of items that are shelved inside those keys, or to put the number of the items found inside each of those keys into some counter field. I know there has to be an easy way, but so far it's done a great job of eluding me.

Help! Thanks, deeverd

Mark · Post by **Mark** » Fri Jan 18, 2008 7:41 pm

Hi Deeverd,

I don't know what you are trying to do, but maybe this helps:

Code: Select all

put the number of lines of the keys of mySomeArray

Best,

Mark

deeverd · Post by **deeverd** » Fri Jan 18, 2008 8:24 pm

Hi Mark,

Thanks, but that's exactly the script I tried, plus more than a hundred variations of it, including "items," "words" and anything else I could think of.

Code: Select all

put the number of lines of the keys of myTextArray

All this script does is to return a number of the keys, but it does not return a number of the amount of contents that are found in each of those keys.

Currently, my workaround solution to this problem is to make a list of the words that intersected, and then put them into a repeat loop that looks at a copy of the text, and then replaces each of those intersect words in the text with empty. I then just compare the beginning word count of the original text to the word count after replacing those matched words with empty to come up with a number that tells me how many total matches were actually made. So I have a way that works pretty fast, but I was certainly hoping for a way to count all the contents of a key after intersecting it.

Thanks for trying. All the best, deeverd

Mark · Post by **Mark** » Fri Jan 18, 2008 11:45 pm

Deeverd,

Maybe this?

Code: Select all

put the number of words of the keys of myArray into myNrOfWords
combine myArray by return and tab -- or other delimiters
put (the number of words of myArray) - myNrOfWords into myNrOfWords
split myArray by return and tab

At the end of this little code snippet, the variable myNrOfWords contains the total number of words of all elements of the array.

If the data contain tabs or returns, you need to use different delimiters (I often use numToChar(4) and numToChar(5) for instance.

Best,

Mark

deeverd · Post by **deeverd** » Tue Jan 22, 2008 5:21 pm

Hi Mark,

Thanks heaps. I will most certainly give your script a try. I've been away from the internet for the past few days, so my apologies for not responding sooner.

The workaround "solution" I mentioned last week worked quite fast, but it was only today that I realized the obvious glitch in it. While the intersect does indeed create a superfast list of matching text (without telling me how many items that each of those keys contain... yet), only this morning it dawned on me that using the list of matching text in a repeat loop to "replace myMatchText with empty in myArrayVar" only succeeded in replacing parts of some of the words.

For instance, if a word like "rainbow" was in the matchingText, but the plural of that same word was "rainbows" was in the manuscriptText, it would replace rainbow with empty but leave the "s." This meant that when I did my before and after word count of the manuscript to get a numbers only return of matches, "s" was still counted as a word, and so my count was definitely not accurate. So this is what it's like to be a programmer?! I guess I could probably try setting the match whole text to true, and see if that works, but if your script works, it would definitely be a lot more fast and efficient.

Thanks again. Looking forward to telling you how it went. Cheers, deeverd

deeverd · Post by **deeverd** » Tue Jan 22, 2008 7:00 pm

Hi Mark,

I'm still only getting a return of the number of keys, but I'm not getting a return of the contents inside them.

I've put together a scaled down script of my code, so you can see what I'm basically doing, but it's still not working:

Code: Select all

on mouseUp
  
  # Just showing the user it's busy:  
  set the cursor to watch
  
  # This opens up the text file that will later be compared to the database list
  answer file "Select a text file for input:" 
  if it is empty then exit mouseUp 

  put it into textFile 
  open file textFile for read 
  read from file textFile until eof 
  put it into myTextArray 
  close file textFile 
  
  # This is my total word count of the manuscript text: 
  put the number of words of myTextArray into field "beforeCounter"
  
  # Now I'm formatting the manuscript into an array 
  replace space with comma in myTextArray
  replace comma with tab & return in myTextArray
  sort myTextArray
  put myTextArray into field "wordList"
  split myTextArray by return and tab
  
  # This opens up the database list --  
  answer file "Select a database list for comparison:" 
  if it is empty then exit mouseUp 
  
  put it into dbFile 
  open file dbFile for read 
  read from file dbFile until eof 
  put it into myDatabaseArray 
  close file dbFile 
  
  # Formatting the list to become an array:  
  replace space with comma in myDatabaseArray
  replace comma with tab & return in myDatabaseArray
  split myDatabaseArray by return and tab
  
  #Intersect takes place here:
  intersect myTextArray with myDatabaseArray
  
  # This is where I'm trying to count the contents inside the keys
  # that are leftover after the intersect takes place.
  put the number of words of the keys of myTextArray into myNrOfWords
  put myNrOfWords into field "totalMatches"
  
end mouseUp

The return I receive in field "totalMatches" is still just the number of keys.
According to the documentation, the intersect command shouldn't be causing a problem or deleting anything out of the keys, but it must.

Any suggestions?
All the best, deeverd

Mark · Post by **Mark** » Tue Jan 22, 2008 7:06 pm

Hi Deeverd,

Have you actually tried the example I have earlier? It should work.

Best,

Mark

deeverd · Post by **deeverd** » Tue Jan 22, 2008 7:32 pm

Hi Mark,

I tried it right away with my original code with no success, but haven't tried it with the reduced code I posted as an example. There's always the chance that something I had in the bigger script got in the way. So I'll give it a try in a couple hours when I can get back to my computer with my Revolution program on it. I'll let you know right away what I find.

Thanks, deeverd

Mark · Post by **Mark** » Tue Jan 22, 2008 8:09 pm

Deeverd,

You should not try your original code! You should try my code. Take my example and apply your own data to it. That's all.

Best,

Mark

deeverd · Post by **deeverd** » Tue Jan 22, 2008 8:10 pm

Hi Mark,

I was able to hurry up and get to my computer at the start of lunch. I just tried the script you sent and did it exactly how you said, which I had honestly tried it verbatim before. Here's the basic code I'm using, and I've pasted it here so that it can easily be cut and pasted elsewhere to instantly be run:

Code: Select all

on mouseUp 
  
  # Just showing the user it's busy: 
  set the cursor to watch 
  
  # This opens up the text file that will later be compared to the database list 
  answer file "Select a text file for input:" 
  if it is empty then exit mouseUp 
  
  put it into textFile 
  open file textFile for read 
  read from file textFile until eof 
  put it into myTextArray 
  close file textFile 
  
  # This is my total word count of the manuscript text: 
  put the number of words of myTextArray into field "beforeCounter" 
  
  # Now I'm formatting the manuscript into an array 
  replace space with comma in myTextArray 
  replace comma with tab & return in myTextArray 
  sort myTextArray 
  put myTextArray into field "wordList" 
  split myTextArray by return and tab 
  
  # This opens up the database list -- 
  answer file "Select a database list for comparison:" 
  if it is empty then exit mouseUp 
  
  put it into dbFile 
  open file dbFile for read 
  read from file dbFile until eof 
  put it into myDatabaseArray 
  close file dbFile 
  
  # Formatting the list to become an array: 
  replace space with comma in myDatabaseArray 
  replace comma with tab & return in myDatabaseArray 
  split myDatabaseArray by return and tab 
  
  #Intersect takes place here: 
  intersect myTextArray with myDatabaseArray 
  
  # This is where I'm trying to count the contents inside the keys 
  # that are leftover after the intersect takes place. 
  put the number of words of the keys of myTextArray into myNrOfWords 
  combine myTextArray by return and tab
  put (the number of words of myTextArray) - myNrOfWords into myNrOfWords
  split myTextArray by return and tab
  
  # I added these other test fields, just to get a visual of the numbers being returned:
  put myNrOfWords into field "totalMatches" 
  put myTextArray into field "field"
  put the number of words of field "field" into field "finalCount"
  
end mouseUp

When I use the script

Code: Select all

put (the number of words of myTextArray) - myNrOfWords into myNrOfWords

what it returns is 0 because it is still returning only the number of keys (but not the number of contents inside each of those keys) and subtracting that same number from itself.

I don't understand why it's not working because it looks as if it should work perfectly. I still think the intersect must somehow knock out the contents. I just don't know.

all the best, deeverd

Mark · Post by **Mark** » Wed Jan 23, 2008 12:32 am

Hi Deeverd,

This part of your script...

Code: Select all

 replace space with comma in myTextArray
  replace comma with tab & return in myTextArray
  sort myTextArray
  put myTextArray into field "word List"
  split myTextArray by return and tab

is not completely wrong, but I would use the following alternative:

Code: Select all

replace comma with space in myTextArray
  replace space with cr in myTextArray
  filter myTextArray without empty
  sort myTextArray
  put myTextArray into field "word List"
  split myDatabaseArray by return and tab

At the end of your script, you have:

Code: Select all

  put myTextArray into field "field"

Using my example, you have turned the variable myTextArray back into an array before you try to put it into a field. Don't split the variable if you have no need for an array and want to display its contents in a field. If you do need the array, split it, but combine it again before displaying the contents in a field.

Unfortunately, I STILL don't know what you want! From your first two messages, I understand you expect to get an array with keys and elements and you want to know the total number of words in the elements of the array:

What I really need to get is the amount of items/contents that remain inside each of those keys that are inside the keys of the new intersected array.

Unfortuntely, your approach results in one array with keys and empty arrays! What exactly do you expect to see in the elements of the array? What data are you using exactly and how and why would it result into an array with non-empty elements? :?

Best,

Mark

Mark Smith · Post by **Mark Smith** » Wed Jan 23, 2008 1:19 am

Deeverd, can I just make sure I understand what you're trying to do?

This is what I think:

You have two lots of text, and you want to know how many words are common to both lots of text.

Is that right?

Best,

Mark Smith

Mark · Post by **Mark** » Wed Jan 23, 2008 1:26 am

Hi Mark,

I think that's right, but I wonder why Deeverd chose the approach he is using.

Mark

Mark Smith · Post by **Mark Smith** » Wed Jan 23, 2008 1:37 am

On second thoughts, are you after the number of common words, and also the number of words unique to each text? If so, then I've modified your script:

deeverd wrote:

Code: Select all

on mouseUp 
  
  # Just showing the user it's busy: 
  set the cursor to watch 
  
  # This opens up the text file that will later be compared to the database list 
  answer file "Select a text file for input:" 
  if it is empty then exit mouseUp 
  
  put it into textFile 
  open file textFile for read 
  read from file textFile until eof 
  put it into myTextArray 
  close file textFile 
  
  # This is my total word count of the manuscript text: 
  put the number of words of myTextArray into field "beforeCounter" 
  
  # Now I'm formatting the manuscript into an array 
  replace space with comma in myTextArray 
  replace comma with tab & return in myTextArray 
  sort myTextArray 
  put myTextArray into field "wordList" 
  split myTextArray by return and tab 

  -- here, duplicate myTextArray so you have a copy that won't be
  --changed by the intersect
  put myTextArray into origTextArray
  
  # This opens up the database list -- 
  answer file "Select a database list for comparison:" 
  if it is empty then exit mouseUp 
  
  put it into dbFile 
  open file dbFile for read 
  read from file dbFile until eof 
  put it into myDatabaseArray 
  close file dbFile 
  
  # Formatting the list to become an array: 
  replace space with comma in myDatabaseArray 
  replace comma with tab & return in myDatabaseArray 
  split myDatabaseArray by return and tab 
  
  #Intersect takes place here: 
  intersect myTextArray with myDatabaseArray 
  
  -- at this point, the keys of myTextArray are the common words, so:

  put the number of lines in the keys of myTextArray into numCommon
  put the number of lines in the keys of origTextArray - numCommon into numUniqueText
  put the number of lines in the keys of myDataBaseArray - numCommon into numUniqueDataBase


 ....

Hope I've understood

Best,

Mark Smith

deeverd · Post by **deeverd** » Wed Jan 23, 2008 4:50 pm

Hello Both Marks,

Thanks big time. I know I'm on the forum quite a bit (only, however, after hours of fruitless struggling beforehand), but it's still quite humbling each time to receive the help and consideration of strangers who are experts in programming in other parts of the world. On the other hand, there's nothing else like it...

Now to answer some of your questions. I have cut and pasted each of your scripts and tried them more than once to make sure I know exactly what they're doing, and so now I find that I am getting some interesting returns on figures that I wasn't really after but they may prove to be helpful nonetheless, because I hadn't thought about those returns before. For instance, your script now creates a return of the number of keys that are unique to both the text array and the database array.

Here's what I'm actually trying to get a return on:

Let's say you took any book that was in text format. Let's say that book was "Robinson Crusoe." Now let's say you have a database that contains a big list of island words as an example. If the database contained the word "shipwreck" and a match was found in the manuscript, I need to know not only how many unique words from the manuscript matched with words in the database, but also how many times that each of those matchwords occured in the manuscript altogether.

I'm thinking that the problem so far has been that I mistakenly thought it was possible in one line of script to receive a return on all the words found in all the keys of an array. I now think I have to ask for each of those match words by name with brackets to get an actual return of the amount of contents that are found in each key. I'm not exactly sure how to call out those contents or the number of their contents, but I'm thinking it would be something like

Code: Select all

put myTextArray[myWord] after field "matchResults"

and then I could quickly get rid of the delimiters and receive a return on the number of words that were placed in the field "matchResults."

I'm thinking that if I use a return statement with a counter, I could put each of the match words from the intersect into the variable myWord, one at a time, to get the contents out of each key of myTextArray. There's a big possibility that by using your idea to create a copy of the manuscript array before the intersect, I can get the keys out of the copy.

I'll let everyone know as soon as I can give it a try, which will be later in the day.

Oh yeah, there was the question of why I am using this approach. I've only been programming with Revolution for 11 months now (with no prior programming experience), so that probably answers a lot of that question right there. However, I spend more hours programming a week than anyone I know and have built scores of programs, so I've gotten a lot of experience in that limited time. As for this approach for this program, I've been able to successfully accomplish what I'm after in various other ways but they took way too long to get the results (sometimes up to 5 minutes with small manuscripts of less than 3,000 words.) Until I discovered the "intersect" command, it was way too slow to be practical in a big program that does lots of other things.

Anyway, I hope that all sheds some light on the madness of my method. Cheers, deeverd