Count # of Unique Two-Word Phrases (Strings/Items)

[-hh] · Post by **[-hh]** » Sun Apr 10, 2016 12:11 am

... was wrong, sorry ...

theotherbassist · Post by **theotherbassist** » Mon Apr 11, 2016 12:04 pm

Here's the code I ultimately implemented. I didn't get to see your solutions until now as I was "unplugged" from the web for a bit. Maybe you guys can tell me how I did.

This is for 4-word "keywords," but you can change the number of words considered a keyword by the function by accordingly changing the line

Code: Select all

put trueword m of pText & space & trueword (m + 1) of ptext & space & trueword (m + 2) of ptext & space & trueword (m + 3) of ptext & comma & comma after twoWordList

Here's everything:

Code: Select all

global twoWordList

on mouseUp
   makewordList field "list"
   createwordfrequencyanalysis twowordlist
   replace ",," with "," in fld "freqlist"
   put empty into twowordlist
end mouseUp

on makewordList ptexts
   get make2wordList(ptexts)
   put it into fld "freqlist"
end makewordList



function TwoWordLinesplit pText
   put 1 into m
   repeat until trueword (m + 3) of pText is empty
      put trueword m of pText & space & trueword (m + 1) of ptext & space & trueword (m + 2) of ptext & space & trueword (m + 3) of ptext & comma & comma after twoWordList
      add 1 to m
   end repeat
end TwoWordLinesplit

function make2wordList pText
   put 1 into L
   put empty into twowordlist
   repeat until line L of (pText) is empty
      get twoWordLineSplit (line L of (pText))
      add 2 to L
   end repeat
end make2wordList


on createwordfrequencyanalysis pTexts
   get listFrequencies(pTexts) 
   put it into fld "freqlist"
end createwordfrequencyanalysis

function listFrequencies pText
   local tArray, tList, tTotal
   set the itemdelimiter to ",,"
   repeat for each item tWord in pText
      add 1 to tArray[tWord] -- add 1 to the word count of each 4-word item
   end repeat
   repeat for each key tKey in tArray
      if length(tKey) < 8 then next repeat
      put tKey & comma & comma & tArray[tKey] & cr after tList -- put each unique 4-word item and its frequency into tList
   end repeat
   sort tList descending numeric by item -1 of each
   return line 1 to -1 of tList
   set the itemdelimiter to ","
end listFrequencies

[-hh] · Post by **[-hh]** » Tue Apr 12, 2016 1:27 am

You were asking for speed, so we are interested in the fastest solution.
Hopefully others come in with their solution and timing comparison.

Attached is a text with 122204 truewords and 6538 unique truewords for testing (Jane Austin: Pride and Prejudice). These are my results. Times are averages from LC 7.1.4-rc1 / LC 8.0.0-dp16. (Yes LC 8 is really faster! Try yourself please.)

Unique words: 400 ms / 300 ms
Total truewords: 122204 / Diff. truewords: 6538

2-word phrases: 2300 ms / 1900 ms
Total 2-word phrases: 122203 / Diff. 2-word phrases: 54998

3-word-phrases: 3800 ms / 3100 ms
Total 3-word phrases: 122202 / Diff. 3-word phrases: 103295

4-word phrases: 4700 ms / 3900 ms
Total 4-word phrases: 122201 / Diff. 4-word phrases: 118713

Here is my test scenario. [It's all included in the attached teststack].
You need a field "Input" for the text and a field (may be list field) "OUTPUT" for the result. Moreover a button for the following script. [You can choose word phrases of length 1,2,3,4.]

Code: Select all

on mouseUp
  put the millisecs into m1 --> start timing
  lock screen; lock messages
  set cursor to watch
  #-- start prepare INPUT [fld "Input" is at about 65 KByte]
  put fld "INPUT" into myInput -- or any other variable
  put 0 into m ## --> each increase of m by one doubles the input <-- ##
  repeat m --> 2^m copies for testing
    put cr & myInput after myInput
  end repeat
  #-- end prepare INPUT
  put the label of btn "phraseLength" into k0
  if k0 = 1 then -- unique truewords
    repeat for each trueword w in myInput
      add 1 to countPairs[w]
    end repeat
    put the number of truewords of myInput into j
  else
    #-- =============== Essential part START: INPUT is variable myInput
    #-- create truewords array
    put 0 into j
    repeat for each trueword w in myInput
      add 1 to j; put w into myWords[j]
    end repeat
    #-- j is now the number of truewords in myInput
    #-- create phrases and count their frequency
    switch k0
      case 2 -- two word phrases
        repeat for each key k in myWords
          #-- Now don't write "exit repeat". The keys are usually not sorted!!
          if k = j then next repeat 
          add 1 to countPairs[myWords[k] & space & myWords[k+1]]
        end repeat
        break
      case 3 --three word phrases
        repeat for each key k in myWords
          if k >= j-1 then next repeat 
          add 1 to countPairs[myWords[k] & space & myWords[k+1] \
                & space & myWords[k+2]]
        end repeat
        break
      case 4 --four word phrases
        repeat for each key k in myWords
          if k >= j-2 then next repeat 
          add 1 to countPairs[myWords[k] & space & myWords[k+1] \
                & space & myWords[k+2] & space & myWords[k+3]]
        end repeat
        break
    end switch
  end if
  #-- "combine countPairs by cr and comma" is a little bit faster
  #-- but we want the frequency as first item for better oversight.
  repeat for each key k in countPairs
    put cr & countPairs[k] & comma& k after myOutput
  end repeat
  delete char 1 of myOutput -- is cr
  #-- =============== Essential part END : OUTPUT is variable myOutput
  #-- start prepare OUTPUT
  sort myOutput -- cosort by word pairs
  sort myOutput descending numeric by item 1 of each
  put the num of lines of myOutput into diffPairs
  put "Total " &k0& "-word phrases: " & (j-k0+1) &cr& "Diff " &k0& \
        "-word phrases: " & diffPairs & cr & myOutput into fld "OUTPUT"
  #-- end prepare OUTPUT
  unlock screen; unlock messages
  put (the millisecs - m1) &" ms" into fld "Timing" --> end timing
end mouseUp
### [-hh fecit, April 2016]

The above is exactly my earlier routine, where one 'horrible' error is removed.
Let me explain this error, because we are in the beginner forum.

I wrote

Code: Select all

-- j is the maximal value of k
repeat for each key k in myWords
  if k = j then exit repeat #<-- WRONG
  add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeat

This is logically wrong and may "cut" the repeat too early, because the keys of an array are usually not sorted.
So this should read, as it is now correct above:

Code: Select all

-- j is the maximal value of k
repeat for each key k in myWords
  if k = j then next repeat #<-- CORRECT
  add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeat

theotherbassist · Post by **theotherbassist** » Tue Apr 12, 2016 12:34 pm

hh, really helpful. thanks. Honestly, it's quick enough for my purposes with either version. I'm sort of waiting on the "official" release of 8 to switch over--but that reservation is more just out of habit than anything else. Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.

[-hh] · Post by **[-hh]** » Tue Apr 12, 2016 2:06 pm

I would do even one step more:
Download the files, read them (or single ones, depending on size) into a variable and THEN pull the nodes from that variable. It may be worth testing, anyway.

FourthWorld · Post by **FourthWorld** » Tue Apr 12, 2016 3:04 pm

theotherbassist wrote:hh, really helpful. thanks. Honestly, it's quick enough for my purposes with either version. I'm sort of waiting on the "official" release of 8 to switch over--but that reservation is more just out of habit than anything else. Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.

The difference between download+process vs down first and process later is merely the time spent on disk I/O, likely just a couple milliseconds at most.

But the advantage of downloading first is that you have the opportunity to process a second time if needed without having to download again. The download is the longest part.

In LiveNet's RSS viewer (in the LC IDE see Plugins->GoLiveNet, or in older versions GoRevNet) I wound up breaking the process up into three steps:

1. Download the feeds to a local cache
This gives me the freedom to use different schedules for different feeds. For example, the feed for this forum and the use-livecode discussion list update every few minutes, but I only bother downloading the feed for the LC blog once a day since it rarely changes more than that. This not only saves my processor time, but is kinder to the servers providing the feeds. For all the over-design of RSS, I do try to honor the update frequency.

2. Parse into common elements
In the olden days RSS was "Really Simple Syndication", and it was true to its name, a joy to work with. Then that acronym was coopted by the RDF crew to serve double-duty as their "RDF Site Summary", which is not only needlessly complicated but alters the original focus from syndication to summarization. So not only are RSS feeds today radically different in structure (replete with evidence that few authors of RSS generators are able to read specs - "Hey guys, ever heard of CDATA?"), but also bifurcating the mandate of the format.

All that is just to say that RSS is complicated to parse, but despite the best efforts of the RDF gang to pervert the format, and the laudible efforts of the Atom crew to correct those mistakes, in the end we find just enough similarity among the various RSS formats that relevant parts can be extracted and put into a unified form.

I prefer using arrays to work with them, so storing them uses LSON (LiveCode serialized arrays such as you get with arrayEncode). LSON data is superfast to pull and decode from disk, and since the result is an array it's very efficient to work with for extraction and rendering.

3. Render for display
Most RSS feeds generated today are made with the dismaying presumption that you'll be using a multi-megabytes-of-RAM browser engine to render it. If you use an embedded browser object you'll find many aspects of dealing with today's often-broken RSS feeds a bit easier.

But if you want a stripped version of the content more akin to RSS' original mandate, giving you more control over its display, a little extra work can turn the dizzying mix of plain text, misunderstood/misapplied CDATA, and HTML you'll find in modern feeds into LC's htmlText, which can be dropped into a field.

All that said, I broke the process into three steps in large part for the unique needs of LiveNet: I wanted the shortest time I could get between opening the card and displaying the aggregate feeds, and I wanted to minimize traffic on the RSS hosts as much as possible.

So in LiveNet steps 1 and 2 above are done on a machine in my office, and the aggregated LSON file uploaded in compressed form every few minutes to one of my public servers, where LiveNet then grabs it whenever the user opens LiveNet.

That's probably more steps than you'd want for a local aggregator, but I do think there's benefit in using separate download schedules for different feeds, to be both more efficient for the user and kinder to the RSS host.

Whether you parse it on download or after probably doesn't matter much, but I would suggest parsing the content into an array for more efficient handling. And if you use different download schedules for each feed, it may be useful to store each in its own array, serialized to disk with arrayEncode.

You could probably dispose of the original XML once you have it converted to an array, but a habit from my analytics work compels me to keep sources in their original form just in case I later decide to do some other parsing on them. Probably not needed with ephemeral data like RSS, but sometimes old habits are hard to break.

theotherbassist · Post by **theotherbassist** » Tue Apr 12, 2016 9:43 pm

So it's as easy as 1-2-3 then. Hah. I appreciate the detailed response--I feel a bit more confident knowing it's *kind of* a matter of taste and not necessarily better one way vs. the other when parsing limited nodes.

I'm really focusing on titles, and specifically headline news titles. I spend far too much time every day rifling through duplicate titles via Feedly/Nextgen Reader thanks to overlap and widespread use of AP by the press. I'm trying to build an app that runs an algorithm that dismantles, say, 30 headline RSS feeds across the globe into common 1-, 2-, 3-, and 4-word phrases and then rebuilds totally new titles based on word order within common phrases and the average frequency. All article titles on a given subject will be broken down and recombined into one "average title" ultimately presented back to the user. Sometimes the title will end up being one that was already in use, sometimes it'll be slightly contorted--but the point is that unless there's radically different information on the subject that necessitates two separate subject listings, the gist will be contained in the singular new "average" title. At the very least, it reduces overlap.

Yeah, you can look at google/facebook trends for this sort of thing. But you can't tailor their list of sources to your liking. I don't want the average of everything on the web--I want the average of my own go-to news outlets.

p.s. I always remove CDATA references from the XML--at least when it comes to news, it seems like the better info is regularly nested in there.

FourthWorld · Post by **FourthWorld** » Tue Apr 12, 2016 10:00 pm

theotherbassist wrote:So it's as easy as 1-2-3 then. Hah. I appreciate the detailed response--I feel a bit more confident knowing it's *kind of* a matter of taste and not necessarily better one way vs. the other when parsing limited nodes.

I'm really focusing on titles, and specifically headline news titles. I spend far too much time every day rifling through duplicate titles via Feedly/Nextgen Reader thanks to overlap and widespread use of AP by the press. I'm trying to build an app that runs an algorithm that dismantles, say, 30 headline RSS feeds across the globe into common 1-, 2-, 3-, and 4-word phrases and then rebuilds totally new titles based on word order within common phrases and the average frequency. All article titles on a given subject will be broken down and recombined into one "average title" ultimately presented back to the user. Sometimes the title will end up being one that was already in use, sometimes it'll be slightly contorted--but the point is that unless there's radically different information on the subject that necessitates two separate subject listings, the gist will be contained in the singular new "average" title. At the very least, it reduces overlap.

That sounds like a fun project. You might consider writing a guess blog post for livecode.com with that when you've got it running. Drop a note to heather AT livecode.com - they love good development stories like that.

Yeah, you can look at google/facebook trends for this sort of thing. But you can't tailor their list of sources to your liking. I don't want the average of everything on the web--I want the average of my own go-to news outlets.

Amen, brother. I'm a bit of a news junky myself. I made the LiveNet aggregator as part of a larger exploration I hope to get time to get back to soon, with a similar focus on weeding out replicated sources and finding the truly good stuff I'm most interested in.

p.s. I always remove CDATA references from the XML--at least when it comes to news, it seems like the better info is regularly nested in there.

Ah, if only some of those aggregators even knew what CDATA is. I've seen CDATA tags around plain text, XML, HTML, escaped HTML, and every strange mix you can imagine. I don't know how other parser writers handle it. Ah, the many times I've cursed at the feed generators I've had to work with.

My least favorite are the corporate ones that go so far in their RSS abuse as to embed JavaScript. Argh!

theotherbassist · Post by **theotherbassist** » Wed Apr 20, 2016 12:12 am

Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.

For the record, I was making a big mistake without knowing it.

I was putting XML node contents directly into lines in a field--don't do this. Each repeat gets slower the further down in the list (field) the target line is. This essentially makes a simple linear counting function harder work than it needs to be. Now I'm putting XML node contents into an array (three lines in each element for each article--title/pubdate/link) and then using "combine using return" to put the array into the field (as UTF-8!) at the end.

For a test of 10 feeds averaging 30 articles each, the direct-to-field version takes about (a rather frustrating) 1m 12s. The xml-to-array-to-list version takes about 3s.

Use arrays.

LiveCode Forums.

Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)

Re: Count # of Unique Two-Word Phrases (Strings/Items)