Count # of Unique Two-Word Phrases (Strings/Items)
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller
Re: Count # of Unique Two-Word Phrases (Strings/Items)
... was wrong, sorry ...
Last edited by [-hh] on Mon Apr 11, 2016 9:09 pm, edited 1 time in total.
shiftLock happens
-
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
Re: Count # of Unique Two-Word Phrases (Strings/Items)
Here's the code I ultimately implemented. I didn't get to see your solutions until now as I was "unplugged" from the web for a bit. Maybe you guys can tell me how I did.
This is for 4-word "keywords," but you can change the number of words considered a keyword by the function by accordingly changing the line
Here's everything:
This is for 4-word "keywords," but you can change the number of words considered a keyword by the function by accordingly changing the line
Code: Select all
put trueword m of pText & space & trueword (m + 1) of ptext & space & trueword (m + 2) of ptext & space & trueword (m + 3) of ptext & comma & comma after twoWordList
Code: Select all
global twoWordList
on mouseUp
makewordList field "list"
createwordfrequencyanalysis twowordlist
replace ",," with "," in fld "freqlist"
put empty into twowordlist
end mouseUp
on makewordList ptexts
get make2wordList(ptexts)
put it into fld "freqlist"
end makewordList
function TwoWordLinesplit pText
put 1 into m
repeat until trueword (m + 3) of pText is empty
put trueword m of pText & space & trueword (m + 1) of ptext & space & trueword (m + 2) of ptext & space & trueword (m + 3) of ptext & comma & comma after twoWordList
add 1 to m
end repeat
end TwoWordLinesplit
function make2wordList pText
put 1 into L
put empty into twowordlist
repeat until line L of (pText) is empty
get twoWordLineSplit (line L of (pText))
add 2 to L
end repeat
end make2wordList
on createwordfrequencyanalysis pTexts
get listFrequencies(pTexts)
put it into fld "freqlist"
end createwordfrequencyanalysis
function listFrequencies pText
local tArray, tList, tTotal
set the itemdelimiter to ",,"
repeat for each item tWord in pText
add 1 to tArray[tWord] -- add 1 to the word count of each 4-word item
end repeat
repeat for each key tKey in tArray
if length(tKey) < 8 then next repeat
put tKey & comma & comma & tArray[tKey] & cr after tList -- put each unique 4-word item and its frequency into tList
end repeat
sort tList descending numeric by item -1 of each
return line 1 to -1 of tList
set the itemdelimiter to ","
end listFrequencies
Re: Count # of Unique Two-Word Phrases (Strings/Items)
You were asking for speed, so we are interested in the fastest solution.
Hopefully others come in with their solution and timing comparison.
Attached is a text with 122204 truewords and 6538 unique truewords for testing (Jane Austin: Pride and Prejudice). These are my results. Times are averages from LC 7.1.4-rc1 / LC 8.0.0-dp16. (Yes LC 8 is really faster! Try yourself please.)
Unique words: 400 ms / 300 ms
Total truewords: 122204 / Diff. truewords: 6538
2-word phrases: 2300 ms / 1900 ms
Total 2-word phrases: 122203 / Diff. 2-word phrases: 54998
3-word-phrases: 3800 ms / 3100 ms
Total 3-word phrases: 122202 / Diff. 3-word phrases: 103295
4-word phrases: 4700 ms / 3900 ms
Total 4-word phrases: 122201 / Diff. 4-word phrases: 118713
Here is my test scenario. [It's all included in the attached teststack].
You need a field "Input" for the text and a field (may be list field) "OUTPUT" for the result. Moreover a button for the following script. [You can choose word phrases of length 1,2,3,4.]
The above is exactly my earlier routine, where one 'horrible' error is removed.
Let me explain this error, because we are in the beginner forum.
I wrote
This is logically wrong and may "cut" the repeat too early, because the keys of an array are usually not sorted.
So this should read, as it is now correct above:
Hopefully others come in with their solution and timing comparison.
Attached is a text with 122204 truewords and 6538 unique truewords for testing (Jane Austin: Pride and Prejudice). These are my results. Times are averages from LC 7.1.4-rc1 / LC 8.0.0-dp16. (Yes LC 8 is really faster! Try yourself please.)
Unique words: 400 ms / 300 ms
Total truewords: 122204 / Diff. truewords: 6538
2-word phrases: 2300 ms / 1900 ms
Total 2-word phrases: 122203 / Diff. 2-word phrases: 54998
3-word-phrases: 3800 ms / 3100 ms
Total 3-word phrases: 122202 / Diff. 3-word phrases: 103295
4-word phrases: 4700 ms / 3900 ms
Total 4-word phrases: 122201 / Diff. 4-word phrases: 118713
Here is my test scenario. [It's all included in the attached teststack].
You need a field "Input" for the text and a field (may be list field) "OUTPUT" for the result. Moreover a button for the following script. [You can choose word phrases of length 1,2,3,4.]
Code: Select all
on mouseUp
put the millisecs into m1 --> start timing
lock screen; lock messages
set cursor to watch
#-- start prepare INPUT [fld "Input" is at about 65 KByte]
put fld "INPUT" into myInput -- or any other variable
put 0 into m ## --> each increase of m by one doubles the input <-- ##
repeat m --> 2^m copies for testing
put cr & myInput after myInput
end repeat
#-- end prepare INPUT
put the label of btn "phraseLength" into k0
if k0 = 1 then -- unique truewords
repeat for each trueword w in myInput
add 1 to countPairs[w]
end repeat
put the number of truewords of myInput into j
else
#-- =============== Essential part START: INPUT is variable myInput
#-- create truewords array
put 0 into j
repeat for each trueword w in myInput
add 1 to j; put w into myWords[j]
end repeat
#-- j is now the number of truewords in myInput
#-- create phrases and count their frequency
switch k0
case 2 -- two word phrases
repeat for each key k in myWords
#-- Now don't write "exit repeat". The keys are usually not sorted!!
if k = j then next repeat
add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeat
break
case 3 --three word phrases
repeat for each key k in myWords
if k >= j-1 then next repeat
add 1 to countPairs[myWords[k] & space & myWords[k+1] \
& space & myWords[k+2]]
end repeat
break
case 4 --four word phrases
repeat for each key k in myWords
if k >= j-2 then next repeat
add 1 to countPairs[myWords[k] & space & myWords[k+1] \
& space & myWords[k+2] & space & myWords[k+3]]
end repeat
break
end switch
end if
#-- "combine countPairs by cr and comma" is a little bit faster
#-- but we want the frequency as first item for better oversight.
repeat for each key k in countPairs
put cr & countPairs[k] & comma& k after myOutput
end repeat
delete char 1 of myOutput -- is cr
#-- =============== Essential part END : OUTPUT is variable myOutput
#-- start prepare OUTPUT
sort myOutput -- cosort by word pairs
sort myOutput descending numeric by item 1 of each
put the num of lines of myOutput into diffPairs
put "Total " &k0& "-word phrases: " & (j-k0+1) &cr& "Diff " &k0& \
"-word phrases: " & diffPairs & cr & myOutput into fld "OUTPUT"
#-- end prepare OUTPUT
unlock screen; unlock messages
put (the millisecs - m1) &" ms" into fld "Timing" --> end timing
end mouseUp
### [-hh fecit, April 2016]
Let me explain this error, because we are in the beginner forum.
I wrote
Code: Select all
-- j is the maximal value of k
repeat for each key k in myWords
if k = j then exit repeat #<-- WRONG
add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeat
So this should read, as it is now correct above:
Code: Select all
-- j is the maximal value of k
repeat for each key k in myWords
if k = j then next repeat #<-- CORRECT
add 1 to countPairs[myWords[k] & space & myWords[k+1]]
end repeat
- Attachments
-
- truewordsPhrases.livecode.zip
- My TestStack
- (2.42 KiB) Downloaded 246 times
-
- prideAndPrejudice.txt.zip
- The Input.
- (247.46 KiB) Downloaded 225 times
shiftLock happens
-
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
Re: Count # of Unique Two-Word Phrases (Strings/Items)
hh, really helpful. thanks. Honestly, it's quick enough for my purposes with either version. I'm sort of waiting on the "official" release of 8 to switch over--but that reservation is more just out of habit than anything else. Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.
Re: Count # of Unique Two-Word Phrases (Strings/Items)
I would do even one step more:
Download the files, read them (or single ones, depending on size) into a variable and THEN pull the nodes from that variable. It may be worth testing, anyway.
Download the files, read them (or single ones, depending on size) into a variable and THEN pull the nodes from that variable. It may be worth testing, anyway.
shiftLock happens
-
- VIP Livecode Opensource Backer
- Posts: 10052
- Joined: Sat Apr 08, 2006 7:05 am
- Contact:
Re: Count # of Unique Two-Word Phrases (Strings/Items)
The difference between download+process vs down first and process later is merely the time spent on disk I/O, likely just a couple milliseconds at most.theotherbassist wrote:hh, really helpful. thanks. Honestly, it's quick enough for my purposes with either version. I'm sort of waiting on the "official" release of 8 to switch over--but that reservation is more just out of habit than anything else. Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.
But the advantage of downloading first is that you have the opportunity to process a second time if needed without having to download again. The download is the longest part.
In LiveNet's RSS viewer (in the LC IDE see Plugins->GoLiveNet, or in older versions GoRevNet) I wound up breaking the process up into three steps:
1. Download the feeds to a local cache
This gives me the freedom to use different schedules for different feeds. For example, the feed for this forum and the use-livecode discussion list update every few minutes, but I only bother downloading the feed for the LC blog once a day since it rarely changes more than that. This not only saves my processor time, but is kinder to the servers providing the feeds. For all the over-design of RSS, I do try to honor the update frequency.
2. Parse into common elements
In the olden days RSS was "Really Simple Syndication", and it was true to its name, a joy to work with. Then that acronym was coopted by the RDF crew to serve double-duty as their "RDF Site Summary", which is not only needlessly complicated but alters the original focus from syndication to summarization. So not only are RSS feeds today radically different in structure (replete with evidence that few authors of RSS generators are able to read specs - "Hey guys, ever heard of CDATA?"), but also bifurcating the mandate of the format.
All that is just to say that RSS is complicated to parse, but despite the best efforts of the RDF gang to pervert the format, and the laudible efforts of the Atom crew to correct those mistakes, in the end we find just enough similarity among the various RSS formats that relevant parts can be extracted and put into a unified form.
I prefer using arrays to work with them, so storing them uses LSON (LiveCode serialized arrays such as you get with arrayEncode). LSON data is superfast to pull and decode from disk, and since the result is an array it's very efficient to work with for extraction and rendering.
3. Render for display
Most RSS feeds generated today are made with the dismaying presumption that you'll be using a multi-megabytes-of-RAM browser engine to render it. If you use an embedded browser object you'll find many aspects of dealing with today's often-broken RSS feeds a bit easier.
But if you want a stripped version of the content more akin to RSS' original mandate, giving you more control over its display, a little extra work can turn the dizzying mix of plain text, misunderstood/misapplied CDATA, and HTML you'll find in modern feeds into LC's htmlText, which can be dropped into a field.
All that said, I broke the process into three steps in large part for the unique needs of LiveNet: I wanted the shortest time I could get between opening the card and displaying the aggregate feeds, and I wanted to minimize traffic on the RSS hosts as much as possible.
So in LiveNet steps 1 and 2 above are done on a machine in my office, and the aggregated LSON file uploaded in compressed form every few minutes to one of my public servers, where LiveNet then grabs it whenever the user opens LiveNet.
That's probably more steps than you'd want for a local aggregator, but I do think there's benefit in using separate download schedules for different feeds, to be both more efficient for the user and kinder to the RSS host.
Whether you parse it on download or after probably doesn't matter much, but I would suggest parsing the content into an array for more efficient handling. And if you use different download schedules for each feed, it may be useful to store each in its own array, serialized to disk with arrayEncode.
You could probably dispose of the original XML once you have it converted to an array, but a habit from my analytics work compels me to keep sources in their original form just in case I later decide to do some other parsing on them. Probably not needed with ephemeral data like RSS, but sometimes old habits are hard to break.

Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
Re: Count # of Unique Two-Word Phrases (Strings/Items)
So it's as easy as 1-2-3 then. Hah. I appreciate the detailed response--I feel a bit more confident knowing it's *kind of* a matter of taste and not necessarily better one way vs. the other when parsing limited nodes.
I'm really focusing on titles, and specifically headline news titles. I spend far too much time every day rifling through duplicate titles via Feedly/Nextgen Reader thanks to overlap and widespread use of AP by the press. I'm trying to build an app that runs an algorithm that dismantles, say, 30 headline RSS feeds across the globe into common 1-, 2-, 3-, and 4-word phrases and then rebuilds totally new titles based on word order within common phrases and the average frequency. All article titles on a given subject will be broken down and recombined into one "average title" ultimately presented back to the user. Sometimes the title will end up being one that was already in use, sometimes it'll be slightly contorted--but the point is that unless there's radically different information on the subject that necessitates two separate subject listings, the gist will be contained in the singular new "average" title. At the very least, it reduces overlap.
Yeah, you can look at google/facebook trends for this sort of thing. But you can't tailor their list of sources to your liking. I don't want the average of everything on the web--I want the average of my own go-to news outlets.
p.s. I always remove CDATA references from the XML--at least when it comes to news, it seems like the better info is regularly nested in there.
I'm really focusing on titles, and specifically headline news titles. I spend far too much time every day rifling through duplicate titles via Feedly/Nextgen Reader thanks to overlap and widespread use of AP by the press. I'm trying to build an app that runs an algorithm that dismantles, say, 30 headline RSS feeds across the globe into common 1-, 2-, 3-, and 4-word phrases and then rebuilds totally new titles based on word order within common phrases and the average frequency. All article titles on a given subject will be broken down and recombined into one "average title" ultimately presented back to the user. Sometimes the title will end up being one that was already in use, sometimes it'll be slightly contorted--but the point is that unless there's radically different information on the subject that necessitates two separate subject listings, the gist will be contained in the singular new "average" title. At the very least, it reduces overlap.
Yeah, you can look at google/facebook trends for this sort of thing. But you can't tailor their list of sources to your liking. I don't want the average of everything on the web--I want the average of my own go-to news outlets.
p.s. I always remove CDATA references from the XML--at least when it comes to news, it seems like the better info is regularly nested in there.
-
- VIP Livecode Opensource Backer
- Posts: 10052
- Joined: Sat Apr 08, 2006 7:05 am
- Contact:
Re: Count # of Unique Two-Word Phrases (Strings/Items)
That sounds like a fun project. You might consider writing a guess blog post for livecode.com with that when you've got it running. Drop a note to heather AT livecode.com - they love good development stories like that.theotherbassist wrote:So it's as easy as 1-2-3 then. Hah. I appreciate the detailed response--I feel a bit more confident knowing it's *kind of* a matter of taste and not necessarily better one way vs. the other when parsing limited nodes.
I'm really focusing on titles, and specifically headline news titles. I spend far too much time every day rifling through duplicate titles via Feedly/Nextgen Reader thanks to overlap and widespread use of AP by the press. I'm trying to build an app that runs an algorithm that dismantles, say, 30 headline RSS feeds across the globe into common 1-, 2-, 3-, and 4-word phrases and then rebuilds totally new titles based on word order within common phrases and the average frequency. All article titles on a given subject will be broken down and recombined into one "average title" ultimately presented back to the user. Sometimes the title will end up being one that was already in use, sometimes it'll be slightly contorted--but the point is that unless there's radically different information on the subject that necessitates two separate subject listings, the gist will be contained in the singular new "average" title. At the very least, it reduces overlap.
Amen, brother. I'm a bit of a news junky myself. I made the LiveNet aggregator as part of a larger exploration I hope to get time to get back to soon, with a similar focus on weeding out replicated sources and finding the truly good stuff I'm most interested in.Yeah, you can look at google/facebook trends for this sort of thing. But you can't tailor their list of sources to your liking. I don't want the average of everything on the web--I want the average of my own go-to news outlets.
Ah, if only some of those aggregators even knew what CDATA is. I've seen CDATA tags around plain text, XML, HTML, escaped HTML, and every strange mix you can imagine. I don't know how other parser writers handle it. Ah, the many times I've cursed at the feed generators I've had to work with.p.s. I always remove CDATA references from the XML--at least when it comes to news, it seems like the better info is regularly nested in there.

Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- Posts: 115
- Joined: Thu Mar 06, 2014 9:29 am
Re: Count # of Unique Two-Word Phrases (Strings/Items)
For the record, I was making a big mistake without knowing it.Most of the "wait" time in my stack so far comes from pulling the contents of individual nodes from RSS feed XML data anyway--titles, descriptions, urls, etc. Do you think that it would be faster to save all the XML as local files and *then* read in the node content? What I'm doing now is pulling individual xml node contents directly from the web locations. Maybe making everything local in one fell swoop would cut some ms from the repeat functions that access the nodes.
I was putting XML node contents directly into lines in a field--don't do this. Each repeat gets slower the further down in the list (field) the target line is. This essentially makes a simple linear counting function harder work than it needs to be. Now I'm putting XML node contents into an array (three lines in each element for each article--title/pubdate/link) and then using "combine using return" to put the array into the field (as UTF-8!) at the end.
For a test of 10 feeds averaging 30 articles each, the direct-to-field version takes about (a rather frustrating) 1m 12s. The xml-to-array-to-list version takes about 3s.
Use arrays.