Sorting data

Whytey · Post by **Whytey** » Sun Jan 12, 2014 10:14 am

Hi, I have a large amount of data in a text box brought down from the Pubmed database that looks something like this....

Junk text
Results: 1 to 20 of 199496
Select item 244094151.
Cloning of Soluble Human Stem Cell Factor in pET-26b(+) Vector.
Asghari S, Shekari Khaniani M, Darabi M, Mansoori Derakhshan S.
Adv Pharm Bull. 2014;4(1):91-5. doi: 10.5681/apb.2014.014. Epub 2013 Dec 23.
PMID:
<< First< Prev
Junk text

How would I go about separating the first five results from the junk text (its all going to be variable as it is brought down from the online database) and put each entry in five separate text boxes?

I look forward to hearing from you! Thanks for your help I greatly appreciate it!

dunbarx · Post by **dunbarx** » Sun Jan 12, 2014 5:50 pm

Hi.

I see Klaus has addressed this in your other post, so i will just tell you a couple of things here.

LC has the ability to "find" data in a block of text, either by word, line or item (or even by character). The hard part is trusting how that block of text is organized. In your example, my first question would be:

"are the last five lines always present in the same way, below the junk text"? If so, you can do something like:

put line -5 to the number of lines of blockOfText into field "yourResults"

Or maybe the string "results:" itself (with colon) only occurs once in the block of text in that specific location. In that case, as Klaus mentioned, the "wordOffset" is your friend.

My point here is that LC has the tools, but the uncertainty is in the formatting of the raw text. Are there aspects of that format that you can rely on? For example, if "results:" appears elsewhere in the text, the method itself will break. This is an example of how to parse data. You need reliable markers to be able to take advantage of the tools.

Craig Newman

Whytey · Post by **Whytey** » Sun Jan 12, 2014 6:36 pm

Hi Craig, thank you I think I am beginning to see the wood from the trees, Results: does only appear in that block of text the one time. Still not quite sure how to write an offset function for this though, I had a try but it doesn't seem to work with the code given in my other post.

if matchText( yourText, "(?ms)'Select item 244094102.\n(.*?)\nPMID'", gotit) then
put gotit
else
put "Not found!"
end if

I'm not sure I was doing it right thought as I put the code behind a button after the find "244""in field "Results2" end mouseUp code?

Results2 is where the big block of text is now.

Hope you can help and aren't as confused as me now! Whytey

Whytey · Post by **Whytey** » Sun Jan 12, 2014 6:42 pm

I'm super new to this as I'm sure you can tell....trust me to start with a REALLY DIFFICULT APP!!!

Thierry · Post by **Thierry** » Sun Jan 12, 2014 7:19 pm

Whytey wrote:Still not quite sure how to write an offset function for this though, I had a try but it doesn't seem to work with the code given in my other post.

if matchText( yourText, "(?ms)'Select item 244094102.\n(.*?)\nPMID'", gotit) then
put gotit
else
put "Not found!"
end if

Umm, you could try to change the 2 "\n" in the regular expression by "." ( a dot.
(Don't know how are your newlines set in your text.)

Code: Select all

 matchText( field "Results2", "(?ms)'Select item 244094102..(.*?).PMID'", gotit)

I'm not sure I was doing it right thought as I put the code behind a button after the find "244""in field "Results2" end mouseUp code?

Why you are not posting your script?
It's a bit hard to follow you here - at least hard for me

Thierry

Whytey · Post by **Whytey** » Sun Jan 12, 2014 9:50 pm

Hi Thierry, I don't really have any code to show you as I'm still trying to figure out how to do this.

I can select certain lines from the text using the handler
# select a certain line in text.

Code: Select all

on mouseUp
   select word 140-159 of field"Results2"
end mouseUp

but this isn't really useful as it is a huge set of text and the word Results: is not going to appear in the same place every time.

I have also managed to find and select the word Results: using
# Find a certain word and select

Code: Select all

  on mouseUp
   find "Results:" in field "Results2"
   select the foundChunk
   end mouseUp

but this is just the word Results:

what I really am looking for is a way to select the text from the word Results: to the word PMID: if that is possible (I guess this is similar to the select lines code I mentioned above) and copy this text into the box "Box1". I have looked at the offset function but it seems to tell me how many times the word Results is in the text rather than allowing me to select and copy (I may just not be quite understanding this function yet though, perhaps you could explain it a little more to me)?

Still confused....

Thanks for your help though, this forum is great!

jacque · Post by **jacque** » Sun Jan 12, 2014 11:25 pm

If I understand the problem right:

Code: Select all

put fld "source" into tText
put wordoffset("Results:",tText) into tStart
put wordoffset("PMID:",tText) into tEnd
put word tStart to tEnd of tText into fld "box1"

And if you're brave, this could also be condensed into:

Code: Select all

put word wordoffset("Results:",tText) to wordoffset("PMID:",tText) of fld "source" into fld "box1"

Whytey · Post by **Whytey** » Sun Jan 12, 2014 11:43 pm

IT WORKED!!

THANK YOU SO MUCH JACQUE!!!!

jacque · Post by **jacque** » Sun Jan 12, 2014 11:49 pm

Oh good. Now you're proficient in the offset command.

Whytey · Post by **Whytey** » Sun Jan 12, 2014 11:50 pm

Lets not get ahead of ourselves

Thierry · Post by **Thierry** » Mon Jan 13, 2014 10:13 am

jacque wrote:

Code: Select all

put fld "source" into tText
put wordoffset("Results:",tText) into tStart
put wordoffset("PMID:",tText) into tEnd
put word tStart to tEnd of tText into fld "box1"

Hi Whytey,

Glad that jacque pass by during my sleeping time

For the record, the pattern of my previous regex can't work in your case because
of extra quotes in the pattern which are not in your entry text.
So, using jacque piece of code as a definition, it would be rewritten as:

Code: Select all

get matchText(tText, "(?ms)Results:(.*?)PMID:", tStart2tEndOfText) 
put tStart2tEndOfText into fld "box1"

We can compare those 2 pieces of code by putting in bold what is in common:

put wordoffset("Results:",tText) into tStart
put wordoffset("PMID:",tText) into tEnd

matchText(tText, "(?ms)Results:(.*?)PMID:", tStart2tEndOfText)

and

put word tStart to tEnd of tText into fld "box1"

matchText(tText, "(?ms)Results:(.*?)PMID:", tStart2tEndOfText)

Few different ways to code; just go with what makes you feel confident..

Happy coding with Livecode

Thierry

Whytey · Post by **Whytey** » Mon Jan 13, 2014 7:47 pm

Hi Jacque, thanks for your message, Ive ground to a halt again! So I am now trying to put the next entry down the list into box 2, the first entry (Cloning of Soluble Human Stem Cell Factor) goes in box one perfectly.

The text is as follows:
Select item 244094151.
Cloning of Soluble Human Stem Cell Factor in pET-26b(+) Vector.
Asghari S, Shekari Khaniani M, Darabi M, Mansoori Derakhshan S.
Adv Pharm Bull. 2014;4(1):91-5. doi: 10.5681/apb.2014.014. Epub 2013 Dec 23.
PMID:

24409415

[PubMed]
Related citations
Select item 244094102.
Nucleostemin depletion induces post-g1 arrest apoptosis in chronic myelogenous leukemia k562 cells.
Seyed-Gogani N, Rahmati M, Zarghami N, Asvadi-Kermani I, Hoseinpour-Feyzi MA, Moosavi MA.
Adv Pharm Bull. 2014;4(1):55-60. doi: 10.5681/apb.2014.009. Epub 2013 Dec 23.
PMID:

I've twiddled with the code slightly it is now:

Code: Select all

on mouseUp
   put fld "Results2" into tText
put wordoffset("244",tText) into tStart
put wordoffset("PMID:",tText) into tEnd
put word tStart to tEnd of tText into fld "box1"
delete word tStart to tEnd of tText
put wordoffset("244",tText) into tStart
put wordoffset("item",tText) into tEnd
delete word tStart to tEnd of tText
put wordoffset("244",tText) into tStart
put wordoffset("PMID:",tText) into tEnd
put word tStart to tEnd of tText into fld "box2"
end mouseUp

but this only returns the number 24409415 in box 2. Where am I going wrong?

Thanks all for helping you guys rock!

Whytey · Post by **Whytey** » Mon Jan 13, 2014 7:51 pm

Sorry Thierry that was to you (and Jacque too) LOL!!! coding getting to my head!

jacque · Post by **jacque** » Mon Jan 13, 2014 9:00 pm

I was wondering if that would happen, the example I gave will only find a single instance as you discovered. To find all instances you'll want a repeat loop that uses the third, optional "skip" parameter. That tells the offset function how much text to skip before it looks up the next instance. Note that the reported result is not the distance from character 1, it is the distance from the last-found instance. So you must add the length of all previous text to the found integer in order to get the correct position where the current lookup occurs.

Here's one way to do it:

Code: Select all

on lookup
  put fld "source" into tText
  put 0 into tSkip
  put 1 into x
  put the number of words in tText into tLength
  repeat
    put wordoffset("Select",tText,tSkip) + tSkip into tStart
    if tStart >= tLength then exit repeat
    put wordoffset("PMID:",tText,tSkip) + tSkip into tEnd
    put word tStart to tEnd of tText into aFound[x]
    add 1 to x
    add tEnd to tSkip
  end repeat
end lookup

I used "select" as the lookup term rather than 244 for a couple of reasons. The number will eventually change as the number of indexed articles grows. That may be far enough into the future that it won't affect your script, but it's more robust not to rely on it. The other, better reason is that "244" occurs twice after the first article, and I assume you only want the actual text of the second entry without the header above it. If you don't want the words "Select item" included in your results, you can delete the first two words of the found text before putting it into the array. Or just add 2 to tStart position after it's initially located.

This script collects the entries into an array. You'll need to place each one into a field. If your fields are named consistently, like "box 1" and "box 2" then it shouldn't be hard to use the array keys to construct and/or identify the correct field name that should hold the text. But if you need help with that, let us know.

jacque · Post by **jacque** » Mon Jan 13, 2014 9:14 pm

Forgot to mention, you are getting the odd "244" line in field 2 because that's what follows the "PMID" line, so it matches a "244" lookup. But I'm wondering if the format of the text actually has that ID number on a separate line, or does it follow immediately after the "PMID:" ? That is, does it look like this:

Select item 244094151.
Cloning of Soluble Human Stem Cell Factor in pET-26b(+) Vector.
Asghari S, Shekari Khaniani M, Darabi M, Mansoori Derakhshan S.
Adv Pharm Bull. 2014;4(1):91-5. doi: 10.5681/apb.2014.014. Epub 2013 Dec 23.
PMID: 24409415

[PubMed]
Related citations
Select item 244094102.
Nucleostemin depletion induces post-g1 arrest apoptosis in chronic myelogenous leukemia k562 cells.
Seyed-Gogani N, Rahmati M, Zarghami N, Asvadi-Kermani I, Hoseinpour-Feyzi MA, Moosavi MA.
Adv Pharm Bull. 2014;4(1):55-60. doi: 10.5681/apb.2014.009. Epub 2013 Dec 23.
PMID:

Because if so, there's an easier way to parse out the entries than using offset.

LiveCode Forums.

Sorting data

Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data

Re: Sorting data