LiveCode Forums.

Posted: **Mon Mar 05, 2018 9:52 pm**

The first step in cryptography, if you know the language the original text was written in, is to analyse a text for letter frequency.

Improved stack available lower down the page!

Obviously it is a relatively simple thing to adapt this stack for anyone whose language uses either
an alphabet or an abugida.

Posted: **Tue Mar 06, 2018 12:38 am**

When I was a kid I was taught that the letter most commonly used in English were, in order: E, T, A, O, N, S, H, I, with a long tail from there. We used to crack Caesar Ciphers with that. Comforting to see the screen shot of that sample kinda bears that out.

Posted: **Tue Mar 06, 2018 2:42 am**

Whatever happened to "E T A O I N S H R D L U"?

Or was this just the way early linotype machines dumped their letters/

Craig

Posted: **Tue Mar 06, 2018 7:19 am**

I have used a relatively short text.

I suspect to get an accurate reflection of letter frequency in any language you'd
have to analyse quite a few long texts of differing genres.

Posted: **Tue Mar 06, 2018 7:49 am**

Craig, it may be that the book I was learning from was very old. Language changes over time, and maybe that affects letter frequency. Or maybe I'm just so old I forgot the details.

Either way, we were both close enough to the results of that sample that we'd be able to sort our way through a Caesar cipher version of it.

For a more accurate count maybe we can talk Richmond into running the algo against the Enron corpus:
https://www.cs.cmu.edu/~enron/

The tar file is only 443 MBs - shouldn't take too long.

Posted: **Tue Mar 06, 2018 8:18 am**

we can talk Richmond into running the algo against the Enron corpus

Quite.

Of course you could download my stack and do that yourself

Posted: **Tue Mar 06, 2018 8:19 am**

My (ancient) codes and ciphers book said ETAONRISH...
I guess letters fall out of popularity like names.

I would guess the content of the Enron file might be skewed towards business terminology and could (probably only very slightly) affect the letter count compared to the language as a whole. Plus U would be at a disadvantage as it would be left out of words like colour... Oh no! English isn't English any more!

Posted: **Tue Mar 06, 2018 9:34 am**

Analysing the text of a book written in the 1890s about Linguistics I came across "nifty" words
such as 'ni**er', 'd*rk*e', 'peasant', 'primitive' and 'uncivilised' . . .

While those words may be a bit dangerous to use because the Thought Police and the Politically Correct Lefty meddlers might "drum you out of the Br*wnies", analysing samples that contain words of that source STILL will
yield an idea of letter-frequency in 19th century English written texts.

As a person who has "chromatically challenged" hair (it's Ginger), I have absolutely no time for "tip-toeing through the tulips" when it comes to language: a 'spade' is a 'spade' (despite possible ambiguities) and NOT
an 'earth relocation device'.

NOW: go away and design a LiveCode stack to check people's texts for Politically Incorrect words and phrases, including an in-built updater so it is always ahead of the moving fence.

Posted: **Tue Mar 06, 2018 12:11 pm**

: importBTN.png (9.04 KiB) Viewed 7694 times

Just added the above as a Love Gift for those
who want to muck around with large corpora.

Letter Counter.livecode.zip: (14.63 KiB) Downloaded 227 times

Completely OT: my Book of the Moment is 'The Book of Bebb' which I thoroughly recommend.

: Bebb.jpg (21.19 KiB) Viewed 7694 times

Posted: **Tue Mar 06, 2018 12:36 pm**

Funny remarks about the ENRON dataset aside, this:

It contains data from about 150 users, mostly senior management of Enron, organized into folders.

Means that, having downloaded the dataset one would have to concatenate
all the constituent messages into one long text . . .

. . . a bit of a pain.

Although, from the point of view of LiveCode that wouldn't be a problem as such.

What would be a problem would be "jumping in and out" of all the folders to load the
text files into a text field in a stack.

Posted: **Tue Mar 06, 2018 12:45 pm**

Frankly, it might be better to work with corpora such as these:

http://www.natcorp.ox.ac.uk/

https://corpus.byu.edu/

even in most of the cases they list these are NOT straightforward files containing texts,
they have all sorts of "guff" such as POS-tags embedded in them.

: xml.png (3.39 KiB) Viewed 7685 times

Yikes!

Posted: **Tue Mar 06, 2018 3:52 pm**

richmond62 wrote: ↑
Tue Mar 06, 2018 12:36 pm
Funny remarks about the ENRON dataset aside, this:

It contains data from about 150 users, mostly senior management of Enron, organized into folders.
Means that, having downloaded the dataset one would have to concatenate
all the constituent messages into one long text .

Given the size of the corpus it would be far more efficient to just traverse the folders and process each individually.

Besides, being email, unless you were doing a relationship study you'd probably want to write a filter to remove the header from each.

Posted: **Tue Mar 06, 2018 6:03 pm**

a relationship study

Err . . . I'm currently working on getting Bulgarian kids between 6 and 8 years old to write
English in a vaguely comprehensible hand.

LiveCode Forums.

Letter Frequency

Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency

Re: Letter Frequency