Page 1 of 1
Letter Frequency
Posted: Mon Mar 05, 2018 9:52 pm
by richmond62
The first step in cryptography, if you know the language the original text was written in, is to analyse a text for letter frequency.
Improved stack available lower down the page!
Obviously it is a relatively simple thing to adapt this stack for anyone whose language uses either
an alphabet or an abugida.
Re: Letter Frequency
Posted: Tue Mar 06, 2018 12:38 am
by FourthWorld
When I was a kid I was taught that the letter most commonly used in English were, in order: E, T, A, O, N, S, H, I, with a long tail from there. We used to crack Caesar Ciphers with that. Comforting to see the screen shot of that sample kinda bears that out.
Re: Letter Frequency
Posted: Tue Mar 06, 2018 2:42 am
by dunbarx
Whatever happened to "E T A O I N S H R D L U"?
Or was this just the way early linotype machines dumped their letters/
Craig
Re: Letter Frequency
Posted: Tue Mar 06, 2018 7:19 am
by richmond62
I have used a relatively short text.
I suspect to get an accurate reflection of letter frequency in any language you'd
have to analyse quite a few long texts of differing genres.
Re: Letter Frequency
Posted: Tue Mar 06, 2018 7:49 am
by FourthWorld
Craig, it may be that the book I was learning from was very old. Language changes over time, and maybe that affects letter frequency. Or maybe I'm just so old I forgot the details.

Either way, we were both close enough to the results of that sample that we'd be able to sort our way through a Caesar cipher version of it.
For a more accurate count maybe we can talk Richmond into running the algo against the Enron corpus:
https://www.cs.cmu.edu/~enron/
The tar file is only 443 MBs - shouldn't take too long.

Re: Letter Frequency
Posted: Tue Mar 06, 2018 8:18 am
by richmond62
we can talk Richmond into running the algo against the Enron corpus
Quite.
Of course you could download my stack and do that yourself

Re: Letter Frequency
Posted: Tue Mar 06, 2018 8:19 am
by SparkOut
My (ancient) codes and ciphers book said ETAONRISH...
I guess letters fall out of popularity like names.
I would guess the content of the Enron file might be skewed towards business terminology and could (probably only very slightly) affect the letter count compared to the language as a whole. Plus U would be at a disadvantage as it would be left out of words like colour... Oh no! English isn't English any more!

Re: Letter Frequency
Posted: Tue Mar 06, 2018 9:34 am
by richmond62
Analysing the text of a book written in the 1890s about Linguistics I came across "nifty" words
such as 'ni**er', 'd*rk*e', 'peasant', 'primitive' and 'uncivilised' . . .
While those words may be a bit dangerous to use because the Thought Police and the Politically Correct Lefty meddlers might "drum you out of the Br*wnies", analysing samples that contain words of that source STILL will
yield an idea of letter-frequency in 19th century English written texts.
As a person who has "chromatically challenged" hair (it's Ginger), I have absolutely no time for "tip-toeing through the tulips" when it comes to language: a 'spade' is a 'spade' (despite possible ambiguities) and NOT
an 'earth relocation device'.
NOW: go away and design a LiveCode stack to check people's texts for Politically Incorrect words and phrases, including an in-built updater so it is always ahead of the moving fence.

Re: Letter Frequency
Posted: Tue Mar 06, 2018 12:11 pm
by richmond62

- importBTN.png (9.04 KiB) Viewed 7206 times
Just added the above as a
Love Gift for those
who want to muck around with large corpora.
Completely OT: my Book of the Moment is 'The Book of Bebb' which I thoroughly recommend.

- Bebb.jpg (21.19 KiB) Viewed 7206 times
Re: Letter Frequency
Posted: Tue Mar 06, 2018 12:36 pm
by richmond62
Funny remarks about the
ENRON dataset aside, this:
It contains data from about 150 users, mostly senior management of Enron, organized into folders.
Means that, having downloaded the dataset one would have to
concatenate
all the constituent messages into one long text . . .
. . . a bit of a pain.
Although, from the point of view of
LiveCode that wouldn't be a problem as such.
What would be a problem would be "jumping in and out" of all the folders to load the
text files into a text field in a stack.
Re: Letter Frequency
Posted: Tue Mar 06, 2018 12:45 pm
by richmond62
Frankly, it might be better to work with
corpora such as these:
http://www.natcorp.ox.ac.uk/
https://corpus.byu.edu/
even in most of the cases they list these are NOT straightforward files containing texts,
they have all sorts of "guff" such as POS-tags embedded in them.

- xml.png (3.39 KiB) Viewed 7197 times
Yikes!
Re: Letter Frequency
Posted: Tue Mar 06, 2018 3:52 pm
by FourthWorld
richmond62 wrote: ↑Tue Mar 06, 2018 12:36 pm
Funny remarks about the
ENRON dataset aside, this:
It contains data from about 150 users, mostly senior management of Enron, organized into folders.
Means that, having downloaded the dataset one would have to
concatenate
all the constituent messages into one long text .
Given the size of the corpus it would be far more efficient to just traverse the folders and process each individually.
Besides, being email, unless you were doing a relationship study you'd probably want to write a filter to remove the header from each.
Re: Letter Frequency
Posted: Tue Mar 06, 2018 6:03 pm
by richmond62
a relationship study
Err . . . I'm currently working on getting Bulgarian kids between 6 and 8 years old to write
English in a vaguely comprehensible hand.