Hot Topics:

Sponsoring:

From: Matthew Keys
------------------------------------------------------
Does anyone know how to create a tag clouds based on the body of an email? The google gods point me in the direction of outlook pluggins but I'm looking for something more linux cli scriptable; maybe something that could parse through an exported mailbox/folder.

===============================================================
From: Sean Brewer
------------------------------------------------------
If you can export the e-mails easily, general algorithm is something like
this:
1. Tokenize the words in the e-mail body.
2. Remove stop words (a, an, the, etc. You can find word lists, and
libraries like NLTK have them built in)
3. Use stemming algorithm to reduce word tokens to their, I think the
correct vocabulary is, free morpheme (e.g. convert the token word "passing"
to "pass")
4. Rank by frequency of result.
That should get you in the neighborhood.

===============================================================
From: Matt Keys
------------------------------------------------------
Thanks for the clues! It looks like python may be the winner this time.

===============================================================
From: Dan Lyke
------------------------------------------------------
Someone made a comment about word clouds on Facebook yesterday, and
being the smartass that I am I couldn't resist:
perl -e 'while () { $c{$

===============================================================
From: Sean Brewer
------------------------------------------------------
Actually, you want to do something called lemmaisation, not stemming,
although they are related, stemming does something slightly different.
Lemmaisation does what I described.
I can probably whip up a dirty example with python and nltk.

===============================================================
From: Sean Brewer
------------------------------------------------------
I ran across this: https://github.com/larsmans/weighwords
It might make what you want to do even easier.

===============================================================
From: Matt Keys
------------------------------------------------------
I ran across a few like that, too. I'm a bit confused as to the
difference between a word cloud and a tag cloud. I'm guessing tag clouds
presume that you've attached some form of tag to an example text, which
the code would use to sort upon whereas word clouds you just point the
code to a pile of text that has not been tagged/grouped?

===============================================================
From: Sean Brewer
------------------------------------------------------
Yeah, I think that's the difference. Code for the word cloud makes a cloud
for most commonly used words.

===============================================================
From: Sean Brewer
------------------------------------------------------
Yeah you could do that, but you still have to do extra processing if you
want anything useful.
" | sort | uniq -c | sort -rn

===============================================================
From: Sean Brewer
------------------------------------------------------
I forgot to add, that you could use all that stuff to find a probable topic
of a conversation, which is a basically a tag. I thought that might be the
direction you were heading. I could be wrong.

===============================================================
From: Sean Brewer
------------------------------------------------------
Here's an example of what I'm thinking: https://gist.github.com/4324904
It's in ruby, though. I found a neat stemmer/lemmatizer algorithm and an
implementation in ruby, but not in python.
Here's example output: https://gist.github.com/4324904#comment-657626

===============================================================
From: Matt Keys
------------------------------------------------------
Nice start! I started working on it in python and pointing to a mbox
source but I keep getting hung up on the method of extraction. I can't
decide if I should focus on the subject or the body... or maybe I should
focus on both? The subject is usually pretty condensed to begin with and
I'm thinking that'd be the smarter place to start... but it wouldn't be
as thorough. The body throws in problems like possible multipart
messages, strange encodings, etc. It would be interesting to see
different results using a chugalug export of maybe the month of December.

===============================================================
From: Mike Harrison
------------------------------------------------------
Laughing.. because I had thought the same thing.
add chugalugextract.txt to: http://chugalug.org
And you can have a copy of the MySQL table that my bot attempts to extract
from email and parse into the data that makes the Chugalug website
A little over 2k primary messages with replies.

===============================================================
From: Matt Keys
------------------------------------------------------
That'll certainly work for test data, thanks! I think this may be a good
match for the Splunk for IMAP app :)