I have been asked what my favoriteOpen Source Tools for Data Mining with Statistics supportare. In no particular order, other than recall, here they are. Feel free to comment on these or any others you like that fall into this same category and the reasons why :

Last post, I mentioned how Beautiful Soup is an elegant way to parse HTML with Google Refine.

Well, it just got better thanks to Iain Sproat's latest commit to Google Refine (and his Java skills are getting better all the time!). If you pull down trunk and build, you'll see that he has integrated the jsoup.org java library that leverages upon Beautiful Soup. Iain has done a great job of pushing the jsoup Element stack right up to GREL (Google Refine Expression Language) for concise usage. I love it !

Using jsoup's simple selector syntax, I was able to easily parse out company websites from LinkedIn's public pages. The example below says select the div called data-table that contains the term Website and return the 2nd <a href> htmlText. In Refine, the ordering starts at [0], so in this case [1] gives the 2nd href link. The jsoup.org website's cookbook and the use of selector-syntax is a great start to begin learning more.

After spending most of the day BANGING my head on using Regex and GREL to handle HTML parsing.

I thought, there MUST be a better way to parse HTML !!!

I know several of you who have thought the same thing. So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension.
It just so happens Google Refine already has a wonderful extension with another language itself: Jython

Thad Guidry was born in Louisiana and served in the US Air Force in Anchorage, Alaska for 4 years during Desert Storm. He cut his teeth with computers and programming at the early age of 9 at the University of Southwestern Louisiana (now UL) computer lab, by sneaking in and playing Star Trek on DEC VAX systems with the local college kids and kept them smiling with Lemonhead candy as a bribe. He purchased his first computer, a Commodore Plus4 with savings from picking up aluminum cans over the course of a hot summer. Today, on the job, he still hacks with databases and off the job volunteers his time helping non-profit organizations with their marketing and information systems needs.