Menu

The Complete Python for Text Analysis

The following set of commands assume that you begin with a Mac OS X that does not have any of the necessities already installed. You can, thus, skip anything you have already done, e.g., if you have already installed Xcode, skip to Step 2.

Step 1: Install the Xcode development and command line tool environment. You’ll have to get Xcode from the Mac App Store. Supposedly, you can avoid this by simply installing the command line tools (see command below), but I have come across at least on instance where it seemed like I needed to go inside Xocde itself and download and install things from within preferences. (This was the old way of doing it.) Here’s the terminal command to install the Command Line Tools (a bit redundant isn’t it?):

xcode-select --install

Nota bene: I continue to see warnings when installing Python and its modules when I have not installed the complete Xcode from the App Store. They look like this:

Warning: xcodebuild exists but failed to execute
Warning: Xcode does not appear to be installed; most ports will likely fail to build.

I am installing the complete setup now on another machine, I will update this post if anything is borked.

If, like me, you have recently upgraded your operating system and things are borked, then you need to clean out the old installation(s). This means downloading the installer and running it like you did when you were young. It’s still fast and easy. The uninstallation is also fast and easy. Cleaning, however, takes some time. The steps below first document what you have installed before working you clean everything out:

At this point, if you are only interested in NLP (natural language processing), you are done.

Optional: If you are going to pull anything from websites, then you can make your life easier by getting Beautiful Soup, which parses HTML for you:

sudo port install py27-beautifulsoup4

(Check for versions, as it may have incremented up.)

Step 5: If, however, you are also interested in network analysis as well as topic modeling and other forms of “big” data analysis, you can also install three Python modules built to do so — NetworkX, Gensim, and pandas:

Step 6: You have a pretty powerful analytical toolkit now at your disposal. If yo would like to make the user interface a bit more “friendly,” let me suggest that you also install iPython, an interactive Python interpreter, and, the best thing since someone sliced something in order to serve it the iPython notebook:

I can’t tell you what a joy iPython notebooks are to use: you can copy complete scripts into a code cell and get results by simply hitting SHIFT + ENTER. And everything is captured for you in a space where you can also make notes on what you are doing, or, in my case, trying to do, in markdown. Everything is saved to a modified JSON file with the extension ipynb. Even better, you can transform the file, using the nbconvert utility, into HTML or LaTeX or PDF. It is very, very, nice.

Options: if you want that LaTeX option for nbconvert to work, you are going to need a functional TeX installation:

sudo port install texlive-latex

Nota bene: In my experience, any TeX installation is big, so if you are in a hurry, either open up another terminal window (or tab), do something in the GUI, or go fix yourself a cup of coffee. It’s going to take a while, and unless staring at the installation log as it scrolls by is your thing, and, hey, it could be, I suggest you let the code take its course and get some other things done.

And, if you need to convert scanned documents into text, the open source OCR application Tesseract is available:

The Amazing Crawfish Boat is available at your favorite bookseller (both Amazon and B&N). I have also released some additional free materials: audio versions of some of the chapters and photos — all available for download. Details are available on the book’s page.