I want to build an analytics engine on top of an article publishing platform. More specifically, I want to track the users' reading behaviour (e.g. number of views of an article, time spent with the article open, rating, etc), as well as statistics on the articles themselves (e.g. number of paragraphs, author, etc).

This will have two purposes:

Present insights about users and articles

Provide recommendations to users

For the data analysis part I've been looking at cubes, pandas and pytables. There is a lot of data, and it is stored in MySQL tables; I'm not sure which of these packages would better handle such a backend.

For the recommendation part, I'm simply thinking about feeding data from the data analysis engine to a clustering model.

Any recommendations about how to put all this together, as well as cool python projects out there that can help me out?
Please let me know if I should give more information.

the first aspects you want to track (# of views, paragraphs, authors, time spent reading) can be computed as summary statistics, eg means and standard deviations. numpy can give you a hand for computing these on n-dimensional data arrays.
–
Thomas VincentSep 18 '12 at 18:17

For clustering or more generally data mining, you'll first need some relevant question to ask to the data, eg "how can we relate article features to readers' descriptors" for which you could use association rule learning. If your question was more about the design of your data analysis layer, I'd advise you to separate core analysis functions and reporting stuffs. Within the core analysis module, try to represent your data with only numpy arrays (which can handle strings). For the rest, it depends on the questions you want to answer, which will define your specifications.
–
Thomas VincentSep 18 '12 at 18:33

so what you're recommending is that I keep my data in the MySQL database; whenever I need to to do some statistical analysis and reporting, I take what I need form the db into numpy data structures; use these structures to train my machine learning models. Right?
–
user1491915Sep 19 '12 at 10:36

Where to keep your data depends on the life-cycle of your information and their amount. To only keep them in a single place is more simple and easy to implement. But if you have a huge amount to data, it might be interesting to think of a more complex data model. It's often best to start with something simple which sticks to your specifications and then optimize and review the design if the actual usage requires it.
–
Thomas VincentSep 19 '12 at 11:34