Tag: python

For Christmas my wife and I brought each other a new FitBit One device (Amazon affiliate link included). These are small fitness tracking devices that monitor the number of steps you take, how high you climb and how well you sleep. They’re great for providing motivation to walk that extra bit further, or to take the stairs rather than the lift.

I’ve only had the device for less than a week, but already I’m feeling the benefit of the gamification on FitBit.com. As well as monitoring your fitness it also provides you with goals, achievements and competitions against your friends. The big advantage of the FitBit One over the previous models is that it syncs to recent iPhones, iPads, as well as some Android phones. This means that your computer doesn’t need to be on, and often it will sync without you having to do anything. In the worst case you just have to open the FitBit app to update your stats on the website. Battery life seems good, at about a week.

The FitBit apps sync your data directly to FitBit.com, which is great for seeing your progress quickly. They also provide an API for developers to provide interesting ways to process the data captured by the FitBit device. One glaring omission from the API is any way to get access to the minute by minute data. For a fee of $50 per year you can become a Premium member which allows you do to a CSV export of the raw data. Holding the data, collected by a user hostage is deeply suspect and FitBit should be ashamed of themselves for making this a paid for feature. I have no problem with the rest of the features in the Premium subscription being paid for, but your own raw data should be freely available.

The FitBit API does have the ability to give you the intraday data, but this is not part of the open API and instead is part of the ‘Partner API’. This does not require payment, but you do need to explain to FitBit why you need access to this API call and what you intend to do with it. I do not believe that they would give you access if your goal was to provide a free alternative to the Premium export function.

So, has the free software community provided a solution? A quick search revealed that the GitHub user Wadey had created a library that uses the urls used by the graphs on the FitBit website to extract the intraday data. Unfortunately the library hadn’t been updated in the last three years and a change to the FitBit website had broken it.

Fortunately the changes required to make it work are relatively straightforward, so a fixed version of the library is now available as andrewjw/python-fitbit. The old version of the library relied on you logging into to FitBit.com and extracting some values from the cookies. Instead I take your email address and password and fake a request to the log in page. This captures all of the cookies that are set, and will only break if the log in form elements change.

Another change I made was to extend the example dump.py script. The previous version just dumped the previous day’s values, which is not useful if you want to extract your entire history. In my new version it exports data for every day that you’ve been using your FitBit. It also incrementally updates your data dump if you run it irregularly.

If you’re using Windows you’ll need both Python and Git installed. Once you’ve done that check out my repository at github.com/andrewjw/python-fitbit. Lastly, in the newly checked out directory run python examples/dump.py <email> <password> <dump directory>.

Many websites have some form of recommendation system. While it’s simple to create a recommendation system for small amounts of data, how do you create a system that scales to huge amounts of data?

How to actually calculate the similarity of two items is a complicated topic with many possible solutions. Which one if appropriate depends on your particularly application. If you want to find out more I suggest reading the excellent Programming Collective Intelligence (Amazon affiliate link) by Toby Segaran.

We’ll take the simplest method for calculating similarity and just calculate the percentage of users who have visited both pages compared to the total number who have visited either. If we have Page 1 that was visited by user A, B and C and Page 2 that was visited by A, C and D then the A and C visited both, but A, B, C and D visited either one so the similarity is 50%.

With thousands or millions of items and millions or billions of views calculating the similarity between items becomes a difficult problem. Fortunately MongoDB’s sharding and replication allow us to scale the calculations to cope with these large datasets.

First let’s create a set of views across a number of items. A view is stored as a single document in MongoDB. You would probably want to include extra information such as the time of the view, but for our purposes this is all that is required.

The first step is to process this list of view of events so we can take a single item and get a list of all the users that have viewed it. To make sure this scales over a large number of views we’ll use MongoDB’s map/reduce functionality.

We’ll build a javascript Object where the keys are the user id and the value is the number of time that user has viewed this item. In the map function we we build an object that represents a single view and emit it using the item id as the key. MongoDB will group all the objects emitted with the same key and run the reduce function, shown below.

A reduce function takes two parameters, the key and a list of values. The values that are passed in can either be those emitted by the map function, or values returned from the reduce function. To help it scale not all of the original values will be processed at once, and the reduce function must be able to handle input from the map function or its own output. Here we output a value in the same format as the input so we don’t need to do anything special.

The final step is to run the functions we’ve just created and output the data into a new collection. Here we’re recalculating all the data each time this function is run. To scale properly you should filter the input based on the date the view occurred and merge it with the output collection, rather than replacing it as we are doing here.

Now we need calculate a matrix of similarity values, linking each item with every other item. First lets see how we can calculate the similarity of all items to one single item. Again we’ll use map/reduce to help spread the load of running this calculation. Here we’ll just use the map part of map/reduce because each input document will be represented by a single output document.

The input to our Python function is a document that was outputted by our previous map/reduce call. We build a new Javascript by interpolating some data from this document into a template function. We loop through all the users who viewed the document we’re comparing against and work out whether they have viewed both. At the end of the function we emit the percentage of users who viewed both.

reduce_func = """
function (key, values) {
return results[0];
}
"""

Because we output unique ids in the map function this reduce function will only be called with a single value so we just return that.

The last step in this function is to run the map reduce. Here as we’re running the map/reduce multiple times we need to merge the output rather than replacing it as we did before.

The final step is to loop through the output from our first map/reduce and call our second function for each item.

for doc in db.item_user_view_count.find():
similarity(doc)

A key thing to realise is that you don’t need to calculate live similarity data. Once you have even a few hundred views per item then the similarity will remain fairly consistent. In this example we step through each item in turn and calculate the similarity for it with every other item. For a million item database where each iteration of this loop takes one second the similarity data will be updated once every 11 days.

I’m not claiming that you can take the code provided here and immediately have a massively scalable system. MongoDB provides an easy to use replication and sharding system, which are plugged in to its Map/Reduce framework. What you should take away is that by using map/reduce with sharding and replication to calculate the similarity between two items we can quickly get a system that scales well with an increasing number of items and of views.

A little while ago I was asked what my biggest gripe with Django was. At the time I couldn’t think of a good answer because since I started using Django in the pre-1.0 days most of the rough edges have been smoothed. Yesterday though, I encountered an error that made me wish I thought of it at the time.

The error that was raised was AttributeError: 'NoneType' object has no attribute 'Model'. This means that rather than containing a module object, models was None. Clearly this is impossible as the class could not have been created if that was the case. Impossible or not, it was clearly happening.

Adding a print statement to the module showed that when it was imported the models variable did contain the expected module object. What that also showed was that module was being imported more than once, something that should also be impossible.

After a wild goose chase investigating reasons why the module might be imported twice I tracked it down to the load_app method in django/db/models/loading.py. The code there looks something like this:

Now I’m being a harsh here, and the exception handler does contain a comment about working out if it should reraise the exception. The issue here is that it wasn’t raising the exception, and it’s really not clear why. It turns out that I had a misspelt module name in an import statement in a different module. This raised an ImportError which was caught, hidden and then Django repeatedly attempted to import the models as they were referenced in the models of other apps. The strange exception that was originally encountered is probably an artefact of Python’s garbage collection, although how exactly it occurred is still not clear to me.

There are a number of tickets (#6379, #14130 and probably others) on this topic. A common refrain in Python is that it’s easier to ask for forgiveness than to ask for permission, and I certainly agree with Django and follow that most of the time.

I always follow the rule that try/except clauses should cover as little code as possible. Consider the following piece of code.

Which of the three attribute accesses are we actually trying to catch here? Handling exceptions like this are a useful way of implementing Duck Typing while following the easier to ask forgiveness principle. What this code doesn’t make clear is which member or method is actually optional. A better way to write this would be:

Now the code is very clear that the var variable may or may not have a member member variable. If method1 or method2 do not exist then the exception is not masked and is passed on. Now lets consider that we want to allow the method1 attribute to be optional.

try:
var.method1()
except AttributeError:
# handle error

At first glance it’s obvious that method1 is optional, but actually we’re catching too much here. If there is a bug in method1 that causes an AttributeError to raised then this will be masked and the code will treat it as if method1 didn’t exist. A better piece of code would be:

ImportErrors are similar because code can be executed, but then when an error occurs you can’t tell whether the original import failed or whether an import inside that failed. Unlike with an AttributeError there is a no easy way to rewrite the code to only catch the error you’re interested in. Python does provide some tools to divide the import process into steps, so you can tell whether the module exists before attempting to import it. In particular the imp.find_module function would be useful.

Changing Django to avoid catching the wrong ImportErrors will greatly complicate the code. It would also introduce the danger that the algorithm used would not match the one used by Python. So, what’s the moral of this story? Never catch more exceptions than you intended to, and if you get some really odd errors in your Django site watch out for ImportErrors.

A hobby project of mine would be made much easier if I could run the same code on the server as I run in the web browser. Projects like Node.js have made Javascript on the server a more realistic prospect, but I don’t want to give up on Python and Django, my preferred web development tools.

The obvious solution to this problem is to embed Javascript in Python and to call the key bits of Javascript code from Python. There are two major Javascript interpreters, Mozilla’s SpiderMonkey and Google’s V8. Unfortunately the python-spidermonkey project is dead and there’s no way of telling if it works with later version of SpiderMonkey. The PyV8 project by contrast is still undergoing active development.

Although PyV8 has a wiki page entitled How To Build it’s not simple to get the project built. They recommend using prebuilt packages, but there are none for recent version of Ubuntu. In this post I’ll describe how to build it on Ubuntu 11.11 and give a simple example of it in action.

The first step is make sure you have the appropriate packages. There may be others that are required and not part of the default install, but there are what I had to install.

sudo aptitude install scons libboost-python-dev

Next you need to checkout both the V8 and PyV8 projects using the commands below.

The key step before building PyV8 is to set the V8_HOME environment variable to the directory where you checked out the V8 code. This allows PyV8 to patch V8 and build it as a static library rather than the default dynamic library. Once you’ve set that you can use the standard Python setup.py commands to build and install the library.

In future I’ll write more detailed posts about how to use PyV8, but let’s start with a simple example. Mustache is a simple template language that is ideal when you want to create templates in Javascript. There’s actually a Python implementation of Mustache, but let’s pretend that it doesn’t exist.

To start import the PyV8 library and create a JSContext object. These are equivalent to sub-interpreters so you have several instance of your Javascript code running at once.

>>> import PyV8
>>> ctxt = PyV8.JSContext()

Before you can run any Javascript code you need enter() the context. You should also exit() it when you are complete. JSContext objects can be used with with statements to automate this, but for a console session it’s simplest to call the method explicitly. Next we call eval() to run our Javascript code, first by reading in the Mustache library and then to set up our template as a variable.

There’s much more to PyV8 than I’ve described in this post, including calling Python code from Javascript but unfortunately the V8 and PyV8 documentation is a bit lacking. I will post some more of my discoveries in future posts.

Recently I was taking part in a review of some Python code. One aspect of the code really stuck out to me. It’s not a structural issue, but a minor change in programming style that can greatly improve the maintainability of the code.

The code in general was quite good, but a code snippet similar to that given below jumped right to the top of my list of things to be fixed. Why is this so bad? Let us first consider what exceptions are and why you might use them in Python.

try:
// code
except Exception, e:
// error handling code

Exceptions are a way of breaking out the normal program flow when an ‘exceptional’ condition arises. Typically this is used when errors occur, but exceptions can also be used as an easy way to break out of normal flow during normal but unusual conditions. In a limited set of situations it can make program flow clearer.

What does this code do though? It catches all exceptions, runs the error handling code and continues like nothing has happened. In all probability it’s only one or two errors that are expected and should be handled. Any other errors should be passed on a cause the program to actually crash so it can be debugged properly.

This code has a bug, the missing e in the do_analysis call. This will raise a NameError that will be immediately captured and hidden. Other, more complicated errors could also occur and be hidden in the same way. This sort of masking will make tracking down problems like this very difficult.

To improve this code we need to consider what errors we expect the do_analysis function to raise and what we want to handle. In the ideal case it would raise an AnalysisError and then we would catch that.

In the improved code the NameError will pass through and be picked up immediately. It is likely that the cleanup function needs to be run whether or not an error has occurred. To do that we can move the call into a finally block.

This allows us to handle a very specific error and ensure that we clean up whatever error happens. Sometimes cleaning up whatever the exception (or in the event of no exception) is required, and in this case the finally block, which is always run, is the right place for this code.

We’re looking up the parameter to do_analysis in a dictionary and catching the case where index doesn’t exist. This code is also capturing too much. Not because the exception is too general, but because there is too much code in the try block.

The issue with this code is what happens if do_analysis raises a KeyError? To capture the exceptions that we’re expecting we need to only wrap the dictionary lookup in and not catch anything from the analysis call.

In this series I’ll show you how to build a search engine using standard Python tools like Django, Whoosh and CouchDB. In this post we’ll begin by creating the data structure for storing the pages in the database, and write the first parts of the webcrawler.

CouchDB’s Python library has a simple ORM system that makes it easy to convert between the JSON objects stored in the database and a Python object.

To create the class you just need to specify the names of the fields, and their type. So, what do a search engine need to store? The url is an obvious one, as is the content of the page. We also need to know when we last accessed the page. To make things easier we’ll also have a list of the urls that the page links to. One of the great advantages of a database like CouchDB is that we don’t need to create a separate table to hold the links, we can just include them directly in the main document. To help return the best pages we’ll use a page rank like algorithm to rank the page, so we also need to store that rank. Finally, as is good practice on CouchDB we’ll give the document a type field so we can write views that only target this document type.

That’s a lot of description for not a lot of code! Just add that class to your models.py file. It’s not a normal Django model, but we’re not using Django models in this project so it’s the right place to put it.

We also need to keep track of the urls that we are and aren’t allowed to access. Fortunately for us Python comes with a class, RobotFileParser which handles the parsing of the file for us. So, this becomes a much simpler model. We just need the domain name, and a pickled RobotFileParser instance. We also need to know whether we’re accessing an http or https and we’ll give it type field to distinguish it from the Page model.

We want to make the pickle/unpickle process transparent so we’ll create a property that hides the underlying pickle representation. CouchDB can’t store the binary pickle value, so we also base64 encode it and store that instead. If the object hasn’t been set yet then we create a new one on the first access.

For both pages and robots.txt files we need to know whether we should reaccess the page. We’ll do this by testing whether the we accessed the file in the last seven days of not. For Page models we do this by adding the following function which implements this check.

For the RobotsTxt we can take advantage of the last modified value stored in the RobotFileParser that we’re wrapping. This is a unix timestamp so the is_valid function needs to be a little bit different, but follows the same pattern.

To update the stored copy of a robots.txt we need to get the currently stored version, read a new one, set the last modified timestamp and then write it back to the database. To avoid hitting the same server too often we can use Django’s cache to store a value for ten seconds, and sleep if that value already exists.

In this post we’ve discussed how to represent a webpage in our database as well as keep track of what paths we are and aren’t allowed to access. We’ve also seen how to retrieve the robots.txt files and update them if they’re too old.

Now that we can test whether we’re allowed to access a URL in the next post in this series I’ll show you how to begin crawling the web and populating our database.

Ok, let’s get this out of the way right at the start – the title is a huge overstatement. This series of posts will show you how to create a search engine using standard Python tools like Django, Celery and Whoosh with CouchDB as the backend.

Celery is a message passing library that makes it really easy to run background tasks and to spread them across a number of nodes. The most recent release added the NoSQL database CouchDB as a possible backend. I’m a huge fan of CouchDB, and the idea of running both my database and message passing backend on the same software really appealed to me. Unfortunately the documentation doesn’t make it clear what you need to do to get CouchDB working, and what the downsides are. I decided to write this series partly to explain how Celery and CouchDB work, but also to experiment with using them together.

In this series I’m going to talk about setting up Celery to work with Django, using CouchDB as a backend. I’m also going to show you how to use Celery to create a web-crawler. We’ll then index the crawled pages using Whoosh and use a PageRank-like algorithm to help rank the results. Finally, we’ll attach a simple Django frontend to the search engine for querying it.

Let’s consider what we need to implement for our webcrawler to work, and be a good citizen of the internet. First and foremost is that we must be read and respect robots.txt. This is a file that specifies what areas of a site crawlers are banned from. We must also not hit a site too hard, or too often. It is very easy to write a crawler than repeatedly hits a site, and requests the same document over and over again. These are very big no-noes. Lastly we must make sure that we use a custom User Agent so our bot is identifiable.

We’ll divide the algorithm for our webcrawler into three parts. Firstly we’ll need a set of urls. The crawler picks a url, retrieves the page then store it in the database. The second stage takes the page content, parses it for links, and adds the links to the set of urls to be crawled. The final stage is to index the retrieved text. This is done by watching for pages that are retrieved by the first stage, and adding them to the full text index.

Celery’s allows you to create ‘tasks’. These are units of work that are triggered by a piece of code and then executed, after a period of time, on any node in your system. For the crawler we’ll need two seperate tasks. The first retrieves and stores a given url. When it completes it will triggers a second task, one that parses the links from the page. To begin the process we’ll need to use an external command to feed some initial urls into the system, but after that it will continuously crawl until it runs out of links. A real search engine would want to monitor its index for stale pages and reload those, but I won’t implement that in this example.

I’m going to assume that you have a decent level of knowledge about Python and Django, so you might want to read some tutorials on those first. If you’re following along at home, create yourself a blank Django project with a single app inside. You’ll also need to install django-celery, the CouchDB Python library, and have a working install of CouchDB available.