The API (Application Programming Interface) for
PyMongo, the Python driver for
MongoDB, has been pretty stable for quite some
time now. In general, I think that the driver does a great job of
exposing all of MongoDB’s functionality in a way that is both
MongoDB-ish and Pythonic (of course I’m biased — if you have suggestions
for improvements please let me know). There have been some minor tweaks
recently, like an improvement for the command
API,
but there haven’t been any huge changes in a while.

All that said, I think there is one place where PyMongo has been
deficient for a long time: its support for
GridFS. There are a couple of
problems with the GridFS implementation in previous versions (= 1.5.2) of PyMongo:

It is slower and less concurrency-friendly than it needs to be.

It could be much simpler and easier to work with.

It allows some operations (modifying existing files) that are incorrect according to the GridFS semantics.

I think that all of these deficiencies stem from one fatal flaw in the
original API design: it was trying too hard to mimic Python’s
filesystem API. Exposing file-like objects for writing and reading is
great, but things like focusing on filename as a file-handle (filename
is less important in GridFS) and allowing users to modify files (which
is not a concept supported by GridFS) were bad decisions. This post
introduces a new GridFS API for PyMongo which tries to address all of
the deficiencies in the old implementation.

The New API

The new GridFS API is available in PyMongo versions = 1.6. There are API
docs
available for the new version of GridFS, but this post walks through the
API in a little more detail.

Almost all of the API is exposed through the GridFS class, here’s an
example instantiation:

You can also use an alternate root collection for GridFS by passing a
collection name as the second argument to GridFS. Given an
instance of GridFS, creating new files and getting data from existing
ones is easy:

The put method takes a string or file-like object
containing data to be written to a GridFS file, and returns the
\_id of the newly created file. It also accepts keyword
arguments for any of the fields available in the GridFS file spec. So to
insert the contents of the local file myimage.jpg with the content
type “image/jpeg” and the filename “myimage” we would do:

The get method takes an \_id of a file in
GridFS, and returns a file-like object (an instance of
gridfs.grid\_file.GridOut) that can be used to read that
file’s data. If there is no file with the given \_id an
exception is raised:

Advanced Usage

While put should cover most use cases that opening a file
in write mode used to, there may be some cases where you don’t want to
write all of the file’s data at once. In that case you can use the
new\_file method to get a new
gridfs.grid\_file.GridIn instance which you can write to.
When you’re done with the instance call close (or just use
Python 2.6’s with statement to handle it for you). Like
put, new\_file takes keyword arguments for any
of the values in the GridFS file spec. A cool note about both methods is
that any unrecognized keyword arguments (in this example,
location) will automatically be set as attributes on the
underlying file document:

Most of the examples above have been referencing files in GridFS by
their \_id. The old API made more of a point of emphasizing
filename as the primary mode of access. While the new API
encourages the better practice of referencing by \_id,
there are still some cases where you might need to work with files based
on filename, or where your application treats
filename as a unique identifier.

The way to work with filenames in the new API is to create a new GridFS
file with the same filename every time you need to modify a file. Since
files in GridFS store their upload date, we can always get the most
recent version of a file by filename. We can also reference by
\_id to get any previous version — we can treat GridFS as a
versioned filestore. The method to get the last version of a file by
name is get\_last\_version:

Performance

I’ve only done some very basic benchmarking of the new GridFS
implementation. It doesn’t appear to be a drastic improvement when
writing large files, but writing small files is about four times faster
in a simple benchmark of mine. This is because the simplification of the
API has reduced the per-file overhead dramatically. There are also huge
performance improvements when uploading a large file from disk or some
other file-like source (especially if it was being done naively using
the old API), as the new API automatically handles streaming from a
file-like source into chunk-sized buffers.

Reading small files is about 10% faster with the new API, while there
doesn’t seem to be much difference with reading large files. A big
difference in performance for both reads and writes is that concurrency
is vastly improved. Since the GridFS semantics are handled correctly,
there is no longer a per-file lock for access - concurrent operations
are fully supported and safe now (with the notable exception of deleting
a file, which could cause concurrent readers to see partial data).

If anybody has any ideas for benchmarks, or existing benchmarks for the
old API, I’d love to see how they compare — please leave a note in the
comments.

Deprecation of the Old API

In the current state of the master branch I have removed the old GridFS
API completely, raising an UnsupportedAPI acception when
attempts are made to use it. Normally I would go through a deprecation
window, but I’m considering just releasing like this. My reasoning is
that it’s a big change and I want people to be making it as soon as
possible. There is also a problem that mixed use of the APIs could be
unsafe (in general, any use of the old API can be unsafe, if it’s used
to overwrite existing files). Let me know what your thoughts on this
decision are as well. Hoping to get this right as I think it will be a
big improvement to PyMongo!