Tuesday, May 7, 2013

Local PyPI Options

Having a central package repository has helped the Python community immensely through sharing reusable code. There's a few issues that arise when you start depending on such a resources though, and may need to be solved:

make your installs resilient against Internet/PyPI issues,

speed up your installs significantly (after the first one),

prevent problems installing packages that are removed from distribution by the author,

allow installation of packages from within a firewalled environment where the host performing the installation does not have Internet access, and

allow hosting and installation of private packages.

All while being as little an overhead on the package users as possible (ie. maintenance of a system performing the above should be either low or none).

Searching for "PyPI" on the package repository is somewhat daunting (the page of results seems to go on forever). Having done a bit of a survey of the top hits there seems to be only a few packages that are relevant to the above requirements (presented here in the order that the PyPI search ranks them):

Flask-Pypi-Proxy - A semi-proxy that supports private package upload. Its dependencies are quite hefty and it does not mirror packages locally.

pyramidpypi - "This is a very simple pypi-like server written with the pyramid web framework." Pyramid is a very hefty dependency for such a simple server and it only supports private package upload.

simplepypi - a very simple local repository allowing upload of packages and installation of them.

yopypi - is a "load balancer" which punts requests to a mirror automatically when the primary PyPI is unavailable.

djangopypi / djangopypi2 - are both PyPI servers acting as local repositories with the same user interface as the real thing. No proxying, though there is a manual tool infi.pypi_manager which may be used to mirror packages to a local djangopypi.

pypiserver - serves files out of local directories or redirects to the real server if not found. Handles upload of private packages. No proxying for missing packages, though it does have a facility for updating packages which are already in the local directories.

pyshop - another private repository implementation with access controls built in. It also performs caching proxy of packages not present locally. Hefty dependencies (Pyramid but also an SQL database).

spynepi - a proxying server with local storage which also handles local upload of private packages! In Twisted. Using "spyne" which is some RPC mechanism and I don't know what it's got to do with PyPI serving. Hefty dependencies.

chishop - another simple local repository with upload written in Django.

ClueReleaseManager - yet another local repository though with full meta-data support and what appears to be proxying of PyPI meta-data, but not files.

pyroxy - a proxying index server which can serve local files (but without local caching of proxied files).

scrambled - a very simple server of local files (point it at a directory and run).

devpi-server - a transparent caching proxy with local storage of the files accessed. Uses a redis database, which is an additional dependency that is a problem in my deployment scenario.

collective.eggproxy - implements caching proxy but has hefty dependencies. Also seems to be very fetch-happy, retrieving eggs I don't actually need.

A lot of the implementations above have a bunch of user controls built into them. And there's an awful lot of "simple PyPI in framework X" implementations. Most of the "proxy" solutions (save pyshop, devpi-server and collective.eggproxy) required manual download of the package files, or they just proxied their requests through to the Internet with no local file storage for speed/resilience. Those others had dependencies that prevented me easily installing them into my target environment.

So none of them fit the bill, and none appeared to be easily modifiable to do what I want. So, I wrote my own: proxypypi :-)

When proxyypi is asked about a package it doesn't know it automatically goes off and fetches the file download list for the package, rewriting all references (PyPI and external) so they appear to be local. On request of one of those now-local package files it performs a background fetch of the file contents and serves up the new file data to the pip request (thus keeping that request alive despite its very short timeout duration).

9 comments:

When I was looking to do the same thing, I ended up working with Brian Jones' MinistryOfPackages[1]. I helped work out a few bugs but later lost steam with the whole project. Probably a bit late for a recommendation but thought I'd point it out.

Looking at what you wrote about proxyypi, pyshop has same functionality, it mirrors package on it's first request, (it downloads it, and then serves from itself, and depending on how many versions it has to create entries in db, it might take a while). Also on request, it fetches new versions of packages already being mirrored.

Thanks for clarifying that. I've modified the blog post to give a more accurate description (though there appears to be some caching preventing me seeing the edit ...). Please consider clarifying your project description to also make it clearer that it has this capability.

An issue (for me) with pyshop is the hefty dependencies: Pyramid, for one (for such a simple web app, Pyramid is a whole lotta framework), and the SQL database. The second is a real issue with my deployment system; that's why my implementation uses just the filesystem.

When using Flask-Pypi-Proxy the index page of the package (http://miserver/simple/Flask) won't be used locally, but it will take the data from Pypi. This is to take into account that there might be new version that aren't in you local repo. To change this, there is a configuration option named SHOULD_USE_EXISTING (check it here: https://flask-pypi-proxy.readthedocs.org/en/latest/installing.html#configuration)

But once that a package is downloaded, if the package exists locally, then it will return that package instead of downloading it from pypi. For example, the first request to Flask will be done to Pypi, and download the package and store it locally. After that, if the same version of the Flask package is requested, then it will return the package that is present on the server.

There's also collective.eggproxy, which I found to be completely unusable (an attempt to pip install lxml using it as the index blocked while eggproxy tried to download all versions of binary eggs for all platforms, until I got tired and aborted after merely one hour and a gig worth of useless files in the cache).

The apps / utils recommended here are nice and full-featured, but as pointed out, they're all fairly heavy implementations for what is actually quite a simple task: making Python packages available on a private index.

It's possible to set up a "private PyPI" using nothing but a vanilla Apache vhost (with "Options +Indexes"), if you don't mind manually uploading packages to the right directory, each time you have a new package / version:

I wrote an article that expands on this, going through how you can split off chunks of code from a big Python codebase, into packages that can be maintained on a "private PyPI" - hope it's useful for others who are trying to do this: