Moshe’zWhere Moshe Z knows your name2016-12-08T03:03:17Zhttp://moshez.wordpress.com/feed/atom/WordPress.commoshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=12942016-12-08T03:03:12Z2016-12-08T03:03:12Z]]>The assumption here is that you have my phone number. If you don’t have my phone number, and you think that’s an oversight on my part, please send me an e-mail at zadka.moshe@gmail.com and ask for it. If you don’t have my phone number because I don’t know you, I am usually pretty responsive on e-mail.

]]>0moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=12832016-12-02T05:34:28Z2016-12-02T05:34:28Z]]>When writing unit tests, it is good to call functions with “mocks” or “fakes” — objects with equivalent interface but a simple, “fake” implementation. For example, instead of a real socket object, something that has recv() but returns “hello” the first time, and an empty string the second time. This is great! Instead of testing the vagaries of the other side of a socket connection, you can focus on testing your code — and force your code to handle corner cases, like recv() returning partial messages, that happen rarely on the same host (but not so rarely in more complex network environments).

There is one OS interface which it is wise not to mock — the venerable UNIX file system. Mocking the file system is the classic case of low-ROI effort:

It is easy to isolate: if functions get a parameter of “which directory to work inside”, tests can use a per-suite temporary directory. Directories are cheap to create and destroy.

It is reliable: the file system rarely fails — and if it does, your code is likely to get weird crashes anyway.

The surface area is enormous: open(), but also os.open, os.mkdir, os.rename, os.mknod, os.rename, shutil.copytree and others, plus modules calling out to C functions which call out to C’s fopen().

The first two items decrease the Return, since mocking the file system does not make the tests easier to write or the test run more reproducible, while the last one increases the Investment.

Do not mock the file system, or it will mock you back.

]]>2moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=12542016-11-19T05:11:36Z2016-11-19T05:11:36Z]]>Recently I have been talking about deploying Python, and some people had the reasonable question: if a .pex file is used for isolating dependencies, and a Docker container is used for isolating dependencies, why use both? Isn’t it redundant?

Why use containers?

I really like glyph’s explanation for containers: they isolate not just the filesystem stack but the processes and the network, giving a lot of the power that UNIX was supposed to give but missed out on. Containers isolate the file system, making it easier for code to write/read files from known locations. For example, its log files will be carefully segregated, and can be moved to arbitrary places by the operator without touching the code.

The other part is that none of the reasonable options packages Python and this means that a pex file would still have to be tested with multiple Pythons, and perhaps do some checking at start-up that it is using the right interpreter. If PyPy is the right choice, it is the choice the operator would have to make and implement.

Why use pex?

Containers are an easy sell. They are right on the hype train. But if we use containers, what use is pex?

In order to explain, it is worthwhile comparing a correctly built runtime container that is not using pex, with one that is: (parts that are not relevant have been removed)

Note that in the first option, we are left with extra gunk in the /wheelhouse directory. Note also that we still have to have pip and virtualenv installed in the runtime container. Pex files bring the double-dutch philosophy to its logical conclusion: do even more of the build on the builder side, do even less of it on the runtime side.

Introduction

WSGI has been a successful standard. Very successful. It allows people to write Python applications using many frameworks (Django, Pyramid, Flask and Bottle, to name but a few) and deploy using many different servers (uwsgi, gunicorn and Apache).

Twisted makes a good WSGI container. Like Gunicorn, it is pure Python, simplifying deployment. Like Apache, it sports a production-grade web server that does not need a front end.

Modern web applications tend to be complex beasts. In order to be trusted by users, they need to have TLS support, signed by a trusted CA. They also need to transmit a lot of static resources — images, CSS and JavaScript files, even if all HTML is dynamically generated. Deploying them often requires complicated set-ups.

Containers

Container images allow us to package an application with all of its dependencies. They often cause a temptation to use those as the configuration management. However, Dockerfile is a challenging language to write big parts of the application in. People writing WSGI applications probably think Python is a good programming language. The more of the application logic is in Python, the easier it is for a WSGI-based team to master it.

PEX

Pex is a way to package several Python “distributions” (sometimes informally called “Packages”, the things that are hosted by PyPI) into one file, optionally with an entry-point so that running the file will call a pre-defined function. It can take an explicit list of wheels but can also, as in our example here, take arguments compatible with the ones pip takes. The best practice is to give it a list of wheels, and build the wheels with pip wheel.

pkg_resources

The pkg_resources module allows access to files packaged in a distribution in a way that is agnostic to how the distribution was deployed. Specifically, it is possible to install a distribution as a zipped directory, instead of unpacking it into site-packages. The code:pex format relies on this feature of Python, so adherence to using pkg_resources to access data files is important in order to not break code:pex compatibility.

Let’s Encrypt

Let’s Encrypt is a free, automated, and open Certificate Authority. It has invented the ACME protocol in order to make getting secure certificates a simple operation. txacme is an implementation of an ACME client, i.e., something that asks for certificates, for Twisted applications. It uses the server endpoint plugin mechanism in order to allow any application that builds a listening endpoint to support ACME.

Twist

The twist command-line tools allows running any Twisted service plugin. Service plugins allow us to configure a service using Python, a pretty nifty language, while still allowing specific customizations at the point of use via command line parameters.

Putting it all together

Our setup.py files defines a distribution called sayhello. In it, we have three parts:

src/sayhello/wsgi.py: A simple Flask-based WSGI application

src/sayhello/data/index.html: an HTML file meant to serve as the root

src/twisted/plugins/sayhello.py: A Twist plugin

There is also some build infrastructure:

build is a Python script to run the build.

build.docker is a Dockerfile designed to build pex files, but not run as a production server.

run.docker is a Dockerfile designed for production container.

Note that build does not push the resulting container to DockerHub.

Credits

Glyph Lefkowitz has inspired me in his blog about how to build efficient containers. He has also spoken about how deploying applications should be no more than one file copy.

Tristan Seligmann has written txacme.

Amber “Hawkowl” Brown has written “twist”, which is much better at running Twisted-based services than the older “twistd”.

Of course, all mistakes and problems here are completely my responsibility.

WSGI is a great standard. It has been amazingly successful. In order to describe how successful it is, let me describe life before WSGI. In the beginning, CGI existed. CGI was just a standard for how a web server can run a process — what environment variables to pass, and so forth. In order to write a web-based application, people would write programs that complied with CGI. At that time, Apache’s only competition was commercial web servers, and CGI allowed you to write applications that ran on both. However, starting a process for each request was slow and wasteful.

For Python applications, people wrote mod_python for Apache. It allowed people to write Python programs that ran inside the Apache process, and directly used Apache’s API to access the HTTP request details. Since Apache was the only server that mattered, that was fine. However, as more servers arrived, a standard was needed. mod_wsgi was originally a way to run the same Django application on many servers. However, as a side effect, it also allowed the second wave of Python web application frameworks –Paste, Flask and more — to have something to run on. In order to make life easier, Python included wsgiref, a module that implemented a single-thread single-process blocking web server with the WSGI protocol.

Development

Some web frameworks come with their own development web servers that will run their WSGI apps. Some use wsgiref. Almost always those options are carefully documented as “just for development use, do not use in production.” Wouldn’t it be nice to use the same WSGI container in both development and production, eliminating one potential source of reproduction bugs?

For ease of use, it should probably be written in Python. Luckily, “twist web –wsgi” is just such a server. In order to show-case how easy it is to use it, twist-wsgi shows commands to run Django, Flask, Pyramid and Bottle apps as easy as it is to run frameworks’ built-in web server.

Production

In production, using the Twisted WSGI containers come with several advantages. Production-grade SSL support using PyOpenssl and cryptography allows elimination of “SSL terminators”, removing one moving piece from the equation. With third-party extensions like txsni and txacme, it allows modern support for “easy SSL”. The built-in HTTP/2 support, starting with Twisted 16.3, allows better support for parallel requests from modern browsers.

The Twisted web server also has a built-in static file server, allowing the elimination of a “front-end” web server that deals with static files by itself, and passing dynamic requests to the application server.

Twisted is also not limited to web serving. As a full-stack network application, it has support for scheduling repeated tasks, running processes and supporting other protocols (for example, a side-channel for online control). Last but not least, in order to integrate that, the language used is Python. As an example for an integrated solution, the Frankenstenian monster plugin show-cases a combo web application combining 4 frameworks, a static file server and a scheduled task updating a file.

While the goal is not to encourage using four web frameworks and a couple of side services in order to greet the user and tell them what time it is, it is nice that if the need strikes this can all be integrated into one process in one language, without the need to remember how to spell “every 4 seconds” in cron or how to quote a string in the nginx configuration file.

]]>0moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=12222016-09-15T06:03:07Z2016-09-15T06:03:07Z]]>In the beginning, came the so-called “procedural” style. Data was data, and behavior, implemented as procedure, were separate things. Object-oriented design is the idea to bundle data and behavior into a single thing, usually called “classes”. In return for having to tie the two together, the thought went, we would get polymorphism.

Polymorphism is pretty neat. We send different objects the same message, for example, “turn yourself into a string”, and they respond appropriately — each according to their uniquely defined behavior.

But what if we could separate the data and beahvior, and still get polymorphism? This is the idea behind post-object-oriented design.

In Python, we achieve this with two external packages. One is the “attr” package. This package allows a useful way to define bundles of data, that still exhibit the minimum amount of behavior we do want: initialization, string representation, hashing and more.

The other is the “singledispatch” package (available as functools.singledispatch in Python 3.4+).

importattrimportsingledispatch

In order to be specific, we imagine a simple protocol. The low-level details of the protocol do not concern us, but we assume some lower-level parsing allows us to communicate in dictionaries back and forth (perhaps serialized/deserialized using JSON).

Our protocol is one to send changes to a map. The only two messages are “set”, to set a key to a given value, and “delete”, to delete a key.

But this was easy! There was no need for polymorphism: we always get one type in (dictionaries), and we consult a mapping to decide which type to produce.

However, for serialization, we do need polymorphism. Enter our second tool — the singledispatch package. The default function is equivalent to a method defined on “object”: the ultimate super-class. Since we do not want to serialize generic objects, our default implementation errors out.

In this case, we kept the functionality “near” the code. However, note that the functionality could be implemented in a different module: these functions, even though they are polymorphic, follow Python namespace rules. This is useful: several different modules could implement “act_on”: for example, an in-memory map (as we defined above), a module using Redis or a module using a SQL database.

Actual methods are not completely obsolete. It would still be best to make methods do anything that would require private attribute access. In simple cases, as above, there is no difference between the public interface and the public implementation.

]]>0moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=12162016-08-25T03:50:15Z2016-08-25T03:50:15Z]]>When operating computers, we are often exposed to so-called “time series”. Whether it is database latency, page fault rate or total memory used, these are all exposed as numbers that are usually sampled at frequent intervals.

However, not only computer engineers are exposed to such data. It is worthwhile to know what other disciplines are exposed to such data, and what they do with it. “Earth sciences” (geology, climate, etc.) have a lot of numbers, and often need to analyze trends and make predictions. Sometimes these predictions have, literally, billions dollars’ worth of decision hinging on them. It is worthwhile to read some of the textbooks for students of those disciplines to see how to approach those series.

Another discipline that needs to visually inspect time series data is physicians. EKG data is often vital to analyze patients’ health — and especially when compared to their historical records. For that, that data needs to be saved. A lot of EKG research has been done on how to compress numerical data, but still keep it “visually the same”. While the research on that is not as rigorous, and not as settled, as the trend analysis in geology, it is still useful to look into. Indeed, even the basics are already better than so-called “roll-ups”, which preserve none of the visual distinction of the data, flattening peaks and filling hills while keeping a score of “standard deviation” that is not as helpful as is usually hoped for.

]]>2moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=11652016-08-20T18:56:29Z2016-08-20T18:56:29Z]]>I was idly contemplating implementing a new Jupyter kernel. Luckily, they try to provide facility to make it easier. Unfortunately, they made a number of suboptimal choices in their API. Fortunately, those mistakes are both common and easily avoidable.

Subclassing as API

They suggest subclassing IPython.kernel.zmq.kernelbase.Kernel. Errr…not “suggest”. It is a “required step”. The reason is probably that this class already implements 21 methods. When you subclass, make sure to not use any of these names, or things will break randomly. If you do not want to subclass, good luck figuring out what the assumption that the system makes about these 21 methods because there is no interface or even prose documentation.

Note the comment “base class increments the execution count”. This is a classic code smell: this seems like this would be needed in every single overrider, which means it really belongs in the helper class, not in every kernel.

None

Of course, this means that user_expressions will sometimes be a dictionary and sometimes None. It is likely that the code will be written to anticipate one or the other, and will fail in interesting ways if None is actually sent.

Optional Overrides

As described in this section there are also ways to make the kernel better with optional overrides. The convention used, which is nowhere explained, is that do_ methods mean you should override to make a better kernel. Nowhere it is explained why there is no default history implementation, or where to get one, or why a simple stupid implementation is wrong.

Dictionaries

All overrides return dictionaries, which get serialized directly into the underlying communication platform. This is a poor abstraction, especially when the documentation is direct links to the underlying protocol. When wrapping a protocol, it is much nicer to use an Interface as the documentation of what is assumed — and define an attr.s-based class to allow returning something which is automatically the correct type, and will fail in nice ways if a parameter is forgotten.

Summary

If you are providing an API, here are a few positive lessons based on the issues above:

You should expect interfaces, not subclasses. Use composition, not subclassing.If you want to provide a default implementation in composition, just check for a return of NotImplemeted(), and use the default.

Do the work of abstracting your customers from the need to use dictionaries and unwrap automatically. Use attr.s to avoid customer boilerplate.

Send all arguments. Isolate your customers from the need to come up with sane defaults.

As much as possible, try to have your interfaces be side-effect free. Instead of asking the customer to directly make a change, allow the customer to make the “needed change” be part of the return type. This will let the customers test their class much more easily.

]]>0moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=11282016-06-07T05:40:57Z2016-06-07T05:40:57Z]]>Every single Python tutorial shows the pattern of

# define functions, classes,
# etc.
if __name__ == '__main__':
main()

This is not a good pattern. If your code is not going to be in a Python module, there is no reason not to unconditionally call ‘main()’ at the bottom. So this code will only be used in modules — where it leads to unpredictable effects. If this module is imported as ‘foo’, then the identity of ‘foo.something’ and ‘__main__.something’ will be different, even though they share code.

This leads to hilarious effects like @cache decorators not doing what they are supposed to, parallel registry lists and all kinds of other issues. Hilarious unless you spend a couple of hours debugging why ‘isinstance()’ is giving incorrect results.

If you want to write a main module, make sure it cannot be imported. In this case, reversed stupidity is intelligence — just reverse the idiom:

# at the top
if __name__ != '__main__':
raise ImportError("this module cannot be imported")

This, of course, will mean that this module cannot be unit tested: therefore, any non-trivial code should go in a different module that this one imports. Because of this, it is easy to gravitate towards a package. In that case, put the code above in a module called ‘__main__.py‘. This will lead to the following layout for a simple package:

This will work in any environment where the package is on the sys.path: in particular, in any virtualenv where it was pip-installed. Unless a short command-line is important, it allows skipping over creating a console script in setup.py completely, and letting “python -m” be the official CLI. Since pex supports setting a module as an entry point, if this tool needs to be deployed in other environment, it is easy to package into a tool that will execute the script:

$ pex . --entry-point SOME_PACKAGE --output-file toolname

]]>1moshezhttps://moshez.wordpress.comhttp://moshez.wordpress.com/?p=11072016-05-31T14:43:45Z2016-05-31T14:43:45Z]]>I’ve seen a few talks about “stop writing classes”. I think they have a point, but it is a little over-stated. All debates are bravery debates, so it is hard to say which problem is harder — but as a recovering class-writing-guiltoholic, let me admit this: I took this too far. I was avoiding classes when I shouldn’t have.

Classes are best kept small

It is true that classes are best kept small. Any “method” which is not really designed to be overridden is often best implemented as a function that accepts a “duck-type” (or a more formal interface).

This, of course, sometimes leads to…

If a class has only one public method, except __init__, it wants to be a function

Especially given function.partial, it is not needed to decide ahead of time which arguments are “static” and which are “dynamic”

Classes are useful as data packets

This is the usual counter-point to the first two anti-class sentiments: a class which is nothing more than a bunch of attributes (a good example is the TCP envelope: source IP/target IP/source port/target port) are useful. Sure, they could be passed around as dictionaries, but this does not make things better. Just use attrs — and it is often useful to write two more methods:

Some variant of “serialize”, an instance method that returns some lower-level format (dictionary, string, etc.)

Some variant of “deserialize”, a class method that takes the lower-level format above and returns a corresponding instance.

It is perfectly ok to write this class rather than shipping dictionaries around. If nothing else, error messages will be a lot nicer. Please do not feel guilty.