For whatever reason, maybe because you need to set the supplier.country before you add things to the DB, or before you instantiate product objects, you need to be able to adjust the country field on your supplier feature.

Option 1: more fixtures

We can just create more fixtures, and try do do a bit of DRY by extracting out common logic:

That's just one way you could do it, maybe you can figure out ways to reduce the duplication of the db.add() stuff as well, but you are going to have to have a different, named fixture for each customisation of Supplier, and eventually you may decide that doesn't scale. us_supplier, eu_supplier, asia_supplier, ch_supplier, etc etc, too many fixtures! I'd like just one, customisable fixture please.

Option 2: factory fixtures

Instead of a fixture returning an object directly, it can return a function that creates an object, and that function can take arguments:

That works, but firstly now everything is a factory-fixture, which makes them more convoluted, and secondly, your tests are filling up with extra calls to make_things, and you're having to embed some of the domain knowledge of what-depends-on-what into your tests as well as your fixtures.

Option 3: "normal" fixture parametrization

This is a pretty cool feature of Pytest. You probably already know that you can parametrize tests, injecting different values for arguments to your test and then running the same test multiple times, once for each value:

Amazing huh? The only problem is that you're now likely to build a teetering tower of implicit dependencies where the only way to find out what's actually happening is to spend ages spelunking in conftest.py, but, hey, if you didn't like crazy nested fixture magic, why are you using pytest in the first place, right?

Hi, I'm Harry, Bob's coauthor for this series on architecture. Now I don't pretend to be an architect*, but I do know a bit about Python. You know the apocryphal tale about bikeshedding? Everyone wants to be able to express an opinion, even if it's only about the colour of

Hi, I'm Harry, Bob's coauthor for this series on architecture. Now I don't pretend to be an architect*, but I do know a bit about Python. You know the apocryphal tale about bikeshedding? Everyone wants to be able to express an opinion, even if it's only about the colour of the bikesheds? Well this will be me essentially doing that about Bob's code. Not questioning the architecture. Just the cosmetics. But, readability counts, so here we go!

"Stop Writing Classes"

Despite the fact that Bob swears blind that he was a functional programmer for years, I think Bob does occasionally let the OO-heavy habits of the C# world take over, and he sees classes everywhere, including plenty of places where they don't really help. OK OK, arguably don't help.

Unless you're actually using mypy, those types aren't adding much value however. The alternative would be the more "classic" namedtuple syntax:

ReportIssueCommand = namedtuple("ReportIssueCommand", ["issue_id", "reporter_name", "reporter_email", "problem_description"])
# or the shorter syntax if it doesn't make you nervous:
ReportIssueCommand = namedtuple("ReportIssueCommand", "issue_id reporter_name reporter_email problem_description")
# come on, have you seen the implementation? nameduples are magic anyway. get with it!

This wasn't available at the time of writing, but Python 3.7 dataclasses might be worth a look too. You'd probably want to use frozen=True to replicate the immutabilty of namedtuples...

tying commands to handlers

You need some way of connecting commands with their handlers. The most boring way of doing that is in some sort of bootstrap/config code (as in this example) but you might want also want to do so "inline" in your handler definition.

Bob's way, where the handler class inherits from Handles[message.ReportIssueCommand] definitely deserves some points for being easily readable, but you really don't want to get into the sausage-factory of the actual implementation, involving, as it does, the controversial typing module.

But it might get confusing if you also want to use decorators for dependency injection:

@inject('start_uow')
def report_issue(start_uow, cmd):
...

managing units of work without a UnitOfWorkManager

The Unit of Work pattern is one of the more straightforward ones; it's easy to understand why you might want to manage blocks of code that need to be executed "together" and atomically.

In a simple project that might just mean wrapping everything in a single database transaction, but you might also want to manage some other types of permanent storage (filesystem, cloud storage...).

If you're using domain events, you might also want to apply the unit-of-work concept to them as well: for a given block of code, perhaps a command handler, either raise all the events in the happy case, or raise none at all (analogous to a rollback) if an error occurs at any point. This gives you the option to replay the command handler later without worrying about duplicate events.

In that case your unit of work manager needs to grow some logic for tracking a stack of events raised by a block of code, as suggested in the domain events post.

a unit of work should probably be a context manager

Either way, Bob nailed it, a Python context manager is the right pattern here. Here's the outline of his class-based one:

could you use an @contextmanager?

We're down to just two classes. Next you might ask whether you really need a class for your unit of work context manager. If your client code doesn't need to call a commit method explicitly, then you might be able to get away with a single method, using contextlib.contextmanager and the yield keyword:

In this case our implementation is the ultra-simple "by convention there is only one instance of this class", which is has a lot going for it in terms of ways to implement the singleton pattern, compared to all the complicated code-based solutions linked above. If you do want a code-based solution, or if you want to continue experimenting with non-class-based solutions to these problems, why not use the "just use a module" solution - modules are essentially already singletons, in Python:

Hello, in this article, I'll try to explain what Photon-pump is, and write an easy example so you can start using it for your own projects.

Photon-pump is a client for Event Store we developed at made.com, it's the little brother to atomic puppy (which is another eventstore client), it's async first, works using TCP so it's also faster (atomicpuppy uses HTTP).

I won't talk about eventsourcing since it's been talked about on previous posts, so this will be just a very simple and silly example of event sourcing.

So, let's say we have a game, for a game to happen we need players, so we need to create them. So we're going to pretend that we have an application that creates players, which will later, create an event and place it in the appropriate stream of Event Store.

This is the example of the "player created" event, it's a json blob:

{"name": "Gil"}

Now, we also need to pick a stream which is just a string representing the "bucket" where the event will be put, we'll use "adventure" which is the name of our imaginary game, not very creative, but it's better than "game".

An event will also have a type, which is like a sub category inside the stream. This is how the event is looking like:

Ignoring run and if __name... the read_an_event function uses the method get from Photon-pump to collect all the events using it like an iterator and printing each of the events. We get event_records, and each contains the event, so we can print out the type and the data.

This was just a very simple example that I came up with, but if you want to make it more like the real world, how about trying to follow the previous posts about CQRS using Photon-pump to store and read the events.

Stay tuned for the next part where we will talk about subscriptions.

BONUS: If you want to replicate this code, you will need python 3.6+ (Remember to install Photon-pump pip install photon-pump) and docker or Event Store installed on your machine. Simply start Event Store in docker (docker run -p 1113:1113 -p 2113:2113 eventstore/eventstore) and run those python scripts (writer.py and reader.py) in sequence to see it work.

]]>

Dependency injection is not crazy, not un-pythonic, and not enterprisey. Here's Wikipedia:

In software engineering, dependency injection is a technique whereby one object (or static method) supplies the dependencies of another object. A dependency is an object that can be used (a service). An injection is the passing of a

Dependency injection is not crazy, not un-pythonic, and not enterprisey. Here's Wikipedia:

In software engineering, dependency injection is a technique whereby one object (or static method) supplies the dependencies of another object. A dependency is an object that can be used (a service). An injection is the passing of a dependency to a dependent object (a client) that would use it. The service is made part of the client's state. Passing the service to the client, rather than allowing a client to build or find the service, is the fundamental requirement of the pattern

In other words, Dependency Injection (DI, for all you jargon-fans out there) is when an object is given its dependencies instead of reaching out to get them by itself. For example, in this series we're building a system for managing IT support issues. Last time we had a requirement to send an email when an issue was assigned to an engineer.

Dependency injection in class constructors

Our handler is orchestration code, and it plugs together two collaborators: a View Builder that fetches data, and an Email Sender that knows how to send an email to the mail server.

We could have our handler import and use its dependencies directly (and implicitly), like this:

This is dependency injection. We're going to be injecting the dependencies (the sender and view) by making them parameters of the constructor. That's it.

Why bother? Passing our parameters this way makes them more explicit, and so reduces the overall quantity of Unpleasant Surprise hiding in the system. It's easy to see what might have side-effects and what doesn't. Because I'm providing all my dependencies from outside of my handler, I can change them easily, or provide fakes for testing. This helps to keep the system loosely-coupled and flexible. It also means that I have to think about what the dependencies of my system ought to be, and that helps me to define meaningful abstractions.

Dependency injection with partial functions

In our implementation, dependency injection is really just a way of performing partial application on a method call. Earlier in this series, I said that I often create handlers by abusing the __call__ magic method.

The callables handler, handler_a, and handler_b all take a single argument (the command) and run the same code on it, so we can see that they are equivalent. Dependency injection is just a way of parametising the behaviour of our applications by partially applying function arguments.

Dependency Injection enables Clean Architecture

The advantage of building a system this way is that it's very easy to test, configure, and extend the behaviour of our application through composition. Dynamic languages offer many ways to fake the behaviour of a component, but my preference is to write explicit fakes and stubs (not just unittest.mock.Mock objects), and pass them as constructor arguments. This forces me to think about my system in terms of composable parts, and to identify the roles that they play. Instead of directly calling the database from my handler, I'm providing an IssueViewBuilder. Instead of writing a load of SMTP code in my handler, I'm providing an instance of EmailSender.

This, for me at least, is the simplest, most obvious, and least magical way of dealing with dependencies, especially across architectural boundaries. Performing dependency injection - whether by constructor injection or partial application, or some magic property-filling decorator - is mandatory if you want to do ports and adapters. It's the "one weird trick" that allows high-level code (business logic) to remain completely isolated from low level code (database transactions, file operations, email sending etc.)

You don't need to use a framework for DI

Dependency injection gets a bad rap in the Python community for reasons that escape me. But I think one of them is that people assume that you need to use a framework to perform the injection, and they're terrified of ending up in an xml-driven hellscape like Spring. This isn't true, you can still perform dependency injection with no frameworks at all. For example, in the code sample for the previous part in this series, I extracted all my wiring into a single module with boring code that looks like this:

This code is just a straight-line script that configures the database, creates all of our message handlers, and then registers them with the message bus. This component is what an architect would call a Composition Root. On my current teams, we tend to call this a bootstrap script. As systems grow, though, and requirements become more complex, this bootstrapper script can become more repetitive and error-prone. Dependency injection frameworks exist to remove some of the boiler-plate around registering and wiring up dependencies. In recent years the .Net-hipster crowd have started to move away from complex dependency injection containers in favour of simpler composition roots. This is known as poor man's DI, pure DI, or artisanal organic acorn-fed DI.

inject

Dependency injection frameworks exist to remove some of the boiler-plate around registering and wiring up dependencies

Usually, on our Python projects at Made.com, we use the inject library. This is a simple tool that performs the partial application trick I demonstrated above. Inject is my favourite of the Python DI libraries because it's so simple to use, but I have a dislike for its use of decorators to declare dependencies.

The configure_binder function takes the place of my bootstrap script in wiring up and configuring my dependencies. When I call IssueAssignedHandler the inject library knows that it should replace the sender param with the configured SmtpEmailSender, and that it should replace the view param with an IssueViewBuilder. The decorator serves to associate the service ("email_sender") with the parameter ("sender"). It works, but it always felt inappropriate to have this kind of declaration outside of my composition root.

Introducing punq, and DI containers

I've been working on a prototype DI framework that avoids this problem by using Python 3.6's optional type hinting, and I'd like to show you some use cases.

(A "container" is the established name for a dependency injection framework's registry of services.)

So far, so underwhelming. Simple registrations don't really save us anything over the bootstrap script from earlier. Using a container for this kind of work really only cuts down on duplication - when I've registered UnitOfWorkManager once, I never have to refer to it again, whereas in the bootstrap I had to explicitly pass it to every handler. It's nice not having to decorate my class with dependency injection specific noise though, instead I can just declare what my dependencies are. As an added bonus, I can run mypy over my code and it will tell me if I've made any stupid type errors.

Using DI to compose chains of services

There are more useful things we can do with a dependency injection container, though. For example, maybe we're writing a program that needs to run a bunch of processing rules over some text. We decide to treat each processing rule as a function and use our container to fetch them all at runtime.

One of the advantages of using types over using other keys is that they're composable. I can ask for a List[T] and get all registered instances of some T. This is handy when you're writing code that processes the same message with a bunch on different steps, including rules engines and message buses (see bonus section). Having generics in our type system can make it easier to manage all of our dependencies in other ways, too. For example, I can use generics to automatically wire up all my message handlers.

class IssueAssignedHandler (Handles[IssueAssignedEvent]):
pass

Here we're stating that our IssueAssignedHandler is an subtype of the Handles class, and it has a type parameter for the handled event. Given a module full of these, I can enumerate the module's types and perform automatic registration.

def register_all(module):
""" Read through all the types in a module and register them if they are handlers"""
for _, type in inspect.getmembers(module, predicate=inspect.isclass):
register(type)
def register(type):
""" If this type is a handler type then register it in the container """
handler_service_type = get_message_type(type)
if handler_service_type is None:
return
container.register(handler_service_type, type)
def get_message_type(type):
""" If this type subclasses the Handles[Msg] class, return the parameterised type.
eg. for our IssueAssignedHandler, this would return Handles[IssueAssignedEvent]
"""
try:
return next(b for b in type.__orig_bases__ if b.__origin__ == services.Handles)
except (AttributeError, StopIteration):
return None
def get_handler_for(event_type):
container.resolve(Handles[event_type])

Nested services

Punq has one more useful trick up its sleeve: nested registrations. These are useful when you need to build some kind of chain of responsibility - a pattern where objects try to handle a request, then pass it along the chain to the next in line.

If punq has multiple services registered for a particular class it will pop one off its stack each time it's asked for one. Because each MessageHandler depends on another MessageHandler, punq treats them as a chain, and injects them into each other like a stack of Russian dolls.

In the following code we add two new message handlers, a metrics handler that records the runtime of our handler pipeline so we can monitor our application, and a de-duplicating handler that prevents us from handling the same message twice. Both of these require complex dependencies of their own, and we can delegate their creation to the container.

This is what I meant in the last part when I said that a message bus is a great place to put cross-cutting concerns. By using this pattern of composing generic MessageHandler services, we can implement things like validation, logging, exception handling, event database session management. DI makes it easy for us to write and test those components separately.

For bonus points: a generic handler can become a message bus implementation

One of the fun side-effects of having a DI container that supports nesting is that we could implement a top-level "God" handler for the generic case whose job is to resolve down to the specific message type, and that effectively becomes the implementation of our message bus:

Last month I travelled to Romania to give a talk about event sourcing at CodeCamp in Iasi. The talk was a quick 101 on what eventsourcing is, why you might want to do it, and then I demoed how to implement and persist an eventsourced domain model.

Unfortunately there's no video from the conference, because I was extraordinarily handsome and especially witty that day, but I've just uploaded the video of my trial run at work.

Slides and code are up on Github in case you want to try the exercise yourself, or to impersonate me. I made a few changes after this trial run, but the video should be close enough that you can work it out. If not, you can email me - bob at made.com and I'll try and help you out.

Okay, so we have a basic skeleton for an application and we can add new issues into the database, then fetch them from a Flask API. So far, though, we don't have any domain logic at all. All we have is a whole bunch of complicated crap where we could just have a tiny Django app. Let's work through some more use-cases and start to flesh things out.

Back to our domain expert:

So when we've added a reported issue to the issue log, what happens next?

Well we need to triage the problem and decide how urgent it is. Then we might assign it to a particular engineer, or we might leave it on the queue to be picked up by anyone.

Wait, the queue? I thought you had an issue log, are they the same thing, or is there a difference?

Oh, yes. The issue log is just a record of all the issues we have received, but we work from the queue.

I see, and how do things get into the queue?

We triage the new items in the issue log to decide how urgent they are, and what categories they should be in. When we know how to categorise them, and how urgent they are, we treat the issues as a queue, and work through them in priority order.

This is because users always set things to "Extremely urgent"?

Yeah, it's just easier for us to triage the issues ourselves.

And what does that actually mean, like, do you just read the ticket and say "oh, this is 5 important, and it's in the broken mouse category"?

Mmmm... more or less, sometimes we need to ask more questions from the user so we'll email them, or call them. Most things are first-come, first-served, but occasionally someone needs a fix before they can go to a meeting or something.

So you email the user to get more information, or you call them up, and then you use that information to assess the priority of the issue - sorry triage the issue, and work what category it should go in... what do the categories achieve? Why categorise?

Partly for reporting, so we can see what stuff is taking up the most time, or if there are clusters of similar problems on a particular batch of laptops for example. Mostly because different engineers have different skills, like if you have a problem with the Active Directory domain, then you should send that to Barry, or if it's an Exchange problem, then George can sort it out, and Mike has the equipment log so he can give you a temporary laptop and so on, and so on.

Okay, and where do I find this "queue"?

Your customer grins and gestures at the wall where a large whiteboard is covered in post-its and stickers of different colours.

Mapping our requirements to our domain

How can we map these requirements back to our system? Looking back over our notes with the domain expert, there's a few obvious verbs that we should use to model our use cases. We can triage an issue, which means we prioritise and categorise it; we can assign a triaged issue to an engineer, or an engineer can pick up an unassigned issue. There's also a whole piece about asking questions, which we might do synchronously by making a phone call and filling out some more details, or asynchronously by sending an email. The Queue, with all of its stickers and sigils and swimlanes looks too complicated to handle today, so we'll dig deeper into that separately.

Let's quickly flesh out the triage use cases. We'll start by updating the existing unit test for reporting an issue:

Triaging an issue, for now, is a matter of selecting a category and priority. We'll use a free string for category, and an enumeration for Priority. Once an issue is triaged, it enters the AwaitingAssignment state. At some point we'll need to add some view builders to list issues that are waiting for triage or assignment, but for now let's quickly add a handler so that an engineer can Pick an issue from the queue.

At this point, the handlers are becoming a little boring. As I said way back in the first part, commands handlers are supposed to be boring glue-code, and every command handler has the same basic structure:

Fetch current state.

Mutate the state by calling a method on our domain model.

Persist the new state.

Notify other parts of the system that our state has changed.

So far, though, we've only seen steps 1, 2, and 3. Let's introduce a new requirement.

When an issue is assigned to an engineer, can we send them an email to let them know?

Something here feels wrong, right? Our command-handler now has two very distinct responsibilities. Back at the beginning of this series we said we would stick with three principles:

We will always define where our use-cases begin and end.

We will depend on abstractions, and not on concrete implementations.

We will treat glue code as distinct from business logic, and put it in an appropriate place.

The latter two are being maintained here, but the first principle feels a little more strained. At the very least we're violating the Single Responsibility Principle; my rule of thumb for the SRP is "describe the behaviour of your class. If you use the word 'and' or 'then' you may be breaking the SRP". What does this class do? It assigns an issue to an engineer, AND THEN sends them an email. That's enough to get my refactoring senses tingling, but there's another, less theoretical, reason to split this method up, and it's to do with error handling.

If I click a button marked "Assign to engineer", and I can't assign the issue to that engineer, then I expect an error. The system can't execute the command I've given to it, so I should retry, or choose a different engineer.

If I click a button marked "Assign to engineer", and the system succeeds, but then can't send a notification email, do I care? What action should I take in response? Should I assign the issue again? Should I assign it to someone else? What state will the system be in if I do?

Looking at the problem in this way, it's clear that "assigning the issue" is the real boundary of our use case, and we should either do that successfully, or fail completely. "Send the email" is a secondary side effect. If that part fails I don't want to see an error - let the sysadmins clear it up later.

That seems better, but how should we invoke our new handler? Building a new command and handler from inside our AssignIssueHandler also sounds like a violation of SRP. Worse still, if we start calling handlers from handlers, we'll end up with our use cases coupled together again - and that's definitely a violation of Principle #1.

What we need is a way to signal between handlers - a way of saying "I did my job, can you go do yours?"

All Aboard the Message Bus

In this kind of system, we use Domain Events to fill that need. Events are closely related to Commands, in that both commands and events are types of message - named chunks of data sent between entities. Commands and events differ only in their intent:

Commands are named with the imperative tense (Do this thing), events are named in the past tense (Thing was done).

Commands must be handled by exactly one handler, events can be handled by 0 to N handlers.

If an error occurs when processing a command, the entire request should fail. If an error occurs while processing an event, we should fail gracefully.

We will often use domain events to signal that a command has been processed and to do any additional book-keeping. When should we use a domain event? Going back to our principle #1, we should use events to trigger workflows that fall outside of our immediate use-case boundary. In this instance, our use-case boundary is "assign the issue", and there is a second requirement "notify the assignee" that should happen as a secondary result. Notifications, to humans or other systems, are one of the most common reasons to trigger events in this way, but they might also be used to clear a cache, or regenerate a view model, or execute some logic to make the system eventually consistent.

Armed with this knowledge, we know what to do - we need to raise a domain event when we assign an issue to an engineer. We don't want to know about the subscribers to our event, though, or we'll remain coupled; what we need is a mediator, a piece of infrastructure that can route messages to the correct places. What we need is a message bus. A message bus is a simple piece of middleware that's responsible for getting messages to the right listeners. In our application we have two kinds of message, commands and events. These two types of message are in some sense symmetrical, so we'll use a single message bus for both.

How do we start off writing a message bus? Well, it needs to look up subscribers based on the name of an event. That sounds like a dict to me:

Here we have a bare-bones implementation of a message bus. It doesn't do anything fancy, but it will do the job for now. In a production system, the message bus is an excellent place to put cross-cutting concerns; for example, we might want to validate our commands before passing them to handlers, or we may want to perform some basic logging, or performance monitoring. I want to talk more about that in the next part, when we'll tackle the controversial subject of dependency injection and Inversion of Control containers.

For now, let's look at how to hook this up. Firstly, we want to use it from our API handlers.

Not much has changed here - we're still building our command in the Flask adapter, but now we're passing it into a bus instead of directly constructing a handler for ourselves. What about when we need to raise an event? We've got several options for doing this. Usually I raise events from my command handlers, like this:

I usually think of this event-raising as a kind of glue - it's orchestration code. Raising events from your handlers this way makes the flow of messages explicit - you don't have to look anywhere else in the system to understand which events will flow from a command. It's also very simple in terms of plumbing. The counter argument is that this feels like we're violating SRP in exactly the same way as before - we're sending a notification about our workflow. Is this really any different to sending the email directly from the handler? Another option is to send events directly from our model objects, and treat them as part our domain model proper.

There's a couple of benefits of doing this: firstly, it keeps our command handler simpler, but secondly it pushes the logic for deciding when to send an event into the model. For example, maybe we don't always need to raise the event.

Now we'll only raise our event if the issue was assigned by another engineer. Cases like this are more like business logic than glue code, so today I'm choosing to put them in my domain model. Updating our unit tests is trivial, because we're just exposing the events as a list on our model objects:

The have_raised function is a custom matcher I wrote that checks the events attribute of our object to see if we raised the correct event. It's easy to test for the presence of events, because they're namedtuples, and have value equality.

All that remains is to get the events off our model objects and into our message bus. What we need is a way to detect that we've finished one use-case and are ready to flush our changes. Fortunately, we have a name for this already - it's a unit of work. In this system I'm using SQLAlchemy's event hooks to work out which objects have changed, and queue up their events. When the unit of work exits, we raise the events.

class SqlAlchemyUnitOfWork(UnitOfWork):
def __init__(self, sessionfactory, bus):
self.sessionfactory = sessionfactory
self.bus = bus
# We want to listen to flush events so that we can get events
# from our model objects
event.listen(self.sessionfactory, "after_flush", self.gather_events)
def __enter__(self):
self.session = self.sessionfactory()
# When we first start a unit of work, create a list of events
self.flushed_events = []
return self
def commit(self):
self.session.flush()
self.session.commit()
def rollback(self):
self.session.rollback()
# If we roll back our changes we should drop all the events
self.events = []
def gather_events(self, session, ctx):
# When we flush changes, add all the events from our new and
# updated entities into the events list
flushed_objects = ([e for e in session.new]
+ [e for e in session.dirty])
for e in flushed_objects:
self.flushed_events += e.events
def publish_events(self):
# When the unit of work completes
# raise any events that are in the list
for e in self.flushed_events:
self.bus.handle(e)
def __exit__(self, type, value, traceback):
self.session.close()
self.publish_events()

Okay, we've covered a lot of ground here. We've discussed why you might want to use domain events, how a message bus actually works in practice, and how we can get events out of our domain and into our subscribers. The newest code sample demonstrates these ideas, please do check it out, run it, open pull requests, open Github issues etc.

Some people get nervous about the design of the message bus, or the unit of work, but this is just infrastructure - it can be ugly, so long as it works. We're unlikely to ever change this code after the first few user-stories. It's okay to have some crufty code here, so long as it's in our glue layers, safely away from our domain model. Remember, we're doing all of this so that our domain model can stay pure and be flexible when we need to refactor. Not all layers of the system are equal, glue code is just glue.

Next time I want to talk about Dependency Injection, why it's great, and why it's nothing to be afraid of.

In the first and second parts of this series I introduced the Command-Handler and Unit of Work and Repository patterns. I was intending to write about Message Buses, and some more stuff about domain modelling, but I need to quickly skim over this first.

If you've just started reading the Message Buses piece, and you're here to learn about Application-Controlled Identifiers, you'll find those at the end of post, after a bunch of stuff about ORMs, CQRS, and some casual trolling of junior programmers.

What is CQS ?

every method should either be a command that performs an action, or a query that returns data to the caller, but not both. In other words, "Asking a question should not change the answer". More formally, methods should return a value only if they are referentially transparent and hence possess no side effects.

Referential transparency is an important concept from functional programming. Briefly, a function is referentially transparent if you could replace it with a static value.

In this class, the is_on method is referentially transparent - I can replace it with the value True or False without any loss of functionality, but the method toggle_light is side-effectual: replacing its calls with a static value would break the contracts of the system. To comply with the Command-Query separation principle, we should not return a value from our toggle_light method.

In some languages we would say that the is_on method is "pure". The advantage of splitting our functions into those that have side effects and those that are pure is that the code becomes easier to reason about. Haskell loves pure functions, and uses this reasonability to do strange things, like re-ordering your code for you at compilation time to make it more efficient. For those of us who work in more prosaic languages, if commands and queries are clearly distinguished, then I can read through a code base and understand all the ways in which state can change. This is a huge win for debugging because there is nothing worse than troubleshooting a system when you can't work out which code-paths are changing your data.

How do we get data out of a Command-Handler architecture?

When we're working in a Command-Handler system we obviously use Commands and Handlers to perform state changes, but what should we do when we want to get data back out of our model? What is the equivalent port for queries?

The answer is "it depends". The lowest-cost option is just to re-use your repositories in your UI entrypoints.

This is totally fine unless you have complex formatting, or multiple entrypoints to your system. The problem with using your repositories directly in this way is that it's a slippery slope. Sooner or later you're going to have a tight deadline, and a simple requirement, and the temptation is to skip all the command/handler nonsense and do it directly in the web api.

Aaaaand, we're back to where we started: business logic mixed with glue code, and the whole mess slowly congealing in our web controllers. Of course, the slippery slope argument isn't a good reason not to do something, so if your queries are very simple, and you can avoid the temptation to do updates from your controllers, then you might as well go ahead and read from repositories, it's all good, you have my blessing. If you want to avoid this, because your reads are complex, or because you're trying to stay pure, then instead we could define our views explicitly.

This is my favourite part of teaching ports and adapters to junior programmers, because the conversation inevitably goes like this:

smooth-faced youngling: Wow, um... are you - are we just going to hardcode that sql in there? Just ... run it on the database?

grizzled old architect: Yeah, I think so. Do The Simplest Thing That Could Possibly Work, right? YOLO, and so forth.

sfy: Oh, okay. Um... but what about the unit of work and the domain model and the service layer and the hexagonal stuff? Didn't you say that "Data access ought to be performed against the aggregate root for the use case, so that we maintain tight control of transactional boundaries"?

goa: Ehhhh... I don't feel like doing that right now, I think I'm getting hungry.

sfy: Right, right ... but what if your database schema changes?

goa: I guess I'll just come back and change that one line of SQL. My acceptance tests will fail if I forget, so I can't get the code through CI.

sfy: But why don't we use the Issue model we wrote? It seems weird to just ignore it and return this dict... and you said "avoid taking a dependency directly on frameworks. Work against an abstraction so that if you dependency changes, that doesn't force change to ripple through your domain". You know we can't unit test this, right?

goa: Ha! What are you, some kind of architecture astronaut? Domain models! Who needs 'em.

Why have a separate read-model?

In my experience, there are two ways that teams go wrong when using ORMs. The most common mistake is not paying enough attention to the boundaries of their use cases. This leads to the application making far too many calls to the database because people write code like this:

# Find all users who are assigned this task
# [[and]] notify them and their line manager
# then move the task to their in-queue
notification = task.as_notification()
for assignee in task.assignees:
assignee.manager.notifications.add(notification)
assignee.notifications.add(notification)
assignee.queues.inbox.add(task)

ORMs make it very easy to "dot" through the object model this way, and pretend that we have our data in memory, but this quickly leads to performance issues when the ORM generates hundreds of select statements in response. Then they get all angry about performance and write long blog posts about how ORM sucks and is an anti-pattern and only n00bs like it. This is akin to blaming OO for your domain logic ending up in the controller.

The second mistake that teams make is using an ORM when they don't need to. Why do we use an ORM in the first place? I think that a good ORM gives us two things:

A unit of work pattern which can be used to control our consistency boundaries.

A data mapper pattern that lets us map a complex object graph to relational tables, without writing tons of boring glue code.

Taken together, these patterns help us to write rich domain models by removing all the database cruft so we can focus on our use-cases. This allows us to model complex business processes in an internally consistent way. When I'm writing a GET method, though, I don't care about any of that. My view doesn't need any business logic, because it doesn't change any state. For 99.5% of use cases, it doesn't even matter if my data are fetched inside a transaction. If I perform a dirty read when listing the issues, one of three things might happen:

I might see changes that aren't yet committed - maybe an Issue that has just been deleted will still show up in the list.

I might not see changes that have been committed - an Issue could be missing from the list, or a title might be 10ms out of date.

I might see duplicates of my data - an Issue could appear twice in the list.

In many systems all these occurrences are unlikely, and will be resolved by a page refresh or following a link to view more data. To be clear, I'm not recommending that you turn off transactions for your SELECT statements, just noting that transactional consistency is usually only a real requirement when we are changing state. When viewing state, we can almost always accept a weaker consistency model.

CQRS is CQS at a system-level

CQRS stands for Command-Query Responsibility Segregation, and it's an architectural pattern that was popularised by Greg Young. A lot of people misunderstand CQRS, and think you need to use separate databases and crazy asynchronous processors to make it work. You can do these things, and I want to write more about that later, but CQRS just means that we separate the Write Model - what we normally think of as the domain model - and the Read Model - a lightweight, simple model for showing on the UI, or answering questions about the domain state.

When I'm serving a write request (a command), my job is protect the invariants of the system, and model the business process as it appears in the minds of our domain experts. I take the collective understanding of our business analysts, and turn it into a state machine that makes useful work happen. When I'm serving a read request (a query), my job is to get the data out of the database as fast as possible and onto a screen so the user can view it. Anything that gets in the way of my doing that is bloat.

This isn't a new idea, or particularly controversial. We've all tried writing reports against an ORM, or complex hierarchical listing pages, and hit performance barriers. When we get to that point, the only thing we can do - short of rewriting the whole model, or abandoning our use of an ORM - is to rewrite our queries in raw SQL. Once upon a time I'd feel bad for doing this, as though I were cheating, but nowadays I just recognise that the requirements for my queries are fundamentally different than the requirements for my commands.

For the write-side of the system, use an ORM, for the read side, use whatever is a) fast, and b) convenient.

Application Controlled Identifiers

At this point, a non-junior programmer will say

Okay, Mr Smarty-pants Architect, if our commands can't return any values, and our domain models don't know anything about the database, then how do I get an ID back from my save method?
Let's say I create an API for creating new issues, and when I have POSTed the new issue, I want to redirect the user to an endpoint where they can GET their new Issue. How can I get the id back?

The way I would recommend you handle this is simple - instead of letting your database choose ids for you, just choose them yourself.

There's a few ways to do this, the most common is just to use a UUID, but you can also implement something like hi-lo. In the new code sample, I've implemented three flask endpoints, one to create a new issue, one to list all issues, and one to view a single issue. I'm using UUIDs as my identifiers, but I'm still using an integer primary key on the issues table, because using a GUID in a clustered index leads to table fragmentation and sadness.

Okay, quick spot-check - how are we shaping up against our original Ports and Adapters diagram? How do the concepts map?

Pretty well! Our domain is pure and doesn't know anything about infrastructure or IO. We have a command and a handler that orchestrate a use-case, and we can drive our application from tests or Flask. Most importantly, the layers on the outside depend on the layers toward the centre.

In the previous part of this series we built a toy system that could add a new Issue to an IssueLog, but had no real behaviour of its own, and would lose its data every time the application restarted. We're going to extend it a little by introducing some patterns for persistent data access, and talk a little more about the ideas underlying ports and adapters architectures. To recap, we're abiding by three principles:

Clearly define the boundaries of our use cases.

Depend on abstractions, not on concrete implementation.

Identify glue code as distinct from domain logic and put it into its own layer.

The IssueLog is a term from our conversation with the domain expert. It's the place that they record the list of all issues. This is part of the jargon used by our customers, and so it clearly belongs in the domain, but it's also the ideal abstraction for a data store. How can we modify the code so that our newly created Issue will be persisted? We don't want our IssueLog to depend on the database, because that's a violation of principle #2. This is the question that leads us to the ports & adapters architecture.

In a ports and adapters architecture, we build a pure domain that exposes ports. A port is a way for data to get into, or out of, the domain model. In this system, the IssueLog is a port. Ports are connected to the external world by Adapters. In the previous code sample, the FakeIssueLog is an adapter: it provides a service to the system by implementing an interface.

Let's use a real-world analogy. Imagine we have a circuit that detects current over some threshold. If the threshold is reached, the circuit outputs a signal. Into our circuit we attach two ports, one for current in, and one for current out. The input and output channels are part of our circuit: without them, the circuit is useless.

Because we had the great foresight to use standardised ports, we can plug any number of different devices into our circuit. For example, we could attach a light-detector to the input and a buzzer to the output, or we could attach a dial to the input, and a light to the output, and so on.

Considered in isolation, this is just an example of good OO practice: we are extending our system through composition. What makes this a ports-and-adapters architecture is the the idea that there is an internal world consisting of the domain model (our ThresholdDetectionCircuit), and an external world that drives the domain model through well-defined ports. How does all of this relate to databases?

By analogy to our circuit example, the IssueLog is a WriteablePort - it's a way for us to get data out of the system. SqlAlchemy and the file system are two types of adapter that we can plug in, just like the Buzzer or Light classes. In fact, the IssueLog is an instance of a common design pattern: it's a Repository. A repository is an object that hides the details of persistent storage by presenting us with an interface that looks like a collection. We should be able to add new things to the repository, and get things out of the repository, and that's essentially it.

We expose a few methods, one to add new items, one to get items by their id, and a third to find items by some criterion. This FooRepository is using a SqlAlchemy session object, so it's part of our Adapter layer. We could define a different adapter for use in unit tests.

This adapter works just the same as the one backed by a real database, but does so without any external state. This allows us to test our code without resorting to Setup/Teardown scripts on our database, or monkey patching our ORM to return hard-coded values. We just plug a different adapter into the existing port. As with the ReadablePort and WriteablePort, the simplicity of this interface makes it simple for us to plug in different implementations.

The repository gives us read/write access to objects in our data store, and is commonly used with another pattern, the Unit of Work. A unit of work represents a bunch of things that all have to happen together. It usually allows us to cache objects in memory for the lifetime of a request so that we don't need to make repeated calls to the database. A unit of work is responsible for doing dirty checks on our objects, and flushing any changes to state at the end of a request.

This code is taken from a current production system - the code to implement these patterns really isn't complex. The only thing missing here is some logging and error handling in the commit method. Our unit-of-work manager creates a new unit-of-work, or gives us an existing one depending on how we've configured SqlAlchemy. The unit of work itself is just a thin layer over the top of SqlAlchemy that gives us explicit rollback and commit points. Let's revisit our first command handler and see how we might use these patterns together.

Our command handler looks more or less the same, except that it's now responsible for starting a unit-of-work, and committing the unit-of-work when it has finished. This is in keeping with our rule #1 - we will clearly define the beginning and end of use cases. We know for a fact that only one object is being loaded and modified here, and our database transaction is kept short. Our handler depends on an abstraction - the UnitOfWorkManager, and doesn't care if that's a test-double or a SqlAlchemy session, so that's rule #2 covered. Lastly, this code is painfully boring because it's just glue. We're moving all the dull glue out to the edges of our system so that we can write our domain model in any way that we like: rule #3 observed.

The code sample for this part adds a couple of new packages - one for slow tests (tests that go over a network, or to a real file system), and one for our adapters. We haven't added any new features yet, but we've added a test that shows we can insert an Issue into a sqlite database through our command handler and unit of work. Notice that all of the ORM code is in one module (issues.adapters.orm) and that it depends on our domain model, not the other way around. Our domain objects don't inherit from SqlAlchemy's declarative base. We're beginning to get some sense of what it means to have the domain on the "inside" of a system, and the infrastructural code on the outside.

Our unit test has been updated to use a unit of work, and we can now test that we insert an issue into our issue log, and commit the unit of work, without having a dependency on any actual implementation details. We could completely delete SqlAlchemy from our code base, and our unit tests would continue to work, because we have a pure domain model and we expose abstract ports from our service layer.

The term DDD comes from the book by Eric Evans: "Domain-Driven Design: Tackling Complexity in the Heart of Software". In his book he describes a set of practices that aim to help us build maintainable, rich, software systems that solve customer's problems. The book is 560 pages of dense insight, so you'll pardon me if my summary elides some details, but in brief he suggests:

Listen very carefully to your domain experts - the people whose job you're automating or assisting in software.

Learn the jargon that they use, and help them to come up with new jargon, so that every concept in their mental model is named by a single precise term.

Use those terms to model your software; the nouns and verbs of the domain expert are the classes and methods you should use in modelling.

Whenever there is a discrepancy between your shared understanding of the domain, go and talk to the domain experts again, and then refactor aggressively.

This sounds great in theory, but in practice we often find that our business logic escapes from our model objects; we end up with logic bleeding into controllers, or into fat "manager" classes. We find that refactoring becomes difficult: we can't split a large and important class, because that would seriously impact the database schema; or we can't rewrite the internals of an algorithm because it has become tightly coupled to code that exists for a different use-case. The good news is that these problems can be avoided, since they are caused by a lack of organisation in the codebase. In fact, the tools to solve these problems take up half of the DDD book, but it can be be difficult to understand how to use them together in the context of a complete system.

I want to use this series to introduce an architectural style called Ports and Adapters, and a design pattern named Command Handler. I'll be explaining the patterns in Python because that's the language that I use day-to-day, but the concepts are applicable to any OO language, and can be massaged to work perfectly in a functional context. There might be a lot more layering and abstraction than you're used to, especially if you're coming from a Django background or similar, but please bear with me. In exchange for a more complex system at the outset, we can avoid much of our accidental complexity later.

The system we're going to build is an issue management system, for use by a helpdesk. We're going to be replacing an existing system, which consists of an HTML form that sends an email. The emails go into a mailbox, and helpdesk staff go through the mails triaging problems and picking up problems that they can solve. Sometimes issues get overlooked for a long time, and the helpdesk team have invented a complex system of post-it notes and whiteboard layouts to track work in progress. For a while this system has worked pretty well but, as the system gets busier, the cracks are beginning to show.

Our first conversation with the domain expert

"What's the first step in the process?" you ask, "How do tickets end up in the mail box?".

"Well, the first thing that happens is the user goes to the web page, and they fill out some details, and report an issue. That sends an email into the issue log and then we pick issues from the log each morning".

"So when a user reports an issue, what's the minimal set of data that you need from them?"

"We need to know who they are, so their name, and email I guess. Uh... and the problem description. They're supposed to add a category, but they never do, and we used to have a priority, but everyone set their issue to EXTREMELY URGENT, so it was useless.

"But a category and priority would help you to triage things?"

"Yes, that would be really helpful if we could get users to set them properly."

This gives us our first use case: As a user, I want to be able to report a new issue.

Okay, before we get to the code, let's talk about architecture. The architecture of a software system is the overall structure - the choice of language, technology, and design patterns that organise the code and satisfy our constraints. For our architecture, we're going to try and stick with three principles:

We will always define where our use-cases begin and end. We won't have business processes that are strewn all over the codebase.

We will treat glue code as distinct from business logic, and put it in an appropriate place.

Firstly we start with the domain model. The domain model encapsulates our shared understanding of the problem, and uses the terms we agreed with the domain experts. In keeping with principle #2 we will define abstractions for any infrastructural or technical concerns and use those in our model. For example, if we need to send an email, or save an entity to a database, we will do so through an abstraction that captures our intent. In this series we'll create a separate python package for our domain model so that we can be sure it has no dependencies on the other layers of the system. Maintaining this rule strictly will make it easier to test and refactor our system, since our domain models aren't tangled up with messy details of databases and http calls.

Around the outside of our domain model we place services. These are stateless objects that do stuff to the domain. In particular, for this system, our command handlers are part of the service layer.

Finally, we have our adapter layer. This layer contains code that drives the service layer, or provides services to the domain model. For example, our domain model may have an abstraction for talking to the database, but the adapter layer provides a concrete implementation. Other adapters might include a Flask API, or our set of unit tests, or a celery event queue. All of these adapters connect our application to the outside world.

In keeping with our first principle, we're going to define a boundary for this use case and create our first Command Handler. A command handler is an object that orchestrates a business process. It does the boring work of fetching the right objects, and invoking the right methods on them. It's similar to the concept of a Controller in an MVC architecture.

A command object is a small object that represents a state-changing action that can happen in the system. Commands have no behaviour, they're pure data structures. There's no reason why you have to represent them with classes, since all they need is a name and a bag of data, but a NamedTuple is a nice compromise between simplicity and convenience. Commands are instructions from an external agent (a user, a cron job, another service etc.) and have names in the imperative tense, for example:

ReportIssue

PrepareUploadUri

CancelOutstandingOrders

RemoveItemFromCart

OpenLoginSession

PlaceCustomerOrder

BeginPaymentProcess

We should try to avoid the verbs Create, Update, or Delete (and their synonyms) because those are technical implementations. When we listen to our domain experts, we often find that there is a better word for the operation we're trying to model. If all of your commands are named "CreateIssue", "UpdateCart", "DeleteOrders", then you're probably not paying enough attention to the language that your stakeholders are using.

The command objects belong to the domain, and they express the API of your domain. If every state-changing action is performed via a command handler, then the list of Commands is the complete list of supported operations in your domain model. This has two major benefits:

If the only way to change state in the system is through a command, then the list of commands tells me all the things I need to test. There are no other code paths that can modify data.

Because our commands are lightweight, logic-free objects, we can create them from an HTTP post, or a celery task, or a command line csv reader, or a unit test. They form a simple and stable API for our system that does not depend on any implementation details and can be invoked in multiple ways.

In order to process our new command, we'll need to create a command handler.

Command handlers are stateless objects that orchestrate the behaviour of a system. They are a kind of glue code, and manage the boring work of fetching and saving objects, and then notifying other parts of the system. In keeping with principle #3, we keep this in a separate layer. To satisfy principle #1, each use case is a separate command handler and has a clearly defined beginning and end. Every command is handled by exactly one command handler.

In general all command handlers will have the same structure:

Fetch the current state from our persistent storage.

Update the current state.

Persist the new state.

Notify any external systems that our state has changed.

We will usually avoid if statements, loops, and other such wizardry in our handlers, and stick to a single possible line of execution. Command handlers are boring glue code.
Since our command handlers are just glue code, we won't put any business logic into them - they shouldn't be making any business decisions. For example, let's skip ahead a little to a new command handler:

This handler violates our glue-code principle because it encodes a business rule: "If an issue is already resolved, then it can't be resolved a second time". This rule belongs in our domain model, probably in the mark_as_resolved method of our Issue object.
I tend to use classes for my command handlers, and to invoke them with the call magic method, but a function is perfectly valid as a handler, too. The major reason to prefer a class is that it can make dependency management a little easier, but the two approaches are completely equivalent. For example, we could rewrite our ReportIssueHandler like this:

If magic methods make you feel queasy, you can define a handler to be a class that exposes a handle method like this:

class ReportIssueHandler:
def handle(self, cmd):
...

However you structure them, the important ideas of commands and handlers are:

Commands are logic-free data structures with a name and a bunch of values.

They form a stable, simple API that describes what our system can do, and doesn't depend on any implementation details.

Each command can be handled by exactly one handler.

Each command instructs the system to run through one use case.

A handler will usually do the following steps: get state, change state, persist state, notify other parties that state was changed.

Let's take a look at the complete system, I'm concatenating all the files into a single code listing for each of grokking, but in the git repository I'm splitting the layers of the system into separate packages. In the real world, I would probably use a single python package for the whole app, but in other languages - Java, C#, C++ - I would usually have a single binary for each layer. Splitting the packages up this way makes it easier to understand how the dependencies work.

There's not a lot of functionality here, and our issue log has a couple of problems, firstly there's no way to see the issues in the log yet, and secondly we'll lose all of our data every time we restart the process. We'll fix the second of those in the next part.

Last year we held a Riemann workshop for our developers. The materials for the workshop were open-sourced but I never actually got around to promoting them in any way so they've just languished, like that bag of salad at the back of the fridge, slowly softening to a soggy senescence.

Since I've been writing about Rsyslog and Riemann, though, it seemed like a good time to resurrect my hard work and share it more widely. This workshop will walk you through monitoring both technical metrics and business metrics using Riemann and Collectd, plus storing and querying them with Influx, and graphing them in Grafana.

I'm hoping this will help someone else get started with this stack, which is extremely powerful, but has a steep learning curve.

To get started, just check out the repository and then from the root run docker-compose up. You should see an enormous and confusing wall of text.

There are a bunch of Markdown files in the /docs/ directory that will give you a quick rundown of how to configure and extend Riemann. If you've any questions or feedback, I'd be overjoyed to hear from you on twitter.

Back in December I said I was interested in replacing Logstash with Rsyslog, but that we needed a Riemann module to cover some of our existing functionality. Specifically we send metrics to Riemann from Logstash for three reasons:

We send internal metrics from Logstash to monitor how events flow through our log pipeline.

We forward all ERROR and CRITICAL logs to Riemann, which performs roll-up and throttling. Errors are forwarded to Slack, and Criticals are sent to Pagerduty.

We allow developers to send application metrics in their structured log.

After some leisurely hacking over the last few days, I've got a Riemann module that should cover all of our needs. We should be able to replace the internal metrics with impstats as discussed in part one of this series, so we won't touch on that here.

As before I've created a docker-compose playground on GitHub. This time the code is under the /custom-metrics directory.

Let's see how ERRORs and CRITICALs can be sent to Riemann.

We strongly encourage our developers to use structured logging in their applications; rather than simply logging text, we log JSON objects. This allows us to add more context into our logs, which makes it simpler to aggregate and search in Kibana.

As well as the obvious timestamp and textual message fields, we also include some information about the request we're handling, including a correlation id, some debugging information so we can work out where the log was written, and some data from the application.

We also get logs from the Systemd Journal, and from third party and legacy applications, so firstly we want to normalise the logs into a common format. As before, we're going to use mmnormalize for this task. Our mmnormalize rulebase handles 3 kinds of logs: http access logs from nginx, structured json logs from our Python API, and log4j style logs from our processing app.

Firstly, we use mmnormalize to parse these different log formats into a json structure, with a tag so we know what kind of data we have. We're going to set a severity field on the json according to some simple rules.

For http logs, we'll use INFO as our severity level if we have a status code < 400, otherwise ERROR.
For plain text and json logs, we'll convert the log4j log-level name into a syslog-severity.

INFO -> info (6)

WARN -> warn (4)

ERROR -> error (3)

CRITICAL -> critical (2)

FATAL -> emergency (0)

(action type="mmnormalize" rulebase="/etc/rsyslog/rules.rb")
# If this is an http log, use the status code
if ($!event.tags contains "http" and $!status >= 400) then {
set $!severity = 3;
set $!body!@fields!levelname = "error";
#
# If we have a json log with a level name, map it back to a severity
} else if ($!body!@fields!levelname != "") then {
set $!severity = cnum(lookup("log4j_level_to_severity", $!body!@fields!levelname));
}

Now that we've got a normalised log level, we can use it to forward errors to Riemann:

By default omriemann uses the $programname as the "service" key in the Riemann message. The prefix configuration setting prepends a fixed value to the service. When we send errors through this configuration, we should see new metrics arriving in Riemann that look like this:

With a sprinkling of Riemann magic, we can set up Slack and Email notifications with an hourly summary of errors across our applications.

Several of our applications are sending metrics to Riemann via the logging pipeline. We're moving away from this technique because it creates a dependency between our ELK stack and our metrics: when Logstash stops stashing, we lose monitoring as well as logs. Despite the flaws of this practice, it's simple to explain to developers, and simple for QA to verify, since it works by writing json to stdout.

Here we have the same basic structure as our previous structured log, but with the inclusion of a new field _riemann_metric. This metric represents the response time for a single invocation of an API end point, and it includes the status code. We want to pass this structure to Riemann.

As in previous articles, we're calling omriemann with a subtree that contains our metric data. This time, however, we're using mode=single. This tells omriemann that every message contains a single metric, and that the fields of the json subtree should be mapped to the fields of the Riemann message. This is different to our previous examples, where we expect that each field of the subtree is a different metric.

As before, I've included a dummy web application that returns a hard-coded status code at each endpoint. This time, you can include a sleep time in milliseconds: curl localhost/404?sleep=100.

This will result in a metric log being sent to Rsyslog and through to Riemann so that we can alert and report on it.

Now that I can reproduce our Logstash/Riemann integration with Rsyslog, I can begin the process of building a new REK stack.

It's been a little longer than I expected, but I'm finally back and working on Rsyslog. Last time, we looked at how impstats could be used to generate internal metrics from Rsyslog, and how to alert on those metrics in Riemann.

This time I want to look at how the Dynstats module in Riemann can be used to do more interesting monitoring of our applications.
The Dynstats module provides a simple interface for counting events in Rsyslog. Any time we see a particular pattern in our logs, or we receive a log file from a particular application, we can increment a counter.

For our first use-case, we're going to use this to record error counts for each of our applications. This is one of the simplest and most important metrics that you can capture since most production issues will cause an increase in error rate. When the "Error Rate > Threshold" alarm goes off, it could be a bad deploy, a misconfigured app, a bug in your code, or a failing disk on the database server. While you definitely want more granular metrics to help you diagnose problems, this broad approach is fantastic as an early warning system.

For this post, I've put together a docker-compose playground. If you want to play along, you'll need a recent version of Docker Compose. Check out the code, and take a look in the dynstats/dummy directory.

To demonstrate a error counter, I've written a quick and dirty Go program that randomly outputs either a notice or error level log at regular intervals. We can set the interval, log tag, and error percentage on the command line.

docker-compose run dummy -interval 100 -tag foo

We can use the dynstats functionality in Rsyslog to count the number of errors that happen in a 10-second window, and send the counter over to Riemann.

# We use the dyn_stats function to create a new metrics bucket.
# Each bucket can contain many counters.
dyn_stats(name="error_rate")
# When we receive an "error" log, we will add 1 to the counter
# for the application that raised the error. We use the $programname
# parsed from the Syslog tag to identify the application.
if ($syslogseverity-text == "error") then {
set $foo = dyn_inc("error_rate", $programname);
}
# Lastly, here's our stats processing.
# Parse the JSON that impstats forwards to us and spit it straight out to Riemann
ruleset(name="stats") {
action(name="parse-stats" type="mmjsonparse")
action(name="send-to-riemann" type="omriemann" subtree="!" server="riemann")
}

We can bring this up by running docker-compose up from the root of the demo project. After everything has started up, you should start to see regular metrics counting the number of errors.

We can extend our Riemann config so that we raise an error if a metric is above an absolute threshold. Open the file riemann/riemann.config . We're going to edit the config to add a new rule: when the error_rate is > 10, log an error. In practice, we would want to send an alert to an operator via email, Slack, or PagerDuty. Add the following line below the prn on line 13:

For our second use case, we're going to look at how dynstats and riemann can give us some monitoring of web applications. One of the strong points of rsyslog is its ability to rapidly parse large amounts of log text and pull out meaningful fields. Rsyslog can parse logs fast enough that we can use it for realtime monitoring of applications. Using mmnormalize, dynstats, and omriemann we can monitor our web applications in near-realtime and see a breakdown of HTTP status codes.

Returning to our rsyslog config, we've got the following block at the bottom of the file

Here we configure a new bucket named "http response". Every log line that comes through is matched to an mmnormalize rulebase that tries to parse the log as an HTTP access log. If the line matches, we pull out some fields - status code, request url, response time etc. - and tag the log as "http". We use that tag decide whether we have an http log and increment the counter for the status code. The rulebase contains 4 lines that look like this:

These rules set up a tag (http) and a pattern to match. We set up 4 rules to account for extra data at the end of the log line, and leading spaces.

The docker-compose project includes an nginx container that is configured to send logs to rsyslog. I've added some dummy endpoints to the configuration so that we can use curl to generate some web logs:

/200 returns a 200 OK

/300 returns a 301 Moved Permanently to /200

/400 returns a 404 Not Found

/500 returns a 500 Internal Server Error.

Curling these endpoints and waiting a few seconds, we should see some new metrics arrive in Riemann:

The code is still rough around the edges, but hopefully you can see how omriemann can make it easier to monitor applications given only their log files. Next time I want to talk about a third way to use omriemann: custom metrics in json.

]]>

In a previous post I said I was interested in replacing Logstash with Rsyslog in our ELK stack. I've been working on one of the outstanding items from that project - the ability to send metrics to Riemann. The Made.com fork of Rsyslog now has an alpha-quality version of

In a previous post I said I was interested in replacing Logstash with Rsyslog in our ELK stack. I've been working on one of the outstanding items from that project - the ability to send metrics to Riemann. The Made.com fork of Rsyslog now has an alpha-quality version of the Riemann output module. Although there's still some missing functionality for more advanced scenarios, we can already do interesting things with the module as it stands.

For our first example, let's imagine that we want to trigger an alert if our log volume changes drastically. For example, we may stop receiving events if there is a networking problem; or we might start writing many more events than usual if there is an error in one of our systems.

We can achieve this with the rsyslog impstats module and the riemann output module.

Firstly we need to configure rsyslog to record its statistics.

# We'll need to load the riemann module before we can use it
module(load="omriemann")
#impstats generates internal metrics for us on an interval
module(load="impstats"
# use a json format
format="cee"
# send metrics every 10 secs
interval="10"
# send a rate-of-change event, like a collectd counter.
resetCounters="on"
# pass these messages to the "stats" ruleset for processing
ruleset="stats")
module(load="mmjsonparse")
ruleset(name="stats") {
# parse the json message
action(name="parse-stats" type="mmjsonparse")
# pass it to riemann
action(name="riemann-output" type="omriemann"
# look for metrics in the root json object
subtree="!"
# and forward them to MYSERVER
server="MYSERVER")
}

At Made.com we're unabashed fans of the ELK stack, and I spend a decent amount of my time thinking about how we parse, ship, and store logs.

Currently, we use an ELK stack setup that looks like this:

Rsyslog receives logs from our Docker containers, via the syslog docker logging driver, and from the rest of the system via the journal. We tag the logs at this point (as application, http, or system logs), and normalise them all to the same json format.

We ship those json logs to an Elasticache Redis instance, and consume them in Logstash. Finally, Logstash routes the logs to the correct Elasticsearch index based on their tags.

This is more-or-less a best-practice set up for ELK but Logstash, honestly, is my least favourite part of the stack. Recently I've had some conversations on the Rsyslog mailing list about whether we can replace Logstash entirely, and just use Rsyslog.

Why might we want to do that?

RSyslog is really fast.
Logstash does a reasonable job of processing our current log volumes, but RSyslog is a couple of orders of magnitude faster for most use cases. In particular, mmnormalize is lightning fast compared to Logstash's grok. That means I can run smaller, cheaper ingest boxes.

Logstash reliability issues.
Rsyslog is definitely not a bug-free piece of software, but I know my way around the code base, and I've inadvertently become a maintainer of the Redis module. We've had repeated issues where Logstash loses its connection to either Redis or Riemann, and failed messages then block the entire pipeline.

Simplicity
Given that RSyslog has disk-backed queues, maybe we don't need to use Redis at all. If we can reliably buffer and forward messages to a central RSyslog server, and then into Elasticsearch, we can drop a component from the pipeline.

To test whether that's feasible, I want to build a couple of prototype architectures. Firstly I want to try our basic setup but replacing Logstash directly with RSyslog

Secondly I'd like to try skipping Redis altogether and going directly to a central rsyslog server using relp. I suspect this will have better throughput, and lower running costs, without loss in reliability.

For all of this to really work well, I'm going to have to roll up my sleeves and write some code.

We'll need GeoIP tagging for our http logs. I've seen this done with both mmexternal and table lookup functionality.

I'll need to hack out a Redis input plugin for RSyslog.

We forward certain logs (anything with a _metric field, plus ERROR and CRITICALs) to Riemann so that we can monitor them and use them for alerting. We'll need a new Riemann output module in ryslog.

The chief maintainer of RSyslog is interested in writing or harvesting a component that works like Elastic's FileBeats - a simple, lightweight forwarder that just reads files and sends logs over the network reliably.

I'm going to open source all the prototypes, including Ansible playbooks and Docker files so that you can play along at home or in the cloud. If you've got any suggestions for other reference architectures or use cases you'd like to see from the REK stack get in touch on the mailing list.

The testing quadrants model, helps to ensure that all important test types and test levels are included in the development lifecycle. This model also provides a way to differentiate and describe the types of tests to all stakeholders, including developers, testers, and business representatives.

In the testing quadrants, tests can be business (user) or technology (developer) facing. Some tests support the work done by the Agile team and confirm software behaviour. Other tests can verify the product. Tests can be fully manual, fully automated, a combination of manual and automated, or manual but supported by tools.

The four quadrants are as follows

The quadrants are not in any specific order.
Also there are no hard and fast rules about what goes in what quadrant. Teams think through them as they do the release and iteration planning, so the whole team starts out by thinking about testing.

WHAT DOES EACH QUADRANT MEAN

The quadrants largely help the agile team to plan their testing for a project and make sure they have all the resources they need to accomplish it successfully .

Q1 - Quadrant1 is a developer lead effort and it is technology facing, the effort applied in this quadrant helps to support the scrum team. This involves getting complete structural or conditional unit test coverage, plus, component testing coverage as well.
This helps to make sure that, are we building the product right?NOTE : with out this quadrant every other quadrant is not that effective and gets harder.

Q2 - Quadrant2 Business-facing tests that supports the scrum team as well.
This involves capturing business examples and collaborating with customers to get more clarity in asserting the functionality, story testing, API testing, UI based testing using automated tools like selenium.
This helps to make sure that, are we building the right product?

Q3 - Quadrant3 This involves recreating actual user experience and have scenario which are realistic.
This relates to regular sprint demos or Informal demos, involving the real users/Customer support teams and get early and quick feedback.

Q4 - Quadrant4
This is technology facing but involves use of tools to assist in understanding the behaviours.
Memory management, Data migration, Recovery, Test environments etc, are covered under this quadrant and stories written for it.
Popular 'ilities' testing like modifiability, usability, adaptability, reusability, security, reliability, availability etc must also be considered as well.

HOW TO USE IT IN AGILE ?
Test planning is not an individual effort and must be acknowledged by the whole team at the release planning or iteration planning.
At release planning the team must understand and agree to the high level test strategy which encompass Unit testing/TDD, ATDD, function/Non functional testing etc - considering all quadrants.
Identify who should own or do the various type of testing and how best to accomplish it, shared responsibility and focus on collaboration is the essence of this step. The team gets to highlight the requirement for any special skills or tools.
At each iteration planning or execution, tests from any quadrant can be created as story and performed.

No story can be considered as done until testing is completed: Unit testing complete, automated regression tests successful run,customer requirements has been confirmed to be captured. Doneness in all quadrants achieved !