It's Friday evening and after a long day, you check the code you were working on into git, have the commit reviewed, accepted, merged, and sync it live. All seems right with the world. You let out a sigh of relief, back your chair away from your desk, and walk away in a satisfied mist of ease. In fact, you're excited because you're going to a concert with your friends tonight.

But then, twenty minutes after you leave, it begins. Errors. Fatal errors. And you're not around to know. So what happens?

In dt, we look out for one another. One of the ways we manage to do this is through an error logging service we've built called Logr. If you read our article on DRE, you'll remember we mentioned a variety of functions we use to generate messages within DRE. One of these is called dre_alert().

For example, you set up an alert in your code that gets triggered when a draft in Sta.sh Writer can't be loaded:

dre_alert("Failed to load draft");

And it seems to be happening every 5 seconds. Oops. It turns out the error is caused by a simple typo. But fortunately everyone in our team gets e-mailed when these severe types of errors are logged and since we're situated all over the world, someone is almost always around to be accountable for problems and emergencies. So yury-n kindly steps in, fixes the problem, and everyone is happy again, yay.

How does LOGR work?

We store both a message and a "payload" (any additional debugging information you pass along) along with host and backtrace information to help figure out the chain of events that led to the alert. All of this information gets sent to a Redis server. We rotate the data using a round-robin algorithm similar to RRDTool to keep the memory usage constant despite the fact that we store millions of different messages within LOGR.

Is it just for emergencies?

Nope. We use it for several things, often things that are recognized problems but not emergencies. Every time the alert happens, the incident gets grouped together with other alerts of the same type, so we can get a better picture of how frequently it's occurring, any patterns, or if there is a particular reason, using the payloads available, of why it's occurring.

We can tag our alerts however we want. So say we have 5 alerts set up for deviantART muro. We can tag each as so and then search for it later to find alerts related to that part of the site. Cool, right?

We also can set thresholds on alerts. Once established, if the incident occurs more than the number of times allowed, an e-mail will be sent to a particular developer.

For example, in this particular alert, $allixsenos is set up to receive an e-mail any time this particular alert happens more than 40 times in one hour:

Mousing over the chart, we can see a bit more information, like the exact number of occurrences that happened at different points in time.

And, as mentioned, we also get a nice payload of information:

With LOGR, we're able to get an idea of which of these alerts are exceeding their thresholds, which are happening frequently, which alerts are most recent, and when they were first seen. This helps us easily rule out which events were transitory or already resolved and which are continuing to happen and need fast attention.

We also generate histograms so that we can see a breakdown of how the error is occurring. This is available for any piece of data stored so that you can see different variations that occur. For example, we could see how the error is occurring, spread out across different servers:

But again, this isn't limited to just servers. It's available for any data stored in the payload.

We're always looking for ways to improve LOGR to help us spot problems with the site in a timely fashion to minimize the amount of impact it has on deviants so you can continue to enjoy the site as usual

whoa, i like the transparency. it's so great to get a view on how you guys run things at the helm. nicely written. it was easy to understand and a pleasure to read. makes me see things from a different perspective.

You mentioned that the team has your back while out on the town, but it also appears that with your savvy systems, deviantART itself has your back too, alerting to problems as they occur, something even a team of thousands surly couldn't handle on a site this big and complex. It's good to know that a complex site has some complex backing to make it continue ticking every second of every day!

Wow, very interesting read. I suppose it sheds a bit more light on the sheer complexity of this website and what kinds of systems are necessary to help keep things in order, or at least are useful to do so.

I wouldn't call the site complex as much as massive. It's really easy to spin up a new page to do some new thing, or create a new library for some new feature and use it everywhere. It's just that there are thousands of new things and thousands of new libraries(we're pretty good with our review process to prevent the same function/class littered throughout different parts of the code too).

It's interesting, most of us spend our days in one specific part of the site or another, so for us there's not really a need to know the entire site's comings and goings, just the part we deal with.