Introducing Telemetry

2019-04-30
by Arkadiusz Gil

We need monitoring

“Let it crash” has been a long-running mantra in the BEAM world. While it might be misinterpreted, there is some merit to it - our software will do unexpected things, and more often than not, the only viable choice is to crash and start over. But simply restarting parts of our application is not sufficient - we should understand what was the cause of the error, and handle it properly in the future releases. We also need to know that the error occurred at all and how it affected our customers! To enable both of these things, we need a way to introspect and analyze our app’s behaviour at runtime - we need monitoring.

Telemetry is a new open source project aiming at unifying and standardising how the libraries and applications on the BEAM are instrumented and monitored. It is a suite of libraries, developed for the last couple of months by Erlang Solutions in collaboration with the Elixir core team and other contributors. But with existing metrics projects such as exometer and folsom, which have both served the community well over the years, why would we need yet another solution? It might start to feel like in the popular comic strip:

By design, Telemetry does not try to cover every use case out there. Rather, it provides a small and simple interface for instrumenting your code, and allows anyone to hook into the instrumentation points at runtime. This enables modularity - projects like Ecto or Plug only need to rely on the core library, and engineers building applications can use the data exposed by those libraries for monitoring their systems.

Let’s dive a little bit deeper into the rationale behind Telemetry and its design.

The tree has fallen in the forest..

At the core of Telemetry lies the event. The event indicates that something has happened: an HTTP request was accepted, a database query returned a result, or the user has signed in. Each event can have many handlers attached to it, each performing a specific action when the event is published. For example, an event handler might update a metric, log some data, or enrich a context of distributed trace.

This becomes extremely convenient when libraries emit Telemetry events. Usually, we don’t write our own web framework or database client, we use an existing package. The library can provide the instrumentation data via events, and our code can handle it in a way that suits our needs. The only thing we need to do is to implement the handlers and attach them when our application starts!

For example, Ecto since version 3.0.0 publishes an event on each database query. We could write a handler which logs whenever a total query time exceeds 500ms:

defmodule MyApp.Monitoring do
require Logger
def log_slow_queries(_event, measurements, metadata, _config) do
if System.convert_time_unit(measurements.total_time, :native, :millisecond) > 500 do
Logger.warn("Query #{inspect(metadata.query)} completed in #{measurements.total_time}ms")
end
end
End

Here measurements and metadata are properties of a single event - each Telemetry event carries these values.

This handler needs to be attached at runtime, for example when our application starts. Assuming that our Ecto repo is called MyApp.Repo, we attach the handler using the code below:

We specify the name of our handler (which needs to be unique across the system), the event we are attaching it to, the function to be invoked each time the event is emitted, and the handler config which is passed as the last argument to our handler on every invocation.

..and there was no one around to hear it

Because Telemetry is designed to have a small performance footprint, there is almost no cost associated with including it even in the most popular Elixir libraries - it is already used by Ecto and Plug, and it is coming to Phoenix soon. Telemetry requires only a single ETS lookup when an event is published, and all handlers are executed synchronously in the process emitting the event, which means that there are no bottlenecks and single points of failure in the whole library.

Ecosystem

Apart from the core Telemetry library, which provides the interface for emitting and handling events, we have built additional tools addressing common use cases related to monitoring.

Telemetry.Poller allows you to perform measurements periodically and emit them as Telemetry events. When you include the library in your project, the default Poller process is started, publishing measurements related to the Erlang VM, like memory usage and the length of run queues.

Telemetry.Metrics provides a bunch of functions for declaring metrics based on the events. For example, the definition counter("my_app.repo.query") means that we want to count how many database queries were made by Ecto. Apart from the counter, Telemetry.Metrics also defines some other aggregations: sum, last value (sometimes referred to as gauge) and distribution. It also supports multi-dimensional metrics via tags and unit conversions.

After the metrics are declared, they need to be fed to the reporter, which attaches relevant event handlers and forwards metrics to the monitoring system of choice at runtime. Currently, there are two reporters available on Hex: one for Prometheus and one for StatsD.

What’s next?

Telemetry core reached a stable API, and now is the right time for including it in libraries, so that their users can benefit from exposed instrumentation data. But we cannot do that on our own - Telemetry is a community run project, and without contributors it won’t be able to flourish. So we encourage everyone - Telemetry has had an active group of contributors from the very early days of the project and we would love you to be part of its growth and adoption too. If you want to get involved, you can integrate your own library with Telemetry, build a reporter for Telemetry.Metrics, or give us your feedback on APIs and documentation.

Currently, we’re improving the performance of the core, as well as extending and polishing the Poller and the Metrics. Weare also working on making the existing reporters more performant and stable.

The next big thing we have planned is a metrics dashboard for Phoenix - imagine generating a new Phoenix project, and having a basic dashboard with metrics served by your endpoint, without setting up any external systems. Telemetry allows us to do this and much more in the near future!

Upcoming meetup

If you would like to learn more about Telemetry, our colleague and core Telemetry contributor Arkadiusz will be speaking at our London BEAM Meetup on Wednesday 8th May. More details and RSVP here