Reliable Insights

As of Prometheus 2.12.0 there's a new feature to help find problematic queries.

While Prometheus has many features to limit the potential impacts of expensive PromQL queries on your monitoring, it's still possible that you'll run into something not covered or there aren't sufficient resources provisioned.

As of Prometheus 2.12.0 any queries which were running when Prometheus shuts down will be printed on the next startup. As all running queries are cancelled on a clean shutdown, in practice this means that they'll be printed only if Prometheus is OOM killed or similar. On the next startup you'll see log lines like these:

Here there was only one query running when Prometheus died unnaturally (which I had to go out of my way to make slow so it'd show up), so it's a likely culprit if Prometheus ran out of resources. However there could have been other queries running that had only trivial usage, or indeed it could have been that something else entirely triggered the termination, so a query being in this list doesn't automatically mean it's a problem.

If the issue was an overly expensive query and you can't just throw resources at it, some flags you may wish to tweak are --query.max-concurrency, --query.max-samples, and --query.timeout.