Time of the incident

10:51 Redis failover and malfunction

As we’ve seen a trend of increased memory use in Redis, which is used for monitoring data storage, we implemented an operation to build a replication in order to scale up Redis. However, building the replication took time, and because Redis was unresponsive, it was detected as a node failure by clustering software (keepalived), and this caused an unintended failover.

As a result, connecting to Redis from the application server was unavailable and we weren’t able to respond properly.

10:55 Recovery and continued failure

We switched over to a more appropriate Redis and this temporarily restored the application.

However, the application was unstable, and along with complex factors such as network errors, Redis memory increasing, etc. that occurred before the incident happened, the application server latency deteriorated and a timeout occurred. Afterward, we restarted the application server, but the symptoms did not improve.

We’ve determined this to be the cause of the prolonged failure. The details surrounding this cause are unconfirmed and further investigation into the matter has been canceled.

11:00-14:50 Incident response

The following specific actions were made in response.

Detected organizations posting inappropriate metrics and cut off these requests

Following these actions, we temporarily switched to maintenance mode, and after confirmation was made within the company, we gradually returned requests starting from the outside.

15:20 Recovery confirmation

After confirmation was made that the application server response was stable, metric retransmission from mackerel-agent had completed, and the delay in reflecting metrics data to the TSDB was resolved, then we declared the restoration.

Cause of the incident

This incident was caused by an unexpected failover accompanying the implementation of an operation in Redis. However, we have not been able to accurately identify the reasons behind the subsequent prolonged restoration. The following is a theory that was raised when looking back on the incident.

When a specific access pattern continues, a thread pool or connection pool with the same wait period or lock contention may cause latency to deteriorate in a Scala application.

Verifying the theory

In order to confirm the above theory, we attempted to reproduce the situation at the time of the incident in an isolated environment. Unfortunately, we were unsuccessful in reproducing an identical situation.

Future support

As previously mentioned, we are unable to specify the details surrounding the cause of this incident, but we do believe that we can prevent similar long-term failures in the future by implementing the following countermeasures.

Reviewing Redis failover behavior (implemented)

To cope with the failover malfunction which was the cause of all this trouble in the first place, we’ve responded by increasing the number of Redis keepalived health checks so as not to cause a failover with such a short period load increase.

Reinforcing the application (implemented)

We created more of a buffer for application performance. Specifically, the following was done.

Scale up work for Redis

Increased the number of application servers

Improving the efficiency of monitoring data stored in Redis (implemented)

As a basic response to the amount of memory used in Redis, the application was refurbished, efficiently saving only the necessary monitoring data in Redis, and the Redis memory usage was reduced.

Counteracting improper requests (implemented)

Creating a feature to quickly block organizations making improper requests

In the future, we will continue to try and improve the accuracy of detecting improper requests and we are also considering building up other features such as the API Limit etc.

Reinforcing application monitoring

We plan to strengthen the application server’s internal process monitoring and prepare a system that can respond before another similar incident occurs.

Summary

We understand that this incident and its extended was an inconvenience and we sincerely apologize. As we work hard to prevent recurrence, please trust that we will do our best to localize the problem even if a similar incident occurs.

Billable targets are determined using the conversion 1 Distribution = 1 Host. Additionally, since CloudFront is a global service, integration with CloudFront is possible regardless of the region selected in the AWS integration settings.

If you use CloudFront, be sure to enable this feature and give it a try. We welcome your feedback!

The mackerel-check-plugins package has been updated to v0.23.0. With this update, we’ve added check-aws-cloudwatch-logs, a check plugin for AWS CloudWatch Logs! For more details such as on how to use, check out the help page below.

This feature was co-developed together with iret Inc., a development firm with abundant AWS operational knowledge. iret Inc. offers the cloudpack service, which provides fully managed services for a variety of AWS products. iret, thank you for all your help!

Webhook can now be registered with notification channel APIs

In addition to email and Slack notifications, it is now possible to register Webhook notification channels using the API. For more details, check out the notification channel API document below.

In terms of this phenomenon, access to the API server failed, and a 5xx status code was most likely returned resulting in an error.

As the API server error rate increased, connectivity monitoring was suspended in order to prevent false reports.

After that, the unstable conditions continued for an extended period of time. At 4:20 pm (JST), recovery measures were taken by adjusting application parameters and reinforcing the server.

We were not able to identify the direct cause and will continue to further investigate this issue. Additionally, starting tomorrow, operations will be implemented to prevent secondary issues from occurring. Please note that depending on the operation, we may temporarily switch to maintenance mode (restricted access to the server).

ISUCON8 was held last weekend and Mackerel team members Matsuki (id:Songmu) and Shibasaki（id:id:shiba_yu36) both made it through the qualifying round. I’m looking forward to the main event.

Now on to the latest update information.

loadavg1 and loadavg15 added to system metrics

With the release of mackerel-agent v0.57.0, loadavg1 and loadavg15 have been added to the loadavg graph which previously only supported loadavg5. Now you can conveniently compare loadavg5 and loadavg15 to check whether the CPU load is increasing or decreasing. When updating mackerel-agent to the latest version, two system metric items will be added to the target host.

The log rotation tracking accuracy for check-log plugin has been improved

With the release of go-check-plugins v0.22.1, a change was made to the inode number for when tracking log files with the check-log plugin. With this, the tracking accuracy for log files when logs are rotated has improved.

You can now specify the number of redirects with check-http

With the release of go-check-plugins v0.22.1, you can now specify the number of follow up redirects with the option --max-redirects (the default is 10).

In environments where ALB / ELB metrics are obtained with the AWS Integration feature, a change was made to now post 0 if the RequestCount metric value obtained from CloudWatch is null. This fixes the problem of alerts not closing automatically, which was previously caused by the null value.

With the release of mackerel-agent v 0.57.0, the code signing certificate for Windows Server installer has been updated. Please note that if you use the previous version of the installer, a certificate expiration warning will occur.

For the second half of the program, the student interns are assigned to each team and work on task assignments and feature developments that will actually be incorporated into the service. Two student interns were also assigned to the Mackerel team and were challenged with a lot of issues.

The feature that was introduced last Monday (9/3) titled "Roles can now be registered/deleted from the API" was implemented and released by our two student interns.

Since it’s the last day of the summer internship, we are going to introduce a complete set of the features implemented by our student interns.

Here is a message from Mackerel team director Katsuya (id:daiksy) who helped mentor the student interns over this last month.

This is the fourth year that the Mackerel team has received student interns.
And the speed of development has been outstanding this year. On more than one occasion, I was surprised checking GitHub after a meeting like, "What!? This feature is already up for review ??".

After two weeks of development, today is the last day of this year’s internship.
I think that this was a good experience for the student interns, but they also inspired the team as well. It was a very fulfilling two weeks.

It truly was a surprise that so many new features were developed and released in such a short period of time.

Now on to the update information.

An Organizations list screen has been added

View a list of the organizations that you belong to by accessing the URL below.

You can also access the same screen by clicking [▼] next to the [Organization Name] on the left side menu and clicking [Organizations].

From this page you can see the number of services, hosts, members, and alerts that are currently occurring for each organization. If you belong to multiple organizations, you can use this list to see the whole picture, like when confirming for which organization an alert is occurring.

Up until now, you could register metadata to hosts, but with this release, you can now register arbitrary JSON data as metadata for services/roles. For more details, refer to the Mackerel API document for Metadata.

This API is currently only supported for email notifications and Slack. With this release, information that can be obtained with the notification channel list API can now be obtained in more detail with email notifications and Slack.