How Neptune helps DevOps teams step up their Ops game

What are the three core pieces of our product that help modern DevOps teams step up their Ops game.

Why Neptune?

With the Cloud, businesses seek greater agility and lower costs. But the thing about the Cloud and DevOps is that to achieve those agility and cost benefits, teams need to seek automation and standardization across all elements of your infrastructure and your applications. As a result, the ability to automate resolving problems to maximize uptime is a key requirement to achieving the benefits of the Cloud and DevOps. So, if you are DevOps team that cares about agility, increasing uptime for your apps and reducing MTTR (mean time to recovery) for problems, you should take a serious look at Neptune!

Based on survey from 200+ customer discussions and from our own experience building self-healing systems that managed millions of servers at Amazon Web Services, if you are Ops engineer, your time typically goes into one of these buckets:

Setting up new infrastructure

Deployments

Automation (continuous app delivery)

Monitoring

Incident response (keeping the lights on)

So, automation and incident response are indeed the top priorities for any DevOps team and Neptune directly helps you take both those things to the next level. Specifically, Neptune offers three benefits:

Reduce your MTTR (Mean Time to Recovery):
If you look at the incident response lifecycle, 95% of time is spent in time to recovery whereas only 5% of time is actually spent in sending alerts. The time to recovery involves troubleshooting - 73% (triaging, running diagnostics, root cause analysis), resolution - 12%, and documentation - 10%. The majority of this 95% work is done manually today. Neptune helps you automate these 95% of the recovery steps. With Neptune, known problems are fixed automatically without manual intervention. For those alerts that require human intervention, Neptune provides the automated context and diagnostics so that those alerts can be fixed in minutes instead of hours.

Increase uptime:
By automating the known issues and diagnosing the unknown problems much faster, uptime for your infrastructure and apps will increase significantly. This has a direct impact on your revenue and customer experience. And it will require less manual and error-prone operations, thus avoiding an engineer potentially fat fingering stuff at 2AM!

Happier and Productive Engineers:
Finally, your engineers will be much happier because 1/ they don't have to wake up to respond to mundane alerts 2/ they could focus their precious time on the business priorities instead of repetitive alerts. DevOps managers will be happy because of the increase in the quality of life for their engineers.

Neptune's Product Offering

In a nutshell, our product offers three things for the DevOps engineering teams:

1. Analytics

Neptune deeply integrates with your monitoring and alerting tools like Nagios, NewRelic and PagerDuty and gives you a one-shot dashboard showing interesting analytics on the pressing problems across all your monitoring tools. You can have visibility into top-20% of the alerts causing 80% of the pain. For example, neptune sends weekly reports about top-20% of alerts based on frequency and MTTR. These reports are extremely valuable not only in the ops weekly meetings but also for your higher management.

Incident Analytics

Analytics version of our product is free! Signup today to get insights into what's really bothering your DevOps team today.

2. Context

For a number of alerts, whenever an alert is fired, it takes an hour or two for an oncall engineer to diagnose the problem. There are three scenarios to consider here: 1. engineer really has no clue, but she is knowledgeable - so she ends up looking at thirteen different systems (monitoring tools, dashboards, ticketing system, logging tools, internal / external dependent services, and run custom scripts etc.) and spending hours making sense of what's really going on 2. engineer needs to look at a documented Runbook 3/ Runbook is not documented and the engineer is not knowledgeable enough but uses his own knowledge to figure out what's going on.

Surprisingly, many operations teams do not have good processes around breaking silos. So, the knowledge is often in the minds of few DevOps engineers and there aren't good processes setup to make the knowledge sharing easy. Neptune streamlines operations processes and Runbooks, and eliminates this by having a single source of truth (aka Runbooks) for handling any alert. In addition, it automatically captures the context and diagnostics without having engineer manually login to multiple systems, thus helping an engineer troubleshoot the problem in minutes instead of hours.

Prior to installing neptune, this is how a typical NewRelic error rate alert looks like today:

Without Neptune: NewRelic Error Rate alarm

After installing Neptune, you get error rate alert + all automated context surrounding the alert in a single page. This automated context includes error rate trend, it's impact on latency, application log file snapshots containing 5xx errors, and health of dependent internal/external services. In addition, it also presents correlation analysis, which shows the recent incidents for that application/host in the last few hours to isolate relevant but a different root cause. Having all this information handy in single page for an alert makes the troubleshooting process fun and makes it 10x faster.

With Neptune, you get alert + automated context

Here are some popular automated context and diagnostics use cases:

Whenever disk is full, send me top-10 home directories that are taking up most of the disk usage

Whenever CPU is high on any webserver, send me CPU usage graph snapshot for last few hours and also capture top snapshot sorted by the CPU usage

3. Remediation

Finally, if it's a known alert, Neptune can just fix the problem automatically. Remediation actions include running a script on a single host or cluster of machines, calling a REST API, or running a custom AWS or Heroku CLI action.

Note: We recommend you take automated actions only when you know the root cause of the problem. If you are not sure or don't know root cause, we recommend you only collect context before triggering automated actions as they will only mask bigger root cause problems.

Every day morning, you receive a daily report indicating actions automatically performed in the last 24 hours. This way, even though you are not manually performing those actions, you still have an eye on what's going on (this helps for auditing as well).

Daily activity report/dashboard

Here are some auto-remediation use cases:

Whenever out of memory errors are found, capture thread/heap dump and restart the process

Whenever load or throughput is too high, increase the size of my AWS EC2 autoscaling group