Cyclical Statistical Forecasts and Anomalies - Part 2

Share:

So you want brilliant alerts over big data?

Well, yeah, of course you do! In the previous post, "Cyclical Statistical Forecasts and Anomalies - Part 1," we discussed how to gather up key measurements for every entity in a critical system, apply your business rules and operations policies into the mix, and build behavior curves for those metrics that can be used to identify anomalies and create useful alerts to filter out the noise and focus in on the events you care about most. We created some interesting alerts based on cyclical anomalies and built a basic-but-working forecast using static lookup files to persist and project the past behaviors.

That works great for CSV files and a low number of entities—from a handful up to 100’s—but requires a different approach when you have 15,000 servers and billions and billions of events to process.

So now we'll adapt the workflow and use some Splunk goodness such as summary indexes (or data model accelerations if you have those handy) to operate our forecasts at greater scale.

We’ll use the same CallCenter.csv sample data from the previous post in this series to illustrate the example, although if you have live data you can just replace that part of the search. You can even use index=_internal which should show the cyclic nature of your Splunk instance if it’s been running for a few months or more, but for discussion purposes the examples will use that CSV. Just make the following adaptations:

Since we’re using the Call Center data CSV for the examples, you’ll see the index=callcenter used to search the data streaming into Splunk. If you aren’t using that example data, you’ll replace ‘callcenter’ with whatever index you're using. If you have your data in datamodels or summary indexes already, that's great—just replace the data references below.

We are going to assume a summary index has been created (you can see how to do that on Splunk Docs) and that it’s called “callcentersummary.” We are going to point our searches there and publish results there in this example, but again, that would be your summary index once you have it created. Learn more about summary indexing here.

Last time, we saved the results from the Splunk Machine Learning Toolkit (MLTK) Numeric Outlier Detection Assistant to a lookup to operationalize the insights. This time, we are going to save the results to the summary index and start with the forecasting technique instead of persisting the statistical behaviors of the past.

Let's begin by making the forecast for tomorrow using the last three weeks of data just for kicks.

Just as before, we are going to take the Numeric Outlier search created by the Assistant and split it into two parts—the upperBound and lowerBound, and the isOutlier parts. This time we need to filter for just the days of the week matching tomorrow (cloning just the data we need), and create the time values for the future (introducing time travel without a Delorean), too!

index=callcenter

| bin _time span=15m

| stats count by _time,source

| eval this = relative_time(now(),"+1d")

| eval filterday=strftime(this, "%A")

| eval DayOfWeek=strftime(_time, "%A")

| where filterday=DayOfWeek

| eval HourOfDay=strftime(_time, "%H")

| eval BucketMinuteOfHour=strftime(_time, "%M")

| stats avg(count) as avg stdev(count) as stdev max(_time) as time by HourOfDay,BucketMinuteOfHour,DayOfWeek,source

This search should be saved as a scheduled search (say CallCenterForecastTomorrow) to trigger at 11:55pm each night, creating the forecast for tomorrow. Alternatively, you can forecast multiple days out, but remember to change the MAX_DAYS_HENCE in props if you go beyond 2 days into the future.

Note you can change the -j flag to have multiple backfill searches triggering at once, depending on your hardware provisioning.

Next, we make a search to add new values as they occur to the summary index as Actual. Save that search as ActualCallCenter, make sure to set the time range to Relative last 15 minutes and schedule the search to run every 15 minutes.

Great. We now have two scheduled searches—one creating the forecast of tomorrow every night at close to midnight, and another creating the actual values to compare our forecast to as the future becomes now. Thanks to backfilling, we can simulate what the last month would have looked like as we roll into the future. We will use the same techniques as we leave statistical forecasts and enter into machine learning projects, so learn to love these commands!

Now, time to get back to our alerts...

Let’s look at just one source from our sample data set so we can make an easy graph to illustrate, and see what our alerts and values would have looked like over the last week.

Pro Tip: I used the source field from an index and fed that into a summary index, where the origin source field is renamed to orig_source.

In the graphic above, we can see the Actual events stopped at 30 minutes past midnight on Thursday morning when I took this snapshot, and we have outliers when call volume was abnormally high given our statistical forecast—from data that was just pushed into the summary index!

Awesome.

If you have datamodels, convert the searches to tstats and away you go. If you want to collect the alerts into a summary index or another persistence layer, you can do that too!

Debugging

Let’s make a quick debugging dashboard to show where the statistical forecast is coming from—the past data in Splunk! This step will be very useful as we move into more complicated descriptive statistics and into machine learning algorithms, so getting into the habit of making a debugging workflow now will really help later on in our journey. Note that I am using the non summarized data here; I'm looking at the raw data and checking to see if the forecast in my summary index makes sense.

index=callcenter source="si_call_volume"

| timechart span=15m count

| timewrap 1week

So with an easy search looking over a few weeks of data, using the line chart in Splunk with the multi-series mode turned on like so:

You can visually see each week that is contributing to your forecast.

Holidays or Special Entities

In Part 1 of this series, I wanted to get into custom holidays or special cyclical treatment base on business rules but we ran out of room... :(

I’m going to make up a completely fictitious holiday from the days in my data set, but I want to show the steps you need to take to make a real list. Just as I'm making a special case for holidays, you can make special cases for entities like Server10001 which manages your CEO’s email server; if your CEO has the same volume of email as Doug Merritt, maybe this is as critical to your business as it is to ours. We are going to create a CSV file or lookup via the Splunkbase app Lookup File Editor and maintain a list of holidays and associated values.

Create a CSV with the columns:
Time,isHoliday,isHolidayDefaultValue,isHolidayGroup,isHolidayName
11/25/2017,1,2,Splunk,SplunkDay

For example, with SPL:

| makeresults 1

|eval Time="11/25/2017"

|eval isHoliday=1

|eval isHolidayDefaultValue=2

|eval isHolidayGroup="Splunk"

|eval isHolidayName="SplunkDay"

|fields- _time

|outputlookup isHoliday.csv

In the search ….

|eval time_key = strftime(that, "%m/%d/%Y")

| lookup isHoliday.csv Time as time_key

Pro Tip: Use a time_key field instead of joining on _time for easy control. Splunk does have a temporal lookup system, but that requires a different workflow.

You have a choice to either use hard-coded values based on your knowledge as an SME, or learn different upperBound and lowerBound file values from your data! You can use the isHolidayDefaultValue as an intelligent replacement for avg+/- stdev*exact(isHolidayDefaultValue). Or you can enrich your alerts directly like |eval isOutlierDougMerrit=if(isHolidayName=”SplunkDay”, “Danger Will Robinson”, “”) and put that field into your alert for added value during your event analytics step. (Don't have an Event Analytics policy? How are you managing 100,000 alerts with your resources? Go check out Splunk IT Service Intelligence.)

Alternatively, you can use the Holiday names as keys to find new behaviors for holiday groups or specific holidays through time.

Stats avg(Count) as avgHoliday stdev(count) as stdevHoliday by HolidayName,isHolidayGroup

Compare those values to the normal “day of week” traffic that you have already calculated.

Victory! Using the Splunk MLTK’s Numeric Outlier Assistant to guide us, we have built a scalable forecasting, thresholding, and alerting mechanism that can be applied to pretty much any type of time series metric. In our next post, we'll use a useful Splunk workflow abstraction, a customer created macro, and some more advanced statistical methods for determining an outlier which will be sure to impress your friends at dinner parties.

Until then, happy Splunking!

Special thanks to the Splunk ML Customer Advisory team including Andrew Stein, Brian Nash and Iman Makaremi.