This is a very basic notebook to demonstrate using Jupyter notebooks to analyze data stored in Elastic Search rather than Kibana. It uses an R kernel, however the same approach will work with a Python or other kernel as well.

### Set Constants## change search default values to return something else from ES## These are similar to a basic default query against something like ES or Splunkformals(Search)$index <- index
formals(Search)$size <-10000# the max query sizeformals(Search)$q <-"*"# give us everything (this is the main search thing)formals(Search)$asdf <-TRUE# make it into a table# formals(elastic::Search)$sort <- "@timestamp:desc"formals(Search)$time_scroll <-"5m"# we're going to scroll so we get more than the max query size

### Set up initial connection
elastic::connect(es_host = host, port = port)
elastic::index_get()
elastic::cat_indices()## after this we'll just use the 'ES()' function to query elastic search

In [ ]:

### For demonstration purposes, we'll just load a pre-generated queryload("THE PATH TO elk_jupyter_r_blog.Rda", verbose=TRUE)# verbose=TRUE means it'll tell you the name of the dataframe loaded. Hint, it's "df".

### For demonstration purposes, we'll just load a pre-generated queryload("/THE/PATH/TO/elk_jupyter_r_blog.Rda", verbose=TRUE)# verbose=TRUE means it'll tell you the name of the dataframe loaded. Hint, it's "df".

In [48]:

### Lets take a quick look at what our dataframe looks like
glimpse(df)

### Lets quickly look at when things happened. I've filtered this to just the day of the mini blue-red CTF.## It looks like a lot happened about 7 and then again right before noon
df %>%
ggplot(aes(x=timestamp))+
geom_density()

In [49]:

df %>%
mutate(day =as.Date(timestamp))%>%
count(day)%>%
arrange(-n)

day

n

2017-12-02

24966

In [ ]:

### A quick list of namesdput(names(df))

In [ ]:

### Lets see how many unique values there are in each field
purrr::map(df, n_distinct)

### Lets do a joy plot of the different servers## Looks like Win2008-64v1, mysql, and BT-VM didn't get touched. ## Something happened on the rest of the windows servers a bit before noon## The LAMP server did something about 7am
df %>%## Next two lines as if we needed to zero in on a specific time period
filter(timestamp >= lubridate::ymd("2017-12-01"))%>%
filter(timestamp <= lubridate::ymd("2017-12-10"))%>%## lets make the server names look better in the plot
mutate(source.beat.name =ifelse(source.beat.name =="elasticsearch","ES", source.beat.name))%>%
mutate(source.beat.name =ifelse(source.beat.name =="blueteam-virtual-machine","BT-VM", source.beat.name))%>%
mutate(source.beat.name = stringr::str_wrap(source.beat.name,8))%>%## ggplot() starts a plot
ggplot()+## geom_joy2 means we want a joy plot## aes() sets the aesthetic, i.e. what's on the x & y axes, the line and fill colors, the alpha, etc.
ggjoy::geom_joy2(aes(x=timestamp, y=source.beat.name, fill=source.beat.name), alpha=0.8)+## lets make the x axis labels look nicer
ggthemes::scale_fill_tableau(palette ="tableau20")+
scale_x_datetime(date_labels="%H")+## some theme stuff to make it look nicer
ggjoy::theme_joy()+
theme(legend.position ="bottom",
legend.title=element_blank(),
axis.title.y=element_blank())

Picking joint bandwidth of 2650

In [81]:

### Lets do the same thing for the event source to see what sources generated events when## We can see various logs being active at various times. Definitely not consistant the whole time.
df %>%## Next two lines as if we needed to zero in on a specific time period
filter(timestamp >= lubridate::ymd("2017-12-01"))%>%
filter(timestamp <= lubridate::ymd("2017-12-10"))%>%## Get rid of records without a source_name
mutate(source.source_name =factor(source.source_name))%>%## lets make the source names look better in the plot
mutate(source.source_name =gsub("Microsoft-Windows","Win", source.source_name))%>%
ggplot()+# the geom and aesthetic
ggjoy::geom_joy2(aes(x=timestamp, y=source.source_name, fill=source.source_name), alpha=0.8)+## the axis
ggthemes::scale_fill_tableau(palette ="tableau20", guide="none")+
scale_x_datetime(date_labels="%H")+# the theme
ggjoy::theme_joy()+
theme(legend.position ="bottom",
legend.title=element_blank(),
axis.title.y=element_blank(),
panel.grid.major.y=element_blank(),
axis.text.y=element_text(size=6))

Picking joint bandwidth of 4150

In [89]:

### Lets look at the windows logs## we can see some logs are happening almost constantly, some are periodic, while others happen just a few times.## many events happen together such has 4720-4738## I'm not a windows log person so can't tell you what these are but can see the patterns.
df %>%## Next two lines as if we needed to zero in on a specific time period
filter(timestamp >= lubridate::ymd_h("2017-12-02 00"))%>%
filter(timestamp <= lubridate::ymd_h("2017-12-03 00"))%>%## get rid of thigns without an event ID. (NA means "Not Available")
filter(!is.na(source.event_id))%>%## This isn't a number, it's an id so should be a character or factor
mutate(source.event_id =as.character(source.event_id))%>%## The figure
ggplot()+## Set the geom and aesthetic
geom_point(aes(x=timestamp, y=source.event_id, color=source.event_id), alpha=0.2)+## Set the axis (in this case, the color)
viridis::scale_color_viridis(discrete=TRUE, option="C", end=0.8)+## Set the theme
theme(legend.position ="none",
legend.title=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_text(size=6))

In [90]:

### Event 7036 has an interesting period of activity between 10 and 2, lets look at it## This query just shows the columns
df %>%# filter to the time period we want
filter(timestamp >= lubridate::ymd_h("2017-12-02 10"))%>%
filter(timestamp <= lubridate::ymd_h("2017-12-02 14"))%>%# Filter to the event_id we want
filter(source.event_id ==7036)%>%# select(one_of(grep("source", names(df), value=TRUE))) %>%## The following line is super helpful in ES data as it gets rid of columns that are all NA## These are common as some columns may be unix specific when we're just looking at a windows event.
select_if(colSums(!is.na(.))>0)%>%
glimpse()

### Building ont he previous block about event 7036, lets visualize it## Ah, now we can see various services starting and stopping on various servers
df %>%# filter to the time period we want
filter(timestamp >= lubridate::ymd_h("2017-12-02 10"))%>%
filter(timestamp <= lubridate::ymd_h("2017-12-02 14"))%>%# Filter to the event_id we want
filter(source.event_id ==7036)%>%# select(one_of(grep("source", names(df), value=TRUE))) %>%## The following line is super helpful in ES data as it gets rid of columns that are all NA## These are common as some columns may be unix specific when we're just looking at a windows event.
select_if(colSums(!is.na(.))>0)%>%
mutate(source.beat.name = stringr::str_wrap(source.beat.name,8))%>%# the figure
ggplot()+# the geom and aesthetic
geom_point(aes(x=timestamp, y=source.event_data.param2))+# The axes
scale_x_datetime(date_labels ="%H")+# makes a grid of figures
facet_grid(source.event_data.param1 ~ source.beat.name)+# how it all looks
theme(
strip.text.y = element_text(angle=0))

In [85]:

### TESTING## I always keep a cell at the bottom I use for stuff I want to test out. ## Normally it's just a glimpse, but it's helpful for various transient things
glimpse(df)