Statistical Anomaly Query
- Identify events from event stream that have statistical significance compared to baseline.
- Split time into 3 sections: a) current partial hour, b) last full hour, c) time from -2h@h until all desired baseline data is included. Ignore section a) as it leads to too many false positives (mostly by not seeing enough events within a partial hour yet). Section b is what we will be checking against statistics generated from baseline.
- Processing of section c):
- Bucketize time in 1 hour chunks.
- Cluster events based on punctuation. I was unable to get both percentage of traffic and absolute counts from the cluster command, punctuation actually works pretty well for the events I looked at.
- Select 25 most frequently repeated events within each hour
- Generate means and standard deviation values from the baseline data for both percentage of traffic in each hour and the absolute counts.
- join with processing of section b):
- Very similar to above, except we do not generate statistics. Just take the top 25 common messages and get their percentage of traffic and count
- With joined results we now do some calculations:
- Calculate z score for both percent of traffic and counts.
- Search to only display z scores that are hitting thresholds. Typical value I use here is -3.1 and +3.1 (assuming normal distribution it is a 1 in 1000 chance of having a false positive). Note: normal distribution is not an ideal assumption. Anything else would require custom commands added to Splunk, so here we are.
- Endgame: make the results a bit more useful:
- For each result that has been triggered - find an example event and display that as opposed to the punctuation.
- numerical value. Number of days to take as baseline. Values I frequently use - 3 (watch out for weekends being skewed), 7 and 14.
- string value. returning a stream of events. e.g. “host=www* source=*access*”
- numerical value. z significant above or below which we alert. Frequently I use 3.1 (which means -3.1 and +3.1). That value results in false positive rate of approx. 1 in 1000
- string value. whether alert on percentage AND count or percentage OR count. Two values used “AND” or “OR"
- This query is meant as a macro.
- Any event stream can be used as input. Some examples:
- events from a specific host
- events from a list of hosts (cluster?)
- backup events from all hosts being backed up
- events from a specific section of a page (e.g. shopping cart processing)
- I do not like to throw away current data. Doing partial hours or splitting data not on top of the hour was likely to cause some difficult to explain results. I made the decision to value replay ability (at least within that 1 hour) and ease of explanation over that.
- There is no memory. 1 minute past the top of the hour the anomalous events just disappear (one of my to do’s). Probably best implemented as a scheduled search with an anomaly log.
- One of the design considerations was that I valued the ability to explain results. I wanted to avoid having a black box and say - trust me. When getting others to use this code there are several things that can be checked:
- Search for that specific punctuation pattern to see what events match
- Count number of these events per hour over the baseline time period to see if the results do look unusual
- Display a sample event for instant recognition.
- Implement a version of this that is based on summary searches. Summary searches will make this much faster; however, the cost is that the event streams become static and need to be pre-calculated in advance. For ad-hoc queries this macro will still have value.
- Summary setup overview:
- For each event stream we are interested in, generate the counts and percentages for the top 25 event types based on punctuation.
- Modify the query to look up that information (beginning section) as opposed to calculating it.
- Error handling needs to be better:
- If there is no events from that host within the baseline, or insufficient events to generate good statistics.
- Figure out parameter error handling for macros. Probably similar to any regular expression parsing multiple arguments.
- `AnomalySearch("index=os host=edrms*",3,2.1, "OR")`
- Look through os logs for these hosts. Taking last 3 days as a baseline, alert on anything that has a z score greater than 2.1 or lower than -2.1 on either count or percentage of events within an hour.
earliest=-$daysofbaseline$d@h latest=-2h@h $searchstring$ | bucket _time span=1h | top limit=25 punct by _time | eval cpunct=percent | eval ccount=count | stats mean(cpunct) as meanpercent, mean(ccount) as meancount, stdev(cpunct) as stddevpercent, stdev(ccount) as stdevcount by punct | table punct, meancount, stdevcount, meanpercent, stddevpercent | join punct [ search earliest=-h@h latest=-0h@h $searchstring$ | bucket _time span=1h | top limit=25 punct by _time | eval precent=percent | eval ccnt=count | table punct, precent, ccnt] | eval zpercentscore=(precent-meanpercent) / stddevpercent | eval zcountscore=(ccnt-meancount)/stdevcount | search (zpercentscore > $threshold$ $countXpercent$ zcountscore > $threshold$) OR (zpercentscore < -$threshold$ $countXpercent$ zcountscore < -$threshold$) | map search="search $searchstring$ punct=$punct$ | head 1 | eval mean_count=$meancount$ | eval mean_percent=$meanpercent$ | eval latest_count=$ccnt$ | eval latest_percent=$precent$ | eval z_count_score=$zcountscore$ | eval z_percent_score=$zpercentscore$" | table _raw, mean_count, latest_count, mean_percent, latest_percent, z_count_score, z_percent_score