Monday, October 7, 2013

SumStats Framework


I heard Seth talk about the SumStats framework (previously the measurements framework) at this past Bro Exchange. The presentation material was dense and at times difficult to follow. I had a previous interest in applying statistical measurements to Bro logs through R, but ran into memory issues (R is greedy), so I tried to understand as much about the framework as possible. The first time I read Seth and Aashish's udp_scan script, I had no clue what was going on. I dug into the SumStats framework's source and after banging my head on my keyboard long enough, it came to me.

SumStats isn't overly complex, it's just complex enough.

In this post I hope to shed some light on how the SumStats framework achieves scalable statistics you can use to determine what is "normal" on your network (or in your trace files). Below are some important terms SumStats relies on that you should be familar with upfront.
interval - a set amount of time
threshold - a maximum or minimum limit to something
calculation - calculations that can be applied to a set of observations (average, max, min, sample, std, etc)
key - a thing which is having observations collected about it (an IP address, an HTTP host header, port) (similar to an indicator in the intel framework)
observation - a piece of data associated with a key (fed to a reducer and ultimately a calculation)
reducer - a named set of operations to apply to keys and observations (including a set of calculations, a filtering function, and a key normalizer)
resultval - the value resulting from a calculation being applied to an observation (happens through a reducer)
result - a table of resultvals (multiple results can exist as multiple calculations can be applied to a single observation)
resultTable - a table of results (a table of tables)
SumStat - a named collection of reducers, intervals, thresholds and handling functions

For each key, a series of observations will occur over an interval of time. These observations are passed to a reducer that applies one or more calculations to each observation and produces a result. Result values are collected into tables. Upon a result exceeding a user defined threshold some single action can be taken, some action can be taken for each key, or both. Upon reaching the end of a user defined interval of time some single action can be taken, some action can be taken for each key, or both.

Because computer communications can happen at big and fast bursts, keeping a large amount of state is required for using the framework. To help reduce the memory footprint the SumStats framework works only with streaming algorithms (something I'd never been exposed to until working with Bro). This ensures if you've told Bro to measure something, for example the average value of all SYN sequence numbers, and all of a sudden that something happens big and fast (a SYN flood), Bro won't tip over.

Here is an example script that makes use of the SumStats framework. It measures the total amount of bytes sent by each host each day and the average number of bytes sent by each host every week. Keeping this historic data could help a network operator determine 'normal' transfer amounts per hosts. This script could easily be adapted to "summary statistic" the total bytes sent to a remote host per SSL certificate.

The script defines two intervals for my script. tx_summer_interval breaks every 1 "days" (I use minutes so you can see the script run in ~10 minutes instead of 1 week) and tx_aver_interval breaks every 1 "week". day_of_week maps day names to numbers. A day counter and a week counter are defined to be used as SumStat intervals (the actual value is a called $epoch in the SumStat record). Two reducers are defined. One is for summing, the other is for averaging. Two SumStat objects are declared; one is for averaging one is for summing.

Summing observations are sent to the summing SumStat object when a connection's state is removed. Averaging observations are sent to the averaging SumStat object when a summing result is calculated.

The SumStats framework took me a bit to wrap my head around but it's pretty neat.

No comments:

Post a Comment