Sunday, October 5, 2014

Out of the Scripts and Into the Core

In the previous post I discussed plugins in Bro and how they could be used to move code from the core to self contained directories. After talking with Seth, I realized plugins had been around in Bro for a while (longer than I thought). I had no idea as most of my experience with Bro was in scriptland. The input framework was built with Bro's plugin system and that's been around for a bit.
Seth suggested I try to move the geoIP lookup features of Bro into a plugin. I immediate thought, with my great new understanding of plugins, it would be quite simple. The geoIP lookup code relies on libgeoip and the legacy MaxMind C API. It exposes a few additional built in functions in bro.bif and makes use of those functions in a few base/policy scripts. I thought it would be easy, involving a few copy/paste commands and moving some script files around. It turns out the job requires knowing some C++ and being more familiar with Bro's internals than I initially thought.
I can read C++ and write some very basic "Hello World" type programs; that is the extent of my C++ knowledge. I also had never really dug deep into Bro's source before. I did do this. However, looking back that was relatively trivial. After grep'ing through much of the src/ directory, I decided I should document some key finding of the core before attempting to modify it. The following is meant to be rough notes and references on C++ and the source which comprises the core.

Generic C++ Programming Concepts
C++ has been around for a while. These concepts aren't anything new to many and might seem very rudimentary to some. I approach Bro from the perspective of someone with relatively strong systems, networking, and security experience. I would say my understanding of programming, especially in C++, isn't as refined.
- C++ is a statically typed compiled language. It can be thought of an extended version of C.
- Things need to be declared before they are defined. This is fairly common across languages I'm used to.
- All things have a scope. Global vs local scope is important and influences references. Local takes precedence.
- Everything has a namespace. Namespaces are a mechanism for grouping things so that we don't have to dump everything into a global scope. Python has namespaces in packages (e.g. the match function in re) and Bro has namespaces as modules. In fact modules in scriptland translate to C++ namespaces.
- C++ is object oriented, which means classes. Classes are data type definitions. A thing of a particular class is called an object. This isn't specific to C++. Classes have member functions, in Python a string class has a function of split(). Some are class variables and methods are public and some are private, which determines how the variable or method can be accessed.
- C++ supports (multiple) class inheritance. Classes can inherit structure from base classes. An example of this is a class that describes a bluebird. The bluebird class inherits attributes and structure of the bird class and, in turn, the bird class inherits attributes and structure of the animal class. Class inheritance provides a generalized way of writing classes that maximizes code reuse.
The Core
Much of Bro's source consists of class definitions and by reading (well, OK, skimming) through the source (well, OK, the header) files, I came up with four buckets most of the code in Bro's core does. These aren't meant to be comprehensive. I didn't write the core so don't take my word for it. Instead, go read it for yourself.
1) Process network traffic
2) Glue scriptland to the core
3) Provide infrastructure the core needs
4) Offer other features

1) Network traffic processing
Bro's main source of awesome comes from its traffic analysis capabilities. Similar to humans and pantlegs, Bro has to do this analysis packet by packet just like any other network sniffer. The code I found of interest in this bucket included the analyzers (duh)(which deserve their own blog post entirely), the NetSessions class and how it is used to interface packet processing. The Frag.* and Ressem.* files help out here as well. The *.pac files in the src/ directory tie binpac analyzers and Bro together. One thing that surprised me is that Bro can still do some neat things without processing any traffic. It can be called with a script and no interface. It can be invoked similar to other scripting language interpreters.

2) Scriptland
The interfaces Bro provides for scriptland to control how Bro behaves are extremely intriguing. I've never lifted the hood to see how scripting languages are implemented before this exercise. I'd be curious how other languages, like Ruby, do things. Scriptland functionality can be broken down into an additional three subtypes.
a) Interpreting the scripts
b) Script object abstraction
c) Exposing things the core can do for scriptland

a) Lexical Analysis
I had to do some reading to really understand what was going on here... Bro scripts need to be read by the core and interpreted before they take affect. Similar to how you are reading this right now, given a set of rules a computer can be taught to read source code. Written English's rules include: sentences read left to right, ideas stop at periods, words end with spaces, etc. Policy scripts don't have words or sentences but do have their own grammar. For examples, expressions make up statements, which are semi-colon delimited and make up frames, which are curly brace delimited. Bro's core needs to represent these things to be able to interpret them. Base classes like Expr, Stmt, Frame, and Scope offer structures which can be duplicated through class inheritance to provide different kinds of expressions or statements (if statement, print statement, for statement, etc.). The Location class provides line and column numbers for debugging interpreted scripts.

b) Object Abstraction
Variables and data in scriptland needs to be represented in C++ for the core to work with them. Variables and constants (scriptland native data types) are passed between policy scripts and Bro's core as pointers to Val objects. Scriptland defined types are created through the BroType interface. BroType creates BroObj (a base class with many children) objects. Event handlers are reference between the core and scriptland as EventHandlerPtr objects. The boolean value of an EventHandlerPtr indicates if scriptland references it anywhere. I'm assuming this is used for optimizing callbacks in the core. Modules in scriptland are C++ namespaces in the core, where GLOBAL is the global default namespace. Below is a rough mapping of scriptland things to C++ classes.

ScriptlandThe Core
everythingSerialObj
pretty much everything                BroObj
all values Val
specific values     *Val (EnumVal, PortVal, AddrVal,
RecordVal, TableVal, etc.)
functionFunc
eventEvent, EventHandlerPtr
attributeAttr (&expire, &redef, &priority, etc.)
connectionConnection
uidUID
typeBroType
whenTrigger
ReporterReporter

c) Bifs
Built in function (Bif) files declare and define functions that are available from scriptland but are implemented in C++. They are essentially C++ source files with a special escape syntax. Events, types, and constant variables used in scriptland are also declared and/or defined in Bifs.

3) Core Infrastructure
Some portions of the core don't process traffic, interpret scripts or provide cool features. Classes like Timers and Managers (*Mgr) are needed for managing Bro's complex internals but generally aren't immediately apparent to people who live in scriptland. Other things like BSD style getopts, signals, pipes, and custom queues are fundamental things for Bro being successful even though they are abstracted away.

4) Other features
This category is sort of a catch all. If it's not traffic processing, script interpreting, but is available to a Bro user or programmer, it's probably in this bucket. Some features in this bucket are implemented as plugins, like the input, probabilistic, logging and file_analysis frameworks. Some features are tied into the core of Bro (which doesn't make sense to me, but could just be legacy) like IP anonymization (Anon.*) and IP geolocation.

I think moving away from libgeoip's C API would be smart. It would remove a dependency and move the code out of the core and into a plugin. It may even provide an opportunity to upgrade from Maxmind's second version file type (mmdb). However, that is easier said than done. I'm still working on it... Maybe I'll figure it out before I get distracted and start playing with Binpac++.