Ever since the London Underground Map was introduced in 1931, its iconic design became a timeless piece. The design was replicated in subway maps all over the globe, and it’s not an overstatement to say that millions of human beings depend on them on a daily basis (well, at least before the COVID-19).
What are the elements of this design that make it so versatile? Without digging too much into academic publications on this topic, we can intuitively name a few here.
Can these successful elements be applied elsewhere other than the subway maps? Yes, perhaps in cybersecurity. The engineering team at Confluera began its journey 2 years ago, asking ourselves a single question:
How would a perfect data representation look like that would empower security analysts on a daily basis?
We are sharing some of our key learnings in this blog post. Hint: Results are strikingly similar to the thought process of the subway map design (Connectedness, Partitions, Real-Time Decisions).
To start with the obvious, many modern server infrastructures can easily be understood in the form of graph data structures. An arbitrary operating system has different processes, files, and network sockets that are constantly interacting with each other at the local and remote level. A directed graph is a natural choice for representing these entities and their complex relationships.
For cyber security analysis, however, not all graphs are designed the same. A poorly designed graph will fail to deliver deep insights that security problems often require. The more Confluera understood about the limitation of other graph solutions, the desire for perfect graph became only stronger. Here are some key ingredients we’ve discovered in our pursuit.
As data breaches become more complex and stealthy in nature, the dwell time of an attack in a data center can easily range over several months at a time. The attacker will infiltrate and make lateral movements to several servers in the network, over a long period of time. In other words, the spatial and temporal distance of cyber attacks are statistically increasing.
One of the early surprises that we had as a team when talking to various customers was the time the security analysts spent on correlating the system events. There are continuous flux of security signals occurring in a given infrastructure. Analysts are constantly faced with a repeated question:
Are these signals individual parts of a larger attack, or just random false positives?
To answer such question, the analysts end up spending hours of time trying to correlate sparse events together. We came to a conclusion that any graph representation used for security analysis needs to start with two uncompromisable premise; the graph should span across unlimited amount of time and the entire infrastructure to precisely track attacker’s movements.
There are thousands of different intentions at crossroads for a given system. Here, the word intention is used to describe a variety of activities. It could be a separate SSH/RDP sessions, database client sessions, or a scheduled job waking up to perform its job for the day. Each intention provides a logical group of executions that are relatable, and also draws a boundary with other activities that are orthogonal to itself.
When analysts are presented with a set of security detections, they need to reverse engineer through how and where the attacker came about. At this forensic stage, having a partitioned data at the intention level helps them with two important dimensions:
In fact, it can be claimed that a graph without proper partitioning is effectively a sequential log due to the limitation it brings. A partitioned graph breathes life into the security analysis.
It is year 2020 and the last two decades of batch data processing (Hadoop, MapReduce) is fading away. Instead of relying on retrospective/hindsight queries, many mission critical businesses have moved on to real-time stream processing solutions that provide more immediate results.
Most security products in the industry at its core have real-time requirements for their detection story. For example, running an MR job at an hourly interval would provide attackers with a large time window of invisibility. This calls for an important question.
Can graphs be streamed and processed in real-time without losing its context?
Stream processing of a single event is trivial. But can the concept of a graph remain intact during stream processing? This was a difficult question that our engineering team had to answer.
After a series of fail-fast experimentations, we have found that enough graph context can indeed be preserved in our stream processing pipeline. With proper projection of data events plus caching/storage hierarchy, we are able to capture and preserve the contextual graph surrounding each event for rapid lookups. Such contextual data enables the detection engine to make some interesting decisions in real-time.
In stream processing, a graph is only as powerful as the context that is accessible. Achieving such capability was a huge milestone at Confluera.
Modern server infrastructures are full of unstructured noises and signals. Brute forcing through the raw data is almost guaranteed to bring inaccurate security analysis as well as drive your valuable team members towards operational fatigue.
The success of the subway map design reminds all of us that an appropriate level of data representation and partitioning would expedite decision making process.
We have argued that such data structure exists in parallel in the domain of cyber security. A directed graph that is meaningfully connected, partitioned, and stream-friendly is essential to deep security analysis that we need today.
In a series of blog posts to come, we will continue to provide deeper dives into this particular journey.