Read about how we helped the Chief Information Officer at an IT services company perform flexible similarity-driven search and outlier detection in cybersecurity data. We developed a novel approach to Graph Data Embeddings with human biases to explore ways of embedding nodes into vector spaces and biasing queries with human domain knowledge.
A government client wanted to perform flexible similarity driven search and outlier detection in cybersecurity data, where network entities (e.g., routers, computers, processes, files, IP addresses) and their relationships (e.g., a process opened a file, or a connection to a local machine was opened from a remote IP address) are represented as very large graphs. The goal was to support queries like “show more processes like this one” or “show me IP addresses with unusual behavior.” The difficulty is that computing similarity between sub-graphs is computationally very expensive and may not capture the structure important for this specific task.
Synaptiq combined our significant experience with graph data and deep learning to explore ways of embedding nodes into vector spaces (turning nodes into vectors of numbers) to enable similarity search and outlier detection using an array of standard algorithms. We stood up an Amazon AMI with a Neo4J graph database and a number of different graph embedding algorithms and evaluated embeddings produced by each of those algorithms. We also developed methods for quickly injecting domain knowledge into the embedding algorithms to tune them for the specific downstream task, which, in this case, is similarity search and outlier detection for cybersecurity.
How Graph Data Embeddings Helped
We produced a detailed study of the strengths and weaknesses of existing embedding algorithms from two standpoints – utility of the embeddings and computational tractability. Based on that outcome, we worked with domain experts to inject knowledge into the best performing algorithm and demonstrated (statistically) significantly better accuracy on a suite of standard benchmarks for supervised learning from graph data. The AMI was delivered to the client.