A government agency wanted to perform flexible similarity driven search and outlier detection in cybersecurity data, where network entities (e.g., routers, computers, processes, files, IP addresses) and their relationships (e.g., a process opened a file, or a connection to a local machine was opened from a remote IP address) are represented as very large graphs. The goal was to support queries like “show more more processes like this one” or “show me IP addresses with unusual behavior”. The difficulty is that computing similarity between sub-graphs is computationally very expensive, and may not capture structure that is important for this specific task.
Synaptiq combined our significant experience with graph data and deep learning to explore ways of embedding nodes into vector spaces (turning nodes into vectors of numbers) to enable similarity search and outlier detection using an array of standard algorithms. We stood up an Amazon AMI with a Neo4J graph database and a number of different graph embedding algorithms, and evaluated embeddings produced by each of those algorithms. We also developed methods for easily injecting domain knowledge into the embedding algorithms to tune them for the specific downstream task which, in this case, is similarity search and outlier detection for cybersecurity.
We produced a detailed study of the strengths and weaknesses of existing embedding algorithms from two standpoints - utility of the embeddings and computational tractability. Based on that outcome, we worked with domain experts to inject knowledge into the best performing algorithm and demonstrated (statistically) significantly better accuracy on a suite of standard benchmarks for supervised learning from graph data. The AMI was delivered to the client.