Large graph datasets on a chalkboard

Relationships are everything – family, friends, mentors, colleagues, business contacts.  That’s true in life in general, but also in a great many applications of data science.  The connections between people, businesses, banks, and accounts can tell you if the flow of money is normal or a case of identity theft.  The movement of data from a hard disk to a computer program to a network port to another host on the internet can indicate whether your machine has been compromised or if you’re just playing an online game with friends.  In each of these cases, the data are often treated as a graph – consisting of nodes (e.g., people or computer programs) and  edges between nodes signifying relationships (e.g., Joe just launched PUBG and is ready for a long night of gaming).

Many businesses are generating large graph datasets and looking for ways to leverage them.  Common tasks include similarity search and finding anomalies.  In an online user group – where the graph nodes are people and posts, and the edges indicate shared communication – you might want to find users like ones who left the group (similarity search) so that you can draw them back in, or find users with odd behavior who might be bots (anomaly detection).

Old school approaches to these problems would look at local structure in the graph and do subgraph comparisons.  Maybe I’ll find a user who left the user group, follow all edges from her node up to 3 hops yielding a subgraph, and look for other subgraphs in the larger graph that are similar.  This is a very powerful approach, and is based on similar data driven approaches to finding words with similar meanings.  They say “you can tell the meaning of a word by the company that it keeps”.  Coke and Pepsi appear in sentences that are very similar to one another, and that are very different from sentences that contain the word “stapler”.  The same is true for nodes in a graph; you  can tell what they’re all about by the company they keep.

Though powerful, this approach has two problems.  First, algorithms for generating and comparing subgraphs are really expensive to compute; adding a node to the subgraph can double the time the algorithm takes to finish its task.  Second, subgraphs  require a hand-coded measure of what it means for two subgraphs to be similar, typically by allowing small changes in their structure to be ignored. Until recently, these problems were hard to overcome.

Enter deep learning methods and so-called graph embeddings.  An embedding is learned from graph data, and allows you to turn a node or edge into a vector of numbers.  So what?  The trick is that the learning algorithms make the numbers in the vector similar for nodes and edges that have similar local structures in the graph.  So now we might have three users who are nodes in the user group graph that look like this: Alice = [0.1, 0.1], Bob = [0.2, 0.1], and Charlie = [0.9, 1.0].  It’s trivial to apply a simple distance metric (like Euclidean distance) to see that Alice and Bob are very similar to each other, and very different from Charlie.  If Bob left the user group, we might conclude that Alice is in danger as well, or if most users are like Alice and Bob we might conclude that Charlie is a bot.

The nice thing about graph embedding algorithms is they tend to be more efficient than manually generating and comparing subgraphs, especially when the graph dataset is large, and comparing vectors of numbers is more flexible than comparing subgraphs directly.  I can change the size of the vectors (e.g., going from 2 numbers per vector to, say, 8) to make more fine grained comparisons without exploding the computational expense.  I can easily set thresholds on the distance metrics, or influence the algorithms to pay more attention to some relations than others (e.g., communication with external vs. internal hosts).

Graph embeddings are finding applications in areas as diverse as cybersecurity, “cognitive” querying of relational databases, and fraud detection.  They may be a great way to extract actionable knowledge from graph datasets you’ve got lying around as well.