# Graph embeddings for machine learning, the key to unlocking graph data, Part 1 of 3

# Graphs, graphs everywhere and yet...

Graph databases are all around us. In our everyday lives we interact with a variety of graphs, even more than we might imagine.

The world wide web (WWW) may be described as a collection of web pages (more generally, resources) that are connected by hyperlinks. The entire Internet is a collection of end systems (e.g., servers, workstations, laptops, smartphones), routers, and switches, connected by communication links (e.g., electric and fiber optic cables, radio and satellite links). When you use a social network, you become part of a graph with millions of users connected by relationships such as friendships and followers. At a microscopic level, an organic molecule such as a protein may be considered a graph, where the entities are atoms and relationships are chemical bonds.

Graph (singular), n.: a collection of entities (objects) and relationships linking those entities.

Because they are everywhere, graphs from real life allow us an opportunity to answer important and interesting questions.

Which close-knit communities within a given social network will most likely respond to an ad?

Do any money laundering schemes exist within a particular set of cash flows?

What are the topics of a set of publications based on the network of citations?

Based on an analysis of its structure, is a given chemical compound carcinogenic?

Is a particular protein an enzyme?

In theory, machine learning should help answer such questions with increasing accuracy. Unfortunately, most machine learning algorithms and programming frameworks only work for tabular data. A graph database is not tabular, so what can we do with the increasingly available graph data? Let’s first unpack why a solution for cracking graph data is so critical.

# A sudden interest and increase in graph data

The volume of publicly available graph data has grown enormously in recent years. One obvious factor is that web content (documents, users, relationships) has grown. Another significant factor is the emergence of Linked Data [1] consisting of graphs from a variety of domains and data providers that have been stitched together into a unified, semantic web. The Linked Open Data (LOD) cloud can be viewed and queried at https://lod-cloud.net/. It includes entities and relationships from domains such as social networks, media, publications, life sciences, geography, and government, among others.

To make such a construction possible, data providers publish their graphs using a standardized representation, namely, the Resource Description Format (RDF). An interesting example of an RDF dataset is DBPedia, which is a graph constructed from the entire Wikipedia corpus by analyzing metadata (e.g., infoboxes) and structural information (e.g., links among Wikipedia pages as well as links from Wikipedia pages to external websites). The DBPedia dataset is available at https://wiki.dbpedia.org/develop/datasets. We will look more closely at RDF in our next blog.

*Bottom line: Graph data are increasingly available and an increasingly obvious source of answers to pressing questions across industries and subject matters. *

# But there are fewer applicable algorithms or frameworks for graph data

If you try to build a predictive model using any of the above datasets, or any other graph dataset, for that matter, such as those available from the Stanford Large Network Dataset Collection [2], using your favorite machine learning framework, you’ll face an immediate hurdle: that programming framework probably assumes data to be in the shape of a table. For example, scikit-learn [3] , a widely used machine learning library, only accepts tabular data. As distinct from graph data, tabular data has the following characteristics:

All the instances are independent of each other and drawn from the same underlying distribution; in machine learning terminology, we say that the instances are independent identically distributed (i.i.d.), and

Each instance is represented as a row in a table, and

Each column represents an attribute of the instances, which implies that all instances have the same set of attributes, i.e., they are homogeneous.

Graph data are not tabular. The entities (nodes) in a graph are connected by relationships (edges), and therefore not independent. It is not possible to convert a graph to a table without some loss of information. Let’s take an example. The PIMA Indian Diabetes Database, from the National Institute of Diabetes and Digestive and Kidney Diseases, has been used to build predictive models for detecting type 2 diabetes in women of the Pima group of Native Americans. The dataset consists of a table, where each row represents a person, and each column represents an attribute of the person, namely, the number of pregnancies, glucose concentration, blood pressure, skin thickness, insulin, body mass index (BMI), diabetes pedigree function, age, and outcome (whether the person has diabetes or not). Therefore, any predictive model based on this data tries to predict the outcome based only on values of the other attributes. In doing so, however, the model treats every person in isolation, and ignores whether a person is related to someone with diabetes. This is because the table described above does **not** capture parent-child and sibling relationships. In other words, information about relationships, that could potentially be helpful in building models, is lost.

Technically, we say that graph data is higher-dimensional than tabular data. Further, the entities could originate from a variety of domains and have different sets of attributes. DBPedia includes both cities and authors (among other types of entities); but they certainly do not have the same attributes.

# Is it possible to convert graph data to tabular data?

Imagine that you have a lot of graph data but cannot find any suitable predictive model. You might decide to invent your own algorithm. Since we likely want any converted data to be amenable to a wide range of algorithms, we must somehow quantify how similar, or dissimilar, any two entities are. Technically, we need to define a distance metric.

To see why this is important, consider, for example, how a k-Nearest Neighbor classifier trained on these entities would work. Given any new instance, the classifier would first locate k previously seen instances that are **most similar** to the new instance. It would then look at the labels (i.e., targets) of those similar instances to predict the label of the new instance. As another example, consider a recommendation system that predicts how you would rate a new movie you haven’t watched yet. One way to make this prediction is to find viewers who have rated the new movie, and whose ratings of many other movies are similar to yours. As yet another example, if you are writing a program to go through a large collection of blog posts to cluster them into categories, you need a way to tell whether two posts are similar and should belong to the same category or not. Many machine learning algorithms use distance metrics under the hood.

# Enter embeddings to help along the way

Defining distances can get tricky. We can say that 5 is closer to 1 than to 20, because the distance between 5 and 1 is 4, whereas the distance between 5 and 20 is 15. Suppose that on a two dimensional grid of rows and columns, we pick two points P (row 1, column 1) and Q (row 4, column 5). Then, the distance between P and Q is given by the √((4-1)² + (5 - 1)²) = 5, which is called the Euclidean distance between the two points. The notion of Euclidean distance can be naturally extended to three and higher dimensional spaces of numbers. On the other hand, if we ask whether the word ‘apple’ is closer to the word ‘car’ than to the word ‘table’, the answer is far less straightforward. These examples suggest that when we convert a graph to a table, it makes sense to stick to, and Euclidean distances, as much as possible.

Now we can take your idea one step further: map all the entities of a graph into points (rows of numbers) in the same Euclidean space. The representation of each entity would be a row of numbers, and the distance between entities would be the Euclidean distance between points they are mapped to.

This is the core idea of a **graph embedding**, which maps entities in a graph to points in a Euclidean space such that “similar” entities are mapped to nearby points in that space. Once you have computed embeddings, you can use them to train your favorite predictive model, whether it is k-nearest neighbors, support vector machines, or deep neural networks, or even unsupervised models such as k-means clustering.

We must clarify that graph embedding can also mean mapping entire graphs (as opposed to entities in a graph) to points in a Euclidean space, but such embeddings are beyond the scope of this discussion.

# What’s next?

Since graph embeddings aim to map similar entities to points that are close to one another, understanding graph embeddings requires us to understand:

What we mean when we say two entities are similar, and

How to actually map an entity to a point in a Euclidean space.

In our next installment, we will take a closer look at these questions. Stay tuned!

*Whitepaper: Sourav Mukherjee, Tim Oates, Ryan Wright. Graph Node Embeddings using Domain-Aware Biased Random Walks. CoRR abs/1908.02947. 2019. *

# References

[1] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5:1–22, 2009.Url: http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf

[2] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. June 2014. Url: http://snap.stanford.edu/data

[3] Pedregosa *et al. *Scikit-learn: Machine Learning in Python. JMLR 12, pp. 2825-2830, 2011. Url: http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html