Sourav Mukherjee, PhD; Tim Oates, PhD; Ryan Wright
Abstract. The recent proliferation of publicly available graph-structured data has sparked an interest in machine learning algorithms for graph data. Since most traditional machine learning algorithms assume data to be tabular, embedding algorithms for mapping graph data to real-valued vector spaces has become an active research area. Existing graph embedding approaches are based purely on structural information and ignore semantic information from the underlying domain. In this paper, we demonstrate that semantic information can play a useful role in computing graph embeddings. Specifically, we present a framework for devising embedding strategies that are aware of domain-specific interpretations of graph nodes and edges and use knowledge of downstream machine learning tasks to identify relevant graph substructures. Using two real-life domains, we show that our framework yields embeddings that are simpler to implement and yet achieve equal or greater accuracy in machine learning tasks than domain-independent approaches.
In recent years, we have witnessed a dramatic increase in the volume of graph-structured data being generated and made publicly available. For example, the Stanford Large Network Dataset Collection provides graph data from a variety of domains such as social networks, web graphs (where nodes and edges represent web pages and hyperlinks, respectively), citation and collaboration networks, et cetera ( Leskovec and Krevl, 2014; Zitnik et al., 2018). Linked Data (Bizer et al., 2009 ) makes it possible for open data providers to publish graph-structured datasets using a standardized representation, namely, the Resource Description Format (RDF) ( Schmachtenberg et al., 2014 ). This standardized representation allows graph data from heterogeneous sources to be combined into a unified, web-scale graph, namely, the Linked Open Data (LOD) Cloud, that may be searched, crawled, and indexed just like the world wide web. These developments have motivated a strong interest in machine learning algorithms for extracting insights from graph data.
Traditional machine learning methods frequently rely on tabular data representation. Specifically, the existence of real-valued vector representations of instances (and of a real-valued distance metric between pairs of instances) is assumed. Consequently, such approaches are not readily applicable to graph data. However, it is possible to overcome this limitation by defining a mapping from entities in graph-structured data to points in a real-valued vector space Rd. Discovering such mappings, also known as embeddings, has emerged as an active area of research. A desirable embedding should:
- Be easy to compute.
- Map to a space with dimensionality.
- Induce a distance metric consistent with the notion of similarity in the original domain.
Complete technical paper available as a PDF.