# Algorithms for Analyzing and Mining Real-World Graphs

## Frank Takes

Promotor: prof.dr. J.N. Kok (UL)

Co-promotor: dr. W.A. Kosters (UL)

Universiteit Leiden

Date: *19 November 2014, 16:15*

Thesis: *PDF*

#### Summary

This thesis focuses on algorithms for analyzing large graphs (also referred to as networks). As opposed to synthetic graphs that are usually the result of applying some mathematical model, real-world graphs are based on data generated by some system, organisation or environment. Examples include (online) social networks, webgraphs, information networks, biological networks and scientific citation networks. Although the graphs studied in this thesis differ in terms of what kind of information the objects and relationships represent, it turns out that the structure of each these networks is surprisingly similar. Characteristic properties include a low density, a power-law degree distribution, low average node-to-node distances and usually one giant component containing the majority of the nodes.

A graph is one of the most fundamental data structures used in computer science, and a wide range of algorithms is available to compute all kinds of properties and measures of graphs. However, the graphs studied in this thesis are typically very large: the number of nodes is easily a few million, and a common number of edges is anywhere between a few million and a billion, which makes it problematic to use various traditional graph algorithms due to their (quadratic or worse) time complexity. For computer scientists, there is an obvious challenge to design efficient algorithms that allow these large graphs to be processed and analyzed in a practical setting. As opposed to using brute-force computation power or parallel systems, in this thesis the characteristic non-random structure of real-world graphs is exploited in order to efficiently compute or approximate various properties and measures of these real-world graphs.

For example, the node-to-node distance, which is traditionally computed using Dijkstra’s shortest path algorithm (or Breadth First Search in an unweighted graph), can efficiently be approximated with high accuracy using a small set of carefully chosen landmarks for which the distances are precomputed. Another example is computing the exact diameter (longest shortest path length) of a graph, which would normally require a run of an All Pairs Shortest Path algorithm to find the largest distance over all node pairs. This work introduces an efficient exact algorithm based on lower and upper bounds that only assesses a particular subset nodes in order to obtain tight bounds on the value of the diameter, allowing it to be computed with only a handful of Breadth First Searches, whereas normally each node would have to examined. Using a similar technique, exact algorithms for computing the radius, center and periphery of a graph are suggested. The thesis furthermore includes a case study of a large (former) dutch online social network in the context of network centrality measures, as well as an analysis of patterns found in paths generated by humans traversing the network of linked Wikipedia pages.