## Clustering Instances

Last graph file update: July 26, 2011, addition of files (descriptions further down):

- G_n_pin_pout.graph.bz2
- preferentialAttachment.graph.bz2
- smallworld.graph.bz2
- uk-2002.graph.bz2
- uk-2007-05.graph.bz2

The graphs in this category have been taken from various sources, as listed below. Except where otherwise noted, all graphs have been processed as follows:

- symmetrized (the union of the graph and its transpose, i.e., all edges are un/bidirectional)
- parallel edges have been removed
- self-loops have been removed
- edges with weight 0 or weight "inf" have been removed
- graphs with non-integer weights have been made unweighted
- graphs where all weights equal 1 have been made unweighted
- weights are only added up if the original graph is directed and contains integer weights

The following four datasets are taken from http://deim.urv.cat/~aarenas/data/welcome.htm , with kind permission of Alex Arenas, Dept. Enginyeria Informatica i Matematiques (Computer Science & Mathematics), Universidad Rovira i Virgili.

Graph |
n |
m |

jazz.graph.bz2 Jazz musicians network . List of edges of the network of Jazz musicians. Data compiled by P.Gleiser and L. Danon , Adv. Complex Syst.6, 565 (2003). |
198 |
2742 |

celegans_metabolic.graph.bz2 : C. elegans metabolic network. List of edges of the metabolic network of C.elegans. Community identification using Extremal Optimization J. Duch and A. Arenas, Physical Review E , vol. 72, 027104, (2005). |
453 |
2025 |

email.graph.bz2 : E-mail network URV. List of edges of the network of e-mail interchanges between members of the Univeristy Rovira i Virgili (Tarragona). Please cite R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt and A. Arenas, Physical Review E , vol. 68, 065103(R), (2003). |
1133 |
5451 |

PGPgiantcompo.graph.bz2 : PGP network. List of edges of the giant component of the network of users of the Pretty-Good-Privacy algorithm for secure information interchange. M. Boguña, R. Pastor-Satorras, A. Diaz-Guilera and A. Arenas, Physical Review E, vol. 70, 056122 (2004). |
10680 |
24316 |

The following 16 datasets are taken from http://www-personal.umich.edu/~mejn/netdata , with kind permission of Mark Newman, Department of Physics and Center for the Study of Complex Systems, University of Michigan and Santa Fe Institute.

Graph |
n |
m |

adjnoun.graph.bz2 Word adjacencies: adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens. M. E. J. Newman, Phys. Rev. E 74, 036104 (2006). |
112 |
425 |

as-22july06.graph.bz2 Internet: a symmetrized snapshot of the structure of the Internet at the level of autonomous systems, reconstructed from BGP tables posted by the University of Oregon Route Views Project. Mark Newman, using data for July 22, 2006. |
22963 |
48436 |

astro-ph.graph.bz2 Astrophysics collaborations: weighted network of coauthorships between scientists posting preprints on the Astrophysics E-Print Archive between Jan 1, 1995 and December 31, 1999. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001). |
16706 |
121251 |

celegansneural.graph.bz2 Neural network: A directed, weighted network representing the neural network of C. Elegans. Data compiled by D. Watts and S. Strogatz and made available on the web here. Please cite D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998). Original experimental data taken from J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, Phil. Trans. R. Soc. London 314, 1-340 (1986). A note on the conversion: the original graph is directed and contains integer weights. Each undirected edge in the converted graph that represents two directed edges has an edge weight that equals the sum of the weights of these edges. |
297 |
2148 |

cond-mat.graph.bz2 Condensed matter collaborations 1999: weighted network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive between Jan 1, 1995 and December 31, 1999. M. E. J. Newman, The structure of scientific collaboration networks, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001). |
16726 |
47594 |

cond-mat-2003.graph.bz2 Condensed matter collaborations 2003: updated network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive. This version includes all preprints posted between Jan 1, 1995 and June 30, 2003. The largest component of this network, which contains 27519 scientists, has been used by several authors as a test-bed for community-finding algorithms for large networks; see for example J. Duch and A. Arenas, Phys. Rev. E 72, 027104 (2005). M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001). |
31163 |
120029 |

cond-mat-2005.graph.bz2 Condensed matter collaborations 2005: updated network of coauthorships between scientists posting preprints on the Condensed Matter E-Print Archive. This version includes all preprints posted between Jan 1, 1995 and March 31, 2005. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001). A note on the conversion: the original graph contains two edges with weight inf, these were removed before making the graph unweighted. |
40421 |
175691 |

dolphins.graph.bz2 Dolphin social network: an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand. D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396-405 (2003). Thanks to David Lusseau for permission to post these data on this web site.. |
62 |
159 |

football.graph.bz2 American College football: network of American football games between Division IA colleges during regular season Fall 2000. M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). |
115 |
613 |

hep-th.graph.bz2 High-energy theory collaborations: weighted network of coauthorships between scientists posting preprints on the High-Energy Theory E-Print Archive between Jan 1, 1995 and December 31, 1999. M. E. J. Newman, Proc. Natl. Acad. Sci. USA 98, 404-409 (2001). |
8361 |
15751 |

karate.graph.bz2 Zachary's karate club: social network of friendships between 34 members of a karate club at a US university in the 1970s. W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977). |
34 |
78 |

lesmis.graph.bz2 Les Miserables: coappearance network of characters in the novel Les Miserables. D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993). |
77 |
254 |

netscience.graph.bz2 Coauthorships in network science: coauthorship network of scientists working on network theory and experiment, as compiled by M. Newman in May 2006. M. E. J. Newman, Phys. Rev. E 74, 036104 (2006). |
1589 |
2742 |

polblogs.graph.bz2 Political blogs: A directed network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance. L. A. Adamic and N. Glance, "The political blogosphere and the 2004 US Election", in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). Thanks to Lada Adamic for permission to post these data on this web site. |
1490 |
16715 |

polbooks.graph.bz2 Books about US politics: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent copurchasing of books by the same buyers. The network was compiled by V. Krebs and is unpublished, but can found on Krebs' web site ( http://www.orgnet.com/ ). Thanks to Valdis Krebs for permission to post these data on this web site. |
105 |
441 |

power.graph.bz2 Power grid: An undirected, unweighted network representing the topology of the Western States Power Grid of the United States. Data compiled by D. Watts and S. Strogatz and made available on the web here. D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998). |
4941 |
6594 |

The following three datasets are taken from http://law.dsi.unimi.it/datasets.php, and have all been gathered by members and collaborators of the Laboratory of Web Algorithms, University of Milano. We refer to the above site for further details on the datasets. Works that these data sets base upon and should be cited when using the data are:

inproceedings{BoVWFI,

author ="Paolo Boldi and Sebastiano Vigna",

title = "The {W}eb{G}raph Framework {I}: {C}ompression Techniques",

year = 2004,

booktitle="Proc. of the Thirteenth International World Wide Web Conference (WWW 2004)",

address="Manhattan, USA",

pages="595--601",

publisher="ACM Press"

}

@inproceedings{BRSLLP,

author = "Paolo Boldi and Marco Rosa and Massimo Santini and Sebastiano Vigna",

title = "Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks",

booktitle="Proceedings of the 20th international conference on World Wide Web",

year = 2011,

publisher="ACM Press"

}

If the graphs you are using were gathered by UbiCrawler, please acknowledge the usage of UbiCrawler by quoting the following paper:

@article{BCSU3,

author="Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna",

title="UbiCrawler: A Scalable Fully Distributed Web Crawler",

journal="Software: Practice \& Experience",

year=2004,

volume=34,

number=8,

pages="711--726"

}

Graph |
n |
m |

cnr-2000.graph.bz2 : A very small crawl of the Italian CNR domain |
325557 |
2738969 |

eu-2005.graph.bz2 : A small crawl of the .eu domain |
862664 |
16138468 |

in-2004.graph.bz2 : A small crawl of the .in domain performed for the Nagaoka University of Technology |
1382908 |
13591473 |

uk-2002.graph.bz2 : This graph has been obtained from a 2002 crawl of the .uk domain performed by UbiCrawler. |
18520486 |
261787258 |

uk-2007-05.graph.bz2 : This graph is a time-aware graph generated by combining twelve monthly snapshot of the .uk domain collected for the DELIS project. You might be interested in the fact that the format the authors of this graph use compresses it as to use only 1.207 bits/link. |
105896555 |
3301876564 |

The following two networks are road networks taken from the benchmark of the 9th DIMACS Implementation Challenge on shortest paths (see http://www.dis.uniroma1.it/~challenge9/download.shtml)

Graph |
n |
m |

road_central.graph.bz2 |
14081816 |
16933413 |

road_usa.graph.bz2 |
23947347 |
28854312 |

The following graph is taken from Vladimir Batagelj and Andrej Mrvar (2006): Pajek datasets (http://vlado.fmf.uni-lj.si/pub/networks/data/)

Graph |
n |
m |

chesapeake.graph.bz2 Chesapeake Bay Mesohaline Network; D. Baird; Umcees Ref No. XXX-86; MGC/M2/SUM 1 from datall.zip from http://www.cbl.cees.edu/~ulan/ntwk/network.html Chesapeake Bay mesohaline ecosystem. Summer carbon flows. Baird, D. and R.E. Ulanowicz. 1989. The seasonal dynamics of the Chesapeake Bay ecosystem. Ecol. Monogr. 59: 329-364. transformed from SCOR format by Scor2net 15.7.2003 |
39 |
170 |

Other graphs:

Graph |
n |
m |

caidaRouterLevel.graph.bz2 The Cooperative Association for Internet Data Analysis (CAIDA) investigates the structure of the Internet and releases measurements of the adjacency matrix of the Internet router-level graph. This graph is the undirected network available from the CAIDA homepage. The data has been collected by CAIDA between Apr 21 and May 8, 2003. | 192244 |
609066 |

preferentialAttachment.graph.bz2 This graph has been generated following a preferential attachment process (see Barabási and Albert, "Emergence of scaling in random networks", Science, 1999). Starting with a clique of five vertices, the vertices are successively added to the graph. Each new vertex chooses exactly five neighbors among the existing vertices, such that the probability of choosing a particular vertex is proportional to its degree. In our implementation, a vertex can choose a neighbour only once, such that the resulting random graph is guaranteed to be simple. | 100000 |
499985 |

smallworld.graph.bz2 This graph has been generated using the small world generator of the Boost Graph Library. Starting with a ring of 100000 vertices and 500000 edges, edges are rewired with a probability of 0.2 according to the random model of Watts and Strogatz (see Watts and Strogatz, "Collective Dynamics of Small-World Networks", Nature, 1998). Multiple edges are removed after the process. | 100000 |
499998 |

G_n_pin_pout.graph.bz2 This graph has been generated using a two-level Gnp random-graph generator. First, each vertex chooses a cluster to belong to, iid randomly. Then, in the spirit of the Erdös-Rényi model, cluster-internal edges are created with a given internal probability each, then cluster-external edges are created with a smaller external probability each. The parameters for this instance are: 100000 vertices, 316 clusters, the internal and the external edge probability are chosen such that the expected number of cluster-internal and the expected number of cluster-external incidences of a node are both five. Such a graph is simple. For references, details and a dynamic version see this project page. | 100000 |
501198 |

The set of bibliometric graphs is also part of the testbed for the clustering challenge, but the files are not repeated here. Please visit the corresponding page to obtain them.