Multidimensional Network Analysis

Implemented by Alessio Orlandi (Java) and me (Python)

Paper

This page provides two different implementations of the multidimensional network analysis framework that is described in this paper.

Download

The Java implementation (download) is a larger JAR file (14 MB) without source code. It implements more functions (for example, it prints the cumulative distributions of the Dimension Relevance or it accepts as input an arc labelled graph, such as the one used for storing the UK Union web dataset). More importantly, it is WAY FASTER than the Python implementation: I didn’t test it rigorously, but it should be 4x faster than the alternative.

The Python implementation (download) is much simpler (2 KB), and it is designed for people who want to edit the code and adapt it to their own necessities. It is not slow in absolute terms (it can process around 3.5 million edges in 20 seconds on a dual core i7 64 bits @ 2.8 GHz, more or less), but of course it’s not C++.

Instructions

The shared flags between the two implementations are the following:

-f: Specifies the path of the network input file.

-d: Specifies the directory containing ONLY network files in the same format and with the same number of dimensions.

-D: Specified the number of dimensions.

-e: Use extended format (edgelist with an “e” as first field, then “node node dimension”) compatible with gSpan format. Otherwise the simple edgelist (node node dimension) is used.

-n: Print the node stats (zipped).

-i: Print the dimension stats.

The other flags only for the Java version are explained in the –help.

So, if we want to analyze the network “querylog” with 7 dimensions, in the same directory of the script, in gSpan format and we want only the node statistics we will write:

java -jar mna.jar -f querylog -D 7 -e -I -n

or

python mna.py -f querylog -D 7 -e -n

Instead, if we want to analyze a bunch of networks in the “dblp_conf” directory, each with 31 dimensions, not in gSpan format, requesting for both node and dimension statistics we will write:

java -jar mna.jar -d dblp_conf -D 31 -I -n -i

or

python mna.py -d dblp_conf -D 31 -n -i

Datasets & Input instruction

One important input instruction: first, the format requested is either an edgelist (node node dimension), example:

0 1 1
0 2 1
1 2 3

or a gSpan edge list:

e 0 1 1
e 0 2 1
e 1 2 3

Second, and more important: the edges are required to be sorted. So the following is a correct edgelist:

0 1 1
0 1 2
0 2 1
1 2 3
while the following is not:
0 1 1
0 2 1
0 1 2
1 2 3

Two datasets are available:

  • DBLP Conf (download). A co-authorship network extracted from DBLP digital bibliography. Two authors are connected if they co-authored a paper. The conference venue is the dimension for the network. We record 31 conferences and the dimension map is the following:VLDB 0, SIGMOD 1, PODS 2, CIKM 3, KDD 4, ICDM 5, PKDD 6, SDM 7, IJCAI 8, AAAI 9, ICML 10, ECAI 11, ACL 12, EACL 13, NLDB 14, SIGIR 15, WWW 16, ECIR 17, WISE 18, PPOPP 19, PACT 20, ICPP 21, ICDCS 22, ICCV 23, CVPR 24, ECCV 25,  CGI 26, ACM-MM 27, ICME 28, MMM 29, MoMM 30.The id->author map is included in the zip file.
  • Querylog (download). A co-queried network. Nodes are terms connected if they are used together in the same query. This dataset is based on the AOL dataset. There are 6 dimensions. Each dimension represent the rank the user clicked from the results of the query. The dimension map is the following: Rank 1 -> 1, Ranks 2-3 -> 2, Ranks 4-6 -> 3, Ranks 7-10 -> 4, Ranks 11-58 -> 5, Ranks 59-500 -> 6. The id->word map is included in the zip file.

Happy multidimensional analysis!