Node Vector Distance

This page serves as an entry point for multiple publications. If you only want the NVD library click here or here for the Julia version, which is more time and space efficient. You can use it by putting the Python file in your path and then use it as:

import network_distance as nd
ge_dist = nd.ge(src, trg, G)

Assuming that src and trg are Python dictionaries whose keys are node IDs and values are the value corresponding to the node; and G is a networkx object. More advanced usage instructions are provided to replicate the results of the following papers (click to scroll):


Download Fast Generalized Euclidean material

This archive contains the data and code to replicate the results for the paper: “Fast Generalized Euclidean with Laplacian Solvers”

The code has been tested on a Intel Xeon Platinum 8358 CPU @ 2.60GHz running CentOS Linux release 7.9.2009 (Core) with Python 3.10.6 and Julia 1.8.0. The code ran on a Slurm partition which might have induced a runtime overhead — it might be a bit faster than what reported in the paper on your laptop (it is faster on mine).

The code relies on the following Pythion libraries: numpy and pandas. It also relies on the following Julia packages: Graphs, SimpleWeightedGraphs, Laplacians, BenchmarkTools, SparseArrays, LinearAlgebra, Distances, Statistics.

Run the scripts in the order they are numbered — except do not run 02a, run 02b instead (02b calls 02a for you). None of the scripts requires a parameter, so you can simply run them from the terminal. Note that they take a while to run, because of how slow the baseline is.

The *.sh, *.py, and *.gp scripts are necessary only for your convenience and for generating figures, all core results are generated by the *.jl scripts. The outputs of the scripts are comma-, space-, or tab- separated files and are named after the figure they generate.

The data folder contains all the data we used for the paper. It is included in the package because some networks underwent a non-trivial preprocessing step to make them undirected, unweighted and single layer in case the original network was more complex.

The library folder includes all the NVD implementations in Julia, even some that are not studied in the paper.

Download Fast Generalized Euclidean material


Download Multilayer Generalized Euclidean material

This archive contains the data and code to replicate the results for the paper: “Generalized Euclidean Measure to Estimate Distances on Multilayer Networks”

The “implementation” folder contains a python library with an implementation of most NVD algorithms. Put it in your Python path or in a folder where your scripts can access it to use it. You can replicate the results of the paper by leaving it in its current folder and running the experiment scripts without modifying their paths. For instance, if you put network_distance.py in you path, the following minimal script will work, assuming you have an “edges” and a “couplings” file in the correcte format (and that nodes 0 and 1 exist for the src and trg input dictionaries):

import networkx as nx
import network_distance as nd

G = nx.read_edgelist(“edges”)

with open(“couplings”, ‘r’) as f:
couplings = set([tuple(line.strip().split(‘\t’)) for line in f])

src = {0: 1}
trg = {1: 1}

dist = nd.ge(src, trg, G, multilayer = True, couplings = couplings)

The “data” folder contains the datasets used in the paper.

  • Each dataset corresponds to two files: an “_edges” file and a “_couplings” file. The first contains the intra-layer connections, while the second contains the inter-layer couplings.
  • The “_edges” files have three tab-separated columns: node1, node2, layer. Each line tells you that node1 is connected to node2 in the layer.
  • The “_couplings” files contain one line per node identity. All nodes in the same line are assumed to correspond to the same identity and thus are coupled to each other.

Each “sec_” folder contains the code to replicate the results from the corresponding section of the paper.

  • “sec_5_2”: Both scripts have no input data nor parameters. Simply run them and they will produce output files.
    • “sec_5_2_1.py”:
      • Output: is tab-separated, 7 columns. Column #1 is the number of layers between origin and destination. Columns #2-7 report the NVD for: MLGE, MLEMD, MLGFT, GE, EMD, GFT, repsectively.
    • “sec_5_2_2.py”:
      • Output: is tab-separated, 7 columns. Column #1 is the strength of inter-layer couplings. Columns #2-7 report the NVD for: MLGE, MLEMD, MLGFT, GE, EMD, GFT, repsectively.
  • “sec_5_3”:
    • All python scripts require one command line input: the name of the input network (possible values: aarhus, copenhagen, egosm, euair, ira, physics).
    • All python scripts output a tab-separated file with 9 columns: beta value, MLGE, GE, MLEMD, EMD, MLGFT, GFT, mlcount, and count. They are the inputs for the R scripts.
    • All R scripts require the corresponding part1.py script to run. They print the RMAED values for MLGE, MLEMD, MLGFT, GE, EMD, GFT, in this order.
    • NOTE: the python scripts take a while to run. It is advised to run them in parallel.
  • “sec_5_4”: All scripts have no input data nor parameters. Simply run them and they will produce output files.
    • “sec_5_4_1.py” to “sec_5_4_3.py”: 7 tab-separated columns. Column #1 is the dimension of the input we’re increasing, then the average and standad deviation of 10 runs of MLGE, MLEMD, and MLGFT.
    • “sec_5_4_4.py”: The runtime of each method for 1 and for 1000 NVDs on the same network.
  • “sec_5_5”:
    • “01_delta_l.py”:
      • Takes as input the layer you want to remove from the eu_air network to calculate the distance. Pass “orig” if you want to keep all layers in the network.
      • Outputs a layer_id.csv with six tab-separate columns: country1, country2, removed layer, MLGE NVD distance between country1 and country2, number of edges and nodes in the network.
    • “02_table_5.py”:
      • Requires the output of the previous script as input and outputs two tab-separated files.
      • “layer_distance.csv” reports the average country pairwise MLGE distance for each removed layer in the network. The rows with the highest five values are included in Table 5 in the paper.
      • “layer_country_distance.csv” reports the average MLGE distance using one country as an origin and removing one layer of the network in turns.
      • The former provides the ranking of the airline in the third column of Talbe 6 in the paper; the latter provides the fourth column in the same table.
    • “03_layer_centralization.py”: calculates the centralization values for all layers in the network, and stores them in a tab-separated file together with the output of the previous script.
    • “04_alternative_explanations.r”: prints the result of simple regressions explaining the observed MLGE distances with simpler networm statistics from the previous script.

Download Multilayer Generalized Euclidean material


Download Pearson Correlations on Complex Networks material

This is the code necessary to reproduce the results in the paper “Pearson Correlations on Complex Networks”. The folder “implementation” contains the code necessary to calculate (among other things) the network Pearson correlation. Simply put the Python script in your path and then use it like shown:

import network_distance as nd
netcorr = nd.correlation(v1, v2, G)

v1 and v2 are python dictionaries and G is a networkx object.

The library has the following dependencies: numpy, pandas, networkx, pyemd, scipy. For this specific paper you only need numpy and networkx, so you could delete all the code you don’t need. Be aware that the calculation of shortest path lengths relies on the multiprocessing library. It has been tested on Ubuntu 20.04, but not on Windows or Mac. It likely won’t work on those platforms unless patched.

Each of the “secXX” folders allows you to replicate the result from that specific subsection. All scripts can be just ran in the terminal as they are. They either print their output on standard output in the terminal or in a file. Specific instructions:

sec031:

  • fig02.py outputs four files with two tab-separated columns: the network correlation (column 1) and the non-network Pearson (column 2) for the random vectors.
  • fig03.py outputs a file with two tab-separated columns: the network correlation (column 1) and the non-network Pearson (column 2) for a Stochastic BlockModel.

sec032:

  • fig04.py produces an LFR network and some node attributes for differently correlated vectors. Note that the LFR benchmark creation might fail, so you might need to try a few times before succeeding.
  • fig05a.py outputs the table with network correlation values for a random sample of pairs of sources at a given distance in a GNM graph. Outputs a file with two tab-separated columns: the network distance between sources and the correlation value.
  • fig05b.py is the same as fig05a.py, but for a Small World network rather than a GNM.
  • tab1-2_*.py create the input for the regression between the path length between sources and the network correlation. Each script produces the correlation for its corresponding network topology. Run all these scripts before tab1-2.r.
  • tab1-2.r runs the regression and calculates accuracy and mean absolute error to reproduce Tables 1 and 2 in the paper.

sec04:

  • fig6.py should be run before tab3.r. It creates a file with four tab-separated columns: the two countries and their network and non-network correlations.
  • tab3.r runs the regression between countries’ trade volume and geographical distance, and their trade correlations.
  • tab4_part1.py performs the optimization of the product space via simulated annealing. WARNING: it takes soooooo long. That’s why I include already an optimized version (which is slightly more optimized than the version used in the paper).
  • tab4_part2.py is an equivalent of fig6.py, but for tab4.r
  • tab4.r runs the regression between countries’ trade volume and geographical distance and their trade correlations, on the Product Space at the 2 digit level, with and without optimization.

Download Pearson Correlations on Complex Networks material


Download ZIP Archive

This archive contains the code and the data to replicate the study of network node distance measures detailed in the paper “The Node Vector Distance Problem in Complex Networks“. We provide only the code and the data we have the right to share, pointing to the original sources when the original material could not be repackaged.

The implementation folder contains the library implementing in Python 3.6.5 all known node vector distance measures. The minimal code to use it, assuming you placed network_distance.py in your path or in the same folder of execution, is the following:

import network_distance as nd
ge_dist = nd.ge(src, trg, G)

Assuming that G is a networkx unweighted graph and that src and trg are two dictionaries, whose keys are the nodes the agent is occupying and whose values are the occupation intensity. The above code calculates the Laplacian Generalized Euclidean distance.

The library prerequisites are the following (the versions are the ones for which the library has been developed, newer or older version could still work): Numpy 1.17.2, Scipy 1.3.1, Pandas 0.25.1, Networkx 2.4, pyemd 0.5.1. Additionally, some experiments also require statsmodels 0.10.1.

WARNING: The calculation of shortest path lengths relies on the multiprocessing library. Moreover, MAPF relies on running an external binary with the subprocess library. Both operations are tested and work reliably on Ubuntu 18.04.1 LTS. I know Windows might have issues with the code as it stands. Thus you might need to calculate the shortest paths independently from the library.

Each folder in this archive allows you to replicate the figures and the tables of the result section. The specific figure and table to replicate is determined by the folder name.

To run the MAPF node vector distance (and reproduce its results) you will need the binary insolver_reLOC from here. For some experiments, you will also need the binary_network benchmark from here to generate LFR synthetic networks.

Download ZIP Archive


This is the code necessary to replicate the results of the paper “Generalized Euclidean Measure to Estimate Network Distances

The archive contains the library implementing in python the generalized Euclidean approach, the Graph Fourier Transform, and the Earth Mover Distance.

Each folder allows the replication of a subsection of the experiments. Simply cd into the folders and run the scripts. Each script should generate (among other things) a csv file with the result of the experiment.

IMPORTANT NOTES:

– We cannot repackage the Anobii dataset from Section 5.3. The script in the corresponding folder would work if you manage to obtain the following files from the original authors and place them in the sec53 folder:
– anobii-friendship.dat: a space-separated unweighted edgelist of the social relationships.
– anobii-bookeshelves-*.dat: a set of six files, each containing a space-separated list of which book is in which user’s bookshelf. The first column is the id of the user, the second column is the id of the book.
– anobii-books-*.dat: a set of six files, each containing a tab-separated list of metadata per book. The columns are, in order: id of the book, isbn of the book, title of the book, author of the book.

– The python libraries we use assume you are using Python 3 with the following additional packages: numpy, scipy, pandas, sklearn, networkx, and pyemd.

Click here to Download