you (U) know Because I Know

Implemented by Giulio Rossetti

UBIK is an algorithm to rank nodes in a network according to their importance. Each node starts with a weighted list of “skills”: UBIK iteratively updates the weights of each skill for each node according to the (adjusted) weights of its neighbors. The paper describing UBIK’s implementation can be downloaded here.

The code is available for download as a JAR executable file.

To execute, the following command line has to be used from the shell:

java -jar ubik.jar node_file edge_file dimension_file

Where “node_file” is the path containing your node file, “edge_file” and “dimension_file” for the edge and dimension files.

Given the complex input format, we provide two datasets to understand it. These are the datasets used in the original paper.

  • Enron: we connected two employees if they wrote to each other at least once. We used as dimensions the day of the week when the communication took place (ending up with seven dimensions from Monday to Sunday). For the set of skills, we considered the 8 most used keywords in the subject field of the emails, eliminating stopwords and stemming the words. For each employee, we weigh the keyword using the number of times she used the word in an email subject.
  • DBLP: two authors are connected if they have written together a paper. The dimension of the connection is the venue where the publication appeared (we chose 16 top-tier conferences in computer science, including VLDB, SIGKDD, CIKM, ACL, SIGGRAPH and others). We chose as skills the keywords used in their publications. We eliminated stopwords, we applied a stemming algorithm on the remaining words and then we selected the 50 most commonly used keywords in a paper title. The number of times an author used a keyword is used to evaluate how much she considers herself an expert about that topic.

The input format of the three files is the following:

  • node_file: first line has to be “t tab # tab 0 tab skill-labels”. The skill labels have to be separated by dashes. Then, for each node, a line as follows: “v tab id tab starting_rank tab recommendation_level tab skill-values”. Skill values have to be separated by dashes. The starting rank is an integer that gives the idea of the known starting importance of the node (can start with all equals to 1 if not applicable). The recommendation level is used to know in which direction the recommendation goes: if a node with value “1” is connected to a node with value “0”, “1” passes its recommendation to “0”. If values are equal, recommendation is bidirectional. Can start with all values equal to 0 if not applicable.
  • edge_file: one line per edge. The line format is “e tab id_node1 tab id_node2 tab dimension_list”. Each dimension is represented by one single symbol. The dimensions have to be written one attached to the other, without separators.
  • dimension_file: one line per dimension. The line format is “dimension_id tab weight”. The dimension id is the symbol representing the dimension in the edge file. The weight can be any float, and it depends on the kind of analysis that you are interested in.

The output format is a line per node, with comma separated fields containing:

  • Column #1: node id;
  • Column #2: the resulting total rank (sum over all the skills: the old rank plus the new values for the skills);
  • Column #3: the old rank (renormalized);
  • Column #4-X: list of the new ranking for each skill, in the same order specified in the node_file.

We also provide a synthetic network generator. The generator generates a random network, creating files with the proper input for UBIK. You are free to specify the number of nodes, of skills, of dimensions and the average degree of the network.

Happy node ranking!