Unequal Entry Requirement in Wikipedia Based on Gender
This is the data and code necessary to reproduce the main results in the paper “Traces of Unequal Entry Requirement for Illustrious People on Wikipedia Based on their Gender”.
To run this code, you will need the following Python libraries: numpy, scipy, pandas, networkx.
In the archive, you will find two folders and six Python scripts.
Folder “data”: This contains the preprocessed data from Wikipedia. The folder contains three files:
- network.csv is the network in edgelist format. It is a comma-separated file with three columns: origin node, destination node, and edge weight. Nodes are ids of illustrious people and match with the ones you will find in the pantheon file.
- pantheon.csv is the table containing the node attributes. It is a comma-separated file with six columns, recording for each illustrious person on Wikipedia included in this study: id (compatible with network.csv), gender, birth year, Wikipedia slug url, month in which the page was added to Wikipedia, year in which the page was added to Wikipedia.
- sampled_nlp_weights.csv is a sample of edges with their weight estimates from our two alternative NLP approaches. It is a comma-separated file with one row per sampled edge and two columns: the weight in the TRF framework and the weight in the LG framework.
Folder “libraries”: This contains a single file, backboning.py. It is a Python library that facilitates the extraction of backbones from networks. It is imported by the other scripts and does not need to be run directly.
The Python scripts should be run in sequence. They do not take any argument. They are:
- 01_figs_2_3_tab_1.py: this reproduces Figures 2 and 3, and Table 1 of the main paper. It produces 16 output tab-separated files (one per data series in the figures), eight for Figure 2, eight for Figure 3, and it prints in the terminal all the information of Table 1. Each file contains the CCDF of a specific combination of gender, network, and measure, in two columns, the x and y coordinates of the figure.
- 02_tab_2.py: this script prints in the terminal the information contained in Table 2.
- 03_fig_4.py: this script produces a tab-separated file. Each column contains a data series of Figure 4 in the main paper (except the first column which contains the p-value threshold, which is the x-axis of the figure).
- 04_fig_5_left.py: this script produces a tab-separated file. Each column contains a data series of Figure 5 (left) in the main paper (except the first column which contains the p-value threshold, which is the x-axis of the figure).
- 05_fig_5_right_6_7.py: this script produces three tab-separated files. The first contains a data series of Figure 5 (right) in the main paper (except the first column which contains the p-value threshold, which is the x-axis of the figure). The second contains three columns with the year and the number of women and men profiles added to Wikipedia in that year (for Figure 6). The third contains two columns with a birth year time bin and the count of people born within that interval with a Wikipedia page.
- 06_fig_8.py: this script produces a tab-separated file. In the first column there is an edge weight and in the second column a label informing which NLP framework produced that weight. This can reproduce Figure 8. The script also prints in the teminal the slope, intercept and r-squared of the regression between the two weights.