Reddit-LogoCrawled originally by Tim Weninger et al. [1]

This is the companion data of the paper “Popularity Spikes Hurt the Future Implementations of a Protomeme“.

You can download the data by clicking here (warning: download size >350MB). The data has been collected from in [1], and from Hacker News. In my further postprocess, I selected only posts submitted to Reddit from April 5th, 2012 to April 26th, 2013, and to Hacker News from January 7th, 2010 to May 29th, 2014.

The archive contains two folders, both containing two files. I’ll describe the content of the Reddit folder, as it is logically equivalent to the Hacker News folder. The first file, “id_roots_date_votes_1333584000” contains the main dataset. The files contains 4 tab-separated columns:

  1. Post ID. This is a unique identifier of the post. This is assigned by the Reddit system. By following this ID, you can obtain the original post by visiting the URL<ID>. So, if you want to visit post “xbfwb”, you just go to
  2. List of meanings in the title. This is a comma separated list of meanings ids. To know what meaning corresponds to what id, look at the map provided in the archive and explained below.
  3. Timestamp. When the post was submitted to Reddit, in the standard UNIX format.
  4. Score. This is the score of the post at the moment of the data collection.

The second file, “root_map” contains the map from a meaning id to its actual word root. The word root is in the first column, the ID is in the second column. The two columns are separated by a tab. The word root is stemmed and all stopwords are already removed.

Have a nice third meme hunting!

[1] Tim Weninger, Xihao Avi Zhu, and Jiawei Han. An exploration of discussion threads in social news sites: a case study of the reddit community. In ASONAM, pages 579–583, 2013.