Datasets
Here is a non-exhaustive list of openly accessible datasets released along with some of our publications.
Publication | Link | #Posts | Timeline |
---|---|---|---|
Mean Birds: Detecting Aggression and Bullying on Twitter (ACM WebSci’17) [PDF] | Zenodo | 1.65M tweets | June - August, 2016 |
Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan’s Politically Incorrect Forum and Its Effects on the Web (ICWSM’17) [PDF] | Zenodo | 11.1M 4chan /pol/, /sp/, and /int/ posts | June 30, 2016 - September 12, 2016 |
The Web Centipede: Understanding How Web Communities Influence Each Other Through the Lens of Mainstream and Alternative News Sources (IMC’17) [PDF] | Zenodo | 487k tweets, 1.8M Reddit posts/comments, 97k 4chan /pol/, /sp/, /int/, and /sci/ posts | Twitter: June 30, 2016 - February 28, 2017, Reddit: June 30, 2016 - February 28, 2017, 4chan: June 30, 2016 - February 28, 2017 |
What is Gab? A Bastion of Free Speech or an Alt-Right Echo Chamber? (CyberSafety’18) [PDF] | Zenodo | 22.1M Gab posts from 336.7K users | August 2016 - January 2018 |
Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (ICWSM’18) [PDF] | Zenodo, GitHub | 100K labeled tweet text | N/A |
On the Origins of Memes by Means of Fringe Web Communities (IMC’18) [PDF] | Zenodo | 158.5M URLs and Phashes for images from Twitter, Reddit, 4chan’s /pol/, and Gab | July 2016 - July 2017 |
Who Let The Trolls Out? Towards Understanding State-Sponsored Trolls (ACM WebSci’19) [PDF] | Zenodo | 10.1M tweets and 21K subreddit posts | February 2012 - August 2018 |
The Pushshift Telegram Dataset (ICWSM’20) [PDF] | Zenodo | 2.2M Telegram users, 28K Telegram channels, and 317M Telegram messages | September 2015 - November 2019 |
The Pushshift Reddit Dataset (ICWSM’20) [PDF] | Pushshift | Reddit subreddits posts and users | Full history |
Disturbed YouTube for Kids: Characterizing and Detecting Disturbing Content on YouTube (ICWSM’20) [PDF] | Zenodo | 844.7K YouTube videos’ metadata | N/A |
Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board (ICWSM’20) [PDF] | Zenodo | 134.5M 4chan /pol/ posts | June 29, 2016 - November 1, 2019 |
A Large Open Dataset from the Parler Social Network (ICWSM’21) [PDF] | Zenodo | 183M Parler posts | August 1, 2018 - November 1, 2021 |
The Evolution of the Manosphere Across the Web (ICWSM’21) [PDF] | Zenodo | 6.7M posts from Incel Forums and 22.1M posts from Incel subreddits | Incel Forums: June 19-30, 2019. Subreddits: June 2005 - December 2018 |
How over is it? Understanding the Incel Community on YouTube (CSCW’21) [PDF] | Zenodo | 6.4K Incel derived videos, 5.8K random videos (control), 37.7K Incel derived recommended videos, and 29.3K control recommended videos | N/A |
It is just a flu: Assessing the Effect of Watch History on YouTube’s Pseudoscientific Video Recommendations (ICWSM’22) [PDF] | Zenodo | 1.1K science, 1.3K pseudoscience, and 3.2K irrelevant videos | N/A |
“I Can’t Keep It Up.” A Dataset from the Defunct Voat.co News Aggregator (ICWSM’22) [PDF] | Zenodo | 2.3M submissions, 16.2M comments, 113.4K user profiles, and 7K subverse profiles from Voat.co | November 08, 2013 - December 25, 2020 |