More and Beyond, and Fun!

In need of some curious datasets for teaching or demonstration purposes? Here are some ideas for you!

Airline On-Time Performance Data
Bureau of Transportation Statistics (BTS) collects monthly data that contain scheduled and actual departure and arrival times reported by certified U.S. air carriers, starting 1987.“Awesome Public Datasets”
Nice listing of publicly available datasets.

Data is Plural Data Archive
The Data Is Plural newsletter is a fun listing of interesting datasets. Could be useful if you just need some data to “play” with.

Data.World is a social network geared toward helping researchers find collections of data. On the inside, Data.World looks a lot like Facebook. Each user gets a profile with a picture and their name and the ability to upload data sets. There is also a “feed” component. Rather than people, Data.World allows users to follow specific data sets.

FiveThiryEight shares the data and code behind their stories.

FlowingData Data Sources
“Have fun and play with some numbers.”  This blog points to all kinds of interesting resources out there: Iowa liquor sales data, data on people who went to ER for wall-punching, and campaign finance reform are just a few examples.

Ice and Fire/Game of Thrones
This API provides access to data from Ice and Fire – the JSON files, organized by book, house and character, are also available for downloading. A relevant data set, Network of Thrones, is available in .csv file – this data set contains 107 characters and edges among them from “A Storm of Swords.”

LEGO Product Catalog
BrickLink, a website for buying and selling LEGOs, also maintains an inventory of LEGO sets released since 1949. Their catalog data are searchable and downloadable.

Billboard Pop Music Lyrics
This dataset, collected by Kaylin Walker, contains the lyrics from 50 years of Billboard Year-End Hot 100 (1965-2015), along with associated variables including rank, song, artist, year, and source.

Netflix Movie Rating Dataset
This dataset contains 100 million ratings of more than 17,000 movies from 480,000 anonymized users – it is currently available through the Internet Archive. This dataset was first made open to the public by the Netflix Prize, a challenge to beat the company’s own movie-recommendation algorithm.

NYPL: Photographers’ Identities Catalog (PIC)
PIC is “a collection of biographical data describing photographers, studios, manufacturers, and others involved in the production of photographic images.” Information such as names, nationalities, dates, locations can be found from the searchable interface and its GitHub repository.

Sci-Hub data
This Sci-Hub data set contains 28 million download requests from the server logs of Sci-Hub in a six-month period (September 2015-February 2016). It is revealed in a Science magazine article that Sci-Hub users are from around the world.

Spelling Bee
The Scripps National Spelling Bee posts competition results  of the spelling challenge on their website – the results are also made available in machine readable data files by Christopher Long.

This Star Wars API provides access to data from the seven Star Wars films, including planets, spaceships, vehicles, people, films and species. The JSON files are also available at GitHub.

Tarantino Movie
FiveThirtyEight publishes an article based on a catalog of every time someone cursed or bled out in a Quentin Tarantino movie. The full data set—with movie title, type of the event (when a profane word occurred), the word, and the number of minutes— is available on its GitHub.