The Best 25 Datasets for Natural Language Processing
Article By Meiryum Ali | June 07, 2018
By Meiryum Ali

Freelance writer working at Gengo; AI enthusiast

Where’s the best place to look for free online datasets for NLP? We combed the web to create the ultimate cheat sheet, broken down into datasets for text, audio speech, and sentiment analysis.

 

Datasets for Sentiment Analysis

Multidomain sentiment analysis dataset: A slightly older dataset that features product reviews from Amazon.

IMDB reviews: An older, relatively small dataset for binary sentiment classification, features 25,000 movie reviews.

Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations.

Sentiment140: a popular dataset, which uses 160,000 tweets with emoticons pre-removed.

Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets.

 

Datasets for Text

20 Newsgroups: Collection of approximately 20,000 documents across 20 different newsgroups.

Reuters News dataset: Dataset features text from Rueters circa 1987.

Penn Treebank: Dataset features Wall Street Journal articles from 1989, used for next word prediction

UCI’s Spambase: A large spam email dataset, useful for spam filtering.

Yelp Reviews: An open dataset released by Yelp, contains more than 5 million reviews.

WordNet: a large database of English ‘synsets’, or groups of synonyms that each describe a different concept.

 

Datasets for Audio Speech

2000 HUB5 English: English speech data derived from 40 telephone conversations.

LibriSpeech: Audiobooks data set. Contains 500 hours of audiobooks read by multiple speakers, organized by chapters of the book.

TED-LIUM: A collection of 1495 TED talk audio recordings.

Free Spoken Digit Dataset: A collection of 1500 recordings of spoken digits in English.

TIMIT: A collection of recordings of 630 speakers of American English.

 

Datasets for Natural Language Processing (general)

Enron Dataset: Email data from the senior management of Enron, organized into folders.

Amazon Reviews: Contains around 35 million reviews from Amazon spanning 18 years. Data include product and user information, ratings, and the plaintext review.

Google Books Ngrams: A collection of words from Google books.

Blogger Corpus: A collection 681,288 blog posts gathered from blogger.com. Each blog contains a minimum of 200 occurrences of commonly used English words.

Wikipedia Links data: The full text of Wikipedia. The dataset contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.

Gutenberg eBooks List: Annotated list of ebooks from Project Gutenberg.

Hansards text chunks of Canadian Parliament: 1.3 million pairs of texts from the records of the 36th Canadian Parliament.

Jeopardy: Archive of more than 200,000 questions from the quiz show Jeopardy.

SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages.

 

Still can’t find what you need? At Gengo, we provide custom datasets for language projects. With nearly a decade of experience in the translation space, Gengo’s specialty is any natural language-related task, including semantic annotation and sentiment analysis. Plus our team includes over 21,000+ qualified native speakers in English as well as 36 other languages.

 

Sources:

https://deeplearning4j.org/opendata
https://github.com/niderhoff/nlp-datasets
https://github.com/MattTriano/Public_Dataset_Sources#naturallanguage
https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/

The Author
Meiryum Ali

Freelance writer working at Gengo; AI enthusiast