The Best 25 Datasets for Natural Language Processing

The Best 25 Datasets for Natural Language Processing
Articles by Meiryum Ali | June 07, 2018

Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data.

With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.


Datasets for Sentiment Analysis

Multidomain sentiment analysis dataset: A slightly older dataset that features product reviews from Amazon.

IMDB reviews: An older, relatively small dataset for binary sentiment classification, features 25,000 movie reviews.

Stanford Sentiment Treebank: Standard sentiment dataset with sentiment annotations.

Sentiment140: a popular dataset, which uses 160,000 tweets with emoticons pre-removed.

Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets.


Datasets for Text

20 Newsgroups: Collection of approximately 20,000 documents across 20 different newsgroups.

Reuters News dataset: Dataset features text from Rueters circa 1987.

Penn Treebank: Dataset features Wall Street Journal articles from 1989, used for next word prediction

UCI’s Spambase: A large spam email dataset, useful for spam filtering.

Yelp Reviews: An open dataset released by Yelp, contains more than 5 million reviews.

WordNet: a large database of English ‘synsets’, or groups of synonyms that each describe a different concept.


Datasets for Audio Speech

2000 HUB5 English: English speech data derived from 40 telephone conversations.

LibriSpeech: Audiobooks data set. Contains 500 hours of audiobooks read by multiple speakers, organized by chapters of the book.

TED-LIUM: A collection of 1495 TED talk audio recordings.

Free Spoken Digit Dataset: A collection of 1500 recordings of spoken digits in English.

TIMIT: A collection of recordings of 630 speakers of American English.


Datasets for Natural Language Processing (General)

Enron Dataset: Email data from the senior management of Enron, organized into folders.

Amazon Reviews: Contains around 35 million reviews from Amazon spanning 18 years. Data include product and user information, ratings, and the plaintext review.

Google Books Ngrams: A collection of words from Google books.

Blogger Corpus: A collection 681,288 blog posts gathered from Each blog contains a minimum of 200 occurrences of commonly used English words.

Wikipedia Links data: The full text of Wikipedia. The dataset contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph itself.

Gutenberg eBooks List: Annotated list of ebooks from Project Gutenberg.

Hansards text chunks of Canadian Parliament: 1.3 million pairs of texts from the records of the 36th Canadian Parliament.

Jeopardy: Archive of more than 200,000 questions from the quiz show Jeopardy.

SMS Spam Collection in English: A dataset that consists of 5,574 English SMS spam messages.


Still can’t find what you need? Gengo creates and annotates customized datasets for a wide variety of NLP projects, including everything from chatbot variations to entity annotation. With a decade of experience in managing a crowd of over 21,000+ linguistic specialists, Gengo is perfectly placed to provide your model with a solid foundation. Contact us to find out how custom data can take your machine-learning project to the next level.

The Author
Meiryum Ali

Freelance writer working at Gengo; AI enthusiast