Artificial intelligence (AI) represents a huge growth opportunity for online retailers. Via machine learning technology, ecommerce companies can potentially boost sales, reduce waste, and increase overall efficiency while actively engaging with consumers.
Not only that, ecommerce companies have a lot of data at their fingertips. The problem for machine learning developers lies in the availability of that data. Ecommerce data typically contains proprietary information and is consequently hard to find on publicly available databases.
Lucky for you, we at Gengo have scoured the internet to gather a list of publicly available ecommerce datasets for machine learning projects. Enjoy!
Fashion-MNIST: Dataset consisting of 60,000 training images and 10,000 test images of fashion products across 10 classes.
Innerwear Data from Victoria’s Secret and Others: Data from 600,000+ innerwear products extracted from popular retail sites. Includes product description, price, category, rating and more.
Electronic Products and Pricing Data: A list of over 7,000 electronic products with 10 fields of pricing information.
Men’s Shoe Prices: A list of 10,000 men’s shoes and the various prices at which they are sold.
Women’s Shoe Prices: A list of 10,000 women’s shoes and the various prices at which they are sold.
eCommerce Item Data: 500 SKUs and their descriptions from an outdoor apparel brand’s product catalog.
Fashion products on Amazon.com: Dataset of 22,000 fashion products on Amazon
E-commerce Tagging for clothing: Images from E-commerce sites with bounding boxes drawn around shirts, jackets, sunglasses etc. The dataset has 907 items of which 504 items have been manually labeled.
Online Retail Data set (UCI Machine Learning Repository): This is a transnational dataset that contains all the transactions during an eight month period (01/12/2010-09/12/2011) for a UK-based online retail company.
Online Auctions Dataset: Dataset that contains eBay auction data on Cartier wristwatches, Xbox game consoles, Palm Pilot M515 PDAs, and Swarovski beads.
Retailrocket recommender system dataset: Collected from a real-world ecommerce website, this dataset contains information on visitor behavior including events like clicks, add to carts, and transactions.
Search Relevance Datasets
eCommerce search relevance: This set contains image URLs, rank on page, a description for each product, the search query that led to each result, and more from five major English-language ecommerce sites.
Best Buy Search Queries NER Dataset: A dataset containing manually labeled search queries on bestbuy.com. The search queries have phrases labeled into various important entities like Brand, Model name, Category Name & etc.
Customer review datasets
Women’s E-Commerce Clothing Reviews: 23,000 Customer Reviews and Ratings. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.
Amazon Commerce reviews set: The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.
Multidomain sentiment analysis dataset: A slightly older dataset contains product reviews data by product type and rating.
Amazon and Best Buy Electronics: A list of over 7,000 online reviews from 50 electronic products.
Grammar and Online Product Reviews: A list of 71,045 online reviews from 1,000 different products.
Ecommerce Industry Datasets
Annual Retail Trade Survey (ARTS): national estimates of total annual sales, e-commerce sales, end-of-year inventories, inventory-to-sales ratios, purchases, total operating expenses, inventories held outside the United States.
Economic Census: Provides a detailed portrait of business activities in industries and communities once every five years, from the national to the local level.
E-Stats: surveys used different measures of economic activity such as shipments for manufacturing, sales for wholesale and retail trade, and revenues for service industries.
EU External Trade Datasets: The value of imports, exports and trade surplus, volume indices, unadjusted and seasonally adjusted; price and terms of trade indices; imports and exports classified by commodity, and by country of origin or destination.
ECommerce Sales by Merchandise Category 1999-2015: Census data showing total ecommerce sales by merchandise line and compound annual growth rate from 1999-2015.
Liked this article? You can find all our previous dataset compilations here. Still can’t find the custom data you need to train your model? Gengo provides custom AI training data in 37 languages for your specific machine learning project needs.
Contact us to learn more about how Gengo can work for you.