Datasets for Machine Learning and Deep Learning
-- Some of the Best Places to Explore
Last month, I shared a short list of dataset repositories that I planned to recommend to students as inspiration for their class projects.
Thanks to all the great suggestions via the Twitter thread above, this list has grown quite a bit! Now, with the semester being in full swing, I recently shared this set of dataset repositories with my deep learning class. However, beyond using this list to find inspiration for interesting student class projects, these are also good places to look for additional benchmark datasets for your model, so I am putting it out here, hoping you find it useful!
It is hard to sort by priority or to pick favorites, so the following list is sorted alphabetically.
Academic Torrents – A distributed system for sharing enormous datasets
- Currently 65 Tb worth of datasets available through this site built on top
- The sharing is fascilitated through bittorrent technology
Link: https://academictorrents.com
Awesome Public Datasets – A large GitHub README list organized by application domain
Link: https://github.com/awesomedata/awesome-public-datasets
- Links to approximately 650 datasets
- Organized by application domain (e.g., agriculture, biology, etc.)
CVonline: Image Databases – Bob Fisher’s Compilation of Computer Vision Datasets
Link: http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm
- An impressive collection of more than 1000 datasets for computer vision sorted by category (agriculture, general images, etc.)
Datasetlist.com – Datasets by domain
Link: https://www.datasetlist.com
- Let’s you sort datasts by category (image, NLP, audio, etc.)
- Sortable by license and year
- Links to original papers (if applicable) are provided
Data is Plural – A dataset newsletter
Link: https://tinyletter.com/data-is-plural
- A weekly newsletter that compiles a list of interesting new datasets each week
Google Dataset Search – A search engine for datasets
Link: https://datasetsearch.research.google.com
- Let’s you search datasets by name or description
- Returns summary results and links to various source where a dataset can be obtained
Huggingface Datasets – A Python library for loading NLP datasets
Link: https://github.com/huggingface/datasets
- A tool that makes NLP datasets directly available in Python
IBM’s Data Asset Exchange – A collection datasets relevant for enterprise applications
Link: https://developer.ibm.com/exchanges/data/
- Several large datasets centered around enterprise applications with friendly community sharing licenses
- Individual datasts contain links to Jupyter notebooks showing data loading and processing examples
- Mostly tabular data and more suited for traditional machine learning
Jupyter Tutorial Data – List of dataset repositories
Link: https://jupyter-tutorial.readthedocs.io/en/latest/data/index.html
- A list linking the most common dataset repositories and search engines
Kaggle Datasets
Link: https://www.kaggle.com/datasets
- A search engine for datasets available through Kaggle
- Datasets can be discovered by search terms, category tags, and file types
OpenML – A search engine for curated datasets and workflows
Link: https://www.openml.org/search?type=data
- 3265 datasets annotated with the number of instances, features, and classes
- Workflows (e.g., scikit-learn pipelines) are available through the community
- Most datasets are tabular datasets for traditional machine learning
Papers with Code – Datasets with benchmarlks
Link: https://www.paperswithcode.com/datasets
- 3,095 machine learning datasets and links to original paper if applicable
- Contains number of papers that used the dataset
- Compiles benchmark information and links to the benchmark sources
Penn Machine Learning Benchmarks – Clean, tabular datasets
Link: https://github.com/EpistasisLab/pmlb/tree/master/datasets
- A collection of preprocessed datasets in tabular form
- More appropriate for traditional machine learning rather than deep learning
Public APIs – A list of public dataset API
Link: https://github.com/public-apis/public-apis
- A list of approximately 650 dataset APIs
r/datasets – A subreddit for sharing and discussing datasets
Link: https://www.reddit.com/r/datasets/
- A site where you can discover datasets and/or engage in discussion around a dataset
Roboflow Public Datasets – Datasets for computer vision
Link https://public.roboflow.com
- A list of publicly available computer vision datasets
- Categories include classification and object detection
UCI Machine Learning Repository – The classic go-to for machine learning projects
Link: https://archive.ics.uci.edu/ml/index.php
- The classic repository for machine learning datasets taht can be searched by task (classification, regression etc.), application area, data type, and size
- Most datasets in this data base are more suitable for traditional machine learning rather than deep learning
VisualDataDiscovery
Link: https://www.visualdata.io/discovery
- A collection of more than 500 computer vision datasets
- Can be filtered by license and code/model availability
This blog is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book. (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.)
If you read the book and have a few minutes to spare, I'd really appreciate a brief review. It helps us authors a lot!
Your support means a great deal! Thank you!