Last month, I shared a short list of dataset repositories that I planned to recommend to students as inspiration for their class projects.

Drawing

Thanks to all the great suggestions via the Twitter thread above, this list has grown quite a bit! Now, with the semester being in full swing, I recently shared this set of dataset repositories with my deep learning class. However, beyond using this list to find inspiration for interesting student class projects, these are also good places to look for additional benchmark datasets for your model, so I am putting it out here, hoping you find it useful!

It is hard to sort by priority or to pick favorites, so the following list is sorted alphabetically.



Academic Torrents – A distributed system for sharing enormous datasets

  • Currently 65 Tb worth of datasets available through this site built on top
  • The sharing is fascilitated through bittorrent technology

Link: https://academictorrents.com

Awesome Public Datasets – A large GitHub README list organized by application domain

Link: https://github.com/awesomedata/awesome-public-datasets

  • Links to approximately 650 datasets
  • Organized by application domain (e.g., agriculture, biology, etc.)



CVonline: Image Databases – Bob Fisher’s Compilation of Computer Vision Datasets

Link: http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm

  • An impressive collection of more than 1000 datasets for computer vision sorted by category (agriculture, general images, etc.)



Datasetlist.com – Datasets by domain

Link: https://www.datasetlist.com

  • Let’s you sort datasts by category (image, NLP, audio, etc.)
  • Sortable by license and year
  • Links to original papers (if applicable) are provided



Data is Plural – A dataset newsletter

Link: https://tinyletter.com/data-is-plural

  • A weekly newsletter that compiles a list of interesting new datasets each week



Google Dataset Search – A search engine for datasets

Link: https://datasetsearch.research.google.com

  • Let’s you search datasets by name or description
  • Returns summary results and links to various source where a dataset can be obtained



Huggingface Datasets – A Python library for loading NLP datasets

Link: https://github.com/huggingface/datasets

  • A tool that makes NLP datasets directly available in Python



IBM’s Data Asset Exchange – A collection datasets relevant for enterprise applications

Link: https://developer.ibm.com/exchanges/data/

  • Several large datasets centered around enterprise applications with friendly community sharing licenses
  • Individual datasts contain links to Jupyter notebooks showing data loading and processing examples
  • Mostly tabular data and more suited for traditional machine learning



Jupyter Tutorial Data – List of dataset repositories

Link: https://jupyter-tutorial.readthedocs.io/en/latest/data/index.html

  • A list linking the most common dataset repositories and search engines



Kaggle Datasets

Link: https://www.kaggle.com/datasets

  • A search engine for datasets available through Kaggle
  • Datasets can be discovered by search terms, category tags, and file types



OpenML – A search engine for curated datasets and workflows

Link: https://www.openml.org/search?type=data

  • 3265 datasets annotated with the number of instances, features, and classes
  • Workflows (e.g., scikit-learn pipelines) are available through the community
  • Most datasets are tabular datasets for traditional machine learning



Papers with Code – Datasets with benchmarlks

Link: https://www.paperswithcode.com/datasets

  • 3,095 machine learning datasets and links to original paper if applicable
  • Contains number of papers that used the dataset
  • Compiles benchmark information and links to the benchmark sources



Penn Machine Learning Benchmarks – Clean, tabular datasets

Link: https://github.com/EpistasisLab/pmlb/tree/master/datasets

  • A collection of preprocessed datasets in tabular form
  • More appropriate for traditional machine learning rather than deep learning



Public APIs – A list of public dataset API

Link: https://github.com/public-apis/public-apis

  • A list of approximately 650 dataset APIs



r/datasets – A subreddit for sharing and discussing datasets

Link: https://www.reddit.com/r/datasets/

  • A site where you can discover datasets and/or engage in discussion around a dataset



Roboflow Public Datasets – Datasets for computer vision

Link https://public.roboflow.com

  • A list of publicly available computer vision datasets
  • Categories include classification and object detection



UCI Machine Learning Repository – The classic go-to for machine learning projects

Link: https://archive.ics.uci.edu/ml/index.php

  • The classic repository for machine learning datasets taht can be searched by task (classification, regression etc.), application area, data type, and size
  • Most datasets in this data base are more suitable for traditional machine learning rather than deep learning



VisualDataDiscovery

Link: https://www.visualdata.io/discovery

  • A collection of more than 500 computer vision datasets
  • Can be filtered by license and code/model availability