If there is one sentence, which summarizes the essence of learning data science, it is this:
The best way to learn data science is to apply data science.
If you are a beginner, you improve tremendously with each new project you undertake. If you are an experienced data science professional, you already know what I am talking about.
However, when I give this advice to people, they usually ask something in return – Where can I get datasets for practice? They don’t realize the amount of data sets available in open. They fail to realize the amount of learning they can get out from working on these projects to get a boost in their career.
If you think that the situation above applies to you – Don’t worry! you are just at the right place. This article will provide you a list of websites / resources from which you can use data to do your own (pet) projects or even create your own products.
There is no end to how you can use these data sources. The application and usage is only limited by your creativity and application.
The simplest way to use them is to create data stories and publishing them over web. This would not only improve your data and visualization skills, but also improve your structured thinking.
On the other hand, if you are thinking / working on a data based product, these datasets could add power to your product by providing additional / new input data.
So, go ahead, work on these projects and share them with the larger world to showcase your data prowess!
I have divided these sources in various sections to help you categorize data sources based on application. We start with simple, generic and easy to handle datasets and then move to huge / industry relevant datasets. We then provide links to dataset for specific purpose – Text Mining, Image classification, Recommendation engine etc. This should provide you a holistic list of data resources.
If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.
– This is the home of the Indian Government’s open data. Find data by various industries, climate, health care etc. You can check out a few visualizations for inspiration here. Depending on your country of residence, you can also follow similar websites from a few other websites – check them out.
Amazon provides a few big datasets, which can be used on their platform or on your local computers. You can also analyze the data in the cloud using EC2 and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more. More information can be found here.
A few months back, Google Research Group released YouTube labeled dataset, which consists of 8 million YouTube video IDs and associated labels from 4800 visual entities. It comes with pre-computed, state-of-the-art vision features from billions of frames.
UCI Machine Learning Repository is clearly the most famous data repository. It is usually the first place to go, if you are looking for datasets related to machine learning repositories. The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories. The repository contains more than 350 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.
Kaggle has come up with a platform, where people can donate datasets and other community members can vote and run Kernel / scripts on them. They have more than 350 datasets in total – with more than 200 as Featured datasets. While some of the initial datasets were usually present at other places, I have seen a few interesting datasets on the platform, not present at other places. Along with new datasets, another benefit of the interface is that you can see scripts and questions from community members on the same interface.
Quandl provides financial, economic and alternative data from various sources through their website / API or direct integration with a few tools. Their datasets are classified as Open or Premium. You can access all the open datasets for Free, but you need to pay for the premium datasets. If you search, you still get good datasets on the platform. Eg. Stock Exchange data from India is available for free.
Driven Data finds real-world challenges where data science can be used to create a positive social impact. They then run online modeling competitions for data scientists to develop the best models to solve them. If you are interested in use of data science for social good – this is the place to be.
I hope that this list of resources would prove extremely useful for people looking out for doing pet projects or side projects. For the starters, this is definitely a gold mine. Make sure you pick a few side projects and continue to work on them. If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.
Looking forward to hearing from you.
Great post Kunal.
Thanks Krishna
Hi Kunal, thanks for the article and all the sources :) You may want to check OpenDataSoft -> http://data.opendatasoft.com or https://opendatainception.io/ as other data sources. Nicolas
Thanks Terpolilli. Will check it out
Thanks a lot Kunal ! That is helpfull for us learners !
Glad that you liked it Doumbia