
Image by author
You’ve read on these pages (and I’m guilty of writing some of these articles) that data science projects are crucial to developing a full suite of technical data science skills. True, they are. But having high-quality data sets for your data science projects is just as crucial. Collecting high-quality data is one of the stages of a data science projectbut he is the one who can decide everything.
The question is, where to find this damn data? Fortunately, numerous websites offer a wealth of data for various purposes.


Image by author
Did you hear about Kaggleprobably the most well-known platform in the data science community. It hosts a wide range of datasets in various formats (CSV, JSON, SQLite, BigQuery) and from many industries and topics such as health, automotive, arts and entertainment, biology, social sciences, investing, social networking, sports, etc. You can also search for datasets based on their technical focus, e.g. computer science, classification, computer vision, NLP, or data visualization.
There are currently 274,855 data sets available, so you won’t have any problems with missing information.
The user-friendly interface and busy community forums make Kaggle a great resource for beginners and professionals alike.
If you are a machine learning enthusiast, UCI Machine Learning Repository should be your landing page. As the name suggests, this repository was created by the University of California, Irvine (UCI). They have collected a gigantic collection of datasets tailored for machine learning. Since the datasets cover a variety of topics, they are especially useful. These datasets cover a wide range of topics and are especially useful for those who want to practice and improve their machine learning skills.
There are currently 653 datasets available. They can be browsed by data type, subject area, task, number of objects and occurrences, and object type.
StrataScratch provides 49 datasets and projects from real companies. This is especially beneficial for those preparing for data science interviews, as it helps users develop technical skills and the ability to draw business insights from data. This allows for a practical and industry-relevant approach to data science projects.
The projects cover various topics such as data mining, data engineering, business analysis, regression, classification, natural language processing, and clustering.
Searching Google datasets is a tool that is designed to search for data sets on the web. You already know how to employ it, even if you have never heard of it before. Why? Well, it looks and works like regular Google search, only it is focused solely on searching for data sets. It is extremely useful if you are looking for data from various sources, academic papers, and government databases.
Amazon AWS Public Datasets program is another site where you can find a lot of open data. With 494 datasets currently available, it is a valuable resource for data scientists. The datasets you find there can be integrated with AWS cloud services. This can be helpful if your projects require more computing resources.
The scope of available data includes genomics, meteorology, and astronomy.
Data.gov is a data repository sponsored by the US government and contains data from various US organizations. It includes 283,935 data sets from 132 US organizations. There is a wide range of data such as agriculture, public health, finance, education, demographics, economics, and environmental data.
Datasets are available in nearly 50 different formats, the most popular of which are HTML, XML, ZIP, CSV, PDF, ArcGIS GeoServices REST API, KML, GeoJSON, JSON, and TEXT.
FiveThirtyEight by ABC News is a repository of data and code for their articles and graphics. It is an ideal resource for data journalists and anyone interested in telling statistical stories. If you are interested in doing projects covering current events, politics, sports and more, this is your source.
Offers over 160 datasets from 2014 to present.
This World Bank Open Data offers extensive data sets on global development data. This data includes indicators on the economy, environment and social issues in countries around the world. If you are interested in global development and socio-economic topics, you can find a lot of captivating data here.
GitHub is not just a platform for sharing code. It can also be used to find datasets for data projects. Many organizations and individuals host their datasets in GitHub repositories. This data covers a wide range of topics, often supported by extensive documentation and analysis code.
OpenML is a web-based machine learning platform. That also means providing access to a lot of data. Almost 5,400 datasets, to be precise. It is designed to share, organize, and discuss data and machine learning experiment results. OpenML can be integrated with popular machine learning frameworks, which is a bonus for your data science.
This Subreddit Datasets is a community-driven data source. People share everything on Reddit. Well, they also share and ask for datasets for data projects. Sometimes it is difficult to find data there. But not because of lack of data. Quite the opposite! This place is full of data, which can sometimes make searching for data confused. Data ranges from very detailed and unusual to more established datasets. Since it is basically a forum, you can also participate in discussions and ask for aid with datasets.
The Statistical Office of the European Union is called Eurostatand it is a comprehensive data source. If you are interested in high-quality statistical data on EU member states, this should be your primary data source. Data on EU countries covers topics such as economy, population, health and trade.
HDX is an open platform where you can find humanitarian data. It is managed by the United Nations Office for the Coordination of Humanitarian Affairs. This platform provides data on humanitarian crises and emergencies in every country in the world. This can be useful if you are interested in projects focusing on global issues, disaster response and human well-being.
There are 20,344 busy and 2,570 archived datasets available with various characteristics and formats.
On CDCyou can find health data. The datasets focus on various health conditions, risk factors, and public health. So if these topics interest you, you’ll find a lot of useful data here.
This BLS the site contains a ton of data on the US economic situation, job market, price changes, quality of life, etc. You will find plenty of high-quality data sets here if you are interested in these topics.
The last data source I will mention is NASA. There is a wealth of data on aerospace, applied science, applications, geoscience, management/operations, raw data, software, and space science.
The database contains over 10,000 datasets, so don’t get lost in its wealth of data!
I am sure that these 16 websites will give you enough data to work with until the end of time, which was exactly my goal! However, the amount of data is not everything.
I have chosen these sites because they will provide you with a very diverse range of datasets suitable for different data science projects. The specifics of the datasets vary from industry to industry. So working with different datasets also allows you to gain domain knowledge.
Whether you are interested in machine learning, data science, data journalism, statistical analysis, or data visualization, you can always count on these resources.
Now you can do your own data science project! If you need more ideas, here are some data science projects you can do this as a beginner.
Nate Rosidi is a data scientist and product strategist. He is also an associate professor of analytics and the founder of StrataScratch, a platform that helps data scientists prepare for interviews with real questions from top companies. Nate writes about the latest job trends, provides interview advice, shares data science projects, and covers all things SQL.
