Table of Contents
- Beginner Level
- Iris Data
- Loan Prediction Data
- Bigmart Sales Data
- Boston Housing Data
- Time Series Analysis Data
- Wine Quality Data
- Turkiye Student Evaluation Data
- Heights and Weights Data
- Intermediate Level
- Black Friday Data
- Human Activity Recognition Data
- Siam Competition Data
- Trip History Data
- Million Song Data
- Census Income Data
- Movie Lens Data
- Twitter Classification Data
- Advanced Level
- Identify your Digits
- Urban Sound Classification
- Vox Celebrity Data
- ImageNet Data
- Chicago Crime Data
- Age Detection of Indian Actors Data
- Recommendation Engine Data
- VisualQA Data
Beginner Level
1. Iris Data Set

Problem: Predict the class of the flower based on available attributes.
2. Loan Prediction Dataset

Problem: Predict if a loan will get approved or not.
3. Bigmart Sales Data Set

Problem: Predict the sales of a store.
4. Boston Housing Data Set

Problem: Predict the median value of owner occupied homes.
5. Time Series Analysis Dataset

Time Series is one of the most commonly used techniques in data science. It has wide ranging applications – weather forecasting, predicting sales, analyzing year on year trends, etc. This dataset is specific to time series and the challenge here is to forecast traffic on a mode of transportation. The data has ** rows and ** columns.
Problem: Predict the traffic on a new mode of transport.
6. Wine Quality Dataset

Problem: Predict the quality of the wine.
7. Turkiye Student Evaluation Dataset

This dataset is based on an evaluation form filled out by students for different courses. It has different attributes including attendance, difficulty, score for each evaluation question, among others. This is an unsupervised learning problem. The dataset has 5820 rows and 33 columns.
Problem: Use classification and clustering techniques to deal with the data.
8. Heights and Weights Dataset

This is a fairly straightforward problem and is ideal for people starting off with data science. It is a regression problem. The dataset has 25,000 rows and 3 columns (index, height and weight).
Problem: Predict the height or weight of a person.
Intermediate Level
1. Black Friday Dataset

Problem: Predict purchase amount.
2. Human Activity Recognition Dataset

Problem: Predict the activity category of a human.
3. Text Mining Dataset

Problem: Classify the documents according to their labels.
4. Trip History Dataset

Problem: Predict the class of user.
5. Million Song Dataset

Problem: Predict release year of the song.
6. Census Income Dataset

Problem: Predict the income class of US population.
7. Movie Lens Dataset

Problem: Recommend new movies to users.
8. Twitter Classification Dataset

Problem: Identify the tweets which are hate tweets and which are not.
Advanced Level
1. Identify your Digits Dataset

Problem: Identify digits from an image.
2. Urban Sound Classification

When you start your machine learning journey, you go with simple machine learning problems like titanic survival prediction. But you still don’t have enough practice when it comes to real life problems. Hence, this practice problem is meant to introduce you to audio processing in the usual classification scenario. This dataset consists of 8,732 sound excerpts of urban sounds from 10 classes.
Problem: Classify the type of sound from the audio.
3. Vox Celebrity Dataset

Audio processing is rapidly becoming an important field in deep learning hence here’s another challenging problem. This dataset is for large-scale speaker identification and contains words spoken by celebrities, extracted from YouTube videos. It’s an intriguing use case for isolating and identifying speech recognition. The data contains 100,000 utterances spoken by 1,251 celebrities.
Problem: Figure out which celebrity the voice belongs to.
4. ImageNet Dataset

Problem: Problem to solve is subjected to the image type you download.
5. Chicago Crime Dataset

Problem: Predict the type of crime.
6. Age Detection of Indian Actors Dataset

This is a fascinating challenge for any deep learning enthusiast. The dataset contains thousands of images of Indian actors and your task is to identify their age. All the images are manually selected and cropped from the video frames resulting in a high degree of variability interms of scale, pose, expression, illumination, age, resolution, occlusion, and makeup. There are 19,906 images in the training set and 6,636 in the test set.
Problem: Predict the age of the actors.
7. Recommendation Engine Dataset

This is an advanced recommendation system challenge. In this practice problem, you are given the data of programmers and questions that they have previously solved, along with the time that they took to solve that particular question. As a data scientist, the model you build will help online judges to decide the next level of questions to recommend to a user.
Problem: Predict the time taken to solve a problem given the current status of the user.
Start: Get Data
8. VisualQA Dataset

VisualQA is a dataset containing open-ended questions about images. These questions require an understanding of computer vision and language. There is an automatic evaluation metric for this problem. The dataset has 265,016 images, 3 questions per image and 10 ground truth answers per question.
Problem: Use deep learning technique to answer open-ended questions about images.
End Notes
Out of the 24 datasets listed above, you should start by finding the one that matches your skillset. Say, if you are a beginner in machine learning, avoid taking up advanced level data sets from the get go. Don’t bite more than you can chew and don’t feel overwhelmed with how much you still have to do. Instead, focus on making step-wise progress.
Once you complete 2 – 3 projects, showcase them on your resume and your GitHub profile (very important!). Lots of recruiters these days hire candidates by checking their GitHub profiles. Your motive shouldn’t be to do all the projects, but to pick out selected ones based on the problem to be solved, domain and the dataset size. If you want to look at complete project solution, take a look at this article.
Comments
Post a Comment