Machine Learning has become a boom lately, everyone is doing it, everyone’s learning it and implementing it. Although there are many things which still need to be cleared in terms of concepts and approach.
There are a few questions that one must surely ask while delving into machine learning and solving problems of the same. These questions are, what is the approach? How to start off? What is the underlying problem? Which algorithm would fit the problem, the best? Etc.
Sign up for your weekly dose of what's up in emerging technology.
In this article, you will learn step-by-step how to answer these questions by yourself whilst solving machine learning problems.
In the first step, we will learn more about where to use machine learning. In the second part, we will learn which algorithm to use, on a specific use case. And lastly, we will use how to have clean visualizations so that it gives the best results in terms of pictorial representation.
Step 1. Where to use Machine Learning?
Not every problem which has numbers involved in it is a machine learning problem. There’s a great saying, if the only tool you have is a hammer, you tend to see every problem as a nail.
Machine Learning can only be used in the following problems:
- Learning from the data is required.
- Prediction of an outcome is asked for.
- Automation is involved.
- Understanding the pattern is required like that in the case of user sentiments.
- Same as point d for building recommendation systems.
- Identification/Detection of an entity/object is required.
There are many other bullets to it too but the fundamentals are the ones mentioned above. A use case may have more than one bullet. There may be things where one might simply not need to have machine learning practice for the same in such a case he should go with one because simplicity is what is valued everywhere.
Now coming up with how to solve a machine learning problem. A following stepwise approach would help you solve almost any machine learning problem.
Step 1(a). How to solve a Machine Learning problem?
- Read the data (from csv, json etc)
- Identify the dependent and independent variables.
- Check if the data has missing values or the data is categorical or not.
- If yes, apply basic data preprocessing operations to bring the data in a go to go format.
- Now split the data into the groups of training and testing for the respective purpose.
- After splitting data, fit it to a most suitable model. (How to find a suitable model is answered below)
- Validate the model. If satisfactory, then go with it, else tune the parameters and keep testing. In a few cases, you can also try different algorithms for the same problem to understand the difference between the accuracies.
- From step 7 one can also learn about accuracy paradox.
- Visualize the data.
Visualising the data is important because we need to understand where our data is heading and also it looks more representative while storytelling about the data.
This 9 step approach is a beginner-friendly approach and would surely help you out.
Step 2. Which Algorithm to use?
To understand the basics of it we need to understand what labelling really is. In layman terms, we can understand labels as the values that we need to predict or as the y variable in a machine learning problem which is often called as the dependent variable.
Let’s understand this with a small example.
Supervised learning is the term we use when we need supervision whilst training. How do we give that supervision? Well, it means the output has a frame to be compared from. That frame is what we call as the dependent variable.
And since we don’t have that frame of reference in unsupervised learning, thus the name.
Now let us see how the algorithms can be served for different purposes.
Note: the following algorithms are used for most of their respective cases and thus are generalized, situations may vary and the choice of algorithms.
Linear Regression-Numeric data
Logistic regression-when the output variable is binary.
Linear Discriminant analysis-multi category classification
Decision tree-Regression and Classification
Ensembles– Regression and Classification
Naive Bayes– Regression and Classification
KNN – Regression and Classification
In ensembles we can take random forest, Adaboost, XG boost and other algorithms combined. These can be used for both classification and regression.
The ensemble can be understood as a group of more than one classifier/regressor irrespective of whether it is the same or not, working for the same purpose.
Step 3. Preparing Clean Visualizations
Now coming to visualizations:
Things that one needs to keep in mind whilst visualizing the reports.
- Can show clustering of classes using scatter plot
- Scatter plot shouldn’t be used when there are too many data points.
- A class comparison can be demonstrated via histograms.
- Pie charts can be used for a comparative breakdown.
- Simple line charts can be used for analysing reports who have frequent deviations like that of stocks.
Having a lot of data points on a scatter plot just makes it look clumsy and thus is not a good report to show in front of all the stakeholders. So it is advised not to use scatter charts in such cases.
The article was aimed to create a general awareness of machine learning tips for beginners. The article covers some general dos and don’ts for the same.
The article covered some basic doubts/questions a beginner generally asks.
Hope you found this article useful.