If you are an absolute Machine Learning beginner and are wondering whether data analysis is a prerequisite, then here’s the hard-fact – data analysis meaning the task of gathering data, cleaning data, exploring and visualizing data is an absolute must before one gets started on machine learning.
However, let’s also get one thing clear – machine learning is as much about linear algebra, probability theory and statistics (especially graphical models) and information theory as much as data analysis. And data analysis forms an important part of understanding — ML algorithms are used with real world data, but without the knowledge of data processing/data-science since data never comes in structured, labeled format, you wouldn’t get far with algorithms. According to a section of ML practitioners, data science and machine learning are essentially two sides to the same field.
Let’s see how data analysis will help you level up on ML
- First, you won’t be able to build a good enough model if you don’t have solid skills with data analysis
- Even if you use packaged tools like Python’s scikit-learn –that end up performing the hard math– one needs to have a solid understanding to make these tools work effectively. Because a solid understanding of exploratory data analysis and data visualization, you can’t get far in machine learning
- Even for application of tools such as caret and scikit-learn, you’ll need to be able to gather, prepare, and explore your data. You a need solid understanding of data analysis
Let’s enumerate how one can use Data Science as a platform to dive into basics of Machine Learning
1) 80% of data science work involves data prep
By now, it is common knowledge that 80% of data science work involves data preparation, EDA, and visualization and for most data scientists, data organization and manipulation is still a much-needed skill and this is where they implement all machine learning algorithms using scikit-learn.
This means when one is building machine learning models, 80% of the time will be spent in gathering data, exploring it, cleaning it, and analyzing results with data visualization.
2) Knowing how to manipulate data is critical
For beginning ML practitioners, manipulating data is more critical than understanding the math underlying the algorithm: While Linear algebra is the building block of machine learning and forms the key to understanding the statistics applied in ML, most data science practitioners have a working understanding of calculus or linear algebra.
However, they are excellent data analysts and usually lean towards the minimum requirement of math and fill in the gaps on the job. According to a data science practitioner from financial sector, if you want to be able to write an algorithm from scratch, you need a very high understanding of linear algebra. If you want to a data science practitioner, otherwise one doesn’t need a high-level knowledge of calculus to understand how an algorithm behaves.
However, in the long run advanced math is an absolute must, but in the short-term, one must focus on data-visualization/data-manipulation stack in R or Python.
This the most widely recommended package to get started for visualization/wrangling/analysis:
R: ggplot2, dplyr, tidyr, stringr
Python: numpy, pandas, matplotlib, seaborn
3) Before one dives into ML, you need to master visualization
The job description of an entry-level data scientist involves a lot of data aggregation and data visualization. This in turn helps a lot to perform exploratory data analysis. For professionals who prefer R, you can learn: ggplot2 for data visualization, including basic visualizations like scatterplots, histograms, bar charts and also learn how to use ggplot and dplyr together for exploratory data analysis. Python users can learn to use Pandas and data visualizations together for exploratory data analysis.
4) Linear algebra is defined as the workhorse of Machine Learning
That said, Linear algebra is important if you want to understand the inner workings of machine learning and gradient descent. One can’t emphasize enough the importance of grasping essential concepts of statistics and probability, given how machine learning is often dubbed as statistical learning.
The field is so vast and endless that it is difficult to follow a focused learning plan and most entry-level data scientists grapple with covering all the essential concepts in a short span of time. For a deeper understanding of the algorithms one needs statistic and stochastic process. But this is the moment, it becomes difficult since one needs knowledge of calculus and Linear Algebra.
However, for an absolute beginner it can be difficult to understand all the important aspects and that’s why a foundation in data analysis, can help one build machine learning models that work. Also, one must remember that during a machine learning workflow, the experience from exploratory data analysis will help as an input to the “data transformation” step of ML workflow.
Outlook
Not everybody has a rigorously quantitative background to work their way through the math required for Machine Learning. Given the rising interest in the field, and a lack of formal training, most beginners (who follow the self-learning path) find it challenging and frustrating to master the concepts completely. That’s why, beginners can use data analysis as a platform to dive into machine learning without completely mastering linear algebra or calculus.
Meanwhile, here’s a guide to ML by Jason Brownlee where he talks about how to get a handle of Linear Algebra for ML. According to Brownlee, there are a minimum of 3 topics one must cover – a) Notation (it will allow one to piece things together); b) Operations which means learning how to perform simple operations such as multiplying, transposing matrics and c) Matrix Factorization, this requires a deep dive into concepts like SVD and QR. This forms the bedrock of machine learning.
Besides, don’t forget to brush up the basics with these books on ML– Elements of Statistical Learning. Hastie, Tibshirani, Friedman & Information Theory, Inference, and Learning Algorithms by David MacKay. For Linear Algebra, check out Linear Algebra, Theory, and Applications by Kuttler