Last updated January 19, 2021
In AI Trends & Future

What should one learn to be a data scientist?

Share

Published on October 8, 2013

by Dr. Dakshinamurthy V Kolluru

In 2011, INSOFE became the first and perhaps the only institute in Asia whose data science program has been certified by the Language Technologies Institute of Carnegie Mellon University (CMU) in Pennsylvania, USA, to be of the same quality as its own graduate classes.
Needless to say, we were thrilled when CMU faculty who saw our class videos said they were impressed with the energy levels we are bringing into the classes!

But the focus of today’s article is different. I wanted to share how we developed our data sciences curriculum and discuss a few insights we gained during the process.

To build the ideal curriculum, we turned to the following groups to discern which topics they believe are important for data scientists to master:

Universities
Corporations
Peer groups

Universities

We identified nine top U.S. universities and examined their Masters in Analytics/Data Science curricula. We did not consider Masters in Machine Learning, AI, or Operations Research programs. The programs we considered are typically offered by business schools, business/engineering schools, or special institutes.

We created a spreadsheet where the rows were modules or concepts. Columns were universities. The cell value for each module was 1 if the university curriculum covered that topic and 0 if it did not. The final column was sum of all the cells in the row indicating the importance of each topic.

	Univ1	Univ2	Univ3	Importance
Concept 1	1	0	1	2

The single biggest surprise to us was that linear programming and optimization came up as the most common and important topic. The other two stakeholder groups did not even consider linear programming as part of an ideal curriculum!

Another common focus area for university study is in-depth understanding of foundational models such as regression, decision trees, Naïve Bayes classifiers, clustering, neural networks, and support vector machines (i.e., the science and math behind them).

The students’ homework assignments tended to focus on working with well-formulated data. Some, but not all, were focused on industry projects. More on this later.

Corporate Executives

Executives from seven industries (distributed between services, products, and consulting) formed our industry advisory board, which guided us on curriculum design. The executives were all tool-specific (the candidates must know R or SAS or SPSS etc.) and project-crazy (they should have worked on real world data. All board members agreed that data science students must understand how to:

Clean up messy data from a variety of sources using one or two tools.
Run algorithms using one or two tools.
Visualize and communicate the results effectively to non-business users.
Conceptualize the big picture of solution architecting and framing, as well as a systematic framework for solving problems.
Demonstrate all of the above by executing a sufficiently complex project.

Our industry advisors also recommended that students know how to set up infrastructure for large and unstructured data (e.g., Hadoop). Interestingly, 90% of the executives who recommended Hadoop skills as mandatory had never used it! Clearly, they had been swayed by the buzz.

Also, I would like to define “a sufficiently complex project.”

For a university, a complex problem is one where the mathematics are as strenuous as recognizing a body part in a medical diagnostic. Typically, the data is very well formed and focused on math.

In a corporate context, a complex problem is one where the data is messy, unstructured, and large. If it is cleaned, the analytics become fairly simple. Typical projects are customer scoring, customer lifetime value, etc.

Industry and academia do not see eye to eye in this matter!

In fact, a professor whom I respect deeply said she removed the industry internship from her curriculum because industry mentors were making her students work on very simple problems.

Data Scientist community

The third stakeholder we looked at was the community. We actually went through the Kaggle KDD cup winner solutions to see where the peer group is heading.

Most the high-end peers focus on clever engineering of fundamental algorithms and computing power.

The techniques that led the competition included random forests, gradient boosting machines, and singular value decomposition. Such problems may not find space in most master’s-level curricula. They are studied by PhD candidates and other advanced graduate students. Industry most likely has not even heard about them.

So, if you want to be a great data scientist today, you need to build a competency set that reflects the priorities the three important stakeholders who define the space.

Gain a very thorough mathematical understanding of some fundamental machine learning and statistical techniques.
Develop the understanding of the bigger picture, the tool skills and data pre-processing skills to quickly solve business problems.
Understand advanced algorithms so you can apply them in real-world problems.
Work on one complex problem from a mathematical perspective and another from a data science perspective.

One way to acquire such skills is by studying in dedicated programs such as INSOFE’s CPEE program. They teach you all these skills through a rigorous curriculum

If you want to explore the free resources, there are resources available online. So, if you have the discipline, you should prepare a self-paced plan that balances all the above aspects and systematically learn.

Access all our open Survey & Awards Nomination forms in one place

Dr. Dakshinamurthy V Kolluru

Dr. Dakshinamurthy V Kolluru, is President of International School of Engineering, an institute offering classroom based certificate programs in Data Science and Big Data in Hyderabad and eLearning certificate programs in Predictive Analytics and Big Data. You can learn more about the school and people at http://www.insofe.edu.in.