Now Reading
What should one learn to be a data scientist?

What should one learn to be a data scientist?


In 2011, INSOFE became the first and perhaps the only institute in Asia whose data science program has been certified by the Language Technologies Institute of Carnegie Mellon University (CMU) in Pennsylvania, USA, to be of the same quality as its own graduate classes.
Needless to say, we were thrilled when CMU faculty who saw our class videos said they were impressed with the energy levels we are bringing into the classes!

But the focus of today’s article is different. I wanted to share how we developed our data sciences curriculum and discuss a few insights we gained during the process.



To build the ideal curriculum, we turned to the following groups to discern which topics they believe are important for data scientists to master:

  • Universities
  • Corporations
  • Peer groups

Universities

We identified nine top U.S. universities and examined their Masters in Analytics/Data Science curricula. We did not consider Masters in Machine Learning, AI, or Operations Research programs. The programs we considered are typically offered by business schools, business/engineering schools, or special institutes.

We created a spreadsheet where the rows were modules or concepts. Columns were universities. The cell value for each module was 1 if the university curriculum covered that topic and 0 if it did not. The final column was sum of all the cells in the row indicating the importance of each topic.

Univ1 Univ2 Univ3 Importance
Concept 1 1 0 1 2

 

The single biggest surprise to us was that linear programming and optimization came up as the most common and important topic. The other two stakeholder groups did not even consider linear programming as part of an ideal curriculum!

Another common focus area for university study is in-depth understanding of foundational models such as regression, decision trees, Naïve Bayes classifiers, clustering, neural networks, and support vector machines (i.e., the science and math behind them).

The students’ homework assignments tended to focus on working with well-formulated data. Some, but not all, were focused on industry projects. More on this later.

Corporate Executives

Executives from seven industries (distributed between services, products, and consulting) formed our industry advisory board, which guided us on curriculum design. The executives were all tool-specific (the candidates must know R or SAS or SPSS etc.) and project-crazy (they should have worked on real world data. All board members agreed that data science students must understand how to:

  • Clean up messy data from a variety of sources using one or two tools.
  • Run algorithms using one or two tools.
  • Visualize and communicate the results effectively to non-business users.
  • Conceptualize the big picture of solution architecting and framing, as well as a systematic framework for solving problems.
  • Demonstrate all of the above by executing a sufficiently complex project.

Our industry advisors also recommended that students know how to set up infrastructure for large and unstructured data (e.g., Hadoop). Interestingly, 90% of the executives who recommended Hadoop skills as mandatory had never used it! Clearly, they had been swayed by the buzz.

Also, I would like to define “a sufficiently complex project.”

For a university, a complex problem is one where the mathematics are as strenuous as recognizing a body part  in a medical diagnostic. Typically, the data is very well formed and focused on math.

In a corporate context, a complex problem is one where the data is messy, unstructured, and large. If it is cleaned, the analytics become fairly simple. Typical projects are customer scoring, customer lifetime value, etc.

Industry and academia do not see eye to eye in this matter!

See Also

In fact, a professor whom I respect deeply said she removed the industry internship from her curriculum because industry mentors were making her students work on very simple problems.

Data Scientist community

The third stakeholder we looked at was the community. We actually went through the Kaggle KDD cup winner solutions to see where the peer group is heading.

Most the high-end peers focus on clever engineering of fundamental algorithms and computing power.

The techniques that led the competition included random forests, gradient boosting machines, and singular value decomposition. Such problems may not find space in most master’s-level curricula. They are studied by PhD candidates and other advanced graduate students. Industry most likely has not even heard about them.

So, if you want to be a great data scientist today, you need to build a competency set that reflects the priorities the three important stakeholders who define the space.

  • Gain a very thorough mathematical understanding of some fundamental machine learning and statistical techniques.
  • Develop the understanding of the bigger picture, the tool skills and data pre-processing skills to quickly solve business problems.
  • Understand advanced algorithms so you can apply them in real-world problems.
  • Work on one complex problem from a mathematical perspective and another from a data science perspective.

One way to acquire such skills is by studying in dedicated programs such as INSOFE’s CPEE program.  They teach you all these skills through a rigorous curriculum

If you want to explore the free resources, there are resources available online. So, if you have the discipline, you should prepare a self-paced plan that balances all the above aspects and systematically learn.


Enjoyed this story? Join our Telegram group. And be part of an engaging community.

Provide your comments below

comments

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (3)
  • “In a corporate context, a complex problem is one where the data is messy, unstructured, and large. If it is cleaned, the analytics become fairly simple. Typical projects are customer scoring, customer lifetime value, etc.”-very interesting..this is so true in industries like telecom where the main problem is getting the data itself.

  • Looks like this article is a good advertisement for INSOFE. I may be wrong so lets not debate on this.
    Analytics is all about how to analyze data, facts and other ancillary information using cetrain tools and methodologies to arrive at some inferences or findings, which enable taking appropriate decisions. In Business analytics the focus should be on how to perform Analytics that helps in identifying patterns, behaviors or use certain findings to take appropriate decisions. Businesses are not trying to do a thesis on whether big data is here to stay or not and whether Hadoop is better than SAS….all they want is value for money using analytics to generate dollars by taking good business decisions backed by scientific data and analysis. The results or implementation may not result in 100% accuracy or success rate but even if they manage to get things right in 75-80% of their projects the gains on these will offset the small errors of judgement and still deliver net gains.

    Similarly academics are also lost in the methodology and process, and not really thinking of what works in the industry and gets implemented to solve real-life problems. Most real life problems may not necessarily require the most complex algorithm. Probably Linear programming, elasticity, time series, etc may sound like ancient stuff, but these are well proven and stand the test of time……and they provide true insights and solve problems……but they dont sound as sexy as big data or multivariate co-relation. At the end of the day what works and solves real problems wins the race….not something that is just fancy and sounds mathematically challenging.

    • Hi Sri:

      Thanks for the comment. As mentioned in the article, I got a similar feedback from a good number of industry representatives.

      While I understand this point of view (when something is working, don’t tweak) and respect it, I believe that balancing the “ancient” (I like the term!) with modern is becoming more essential for two reasons.

      When one can do better, why not: Even today, I see quite a few analysts refusing to move beyond logistic regression even when advanced methods are adding substantial improvements to their problem. While trying out new is less comfortable, with today’s tools I believe that it is fairly easy to learn and explore the power of modern techniques. So, we do need data scientists in the system who can go beyond the traditional with ease.

      Problems are genuinely getting tougher: Pleasantly and surprisingly, the users are taking expectations beyond the traditional realm. Recently, I was asked by a software services major to solve a problem with 7 rows and 38,500 attributes! During the same time, I was working on another problem that had 500 dimensions and 100 million rows for a product vendor. Sort of two extremes in data sizes! These complex problems are not one-off cases and their frequency is becoming high (at least in my humble experience).

      So, while I agree that stuff like gradient boosting machines or conditional random fields sound sexy & scary at once, I am finding myself using them in solving real world problems more often than not! They are definitely not confined in academic discussions only. Looking at Kaggle, I get a feeling that many more people are going in that direction.

      So, I feel that time has come to make these techniques part of traditional data science curriculum. I also advise practicing data scientists to equip yourself with these skills. I do believe that you will end up using them gainfully in real world projects sooner than later.

Leave a Reply

Scroll To Top