Education

What should one learn to be a data scientist?

  • “In a corporate context, a complex problem is one where the data is messy, unstructured, and large. If it is cleaned, the analytics become fairly simple. Typical projects are customer scoring, customer lifetime value, etc.”-very interesting..this is so true in industries like telecom where the main problem is getting the data itself.

  • Sri

    Looks like this article is a good advertisement for INSOFE. I may be wrong so lets not debate on this.
    Analytics is all about how to analyze data, facts and other ancillary information using cetrain tools and methodologies to arrive at some inferences or findings, which enable taking appropriate decisions. In Business analytics the focus should be on how to perform Analytics that helps in identifying patterns, behaviors or use certain findings to take appropriate decisions. Businesses are not trying to do a thesis on whether big data is here to stay or not and whether Hadoop is better than SAS….all they want is value for money using analytics to generate dollars by taking good business decisions backed by scientific data and analysis. The results or implementation may not result in 100% accuracy or success rate but even if they manage to get things right in 75-80% of their projects the gains on these will offset the small errors of judgement and still deliver net gains.

    Similarly academics are also lost in the methodology and process, and not really thinking of what works in the industry and gets implemented to solve real-life problems. Most real life problems may not necessarily require the most complex algorithm. Probably Linear programming, elasticity, time series, etc may sound like ancient stuff, but these are well proven and stand the test of time……and they provide true insights and solve problems……but they dont sound as sexy as big data or multivariate co-relation. At the end of the day what works and solves real problems wins the race….not something that is just fancy and sounds mathematically challenging.

    • Murthy

      Hi Sri:

      Thanks for the comment. As mentioned in the article, I got a similar feedback from a good number of industry representatives.

      While I understand this point of view (when something is working, don’t tweak) and respect it, I believe that balancing the “ancient” (I like the term!) with modern is becoming more essential for two reasons.

      When one can do better, why not: Even today, I see quite a few analysts refusing to move beyond logistic regression even when advanced methods are adding substantial improvements to their problem. While trying out new is less comfortable, with today’s tools I believe that it is fairly easy to learn and explore the power of modern techniques. So, we do need data scientists in the system who can go beyond the traditional with ease.

      Problems are genuinely getting tougher: Pleasantly and surprisingly, the users are taking expectations beyond the traditional realm. Recently, I was asked by a software services major to solve a problem with 7 rows and 38,500 attributes! During the same time, I was working on another problem that had 500 dimensions and 100 million rows for a product vendor. Sort of two extremes in data sizes! These complex problems are not one-off cases and their frequency is becoming high (at least in my humble experience).

      So, while I agree that stuff like gradient boosting machines or conditional random fields sound sexy & scary at once, I am finding myself using them in solving real world problems more often than not! They are definitely not confined in academic discussions only. Looking at Kaggle, I get a feeling that many more people are going in that direction.

      So, I feel that time has come to make these techniques part of traditional data science curriculum. I also advise practicing data scientists to equip yourself with these skills. I do believe that you will end up using them gainfully in real world projects sooner than later.