Data Science Skills Study 2020 – By AIM and AnalytixLabs

Share

Published on August 17, 2020

by Siddhartha Thomas

Analytics India Magazine, in association with AnalytixLabs, released the Data Science Skills Survey over the months of June and July 2020 so as to get an in-depth perspective into the key trends related to the tools and models deployed across sectors.

AIM has now published the findings of the survey in this report. Please access last year’s Study here.

This survey provides a direct perspective on the Data Science skills and domains that AI, Analytics, and Machine Learning practitioners are working on and how organizations and Data Science personnel stay ahead of the data science pack. This report will benefit prospective job seekers, including students, and personnel seeking to transition to the Data Science function – it will help this broad audience to understand the skills, technologies, and platforms in demand across organizations.

Key Findings

The Data Science and Analytics recruitment survey was released over a period starting from the end of June to mid-July 2020. The diversity of responses reveals the following:

Languages Used for Statistical Modelling

The data scientists and practitioners who were surveyed revealed that the top Language preferred for Statistical Modelling is Python, favoured by 65.2% proportion of the respondents.

This is followed by a steep drop with 16.7% of the respondents preferring R as the language for statistical modelling. Python and R dominate the preference scale, with a combined figure of 81.9% utilization for statistical modelling among those surveyed.

SQL is 3^rd in preference at 5.8%. This is followed by Matlab at 4.8%, and Scala / Java / PySpark pack of languages at 4.1%. SAS and Julia complete the list at a similar proportion of preference of 1.7% among practitioners/respondents.

Data Science Models Used Across Work

While an overwhelming majority of practitioners prefer Python as the language for statistical, the preference for Data Science models is not lop-sided. Logistic Regression tops the list in terms of preference with 23.5% of the respondents utilizing the model for their work. This is followed by an equal proportion who prefer both Random Forests and XGBoost at 15.4% as the model utilized by data scientists and practitioners.

Artificial Neural Networks (ANN) model is preferred by 13.6% of respondents. Clustering and Support Vector Machine – each has a similar preference amongst 8.7% of the respondents. In terms of preference, Decision Trees is 7.2% and Principal Component Analysis is 4.5%. Bayesian Technique completes the listwith a 3% preference across the data science community.

Preferred Python General Purpose Libraries

While Python is the language of choice for statistical modelling among the Data Science community, the preference for Python General Purpose Libraries follows a flatter curve. Pandas is the library of choice among the respondents, with 20.1% of the respondents preferring this library. This is followed by Numpy at 13.8% and Sciket-Learn at 13.6%. MatPlotLib has a preference of 11% among the data scientists surveyed.

The proportion of respondents who prefer TensorFlow is 6.8% and Keras is 6%. NLTK, XGboost, PyTorch, SciPym and Statsmodels all have a proportion of preference amongst respondents below 6% – 5.7%, 5.4%, 4.5%, 3.9%, and 3.4% respectively. A combination of Other libraries has a preference of 5.8%.

Python Frameworks Utilized for AI / Deep Learning Projects

Among Python frameworks utilized for Artificial Intelligence (AI) and Deep Learning projects, TensorFlow is the preferred framework across 64.9% of proportion of respondents. PyTorch is utilized by 27.2% of the data scientists surveyed. These two frameworks have a combined preference of 92.1% among respondents.

Theano and Caffe are the other frameworks preferred respectively by 4.6% and 3.3% of the data science community– these figures below 5% highlight the concentration of preference for TensorFlow and PyTorch.

Other frameworks make up the remainder of the list, with a combined preference of 4.6%.

Preferred Platforms & Tools to Develop AI Models

Amongst all the platforms and tools used to develop AI models, Open Source platforms and tools are utilized by an overwhelming majority of 83.1% of the data scientists surveyed. The flexibility, agility, and the ability to scale and grow, give the Open Source platforms this high level of preference amongst data scientists and professionals.

Licensed tools are at the next position of preference among 10.3% of the respondents. Custom-made platforms are utilized by 4.4% of the community. A combination of Other platforms has a preference of 2.2%.

Preferred Processing Units to Develop AI / ML Models

Processing Units are at the heart of the development of AI and ML platforms as they impart the computing power to train and scale the Deep Learning models. Of all the high-performance processing units available, data scientists are increasingly preferring TPUs and GPUs for the neural network workloads.

46.4% of the respondents prefer the range of Google Cloud TPUs to develop and scale the models. Nvidia GeForce GTX 10 Series GPU is utilized by 21.4% of the respondents. At 3^rd position, in terms of preference, a combination of Low-end processing unit models is favoured by 17.9% of the data scientists surveyed.

The next three processing units in terms of preference are from the Nvidia range of GPUs. Nvidia Tesla V100 Series GPU and Nvidia GeForce GTX9 Series GPU are both preferred by 15.5% of the respondents. Nvidia GeForce RTX20 Series GPU is the preferred processing unit of 14.3% of the respondents.

A combination of other High-end processors is preferred by 10.7% of the respondents. FPGA (Field Programmable Gate Array) is the last in the list of preference of processing units, with 4.8% of the respondents preferring this type.

While there are many platforms used by data scientists and AI / ML practitioners to share code, Git is the platform of preference with 74.7% of respondents. This highlights the popularity of the cloud-based platform amongst the Data Science community. 4.8% of respondents prefer Sublime Text to share code, while 3.6% of respondents still utilizeon-premise servers across their organizations to share code.

Codepen is preferred by 2.4% of respondents. Codeshare and Coda are both favoured by 1.2% of the data scientists surveyed. A combination of other Cloud Platforms is utilized by 12% of the respondents.

Integrated Development Environments (IDEs) To Streamline Processes

Integrated Development Environments (IDEs) have emerged as the fundamental tools for software development, leading to new processes and integration platforms that provide programmers with higher levels of flexibility.

Of all the IDEs, JUPYTER Notebook is the most preferred among respondents, with 31.3%. PyCharm is next, favoured by 22% of the data scientists surveyed. R Studio is favoured by 17.5% of the data science practitioners, while 7.2% of the data science practitioners utilize Spyder.

Google Cloud Shell is preferred by 5.2% of the practitioners. The remaining platforms are preferred by less than 5% of the respondents, including Idle (3.8%), Observable (1.7%), JSFiddle and Repl.it (1.4%), and a combination of other platforms (2.1%).

Preferred Neural Network Architectures

Neural Network Architecture is a system of training machine learning models, whereby a computer learns to perform a task by analyzing training examples. A neural network consists of millions of simple processing nodes that are densely interconnected. Convolutional Neural Networks or CNN are preferred by 24.4% of the participants. 9.3% of the participants prefer LSTM.

8.3% of the data scientists surveyed prefer both GoogLeNet and Inception and BERT. Recurrent Neural Networks (RNN) and Network-in-Network are both preferred by 7.3% of respondents. 6.7% of survey respondents favour ResNet, while 5.7% utilize RCNN. VGG Net and Seq2Seq are preferred by 4.7% of the respondents. Feedforward Neural Network are favoured by 3.6% of respondents.

Preferred Cloud Service Platforms to Develop AI / ML Models

Of the numerous Cloud Service Platforms (CSPs) that are utilized to develop AI and ML models, the data science community prefers Amazon Web Services (AWS) the most – with 39.5% preferring this platform. Google Cloud Platform (GCP) is preferred by 20.4% while MS Azure by 19.7% of the community.

Oracle Cloud is utilized by 9.2% of the respondents, while IBM Bluemix by 7.2% of the community. Alibaba Cloud and HPE Aruba complete the list in terms of utilization by 3.3% and 0.7% respectively of the data scientists surveyed.

Preferred Business Intelligence (BI) or Dashboarding Tools

Across the Data Science community, a variety of BI and dashboarding tools are utilized depending on functionality, availability, and the overall basket of other tools used in conjunction with the BI tool.

MS Power BI and Tableau are favoured by 26% and 24.9% of the data science community respectively. MS Excel, the spreadsheet software bundled with the MS Office suite, is utilized by 14.2% of the community.

The next 3 in terms of preference cover BI tools developed by Software & Technology companies as part of their wider offerings. IBM Cognos Analytics and SAP BusinessObjects are respectively favoured by 9.5% and 5.9% of the respondents, while Oracle Analytics Server is utilized by 5.3% of the data scientists surveyed.

The remaining BI tools are all utilized by less than 5% of the community, with Qlikview by 4.1%, Informatica and Spotfire both by 3.6%, and SAS Visual Analytics by 3% of the data scientists surveyed.

Preferred Big Data/Database Tools

An assortment of Big Data and Database tools is utilized by data scientists, with MySQL the most favoured tool with 26.6% of the data scientists surveyed. The combined array of Hadoop tools, including Hive, HDFS, and Impala, is also the most preferred at a similar proportion – 26.6%.

After MySQL and Hadoop there is a significant drop in preference of tools, with MongoDB is utilized by 12.8% of the data science community surveyed.

There is another significant drop in preference after MongoDB with Snowflake preferred by 4.8% of the respondents. Amazon Redshift and BigQuery, both have a preference of 4.3%. The remaining tools of Teradata, Cassandra, Vertika, and Kafka have a preference less than 4% among the community – 3.7%, 3.2%, 3.2%, and 1.6%. An assortment of other tools has a preference of 9%.

Learning Resources Utilized to Upskill

While many Learning Resources are utilized by practitioners to upskill in the particular domain of Data Science, Massive Online Open Courses or MooCs are utilized the most by 15.8% of the data scientists surveyed. Networking via Social Media, especially via LinkedIn, is gaining as a medium to upskill and is favoured by 15.2% of the respondents.

Online Certifications and Courses, once not preferred by professionals and students alike, provide convenience in terms of pace and place of learning and are thus preferred by 14.6% of the data scientist community. Online videos hosted on such platforms as LinkedIn and Youtube are also gaining preference from the respondents with 11.9% favouring this learning medium.

The traditional format of learning, Classroom training imparted by Private Institutes is utilized by 6.8% of the data science community. eBooks, including other formats of Traditional Media, are utilized by 6.5% of the respondents. Online tech forums, such as GitHub, are fast gaining popularity among the data science community and are preferred by 6.3% of the data scientists surveyed.

Another traditional format, University Certifications and Courses, is at the lower end of the spectrum of preference – 5.7%. Participation in Hackathons and Attending Workshops & Conferences are respectively preferred by 5.4% and 4.8% of the community. Other formats of learning resources make up 7.1% in terms of preference.

Organizational Demographics of Respondents

The organizational demographics of the data science community surveyed covers the break-up by the Type of Company, the Sector or Industry affiliation, and Years of Experience.

Type of Company

As expected, most of the respondents were from MNC IT companies at 37%. However, the diversity of responses is exemplified by respondents from Start-up companies, represented by 21.7% of the respondents. Consulting companies are represented by 13% of the respondents.

Captive BFSI companies are represented by 8.3%, Domestique IT and Boutique Analytics are represented both by 4.8% of respondents. Captive Parma firms are represented by 3.6%, while Domestic Firms are represented by 1.2% of the respondents.

Industry or Sector Affiliation

The Industry / Sector affiliation of the data scientists surveyed, reveals concentration towards the IT / ITES sector – with 36% of the respondents. This is followed by the Technology sector, with an affiliation of 18%. This sector covers both Software and Hardware technologies.

The BFSI is represented by 16% of the respondents. This is followed by the eCommerce & Retail and the Pharma and Healthcare sectors that are both represented by 7% of the respondents.

The Digital Media & Entertainment, Travel & Hospitality, and the Automotive/Industrials sectors are all represented by 4% of the respondents. The FMCG and Consumer Electronics sectors complete the list in terms of Industry affiliation with a representation of 2%.

Experience of Respondents

The experience break-up of the respondents reveals that most of the respondents are from the 0-3 years’ experience level at 70%. 11% of the respondents are from the 3-6 years’ experience level, while 7% of the respondents are from 6-10 years’ experience level. The 10-15 years’ experience level is represented by 5% of the respondents, while the 15+ years, leadership level is represented by 6% of the respondents.

Conclusion

As the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, it is important that data scientists, and AI & Analytics practitioners community are aware of the skills and tools that the broader community is working on.

Some of the platforms and tools have a wider spread in terms of preference across the Data Science community. The graphs for Languages Used for Statistical Modelling, Python Frameworks, Processing Units to Develop AI & ML Models, and Type of Tools used to Develop Models, reveal the data science community has a greater preference for one or two of the top technologies across these categories.

Hence, AI and analytics practitioners seeking to upskill in certain areas could look at the top tools or platforms to upskill in – Python for Statistical Modelling; TensorFlow for Python Frameworks; Git for Sharing code, among other.

However, each tool or platform has its own benefits in terms of output, data models, or functional performance. Hence aspiring data scientists should look at the holistic data modelling requirements before upskilling across certain tools and platforms.

Download the complete report here.

Access all our open Survey & Awards Nomination forms in one place

Siddhartha Thomas

"Siddhartha is an industry research professional with areas of interest across the Digital Media,Traditional Media, and Technology sectors. Siddhartha studies and researches organizations and industries from the perspective of innovation, finance, and strategic management. He has extensive research and knowledge management experience across numerous large and small organizations."