MITB Banner

How To Build Analytics Platform At Petabyte Scale, Explains Sharad Agrawal, CEO and Founder, Sprinkledata

Share

“The popularity of data science has grown drastically in the last five years, while big data and artificial intelligence have picked in the last 2-3 years. With an increase in these trends, the companies are keen on leveraging data assets such as customer data, machine generated data and make use of it”, noted Sharad Agrawal, CEO and Founder, Sprinkledata, during Cypher 2017.

While this is true, the idea is to store huge amounts of data that has been generated over the years in a way that can be accessed easily. The data is so huge that it reaches up to petabyte scale which Agrawal explains is the amount, if suppose an app or website has 10 billion users who are active daily, the kind of activity they do and the engagement data they generate, if it is accumulated for 24 months, it would generate a petabyte of data.

Having said that, and having understood that sales and marketing analytics is very important, and that it can make a difference in customer retention and acquisition, there are only 2% people who say that they are happy with the kind of analytics they have in their organization. Agrawal states that the reasons for this could be—firstly, the data is too large or fragmented and processing them is a challenge. Secondly, the data is accessible to only a select set of people and requires cost and energy of the organization. There may be more challenges in the form of system driven or platform challenges and people challenges.

“Platform challenges are if you are collecting large amounts of data for pipelining you may use big data systems, Spark or Hadoop systems and feed the data into warehouse and apply visualization tool on the top. In such cases warehouse may be able to take all the data beyond terabyte scale. Here lies the matching of scale and speed of the system as they have to prune the data, which may result in the loss of data”, he said.  

Few approaches for typical analytics project to overcome these challenges-

Agrawal is quick to add that instead of using multiple systems, everything should be done in a cluster. “Data at entry stage is large and we need large big data cluster for ingesting large amount of data—so big data cluster becomes an important part of the ecosystem. If ingestion and pipelining is done at big data cluster, a major chunk of the challenge could be overcome’, he says.  

The other way could be, instead of analysts asking for reports and going to the dashboard, they can do a self served reporting on the big data platform. “This would avoid mismatch in speed and scalability as there would be no information loss and would be accessible to everyone”, he said. “Self serve is very important if we want to be in data driven industry”, added Agrawal.

Some of the other ways to build an analytics platform at petabyte scale are building new systems that have the agility to ask new questions, has accessibility to data, has an easy to use interface and platform complexity should be removed. Most of the cases may require a user to join data from multiple source, so enrichment process should be highly scalable and turnaround time should be fast and efficient.

Agrawal summarizes by saying that in today’s world, there is a tremendous rise of data science, analytics, machine learning and artificial intelligence, yet we are not able to drive value from these systems. We are capturing huge amount of data and applying big data technologies, yet the approach we are following is age old and it should be reconsidered, to leverage the best data assets.  

Share
Picture of Srishti Deoras

Srishti Deoras

Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.