Leveraging Data Virtualization to Accelerate Machine Learning Initiatives

Today, machine learning is used in all major industries like manufacturing, retail, healthcare, travel, financial services, and energy. Some of the top machine learning use cases include predictive maintenance and condition monitoring in manufacturing, dynamic pricing in travel, and upselling and cross-channel marketing in retail. In fact, according to Forbes, “57% of enterprise executives believe that the most critical growth benefit of AI and ML will be improving customer experiences and support. However, to enable machine learning initiatives, sophisticated infrastructure that is adaptable and that can quickly integrate and process large amounts of data from disparate sources is imperative. Data is often scattered across multiple data platforms, tools, applications and processing engines. Establishing and maintaining such an infrastructure can be complicated and costly. 

It’s time for a New Approach

Many organisations are looking for new ways to store their data, often done via a data lake or data lake houses. The purpose of a Data Lake is to collect large volumes of data from multiple, disparate sources, including data of different types (both structured and unstructured) and store this data in its original format. Replication of data from the system of origin can be slow and costly and sometimes only a small subset of the relevant data will be stored in the data lake. To leverage this data for machine learning purposes, it first needs to be integrated. With the increasingly distributed nature of the data ecosystem, data integration gets more and more complex and harder to achieve in a reasonable time frame using more traditional methods. Data is typically distributed across a hybrid of cloud providers and on-premises systems making access and integration even more challenging. According to the Total Economic Impact (TEI) of Data Virtualization survey conducted by Forrester, Data Scientists spend about 30% of their time on data wrangling and data curation.

The Benefits of Leveraging Data Virtualization

An alternative approach to moving data from multiple source systems into a new, centralized repository, is data virtualization. It provides real-time, logical, consolidated views of data, without data replication and allows the data to remain at its origin. Data can reside on-premises or in the cloud and the data can be of differing types and structures. For Data Scientists it means more access to data in a truly self-service and flexible way. The Data Scientist no longer needs to be concerned about the technical complexities of the underlying data sources and how this data is joined or combined. The data virtualization layer hides this complexity but at the same time provides flexibility to model the data in different ways for different business requirements, including data science and advanced analytics purposes.


Sign up for your weekly dose of what's up in emerging technology.

By providing a single access point for all corporate data assets, regardless of the location and format, Data Virtualization provides real data agility. Data Scientists and Data Engineers can apply functions on top of the physical data, to obtain different logical views of the same physical data, without the need to create additional physical copies of that source data. It offers a fast and inexpensive way to help to address many of the specific data challenges faced by Data Scientists when integrating data for machine learning purposes. Best-of-breed data virtualization tools offer a searchable Data Catalog, that includes extended metadata for each data set, such as tags, column descriptions and commentary, as well as active metadata like who uses what data set, when, and how. Data usage knowledge is key to better servicing the data needs of the business.

Keeping it Simple

Data virtualization offers a framework for clarity and simplicity to the data integration process. Data is everywhere, so regardless of whether data is stored in a relational database, a Hadoop cluster, a SaaS application, a multi-dimensional cube or a NoSQL system, data virtualization will serve up the data in a consistent way. By exposing the data according to a consistent model or data representation, avoids having to create pools of data and potentially obtaining different results. Data virtualization also promotes reusability. It is possible to clearly and cost-effectively separate the responsibility of the IT data architects/engineers and the data scientists. Leveraging data virtualization, reusable logical data sets can be developed to expose information in different ways and the data can be standardized as it is brought together.

Download our Mobile App

The Forrester TEI survey results have identified that data preparation tasks can be reduced by 67% allowing Data Science work to be accelerated. As the adoption of machine learning and artificial intelligence continues to grow, data lakes will become more prevalent and data virtualization will become increasingly more necessary for optimizing the productivity of data scientists and the initiatives they work on that rely heavily on data.

However, the biggest benefit will be from a data integration perspective where a considerable amount of time is spent. More time can be spent focused on the scientific methods for extracting actionable insights from data rather than being burdened with data engineering and management tasks. By simplifying the way in which data is accessed, data virtualization simplifies machine learning initiatives. This means that the entire organisation will enjoy the full benefits of cost-effectively gleaning real-time business insights.

More Great AIM Stories

Ravi Shankar
Ravi Shankar is the senior vice president and chief marketing officer at Denodo. He is responsible for Denodo's global marketing efforts, including product marketing, demand generation, field marketing, communications, social marketing, customer advocacy, partner marketing, branding, and solutions marketing. Ravi brings to his role more than 25 years of proven marketing leadership and product management, business development, and software development expertise within both prominent and emerging enterprise software leaders such as Oracle, Informatica, and Siperian. In addition, his profound knowledge of data-related technologies facilitates increased global awareness of the Denodo Platform and accelerates its growth. He holds an MBA from the Haas School of Business at the University of California, Berkeley, and an MS and an Honors BS degree in computer science.

AIM Upcoming Events

Regular Passes expire on 3rd Mar

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Early Bird Passes expire on 17th Feb

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, Virtual
Deep Learning DevCon 2023
27 May, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

RIP Google Stadia: What went wrong?

Google has “deprioritised” the Stadia game streaming platform and wants to offer its Stadia technology to select partners in a new service called “Google Stream”.