What Is Dark Data Within An Organisation?

The digital universe consisting of all the data we create annually, is currently doubling in size approximately every twelve months. According to research by IDC, the total data is expected to reach 44 zettabytes in size by 2020. That’s 44 trillion gigabytes and will contain nearly as many digital bits as there are stars in the universe. Likewise, it is predicted that by 2030 more than 90% of this data will be unstructured data. This explosion of data is far exceeding our capacity to actually use it. Nearly all companies (and even individuals) store data that they will never access again, just because cloud storage is now cheap and available to everyone.

Only a small fraction of all that data is in a traditional, structured form which is easily accessed and used by organisations. A more substantial part of big data is unstructured, but at least some are accessible while the vast majority is simply hidden altogether going unseen and unused. This is what we call dark data. The growing flow of machine and sensor data generated by the Internet of Things and the massive stores of raw data found in the unexplored depths of the deep web, all comprise dark data. 

It is clear that the majority of all this data that is created is dark unstructured data. Dark data was a concept coined by the IT consulting firm Gartner which defined it as data assets organisations collect, process and store during normal business activities but commonly fail to apply for other purposes. 


Sign up for your weekly dose of what's up in emerging technology.

In the universe of information assets, data may be deemed dark for a number of various reasons either because it’s unstructured or because it’s behind a firewall. Or it may be dark due to the speed or volume or because people simply have not made the connections between the different data sets. This could also be because they do not lie in a relational database or because until recently, the techniques required to leverage the data effectively did not exist. Dark data is often text-based and stays within company firewalls but remains very much untapped. 

For instance, supply chain complexity is a significant challenge for organisations. The supply chain is a data-driven industry traversing across a network of global suppliers distribution channels and customer base. This industry churns out data in huge numbers given that an estimated that only 5% of data is being used. So while 95% of such data is not being utilised for analytics, it presents an opportunity for big data technologies to bring this dark data to light. 

To date, organisations have explored only a small fraction of the digital universe for data analytic value. Dark analytics is about turning dark data into intelligence and insight that a company can use. It seeks to remove these limitations by casting a much wider data net that can capture a mass of currently untapped signals. Few organisations have been able to exploit non-traditional data sources such as audio and video files.

Only very late in the last decade, with advances in machine learning and image recognition techniques that this situation is changing. Now video analytics API from open source tools can go through every scene in a video and identify particular elements in those scenes such as a dog, birthday cake, a mountain or a house. Recent improvements in computer vision pattern recognition and cognitive analytics are making it possible for companies to draw meanings from those untapped sources and derive insights into new dark analytics.

Managing Dark Data

Businesses need to improve data management strategies, utilise the right tools to identify which data is valuable, and remove from their data centres’ dark data’. On average, 52% of all data stored by firms around the world is ‘dark’ as those responsible for managing it don’t have much idea about its content or usefulness. Much has been told about the economic cost of dark data, but the environmental cost has, so far, often been ignored. 

A survey, done by Vanson Bourne for Veritas — The Value of Data — found that on average, more than half (52%) of all data in companies remains untagged or unclassified. This highlights that those organisations have no visibility, oversight or monitoring over huge volumes of potentially business-critical data, which makes it a likely target for hackers.

The IT industry must get past the hurdle since data volumes are getting bigger each year. Businesses need to know this type of data and the storage policies around it. Data mapping and data discovery are the first actions in understanding how data flows through an organisation. Getting visibility and insight into where critical datasets are being stored, who has access to it and how long it is being held is a decisive first step in managing dark data.

Proactive data management enables firms to gain visibility into their data, storage and backup infrastructure, to take control of data correlated risks and make well-educated decisions with data. 

According to Gartner through 2021, more than 80% of organisations will fail to create a consolidated data security policy across their silos, causing potential noncompliance, security breaches and business liabilities. To successfully manage data growth and security, IT managers will require deploying the right tools and train employees on how to avoid data hoarding. But with the average large tech company storing millions of data files, manually classifying and tagging data is beyond the capability of most humans. Businesses should, therefore, implement data management tools with machine learning, algorithms, policies and processes, which can help manage and seek valuable insights from their data.

If global businesses continue to store huge amounts of ‘dark data’ within their premises and cloud, it can also lead to a honeypot for cybercriminals, according to experts. Classifying data allows companies to rapidly scan and tag data to make sure that sensitive information is adequately managed and protected, despite the location of the data. This wider visibility into data assists companies in complying with increasing and strict data protection laws that need special retention policies to be implemented and enforced across a company’s total data estate.

Keeping pace with the large scale data explosion, companies can automate the analytics, tracking, and reporting required to provide organisational accountability for dark data, files and information security. Companies might need to manage zettabytes of data and billions of files, and that’s why, their data analytics methods should integrate with archiving, backup and security tools to prevent data loss and ensure policy-based data retention. There are certain tools which automatically orchestrate the recovery of data everywhere it resides, assures 24/7 availability of business-critical apps, and equip organisations with the insights they require to comply with evolving data regulations.

More Great AIM Stories

Vishal Chawla
Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM