Last updated February 5, 2020
In AI Origins & Evolution

How Synthetic Data Startups Spawned An Open Data Economy

Published on July 20, 2018
by Richa Bhatia

If there is anything that holds back the pace of global innovation in the artificial intelligence-led world is the torrent of data and the availability of organised datasets. According to Neuromation, a tech platform for distributed synthetic data for deep learning applications observes, a huge amount of task-specific data is needed to train AI algorithms and neural networks, the virtual brain of smart robots and self-learning, software programs. Organising and labelling these datasets is a lengthy process that typically requires prohibitive amounts of expensive human labour.

According to Yashar Behzadi, CEO of Neuromation, one way to counter data scarcity is by building solution — more online AI development communities like the one he and his team have built at Neuromation. The company’s platform is the AI equivalent of a housing co-op; an ecosystem for the AI developers that promotes esprit de corps among its members who pool their talents. Neuromation is pegged as a “one-stop-shop” platform that enables non-data scientists to create fairly sophisticated algorithms that were once the exclusive domain of trained AI software engineers.

Shortage Of Training Data And The Rise Of Synthetic Data Platforms

Building a dataset from scratch can prove to be too costly and deep learning necessitates the need for datasets. This, in turn, has led to the rise of companies that create curated datasets. For example, Seattle-based Mighty AI provides training for data-as-a-service to enable teams to build computer vision models for autonomous vehicles. The company provides sample datasets for a host of tasks such as road and lane-marking annotation, metadata attribution and classification and full semantic segmentation among others. San Francisco-based Figure Eight also provides a human-in-the-loop platform to train and test ML models and helps build training data that enterprises can use.

Some of the companies that use training data are Tesco, Oracle, Bossanova, Adobe, AutoDesk and Spotify, among others. On the other hand, Neuromation makes use of synthetic data — a computer-generated data that looks like real data. The press note indicates that through the use of sophisticated algorithms, the startup builds simulated, or “synthetic,” data which can be used by data scientists for training of their AI algorithms. Behzadi believes that synthetic data is the key to formal training algorithms. “In fact, some AI applications, such as object recognition, can even be trained almost exclusively with synthetic data,” said Behzadi. “Synthetic data is the game changer that’s going to help the AI industry keep pace with its own phenomenal growth,” he added.

This has spawned a wave of data startups that are giving tailored solutions to a set of companies specifically, for example, game data or aerial imagery data. Let’s look at some of the reasons that led to the data startups:

As Neuromation news release points out, the shortage of training data is a pressing problem in the industry
A Crowdflower data scientist report indicates that quality dataset is a bottleneck in finishing projects
Building your own training dataset is a laborious, time-consuming task
Most companies and big enterprises augment their datasets with external sources
With the kind of impact data can have on the outcome and the recent debate on data privacy, companies are taking different approaches to building datasets.
Given the recent trends, there has been a spurt of growth in data exchanges and similar platforms that use blockchain technology
Synthetic data is one way for startups to compete with data-rich companies such as Google. In many ways, this is a data-readiness strategy deployed that helps bootstrapped startups bring AI solutions for the masses
A lot of vertical AI startups depend on synthetic dataset to solve industry-specific problems

Is The Adoption Of Synthetic Data Mainstream

With the technology market getting increasingly competitive, bootstrapped companies are increasingly dependent on proprietary datasets that can be difficult to procure and replicate. A diverse dataset is required to build into the product’s functionality which will meet the end user’s needs. With tasks getting commoditized, these low-level tasks — data generation and building training datasets are becoming critical to modern businesses. This is also a big barrier to entry and data startups are disrupting the business models by providing training datasets to startups and enterprises alike. Synthetic datasets have become part of the data strategy and have also sparked a notion of open data economy.

Also, the recent GDPR rules and the renewed focus on data privacy has made it impossible for companies to collaborate on data without having user consent. Experts believe even anonymizing data in the post-GDPR world is not enough to avoid the risk of re-identification. Hence, a number of startups are providing simulated datasets which are secure and would not lead to loss of in data information.

Access all our open Survey & Awards Nomination forms in one place >>

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.

How Synthetic Data Startups Spawned An Open Data Economy

Shortage Of Training Data And The Rise Of Synthetic Data Platforms

Is The Adoption Of Synthetic Data Mainstream

Richa Bhatia

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru