How Synthetic Data Startups Spawned An Open Data Economy


If there is anything that holds back the pace of global innovation in the artificial intelligence-led world is the torrent of data and the availability of organised datasets. According to Neuromation, a tech platform for distributed synthetic data for deep learning applications observes, a huge amount of task-specific data is needed to train AI algorithms and neural networks, the virtual brain of smart robots and self-learning, software programs. Organising and labelling these datasets is a lengthy process that typically requires prohibitive amounts of expensive human labour.

According to Yashar Behzadi, CEO of Neuromation, one way to counter data scarcity is by building solution — more online AI development communities like the one he and his team have built at Neuromation. The company’s platform is the AI equivalent of a housing co-op; an ecosystem for the AI developers that promotes esprit de corps among its members who pool their talents. Neuromation is pegged as a “one-stop-shop” platform that enables non-data scientists to create fairly sophisticated algorithms that were once the exclusive domain of trained AI software engineers.


Sign up for your weekly dose of what's up in emerging technology.

Shortage Of Training Data And The Rise Of Synthetic Data Platforms

Building a dataset from scratch can prove to be too costly and deep learning necessitates the need for datasets. This, in turn, has led to the rise of companies that create curated datasets. For example, Seattle-based Mighty AI provides training for data-as-a-service to enable teams to build computer vision models for autonomous vehicles. The company provides sample datasets for a host of tasks such as road and lane-marking annotation, metadata attribution and classification and full semantic segmentation among others. San Francisco-based Figure Eight also provides a human-in-the-loop platform to train and test ML models and helps build training data that enterprises can use.

Some of the companies that use training data are Tesco, Oracle, Bossanova, Adobe, AutoDesk and Spotify, among others. On the other hand, Neuromation makes use of synthetic data — a computer-generated data that looks like real data. The press note indicates that through the use of sophisticated algorithms, the startup builds simulated, or “synthetic,” data which can be used by data scientists for training of their AI algorithms. Behzadi believes that synthetic data is the key to formal training algorithms. “In fact, some AI applications, such as object recognition, can even be trained almost exclusively with synthetic data,” said Behzadi. “Synthetic data is the game changer that’s going to help the AI industry keep pace with its own phenomenal growth,” he added.

This has spawned a wave of data startups that are giving tailored solutions to a set of companies specifically, for example, game data or aerial imagery data. Let’s look at some of the reasons that led to the data startups:

  • As Neuromation news release points out, the shortage of training data is a pressing problem in the industry
  • A Crowdflower data scientist report indicates that quality dataset is a bottleneck in finishing projects
  • Building your own training dataset is a laborious, time-consuming task
  • Most companies and big enterprises augment their datasets with external sources
  • With the kind of impact data can have on the outcome and the recent debate on data privacy, companies are taking different approaches to building datasets.
  • Given the recent trends, there has been a spurt of growth in data exchanges and similar platforms that use blockchain technology
  • Synthetic data is one way for startups to compete with data-rich companies such as Google. In many ways, this is a data-readiness strategy deployed that helps bootstrapped startups bring AI solutions for the masses
  • A lot of vertical AI startups depend on synthetic dataset to solve industry-specific problems

Is The Adoption Of Synthetic Data Mainstream

With the technology market getting increasingly competitive, bootstrapped companies are increasingly dependent on proprietary datasets that can be difficult to procure and replicate. A diverse dataset is required to build into the product’s functionality which will meet the end user’s needs. With tasks getting commoditized, these low-level tasks — data generation and building training datasets are becoming critical to modern businesses. This is also a big barrier to entry and data startups are disrupting the business models by providing training datasets to startups and enterprises alike. Synthetic datasets have become part of the data strategy and have also sparked a notion of open data economy.

Also, the recent GDPR rules and the renewed focus on data privacy has made it impossible for companies to collaborate on data without having user consent. Experts believe even anonymizing data in the post-GDPR world is not enough to avoid the risk of re-identification. Hence, a number of startups are providing simulated datasets which are secure and would not lead to loss of in data information.

More Great AIM Stories

Richa Bhatia
Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM