continuous advancements in artificial intelligence, the vulnerability of data-driven models determines how we engage with the world. Meta’s latest efforts are situated on this quest to create good quality data.
Meta’s Mephisto framework is described as a tool for collecting quality data for ML research projects easier and serving as a foundation informing new outcomes.
The need for quality data
Good data is at the core of AI today. However, a 2021 study reviewing several COVID-19 models revealed that they were useless in their real-world applications. This was because of bad data stemming from lack of standardisations, duplication, and mislabelling. The cost of bad data is estimated to be $15 million annually for each organisation. “The importance of data quality and master data management is very clear: people can only make the right data-driven decisions if the data they use is correct. Without sufficient data quality, data is practically useless and sometimes even dangerous,” stated BARC in a report stating data as a key BI trend for 2022.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
With the introduction of distributed technologies and data centres, data is being created at scale. Yet, when it comes to explaining the data and ensuring its accuracy, we are to catch up. If anything, the types and quantity of data also only keep increasing. Data is being created in organisations, internally and externally, and while companies are catching up with data organisations, creating good quality data is still a challenge.
Additionally, most researchers’ public data sets for their study have to apply data quality controlling techniques to overcome labelling errors and train robust models fluently. Although, given a lack of a central method to mitigate such issues, researchers risk running into the same pitfalls of data collection that prior work has already resolved.
All about Meta’s Mephisto
To encourage open and reproducible science, Meta’s Mephisto is a tool to standardise and codify the best practices and infrastructure for collecting and annotating research data. Through the tool, researchers can share collection methodologies through a reusable form. This allows other researchers to swap components and find the annotations needed for their study, thereby mitigating the challenge of custom task creation. Mephisto works by identifying the common workflows for driving a complex annotation task from the idea stage through gathering the data. “This allows for iterating on task design and quality control in a meaningful way before any data collection task. It also allows us to publish our methodologies for the wider AI community to use or improve upon,” Meta stated.
Mephisto has three novel characteristics:
- Platform agnostic: The tool was designed from a ground-up approach, enabling it to work with different crowd providers.
- Centralised: Mephisto is a platform that shares blocklists and tracks worker utilisation across various projects.
- Extensible: Mephisto is extensible because it defines tasks in blueprints that can be published, shared and re-used quickly.
The Mephisto architecture
Mephisto is split into three primary sections:
- The Data Model: The data model section comprises all of the ‘unchanging’ parts of a crowdsourcing workflow. It attempts to capture the required state for crowdsourcing tasks from the short through the long term. It can allow Mephisto to work on all crowdsourcing tasks at a conceptual level successfully. Here, classes are broken into three categories, Runs of a task, Assignments in a task run, and the context required to support Worker logic.
The data model cycle/ Source: Mephisto
- The Core Abstractions: The second section encapsulates the parts of crowdsourcing that may frequently change. The three main abstractions for differences in running crowdsourcing tasks are
- Blueprints, the proctors for a task
- Architects, the logic behind entity communication with Mephisto backend
- Crowd providers, providing a generalised API for dealing with external crowdsourcing providers.
- The Operations Layer: The last layer consists of several classes and utilities operating the common crowdsourcing task flows. This includes launching and monitoring tasks, reviewing incoming data, etc., but generalised by the APIs in the previous layers.
How Mephisto works
Researchers and engineers can leverage Mephisto to collect data across different research domains, crowdsourcing, and server configurations by using the same code to run their tasks. Numerous plug-and-play abstractions work at the backend to start the data collection process, and the tool follows up with workflow guidelines advising from ideation to full-fledged creation.
Meta illustrated this with the example of researchers finding an existing task relevant to what they want to collect. Here, Mephisto’s blueprint can be a starting point the researcher uses to make immediate changes to the code. They can make alterations to the data displayed, the annotations and more. Essentially, the tool allows them to test the code and iterate it locally before piloting. Mephisto allows them to test it on a clean workflow that launches small pilot batches. Several researchers can view the results, making it easy to identify possible issues with the task or identify workers intentionally submitting invalid data. Upon this verified data, researchers can finally use existing quality control methods to improve the task quality further, or they can construct their heuristics specific to the data being collected.
“Once the pilots display high-quality results, they can launch the complete task and monitor progress while it’s in flight. From here, researchers can package up their data set and publish the complete code by which others can collect something similar,” Meta concluded.
Open research is a proven technique to ensure wider collaboration, greater engagement, and compliance with ethical standards. Meta aims to support the training data collection element of open research by publishing code for data collection that helps make it reproducible. The team’s mission is to create an industry-wide standard for data collection and enable everyone to share, adapt, and standardise on more responsible techniques.
Mephisto currently includes important privacy protection protocols, like hiding worker identification. Meta plans to add solutions that report worker statistics on contributions to a data set, warn about fair pay and protections, and highlight projects that explicitly try to de-bias data sets.