Just last year, in yet another data breach scandal at Airbnb, the data of a few hosts, including personal address information and direct messages, had been exposed to the other hosts on the app. The Airbnb sub-Reddit consisted of instances by hosts where on logging in, they were presented with different names and a different inbox, while their co-host saw a second, unrelated inbox. The breach was deemed a technical issue for only a small subset of users by Airbnb.
In a blog post by Elizabeth Nammour, Wendy Jin, and Shengpu Liu, software engineers at Airbnb, the team broke down the automated data protection system at the platform. Data is stored across MySQL, Hive, and S3; generated, replicated, and propagated daily throughout the centralised inventory system that manages the data at Airbnb.
The Data Protection Platform
The DPP is automated to understand the data and enable its protection. If the software can’t do it, it notifies the team to do so manually. The automation is focused mainly on three areas: data discovery, prevention of sensitive data leakages, and data encryption.
It is essential first to discover the personal data that is to be secured. DPP automatically notifies data owners when it detects their data in data stores, along with ensuring the data gets deleted or returned. Data breaches generally occur when API keys or credentials are leaked internally, such as an engineer logging the secret within the server. DPP steps in a preventive form here to identify potential leaks and notify the engineer to delete the secret from the code and then hide the new secret from the encryption tool sets. Lastly, encryption is essential to ensure infiltrators don’t get access to sensitive data. DPP’s encryption service discovers sensitive data instead of relying on manual identification.
Let’s take a look at the various components making up DPP.
Source: Elizabeth Nammour’s Medium Post
The data classification service is called Inspekt. This is the service that is continuously scanning Airbnb’s data stores’ tag sensitive and personal data. Angmar is the secret detection pipeline for the codebase, followed by Cipher, the data encryption service with a transparent framework for Airbnb developers to protect sensitive data. The privacy compliance requests are handled by the orchestration service Obliviate while the Minister handles the third party risk and privacy compliance service. Madoka is the metadata service collecting security and privacy properties of the data assets from various sources. Lastly, the presentation layer is the Data Protection Service that defines jobs to enable automation of data protection.
Source: Elizabeth Nammour’s Medium Post
One of the essential layers for data protection, Madoka is a metadata system maintaining the security and privacy-related metadata across all assets on the Airbnb platform. Madoka’s centralised repository for engineers and other internal stakeholders allows them to track and manage the metadata of their data assets.
Madoka’s primary metadata includes data assets list, ownership and data classification in both MySQL and S3 formats.
Madoka looks over three essential functions: collecting metadata, storing metadata, and providing metadata to other services. Its two main services initiate these; a crawler and a backend. The Madoka crawler is a daily crawling service bringing metadata from other data sources and publishing them onto an AWS Simple Queue Service (SQS) queue. The Madoka backend is a data service ingesting this SQS metadata, reconciling any conflicting information, and storing the metadata in its database.
The crawler collects the list of all columns within the MySQL AWS account by calling the AWS APIs to get the list of all clusters and their reader endpoint in the environment. The crawler uses JDBI to connect the endpoint to the cluster and list all the databases, tables, columns, and column data types. It retains this data and passes it along to the Madoka backend for storage. Terraform is leveraged to configure AWS resources in code – it is parsed by the crawler to fetch the S3 metadata. The crawler uses S3 inventory reports to fetch the tools, enabling inventory reports on all production S3 buckets in Terraform. The crawler retains information such as account number, account name, bucket name, assumed role name, etc., to pass along to the backend for storage.
The team uses a metadata property, ownership, to describe who owns a specific data asset. The service ownership data allows the team to link a data asset to a specific codebase and protect actions requiring code changes. In addition, the software enables team membership instead of user ownership to ensure that the data assets remain with the team to ensure further protection.
The metadata property of data classification describes the type of data elements stored within the asset. It gathers the data to allow the user to understand the risk associated with each data set to help determine the level of protection needed.
The crawler fetches data classifications from Airbnb’s Git repositories and the automated data classification tool, Inspekt. The output comprises data elements found in each asset to ensure constant monitoring and classifications with changing data.
The team has created Madoka to be easily extensible, constantly collecting and storing more security and privacy-related attributes. The Airbnb team has taken up rampant steps to ensure better data protection and security.