Do you ever wonder how much data you create in the one minute it takes to book a single ride on Uber? Or, how much data would have gone in the background to enable you to do so? Meet Uber’s Jellyfish, the seamless data storage infrastructure that enables you to use the app effortlessly.
Uber has deployed various storage technologies to store their business data, including one called Schemaless. Uber has been leveraging Schemaless to model related entries in a single row of multiple columns for a couple of years. It uses a fast underlying storage technology to enable millisecond-order latency at high QPS.
But, the problem arises in the fact that Schemaless has started to accumulate more data and use expensive storage, thereby increasing Uber’s costs by a lot. In addition, while the data access patterns showed initial frequent access followed by an occasional one, Uber ensures that the data is readily available upon request.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
The Jellyfish Project
Uber had to develop a solution to treat data based on their access patterns depending on frequently accessed or infrequently accessed data. Engineers at Uber worked to keep backward compatibility with no charge for the customer.
Upon experimenting with various compression methods, engineers at Uber found the ZSTD compression algorithm assisting in a 40% overall saving on batching and compressing several cells together. In addition, tiering data internally in the same tier can be done while reducing the old data footprint, ensuring that the latency is within the required hundreds of milliseconds. “Since the batch size and ZSTD are configurable, we could tune our solution for the different use cases that are currently served by Schemaless,” the team explained in a blog post.
Hence, this was dubbed project Jellyfish, inspired by the Jellyfish’s traits of spending less energy during mass travelling.
The Jellyfish can control its overall savings and has a preferred impact on CPU utilisation. It uses two main parameters to ensure the same:
- Batch size that batches a number of rows together, and
- Compression level that controls speed vs ZSTD compression in the system.
Source: Uber Engineering
The batch size used by the team consists of 100 rows with a ZSTD level of 7. This combination is light on the CPU and results in a 40% compression ratio.
The architecture consists of the standard real-time table and the new batch table. The real time table consists of consumer data which is moved to the batch table once it is batched and compressed. Coming back to Schemaless, batching is done by each cell, the basic unit of the technology. Schemaless uses the batch index to retrieve the right batch data and decompress and index data to efficiently extract the requested cell.
Bringing the Jellyfish to life
The rolling out process of Schemaless went through multiple validation stages and ended with a phased rollout to actual production instances.
The first phase, ‘Enabling Jellyfish’, configures Jellyfish along with the migration range to create a backend batch. The second step is to Migrate old data from the real-time backend and copy it into the batch backend. The Consistency validation phase shadows traffic into the real-time table, validating data in the batch backend. The two reported consistencies, content and count, should be zero for a successful migration. This is followed by Pre-deletion, which reverse shadows traffic to old data and compute digests from the real-time backend to create a comparison. The last two phases are Logical deletion and Physical deletion that switch out the read path from the real-time backend for old data and, once confirmed, actually delete data in an irreversible phase.
Source: Uber Engineering
Changing the Data Access Model
The Jellyfish project helps optimise various functions, including decoding requested cells, deleting metadata, and collating updates by batch. Initially, the response to user requests for a cell was fetching the whole batch; but with Jellyfish, the system can decode only the part of the requested cell. The optimisation also helps remove the cell entry from the batch index while deleting cells. This means the user won’t be able to access the cells, and in the meanwhile, a background job actually removes the cells, and stores selected cell information in a journal table. This ensures that the online read/write path is not affected while reducing the user-perceived latency. Lastly, a batch cell is touched a few times while updating cells to reduce the total update time by a factor of 4. The jellyfish project helped Uber engineering decrease the actual footprint and a 33 per cent reduction in storage space taken by old backend data. The team also applied macro benchmarks to characterise Jellyfish’s performance under different workloads and stress testing to discover the relationship between throughput and latency. The tests led to validating that the Jellyfish can attain a latency SLA of a few hundred milliseconds.