PayPal’s global data science group recently released its open-source Spark Indexing library named Dione, a library that would enable faster interaction with Hadoop Data. From time to time, users want to use the same data for more ad-hoc oriented tasks in ecosystems such as Spark, Hive and HDFS (Hadoop Distributed File Systems). Tasks such as multi-row load or single-row fetch are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.), requiring data duplication and significant operational costs.
Dione can be used to solve such challenges using the index as a “shadow” table of the original data. It contains only the key columns and pointers to the data and is saved in a special format inspired by Avro and bucketing. Based on this index, the library provides APIs to join, query, and fetch back the original data in the required SLAs.
A few advantages of using Dione are:
- Relies only on Spark, Hive and HDFS. No external services.
- Semi-managed — does not modify, duplicate or move the original data.
- Supports multiple indices on the same data.
- The index is exposed as a standard Hive table.
- The special Avro B-Tree format supports ad-hoc single-row fetch in an SLA of seconds.
Dione solves two main challenges that are currently faced using traditional methods:
- Given a pointer (a line in the index), how to fetch the data quickly?
- How to store the index table to meet the required SLAs?
The library makes use of its two main architectural components — Indexer and AvroBtreeFormat. The final index solution is composed of the inter-relations between these components, although in principle, each of them can stand on its own.
The Indexer’s main goal is to solve the multi-row load task. The single-row fetch task as using Spark to scan through all the index tables does not meet the required SLA. For that, the index’s storage format is used. Inspired by Avro’s SortedKeyValueFile, bucketing and traditional databases indexing systems, the development team decided to create a “new” file format — Avro B-Tree.
Both the Indexer and Avro B-Tree File Format libraries are independent packages and rely only on HDFS. Users can save any table in the Avro B-Tree format so it will be accessible for both batch analytics with Spark and single-row fetches. For a simplified Spark user experience, a high-level Spark API has been added for creating and using an index. The API is available in Scala and Python.
PayPal has open-sourced this library to share this functionality with the community and get feedback.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.