The term “open-source” was first coined in 1998. Since then, open-source software has generated billions of dollars. Moreover, the paradigm of open-source practices has been a blessing for modern-day big data systems. For instance, a decade ago, when Apache Spark was still in its nascence, companies like Yahoo, Amazon, and Intel helped improve the project.
Spark enabled data storage in the memory subsystems of the thousands of servers it pulls together. In contrast, Hadoop allowed seamless data storage on hard disks and reduced the time taken to call the data. Today, many big data companies are founded on these open-source projects. But with scale comes cost. Ride-hailing company Uber, which has accumulated tonnes of data over the years, claims to have a cost-efficient way of leveraging multiple open-source platforms. Open-source makes creating software easier and faster. Moreover, focused effort by scores of developers on specific software components will lead to higher quality instead of dozens of engineers solving the same problems many times in silos.
How Uber does it
Big data is at the core of Uber’s business. The scale of Uber’s big data platform that serves riders and eaters has multiplied from single-digit petabytes to many hundreds of petabytes. To reduce costs on the data platform, Uber’s engineering team focussed on three main factors: platform efficiency, supply, and demand.
Thanks to open-source software, Uber was able to scale up to meet business needs without reinventing the wheel. Today, Uber runs some of the largest deployments of Apache Hadoop, Apache Hive, Apache Spark, Apache Kafka, Presto, and Apache Flink globally.
Optimise for File Formats and Tables
Uber’s Hadoop File System space is occupied by Apache HiveTM tables. These tables are stored in two formats mainly: Parquet and ORC, which are column-based formats. These files consist of several blocks, which contain many rows (let’s say 10,000), stored in columns. To optimise the Parquet format, the engineers would traditionally deploy the GZIP Level 6 compression algorithm inside Parquet. As a tweak, they have utilised the latest work on compression algorithms by the Facebook team — ZSTD Level 9 and Level 19 algorithm. This algorithm helped reduce the parquet file size by 8% and 12% compared to GZIP-based Parquet files, respectively.
Optimise for Big Data Workload
Uber uses Apache YARN to run the majority of its big data Compute workload. The function of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate background processes. The Scheduler in YARN is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues, etc. The Capacity Schedule inside YARN allows the team to configure a hierarchical queue structure with MIN and MAX settings for each queue. These settings follow the Dynamic MAX algorithm, which monitors for accumulated usage within 24 hours. Adding more machines to the YARN cluster reduces the average utilisation and hurt cost efficiency. So, the team at Uber implements time-based rates. In this way, the YARN cluster’s average utilisation (measured by CPU/MemGB allocation) will be around 80%. That said, YARN clusters are relatively larger at Uber compared to other organisations. To ensure scalability and cost-efficiency, the team uses the YARN federation approach while dealing with large clusters.
Optimising for Storage
Uber’s data sets are usually stored in two systems: Online Storage systems (MySQL database) and the Analytics Storage system (Hive tables stored in HDFS on hard drives). Background tasks such as compression, secondary index building, data scrubbing, and erasure code fixes can be considered maintenance jobs.
The team let Ingestion write lightly-compressed Parquet files, which take a bit more disk space but a lot fewer CPUs. In this way, a background task scheduled to run later will not need too many CPUs.
The open-source tools have changed the way we interact with data, but how well have we adopted the open-source practices when it comes to using this data? Modern-day AI companies are founded on the aforementioned open-source tools, but experts believe that open-source is throwing AI policymakers for a loop.
Open-Source through the Lens of ML
Despite the growing evidence of the benefits of open-source tools in building large-scale real-world systems, few researchers argue that the true potential of these methods in the ML ecosystem is underutilised. “Existing implementations are not openly shared, resulting in software with low usability and weak interoperability,” said the researchers. Unlike Yester year’s open-source revolution, which gave us Hadoop and Spark, the ML ecosystem, according to the researchers, still struggles to embrace the open-source paradigm. Barring a few frameworks such as TensorFlow and PyTorch, there aren’t many open-source ML tools. According to the researchers, the following factors contribute to the current state of open-source in the ML community:
- misconceptions of open-source conflict with commercial interests,
- lack of incentives to publish open-source software,
- Issues with reproducibility,
- Poor programming skills of ML researchers, and more.
One must also observe that the ML developer landscape is relatively new, and the incentives are different to a traditional software developer ecosystem. The ML ecosystem devoid of these roadblocks can lead to software packages that are easily adopted across domains, better interoperability, and standardisation of ML toolboxes.