MITB Banner

Rust Turns GitHub’s Long-standing Problem to Dust

The search engine is written in Rust from scratch

Share

Listen to this story

Like the fungus after which it was named, Rust has become one of the most popular programming languages. The latest testament to Rust’s popularity comes with GitHub’s announcement of a new code search engine that can function at the GitHub scale. Written in Rust, it creates and incrementally maintains a code search index shared by Git blob object ID. 

In a blog post on Monday, GitHub engineer Timothy Clem explained about BlackBird, a code search engine written completely in Rust. It currently provides access to almost 45 million GitHub repositories. Sifting through thousands of codes needs something more capable than grep, a command line used to filter through plain text data and figure out a particular pattern. Using ripgrep to run a specific regular expression query on a 13GB file in memory takes about 2.769 seconds or 0.6GB/sec/core.

The Blackbird works faster as it delves into 640 queries per second. Since its indexing rate is about 120,000 documents per second, processing 15.5 billion documents takes around 36 hours, or 18 for re-indexing since delta indexing reduces the documents to be crawled.

To keep the search index manageable, GitHub breaks the data into pieces using Git’s content-addressable hashing scheme and delta encoding to reduce the data and metadata to be crawled. This works well because GitHub has plenty of redundant data that can be compressed through deduplication data-shaving techniques. 

(Join the GitHub Code Search Engine beta)

Why not an open-source solution?

There is a plethora of open-source solutions to choose from, like Apache Cassandra, Solr, or Elasticsearch, instead of building search engines from scratch. So then, why did GitHub opt for the long haul?

Clem answered in a GitHub Universe video presentation that GitHub failed to find success using general text search products to power code search. The user experience could be better, indexing could be faster, plus it is expensive. He wrote that some newer, code-specific open-source projects are out there, but they don’t work at GitHub’s scale. 

GitHub started experimenting with Elasticsearch in 2011. However, it sometimes introduces breaking changes that need to be adapted. As Clem noted, it took months to index roughly eight million repositories. GitHub originally used Solr for search but moved to ElasticSearch due to an excessive need for storage space. 

Today, GitHub supports about 200 million dynamic code repositories. The latest Rust-written engine, Blackbird, supports search across about 45 million repositories, providing partial coverage. It still enables search across 15 terabytes of code and 15.5 billion documents for programs written in Java, Python, and JavaScript. 

In Rust, We Trust

The Blackbird project is a crucial point for Rust, which is usually adopted to build new features for projects originally written in C/C++. For instance, last year Microsoft Azure CTO declared that all new projects must be written in Rust over C/C++ because of its memory safety features.  

Research shows that memory safety issues have accounted for 60% to 70% of all security vulnerabilities across operating systems. The Google security blog sees Rust as a practical language for implementing the kernel. They highlight that Rust helps reduce the number of potential bugs and provides higher security, establishing it as a good choice for creating such a critical component. 

Moreover, thanks to manual memory allocation and low-level commands, Rust is perfect for programming hardware devices with tiny processors and limited RAM, such as microcontrollers. It can become the top language for the Internet of Things (IoT). Rust can also become a good patch for performance-sensitive, back-end services. For instance, Tilde used Rust to rewrite some Java HTTP endpoints. As a result, they have reduced memory usage by 100 times.  

The Bottomline

BlackBird is being called a game changer as Github’s default search is not the best option for programmers, as it strips most “special” characters out and doesn’t support regular expression search. Another option is Sourcegraph, which is expensive at $90 per active user per month. Having a better search built into Github would be an economical prime option. 

Even though GitHub’s latest search engine is being lauded, users are comparing the product to StackExchange, and some even called it ‘Google Scholar for Code’. The current use cases seem obscure, but it can indirectly be a major productivity booster for GitHub CoPilot.

Share
Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.