Rust Turns GitHub’s Long-standing Problem to Dust

The search engine is written in Rust from scratch
Listen to this story

Like the fungus after which it was named, Rust has become one of the most popular programming languages. The latest testament to Rust’s popularity comes with GitHub’s announcement of a new code search engine that can function at the GitHub scale. Written in Rust, it creates and incrementally maintains a code search index shared by Git blob object ID. 

In a blog post on Monday, GitHub engineer Timothy Clem explained about BlackBird, a code search engine written completely in Rust. It currently provides access to almost 45 million GitHub repositories. Sifting through thousands of codes needs something more capable than grep, a command line used to filter through plain text data and figure out a particular pattern. Using ripgrep to run a specific regular expression query on a 13GB file in memory takes about 2.769 seconds or 0.6GB/sec/core.

The Blackbird works faster as it delves into 640 queries per second. Since its indexing rate is about 120,000 documents per second, processing 15.5 billion documents takes around 36 hours, or 18 for re-indexing since delta indexing reduces the documents to be crawled.

To keep the search index manageable, GitHub breaks the data into pieces using Git’s content-addressable hashing scheme and delta encoding to reduce the data and metadata to be crawled. This works well because GitHub has plenty of redundant data that can be compressed through deduplication data-shaving techniques. 

(Join the GitHub Code Search Engine beta)

Why not an open-source solution?

There is a plethora of open-source solutions to choose from, like Apache Cassandra, Solr, or Elasticsearch, instead of building search engines from scratch. So then, why did GitHub opt for the long haul?

Clem answered in a GitHub Universe video presentation that GitHub failed to find success using general text search products to power code search. The user experience could be better, indexing could be faster, plus it is expensive. He wrote that some newer, code-specific open-source projects are out there, but they don’t work at GitHub’s scale. 

GitHub started experimenting with Elasticsearch in 2011. However, it sometimes introduces breaking changes that need to be adapted. As Clem noted, it took months to index roughly eight million repositories. GitHub originally used Solr for search but moved to ElasticSearch due to an excessive need for storage space. 

Today, GitHub supports about 200 million dynamic code repositories. The latest Rust-written engine, Blackbird, supports search across about 45 million repositories, providing partial coverage. It still enables search across 15 terabytes of code and 15.5 billion documents for programs written in Java, Python, and JavaScript. 

In Rust, We Trust

The Blackbird project is a crucial point for Rust, which is usually adopted to build new features for projects originally written in C/C++. For instance, last year Microsoft Azure CTO declared that all new projects must be written in Rust over C/C++ because of its memory safety features.  

Research shows that memory safety issues have accounted for 60% to 70% of all security vulnerabilities across operating systems. The Google security blog sees Rust as a practical language for implementing the kernel. They highlight that Rust helps reduce the number of potential bugs and provides higher security, establishing it as a good choice for creating such a critical component. 

Moreover, thanks to manual memory allocation and low-level commands, Rust is perfect for programming hardware devices with tiny processors and limited RAM, such as microcontrollers. It can become the top language for the Internet of Things (IoT). Rust can also become a good patch for performance-sensitive, back-end services. For instance, Tilde used Rust to rewrite some Java HTTP endpoints. As a result, they have reduced memory usage by 100 times.  

The Bottomline

BlackBird is being called a game changer as Github’s default search is not the best option for programmers, as it strips most “special” characters out and doesn’t support regular expression search. Another option is Sourcegraph, which is expensive at $90 per active user per month. Having a better search built into Github would be an economical prime option. 

Even though GitHub’s latest search engine is being lauded, users are comparing the product to StackExchange, and some even called it ‘Google Scholar for Code’. The current use cases seem obscure, but it can indirectly be a major productivity booster for GitHub CoPilot.

Download our Mobile App

Tasmia Ansari
Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring