GitHub has announced the general availability of precise code navigation for all Python repositories on GitHub.com. Precise code navigation is powered by stack graphs, a new open-source framework that allows defining the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, GitHub can generate code navigation data for a repository without requiring any configuration from the repository owner or tapping into a build process or other CI job.
Code navigation is a family of features that allows developers to explore the relationships in their code and its dependencies at a deep level. The most basic code navigation features are “jump to definition” and “find all references.” Both are built on the fact that ‘names’ are pervasive in the code that we write. Programming languages allow defining things — functions, classes, modules, methods, variables, and more. Those things have names so that developers can ‘refer’ back to them in other parts of the code.
GitHub’s aim with Stack Graphs is to collect information about the lists of definitions and references and determine which definitions each reference maps to for all of the code hosted on the platform.
The problem
Many times, there are multiple definitions with the same name, and in Python, names can shadow each other. Similarly, in Rust, top-level definitions are not allowed to shadow each other, but local variables are. Often, codes are also split across multiple files, packages, and repositories. Programming languages provide the ability to refer to definitions that might be quite far away. But, the rules for how to refer to things in other files are different for different languages.
Code also changes and evolves over time. If not careful, GitHub has to reanalyse every file in the repository and in all its dependencies whenever any file changes. This increases the amount of work, which is especially problematic at GitHub’s scale. The team did not want developers to require any manual configuration on the part of each repository owner. For this, GitHub relies on incremental processing and storage, reusing the results that have already been calculated and saved for the files that haven’t changed. The second challenge was the number of supporting programming languages. GitHub hosts code written in every programming language imaginable, but itself doesn’t care about the language.
To GitHub, everything is just bytes. But for code navigation, where the name binding rules are different for each language, it is essential to know how to parse and interpret the content of those files. To support this at scale, it has to be as easy as possible for GitHub engineers and external language communities to describe the name binding rules for a language.
The solution
After examining the problem, GitHub created stack graphs to tackle these challenges, based on the scope graphs framework from Eelco Visser’s research group at TU Delft.
With a stack graph, GitHub has implemented “jump to definition” where after the user clicks on a reference, GitHub loads in the stack graphs for each file in the commit and merges them together. It then performs a path-finding search starting from the reference node corresponding to the symbol that the user clicked on, considering symbol stacks and precedences to ensure that any invalid paths are not created. Also, any valid paths that GitHub finds represent the definitions that the reference refers to. Those are displayed in a hovercard.
GitHub used Tree-sitter, an open-source parsing framework to create stack graphs from the source code. The Tree-sitter community has already written parsers for a wide variety of programming languages and is used in many places across GitHub. This made it a natural choice to build stack graphs on. As part of developing stack graphs, a new graph construction language to Tree-sitter allows the construction of arbitrary graph structures from parsed CSTs.
Developers can use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere. This approach allows creating stack graphs incrementally for each source file they receive while only having to analyse the source code content and without having to invoke any language-specific tooling or build systems.