“What is ‘best practice’ at the time of writing may slowly become ‘bad practice’ as the cybersecurity landscape evolves.”
Pearce et al.,
Modern-day deep learning (DL) models, especially the ones powering sophisticated NLP based applications, have become so advanced that they can even run code diagnostics and perform interventions on a codebase. For example, GitHub recently released Copilot, an AI-based programming assistant that can generate code in popular programming languages. All one has to do is give some context to the Copilot, such as comments, function names, and surrounding code. Copilot is built on OpenAI’s GPT-3 that is trained on open-source code, including “public code…with insecure coding patterns”, thus giving rise to the potential for “synthesise[d] code that contains these undesirable patterns”.
Based on the OpenAI Codex family of models, Copilot’s tokenisation step is nearly identical to GPT-3. For instance, byte pair encoding is used to convert the source text into a sequence of tokens.
A Brief Overview of Copilot:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
- A coder works on a program, editing the code in a plain text editor.
- As the coder adds lines of code to the program, Copilot continuously scans the program and periodically uploads some subset of lines, the position of the user’s cursor, and metadata before generating some code options for the user to insert.
- Copilot tries to generate code that is functionally relevant to the program as implied by comments, docstrings, function names, and so on.
- Copilot also reports a numerical confidence score for each of its proposed code completions, with the top-scoring (highest-confidence) score presented as the default selection for the user.
- The user can then choose any of Copilot’s options to make changes to the code.
Given a prompt, Codex and Copilot try to autocomplete code that is most relevant to the prompt given by the user. These tools are powered by GPT-3, which is trained on a publicly available dataset. This also means that the code generated is more of a probabilistic exercise of finding the best code. This can usher bad code into the systems. The researchers at NYU fear that the model will not necessarily generate the best code but rather the one that best matches the code that came before. According to the researchers, the quality of the generated code can be strongly influenced by semantically irrelevant features of the prompt.
Dangers of Piloting Bad Code
To find how vulnerable code is generated from platforms such as Github’s Copilot, NYU researchers investigated the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis, they prompted Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE’s “Top 25” list).
CWE is an open community initiative sponsored by the Cybersecurity and Infrastructure Security Agency (CISA). According to MITRE, Common Weakness Enumeration (CWE) is a community-developed list of common software weaknesses such as flaws, faults, bugs, or other errors in software implementation that can result in systems and networks being vulnerable to attack. The CWE List and its glossary are used to identify and describe these weaknesses in terms of CWEs.
The researchers validated Copilot’s performance on three distinct code generation axes — examining how it performs given the diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40% to be vulnerable.
(Image credits: Paper by Pearce at al.,)
The above picture illustrates the methodology used by the researchers to validate CoPilot, which can be summarised as follows:
- For each CWE, the authors wrote a number of ‘CWE scenarios’. These are small incomplete program snippets in which Copilot will be asked to generate code.
- Next, the Copilot is asked to generate up to 25 options for each scenario. Each option is then combined with the original program snippet to make a set of programs with some options discarded if they have significant syntax issues.
- Then, evaluation of each program occurs, performed by CodeQL using either built-in or custom queries.
The researchers designed 54 scenarios across the 18 different CWEs. From these, Copilot was able to generate options that produced 1087 valid programs. Of these, 477 (43.88 %) were determined to contain a CWE. Of the scenarios, 24 (44.44 %) had a vulnerable top-scoring suggestion. Breaking down by language, 25 scenarios were in C, generating 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13 (52.00 %) had a top-scoring program vulnerable. 29 scenarios were in Python, generating 571 programs in total. 219 (38.4,%) were vulnerable.
Future Debugged
Compared with the earlier two languages (Python and C), Copilot struggled with generating syntactically correct and meaningful Verilog. This is mostly due to the smaller amount of training data available. As Copilot is trained over open-source code available on GitHub, the authors believe that the variable security quality stems from the nature of the community-provided code. That is, where certain bugs are more visible in open-source repositories, those bugs will be more often reproduced by Copilot.
The researchers also observed that being a generative model, Copilot’s outputs are not directly reproducible. For the same given prompt, they warn that Copilot can generate different answers at different times. “As Copilot is both a black-box and closed source residing on a remote server, general users cannot directly examine the model used for generating outputs,” wrote the authors. They also admit that the scenarios written to validate Copilot’s performance do not completely justify the real world coding scenario, which is “messier” and contains larger amounts of context.
There is no doubt that low code or no code tools and platforms will flourish. While the coding community can improve their productivity with a coding assistant like Copilot at hand, outsiders and non-technical users can tinker with their ideas without having to dig deep into coding paradigms. The advantages are immense. This widespread adoption of AI based coding practices can also open doors to vulnerabilities. The New York University researchers recommend that tools like Copilot should be paired with appropriate security-aware tooling during both training and generation to minimise the risk of introducing security.