Vulnerabilities and Insecure Training Data Memorisation in GitHub Copilot: Does Context Play a Role?

Feb 2024 — Sep 2024 Submitted to University [TUHH]

GitHub Copilot, an AI-driven code completion tool developed by OpenAI, has revolutionized the way developers write code. However, with great power comes great responsibility. In my recent project, I set out to examine the security and privacy aspects of Copilot, focusing on its vulnerability to specific types of attacks and weaknesses.

Abstract

The use of Large Language Models has become an integral part of day-to-day life for many individuals, especially in coding and software development. Despite their widespread adoption by developers, there is still not enough evidence on whether these models generate code that replicates samples from their training datasets. Our study explores how LLMs, particularly black-box code generation models like GitHub Copilot, handle code generation, whether they replicate their training data, and if they fix or introduce vulnerabilities in their output; across various contexts of same input. We analysed over 14,250 code snippets generated by the LLM using 1694 prompt files across eight different variants. We found that the LLM replicates the code exactly as it is in its training data 18.59% of the time. This is because the model replicates vulnerabilities present in its training data. In parallel, we collected 7159 data points, detailing the files on GitHub OSS repositories where the code was found. Additionally, the LLM fails to address the well-known vulnerabilities in the input prompt 50.78% of the time, sometimes leading to the injection of additional vulnerabilities. We observed that the LLM introduced vulnerabilities 58.16% of the cases with 87 unique CWE injections apart from the well-known input vulnerabilities. These vulnerabilities, including SQL injection, poor input validation, and weak cryptographic practices, pose significant risks if the generated code is used in real-world applications. Additionally, 78.38% of the code generated is either incorrect or not executable, resulting in considerable inefficiency as developers are forced to spend time debugging. While LLMs boost productivity in many scenarios, they also present notable risks. This highlights the need for improved training methodologies to avoid training data memorisation and robust validation mechanisms to ensure LLMs can be safely integrated into development workflows without compromising security or code quality.

Keywords: Context-based Code generation, Large Language Models (LLMs), Vulnerabilities, Common Weakness Enumerations (CWEs), Training Data Memorisation

What I learned:

Through this project, I learned how to work independently, effectively managing my responsibilities as a working student while balancing university studies along with being dedicated to the project. This experience taught me to be analytical, accountable and reliable, ensuring that I could meet all deadlines and deliver high-quality work despite the demands of multiple commitments.

The project will be available online after the approval at the university to make it public.