So Brendan Dolan-Gavitt, assistant professor in the computer science and engineering department at NYU Tandon, has released FauxPilot, an alternative to Copilot that runs locally, without phoning home to the Microsoft mothership.
Copilot relies on OpenAI Codex, a natural language-to-code system based on GPT-3 that was trained on “billions of lines of public code” in GitHub repositories. That has made advocates of free and open source software (FOSS) uncomfortable because Microsoft and GitHub have failed to specify exactly which repositories informed Codex.
As Bradley Kuhn, policy fellow at the Software Freedom Conservancy (SFC), wrote in a blog post earlier this year, “Copilot leaves copyleft compliance as an exercise for the user. Users likely face growing liability that only increases as Copilot improves. Users currently have no methods besides serendipity and educated guesses to know whether Copilot’s output is copyrighted by someone else.”
Shortly after GitHub Copilot became commercially available, the SFC urged open source maintainers not to use GitHub in part due to its refusal to address concerns about Copilot.
not a perfect world
FauxPilot doesn’t use Codex. It relies on Salesforce’s CodeGen model. However, that’s unlikely to appear FOSS advocates because CodeGen was also trained using public open source code without regard to the nuances of different licenses.
“The models that it’s using right now are ones that were trained by Salesforce, and they were again, trained basically on all of GitHub public code,” explained Dolan-Gavitt in a phone interview with The Register. “So there are some issues still there, potentially with licensing, that wouldn’t be resolved by this.”
“On the other hand, if someone with enough compute power came along and said, ‘I’m going to train a model that’s only trained on GPL code or has a license that lets me reuse it without attribution’ or something like that, then they could train their model, drop that model into FauxPilot, and use that model instead.”
For Dolan-Gavitt, the primary goal of FauxPilot is to provide a way to run the AI assistance software on-premises.
“There are people who have privacy concerns, or maybe, in the case of work, some corporate policies that prevent them from sending their code to a third-party, and that definitely is helped by being able to run it locally,” he explained .
GitHub, in its description of what data Copilot collects, describes an option to disable the collection of Code Snippets Data, which includes “source code that you are editing, related files and other files open in the same IDE or editor, URLs of repositories and files paths.”
But doing so does not appear to disable the gathering of User Engment Data – “user edit actions like completions accepted and dismissed, and error and general usage data to identify metrics like latency features engagement” and potentially “personal data, such as pseudonymous identifiers .”
Dolan-Gavitt said he sees FauxPilot as a research platform.
“One thing that we want to do is train code models that hopefully output more secure code,” he explained. “And once we do that, we’ll want to be able to test them and maybe even test them with current users using something like Copilot but with our own models. So that was kind of motivation.”
Doing so, however, has some challenges. “At the moment, it’s somewhat impractical to try and create a dataset that doesn’t have any security vulnerabilities because the models are really data hungry,” said Dolan-Gavitt.
“So they want lots and lots of code to train on. But we don’t have very good or foolproof ways of ensuring that code is bug free. So it would be an immense amount of work to try and curate a data set that was free of security vulnerabilities.”
Nonetheless, Dolan-Gavitt, who co-authored a paper on the insecurity of Copilot code suggestions, finds AI assistance useful enough to stick with it.
“My personal feeling on this is I’ve had Copilot turned on basically since it came out last summer,” he explained. “I do find it really useful. That said, I do kind of have to double check its work. But often, it’s often easier for me at least to start with something that it gives me and then edit it into correctness than to try to create it from scratch.” ®