Listen to this story
Autocoding platforms have emerged as one of the premier use-cases for large language models over the past few years. These platforms, which can generate code based on natural language prompts, have been catapulted into the toolkit of the mainstream developer.
While this might seem like a win for the developer community as a whole, many have raised concerns over the datasets used by Codex and other large language models used for autocoding platforms. Class action lawsuits have been filed against Microsoft for their usage of GitHub repositories to train their Copilot algorithm without giving credit to the developers.
Even as Microsoft uses its vast access to resources to create GitHub Copilot, other offerings in the same field have found success by using responsibly-sourced datasets. The primary example among them is Tabnine, which has created an autocoding platform with a focus on responsibly-sourced datasets and user privacy. To delve deeper into Tabnine’s operations, Analytics India Magazine spoke to Brandon Jung, VP Ecosystem and Business Development at Tabnine.
AIM: Tabnine has a focus on training models using responsibly sourced datasets. Can you tell us what are some of the benefits and challenges that come along with taking this approach?
Brandon: Generally speaking for a model, the more code the better. So, from that standpoint, the decision to only take fully permissive license code means that there is code that is really, really well written and would be great to include.
If you widely pull in lots of code that is not fully permissive, what code you get also will vary. Code that might be open, or I should say available, but licensed differently. It also can be available, but it might have personal information stuff in it. Fully permissive open source code will not have that because you’re not pulling in personal code.
How can we assure that we’re not going to be putting in code that we weren’t aware that we just put in? The easiest way to do that is to make sure you do it with fully permissive code. More code is better generally, so that’s a bit of a tradeoff of the route we went.
AIM: Will your datasets ever grow to the point where you can compete against Github Copilot, which has the admittedly unfair advantage of using code from Github repos?
Brandon: The fit-for-purpose is a really important aspect of this. If you have all of GitHub data, if you’re taking all code, including a bunch of code from other people, the model biases based on the number of times it sees data. On average, code on GitHub is dated. There’s far more code out there for the older version than there is for the new version. So, by training on all of GitHub data, you’re actually biasing your model towards old code and old processes.
You have a bunch of users on GitHub already. Now, they’re gonna create a bunch more code, and it’s going to be based on older code. I don’t know if that actually moves it forward because what you are doing is you’re just reinforcing it. Your minus of this is that it’s not the highest quality code.
When we work with Google or with Amazon, the data that we pick up from partners as well have that bias towards current APIs, towards where the industry is going versus where it has been. A company, or even an open source team, knows where they want to go. [Copilot’s] not as useful [to them].
AIM: Tabnine has the ability to learn the coding style of the developer in question. What are some of the technical advancements that allow you to enable this?
Brandon: Tabine operates with really two models, it operates with a local model on your computer and a cloud. You can use one or the other, or both. What that allows us to do is to do some pretty good customisation for you as a user based on the code on your computer without sending all of your code back to Tabnine.
There’s trade-offs to that. If you run all locally, you’ll get much shorter snippets because you’re not getting a huge GPU sitting behind you solving that problem. That optionality and the ethical stance we take for how we handle a developer’s data and their interaction with us is a big deal.
We’ve oriented this in a way that you keep your security as a developer without sending all of it back. Co-pilot sucks all your code back. I think this is well known. So, there’s a security implication.
AIM: Why should users pick Tabnine over others? What are the advantages your platform offers over competitors?
Brandon: First off, I think our strategy, as I talked about innovation through architecture of being able and partnering with the rest of the industry. There’s so many people working on it, I think that means the likelihood that a Google or a Salesforce or a Meta is gonna have at least equivalent, if not better, models as we go over time.
Secondly, the data matters both from where you get the data, only fully permissive, and being able to train on your own. The last is security. You can run it where you wanna run it, you have maximum control. Your model is your model. Your developer’s code doesn’t leak.
I’d say those are the easy three ones. Innovational architecture, data that matters, and then security, you can run anywhere you want. Each one where we’re differentiated and where people are focused on.