OpenAI Codex shows the limits of large language models
All Transform 2021 sessions are now available on demand. Look now.
In a new paper, OpenAI researchers have revealed details about Codex, a deep learning model that generates software source code. Codex supports Copilot, a tool for “AI pair programmers” jointly developed by OpenAI and GitHub. Copilot is currently available in beta test mode for a limited number of users.
The paper is fascinating read that explains the process by which the scientists at OpenAI managed to reuse their flagship language model, GPT-3, to create Codex. But more importantly, the paper also sheds much-needed light on how much you can trust deep learning in programming.
The “no free lunch” theorem
Codex is a descendant of GPT-3, a massive deep learning language model that was released last year. The complexity of deep learning models is often measured by the number of their parameters. In general, the learning ability of a model increases with the number of parameters. GPT-3 came with 175 billion parameters, more than two orders of magnitude larger than its predecessor GPT-2 (1.5 billion parameters). GPT-3 was trained to more than 600 gigabytes, more than 50 times larger than the training data set of GPT-2.
Aside from its enormous size, the main innovation of GPT-3 was “fow-shot learning,” the ability to perform tasks for which it was not trained. The Paper that introduced GPT-3 was entitled “Language Models are Few-Shot Learners” and stated: “Here we show that upscaling language models significantly improves task-independent performance with few recordings [emphasis mine], sometimes even with prior, state-of-the-art fine-tuning approaches to achieve competitiveness. “
Basically, the premise was that a sufficiently large model trained on a large corpus of text could match or exceed multiple models that specialize in specific tasks.
But, according to OpenAI’s new paper, none of the various versions of GPT-3 were able to solve any of the coding problems used to evaluate Codex. To be fair, there weren’t any coding examples in GPT-3’s training dataset, so we can’t expect it to be able to code. But the OpenAI scientists also tested GPT-J, a model with 6 billion parameters that was used to train The stack, an 800 gigabyte dataset that contains 95 gigabytes of GitHub and 32 gigabytes of StackExchange data. GPT-J solved 11.4 percent of the coding problems. Codex, a version of the 12 billion parameter from GPT-3, fine-tuned to 159 gigabytes of code samples from GitHub, solved 28.8 percent of the problems. A separate version of Codex called Codex-S that was refined through supervised learning increased performance to 37.7 percent (other GPT and Codex models are trained through unsupervised learning).
Codex proves that machine learning is still used by the “no free lunch”-Theorem (NFL), which means that generalization comes at the expense of performance. In other words, machine learning models are more accurate when they are designed to solve a specific problem. On the other hand, their performance degrades as their problem area is expanded.
Codex can perform a specific task (converting function descriptions and signatures into source code) with high accuracy at the expense of poor natural language processing capabilities. On the other hand, GPT-3 is a general language model that can generate decent text on many topics (including complicated programming concepts), but cannot write a single line of code.
Size vs. cost
The experiments by the OpenAI researchers show that the performance of Codex improved as the machine learning model grew in size. For 300 million parameters, Codex solved 13.2 percent of the evaluation problems versus the 28.8 percent performance of the 12 billion parameter model.
But the full version of GPT-3 contains 175 billion parameters, a whole order of magnitude larger than the one used to create Codex. Wouldn’t training the larger model with the Codex training data produce better results?
A likely reason for stopping at 12 billion could be the record size. A larger Codex model would require a larger dataset. Training on the 159 gigabyte corpus would likely lead to overfitting, with the model becoming very good at remembering and rehearsing its training examples and very poor at dealing with new situations. Collecting and managing large data sets is an expensive and time consuming process.
An equally annoying problem would be the cost of Codex. Aside from a scientific experiment, Codex should become the backbone of a future product that can make profit for a research laboratory, that is So to say-owned by a trading company. As I mentioned earlier, the cost of training and running the 175 billion GPT-3 model would make it very difficult to build a profitable business model on.
However, a smaller but finely tuned version of GPT-3 would be much more manageable in terms of gains and losses.
Finally, as OpenAI’s experiments show, Codex’s size-for-money ratio follows a logarithmic scale. This means that the increase in performance will gradually decrease as you increase the size of the model. Therefore, the added cost of collecting data and training and running the larger model may not be worth the small increase in performance.
Note that code generation is a very lucrative market. Given the high hourly wages of programmers, even the savings of a few hours of programming time per month would be enough to cover Codex’s subscription fees. In other areas where labor costs are lower, automating tasks with large language models will be more of a challenge from a profit and loss perspective.
Generate vs. understand code
It must be remembered that no matter how intriguing the output of Codex, the deep learning model does not understand programming. Like all other language models based on deep learning, Codex records statistical correlations between code fragments.
In their paper, the OpenAI scientists admit that Codex “cannot be trained efficiently” and that “even experienced developers do not come close to this amount of code in the course of their careers”.
They also add that “a strong student who completes an introductory computer science course is expected to be able to solve more of the problems than Codex-12B”.
Here is an interesting excerpt from the paper: “We rehearse tokens from Codex until we encounter one of the following stop sequences: ‘ nclass’, ‘ ndef’, ‘ n #’, ‘ nif’ or ‘ nprint’ , since otherwise the model generates further functions or instructions. “
This means that Regardless, Codex continues to generate code even after the block that fixes the problem indicated in the command prompt has already completed.
This is a scheme that works well when you want to solve simple problems that keep coming up. But when you zoom out and try to write a large program that addresses a problem that needs to be solved in several steps, the limitations of Codex become obvious.
The OpenAI scientists found that the performance of the model decreases exponentially as the number of components in the functional description increases.
“This behavior is atypical for a human programmer who should be able to correctly implement a program for a chain of any length if he can for a chain of length two,” the researchers write in their paper.
Codex’s lack of understanding of program structure and code is further exposed by the fact that it “may recommend syntactically incorrect or undefined code and call functions, variables and attributes that are not defined or are outside the scope of the code base,” so the paper . In practice, this means that, in some cases, the machine learning model will put together different pieces of code that it has seen before, even if they don’t match.
In their paper, the researchers also discuss “misalignment” problems in the Codex, where the model can solve a particular problem but fails to do so due to various errors. Codex uses the contents of the file you are working on as context to generate its output. If your code contains subtle flaws (which is normal for human programmers), Codex can “intentionally” suggest code that looks good on the surface but is wrong, the researchers warn.
Misalignments are an interesting phenomenon that needs further investigation. However, OpenAI’s experiments continue to show that “the misalignment would likely persist and even worsen as the data, parameters and training time increased,” which could be another reason for the model to grow in size with 12 billion parameters in balance hold.
The paper also talks at length about the ability for Codex to produce obsolete and vulnerable code (which deserves a separate article, so I haven’t discussed it here).
Responsible use and reporting of AI
As I said after the publication of Copilot, “AI Pair Programmer”, the term used for Copilot on the GitHub website, is imprecise.
Codex is not a programmer. And it won’t take your job (if you’re a programmer) either. Coding is only part of what programmers do. The OpenAI scientists state that Codex, in its current state, “can reduce the cost of software production somewhat by increasing programming productivity,” but will not replace the other tasks that software developers regularly do, such as “meeting with colleagues, writing of design specifications, and upgrade of existing software stacks. “
Confusing Codex with a programmer can also lead to “over-dependency,” where a programmer blindly approves any code generated by the model without revising it. Given the obvious and subtle mistakes Codex can make, overlooking this threat can create quality and safety risks. “Human supervision and vigilance are required for the safe use of code generation systems like Codex,” warn the OpenAI researchers in their paper.
Overall, the response from the programming community shows that Codex is a very useful tool with potentially big implications for the future of the software industry. At the same time, given the hype surrounding Copilot’s release, it’s important to understand the unwanted effects. In this regard, it is worth commending the people at OpenAI for responsibly studying, documenting, and reporting on the Codex’s boundaries and threats.
Ben Dickson is a software engineer and founder of TechTalks. He writes about technology, economics and politics.
This story originally appeared on bdtechtalks.com. Copyright 2021
VentureBeat’s mission is to be a digital marketplace for tech decision makers to gain knowledge of transformative technologies and transactions. Our website provides essential information on data technologies and strategies to help you run your organization. We invite you to become a member of our community to gain access:
- current information on the topics of interest to you
- our newsletters
- closed thought leadership content and discounted access to our award-winning events such as Transform 2021: Learn more
- Network functions and more
become a member