May 10, 2023 – OpenAI, one of the leading organizations in the field of artificial intelligence, has developed a new tool that aims to enhance the transparency and reliability of language models. The tool utilizes another language model, GPT-4, to analyze the internal structure of other language models, such as OpenAI’s GPT-2, to identify the parts responsible for the model’s behavior and provide natural language explanations for it.
The tool works by inputting a text sequence into the model being evaluated and waiting for a specific neuron to frequently activate. It then presents these highly active neurons to GPT-4 to generate an explanation. To verify the accuracy of the explanation, the tool provides GPT-4 with some text sequences and asks it to predict or simulate the neuron’s behavior, comparing the simulated behavior with the actual behavior of the neuron.
By using this method, the researchers were able to generate preliminary natural language explanations for each neuron in GPT-2 and compile them into a dataset, which was released in open-source form on GitHub, along with the tool’s code. This kind of tool could one day be used to improve the performance of language models, such as reducing bias or harmful language. However, the researchers admit that there is still a long way to go before the tool becomes truly useful. They have confidence in the explanations for only about 1,000 neurons, which is a small fraction of the total number.
Some may argue that this tool is just an advertisement for GPT-4 because it requires GPT-4 to run. However, Jeff Wu, the head of OpenAI’s Scalable Alignment team, says that this is not the tool’s purpose, and using GPT-4 was just incidental. Furthermore, it shows GPT-4’s weaknesses in this area. He also says that the tool was not created for commercial applications and could theoretically adapt to other language models besides GPT-4.
Many of the explanations generated by the tool received low scores or did not explain much of the neuron’s actual behavior. Wu says that the activity of many neurons is difficult to explain, such as when they activate on five or six different things without an apparent pattern. Sometimes there is a clear pattern, but GPT-4 cannot find it.
However, Wu believes that for more complex, newer, and larger models, or models that browse the web for information, the tool’s basic mechanism will not change much. He says that it only needs a slight adjustment to figure out why neurons decide to conduct certain search engine queries or visit specific websites.
“We hope that this will open up a promising avenue for solving the interpretability problem in an automated way, so that others can build on it and contribute,” Wu said. “We hope we can really have good explanations for the behavior of these models.”