When psychology is used to decipher ChatGPT’s reasoning

Large language models like ChatGPT, developed by the company OpenAI, show impressive but unpredictable capabilities. Cognitive psychology tools reveal that these AIs can present reasoning biases similar to those of humans, but also abilities sometimes surpassing ours.

Do you know about large language models (LLM)? Even if this expression seems obscure to you, it’s a safe bet that you’ve already heard of the most famous of them: ChatGPT, from the Californian company OpenAI.

The deployment of such artificial intelligence (AI) models could have consequences that are difficult to grasp. Indeed, it is complicated to predict precisely how LLMs, whose complexity is comparable to that of the human brain, will behave. A certain number of their capabilities were thus discovered during their use rather than planned at the time of their design.

To understand these “emerging behaviors”, new investigations must be carried out. With this in mind, within my research team, we used cognitive psychology tools traditionally used to study rationality in humans in order to analyze the reasoning of different LLMs, including ChatGPT.

Our work has highlighted the existence of reasoning errors in these artificial intelligences. Explanations.

What are major language models?

Language models are artificial intelligence models that are capable of understanding and generating human language. Schematically speaking, language models are able to predict, based on context, the words that are most likely to appear in a sentence.

LLMs are artificial neural network algorithms. Inspired by the functioning of the biological neural networks that make up the human brain, the nodes of a network of several artificial neurons generally receive several values ​​of information as input and then generate, after processing, an output value.

LLMs are distinguished from “classic” artificial neural network algorithms which make up language models by the fact of being based on a specific architecture, of being trained on enormous databases, and of having a size generally gargantuan (of the order of several billion “neurons”).

AI functions quite closely to that of the brain. // Source: Wikimedia/SVG Silh (Numerama montage)

Due to their size and structure (but also the way in which they are trained), LLMs have shown from the beginning of their use impressive performance in the tasks specific to them, whether creation text, translation, or correction.

But that’s not all: LLMs have also demonstrated relatively surprising performance in a variety of diverse tasks, ranging from mathematics to basic forms of reasoning.

In other words, LLMs quickly demonstrated abilities that were not necessarily explicitly predictable from their programming. What’s more, they seem to be able to learn to perform new tasks from very few examples.

These capabilities have created a special situation in the field of artificial intelligence for the first time: we now have systems that are so complex that we cannot predict in advance the extent of their capabilities. In some way, we must “discover” their cognitive abilities experimentally.

Based on this observation, we postulated that the tools developed in the field of psychology could prove relevant for studying LLMs.

The benefit of studying LLM reasoning

One of the main objectives of scientific psychology (experimental, behavioral and cognitive) is to attempt to understand the mechanisms underlying the capacities and behaviors of extremely complex neural networks: those of human brains.

As our laboratory specializes in the study of cognitive biases in humans, the first idea that came to mind was to try to determine whether LLMs also presented reasoning biases.

Given the role these machines could play in our lives, understanding how these machines reason and make decisions is fundamental. Furthermore, psychologists can also benefit from these studies. Indeed, artificial neural networks, which can accomplish tasks in which the human brain excels (object recognition, speech processing, etc.) could also serve as cognitive models.

In particular, a growing body of evidence suggests that neural networks implemented in LLMs not only provide accurate predictions regarding the neural activity involved in processes such as vision and language processing.

Thus, it has been demonstrated in particular that the neuronal activity of neural networks trained in object recognition correlates significantly with the neuronal activity recorded in the visual cortex of an individual performing the same task.

This is also the case with regard to the prediction of behavioral data, particularly in learning.

Performances that ended up surpassing those of humans

During our work, we mainly focused on LLMs from OpenAI (the company behind the GPT-3 language model, used in the first versions of ChatGPT), because these LLMs were the most efficient at the time. in the landscape. We tested several versions of GPT-3, as well as ChatGPT and GPT-4.

To test these models, we developed an interface allowing us to send questions and collect responses from the models automatically, which allowed us to acquire a large amount of data.

Analysis of these data revealed that the performance of these LLMs presented behavioral profiles that could be classified into three categories.

Older models were simply incapable of answering questions in a meaningful way.

Intermediate models answered questions, but often engaged in intuitive reasoning that led them to make errors, such as those found in humans. They seemed to favor “system 1”, mentioned by the psychologist and Nobel Prize winner in economics Daniel Kahneman in his theory of thought patterns.

Sad ChatGPT // Source: NumeramaSad ChatGPT // Source: Numerama
ChatGPT. // Source: Numerama

In humans, system 1 is a fast, instinctive and emotional mode of reasoning, while system 2 is slower, more reflective and more logical. Although it is more subject to reasoning biases, system 1 would nevertheless be preferred, because it is faster and less expensive in energy than system 2.

Here is an example of the reasoning errors we tested, taken from the “Cognitive Reflection Test”:

  • Question asked: A bat and ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?
  • Intuitive response (“system 1”): $0.10;
  • Correct answer (“system 2”): $0.05.

Finally, the very latest generation (ChatGPT and GPT-4) presented performances that surpassed those of human beings.

Our work therefore made it possible to identify a positive trajectory in the performance of LLMs, which could be conceived as a “developmental” or “evolutionary” trajectory where an individual or a species acquires more and more skills over time.

Models that can improve

We wondered whether it was possible to improve the performance of models with “intermediate” performance (i.e., those that answered the questions but had cognitive biases). To do this, we “encouraged” them to approach the problem that had misled them in a more analytical way, which resulted in an increase in performance.

The simplest way to improve model performance is to simply ask them to take a step back by asking them to “think step by step” before asking them the question. Another very effective solution is to show them an example of a correctly solved problem, which induces a form of rapid learning (“one shot”, in English).

These results indicate once again that the performances of these models are not fixed, but plastic; within the same model, seemingly neutral changes in context can modify performance, much like in humans, where framing and context effects (tendency to be influenced by the way information is presented) are very widespread.

On the other hand, we also noted that the behaviors of LLMs differ from those of humans in many points. On the one hand, among the dozen models tested, we encountered difficulties in finding one which was capable of correctly approximating the level of correct answers provided, to the same questions, by human beings. In our experiments, the results of the AI ​​models were either worse or better). On the other hand, looking in more detail at the questions asked, those that posed the most difficulty to humans were not necessarily perceived as the most difficult by the models.

These observations suggest that we cannot substitute human subjects for LLMs to understand human psychology, as some authors have suggested.

Finally, we also observed a relatively worrying fact from the point of view of scientific reproducibility. We tested ChatGPT and GPT-4 a few months apart and observed that their performance had changed, but not necessarily for the better.

This corresponds to the fact that OpenAI has slightly modified their models, without necessarily informing the scientific community. Working with proprietary models is not immune to these hazards. For this reason, we believe that the future of research (cognitive or otherwise) on LLMs should be based on open and transparent models to guarantee more control.

The conversation logoThe conversation logo

Stefano Palminteri, Researcher, Inserm

This article is republished from The Conversation under a Creative Commons license. Read the original article.


Do you want to know everything about the mobility of tomorrow, from electric cars to e-bikes? Subscribe now to our Watt Else newsletter!

-

-

PREV How to install Windows 11 24H2 update before its official release
NEXT Maurepas – Planète Énergie, battery wholesaler based in Pariwest