When posed with a logical puzzle that demands reasoning about the knowledge of others and about
counterfactuals, large language models (LLMs) display a "distinctive and revealing pattern of failure," according to a bulletin from the Bank for International Settlements.
With ChatGPT capturing the public imagination and central banks around the world exploring the potential applications of LLMs, BIS has been testing their cognitive limits.
To do this, it quizzed GPT-4 with the well-known Cheryl’s birthday logic puzzle, finding that the LLM solved the puzzle flawlessly when presented with the original wording.
As the authors note, GPT-4 will have encountered the puzzle and its solution during its training. However, the model consistently failed when small incidental details - such as the names of the characters or the specific dates - were changed.
This says, the BIS bulletin, suggest a lack of true understanding of the underlying logic.
BIS says that the findings do not detract from the progress in central bank applications of machine learning to data management, macro analysis and regulation.
"Nevertheless, our findings do suggest that caution should be exercised in deploying large language models in contexts that necessitate careful and rigorous economic reasoning.
"The evidence so far is that the current generation of LLMs falls short of the rigour and clarity in reasoning required for the high-stakes analyses needed for central banking applications."
Read the bulletin