Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...
The system prompt may still make a difference though.
o3 Chat is also similarly wrong, saying {4}.
Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...