Quite deep and informative. One thing to point out here is how they measure accu...

Quite deep and informative. One thing to point out here is how they measure accuracy when testing their LLMs. Like most neural networks, the n-dimensional space of LLMs is largely sparse. In simple terms, this means that it’s very easy to “exercise” the model in areas where it’s weak. That doesn’t mean that the model doesn’t have the potential of doing a (much) better job if you “exercise” its space properly—which is exactly what techniques like CoT basically do. LLMs are inherently limited, and they don’t have “true” reasoning capabilities—beyond marketing hype, this much should be obvious to most serious ML researchers today. The only question is how well we can get them to “mimic” reasoning for practical applications, and this is where “prompt engineering”, strictly speaking, is a true form of engineering, which has to take into account the mathematical foundations of the models and figure out how to extract out of them the best performance they can deliver.