To validate the choices and configurations, feel free to give it a reading. We also breakdown our methodology in the blog and in-depth within the paper.
General hallucinations benchmarks tend to be knowledge specific like GPQA or MMLU but none specifically measure structured output end-to-end which is one of the biggest use case for LLMs.
Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.
> "don't use an LLM"
Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.
The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.
We ran the comparison and saw no difference, so to keep the bench consistent since some models don't support structured decoding we used greedy decoding on all models.
To validate the choices and configurations, feel free to give it a reading. We also breakdown our methodology in the blog and in-depth within the paper.