Hacker Newsnew | past | comments | ask | show | jobs | submit | khurdula's commentslogin

We've open-sourced all code, and test sets. You can find them here: https://interfaze.ai/blog/introducing-structured-output-benc...

To validate the choices and configurations, feel free to give it a reading. We also breakdown our methodology in the blog and in-depth within the paper.


We've added opus 4.6 and 4.7 to our leaderboard, they perform very closely with sonnet 4.6. Feel free to checkout our updated blog again :D


hey! we've evaluated gpt 5.5 as well along with other frontier models. gemini and gemma models outperform it across all three modalities.

Open source models like glm 4.7 still compete closely with table toppers.


We've updated our leaderboard having evaluated frontier models gemini 3.1 pro, opus 4.6 & 4.7, glm 5.1, deepseek v4, Kimi K2.6 as well.


We're updating our leaderboard with these model scores, should be out soon :D


We do love Qwen! It can be an easy choice when confused looking at this leaderboard.


Yep, we will be adding it soon as well.


Due to high demand, we're adding it soon!


General hallucinations benchmarks tend to be knowledge specific like GPQA or MMLU but none specifically measure structured output end-to-end which is one of the biggest use case for LLMs.

Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.

> "don't use an LLM"

Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.

The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.


We saw that structured decoding didn't make a difference in the quality of the output.

Check out the paper section "6.3 Structured Decoding Ablation"

Paper: https://arxiv.org/pdf/2604.25359

We ran the comparison and saw no difference, so to keep the bench consistent since some models don't support structured decoding we used greedy decoding on all models.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: