More

khurdula · 2026-05-05T22:33:34 1778020414

We've open-sourced all code, and test sets. You can find them here: https://interfaze.ai/blog/introducing-structured-output-benc...

To validate the choices and configurations, feel free to give it a reading. We also breakdown our methodology in the blog and in-depth within the paper.

khurdula · 2026-05-01T23:29:47 1777678187

We've added opus 4.6 and 4.7 to our leaderboard, they perform very closely with sonnet 4.6. Feel free to checkout our updated blog again :D

khurdula · 2026-05-01T23:28:34 1777678114

hey! we've evaluated gpt 5.5 as well along with other frontier models. gemini and gemma models outperform it across all three modalities.

Open source models like glm 4.7 still compete closely with table toppers.

khurdula · 2026-05-01T23:25:55 1777677955

We've updated our leaderboard having evaluated frontier models gemini 3.1 pro, opus 4.6 & 4.7, glm 5.1, deepseek v4, Kimi K2.6 as well.

khurdula · 2026-04-30T16:04:30 1777565070

We're updating our leaderboard with these model scores, should be out soon :D

khurdula · 2026-04-30T15:57:05 1777564625

We do love Qwen! It can be an easy choice when confused looking at this leaderboard.

khurdula · 2026-04-29T20:15:48 1777493748

Yep, we will be adding it soon as well.

khurdula · 2026-04-29T18:55:36 1777488936

Due to high demand, we're adding it soon!

khurdula · 2026-04-29T18:54:43 1777488883

General hallucinations benchmarks tend to be knowledge specific like GPQA or MMLU but none specifically measure structured output end-to-end which is one of the biggest use case for LLMs.

Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.

> "don't use an LLM"

Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.

The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.

khurdula · 2026-04-29T18:41:42 1777488102

We saw that structured decoding didn't make a difference in the quality of the output.

Check out the paper section "6.3 Structured Decoding Ablation"

Paper: https://arxiv.org/pdf/2604.25359

We ran the comparison and saw no difference, so to keep the bench consistent since some models don't support structured decoding we used greedy decoding on all models.