i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.
If it helps, I mean it in a really literal sense. qwen3.6 27b is currently $3.20 per million tokens on openrouter right now which is way overpriced. As good as the 27b is, kimi k2.5 $3.00 and it's just in another league in terms of capability. There's no reason to spend money on it.
And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.
And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.
Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.
depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification.
what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus.
what i usually do is basically distillation of a big one onto smaller one.
nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.
was part of the beta, its properly good model, in some sense i forgot that im not on opus or gpt. opus is still better. gpt is the one struggling for me. it has some niche in backend work but you can get the same with opus with skills, its lacking in almost all others.
i have glm and kimi. kimi was in most of the cases better and my replacement for claude when i run out of tokens. Now im finding myself using glm more then kimi. Its funny that glm vs kimi, is like codex vs claude. Where glm and codex are better for backend and kimi and claude more for frontend.
as kimi did a huge amount of claude distilation it seems to be somewhat based in data
Yeah it seems they did not align it to much, at least for now. Yesterday it helped me bypass the bot detection on a local marketplace. that i wanted to scrap some listing for my personal alerting system. Al the others failed but glm5.1 found a set of parameters and tweaks how to make my browser in container not be detected.
I always jump on the Chinese models when I'm trying to do something that the US ones chastise me for, they're a little more liberal, especially around copyright.
When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).
i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.
Why don't you start a new session or use the /compact command when context gets to 100k tokens?
From my testing it was ok until 145k tokens, the largest context I had before switching to a new session. I think Z.ai officially said it should be good until 200k tokens.
Using it in Open Code is compacting the context automatically when it gets too large.
reply