There are quite a few comments here about benchmark and coding performance. I wo...

segmondy · 2026-04-24T13:07:12 1777036032

They have had the best math models for about a year most folks just didn't know about it. You can't find inference on APIs, but I run these at home, this is also the advantage of open models.

https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

simonjgreen · 2026-04-25T11:09:43 1777115383

You are of course specifically referring to the math optimised models, not the chat ones folks would generally encounter. Not that I’m trying to contradict you, your point is super valid and I agree with you! But I’m supplementing to help anyone following along who may make choices.

This is when it happened for anyone interested: https://binaryverseai.com/deepseek-math-v2-benchmarks-review...

jug · 2026-04-25T15:11:35 1777129895

Shouldn't one use e.g a Wolfram Alpha MCP endpoint for math in AI? From what I've seen on even premium non-quantized models, I would never ever trust the innate ability of a LLM to calculate.

lowbloodsugar · 2026-04-24T15:45:46 1777045546

You run a 671B model at home?

segmondy · 2026-04-24T18:15:30 1777054530

Yes, and plenty of others do too. Quantizied. Join us at r/localllama

My largest models

   318G    /llmzoo/models/Qwen3.5-397B
   377G    DeepSeekv3.2-nolight
   380G    /llmzoo/models/DeepSeek-V3.2-UD
   400G    /llmzoo/models/Qwen3.5-397B-Q8
   443G    DeepSeek-Math-v2
   443G    DeepSeek-V3-0324-Q5
   522G    /llmzoo/models/GLM5.1
   545G    /llmzoo/models/kimi2.6
   546G    /llmzoo/models/KimiK2.5

danilocesar · 2026-04-24T22:55:39 1777071339

Is your house's heating system based on H100s?

Liftyee · 2026-04-24T19:09:35 1777057775

What hardware do you use?

MezzoDelCammin · 2026-04-25T10:24:43 1777112683

I think the answer to this is:"yes"

CoolThings · 2026-04-25T11:04:55 1777115095

a Beowulf cluster of 256 x Raspberry Pi 3.

hhh · 2026-04-26T13:09:51 1777208991

I used to maintain a 2000 pi 4 cluster, before LLMs were relevant, with around 6gb free ram per node. I wonder what I could have done with something like this.

tclancy · 2026-04-25T03:00:03 1777086003

All of it.

chid · 2026-04-25T13:08:40 1777122520

even quantised, those are HUGE

tclancy · 2026-04-24T18:11:48 1777054308

It's a big house.

UncleOxidant · 2026-04-25T02:43:47 1777085027

Maybe if there was a 1-bit quant.

barbacoa · 2026-04-25T19:19:57 1777144797

Apple briefly was selling Mac studio with 512 GB of unified ram, meaning all that was available as vram.

verdverm · 2026-04-24T14:32:08 1777041128

Vertex AI has had deep seek available via API for a while

segmondy · 2026-04-24T18:16:39 1777054599

I'm talking about their specialized math models, not the general model.

PhilippGille · 2026-04-24T14:26:17 1777040777

When you say "Gemini", which exact model do you mean? You know there are several and they vary a lot in how capable they are? Pro 3.1 Preview, 2.5 Pro (their latest non-preview pro model), Flash 3 Preview, ...

Same with GPT-5: Latest 5.5, prior 5.4, or actually the original 5 (.0)?

You can't talk about model performance without specifying the exact model.

hodgehog11 · 2026-04-24T15:27:38 1777044458

My apologies, I thought it would be implicit that I am using the top-tier model of the time given the challenge of the tasks. GPT-5.5 was too new in this top comment (although I did test it a bit in a comment below), so I was using GPT-5.4. Gemini is Pro 3.1 Preview.

WarmWash · 2026-04-24T14:39:23 1777041563

High bet on 3.1 pro. I use it a lot for math and classic engineering, it's very strong.

ozgune · 2026-04-24T09:39:23 1777023563

I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi.

Summary: Opus 4.6 forms the baseline all three are trying to beat. DeepSeek V4-Pro roughly matches it across the board, Kimi K2.6 edges it on agentic/coding benchmarks, and Opus 4.7 surpasses it on nearly everything except web search.

DeepSeek V4-Pro Max shines in competitive coding benchmarks. However, it trails both Opus models on software engineering. Kimi K2.6 is remarkably competitive as an open-weight model. Its main weakness is in pure reasoning (GPQA, HMMT) where it trails Opus.

Speculation: The DeepSeek team wanted to come out with a model that surpassed proprietary ones. However, OpenAI dropped 5.4 and 5.5 and Anthropic released Opus 4.6 and 4.7. So they chose to just release V4 and iterate on it.

Basis for speculation? (i) The original reported timeline for the model was February. (ii) Their Hugging Face model card starts with "We present a preview version of DeepSeek-V4 series". (iii) V4 isn't multimodal yet (unlike the others) and their technical report states "We are also working on incorporating multimodal capabilities to our models."

solenoid0937 · 2026-04-24T15:05:11 1777043111

I feel like people suck at promoting Opus. Baseline, it's pretty on par with GPT 5.5.

But if you prompt it well - give it the reasoning behind why you're asking it to do something - it pulls far ahead.

hodgehog11 · 2026-04-24T15:25:32 1777044332

That's fine for procedural tasks, and I understand its value there. But these particular tasks I'm referring to occur on the front lines of research. You can't expect the prompts to be incredibly detailed, since those details are the whole challenge of the problem. I think there is value in having models that are capable of making really good preliminary insights to help guide the research.

adastra22 · 2026-04-26T05:52:04 1777182724

really depends on your area of research

cultofmetatron · 2026-04-25T10:21:46 1777112506

I really wanted to get excited about opus but in my own real world usage, I wasn't getting much out of it before hitting my limits. meanwhile i can abuse codex on 5.5 for hours getting a whole lot of work done. Plus, open code and PI are much more fun and interesting harnesses to work from than claude code imho.

I will however say that claude work and design are really great up until i blow its limit.

arcanemachiner · 2026-04-25T10:11:39 1777111899

Would love to know how GLM 5.1 stacks up in this ranking. Seems like it's on par with Kimi K2.6.

bbertelsen · 2026-04-24T18:08:22 1777054102

I'd be interested to know when that Opus 4.6 baseline is from given their recent recognition of performance issues. Do you have a paper posted on this review?

ozgune · 2026-04-25T06:44:26 1777099466

Ack. I took the benchmark results that AI Labs themselves published for their models. So the Opus 4.6 baseline would be from the time that Anthropic released the model.

lifty · 2026-04-24T07:30:15 1777015815

Wondering how gpt 5.5 is doing in your test. Happy to hear that DeepSeek has good performance in your test, because my experience seems to correlate with yours, for the coding problems I am working on. Claude doesn't seem to be so good if you stray away from writing http handlers (the modern web app stack in its various incarnations).

hodgehog11 · 2026-04-24T08:46:54 1777020414

Very cool to hear there is agreement with (probably quite challenging?) coding problems as well.

Just ran a couple of them through GPT 5.5, but this is a single attempt, so take any of this with a grain of salt. I'm on the Plus tier with memory off so each chat should have no memory of any other attempt (same goes for other models too).

It seems to be getting more of the impressive insights that Gemini got and doing so much faster, but I'm having a really hard time getting it to spit out a proper lengthy proof in a single prompt, as it loves its "summaries". For the random matrix theory problems, it also doesn't seem to adhere to the notation used in the documents I give it, which is a bit weird. My general impression at the moment is that it is probably on par with Gemini for the important stuff, and both are a bit better than DeepSeek.

I can't stress how much better these three models are than everything else though (at least in my type of math problems). Claude can't get anything nontrivial on any of the problems within ten (!!) minutes of thinking, so I have to shut it off before I run into usage limits. I have colleagues who love using Claude for tiny lemmas and things, so your mileage may vary, but it seems pretty bad at the hard stuff. Kimi and GLM are so vague as to be useless.

lifty · 2026-04-24T09:29:48 1777022988

My work is on a p2p database with quite weird constraints and complex and emergent interactions between peers. So it's more a system design problem than coding. Chatgpt 5.x has been helping me close the loop slowly while opus did help me initially a lot but later was missing many of the important details, leading to going in circles to some degree. Still remains to be seen if this whole endeavour will be successful with the current class of models.

wohoef · 2026-04-25T11:17:54 1777115874

Do you an idea of how well these models perform on set theory problems or more niche fields in mathematics? So the model would have to both understand a paper that’s not in its training data, and use this to write proofs.

hodgehog11 · 2026-04-25T13:32:02 1777123922

This is all fairly niche stuff I'm trying it on (well, the first three problems anyway), so yes, it needs me to give it several papers that are not in its training data and use them to write proofs. I would expect my experiences to transfer to set theory problems as well.

giwook · 2026-04-25T00:53:12 1777078392

Doesn't the Plus tier not have access to their best (Pro) model?

alansaber · 2026-04-24T12:35:31 1777034131

Very interesting. I wonder how much of this is due to the context length. I am unclear on the implementation strategy, you ran this problem as a 1-shot using chat mode, or using each on an agent harness?

segmondy · 2026-04-24T13:08:31 1777036111

Has nothing to do with context length, they have experience training math models, they have a model that would take gold in IMO and a lean prover. Both have been out for almost a year.

dataviz1000 · 2026-04-25T11:59:44 1777118384

> there is no quantitative measure of performance here

Have them do multiplication or other complicated arithmetic. You say that isn't difficult. Then why do they burn 200k tokens in 20 minutes without converging? I did a deep exploration to help myself understand here [0].

[0] https://adamsohn.com/reliably-incorrect/

chinadata · 2026-04-26T02:58:19 1777172299

Yes, DeepSeek can rely help save money.

bnm04 · 2026-04-24T14:10:16 1777039816

Have you also tried the Pro versions of ChatGPT and Gemini (Deep Think)?

hodgehog11 · 2026-04-24T15:23:02 1777044182

Yes to both, I'm paying for them and use the top-tier thinking models.

nibbleyou · 2026-04-24T07:12:21 1777014741

Curious to know what kind of problems you are talking about here

hodgehog11 · 2026-04-24T07:20:39 1777015239

I don't want to give away too much due to anonymity reasons, but the problems are generally in the following areas (in order from hardest to easiest):

- One problem on using quantum mechanics and C*-algebra techniques for non-Markovian stochastic processes. The interchange between the physics and probability languages often trips the models up, so pretty much everything tends to fail here.

- Three problems in random matrix theory and free probability; these require strong combinatorial skills and a good understanding of novel definitions, requiring multiple papers for context.

- One problem in saddle-point approximation; I've just recently put together a manuscript for this one with a masters student, so it isn't trivial either, but does not require as much insight.

- One problem pertaining to bounds on integral probability metrics for time-series modelling.

MinimalAction · 2026-04-24T13:33:39 1777037619

Regarding the first problem: are you looking at NCP maps for non-Markovian processes given you mention C*-algebra? Or is it more of a continuous weak monitoring of a stochastic system that results in dynamics with memory effects?

I'd be very curious to know how any LLMs fare. I completely understand if you don't want to continue the discussion because of anonymity reasons.

hodgehog11 · 2026-04-24T15:46:27 1777045587

More of the latter. It's a pet project of mine, and all of the LLMs tend to utterly fail at getting anywhere with it, at least in chats. In an agentic setup, it can chip away at some aspects, but it needs serious guidance on relevant language, notation, and concepts. To me, it demonstrates that the LLMs are not particularly good at crossing literatures, but then again, humans rarely seem to be good at that either...

pm2r · 2026-04-24T07:33:53 1777016033

It would be wonderful to have a deeper insight, but I understand that you can disclose your identity (I understand that you work in applied research field, right ? )

hodgehog11 · 2026-04-24T08:54:46 1777020886

Yes, I do mostly applied work, but I come from a background in pure probability so I sometimes dabble in the fundamental stuff when the mood strikes.

Happy to try to answer more specific questions if anyone has any, but yes, these are among my active research projects so there's only so much I can say.

pm2r · 2026-04-24T14:53:38 1777042418

Thanks a lot for your kind but detailed answer. I’m no more in the research field but you gave me good ideas to work on

fuddle · 2026-04-24T19:08:13 1777057693

Any plans to publish the benchmark results?

hodgehog11 · 2026-04-25T08:18:16 1777105096

I have plans to publish the problems, not any plans to publish how well the LLMs perform on them. The standard for publishing benchmarks is very high, and I'm really just posting vibes here. Still, I hope my experiences are useful to some people, as others experiences have been useful to me.