Hacker Newsnew | past | comments | ask | show | jobs | submit | m101's commentslogin

I was having a back and forth with Claude over a somewhat controversial topic, and I found it difficult for it to not misinterpret my questions. It was like speaking to a motivated reasoner who misinterpreted the 3 important words because the 10 others gave it cognitive disconence.

Eventually I cracked it and it said this:

“ I treated the subject as denial-adjacent and reflexively re-asserted the obvious, which means I was answering an imaginary opponent instead of you.”


Is his why online forums like Reddit are dying? Because people are moving their time-wasting arguing with the void to arguing with an ai? This is really bizarre to me.

My experience of reddit forums is extremely poor. I admit to sometimes wanting to see if I can crack the AI on something, but mostly use it like a search engine for topics I'm not familiar with rather than to speak to/debate.

I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.

Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:

- all intermediaries were given the prices of all buyers up front

- private price information in certain auction types was actually being broadcast to everyone

- multiple contradictions in instructions

If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.

There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.


Unless you're coming up with a deterministic set of criteria for evaluating these bugs and issues, every single model is going to keep telling you it finds new things and to fix them.

I'm sure you said the same "find mistakes please" thing to Opus 4.8 and GPT 5.5 when you were using $previous_amazing_latest_model, and they also found and fixed them.

Once the next "Fable"-type model comes out I'm sure it's going to find even more mistakes that the "special" Fable made.

You're using these models to make mistakes and then using upgraded versions of them to find their previous mistakes and fix them, until a new version comes along that can magically fix even more mistakes their previous versions made. There's no end to it.


Yes - I was thinking this - however I had already worked on it so many times with opus and gpt that I thought they had enough time to realise some common sense things that fable just got and understood first time, on the first pass. The difference seemed significant enough to comment about.

Maybe you are something special by letting those slip through in the first place?..

The point is that there's a difference in these models and everyone is looking for where the differences are. stop being an arse.

GP literally caught them?

Prompt: can you reformat your sentence to be less unkind?

This conversation is about capabilities of Fable 5 vs. older models, not about the GP's abilities.

It's just much more thorough and spins up a lot of subagents to basically do a lot more E2E testing. Not necessarily smarter, imo you could get the same result with a lesser model by procedurally prompting, but a lot more compute and orchestration.

i had to specifically tell fable not to use a bunch of subagents in order to preserve my token allowance.

This seems like the exact project you should try out Codex Security for. It catches a lot of stuff:

https://chatgpt.com/codex/cloud/security/


> ... and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through

Wait... Are you telling me models everybody told me were better than coders up to just one month ago are actually making lots of mistakes?

This is shocking.


The solution to all these expenses is to just have the user pay the transaction costs. Then everyone will start using bank transfers.

> solution to all these expenses is to just have the user pay the transaction costs

Then I offer an all-in price and take your customers.


Except your prices would have to be higher. Costco works on <0.5% margin, whilst these card fees are 1,2,3%+

The issue with bank transfers today is that the SEPA system is robust and established, but got no web compatible API.

But there are two projects (why one, if you can have two!?), one being Wero by different banks, the other being the Digital Euro by the European central bank. If either finds good adaption (Wero is rolling out slowly and for quite a bunch of banks every customer already got a Wero account automatically) this could move things around ...


I'm Irish, but I've built a website for an Australian client and they integrated something which did that. In the checkout, you could choose to pay with a system which would log you into your bank's website, where you could approve a payment, then return to the site on which you'd made your purchase, where it would instantly be marked as paid. I think that it may have taken a few days for the money to actually arrive in their bank account, but the payment was authorised instantly.

This stuff is very popular in the Baltics, there are many payment options and banks provide the necessary connections to be able to complete payments for the users using 2fa auth. Not to mention crypto. e.g. check out varle.lt as an example of an online retailer, the options are sort of normal and expected.

This is POLi and it's a massive security risk that they have everyone's bank passwords.

It could also be something PSD2 based. You should be able to create payment using PSD2, but the client still has to approve it inside their bank app.

They said Australian - it's POLi.

It was indeed POLi. Strange that it's using actual bank passwords!

I believe the usual SEPA flow is either scan this QR code or type this IBAN+reference into your bank's mobile app? SEPA is a "giro" system, meaning the person who owns the money has to push it, rather than a cheque system where the money owner writes something to the merchant who then pulls money from the money owner. These are always less convenient because the money owner has to contact their bank. They're also more secure.

That would seem like a logical solution. So wouldn't it be convenient for the expensive payment methods if legalities prevented merchants from charging higher fees to customers using them?

That’s exactly the system the card companies try to impose on vendors (mostly successfully).

In the UK it’s the system the law imposes on everyone.


Indeed. It's a triumph of consumer protection laws failing to protect consumers. Merchants here have to set their prices a bit higher to compensate for the fees and you still have to pay those higher prices as a customer even if you're using a more efficient payment method. I will never understand why the law wasn't set the other way - requiring explicit disclosure of payment fees to end customers and prohibiting payment services from incorporating these kinds of anticompetitive terms in merchant agreements - so that everyone could make an informed choice and market pressures would push the transaction overheads down.

I would say it’s regulatory capture. Some others would call it incompetence. Probably both in the UK

It might have been regulatory capture - though I have seen no specific evidence of that myself. It might simply have been the old story about a road and good intentions. At this point it doesn't really matter how it happened - it would be better if the situation were fixed in any case.

> The solution to all these expenses is to just have the user pay the transaction costs.

so the ticketmaster model?


Ticketmaster is a monopoly.

Customers paying the price would: 1) induce scrutiny from the public on visa a Mastercard (the monopolies) 2) encourage the competitive market amongst issuers to compress prices


interesting question, and i used the AI for help on this one:

$ value of equity purchased in indices:

- total market cap of those 3: $3.6tn

- index inclusion weights is based on free float, not full market cap

- free floats ~5%

=> 5% * 3.6tn = 180bn of these stocks in MV weight in the index

$ value of index funds: $18tn

$ value of market cap that is tracked by these index funds: $57tn

=> index funds are 18/57 = 31.6% of the market value

=> 180bn * 31.6% = $57bn of stock included in the index funds

so $57bn in sales in other companies => 57bn/18tn = 0.32% of all other stocks sold

Now for the assymmetry here:

- 57bn in sales is about 7% of daily volume for all incumbants combined

- 57bn in purchase is about 15-30 days of volume for typical stocks (hence Elon's eagerness to get them included asap)


Companies rush to IPO because they think the price they are selling at is so high that it outweighs the painful nature of being a public company.

I spent a few weeks making this - it would be great to get some feedback on v1.

The versions of this that I have seen elsewhere are extremely unsatisfactory and this sort of problem just lends itself to being solved with AI.


Anthropic killing headless usage in their plans on June 15th pushed me to codex. I heard there’s a tmux work around though.


For those of you that use deepseek v4 occasionally, what harness do you use it with? I’m only familiar with claude code and codex.

Any comments on what you can or cannot rely on it for relative to cc and codex would be appreciated too!


Maybe check out Goose. It is the standard agent harness being developed by The Linux Foundation under the AAIF. Under active development and the implementation seems to have a good leg up on the other popular agents.

https://github.com/aaif-goose/goose

https://goose-docs.ai/


I see their name mentiod everywere along with Aider, presumably for being among the first agents, but I've never met anyone that actually uses them.


Check out pi.dev. OpenCode is a nice batteries-included Claude Code replacement, but I’m in love with the extensibility of Pi.


Any Pi extensions you'd specifically recommend? I'm just starting out with Pi, but I've had mixed results with extensions. I'm using Pi with gemma4 26b locally, so anything that's friendly to small local models would be appreciated. I think the only extension I'm using right now is pi-total-recall.


I think pi wants you to write your own extensions, adapted to your meeds.

I haven't had a need for any extensions though. Maybe subagents, but I solved that with tmux. For all the rest, I just use "skills".


This will be interesting. I can see some world where it’s used with consumers, but for the most part I think it will be in the cloud and that would make most sense


is this true because training companies have not been training AI for both performance and brevity (or some other metric like that)? If this becomes a much more serious issue surely they would adjust the training processes


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: