More

benedictevans · 2026-02-26T04:52:57 1772081577

You’ve missed the point completely - if the important experiences are things built on top of foundation models, where the model itself is just an API call, then you don’t need to have a foundation model for build them and the model is just commodity infra

johnfn · 2026-02-26T05:30:14 1772083814

Yes, but OpenAI has 900M+ user reach, plus staggering amounts of cash, plus early access + deep integration with the latest and greatest models. I hardly think that is tantamount to "just an API call".

benedictevans · 2026-02-26T08:55:55 1772096155

Microsoft Google and Meta also had enormous reach, but that didn’t mean that they built everything on the Internet, nor could they.

benedictevans · on Feb 25, 2025

Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.

simonw · on Feb 25, 2025

Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.

Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?

benedictevans · on Feb 26, 2025

Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.

https://www.ben-evans.com/benedictevans/2025/1/the-problem-w...

simonw · on Feb 26, 2025

I have a hunch that's a problem unique to the way ChatGPT web edition handles PDFs.

Claude gets that question right: https://claude.ai/share/7bafaeab-5c40-434f-b849-bc51ed03e85c

ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.

This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.

Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.

I talked about this problem here: https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide....

So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.

That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).

benedictevans · on Feb 26, 2025

Interesting, thanks. I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.

simonw · on Feb 26, 2025

Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.

This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.

[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]

It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.

benedictevans · on Nov 22, 2024

1: It's the TITLE of a 100 slide presentation. It's not the only thing it said, and it's a way to think about what was happening.

2: Mobile replaced the PC as the main way people use the internet and do their day-to-day computing. The consumer Internet runs on smartphone apps, not PCs. In 2013 a lot of people didn't understand that that was happening, so it was worth saying.

benedictevans · on Nov 22, 2024

Well, I'm not trying to explain the state of the science and the engineering, but to work out what this means to everyone else. There are no products to analyse yet - which is part of the problem.

benedictevans · on Nov 22, 2024

See slides 58 and 59 - this can take a while.

ChatGPT got to 100m users much faster than anything else because it's riding on all the infrastructure we already built in the last 20 years. To a consumer, it's 'just' a website, and you don't have to wait for telcos to build broadband networks or get everyone to buy a $600 smartphone.

But, most people go to the website and say 'well, that's very cool, but I don't know what I'd use it for'. It's very useful for coding and marketing, and a few general purposes, but it isn't - YET - very helpful for most of the things that most people do all day. A lot of the presentation is wondering about this.

fragmede · on Nov 22, 2024

Only OpenAI knows for sure, but so many non-tech people I know use ChatGPT for a sounding board for whatever. "My boyfriend sent me this text, how should I respond?" or "Teach me about investing." There are a bunch of people I know that don't use ChatGPT, I'm just surprised at the uptake by people who I didn't think would have as use for it have found it very useful.

mg · on Nov 22, 2024

How long is a while and what is it, that most people do all day?

A quick Google search for "most common job" came back with

    Cashier

    A cashier works in a retail environment and
    processes transactions for a customer's purchase.

I wouldn't be surprised if robots can do that on their own in 10 years.

Gud · on Nov 22, 2024

Robots can already do that, they are used at large chains (McDonalds) and they are used all the time.

What they can't do is call the police when the hobo gets too wild, can't fix the inevitable bug in the process(by doing some 4th level menu bypass) and other random stuff that might pop up.

And when the robot can do all that humans are no longer viable as economic entities and will be out competed.

delfinom · on Nov 23, 2024

The problem is, the robot has to know what I want it to do without me having to dictate it.

That's the beauty of human interaction, it can't be massively truncated down to just even finger pointing.

benedictevans · on Nov 22, 2024

I tried to capture this on the last slide before the conclusion - maybe all AI questions have one of two answers - "no-one knows" or "it will be the same as the last time"

this is one of the "no-one knows" questions

Animats · on Nov 22, 2024

The question I'm asking isn't whether hallucinations can be fixed. It's what, if they are not fixed, are the economic consequences for the industry? How necessary is it that LLMs become trustworthy? How much valuation assumes that they will?

Sateeshm · on Nov 24, 2024

And is it even fixable?

namaria · on Nov 25, 2024

The "hallucinations" problem feels to me like an inherent feature. For LLMs to have interesting output the temperature needs to be higher then zero. The whole system is interesting because it is probabilistic. "Hallucinations" (hate the word btw) are to LLMs as melting is to ice. There will be no 'meltless' ice because the melting is what makes it cold and useful.

benedictevans · on Nov 22, 2024

Most of the presentation is saying that is isn't clear how this will work, it will take a long time, and it probably won't do everything.

Indeed, you would see that if you'd read even the first half-dozen 5 slides ;)

lucianbr · on Nov 22, 2024

I've decided to not read the article/slides because the title, in conjuction with the other titles on the page, sounded stupid to me.

My time is not free, sorry.

benedictevans · on Nov 22, 2024

Thank you - I will add this to my file of people expressing strong opinions of things they haven’t read and know nothing about.

benedictevans · on Sept 8, 2024

Americans generally still have to compile and file tax returns. In other countries that is often entirely automated.

benedictevans · on Sept 8, 2024

He didn’t object to jobs. He objected to a woman having a job.

sfink · on Sept 8, 2024

He objected to a woman wanting to have a job.

benedictevans · on July 16, 2024

"Where are the numbers coming from?"

The numbers are sourced, both in the charts and in the text. You can find a half-a-dozen more sources that say the same.

It's great for you that you're in the low-single-digit percentage of people that have already found use cases. Now, look at the data.