Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I understand the 'fun factor' but at this point I really wonder what this pelican still proofs ? I mean, providers certainly could have adapted for it if they wanted, and if you want to test how well a model adapts to potential out of distribution contexts, it might be more worthwhile to mix different animals with different activity types (a whale on a skateboard) than always the same.


That's why I did the flamingo on a unicycle.

For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.


It is completely wild to me that you prefer Qwen's flamingo. I think it's really bad and Opus' is pretty good.


The Opus one doesn't even have a bowtie.


The Opus one looks like a flamingo, and looks like it's riding the unicycle. Sitting on the seat. Feet on the pedals.

The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.

But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.


Let's not oversell Opus' output. The Qwen flamingo is flawed but could be easily fixed with 1-2 prompts if you're really upset with it. The Opus SVG is not any better than something that I could make in Inkscape with 3 minutes and sufficient motivation. Calling Opus' flamingo "programmer art" would be an insult to programmers.


Game over opus


r/LocalLlama is now doing a horse in a racing car:

https://redd.it/1slz38i


To me the opus flamingo is waaaay better than the qwen one. qwen has the better pelican, though.


Is a flamingo on a unicycle not merely a special case of a pelican on a bicycle?


If I (commercially) made models I’d put specific care into producing SVGs of various animals doing (riding) various things ... I find it interesting how confident you seem to be that they’re not.


Google Gemini featured a bunch of examples of exactly that in their release video for 3.1 Pro: https://x.com/JeffDean/status/2024525132266688757


This is a gag that's long outlived its humor, but we're in a space so driven by hype there are people who will unironically take some signal from it. They'll swear up and down they know it's for fun, but let a great pelican come out and see if they don't wave it as proof the model is great alongside their carwash test.


Consider reading the article, which addresses all of the points you raise.

It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..


Yeah I can imagine these popular benchmarks get special treatment in the training of new models. I wonder how they would perform for "Elephant riding a car" or "Lion sleeping in a bed"


They're certainly aware of the test, but a turtle doing a kickflip on a skateboard? I seriously doubt they train their models for that.

https://x.com/JeffDean/status/2024525132266688757

If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx


I think I found the leaked Claude Mythos version of the turtle benchmark: https://www.youtube.com/watch?v=l82XWTKLZuk




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: