Genie: Best AI Software Engineer

pacomerh · on Aug 12, 2024

These tools remind me of the Dreamweaver era of solving problems no matter how the code looks

uzumak · on Aug 12, 2024

looks like they trained their model on SWE-bench and tried to submit https://github.com/swe-bench/experiments/pull/45

mronetwo · on Aug 12, 2024

Sorry for not discussing the product itself, but...

I'm just not seeing a machine that is "likely correct", constantly interrupting the "operator" to be that much of a win. I have seen some software influencers reflect on how much more fun it is to code, after dropping the LLM assistant.

All of these feel like offerings to the Productivity God. As a salary guy I'll never get excited that I can do more during my work day. It's already easy to hit my capacity.

henning · on Aug 12, 2024

So most of the time it still gets it wrong. And then when it gets it right it will still be subtly wrong. What a waste of electrical power and time.

difosfor · on Aug 12, 2024

God I hate pages that hijack scrolling..

Y_Y · on Aug 12, 2024

Any external verification ofthe benchmark results?

potatoman22 · on Aug 12, 2024

I'm skeptical. Partially because if you go to https://www.swebench.com/, you can see this company underreported results from their competitors like Amazon Q Developer. I've also seen plenty of other projects claim they've reached 30%+ on SWE-bench without verifying or posting their results on this site.

carls · on Aug 12, 2024

I skimmed the technical report: https://cosine.sh/blog/genie-technical-report

At the bottom, they noted the following:

> SWE-Bench has recently modified their submission requirements, now asking for the full working process of our AI model in addition to the final results -their condition to have us appear on the offical leaderboard. This change poses a significant challenge for us, as our proprietary methodology is evident in these internal processes. Publicly sharing this information would essentially open-source our approach, undermining the competitive advantage we’ve worked hard to develop. For now, we’ve decided to keep our model’s internal workings confidential. However we’ve made the model’s final outputs publicly available on GitHub for independent verification. These outputs clearly demonstrate our model’s 30% success rate on the SWE-Bench tasks.

Their model outputs are here: https://github.com/CosineAI/experiments/tree/cos/swe-bench-s...

Y_Y · on Aug 12, 2024

> However we’ve made the model’s final outputs publicly available on GitHub for independent verification.

Sounds legit

ramon156 · on Aug 12, 2024

I thought this was going to be a blog post and just turned out to be a "use our product!" jumpscare. I'll gladly pass

Bjorkbat · on Aug 12, 2024

Something I'm kind of curious about is the degree to which eval performance might be due to parts of the SWE-bench dataset getting into the latest LLM models.

A while back someone on Twitter seemed to confirm that Claude-3.5 was aware of the Github issues inside the dataset by mentioning them, but I couldn't find the original post.

30% performance on the full SWE-bench benchmark is quite the leap, but just how "real" of an achievement is this? Anecdotal reports mention that GPT-4o is marginally better than GPT-4 turbo at best, and yet agents leveraging the LLM did perform better.

What would happen if SWE-bench was updated, top to bottom, with completely new Github issues? Would all these agents just completely shit the bed?

log101 · on Aug 12, 2024

“…a human reasoning lab”

closes the tab