Hacker Newsnew | past | comments | ask | show | jobs | submit | mgrund's commentslogin

I was under the impression that swe-bench (and I guess most other benchmarks) were supposed to be run offline?

I get that you may accidentally include something in local git history, but it feels off to me to run these kinds of benchmarks online.


The article has this to say:

> Blocking web access outright would indeed prevent this, but isn’t possible as many benchmarks do require network access to download resources and hit relevant APIs to solve the task - the example above requires a video download from YouTube. Even if this weren’t the case, searching the web for context is a vital agent capability, so blocking it would stray from the downstream agent experience we wish to measure.


swe-bench is a standardized evaluation suite so that's why I'm asking - hopefully there are well-defined criteria on whether this is an open/closed book benchmark.

As I understand it, it is designed to evaluate the LM itself and not agentic systems with online access (very high likelihood of unintentional cheating/solution leaking). The paper and docs are not super clear on the concrete requirements (although reproducibility is emphasized which goes against online access). So I was hoping for someone with more familiarity to chip in.

Obviously not a problem for internal evaluations, but for fair scoreboard submissions it matters. It's not a matter of whether internet searches are useful, but rather what the benchmark is intended to benchmark.


Some, like TerminalBench-2.0, requires web access for some tasks.

If agents are expected to be use the web as a tool productively, which is a very useful SWE skill, they should be evaluated with that setting. Otherwise you risk behavior drift from the agent you are actually shipping


I really really want to like local AI, but I highly doubt it will see wide adoption for a long time.

The additional up-front cost for hardware designed to run an LLM in addition to normal workload is unlikely to be accepted by most consumers.

The scale will be very constrained (like Apples on-device models which are small, heavily quantized, and have a small 4K token context window). It’s also terrible for battery life.

AI as it is implemented today is simply just computationally expensive and unless you put in dedicated hardware (like the ANE) for only this purpose - a large cost driver - I don’t really see it getting large scale adoption.

Companies will probably need a server-backed solution as fallback if they want reasonable user experience, so why even invest in diverse hardware support.


My thought exactly! First the usage limits + model limitations and now fundamental change to the billing. Hope some consumer watchdogs are looking into this!


There is but I don’t think this is it.

I’ve worked most of my career in US tech satellite offices and I have not experienced EU team members to be less productive than US team members, nor spend less time on work (if anything, more really since they also need to be available for US time zone overlap).

It’s true there are chill jobs here, as there are in the US.

But ambitious people tend to work as much as ambitious US people (and it’s really more like 40 hours work weeks - 39,5 where I live since lunch is not work time). But again, many are not really counting, it’s just a full time job.

Vacations (typically 3 weeks summer holiday and additional weeks to distribute over the year) does create longer time on skeleton crew. Skilled tech labour is also cheaper so you can just hire more to make up for it.


Are you hiring internationally (EU specifically) or only US?


Transparent screens doesn’t make much sense for consumer TVs (I know the article indeed points to other use-cases). You still need a black background to facilitate display of black content.


Couldnt you just add an LCD layer that turns on when you turn the TV on, to act as a black substrate?


[W]ho wants to see their bookshelves showing through in the background while they’re watching Dune? That’s why the transparent OLED TV LG demonstrated at CES 2024 included a “contrast layer”—basically, a black cloth—that unrolls and covers the back of the display on demand.

From TFA.


Doesn't the screen just go black as needed?


More likely crash looping of so many VMs overloading some system with insufficient back pressure, possibly combined with unfortunate cluster management scheduler behavior at this scale of crash looping (e.g. too eager to retry scheduling instances, maybe even on new hosts which causes more infrastructure load).


I believe Show HN is also what is suggested in guidelines for non-YC startups that are not allowed to use Launch HN (although some text along with the post would have been nice).


Exactly. It’s just leveling the playing field. I’ve been doing generated cover letters based on cv, job post and a few other data sources with manual review + adjustments with a very decent callback rate.


Sounds interesting, any reading material you could point to about this?


Hmmm it’s a very hard thing to sum up in one piece, not sure I have a good author… I highly recommend skimming Marvin Minsky’s “The Frame Problem” (TL;DR personal assistants will only be helpful once they have intuition), and Picardy writings on Affective computing (TL;DR computers need to simulate our mental states in order to help us effectively). This is a good reminder to look for more lit though!

Philosophy wise, I’m frustratingly forgetting the two people who write about how one’s consciousness can be said to include things like personal journals, record books, calendars, etc. The idea is that you’re offloading part of your mind into artificial forms, in a very meaningful and real way. Will update if I remember their names soon! In the meantime, this is a great skim on the more general topic IMO: https://plato.stanford.edu/entries/identity-personal/


Thanks!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: