"Good enough" bridges still last 50+ years. We could design a bridge to last 200 years but we won't even know if the design we have today will even be needed in 200 years. Maybe by then we all use trains in underground tunnels.
I don't think that's true. Engineers would largely want to build the best bridge costs be damned. But they would end up undercut by anyone who cuts corners resulting in the only companies getting contracts are the ones who cut the most corners. Even if no one wants to build bridges that collapse, it would be impossible without some counter forces of laws and accountability.
Microsoft has had a lot of naming blunders in the past but this has to be their worst. Copilot is currently, a tool to review PRs on github, the new name for windows cortana, the new name for microsoft office, a new version of windows laptop/pc, a plugin for VS code that can use many models, and probably a number of other things. None of these products/features have any relation to each other.
So if someone says they use Copilot that could mean anything from they use Word, to they use Claude in VS Code.
>Microsoft has had a lot of naming blunders in the past but this has to be their worst.
Nah I still rate "Windows App" the Windows App that lets you remotely access Windows Apps. I hate it to death, its like a black hole that sucks all meaning from conversations about it.
This feels like an AI generated comment, but I'll reply anyway. AI has been a massive negative for open source since every project is now drowning in AI generated PRs which don't work, reports for issues which don't exist, and the general mountain of time waster automated slop.
We are getting to the point where many projects may have to close submissions from the general public since they waste far more time than they help.
And then you get a new hire who already knows the common SaaS products but has to re learn your vibe coded version no one else uses where no information exists online.
There is a reason why large proprietary products remain prevalent even when cheaper better alternatives exist. Being "industry standard" matters more than being the best.
It will. By translation I mean like a front end client that translates the api into a user interface they prefer. They will build something localized to their own workflow. If it doesn't end well it's localized to them only.
Maybe, but I don't really believe users can or want to start designing software, if it was even possible which today it isn't really unless you already have software dev skills.
That would basically make users a product manager and UX designer, which they aren't really capable of currently. At most they will discover what they think they want isn't what they actually want.
Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.
I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.
Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.
ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.
You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.
Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.
What is the use in keeping it open when no one will ever look at it again after it goes stale? It still exists in the system if you ever wanted to find it again or if someone reports the same issue again. But after a certain time without reconfirming the bug exists, there is no point investigating because you will never know if you just haven't found it yet or if it was fixed already.
See my reply to eminence32 - bug tracking serves as a list of known defects, not as a list of work the engineers are going to do this [day/month/year].
The primary purpose is not usually a list of known defects and many ‘bugs’ are not actually bugs but feature requests or misunderstandings from users (e.g. RFC disallows the data you want my html parser to allow).
The people who filed them would disagree and many would vehemently argue that their bug is in fact a bug, and is the most important bug and how dare you close it.
reply