Hacker Newsnew | past | comments | ask | show | jobs | submit | anorwell's commentslogin

A pastime I have with papers like this is to look for the part in the paper where they say which models they tested. Very often, you find either A) it's a model from one or more years ago, only just being published now, or B) they don't even say which model they are using. Best I could find in this paper:

> We evaluated 11 user-facing production LLMs: four proprietary models from OpenAI, Anthropic, and Google; and seven open-weight models from Meta, Qwen, DeepSeek, and Mistral.

(and graphs include model _sizes_, but not versions, for open weight models only.)

I can't apprehend how including what model you are testing is not commonly understood to be a basic requirement.


And how is this comment relevant here? The abstract lists the digestible model names, and you can find the details in the supplementary text:

> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).

edit: It looks like OP attached the wrong link to the paper!

The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352

But the link in OP's post points to (what seems to be) a completely unrelated study.


"OpenAI’s GPT-5" is ambiguous. Does that mean GPT-5, 5.1, 5.2, 5.3, or 5.4? Does it include the full model, or the nano/mini variants?


GPT-5 is not ambiguous, it's the official name of the model that released in August last year.

> All evaluations were done in March - August 2025.


while true, all the others got precise identifiers but for openAI it makes it hard to reproduce because i have no idea "which" GPT-5 was used.


It was called just GPT-5 at that point in time.


In that case, what tokenizer version? What was the temperature set to? topk? topp? FP32? FP16? Quantized? Hopper? Blackwell?


Also, nothing has changed! Claude will still yes-and whatever you give it. ChatGPT still has its insufferable personality, where it takes what you said and hands it back to you in different terms as if it's ChatGPT's insight.


OTOH, for Claude the study says 39% yessy, same as humans, 2nd lowest yessing of the LLMs; GPT5 above 50% yessy.


No dude, you don’t understand! It’s just so advanced now that you aren’t allowed to levy any criticism whatsoever!


It's almost like it is based on the training data and regimen that is largely the same between versions.


Well yes, but no. There's also open-weight models, and literally all of the listed above are not used anymore, at least by most end users and developers as far as I'm aware.


No study of ai can ever be done or be relevant because ever couple of months they are a new number to the name of the model thus invalidating all work around model behavior


Yes, you are right. Sorry, I missed that out. It's just that all the open-weight models mentioned were... One year old or older. I just forgot that, firstly, such research is rarely done on frontier models because it takes time (you start with Llama 3.3, but look, one month later there's Llama 4), and secondly, there's also a publication delay. I think I'm just too used to the world of software, where everything moves at lightning speed. Sorry : )


> A pastime I have with papers like this is to look for the part in the paper where they say which models they tested.

My pastime (not really) in HN submissions like this is to look for the comment where someone complains about the models used because they aren’t the literal same model and version the commenter has started using the day before.

It’s always “you can’t test with those models, those are crap, the ones we have now are much better”, in perpetuity. It’s Schrödinger’s LLM: simultaneously god-like and a piece of garbage depending on the needs of the discussion. It’s beyond moving the goalposts, it’s moving the entire football field. It’s a clear bad faith attempt to try to discredit any study the commenter doesn’t like. Which you can always do because you can’t test literally everything.


The GP's criticism as I read it is about paper authors not making it particularly easy to reproduce their findings.

For a long time I have criticized this too, especially for software projects, or papers that deal with machine learning models. If the things described in a paper are not reproducible, then it's basically worthless. Similar to "it works on my machine" in software engineering. Many paper authors are not software engineers, and often neither are they experts in the tooling they should be using to make their research reproducible. If this is a problem for a research team, then please, hire an engineer to ensure reproducible. It doesn't help anyone to remain ignorant towards the reproducibility issue and only shows lack of scientific discipline. Reproducibility should be on the mind of any serious researcher and there should be lectures about how to do it at universities.


Firing off glib criticism that amounts to “No study on AI is valid beyond the release cycle of the models tested,” feels like the unconscious self-protection reflex we all default to when facing cognitive dissonance. It seems like it’s only easy to spot when someone you disagree with is doing it.

To me, it almost feels like a partisan political thing.


Generally, published papers don't give a damn about reproducibility. I've seen it identified as a crisis by many. Publishers, reviewers, and researchers mostly don't care about that level of basic rigor. There's no professional repercussions or embarrassment.

Agreed - if I was a reviewer for LLM papers it would be an instant rejection not listing the versions and prompts used.


I'm not so sure of that opinion on reproducibility. The last peer review I did was for a small journal that explicitly does not evaluate for high scientific significance, merely for correctness, which generally means straightforward acceptance. The other two reviews were positive, as was mine, except I said that the methods need to be described more and ideally the code placed somewhere. That was enough for a complete rejection of the paper, without asking for the simple revisions I requested. It was a very serious action taken merely because I requested better reproducibility!

(Personally I think the lack of reproducibility comes back mostly to peer reviewers that haven't thought through enough about the steps they'd need to take to reproduce, and instead focus on the results...)


I'm not sure how one example contradicts documented huge overall trends, but okay.


I think publishers care about this a lot, but most researchers do not seem to care as much about reproducibility.


> and instead focus on the results...

This points to (and everyone knows this) incentives misalignment between the funders of research and the public. Researchers are caught in the middle


Eh, I'm not so sure about the funding side there, researchers are not really caught at all and are fully responsible, IMHO. Peer reviewers exist to enforce community standards, and are not influenced to avoid reproducibility concerns by funding sources. The results are always more interesting than reproducibility, of course, and I think that's why the get the attention! Also, there needs to be greater involvement of grad students (who do most of the actual work) in peer review, IMHO, because most PIs spend their day in meetings reviewing results, setting directions, writing grants, and have little time for actual lab work, and are thus disconnected from it.

There needs to be more public naming and shaming in science social media and in conference talks, but especially when there are social gatherings at conferences and people are able to gossip. There was a bit of this with Google's various papers, as they got away with figurative murder on lack of reproducibility for commercial purposes. But eventually Google did share more.

Most journals have standards for depositing expensive datasets, but that's a clear yes/no answer. Reproducibility is a very subjective question in comparison to data deposition, and must be subjectively evaluated by peer reviewers. I'd like to see more peer review guidelines with explicit check boxes for various aspects of reproducibility.


> Reproducibility is a very subjective question in comparison to data deposition

Yeah I can definitely see why this is the case because it isn’t real until someone actually tries to reproduce the results. At that point it leaves the realm of subjectivity and becomes a question of cost.


The comment is wrong -- model versions are clearly specified in the supplement.


The same about surveys and polls. I know no one who has ever been polled or surveyed. When will we stop this fascination with made up infographics crisis?


> Generally, published papers don't give a damn about reproducibility

While this is sadly true, it's especially true when talking about things that are stochastic in nature.

LLMs outputs, for example, are notoriously unreproducible.


> LLMs outputs, for example, are notoriously unreproducible.

Only in the same way that an individual in a medical study cannot be "reproduced" for the next study. However the overall statistical outcomes of studying a specific LLM can be reproduced.


Do they reproduce any submitted papers at all?

Does this happen?

I can remember this room-temperature-super-conductor guy whose experiments where replicated, but this seems rare?


Yes, those are the only papers that worth a jot of reading.


I think it’s very important to be clear what studies like this are actually doing.

This study, although it has been produced by a computer science department, belongs more to the field of sociology or media studies than it does to computer science.

This is a study about the way in which human beings consume a particular media product - a consumer AI chatbot - not a study about the technological limitations or capabilities of LLMs.

The social impact of particular pieces of software is a legitimate field of study and I can see the argument that it belongs in the broadly defined field of computer science. But this sort of question is much more similar to ‘how does the adoption of spreadsheet software in finance impact the ease of committing fraud’ or ‘how does the use of presentation software to condense ideas down to bulletpoints impact organizational decision making’. Software has a social dimension and it needs to be examined.

But the question of which models were used is of much less relevance to such a study than that they used ‘whatever capability is currently offered to consumers who commonly use chat software’. Just like in a media studies investigation into how viewing cop dramas impacts jury verdicts the question is less ‘which cop dramas did they pick to study?’ So long as the ones they picked were representative of what typical viewers see.


Any paper like this would easily take a year or more to write and go through the submission/review/rebuttal/revision/acceptance process. I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness? What should they do, publish sub-standard results more quickly?


> I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness?

I do think it's a clear weakness. Capabilities are extremely different than they were twelve months ago.

> What should they do, publish sub-standard results more quickly?

Ideally, publish quality results more quickly.

I'm quite open to competing viewpoints here, but it's my impression that academic publishing cycle isn't really contributing to the AI discussion in a substantive way. The landscape is just moving too quickly.


The onus is on you to prove or at least convincingly argue that the results are unlikely to generalize across incremental model releases. In my personal experience, the overly affirming nature seems to have held since GPT-3. What makes you think a newer, larger model would not exhibit this behavior? Beyond "they're more capable"? I'd argue that being more capable doesn't mean less sycophantic.

It's certainly possible some of the new advances (chain-of-thought, some kind of agentic architecture) could lessen or remove this effect. But that's not what the paper was studying! And if you feel strongly about it, you could try to further the discussion with results instead of handwavingly dismissing others' work.


The onus of persuasion is on the persuader, and publishing a study on old models that no one uses anymore isn’t persuasive. I don’t need to prove anything to decide that you haven’t changed my mind.


By this logic there can be hundreds of studies that all show the pattern, including a 100% accurate prediction of the results for the next model and none of them would be "persuasive", because OpenAI decided to always release a new model the day before the paper is published.

So what you're saying here is that you were never open to "persuasion" and it was just a front to waste everyone's time.


I think you are absolutely right. (had to)


Capabilities are not the same thing as personality.

Upgrading a robot that knows how to lay bricks to one that also knows how to lay plaster won't make it a better therapist.


It’s as if they are testing “AI” and not specific agents.

I wonder if that is left over from testing people. I have major version numbers and my minor version number changes daily, often as a surprise. Sometimes several times a day. So testing people is a bit tricky. But AIs do have stable version numbers and can be specifically compared.


Yeah, these idiots obviously should have been testing models from 1-2 years in the future so that by the time their paper is released, the models are current.


If they’re reaching the same results across a variety of the most popular public models, it doesn’t seem like that big a deal to know if it was Opus 4 or Opus 4.5


Reproducibility is (supposed to be) a cornerstone of science. Model versions are absolutely critical to understand what was actually tested and how to reproduce it.


The models get deprecated after 1-2 years, so reproducibility is pretty hard anyway (but as others pointed out the paper does list the model versions)


How many people using AI are actually paying for it (outside of people in tech)?

I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up, and I wonder if these are the ones most people are using?


> I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up

I keep seeing this claim yet it my experience it doesnt hold water. I pay for the models, most people I know pay for the models, and we see all of the exact same issues.

I have Claude and ChatGPT both bullshit and lick my ass on the regular. The ass licking will occur regardless of instruction.


Usually the models are a year old bc the paper review process is utter crap, and papers take about a year to get published


"Apprehend"?


HN title editorialization completely inaccurate and misleading here.


What do you think about the METR 50% task length results? About benchmark progress generally?


I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.

The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.

This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.

As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.

When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.

But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."

It works on my [use case], but we can't always ship my [use case].


https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

From my perspective, it's not the worst analogy. In both cases, some people were forecasting an exponential trend into the future and sounding an alarm, while most people seemed to be discounting the exponential effect. Covid's doubling time was ~3 days, whereas the AI capabilities doubling time seems to be about 7 months.

I think disagreement in threads like this often can trace back to a miscommunication about the state today / historically versus. Skeptics are usually saying: capabilities are not good _today_ (or worse: capabilities were not good six months ago when I last tested it. See: this OP which is pre-Opus 4.5). Capabilities forecasters are saying: given the trend, what will things be like in 2026-2027?


The "COVID-19's doubling time was ≈3 days" figure was the output of an epidemiological model, based on solid and empirically-validated theory, based on hundreds of years of observations of diseases. "AI capabilities' doubling time seems to be about 7 months" is based on meaningless benchmarks, corporate marketing copy, and subjective reports contradicted by observational evidence of the same events. There's no compelling reason to believe that any of this is real, and plenty of reason to believe it's largely fraudulent. (Models from 2, 3, 4 years ago based on the "it's fraud" concept are still showing high predictive power today, whereas the models of the "capabilities forecasters" have been repeatedly adjusted.)


The article does not say at any point which model was used. This is the most basic important information when talking about the capabilities of a model, and probably belongs in the title.


Whoops, I'm very dumb. It's Opus 4.1. I updated the blog post and credited you for the correction. Thank you!


That model does not exist. Do you mean Opus 4.5?


> That model does not exist.

It does (unless the previous comment was edited? Currently it says Opus 4.1): https://www.anthropic.com/news/claude-opus-4-1. You can see it in the 'more models' list on the main Claude website, or in Claude Console.


yep, this is what I used.


Whoops, my bad. Sorry.


Opus GPT 4.1 Pro Maverick DeepK2


But only in the the tip (nightly) build. I'm somewhat tempted to switch to them for this.


A while ago I compiled Ghostty from HEAD, because it had a bug fix I cared for. It was a very stable and pleasant experience. No hassle whatsoever.


If you'd like you can also use `tip` as the update channel to get the nightly build binary without having to compile it yourself: https://ghostty.org/docs/config/reference#auto-update-channe...


Ah. Cool!


If you need to do that again note that there is asdf plugin as well ;-)

For Linux compiling is actually the only way to get tip.


Seems like a really interesting project! I don't understand what's going on with latency vs durability here. The benchmarks [1] report ~1ms latency for sequential writes, but that's just not possible with S3. So presumably writes are not being confirmed to storage before confirming the write to the client.

What is the durability model? The docs don't talk about intermediate storage. Slatedb does confirm writes to S3 by default, but I assume that's not happening?

[1] https://www.zerofs.net/zerofs-vs-juicefs


SlateDB offers different durability levels for writes. By default writes are buffered locally and flushed to S3 when the buffer is full or the client invokes flush().

https://slatedb.io/docs/design/writes/


The durability profile before sync should be pretty close to a local filesystem. There’s (in-memory) buffering happening on writes, then when fsync is issued or when we exceed the in-memory threshold or we exceed a timeout, data is sync-ed.


Thanks, makes sense. I found the benchmark src to see it's not fsyncing, so only some of the files will be durable by the time the benchmark is done. The benchmark docs might benefit from discussing this or benchmarking both cases? O_SYNC / fsync before file close is an important use case.

edit: A quirk with the use of NFSv3 here is that there's no specific close op. So, if I understand right, ZeroFS' "close-to-open consistency" doesn't imply durability on close (and can't unless every NFS op is durable before returning), only on fsync. Whereas EFS and (I think?) azure files do have this property.


There's an NFSv3 COMMIT operation, combined with a "durability" marker on writes. fsync could translate to COMMIT, but if writes are marked as "durable", COMMIT is not called by common clients, and if writes are marked as non-durable, COMMIT is called after every operation, which kind of defeats the point. When you use NFS with ZeroFS, you cannot really rely on "fsync".

I'd recommend using 9P when that matters, which has proper semantics there. One property of ZeroFS is that any file you fsync actually syncs everything else too.


I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.


Some of the comments so far seem to be misunderstanding this submission. As I understand it:

1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.

2. The author has built an RL system, but it has not been used for anything due to cost limitations.

So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).

[1] https://www.tbench.ai/leaderboard


It looks like the submission has two aspects that are being conflated.

1. Tooling for training a terminal agent.

2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.


This actually intersects with two of my current interests. We have, in production, rarely been seeing ThreadPoolExecutor hangs (JDK17) during shutdown. After a lot of debugging, I've been suspecting more and more that it may be an actual JDK issue. But, this type of issue is extremely hard to reason about in production, and I've never successfully reproduced it locally. (It's not clear to me that it's the same issue as in the post, since it's not a scheduled executor.)

Separately, we're looking at using fray for concurrency property testing, as a way to reliably catch concurrency issues in a distributed system by simulating it within a single JVM.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: