Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Wah, it can't write code like a Senior engineer with 20 years of experience!

No, that's not my problem with it. My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.

Sure, sometimes it produces useful code. And often, it'll simply call the "doTheHardPart()" method. I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over.

Outside of programming, this is much worse. I've both seen online and heard people quote LLM output as if it were authoritative. That to me is the bigger danger of LLMs to society. People just don't understand that LLMs aren't high powered attorneys, or world renown doctors. And, unfortunately, the incorrect perception of LLMs is being hyped both by LLM companies and by "journalists" who are all to ready to simply run with and discuss the press releases from said LLM companies.



> inbuilt into the models of all LLMs is that they'll fabricate a lot.

Still the elephant in the room. We need an AI technology that can output "don't know" when appropriate. How's that coming along?


Unfortunately they are trained first and foremost as plausibility engines. The central dogma is that plausibility will (with continuing progress & scale) converge towards correctness, or "faithfulness" as it's sometimes called in the literature.

This remains very far from proven.

The null hypothesis that would be necessary to reject, therefore, is a most unfortunate one, viz. that by training for plausibility we are creating the world's most convincing bullshit machines.


> plausibility [would] converge towards correctness

That is the most horribly dangerous idea, as we demand that the agent guesses not, even - and especially - when the agent is a champion at guessing - we demand that the agent checks.

If G guesses from the multiplication table with remarkable success, we more strongly demand that G computes its output accurately instead.

Oracles that, out of extraordinary average accuracy, people may forget are not computers, are dangerous.


One man's "plausibility" is another person's "barely reasoned bullshit". I think you're being generous, because LLMs explicitly don't deal in facts, they deal in making stuff up that is vaguely reminiscent of fact. Only a few companies are even trying to make reasoning (as in axioms-cum-deductions, i.e., logic per se) a core part of the models, and they're really struggling to hand-engineer the topology and methodology necessary for that to work roughly as facsimile of technical reasoning.


I’m not really being generous. I merely think if I’m gonna condemn something as high-profile snake oil for the tragically gullible, it’s helpful to have a solid basis for doing so. And it’s also important to allow oneself to be wrong about something, however remote the possibility may currently seem, and preferably without having to revise one’s principles to recognise it.


As a sort of related anecdote... if you remember the days before google, people sitting around your dinner table arguing about stuff used to spew all sorts of bullshit then drop that they have a degree from XYZ university and they won the argument... when google/wikipedia came around turned out that those people were in fact just spewing bullshit. I'm sure there was some damage but it feels like a similar thing. Our "bullshit-radar" seems to be able to adapt to these sorts of things.


Well, with every conspiracy theories thriving on this day and age with access to technology and information at one fingertips. If you add that now the US administration effectively spewing bullshit every few minutes.

The best example of this was an arguement I had a little while ago where I was talking about self driving and I was mentioning that I have a hard time trusting any system relying only on cameras, to which I was being told that I didn't understand how machine learning works and obviously they were correct and I was wrong and every car would be self driving within 5 years. All of these things could easily be verified independently.

Suffice to say that I am not sure that the "bullshit-radar" is that adaptive...

Mind you, this is not limited to the particular issue at hand but I think those situations needs to be highlighted, because we get fooled easily by authoritative delivery...


Language models are closing the gaps that still remain at an amazing rate. There are still a few gaps, but if we consider what has happened just in the last year, and extrapolated 2-3 years out....


If the training data had a lot of humans saying "I don't know", then the LLM's would too.

Humans don't and LLM's are essentially trained to resemble most humans.


I have seen many people not saying "don't know" when appropriate. If you believe whomever without some double-checking you will have (bad) surprises.

To make another parallel: that's why we have automated testing in software (long before LLMs). Because you can't trust without checking.


And what's you opinion of people that never say "don't know"?

Unless you are in sales or marketing, getting caught lying is really detrimental to your career.


Or politics, where confidently lying is an essential skill.


Is it? Seems like a lost art these days. Modern politicians hardly put in any effort making up convincing and subtle lies anymore..


and it still works. And problem seems to be related to the unconditional trust of LLM output.


Too many don't knows, when they should they are stupid

Too little don't knows and end up being wrong, an idiot.


There is lying and there is being incompetent.


Then the optimal strategy is to never say "don't know" and never get caught lying.

Seems to work for many people. I suspect my career has been hampered by a higher-than-average willingness to say "I don't know"...


I think you are discounting the fact that you can weed out people who make a habit of that, but you can't do that with LLMs if they are all doing that.


Some people trust Alex Jones, while the vast majority realize that he just fabricates untruths constantly. Far fewer people realize that LLMs do the same.

People know that computers are deterministic, but most don't realize that determinism and accuracy are orthogonal. Most non-IT people give computers authoritative deference they do not deserve. This has been a huge issue with things like Shot Spotter, facial recognition, etc.


That's what I really want.

One thing I see a lot on X is people asking Grok what movie or show a scene is from.

LLMs must be really, really bad at this because not only is it never right, it actually just makes something up that doesn't exist. Every, single, time.

I really wish it would just say "I'm not good at this, so I do not know."


When your model of the world is build on the relative probabilities of the next opaque apparently-arbitrary number in context of prior opaque apparently-arbitrary numbers, it must be nearly impossible to tell the difference between “there are several plausible ways to proceed, many of which the user will find useful or informative, and I should pick one” and “I don’t know”. Attempting to adjust to allow for the latter probably tends to make the things output “I don’t know” all the time, even when the output they’d have otherwise produced would have been good.


I thought about this of course, and I think a reasonable 'hack' for now is to more or less hardcode things that your LLM sucks at, and override it to say it doesn't know. Because continually failing at basic tasks is bad for confidence in said product.

I mean, it basically does the same thing if you ask it to do anything racist or offensive, so that override ability is obviously there.

So if it identifies the request as identifying a movie scene, just say 'I don't know', for example.


Hardcode by whom? Who do we trust with this task to do it correctly? Another LLM that suffers from the same fundamental flaw or by a low paid digital worker in a developing country? Because that's the current solution. And who's gonna pay for all that once the dumb investment money runs out, who's gonna stick around after the hype?


By the LLM team (Grok team, in this case). I don't mean for the LLM to be sentient enough to know it doesn't know the answer, I mean for the LLM to identify what is being asked of it, and checking to see if that's something on the 'blacklist of actions I cannot do yet', said list maintained by humans, before replying.

No different than when asking ChatGPT to generate images or videos or whatever before it could, it would just tell you it was unable to.


> It's impossible to predict with certainty who will be the U.S. President in 2046. The political landscape can change significantly over time, and many factors, including elections, candidates, and events, will influence the outcome. The next U.S. presidential election will take place in 2028, so it would be difficult to know for sure who will hold office nearly two decades from now.

So it can say “I don’t know”


I can do this because it is in fact the most likely thing to continue with, word by word.

But the most likely thing to continue a paper with is not to say at the end „I don‘t know“. It is actually providing sources which it proceeds to do wrongly.


>> We need an AI technology that can output "don't know" when appropriate. How's that coming along?

Heh. Easiest answer in the world. To be able to say "don't know", one has first to be able to "know". And we ain't there yet, by large. Not even flying by a million miles of it.


Needs meta annotation of certainty on all nodes and tokkens that accumulates while reasoning . Also gives the ability to train in believes, as in overriding any uncertainty. Right now we are in the pure believes phase.AI is its own god right now, pure blissful believe without the sin of doubt.


Not sure. We haven't figured it out for humans yet.


Sure we have. We don't have a perfect solution but it's miles better than what we have for LLMs.

If a lawyer consistently makes stuff up on legal filings, in the worst cases they can lose their license (though they'll most likely end up getting fines).

If a doctor really sucks, they become uninsurable and ultimately could lose their medical license.

Devs that don't double check their work will cause havoc with the product and, not only will they earn low opinions from their colleges, they could face termination.

Again, not perfect, but also not unfigured out.


By the same measure, of an llm really sucks we stop using it, same solution


Many people haven’t gotten the message yet, it seems.


Sure. I don't use GPT-3.5-Turbo for similar reasons. I fired it.


How many companies train on data that contains 'i don't know' responses. Have you ever talked with a toddler / young child? You need to explicitly teach children to not bull shit. At least I needed to teach mine.


Never mind toddlers, have you ever hired people? A far smaller proportion of professional adults will say “I don’t know” than a lot of people here seem to believe.


I never thought about this but I have experienced this with my children.


> train on data that contains 'i don't know' responses

The "dunno" must not be hardcoded in the data, it must be an output of judgement.


Judgement is what we call a system trained on good data.


No I call judgement a logical process of assessment.

You have an amount of material that speaks of the endeavours in some sport of some "Michael Jordan", the logic in the system decides that if a "Michael Jordan" in context can be construed to be "that" "Michael Jordan" then there will be sound probabilities he is a sportsman; you have very little material about a "John R. Brickabracker", the logic in the system decides that the material is insufficient to take a good guess.


AI is not a toddler. It's not human. It fails in ways that are not well understood and sometimes in an unpredictable manner.


Actually it fails exactly like I would expect something trained purely in knowledge and not in morals.


Then I expect your personal fortunes are tied up in hyping the "generative AI are just like people!" meme. Your comment is wholly detached from the reality of using LLMs. I do not expect we'll be able to meet eye-to-eye on the topic.


I always see this, and I always answer the same.

This exists, each next token has a probability assigned to it. High probability means "it knows", if there's two or more tokens of similar probability, or the prob of the first token is low in general, then you are less confident about that datum.

Of course there's areas where there's more than one possible answer, but both possibilities are very consistent. I feel LLMs (chatgpt) do this fine.

Also can we stop pretending with the generic name for ChatGPT? It's like calling Viagra sildenafil instead of viagra, cut it out, there's the real deal and there's imitations.


> low in general, then you are less confident about that datum

It’s very rarely clear or explicit enough when that’s the case. Which makes sense considering that the LLMs themselves do not know the actual probabilities


Maybe this wasn't clear, but the Probabilities are a low level variable that may not be exposed in the UI, it IS exposed through API as logprobs in the ChatGPT api. And of course if you have binary access like with a LLama LLM you may have even deeper access to this p variable


> it IS exposed through API as logprobs in the ChatGPT api

Sure but they often are not necessarily easily interpretable or reliable.

You can use it to compare a model’s confidence of several different answers to the same question but anything else gets complicated and not necessarily that useful.


>can we stop pretending with the generic name for ChatGPT?

What? I use several LLM's, including ChatGPT, every day. It's not like they have it all cornered..


This is very subjective, but I feel they are all imitators of ChatGPT. I also contend that the ChatGPT API (and UI) will or has become a de facto standard in the same manner that intel's 80886 Instruction set evolved into x86


> How's that coming along?

It isn't. LLMs are autocomplete with a huge context. It doesn't know anything.


I’d love a confidence percentage accompanying every answer.


So we just need to solve the halting problem in NLP?


would you rather the LLM make up something that sounds right when it doesn't know, or would you like it to claim "i don't know" for tasks it actually can figure out? because presumably both happen at some rate, and if it hallucinates an answer i can at least check what that answer is or accept it with a grain of salt.

nobody freaks out when humans make mistakes, but we assume our nascent AIs, being machines, should always function correctly all the time


> would you rather the LLM make up something that sounds right when it doesn't know, or would you like it to claim "i don't know" for tasks it actually can figure out?

The latter option every single time


> but we assume our nascent AIs, being machines, should always function correctly all the time

A tool that does not function is a defective tool. When I issue a command, it better does it correctly or it will be replaced.


And that's part of the problem - you're thinking of it like a hammer when it's not a hammer. It's asking someone at a bar a question. You'll often get an answer - but even if they respond confidently that doesn't make it correct. The problem is people assuming things are fact because "someone at a bar told them." That's not much better than, "it must be true I saw it on TV".

It's a different type of tool - a person has to treat it that way.


Asking a question is very contextual. I don't ask a lawyer house engineering problems, nor my doctor how to bake cake. That means If I'm asking someone at a bar, I'm already prepare to deal with the fact that the person is maybe drunk, probably won't know,... And more often than not, I won't even ask the question unless dire needs. Because it's the most inefficient way to get an informed answer.

I wouldn't bat an eye if people were taking code suggestions, then review it and edit it to make it correct. But from what I see, it's pretty a direct push to production if they got it to compile, which is different from correct.


Sounds like a trillion dollar industry.


It would be nice to have some kind of "confidence level" annotation.


> What's worse, people are treating them as authoritative. … I've both seen online and heard people quote LLM output as if it were authoritative.

Thats not an LLM problem. But indeed quite bothersome. Dont tell me what Chatgpt told you. Tell me what you know. Maybe you got it from ChatGPT and verified it. Great. But my jaw kind of drops when people cite an LLM and just assume it’s correct.


It might not be an LLM problem, but it’s an AI-as-product problem. I feel like every major player’s gamble is that they can cement distinct branding and model capabilities (as perceived by the public) faster than the gradual calcification of public AI perception catches up with model improvements - every time a consumer gets burned by AI output in even small ways, the “AI version of Siri/Alexa only being used for music and timers” problem looms a tiny, tiny bit larger.


Why is that a problem tho?

Branding for current products have this property today - for example, apple products are seen as being used by creatives and such.


> when people cite an LLM and just assume it’s correct.

people used to say the exact same thing with wikipedia back when it first started.


These are not similar. Wikipedia says the same thing to everybody, and when what it says is wrong, anybody can correct it, and they do. Consequently it's always been fairly reliable.


Lies and mistakes persist on Wikipedia for many years. They just need to sound truthy so they don't jump out to Wikipedia power users who aren't familiar with the subject. I've been keeping tabs on one for about five years, and its several years older than that, which I won't correct because I am IP range banned and I don't feel like making an account and dealing with any basement dwelling power editor NEETs who read Wikipedia rules and processes for fun. I know I'm not the only one to, because this glaring error isn't in a particularly obscure niche, its in the article for a certain notorious defense initiative which has been in the news lately, so this error has plenty of eyes on it.

In fact, the error might even be a good thing; it reminds attentive readers that Wikipedia is an unreliable source and you always have to check if citations actually say the thing which is being said in the sentence they're attached to.


Maybe you're just wrong about it.


Citation Needed - you can track down WHY it's reliable too if the stakes are high enough or the data seems iffy.


That's true too, but the bigger difference from my point of view is that factual errors in Wikipedia are relatively uncommon, while, in the LLM output I've been able to generate, factual errors vastly outnumber correct facts. LLMs are fantastic at creativity and language translation but terrible at saying true things instead of false things.


> Consequently it's always been fairly reliable.

Comments like these honestly make me much more concerned than LLM hallucinations. There have been numerous times when I've tracked down the source for a claim, only to find that the source was saying something different, or that the source was completely unreliable (sometimes on the crackpot level).

Currently, there's a much greater understanding that LLM's are unreliable. Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.

The big problem is that people in general are terrible at exercising critical thinking when they're presented with information. It's probably less of an issue with LLMs at the moment, since they're new technology and a certain amount of skepticism gets applied to their output. But the issue is that once people have gotten more used to them, they'll turn off they're critical thinking in the same manner that they turn it off when absorbing information from other sources that they're used to.


Wikipedia is fairly reliable if our standard isn't a platonic ideal of truth but real-world comparators. Reminds me of Kant's famous line. "From the crooked timber of humankind, nothing entirely straight can be made".

See the Wikipedia page on the subject :)

https://en.m.wikipedia.org/wiki/Reliability_of_Wikipedia


The sell of Wikipedia was never "we'll think so you don't have to", it was never going to disarm you of your skepticism and critical thought, and you can actually check the sources. LLMs are sold as "replace knowledge work(ers)", you cannot check their sources, and the only way you can check their work is by going to something like Wikipedia. They're just fundamentally different things.


> The sell of Wikipedia was never "we'll think so you don't have to", it was never going to disarm you of your skepticism and critical thought, and you can actually check the sources.

You can check them, but Wikipedia doesn't care what they say. When I checked a citation on the French Toast page, and noted that the source said the opposite of what Wikipedia did by annotating that citation with [failed verification], an editor showed up to remove that annotation and scold me that the only thing that mattered was whether the source existed, not what it might or might not say.


I feel like I hear a lot of criticism about Wikipedia editors, but isn't Wikipedia overall pretty good? I'm not gonna defend every editor action or whatever, but I think the product stands for itself.


Wikipedia is overall pretty good, but it sometimes contains erroneous information. LLMs are overall pretty good, but they sometimes contain erroneous information.

The weird part is when people get really concerned that someone might treat the former as a reliable source, but then turn around and argue that people should treat the latter as a reliable source.


I had a moment of pique where I was just gonna copy paste my reply to this rehash of your original point that is non-responsive to what I wrote, but I've found myself. Instead, I will link to the Wikipedia article for Equivocation [0] and ChatGPT's answer to "are wikipedia and LLMs alike?"

[0]: https://en.wikipedia.org/wiki/Equivocation

[1]: https://chatgpt.com/share/67e6adf3-3598-8003-8ccd-68564b7194...


Wikipedia occasionally has errors, which are usually minor. The LLMs I've tried occasionally get things right, but mostly emit limitless streams of plausible-sounding lies. Your comment paints them as much more similar than they are.


In my experience, it's really common for wikipedia to have errors, but it's true that they tend to be minor. And yes, LLMs mostly just produce crazy gibberish. They're clearly worse than wikipedia. But I don't think wikipedia is meeting a standard it should be proud of.


Yes, I agree. What kind of institution do you think could do better?


It's scored better than other encyclopedias it has been compared against, which is something.


Wikipedia is one of the better sources out there for topics that are not seen as political.

For politically loaded topics, though, Wikipedia has become increasingly biased towards one side over the past 10-15 years.


source: the other side (conveniently works in any direction)


> Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.

One of these things is not like the others! Almost always, when I see somebody claiming Wikipedia is wrong about something, it's because they're some kind of crackpot. I find errors in Wikipedia several times a year; probably the majority of my contribution history to Wikipedia https://en.wikipedia.org/wiki/Special:Contributions/Kragen consists of me correcting errors in it. Occasionally my correction is incorrect, so someone corrects my correction. This happens several times a decade.

By contrast, I find many YouTube videos and studies from advocacy groups to be full of errors, and there is no mechanism for even the authors themselves to correct them, much less for someone else to do so. (I don't know enough about posts on AskHistorians to comment intelligently, but I assume that if there's a major factual error, the top-voted comments will tell you so—unlike YouTube or advocacy-group studies—but minor errors will generally remain uncorrected; and that generally only a single person's expertise is applied to getting the post right.)

But none of these are in the same league as LLM output, which in my experience usually contains more falsehoods than facts.


> Currently, there's a much greater understanding that LLM's are unreliable.

Wikipedia being world-editable and thus unreliable has been beaten into everyone's minds for decades.

LLMs just popped into existence a few years ago, backed by much hype and marketing about "intelligence". No, normal people you find on the street do not in fact understand that they are unreliable. Watch some less computer literate people interact with ChatGPT - it's terrifying. They trust every word!


Look at the comments here. No one is claiming that LLMs are reliable, while numerous people are claiming that Wikipedia is reliable.


Isn't that the issue with basically any medium?

If you read a non-fiction book on any topic, you can probably assume that half of the information in it is just extrapolated from the authors experience.

Even scientific articles are full of inaccurate statements, the only thing you can somewhat trust are the narrow questions answered by the data, which is usually a small effect that may or may not be reproducible...


No, different media are different—or, better said, different institutions are different, and different media can support different institutions.

Nonfiction books and scientific papers generally only have one person, or at best a dozen or so (with rare exceptions like CERN papers), giving attention to their correctness. Email messages and YouTube videos generally only have one. This limits the expertise that can be brought to bear on them. Books can be corrected in later printings, an advantage not enjoyed by the other three. Email messages and YouTube videos are usually displayed together with replies, but usually comments pointing out errors in YouTube videos get drowned in worthless me-too noise.

But popular Wikipedia articles are routinely corrected by hundreds or thousands of people, all of whom must come to a rough consensus on what is true before the paragraph stabilizes.

Consequently, although you can easily find errors in Wikipedia, they are much less common in these other media.


Yes, though by different degrees. I wouldn't take any claim I read on Wikipedia, got from an LLM, saw in a AskHistorians or Hacker News reply, etc., as fact, and I would never use any of those as a source to back up or prove something I was saying.

Newspaper articles? It really depends. I wouldn't take paraphrased quotes or "sources say" as fact.

But as you move to generally more reliable sources, you also have to be aware that they can mislead in different ways, such as constructing the information in a particular way to push a particular narrative, or leaving out inconvenient facts.


And that is still accurate today. Information always contains a bias from the narrators perspective. Having multiple sources allows one to triangulate the accuracy of information. Making people use one source of information would allow the business to control the entire narrative. Its just more of a business around people and sentiments than being bullish on science.


Correct; Wikipedia is still not authoritative.


Wikipedia will cite and often broadly source. Wikipedia has an auditable decision trail for content conflicts.

It behaves more like an accountable mediator of authority.

Perhaps LLMs offering those (among other) features would be reasonably matched in a authorativity comparison.


Authority, yes, accountable, not so much.

Basically at the level of other publishers, meaning they can be as biased as MSNBC or Fox News, depending on who controls them.


And they were right, right? They recognized it had structural faults that made it possible for bad data to sip in. The same is valid for LLMs: they have structural faults.

So what is your point? You seem to have placed assumptions there. And broad ones, so that differences between the two things, and complexities, the important details, do not appear.


> Thats not an LLM problem

It is, if the purpose of LLMs was to be AI. "Large language model" as a choir of pseudorandom millions converged into a voice - that was achieved, but it is by definition out of the professional realm. If it is to be taken as "artificial intelligence", then it has to have competitive intelligence.


> But my jaw kind of drops when people cite an LLM and just assume it’s correct.

Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor, so is it totally their fault?

They've heard about the uncountable sums of money spent on creating such software, why would they assume it was anything short of advertised?


> Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor

Why does this imply that they’re always correct? I’m always genuinely confused when people pretend like hallucinations are some secret that AI companies are hiding. Literally every chat interface says something like “LLMs are not always accurate”.


> Literally every chat interface says something like “LLMs are not always accurate”.

In small, de-emphasized text, relegated to the far corner of the screen. Yet, none of the TV advertisements I've seen have spent any significant fraction of the ad warning about these dangers. Every ad I've seen presents someone asking a question to the LLM, getting an answer and immediately trusting it.

So, yes, they all have some light-grey 12px disclaimer somewhere. Surprisingly, that disclaimer does not carry nearly the same weight as the rest of the industry's combined marketing efforts.


> In small, de-emphasized text, relegated to the far corner of the screen.

I just opened ChatGPT.com and typed in the question “When was Mr T born?”.

When I got the answer there were these things on screen:

- A menu trigger in the top-left.

- Log in / Sign up in the top right

- The discussion, in the centre.

- A T&Cs disclaimer at the bottom.

- An input box at the bottom.

- “ChatGPT can make mistakes. Check important info.” directly underneath the input box.

I dislike the fact that it’s low contrast, but it’s not in a far corner, it’s immediately below the primary input. There’s a grand total of six things on screen, two of which are tucked away in a corner.

This is a very minimal UI, and they put the warning message right where people interact with it. It’s not lost in a corner of a busy interface somewhere.


Maybe it's just down to different screen sizes, but when I open a new chat in chat GPT, the prompt is in the center of the screen, and the disclaimer is quite a distance away at the very bottom of the screen.

Though, my real point is we need to weigh that disclaimer, against the combined messaging and marketing efforts of the AI industry. No TV ad gives me that disclaimer.

Here's an Apple Intelligence ad: https://www.youtube.com/watch?v=A0BXZhdDqZM. No disclaimer.

Here's a Meta AI ad: https://www.youtube.com/watch?v=2clcDZ-oapU. No disclaimer.

Then we can look at people's behavior. Look at the (surprisingly numerous) cases of lawyers getting taken to the woodshed by a judge for submitting filings to a court with chat GPT introduced fake citations! Or, someone like Ana Navarro confidentially repeating an incorrect fact, and when people pushed back saying "take it up with chat GPT" (https://x.com/ananavarro/status/1864049783637217423).

I just don't think the average person who isn't following this closely understands the disclaimer. Hell, they probably don't even really read it, because most people skip over reading most de-emphasized text in most-UIs.

So, in my opinion, whether it's right next to the text-box or not, the disclaimer simply cannot carry the same amount of cultural impact as the "other side of the ledger" that are making wild, unfounded claims to the public.


I remember when Google results called out the ads as distinct from the search results.

That was necessary to build trust until they had enough power to convert that trust into money and power.


Speculating that they may at some point in the future remove that message does not mean that it is not there now. This was the point being made:

> Literally every chat interface says something like “LLMs are not always accurate”.


> Surprisingly, that disclaimer does not carry nearly the same weight as the rest of the industry's combined marketing efforts.

Thank you.


> Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor, so is it totally their fault?

Well... yeah.


I’ve come to believe a more depressing take: they _want_ to believe it, and therefore do.

No disclaimer is gonna change that.


Is it worse or better than quoting a Twitter comment and taking that as it were authoritative?


It remains proportional to the earned prestige of the speaker.


The superficial view: „they hallucinate“

The underlying cause: 3rd order ignorance:

3rd Order Ignorance (3OI)—Lack of Process. I have 3OI when I don't know a suitably efficient way to find out I don't know that I don't know something. This is lack of process, and it presents me with a major problem: If I have 3OI, I don't know of a way to find out there are things I don't know that I don't know.

—- not from an llm

My process: use llms and see what I can do with them while taking their Output with a grain of salt.


But the issue of the structural fault remains. To state the phenomenon (hallucination) is not "superficial", as the root does not add value in the context.

Symptom: "Response was, 'Use the `solvetheproblem` command'". // Cause: "It has no method to know that there is no `solvetheproblem` command". // Alarm: "It is suggested that it is trying to guess a plausible world through lacking wisdom and data". // Fault: "It should have a database of what seems to be states of facts, and it should have built the ability to predict the world more faithfully to facts".


My company just broadly adopted AI. It’s not a tech company and usually late to the game when it comes to tech adoption.

I’m counting down the days when some AI hallucination makes its way all the way to the C-suite. People will get way too comfortable with AI and don’t understand just how wrong it can be.

Some assumption will come from AI, no one will check it and it’ll become a basic business input. Then suddenly one day someone smart will say “thats not true” and someone will trace it back to AI. I know it.

I assume at that point in time there will be some general directive on using AI and not assuming it’s correct. And then AI will slowly go out of favor.


People fabricated a lot too. Yesterday I spent far less time fixing issues in the far more complex and larger changes Claude Code managed to churn out than what the junior developer I worked with needed. Sometimes it's the reverse. But with my time factored in, working with Claude Code is generally more productive for me than working with a junior. The only reason I still work with a junior dev is as an investment into teaching him.

Claude is cheaper, faster, produces better code.


You are mixing a point and the issue, largely heterogeneous: Claude being a champion in producing good code vs LLMs in general being delirious.

If your junior developer is just "junior", that is one matter; if your junior developer hallucinates documentation details, that's different.


Every developer I've ever worked with have gotten things wrong. Whether you call that hallucinating or not is irrelevant. What matters is the effort it takes to fix.


On the logically practical point I agree with you (what counts in the end in the specific process you mention is the gain vs loss game), but my point was that if your assistant is structurally delirious you will have to expect a big chunk of the "loss" as structural.

--

Edit: new information may contribute to even this exchange, see https://www.anthropic.com/research/tracing-thoughts-language...

> It turns out that, in Claude, refusal to answer is the default behavior

I.e., boxes that incline to different approaches to heuristic will behave differently and offer different value (to be further assessed within a framework of complexity, e.g. "be creative but strict" etc.)


And my direct experience is that I often spend less time directing, reviewing and fixing code written by Claude Code at this point than I do for a junior irrespective of that loss. If anything, Claude Code "knows" my code bases better. The rest, then, to me at least is moot.

Claude is substantially cheaper for me, per reviewed, fixed change committed. More importantly to me, it demands less of my limited time per reviewed, fixed change committed.

Having a junior dev working with me at this point wouldn't be worth it to me if it wasn't for the training aspect: We still need pipelines of people who will learn to use the AI models, and who will learn to do the things it can't do well.


> irrespective of that loss

Just to be clear, since that expression may reveal a misunderstanding, I meant the sophisticated version of

  ((gain_jd-loss_jd)>(gain_llm-loss_llm))?(jd):(llm)
But my point was: it's good that Claude has become a rightful legend in the realm of coding, but before and regardless, a candidate that told you "that class will have a .SolveAnyProblem() method: I want to believe" presents an handicap. As you said no assistant revealed to be perfect, but assistants who attempt mixing coding sessions and creative fiction writing raise alarms.


But this was true before LLMs. People would and still do take any old thing from an internet search and treat it as true. There is a known, difficult-to-remedy failure to properly adjudicate information and source quality, and you can find it discussed in research prior to the computer age. It is a user problem more than a system problem. In my experience, with the way I interact with LLMs, they are more likely to give me useful output than not, and this is borne out by mainstream non-edge-case academic peer-reviewed work. Useful does not necessarily equal 100% correct, just as a Google search does not. I judge and vet all information, whether from an LLM, search, book, paper, or wherever We can build a straw person who "always" takes LLM output as true and uses it as-is but those are the same people who use most information tools poorly, be they internet search, dictionaries, or even looking in their own files for their own work or sent mail (I say this as an IT professional who has seen the worker types from before the pre-internet days through now). In any case, we use automobiles despite others misusing them. But only the foolish among us completely take our hands off the wheel for any supposed "self-driving" features. While we must prevent and decry the misuse by fools, we cannot let their ignorance hold us back. Let's let their ignorance help make tools, as they help identify more undesirable scenarios.


Have you talked to a non-artificial intelligence lately? I’ve got some news for you…


This is the problem of the internet writ large.

The solution is to be selective and careful like always


> My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.

The same is true about the internet, and people even used to use these arguments to try to dissuade people from getting their information online (back when Wikipedia was considered a running joke, and journalists mocked blogs). But today it would be considered silly to dissuade someone from using the internet just because the information there is extremely unreliable.

Many programmers will say Stack Overflow is invaluable, but it's also unreliable. The answer is to use it as a tool and a jumping off point to help you solve your problem, not to assume that its authoritative.

The strange thing to me these days is the number of people who will talk about the problems with misinformation coming from LLMs, but then who seem to uncritically believe all sorts of other misinformation they encounter online, in the media, or through friends.

Yes, you need to verify the information you're getting, and this applies to far more than just LLMs.


Shades of grey fallacy. You have way more context clues about the information on the internet than you do with an LLM. In fact, with an LLM you have zero(-ish?).

I can peruse your previous posts to see how truthful you are, I can tell if your post has been down/upvoted, I can read responses to your post to see if you've been called out on anything, etc.

This applies tenfold in real life where over time you get to build comprehensive mental models of other people.


I have decided it must be attached to a sort of superiority complex. These types of people believe they are capable of deciphering fact from fiction but the general population isn’t so LLMs scare them because someone might hear something wrong and believe it. It almost seems delusional. You have to be incredibly self aggrandizing in your mind to think this way. If LLMs were actually causing “a problem” then there would be countless examples of humans making critical mistakes because of bad LLM responses, and that is decidedly not happening. Instead we’re just having fun ghiblifying the last 20 years of the internet.


> that is decidedly not happening

Regardless of anything else it’s extremely too early to make such claims. We have to wait until people start allowing “AI agents” to make autonomous blackbox decision with minimal supervision since nobody has any clue what’s happening.

Even if we tone down the SciFi dystopia angle not that many people really use LMMs in non superficial ways yet. What I’m most afraid of would be the next generation growing without the ability to critically synthesize information on their own.


Most people - the vast majority of people - cannot critically synthesize information on their own.

But the implication of what you are saying is that academic rigour is going to be ditched overnight because of LLMs.

That’s a little bit odd. Has the scientific community ever thrown up its collective hands and said “ok, there are easier ways to do things now, we can take the rest of the decade off, phew what a relief!”


> what you are saying is that academic rigour is going to be ditched overnight

Not across all level and certainly not overnight. But a lot of children entering the pipeline might end up having a very different experience than anyone else before LLMs (unless they are very lucky to be in an environment that provides them better opportunities).

> cannot critically synthesize information on their own.

That’s true, but if we even less people will try to so that or even know where to start that will get even worse.


No matter how things will evolve, that Ghiblification is something we will look back to in twenty years and say: "Remember how cool that was?"


People need time to adapt to llms capacity to spit out nonsense. It’ll take time but I’m sure they will.


> What's worse, people are treating them as authoritative

Because in people's experience, LLMs are often correct.

You are right LLMs are not authoritative, but people trust it exactly because they often do produce correct answers.


> I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm

Happened to me as well. Wanted it to quickly write an algorithm for standard deviation over a stream of data, which is a text-book algorithm. It did it almost right, but messed up the final formula and the code gave wrong answers. Weird, considering some correct codes exist for that problem in Wikipedia.


It's always perplexing when people talk about LLMs as "it", as if there's only one model out there, and they're all equally accurate.

FWIW, here's 4o writing a selection sort: https://chatgpt.com/share/67e60f66-aacc-800c-9e1d-303982f54d...


I don't understand the point of that share. There are likely thousands of implementations of selection sort on the internet and so being able to recreate one isn't impressive in the slightest.

And all the models are identical in not being able to discern what is real or something it just made up.


I guess the point is that op was pretty adamant that LLMs refuse to write selection sorts?


No? I mean if they refused that would actually be a reasonably good outcome. The real problem is if they generally can write selection sorts but occasionally go haywire due to additional context and start hallucinating.

I mean asking a straightforward question like: https://chatgpt.com/share/67e60f66-aacc-800c-9e1d-303982f54d... is entirely pointless as a test


Because, to be blunt, I think this is total bullshit if you're using a decent model:

"I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over."


I was part of preparing an offer a few weeks ago. The customer prepared a lot of documents for us - maybe 100 pages on total. Boss insisted on using chatgpt to summarize this stuff and read only the summary. I did a loner, slower, reading and cought on some topics chatgpt outright dropped. Our offer was based on the summary - and fell through because we missed these nuances. But hey, boss did not read as much as previously...


I saw someone saying 80% of doctors believe that LLM's are trustworthy consultation partners.

Code created by LLM's doesnt compile, hallucinated API's.. invalid syntax and completely broken logic, why would you trust it with someones life !


I wonder if the exact phrasing has varied from the source, but even then if "consultation partners" is doing the heavy lifting there. If it was something like "useful consultation partners", I can absolutely see value as an extra opinion that is easy to override. "Oh yeah, I hadn't thought about that option - I'll look into it further."

I imagine we're talking about it as an extra resource rather than trusting it as final in a life or death decision.


> I imagine we're talking about it as an extra resource rather than trusting it > as final in a life or death decision.

I'd like to think so. Trust is also one of those non-concrete terms that have different meanings to different people. I'd like to think that doctors use their own judgement to include the output from their trained models, I just wonder how long it is till they become the default judgement when humans get lazy.


I think that's a fair assessment on trust as a term, and incorporating via personal judgement. If this was any public story, I'd also factor in breathless reporting about new tech.

Black-box decisions I absolutely have a problem with. But an extra resource considered by people with an understanding of risks is fine by me. Like I've said in other comments, I understand what it is and isn't good at, and have a great time using ChatGPT for feedback or planning or extrapolating or brainstorming. I automatically filter out the "Good point! This is a fantastic idea..." response it inevitably starts with...


I'll see if i can dig it up, it was from a real life meeting which I have tossed the printed notes from a while back in disgust.


Because LLM’s, with like 20% hallucination rate, are more reliable than overworked, tired doctors that can spend only one ounce of their brainpower on the patient they’re currently helping?


Yeah, I'm gonna need really strong evidence for that claim before I entrust my life to an AI.


Apologies, but have you noticed that if your entrusted (the "doctor") trusted the unentrustable (the "LLM"), then your entrusted is not trustworthy?


yes, I have noticed.. and I am concerned.


"Quis custodiet ipsos custodes". The old problem.

In fact, the phenomenon of pseudo-intelligence scares those who were hoping to get tools that limited the original problem, as opposed to potentially boosting it.


>I saw someone saying 80% of doctors believe that LLM's are trustworthy consultation partners.

See, now that is something I don't know why I should trust: a random person on the internet citing a statistics that they saw someone else say.


The claim seems plausible because it doesn't say there was any formal evaluation, just that some doctors (who may or may not understand how LLMs work) hold an opinion.


I wish I could cite the actual study, but I'm my feeble mind only remembers the anger I felt at the statistic.

Unlike the LLM, i'm willing to be truthful about my memory.


luckily us being programmers can fix things like syntax errors.


> I saw someone saying

The irony...


> What's worse, people are treating them as authoritative.

So what? People are wrong all the time. What happens when people are wrong? Things go wrong. What happens then? People learn that the way they got their information wasn't robust enough and they'll adapt to be more careful in the future.

This is the way it has always worked. But people are "worried" about LLMs... Because they're new. Don't worry, it's just another tool in the box, people are perfectly capable of being wrong without LLMs.


Being wrong when you are building a grocery management app is one thing, being wrong when building a bridge is another.

For those sensitive use cases, it is imperative we create regulation, like every other technology that came before it, to minimize the inherent risks.

In an unrelated example, I saw someone saying recently they don't like a new version of an LLM because it no longer has "cool" conversations with them, so take that as you will from a psychological perspective.


I have a hard time taking that kind of worry seriously. In ten years, how many bridges will have collapsed because of LLMs? How many people will have died? Meanwhile, how many will have died from fentanyl or cars or air pollution or smoking. Why do people care so much about the hypothetical bad effects from new technology and so little about the things we already know are harmful


A tool is good but lots of people are stupid and misuse it… That’s just life buddy.


Humans bullshit and hallucinate and claim authority without citation or knowledge. They will believe all manner of things. They frequently misunderstand.

The LLM doesn’t need to be perfect. Just needs to beat a typical human.

LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.


LLM's can't be held accountable.

And many, many companies are proposing and implementing uses for LLM's to intentionally obscure that accountability.

If a person makes up something, innocently or maliciously, and someone believes it and ends up getting harmed, that person can have some liability for the harm.

If a LLM hallucinates something, that somone believes and they end up getting harmed, there's no accountability. And it seems that AI companies are pushing for laws & regulations that further protect them from this liability.

These models can be useful tools, but the targets these AI companies are shooting for are going to be activly harmful in an economy that insists you do something productive for the continued right to exist.


This is correct. On top of that, the failure modes of AI system are unpredictable and incomprehensible. Present day AI systems can fail on/be fooled by inputs in surprising ways that no humans would.


Accountability exists for two reasons:

1. To make those harmed whole. On this, you have a good point. The desire of AI firms or those using AI to be indemnified from the harms their use of AI causes is a problem as they will harm people. But it isn't relevant to the question of whether LLMs are useful or whether they beat a human.

2. To incentivize the human to behave properly. This is moot with LLMs. There is no laziness or competing incentive for them.


> This is moot

That’s not a positive at all, the complete opposite. It’s not about laziness but being able to somewhat accurately estimate and balance risk/benefit ratio.

The fact that making a wrong decision would have significant costs for you and other people should have a significant influence on decision making.


> This is moot with LLMs. There is no laziness or competing incentive for them.

The incentives for the LLM are dictated by the company, at the moment it only seems to be 'whatever ensures we continue to get sales'.


[flagged]


That reads as "people shouldn't trust what AI tells them", which is in opposition to what companies want to use AI for.

An airline tried to blame its chatbot for inaccurate advice it gave (whether a discount could be claimed after a flight). Tribunal said no, its chatbot was not a separate legal entity.

https://www.bbc.com/travel/article/20240222-air-canada-chatb...


Yeah. Where I live, we are always reminded that our conversations with insurance provider personnel over phone are recorded and can be referenced while making a claim.

Imagine a chatbot making false promises to prospective customers. Your claim gets denied, you fight it out only to learn their ToS absolves them of "AI hallucinations".


> LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.

On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.

My problem isn't that humans are doing similar things to LLMs, my problem is that humans can understand consequences of bullshitting at the wrong time. LLMs, on the other hand, operate purely on bullshitting. Sometimes they are right, sometimes they are wrong. But what they'll never do or tell you is "how confident am I that this answer is right". They leave the hard work of calling out the bullshit on the human.

There's a level of social trust that exists which LLMs don't follow. I can trust when my doctor says "you have a cold" that I probably have a cold. They've seen it a million times before and they are pretty good at diagnosing that problem. I can also know that doctor is probably bullshitting me if they start giving me advice for my legal problems, because it's unlikely you are going to find a doctor/lawyer.

> Just needs to beat a typical human.

My issue is we can't even measure accurately how good humans are at their jobs. You now want to trust that the metrics and benchmarks used to judge LLMs are actually good measures? So much of the LLM advocates try and pretend like you can objectively measure goodness in subjective fields by just writing some unit tests. It's literally the "Oh look, I have an oracle java certificate" or "Aws solutions architect" method of determining competence.

And so many of these tests aren't being written by experts. Perhaps the coding tests, but the legal tests? Medical tests?

The problem is LLM companies are bullshiting society on how competently they can measure LLM competence.


> On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.

Some humans can, certainly. Humans as a race? Maybe, ish.


Well there are still millions that can. There is a handful of competitive LLMs and their output given the same inputs are near identical in relative terms (compared to humans).


"On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something."

You can do the same with LLM, I gaslight chatgpt all the time so it not hallucinate


Your second point directly contradicts your first point.

In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.

As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.

No external help, just conversations with ChatGPT and some Googling.

Obviously LLMs have issues, but if we're now in the "Beginners can program their own custom apps" phase of the cycle, the potential is huge.


> As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.

This is actually an interesting one - I’ve seen a case where some copy/pasted PDF saving code caused hundreds of thousands of subtly corrupted PDFs (invoices, reports, etc.) over the span of years. It was a mistake that would be very easy for an LLM to make, but I sure wouldn’t want to rely on chatgpt to fix all of those PDFs and the production code relying on them.


Well humans are not a monolithic hive mind that all behave exactly the same as an “average” lawyer, doctor etc. that provides very obvious and very significant advantages.

> days to go from a cold start with zero dev experience

How is that relevant?


>> In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.

This paragraph makes little sense. A negligence claim is based on a deviation from some reasonable standard, which is essentially a proxy for the level of care/service that most practitioners would apply in a given situation. If doctors were as regularly incompetent as you are trying to argue then the standard for negligence would be lower because the overall standard in the industry would reflect such incompetence. So the existence of negligence claims actually tells us little about how good a doctor is individually or how good doctors are as a group, just that there is a standard that their performance can be measured against.

I think most people would agree with you that medical negligence claims are a huge problem, but I think that most of those people would say the problem is that so many of these claims are frivolous rather than meritorious, resulting in doctors paying more for malpractice insurance than necessary and also resulting in doctors asking for unnecessarily burdensome additional testing with little diagnostic value so that they don’t get sued.

I won’t defend lawyers. They’re generally scum.


It's fine if it isn't perfect if whomever is spitting out answers assumes liability when the robot is wrong. But, what people want is the robot to answer questions and there to be no liability when it is well known that the robot can be wildly inaccurate sometimes. They want the illusion of value without the liability of the known deficiencies.

If LLM output is like a magic 8 ball you shake, that is not very valuable unless it is workload management for a human who will validate the fitness of the output.


I never ask a typical human for help with my work, why should that be my benchmark for using an information tool? Afaik, most people do not write about what they don't know, and if one made a habit of it, they would be found and filtered out of authoritative sources of information.


ok, but people are building determinative software _on top of them_. It's like saying "it's ok, people make mistakes, but lets build infrastructure on some brain in a vat". It's just inherently not at the point that you can make it the foundation of anything but a pet that helps you slop out code, or whatever visual or textual project you have.

It's one of those "quantities is so fascisnating, lets ignore how we got here in the first place"


You’re moving the goalposts. LLMs are masquerading as superb reference tools and as sources of expertise on all things, not as mere “typical humans.” If they were presented accurately as being about as fallible as a typical human, typical humans (users) wouldn’t be nearly as trusting or excited about using them, and they wouldn’t seem nearly as futuristic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: