That’s a very real example of the core problem: LLMs don’t reliably honor constraints, even when they’re explicit and simple. Instruction drift shows up fast in learning tasks — and quietly in production systems.
That’s why trusting them “agentically” is risky. The safer model is to assume outputs are unreliable and validate after generation.
I’m working on this exact gap with Verdic Guard (verdic.dev) — treating LLM output as untrusted input and enforcing scope and correctness outside the model. Less about smarter prompts, more about predictable behavior.
Your Spanish example is basically the small-scale version of the same failure mode.
A helpful way to learn this is to separate models, machines, and practice.
For computation models, the circuit model and measurement-based computation cover most real work. Aaronson’s Quantum Computing Since Democritus and Nielsen & Chuang explain why quantum differs from classical (interference, amplitudes, complexity limits).
For computers/architecture, think of qubits as noisy analog components and error correction as how digital reliability is built on top. Preskill’s NISQ notes are very clear here.
For programming, most work is circuit construction and simulation on classical hardware (Qiskit, Cirq). That’s normal and expected.
Beyond Shor, look at Grover, phase estimation, and variational algorithms—they show how quantum advantage might appear, even if it’s limited today.
This matches our experience too. The biggest reduction in hallucinations usually comes from shrinking the action space, not improving the prompt. When inputs, tools, and outputs are explicitly constrained, the model stops “being creative” in places where creativity is actually risk.
It’s less about smarter models and more about making the system boring and deterministic at each step.
Fair enough. A healthy dose of skepticism has served us well for every overhyped wave so far. The difference this time seems to be that AI systems don’t just fail noisily — they fail convincingly, which changes how risk leaks into production.
Treating them with the same paranoia we applied to web scale infra and crypto is probably the right instinct. The chupacabra deserved it too.
To someone that has paid zero attention and/or deliberately ignores any coverage about the numerous (and often hilarious) ways that spicy autocomplete just completely shits the bed? Sure, maybe.
That’s fair — if you’re already skeptical and paying attention, the failures are obvious and often funny. The risk tends to show up more with non-experts or downstream systems that assume the output is trustworthy because it looks structured and confident.
Autocomplete failing loudly is annoying; autocomplete failing quietly inside automation is where things get interesting.
This hits a key point that isn't emphasised enough. A few interactions with technology and people have shaped my view:
I fiddled with Apple's Image Playground thing sometime last year, and it was quite rewarding to see a result from a simple description. It wasn't exactly what I'd asked for, but it was close, kind of. As someone who has basically zero artistic ability it was fun to be able to create something like that. I didn't think much about it at the time, but recently I thought about this again, and I keep that in mind when seeing people who are waxing poetic about using spicy autocomplete to "write code" or "analyse business requests" or whatever the fuck it is they're using it for. Of course it seems fucking magical and fool proof if you don't know how to do the thing you're asking it for, yourself.
I had to fly back to my childhood home at very short notice in August to see my (as it turned out, dying) father in hospital. I spoke to more doctors, nurses and specialists in two weeks, than I think I have ever spoken to about my own health in 40+ years). I was relaying the information from doctors to my brother via text message. His initial response to said information was to send me back a Chat fucking GPT summary/analysis of what I'd passed along to him... because apparently my own eyeballs seeing the physical condition of our father, and a doctor explaining the cause, prognosis and chances of recovery were not reliable enough. Better ask Dr Spicy Autocomplete for a fucking second opinion I guess.
So now my default view about people who choose to use spicy autocomplete for anything besides shits and giggles like "Write a Star Trek fan fiction where Captain Jack Sparrow is in charge of DS9", or "Draw <my wife> as a cute little bunny" is essentially "yeah of course it looks like infallible magic, if you don't know how to do it yourself".
The real risk with LLMs isn’t when they fail loudly — it’s when they fail quietly and confidently, especially for non-experts or downstream systems that assume structured output equals correctness.
When you don’t already understand the domain, AI feels infallible. That’s exactly when unvalidated outputs become dangerous inside automation, decision pipelines, and production workflows.
This is why governance can’t be an afterthought. AI systems need deterministic validation against intent and execution boundaries before outputs are trusted or acted on — not just better prompts or post-hoc monitoring.
That gap between “sounds right” and “is allowed to run” is where tools like Verdic Guard are meant to sit.
Deep nesting: The indexer enforces a 255-depth limit (and gives a clear error if exceeded). That's a u8 + stack overflow safety guard. Details on the Known Limitations page: https://giantjson.com/docs/known-limitations/
Wide objects / long lines: This was actually the harder problem. In Text Mode, extremely long lines (especially without spaces, like minified JSON or base64 blobs) caused serious issues with Android's text layout engine. I ended up detecting those early and truncating at ~5KB for display.
In Browser Mode, cards truncate values aggressively (100 chars collapsed, 1000 chars expanded), but the full value is still available for copy-to-clipboard operations. I also tried to make truncation "useful" by sniffing for magic bytes—if it looks like base64-encoded data, it shows a badge with the detected format (PNG, PDF, etc.) and lets you extract/download it.
Index build time & memory: These are definitely the limiting factors right now. The structural index itself grows linearly with node count (32 bytes/node stored on disk), and for minified JSON I also keep a sparse line index in memory. For big files, the initial indexing can take a minute—I'm not sure if that scares users away or if they expect it for a GB sized file.
I've been watching Play Console for ANRs/OOMs and so far just had 1-2 isolated cases that I could fix from the stack traces. But honestly, I'm still figuring out which direction to prioritize next—real-world usage patterns will tell me more than my synthetic tests did.
This matches what I’ve seen as well. A lot of “debt relief” and “settlement” services are essentially rent-seeking intermediaries that leave consumers worse off or stuck in long programs with unclear outcomes.
Non-profit credit counseling (with transparent fee structures and regulatory oversight) tends to be the only consistently legitimate option. Anything that promises easy reductions or fast fixes should probably be treated with extreme skepticism.
Consumer finance is one of those areas where incentives are misaligned enough that doing nothing is often safer than trusting a glossy solution.
That’s a very sane stance. Treating LLM output as untrusted input is probably the correct default when correctness matters.
The worst failures I’ve seen happen when teams half-trust the model — enough to automate, but still needing heavy guardrails. Putting the checks outside the model keeps the system understandable and deterministic.
Ignoring AI unless it can be safely boxed isn’t anti-AI — it’s good engineering.
That framing resonates a lot. In production, creativity is often just unbounded variance.
Once each step is intentionally boring and constrained, failures become predictable and debuggable — which is what engineering actually optimizes for. That tradeoff is almost always worth it.
I’m building Verdic Guard (verdic.dev) around the same idea: treat LLMs as creative generators, but enforce scope and correctness outside the model so systems stay calm under load.
This is a very pragmatic take. The “90% accuracy is a liability” line resonates — in high-stakes systems, partial automation often costs more than it saves.
What I like here is the field-level confidence gating instead of a single document score. That maps much better to real failure modes, where one bad value (amount, date, vendor) can invalidate the whole record.
One question I’m curious about: how stable are the confidence thresholds over time? In similar systems I’ve seen, models tend to get confidently wrong under distribution shift, which makes static thresholds tricky.
Have you considered combining confidence with explicit intent or scope constraints (e.g., what the system is allowed to infer vs. must escalate), rather than confidence alone?
Overall, this feels much closer to how production systems should treat AI — not as an oracle, but as a component that earns trust incrementally.
This resonates. A lot of AI reading tools optimize for removal of effort (summaries, shortcuts), which often ends up weakening comprehension rather than strengthening it.
One thing I’m curious about: how do you decide when the AI should intervene versus stay silent? In deep reading, timing matters a lot — too much contextual help can break flow, too little can frustrate.
Have you observed differences across use cases (e.g. technical papers vs. philosophy vs. fiction)? It feels like the “right amount” of AI assistance probably isn’t static and might depend on reader intent and text difficulty.
Interesting direction overall — especially the idea of AI as a reading companion rather than a replacement.
Whether or not Hallucination “happens often” depends heavily on the task domain and how you define correctness. In a simple conversational question about general knowledge, an LLM might be right more often than not — but in complex domains like cloud config, compliance, law, or system design, even a single confidently wrong answer can be catastrophic.
The real risk isn’t frequency averaged across all use cases — it’s impact when it does occur. That’s why confidence alone isn’t a good proxy: models inherently generate fluent text whether they know the right answer or not.
A better way to think about it is: Does this output satisfy the contract you intended for your use case? If not, it’s unfit for production regardless of overall accuracy rates.
That’s why trusting them “agentically” is risky. The safer model is to assume outputs are unreliable and validate after generation.
I’m working on this exact gap with Verdic Guard (verdic.dev) — treating LLM output as untrusted input and enforcing scope and correctness outside the model. Less about smarter prompts, more about predictable behavior.
Your Spanish example is basically the small-scale version of the same failure mode.
reply