It beeps at you if you stop paying attention, which is superior. Hands on wheel is an arbitrary design decision more likely to placate what a layman would think is necessary to ensure safe AI steering.
This doesn't mean that it's over for SO. It just means we'll probably trend towards more quality over quantity. Measuring SO's success by measuring number of questions asked is like measuring code quality by lines of code. Eventually SO would trend down simply by advancements of search technology helping users find existing answers rather than asking new ones. It just so happened that AI advanced made it even better (in terms of not having to need to ask redundant questions).
With coding agents AI almost never manually type code anymore. It would be great to have a code editor that runs on my phone so I can do voice prompts and let the coding agents type stuff for me.
To be fair with the languages I use there are only a finite number of ways a particular line or even function can be implemented due to high level algebraic data structures and strict type checking. Business logic is encoded as data requirements, which is encoded into types, which is enforced by the type checker. Even a non-AI based system can technically be made to fill in the code, but AI system allows this to sort of be generalized across many languages that did not implement auto-complete.
I have been doing this with GitHub's copilot agent web interface on my phone; word-vomit voice prompt + instructions to always run the tests or take screenshots so I can evaluate the change works really well.
We already have verification layers: high level strictly typed languages like Haskell, Ocaml, Rescript/Melange (js ecosystem), purescript (js), elm, gleam (erlang), f# (for .net ecosystem).
These aren’t just strict type systems but the language allows for algebraic data types, nominal types, etc, which allow for encoding higher level types enforced by the language compiler.
The AI essentially becomes a glorified blank filler filling in the blanks. Basic syntax errors or type errors, while common, are automatically caught by the compiler as part of the vibe coding feedback loop.
Interestingly, coding models often struggle with complex type systems, e.g. in Haskell or Rust. Of course, part of this has to do with the relative paucity of relevant training data, but there are also "cognitive" factors that mirror what humans tend to struggle with in those languages.
One big factor behind this is the fact that you're no longer just writing programs and debugging them incrementally, iteratively dealing with simple concrete errors. Instead, you're writing non-trivial proofs about all possible runs of the program. There are obviously benefits to the outcome of this, but the process is more challenging.
Actually I found the coding models to work really well with these languages. And the type systems are not actually complex. Ocaml's type system is actually really simple, which is probably why the compiler can be so fast. Even back in the "beta" days of Copilot, despite being marketed as Python only, I found it worked for Ocaml syntax and worked just as well.
The coding models work really well with esoteric syntaxes so if the biggest hurdle to adoption of haskell was syntax, that's definitely less of a hurdle now.
> Instead, you're writing non-trivial proofs about all possible runs of the program.
All possible runs of a program is exactly what HM type systems type check for. This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.
There's a reason I mentioned Haskell and Rust specifically. You're right, OCaml's type system is simpler in some relevant respects, and may avoid the issues that I was alluding to. I haven't worked with OCaml for a number of years, since before the LLM boom.
The presence of type classes in Haskell and traits in Rust, and of course the memory lifetime types in Rust, are a big part of the complexity I mentioned.
(Edit: I like type classes and traits. They're a big reason I eventually settled on Haskell over OCaml, and one of the reasons I like Rust. I'm also not such a fan of the "O" in OCaml.)
> All possible runs of a program is exactly what HM type systems type check for.
Yes, my point was this can be a more difficult goal to achieve.
> This fed into the coding model automatically iterates until it finds a solution that doesn't violate any possible run of the program.
Only if the model is able to make progress effectively. I have some amusing transcripts of the opposite situation.
I also try to do verbose type classes using Ocaml's module system and it's been handling these patterns pretty well. My guess is there is probably good documentation / training data in there for these patterns since they are well documented. I haven't actually used coding agents with Haskell yet so it's possible that Ocaml's verbosity helps the agent.
I've been in many situations where I wanted translations, and I can't think of one where I'd actually want to rely on either glasses or the airpods working like they do in the demos.
The crux of it for me:
- if it's not a person it will be out of sync, you'll be stopping it every 10 sec to get the translation. One could as well use their phone, it would be the same, and there's a strong chance the media is already playing from there so having the translation embedded would be an option.
- with a person, the other person needs to understand when your translation in going on, and when it's over, so they know when to get an answer or know they can go on. Having a phone in plain sight is actually great for that.
- the other person has no way to check if your translation is completely out of whack. Most of the time they have some vague understanding, even if they can't really speak. Having the translation in the glasses removes any possible control.
There are a ton of smaller points, but all in all the barrier for a translation device to become magic and just work plugged in your ear or glasses is so high I don't expect anything beating a smartphone within my lifetime.
Some of your points are already considered with current implementations. Airpods live translate uses your phone to display what you say to the target person, and the target person's speech is played to your airpods. I think the main issue is that there is a massive delay and apple's translation models are inferior to ChatGPT. The other thing is the airpods don't really add much. It works the same as if you had the translation app open and both people are talking to it.
Aircaps demos show it to be pretty fast and almost real time. Meta's live captioning works really fast and is supposed to be able to pick out who is talking in a noisy environment by having you look at the person.
I think most of your issues are just a matter of the models improving themselves and running faster. I've found translations tend to not be out of whack, but this is something that can't really be solved except by having better translation models. In the case of Airpods live translate the app will show both people's text.
It's understating the lag. Faster will always be better, but even "real time" still requires the other person to complete their sentence before getting a translation (there is the edge case of the other language having similar grammatical structure and word order, but IMHO that's rare), and you catch up from there. That's enough lag to warrant putting the whole translation process literally on the table.
I see the real improvements in the models, for IRL translation I just think phones are very good at this and improving from there will be exponentially difficult.
IMHO it's the same for "bots" intervening (commenting/reacring on exchanges etc.) in meetings. Interfacing multiple humans in the same scene is always a delicate problem.
I have the G1 glasses and unfortunately the microphones are terrible, so the live translation feature barely works. Even if you sit in a quiet room and try to make conditions perfect, the accuracy of transcription is very low. If you try to use it out on the street it rarely gets even a single word correct.
This is the sad reality of most if these AI products and it’s that they are just taking poor feature implementations on the hardware. It seems like if they just picked one or these features and doing it well will make the glasses useful.
Meta has a model just for isolating speech in noisy environments (the “live captioning feature”) and it seems that’s also the main feature of the Aircaps glasses. Translation is a relatively solved problem. The issue is isolating the conversation.
I’ve found meta is pretty good about not overdelivering on promised features, and as a result even though they probably have the best hardware and software stack of any glasses, the stuff you can do with the Rayban displays are extremely limited.
Is it even possible to translate in real time? In many languages and sentences the meaning and translation needs to completely change all thanks to one additional word at the very end. Any accurate translation would need to either wait for the end of a sentence or correct itself after the fact.
Live translation is a well solved problem by this point — the translation will update as it goes, so while you may have a mistranslation visible during the sentence, it will correct when the last word is spoken. The user does need to have awareness of this but in my experience it works well.
Bear in mind that simultaneous interpretation by humans (eg with a headset at a meeting of an international organisation) has been a thing for decades.
But these models are more like generalists no? Couldn’t they simply be hooked up to more specialized models and just defer to them the way coding agents now use tools to assist?
There would be no point in going via an LLM then, if I had a specialist model ready I'd just invoke it on the images directly. I don't particularly need or want a chatbot for this.
Current LLMs are doing this for coding, and it's very effective. It delegates to tool calls, but a specialized model can just be thought of as another tool. The LLM can be weak in some stuff handled by simple shell scripts or utilities, but strong in knowing what scripts/commands to call. For example, doing math via the model natively may be inaccurate, but the model may know to write the code to do math. An LLM can automate a higher level of abstraction, in the same way a manager or CEO might delegate tasks to specialists.
In this case I'm building a batch workflow: images come in, images get analyzed through a pipeline, images go into a GUI for review. The idea of using a VLM was just to avoid hand-building a solution, not because I actually want to use it in a chatbot. It's just interesting that a generalist model that has expert-level handwriting recognition completely falls apart on a different, but much easier, task.
I think a bigger problem is the HN reader mind reading what the rest of the world wants. At least when an HN reader telling us what they want it's a primary source, but reading a comment about an HN reader postulating what the rest of the world wants is simply more noisy than an unrepresentative sample of what the world may want.
I would guess HN readers are not an average cross-section of broader society, but I would also guess that because of that HN readers would be pretty bad at understanding what broader society is thinking.
Agreed. At best most of the stuff I ended up buying from an Instagram ad turned out to be oversold or overpromised and underdelivered. While not a scam outright, it's sort of training me to avoid buying anything from ads...
There is an entire network of "get rick quick just by my pdf" intagramers, who peddle a pdf teaching you how to find a chinese product, make a website, and then drop ship that chinese product for 3x the cost to unsuspecting buyers.
Probably 75% of products you see on instagram ads, you can go find on temu for their actual cost, usually at 80% discount.
It got so bad that even non-tech savvy people around me learnt to do a lot of research about any product shown on Instagram ads.
To me any product advertised on Instagram, or through YouTubers sponsorships, have become synonymous with overpromised bullshit if not outright scams. Every single time I see a sponsorship deal on a YouTube video I do some research just to validate it, and the vast majority of it are outright shitty products.
It's been working great as a signal of what products not to buy.
One of my theories is that there isn't actually enough honest companies buying ad space to satisfy the shareholders in companies like Alphabet or Meta. If they actually care to also filter out the ads for junk products and services, there would probably be a minor collapse in the industry.
Honest companies are priced out by scammy companies, and as long as these companies share the profits they are totally fine profiting off scams. They make more money off the scams, simply put.
reply