I think spatial tokens could help, but they're not really necessary. Lots of physics/physical tasks can be solved with pencil and paper.
On the other hand, it's amazing that a 512x512 image can be represented by 85 tokens (as in OAI's API), or 263 tokens per second for video (with Gemini). It's as if the memory vs compute tradeoff has morphed into a memory vs embedding question.
This dichotomy reminds me of the "Apple Rotators - can you rotate the Apple in your head" question. The spatial embeddings will likely solve dynamics questions a lot more intuitively (ie, without extended thinking).
We're also working on this space at FlyShirley - training pilots to fly then training Shirley to fly - where we benefit from established simulation tools. Looking forward to trying Fei Fei's models!
They're building on a false premise that human equivalent performance using cameras is acceptable.
That's the whole point of AI - when you can think really fast, the world is really slow. You simulate things. Even with lifetimes of data, the cars still will fail in visual scenarios where error bars on ground truth shoot through the roof.
Elon seems to believe his cars will fail in similar ways to humans because they use cameras. False premise. As Waymo scales, human just isn't good enough, except for humans.
So, I agree with what you're saying, but that doesn't matter.
The legal standing doesn't care what tech it is behind it. 1000 monkeys for all it matters. The point is level 3 is the most dangerous level because neither the public nor the manuf properly operates in this space.
Definitely curious what circuits light-up from a Neuralese perspective. We want reasoning traces that are both faithful to the thought process and also interpretable. If the other language segments are lighting up meanings much different than their translations, that would raise questions for me.
Yes absolutely this! We're working on these problems at FlyShirley for our pilot training tool. My go-to is: I'm facing 160 degrees and want to face north. What's the quickest way to turn and by how much?
For small models and when attention is "taken up", these sorts of questions really send a model for a loop. Agreed - especially noticeable with small reasoning models.
I just tried this with a smaller "thinking" model (deepseek distill, running locally) and boy are you right. It keeps flipping between which direction it should turn, second guessing its thought process and then getting sidetracked with a different approach.
Hello! On this topic: Could you please look into making the path ‘next/image’ uses for caching images user definable? Currently I can’t use Google app engine standard because the directory it uses is write protected. The only real solution seems to be custom image providers, which is a drag, so I’m on App Engine Flex spending way more than I probably should :-)
I did try, but in a wrong way (try to SVD quantization error to recover quality (I.e. SVD(W - Q(W)))). The lightbulb moment in this paper is to do SVD on W and then quantize the remaining.
reply