Yes, there’s rudimentary evidence that there’s essentially a 3D engine within the model that participates in generating the image. If we could inspect and interpret the whole process it would likely be bizarre and byzantine, like a sort of convergent evolution that independently recreates a tangled spaghetti mess of unreal engine, adobe lightroom, and a physical simulation of a Canon D5.
Essentially similar perhaps to the 3D engine that a human brain runs that generates a single "3D" image from two 2D cameras (eyes) and fills in missing objects in blind spots, etc.
Note that while having two eyes helps build a more accurate 3D image, people with one eye still see in 3D. Eye movement is at least as important a part of 3D vision as stereoscopy.
And apparently 3D renders can be inferred from only 2d images like in these image gen models, so even without video or parallax, brains could probably model the world in 3D.
I remember a wittgenstein thing asking to think of a tree or something, and then point to the place in your head where that thought exists. It's kind of like that.