Don’t know the exact details but I imagine the further from the original input image the more the system needs to make up stuff. Same why generative video models are limited to a few seconds. It will improve
Can you point at some data that would indicate it will improve? There are lots of statements today about GenAI akin to "that will get fixed later" but we don't actually seem to know what will actually improve and what will just get incrementally prettier without fixing the underlying issue.
AI can outpaint more images in similar style, then map them to 3D. IMHO, AI should generate a story from the image, then use image + story + location & direction to generate consistent 3D world.