I think those types of visual glitches can probably be fixed with more or better training, and I have no doubt that future versions of this type of system will produce outputs that are indistinguishable from real videos.
But better training can't fix the more general problem that I'm describing. Perfect-looking videos aren't useful if you can't get it to follow your instructions.
But better training can't fix the more general problem that I'm describing. Perfect-looking videos aren't useful if you can't get it to follow your instructions.