I think there is a national security aspect to ML models trained on copyrighted data. Countries that allow it will gain a superior technological advantage and outcompete those who disallow training on copyrighted material. I personally believe training LLMs on copyrighted data is copyright infringement if the models are deployed in a way that competes with the copyright holder. But that doesn’t necessarily mean it’s something we should disallow.
You can say the same for any legal enforcement like respecting patent or copyright law or making Champagne outside France. Yet the sky isn’t falling given this reality with so many legally protected industries. Maybe these markets where such an industry might offshore to are too small and insular to be very significant, and are probably language bound to make english models less relevant compared to native language models.