Not necessarily, not all tactics can be used symmetrically like that. Many of the sites they scrape feel the need to support search engine crawlers and RSS crawlers, but OpenAI feels no such need to grant automated anonymous access to ChatGPT users.
And at the end of the daty, they can always look at the responses coming in and make decisions like “95% of users said these responses were wrong, 5% said these responses were right, let’s go with the 95%”. As long as the vast majority of their data is good (and it will be) they have a lot of statistical tools they can use to weed out the poison.
If you want to pick apart my hastily concocted examples, well, have fun I guess. My overall point is that ensuring data quality is something OpenAI is probably very good at. They likely have many clever techniques, some of which we could guess at, some of which would surprise us, all of which they’ve validated through extensive testing including with adversarial data.
If people want to keep playing pretend that their data poisoning efforts are causing real pain to OpenAI, they’re free to do so. I suppose it makes people feel good, and no one’s getting hurt here.
I'm interested in why you think OpenAI is probably very good at ensuring data quality. Also interested if you are trying to troll the resistance into revealing their working techniques.
What makes people think companies like OpenAI can't just pay experts for verified true data? Why do all these "gotcha" replies always revolve around the idea that everyone developing AI models is credulous and stupid?
And at the end of the daty, they can always look at the responses coming in and make decisions like “95% of users said these responses were wrong, 5% said these responses were right, let’s go with the 95%”. As long as the vast majority of their data is good (and it will be) they have a lot of statistical tools they can use to weed out the poison.