Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Pushshift.io Reddit corpus

Pushshift is a single person with some very strong political opinions who has specifically used his datasets to attack political opponents. Frankly I wouldn't trust his data to be untainted.

These models really need to be trained on more official data sources, or at least something with some type of multi-party oversight rather than data that effectively fell off the back of a truck.

edit: That's not even to mention I believe it's flat-out illegal for him to collect and redistribute this data as Reddit users did not agree to any terms of use with him. Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR: https://www.reddit.com/r/pushshift/comments/pat409/online_re...



Thats interesting, any good sources for this accusation?


Not handy, and I'm not going to spend my evening digging. It may've also been one of the NGOs ideologically aligned with him that credited him for the data + assistance


If it's so egregious is it really that hard to find an example of the bias?

Calling the integrity of a single person operation into question, but then backing out with no evidence and even saying it might not have even been them seems a bit irresponsible.


On the other hand, they warned you with their username...


You can just look at the data…


Web scraping is legal. Reddit users, like all other members of public forums, put their comments on the internet for the whole world to see. And collect, parse, process and manipulate. If you don't want the whole world to have access to your writing, you'd have to join a private forum.

Trying to shoehorn social media posts into some contorted post-hoc bastardization of the concept of privacy is ridiculous.

Shockingly, things that people post to publicly accessible websites are accessible by the public. We're starting to see social damage from this, with facial recognition and authoritarian governments using people's posts for tracking and oppression.

Decentralized services with strong legislation protecting personal data, and globally recognized content licensing will all be needed to prevent future abuse, but everyone currently in the planet over the age of 20 is more or less personally responsible for the massive and naive oversharing. We know better now, but 15+ years ago nobody except Sci-fi authors and fringe activists had a grasp of how badly unprotected globally shared streams of consciousness could go wrong.


> Just look at the disastrous mess of his half-baked "opt-out" thing that flagrantly violates GDPR

Pushshift collects data from Reddit using the same API as the mobile app and public site. It does not have any privileged access to the Reddit database, nor is it collecting any PII that would be subject to GDPR.

You as a user grant a pretty broad license to Reddit when you post content. One of the things the license allows them to do is redistribute the content to other users as well as search indexes and things like the Wayback Machine or Pushshift.

(While I did work for Reddit at one point, these opinions are my own)


> nor is it collecting any PII that would be subject to GDPR

Yeah that's not how that works. Reddit is a free text input interface. I'm free to put PII in any post or comment I want to and you have to comply with data protection laws accordingly if I want my information redacted later on.

The same way you wouldn't just "let it ride" if someone uploaded illegal content - the content itself is what's protected, doesn't matter how Reddit structures its web forms.


That has already been hashed out in the European courts. The processor of the data needs to have a reasonable way of establishing that the data belongs to a identifiable natural person.

But by all means, if you disagree feel free to report Pushshift to the EU regulators. As far as I know Pushshift is based in the US and has no presence to establish a nexus to EU law.


The opt-out form doesn't even get processed these days. It's a fig leaf for GDPR compliance that doesn't actually work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: