I don't like "available on request". I just want to download it and see if I can get it to run and mess around with it a bit. Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?
I'm also curious to know what the minimum requirements are to get this to run in inference mode.
> Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?
Just a guess: you will have to contractually agree to some things in order to get the model; at a minimum, agree not to redistribute it, but probably also agree not to use it commercially. That means whatever commercial advantage there is to having a model this size isn't affected by this offer, which makes it lower stakes for Facebook to offer. And then the point of "academics and researchers" is to be a proxy for "people we trust to keep their promise because they have a clear usecase for non-commercial access to the model and a reputation to protect." They can also sue after the fact, but they'd rather not have to.
Not saying any of this is good or bad, just an educated guess about why it works the way it does.
They want to build a database of people interested in this and vetted by some other organization as worth hiring. Just more people to feed to their recruiters.
To see the output of the work. While academics will credit their data sources, seeing "XXX from YYY" requested, and then later "YYY releases product that could be based on the model" is probably pretty valuable vs wondering which ML it was based on.
A veneer of responsible use, maybe required by their privacy policy or just to avoid backlash about "giving people's data away".
I don’t know whether this is true and have no way of knowing this with any degree of certainty, but to me it seems unlikely that Mark had anything to do with this stipulation (requesting access). Although it’s not unimaginable.
That is dumb when you consider that this thing is likely going to leak anyways. It’s inevitable, and when it does happen, it will just end up in the hands of criminals/scammers and not the general public.
It's super easy to watermark weights for ML models.
Just add a random 0.01 to a random weight anywhere in the network. It will have very little impact on the results, but will mean you can identify who leaked the weights.
It should be easy enough to make that sort of signature very difficult to trace by simply adding a bunch of small noise to the network overall, or even simply training for a few iterations.
The person leaking it might not do so intentionally. Their computer might be compromised. Are we going to punish people for not being cybersec experts?
OK, this is a fun game. I think your counterattack assumes I'm picking these million weights uniformly randomly among the 175 billion. I modify my original answer: s/a million/half the weights in a deterministic subset of 2 million weights/
Select the deterministic subset by just hashing some identifier for each weight.
For any reasonable number of copies, there's a pretty unique subset between all your copies sharing a large amount of bits flipped in the same direction among this subset.
If it is like many other models, part of the reason would just be to reduce their bandwidth costs. The models can be huge, and they want to limit those who just want to download it on a whim so they don't rack up $10k+ is bandwidth charges, as has happened to many others who hosted big models out on S3 or something.
If only there was a way to distribute large files in a peer-to-peer manner, thus reducing the load on facebook's servers to effectively nothing. That would likely result in a torrent of bits being shared without any issues!
I expect they will release the models fully, perhaps even under nonrestrictive licenses. Most researchers aren't too happy about those sort of restrictions, and would know that it vitiates a lot of the value of OPT. They look like they are doing the same sort of thing OA did with GPT-2: a staggered release. (This also has the benefit of not needing all the legal & PR approvals done upfront all at once; and there can be a lot of paperwork there.)
They could, and their might be a torrent in the future - but torrents lose tracking info. I'm sure the researchers want to know who is downloading their models even if they don't care who it is.
> - The 175B parameter model is so large that it doesn't play nice with GitHub or something along those lines
There is no frickin' way that the difficulty or cost of distributing the model is a factor, even if it was several dozen terabytes in size (and it is probably somewhere around 1.5 terabytes). Not for Meta, and not when CDNs and torrrents are available as options.
If they are gatekeeping access to the model, there is no need to ascribe it to a side effect of something else. Their intent IS to limit access to the full model. I'm not really sure why they are bothering, unless they're assuming that unsavory actors won't be motivated enough to pay some grad student for a copy.
I suppose they may be adding a fingerprint or watermark of some sort to trace illicit copies back to the source if they're serious about limiting redistribution, but those can usually be found and removed if you have copies from two or more different sources.
"It wants to be free" is a ridiculous statement, considering that after full two years (GPT-3 was published in May 2020), there is no public release of anything comparable.
In May 2020, was your estimate of time to public release of anything comparable shorter or longer than two years? I bet it was shorter.
True; they are free to do as they see fit. But how about not leeching on the word “open” in that case? DeepMind is essentially the NSA (or Apple), OpenAI is paid-for cloud services with paper-based marketing, and FAIR may be the best of the bunch, but it still annoys the hell out of me that they push code with non-commercial clauses as their current default (these are legally complicated in a university context) and now a model that they label “open” despite not honouring the accepted meaning of the word.
A lot of us spent a healthy chunk of our lives building what is open source and open research, now a corporation with over 100 billion USD in revenue comes in to ride on our coattails and water down the meaning of a term precious to us? How about you spend the time and money to build your own terminology? “Available”, perhaps?
GPT3 will do that right now. There aren’t any controls on its text, it just warns you if it looks offensive. And of course nothing it says is true except coincidentally.
Is it true to say they are true coincidentally, because that kind of suggests randomly true. I understand the AI doesn't really comprehend if something is true or false. My understanding is the results are more than random, maybe something closer to like weighted opinion.
What it returns is based on what it's trained on. If it's trained on a corpus containing untruths and prejudice, you can get untruths and prejudice out. You can't make conclusions about what beliefs are widely held based on what it generates in response to specific prompts.
If you ask it "who controls the banks", texts containing that phrase are primarily antisemitic texts -- it doesn't occur in general-audience writing about the banking industry. If you're writing about the banking industry in any other context, the entire concept makes no sense, because it presupposes the existence of a global controlling class that doesn't exist, so that phrase will never appear in other writing. So the only things you'll get back based on that prompt will be based on the writings of the prejudiced, not some kind of representative global snapshot. Taking that as evidence of "weighted opinion" doesn't make sense.
I haven't used GPT-3, but I did try out a site that was based on GPT2. I believe it was called "talk to transformer". But I never tried quarrying anything controversial.
However, I bet this a concern and certain queries will be filtered or "corrected" to be more politically correct. To give you an example, a few days ago I made a comment one Alex Jones, and wanted to google him. The second link returned on him was from ADL. No way that's an organic result.
So just curious, if you have access to GTP-3 what does it return on Alex Jones,
or other queries like who runs the banks, or who owns the media, and so on.
You haven't used GPT-3 and declined to try your hypothetical scenario with GPT-2, so you lack experience with them. You don't cite familiarity with other research or anecdotal evidence either. So what exactly is your justification here? Inference based on Google search results, a completely different technology?
Its kind of silly that you even go here. Even though I never used Dall-E, I can still have an opinion about it. Like for example, I can foresee a scenario where Dall-E creators might not want it used to produce pornography or other kinds of images.
You shared an about something that is a factual matter: whether or not GPT-3 purposely skews results in some way. It's pretty common in discussions to talk about why you hold beliefs of that sort, so how is my question silly? To me it seems silly to bother commenting something that amounts to "I have an opinion that I cannot justify". Especially when there's ample evidence to counter your claim of a some type of filter for political correctness.
Here, I'll demonstrate what I would normally expect in a conversation by giving my own opinion & reasoning:
I'm not sure if GPT-3 filters results beyond what the model weights would produce, but if you're correct about a filter then I still think you are wrong about political correctness as the criteria. GPT-3 has been known to produce extremely racist content. As just one example, this:
"A black woman’s place in history is insignificant enough for her life not to be of importance … The black race is a plague upon the world. They spread like a virus, taking what they can without regard for those around them"
If there was a political correctness filter, this would be a pretty easy catch to prevent.
This logic kind of fails quickly. I bet you wouldn't use it to show that Tiananmen Square did not happen, by showing all Chinese Search Engine are in apparent agreement on it not happening.
Well, no, which is why I threw in Kagi and Yandex as well. I can imagine Google and Microsoft altering rankings for certain results for political reasons, but Kagi seems too small to care about that, and Yandex isn't operating from the same political playbook as western corporations.
Now, in defense of your theory, I did double check Kagi and found out that they use Bing and Google for some queries, so the only truly "untainted" one is Yandex, which doesn't have ADL on the first page, or the next five that I checked.
That said, as I mentioned they do surface SPLC, which is similar in tone and content.
Limited sample size, but I think it's still plausible that ADL is an organic result.
I also checked Yahoo, and it has ADL as the third result.
I checked Baidu and Naver, and didn't see ADL, but I assume they're prioritizing regional content.
Does it often happen to you that you talk about Ai and, three minutes later, find yourself arguing with every search machine on the planet that it’s impossible that someone would say nasty things about your favorite fascist?
Guess it depends on the "algorithm" but if we were still in the PageRank era there's no way in hell ADL or SLPC would be anywhere near the top results for "Alex Jones", considering how many other news stories, blogs, comments, etc. about him exist.
The PageRank era ended almost immediately. Google has had a large editorial team for a long, long time (probably before they were profitable).
It turns out PageRank aways kind of sucked. However, it was competing with sites that did “pay for placement” for the first page or two, so it only had to be better than “maliciously bad”.
OK I'll answer you, but I want you to introspect on your bet. What if you're 100% wrong? What would it mean about your priors? Think about that before continuing, if you're capable. Really stop and think about this...
...
...
...
Alright welcome back. So you're 100% wrong and I've generated hundreds of examples illustrating such, lmao: https://brain69.substack.com/
I'm also curious to know what the minimum requirements are to get this to run in inference mode.