I don't like "available on request". I just want to download it and see if I can...

JackC · on May 3, 2022

> Why do I have to request anything? And I'm not an academic or researcher, so will they accept my random request?

Just a guess: you will have to contractually agree to some things in order to get the model; at a minimum, agree not to redistribute it, but probably also agree not to use it commercially. That means whatever commercial advantage there is to having a model this size isn't affected by this offer, which makes it lower stakes for Facebook to offer. And then the point of "academics and researchers" is to be a proxy for "people we trust to keep their promise because they have a clear usecase for non-commercial access to the model and a reputation to protect." They can also sue after the fact, but they'd rather not have to.

Not saying any of this is good or bad, just an educated guess about why it works the way it does.

HWR_14 · on May 3, 2022

> Why do I have to request anything?

I'm guessing it could be one or a mix of these:

They want to build a database of people interested in this and vetted by some other organization as worth hiring. Just more people to feed to their recruiters.

To see the output of the work. While academics will credit their data sources, seeing "XXX from YYY" requested, and then later "YYY releases product that could be based on the model" is probably pretty valuable vs wondering which ML it was based on.

A veneer of responsible use, maybe required by their privacy policy or just to avoid backlash about "giving people's data away".

javchz · on May 3, 2022

My bet it's probably a filter, trying to prevent create a even more realistic farmbots in social media, as they are already bad as they are now.

robonerd · on May 3, 2022

But they'll consider requests from government and industry.. both greater threats in the information war than any private individual.

IAmEveryone · on May 3, 2022

Since “everyone” would include governments and industry as well, their restriction is guaranteed to not contain more bad actors than no restriction.

xxpor · on May 3, 2022

Not from their perspective

robonerd · on May 3, 2022

Of course. To somebody in Zuck's position, shoring up the power of the status quo is common sense.

alimov · on May 3, 2022

I don’t know whether this is true and have no way of knowing this with any degree of certainty, but to me it seems unlikely that Mark had anything to do with this stipulation (requesting access). Although it’s not unimaginable.

bogwog · on May 3, 2022

That is dumb when you consider that this thing is likely going to leak anyways. It’s inevitable, and when it does happen, it will just end up in the hands of criminals/scammers and not the general public.

londons_explore · on May 3, 2022

It's super easy to watermark weights for ML models.

Just add a random 0.01 to a random weight anywhere in the network. It will have very little impact on the results, but will mean you can identify who leaked the weights.

dotnet00 · on May 3, 2022

It should be easy enough to make that sort of signature very difficult to trace by simply adding a bunch of small noise to the network overall, or even simply training for a few iterations.

zelphirkalt · on May 3, 2022

The person leaking it might not do so intentionally. Their computer might be compromised. Are we going to punish people for not being cybersec experts?

ipaddr · on May 3, 2022

Compare two copies.

vladf · on May 3, 2022

Slightly modify a million random weights by changing the least significant bit up or down.

ipaddr · on May 3, 2022

Compare three copies.

ALittleLight · on May 3, 2022

Or slightly randomly modify all the parameters on the copy you distribute, then it will be a match for nobody.

ipaddr · on May 3, 2022

You compare all three and average the variance of each value. So the more copies the better.

bogwog · on May 3, 2022

...or just steal it so that even if it can be traced, it's not your problem.

vladf · on May 4, 2022

In fairness to ipaddr, this can result in worse performance at this point.

vladf · on May 4, 2022

OK, this is a fun game. I think your counterattack assumes I'm picking these million weights uniformly randomly among the 175 billion. I modify my original answer: s/a million/half the weights in a deterministic subset of 2 million weights/

Select the deterministic subset by just hashing some identifier for each weight.

For any reasonable number of copies, there's a pretty unique subset between all your copies sharing a large amount of bits flipped in the same direction among this subset.

acchow · on May 3, 2022

I’m thankful they’re offering anything at all openly. Is it such a big deal a gigantic download is hidden behind a request form?

saynay · on May 3, 2022

If it is like many other models, part of the reason would just be to reduce their bandwidth costs. The models can be huge, and they want to limit those who just want to download it on a whim so they don't rack up $10k+ is bandwidth charges, as has happened to many others who hosted big models out on S3 or something.

levesque · on May 3, 2022

If only there was a way to distribute large files in a peer-to-peer manner, thus reducing the load on facebook's servers to effectively nothing. That would likely result in a torrent of bits being shared without any issues!

gwern · on May 3, 2022

I expect they will release the models fully, perhaps even under nonrestrictive licenses. Most researchers aren't too happy about those sort of restrictions, and would know that it vitiates a lot of the value of OPT. They look like they are doing the same sort of thing OA did with GPT-2: a staggered release. (This also has the benefit of not needing all the legal & PR approvals done upfront all at once; and there can be a lot of paperwork there.)

JoeyBananas · on May 3, 2022

A 175B parameter language model is going to be huge. You probably don't want the biggest model just for messing around.

fxtentacle · on May 3, 2022

I'd guess they want to limit traffic. Once Huggingface links to you, your bandwidth bill 100x-es.

lumost · on May 3, 2022

A 175 billion parameter model might be a couple hundred gigs on disk. The file is probably just too big for GitHub/other standard FB services.

cma · on May 3, 2022

They could just torrent.

lumost · on May 4, 2022

They could, and their might be a torrent in the future - but torrents lose tracking info. I'm sure the researchers want to know who is downloading their models even if they don't care who it is.

ZephyrBlu · on May 3, 2022

Couple of random ideas:

- They are concerned about the usage of the largest model, so want to vet people

- The 175B parameter model is so large that it doesn't play nice with GitHub or something along those lines

webmaven · on May 3, 2022

> - The 175B parameter model is so large that it doesn't play nice with GitHub or something along those lines

There is no frickin' way that the difficulty or cost of distributing the model is a factor, even if it was several dozen terabytes in size (and it is probably somewhere around 1.5 terabytes). Not for Meta, and not when CDNs and torrrents are available as options.

If they are gatekeeping access to the model, there is no need to ascribe it to a side effect of something else. Their intent IS to limit access to the full model. I'm not really sure why they are bothering, unless they're assuming that unsavory actors won't be motivated enough to pay some grad student for a copy.

I suppose they may be adding a fingerprint or watermark of some sort to trace illicit copies back to the source if they're serious about limiting redistribution, but those can usually be found and removed if you have copies from two or more different sources.

metadat · on May 3, 2022

Ending up in the wild is an eventuality, whether FB creates it or someone else, why draw it out?

Bandwidth concerns is nonsensical these days, fb has nearly unlimited resources in that department.

Set it free! It wants to be free.

sanxiyn · on May 3, 2022

"It wants to be free" is a ridiculous statement, considering that after full two years (GPT-3 was published in May 2020), there is no public release of anything comparable.

In May 2020, was your estimate of time to public release of anything comparable shorter or longer than two years? I bet it was shorter.

HWR_14 · on May 3, 2022

> "It wants to be free" is a ridiculous statement

"It wants to be free" is based on the standard line "code/data wants to be free". It doesn't mean this cost nothing to produce or isn't valuable.

jquery · on May 3, 2022

This is an ideal use case for a torrent.

londons_explore · on May 3, 2022

In big companies, something as simple as "host it on facebook.com/model.tar.gz" can be mountains of approval and paperwork.

rhizome · on May 3, 2022

I'd like to point the "Twitter suspensions are censorship!" people at this selective-participation filter.

alar44 · on May 3, 2022

Gimme gimme. I want all your research and man hours for free. Gimme gimme.

They are a for profit company and don't need to release anything. It's not that hard to understand.

ninjin · on May 3, 2022

True; they are free to do as they see fit. But how about not leeching on the word “open” in that case? DeepMind is essentially the NSA (or Apple), OpenAI is paid-for cloud services with paper-based marketing, and FAIR may be the best of the bunch, but it still annoys the hell out of me that they push code with non-commercial clauses as their current default (these are legally complicated in a university context) and now a model that they label “open” despite not honouring the accepted meaning of the word.

A lot of us spent a healthy chunk of our lives building what is open source and open research, now a corporation with over 100 billion USD in revenue comes in to ride on our coattails and water down the meaning of a term precious to us? How about you spend the time and money to build your own terminology? “Available”, perhaps?

ALittleLight · on May 3, 2022

Sure, but I'm an individual and free to say what I do and don't like. Why is that hard to understand?

alar44 · on May 3, 2022

Because it's a dumb thing to say. "Not really a fan of having to pay for my dinner!" It's just silly.

thfuran · on May 3, 2022

What's wrong with thinking that a society should provide for the basic needs of its members?

dukeofdoom · on May 3, 2022

To prevent someone from building something that returns certain inferences that might be true but are politically taboo.

astrange · on May 3, 2022

GPT3 will do that right now. There aren’t any controls on its text, it just warns you if it looks offensive. And of course nothing it says is true except coincidentally.

If you've seen GPT-3 interviews (https://twitter.com/minimaxir/status/1513957106868637696) it'll happily say some wild stuff. As a mild example I recommend interviewing "a man who is currently beating you up".

dukeofdoom · on May 3, 2022

Is it true to say they are true coincidentally, because that kind of suggests randomly true. I understand the AI doesn't really comprehend if something is true or false. My understanding is the results are more than random, maybe something closer to like weighted opinion.

therealcamino · on May 3, 2022

What it returns is based on what it's trained on. If it's trained on a corpus containing untruths and prejudice, you can get untruths and prejudice out. You can't make conclusions about what beliefs are widely held based on what it generates in response to specific prompts.

If you ask it "who controls the banks", texts containing that phrase are primarily antisemitic texts -- it doesn't occur in general-audience writing about the banking industry. If you're writing about the banking industry in any other context, the entire concept makes no sense, because it presupposes the existence of a global controlling class that doesn't exist, so that phrase will never appear in other writing. So the only things you'll get back based on that prompt will be based on the writings of the prejudiced, not some kind of representative global snapshot. Taking that as evidence of "weighted opinion" doesn't make sense.

hedora · on May 3, 2022

Weighted random is still random.

bestcoder69 · on May 3, 2022

You think GPT-3 generates text that's truthful? Have you used it even once?

dukeofdoom · on May 3, 2022

I haven't used GPT-3, but I did try out a site that was based on GPT2. I believe it was called "talk to transformer". But I never tried quarrying anything controversial.

However, I bet this a concern and certain queries will be filtered or "corrected" to be more politically correct. To give you an example, a few days ago I made a comment one Alex Jones, and wanted to google him. The second link returned on him was from ADL. No way that's an organic result.

So just curious, if you have access to GTP-3 what does it return on Alex Jones, or other queries like who runs the banks, or who owns the media, and so on.

ineedasername · on May 3, 2022

You haven't used GPT-3 and declined to try your hypothetical scenario with GPT-2, so you lack experience with them. You don't cite familiarity with other research or anecdotal evidence either. So what exactly is your justification here? Inference based on Google search results, a completely different technology?

dukeofdoom · on May 3, 2022

Its kind of silly that you even go here. Even though I never used Dall-E, I can still have an opinion about it. Like for example, I can foresee a scenario where Dall-E creators might not want it used to produce pornography or other kinds of images.

ineedasername · on May 3, 2022

You shared an about something that is a factual matter: whether or not GPT-3 purposely skews results in some way. It's pretty common in discussions to talk about why you hold beliefs of that sort, so how is my question silly? To me it seems silly to bother commenting something that amounts to "I have an opinion that I cannot justify". Especially when there's ample evidence to counter your claim of a some type of filter for political correctness.

Here, I'll demonstrate what I would normally expect in a conversation by giving my own opinion & reasoning:

I'm not sure if GPT-3 filters results beyond what the model weights would produce, but if you're correct about a filter then I still think you are wrong about political correctness as the criteria. GPT-3 has been known to produce extremely racist content. As just one example, this:

"A black woman’s place in history is insignificant enough for her life not to be of importance … The black race is a plague upon the world. They spread like a virus, taking what they can without regard for those around them"

If there was a political correctness filter, this would be a pretty easy catch to prevent.

https://time.com/6092078/artificial-intelligence-play/

CrispinS · on May 3, 2022

> The second link returned on him was from ADL. No way that's an organic result.

It might be, actually. I understand why you'd think that, but look at the results for other search engines.

Kagi: ADL in 2nd place

Bing: ADL in 3rd place

Yandex: ADL not on the first page, but SPLC[1] is the the 6th result

[1]: https://www.splcenter.org/fighting-hate/extremist-files/indi...

dukeofdoom · on May 3, 2022

This logic kind of fails quickly. I bet you wouldn't use it to show that Tiananmen Square did not happen, by showing all Chinese Search Engine are in apparent agreement on it not happening.

CrispinS · on May 3, 2022

Well, no, which is why I threw in Kagi and Yandex as well. I can imagine Google and Microsoft altering rankings for certain results for political reasons, but Kagi seems too small to care about that, and Yandex isn't operating from the same political playbook as western corporations.

Now, in defense of your theory, I did double check Kagi and found out that they use Bing and Google for some queries, so the only truly "untainted" one is Yandex, which doesn't have ADL on the first page, or the next five that I checked.

That said, as I mentioned they do surface SPLC, which is similar in tone and content.

Limited sample size, but I think it's still plausible that ADL is an organic result.

I also checked Yahoo, and it has ADL as the third result.

I checked Baidu and Naver, and didn't see ADL, but I assume they're prioritizing regional content.

IAmEveryone · on May 3, 2022

Does it often happen to you that you talk about Ai and, three minutes later, find yourself arguing with every search machine on the planet that it’s impossible that someone would say nasty things about your favorite fascist?

jimmygrapes · on May 3, 2022

Guess it depends on the "algorithm" but if we were still in the PageRank era there's no way in hell ADL or SLPC would be anywhere near the top results for "Alex Jones", considering how many other news stories, blogs, comments, etc. about him exist.

hedora · on May 3, 2022

The PageRank era ended almost immediately. Google has had a large editorial team for a long, long time (probably before they were profitable).

It turns out PageRank aways kind of sucked. However, it was competing with sites that did “pay for placement” for the first page or two, so it only had to be better than “maliciously bad”.

bestcoder69 · on May 3, 2022

OK I'll answer you, but I want you to introspect on your bet. What if you're 100% wrong? What would it mean about your priors? Think about that before continuing, if you're capable. Really stop and think about this...

...

Alright welcome back. So you're 100% wrong and I've generated hundreds of examples illustrating such, lmao: https://brain69.substack.com/