The results of this LLM are consistently far better than any other that I choose. I asked ‘what is the most efficient approach to building a led grow light with off-the-shelf parts?’ and its response was incredible. Very much in line with how I’ve done it in the past after weeks of research, trial and error, and feedback from people. The other LLMs gave mostly reasonable yet sparse and incomplete answers.
It also opted to include an outline of how to include an integrated timer. That’s a great idea and very practical, but wasn’t prompted at all. Some might consider that a bad thing, though.
Whatever it is, it’s substantially better than what I’ve been using. Exciting.
I'm asking it about how to make turbine blades for a high bypass turbofan engine and it's giving very good answers, including math and some very esoteric material science knowledge. Way past the point where the knowledge can be easily checked for hallucinations without digging into literature including journal papers and using the math to build some simulations.
I don't even have to prompt it much, I just keep saying "keep going" and it gets deeper and deeper. Opus has completely run off the rails in comparison. I can't wait till this model hits general availability.
That's what I've observed, I gave it a task for a PoC on something I've been thinking about for a while and it's answer while syntactically correct is entirely useless (in the literal sense) due to it ignoring parts of the task.
You know at one point we wouldn't be able to benchmark them, due to the sheer complexity of the test required. I.e. if you are testing a model on maths, the problem will have to be extremely difficult to even consider a 'hustle' for the LLM; it would then take you a day to work out the solution yourself.
See where it's getting at? When humans are no longer on the same spectrum as LLMs, that's probably the definition of AGI.
I can't test the bot right now, because it seems to have been hugged to death. But there's quite a lot of simple tests LLMs fail. Basically anything where the answer is both precise/discrete and unlikely to be directly in its training set. There's lots of examples in this [1] post, which oddly enough ended up flagged. In fact this guy [2] is offering $10k to anybody that create a prompt to get an LLM to solve a simple replacement problem he's found they fail at.
They also tend to be incapable of playing even basic level chess, in spite of there being undoubtedly millions of pages of material on the topic in their training base. If you do play, take the game out of theory ASAP (1. a3!? 2. a4!!) such that the bot can't just recite 30 moves of the ruy lopez or whatever.
The entire problem with LLMs is that you don't want to prompt them into solving specific problems. The reason why instruction finetuning is so popular is that it makes it easier to just write whatever you want. Text completion on the other hand requires you to conform to the style of the previously written text.
In a sense, LLMs need an affordance model so that it can estimate the difficulty of a task and plan a longer sequence of iterations automatically according to its perceived difficulty.
The comment I replied to, "a huge class of problems that's extremely difficult to solve but very easy to check", sounded to me like an assertion that P != NP, which everyone takes for granted but actually hasn't been proved. If, contrary to all expectations, P = NP, then that huge class of problems wouldn't exist, right? Since they'd be in P, they'd actually be easy to solve as well.
We could end up with a non-constructive proof of P=NP. That is, a proof that the classes are equal but no algorithm to convert a problem in one into the other (or construct a solution of one into a solution of the other).
I recently tried a Fermi estimation problem on a bunch of LLMs and they all failed spectacularly. It was crossing too many orders of magnitude, all the zeroes muddled them up.
E.g.: the right way to work with numbers like a “trillion trillion” is to concentrate on the powers of ten, not to write the number out in full.
Predicting the next character alone cannot achieve this kind of compression, because the probability distribution obtained from the training results is related to the corpus, and multi-scale compression and alignment cannot be fully learned by the backpropagation of this model
You know, people often complain about goal shifting in AI. We hit some target that was supposed to be AI (or even AGI), kind of go meh - and then change to a new goal. But the problem isn't goal shifting, the problem is that the goals were set to a level that had nothing whatsoever to do where we "really" want to go, precisely in order to make them achievable. So it's no surprise that when we hit these neutered goals we aren't then where we hope to actually be!
So here, with your example. Basic software programs can multiply million digit numbers near instantly with absolutely no problem. This would take a human years of dedicated effort to solve. Solving work, of any sort, that's difficult for a human has absolutely nothing to do with AGI. If we think about what we "really" mean by AGI, I think it's the exact opposite even. AGI will instead involve computers doing what's relatively easy for humans.
Go back not that long ago in our past and we were glorified monkeys. Now we're glorified monkeys with nukes and who've landed on the Moon! The point of this is that if you go back in time we basically knew nothing. State of the art technology was 'whack it with stick!', communication was limited to various grunts, and our collective knowledge was very limited, and many assumptions of fact were simply completely wrong.
Now imagine training an LLM on the state of human knowledge from this time, perhaps alongside a primitive sensory feed of the world. AGI would be able to take this and not only get to where we are today, but then go well beyond it. And this should all be able to happen at an exceptionally rapid rate, given historic human knowledge transfer and storage rates over time has always been some number really close to zero. AGI not only would not suffer such problems but would have perfect memory, orders of magnitude greater 'conscious' raw computational ability (as even a basic phone today has), and so on.
---
Is this goal achievable? No, not anytime in the foreseeable future, if ever. But people don't want this. They want to believe AGI is not only possible, but might even happen in their lifetime. But I think if we objectively think about what we "really" want to see, it's clear that it isn't coming anytime soon. Instead we're doomed to just goal shift our way endlessly towards creating what may one day be a really good natural language search engine. And hey, that's a heck of an accomplishment that will have immense utility, but it's nowhere near the goal that we "really" want.
There are different shades of AGI, but we don’t know if they will happen all at once or not.
For example, if an AI can replace the average white collar worker and therefore cause massive economic disruption, that would be a shade of AGI.
Another shade of AGI would be an AI that can effectively do research level mathematics and theoretical physics and is therefore capable of very high-level logical reasoning.
We don’t know if shades A and B will happen at the same time, or if there will be a delay between developing one and other.
AGI doesn’t imply simulation of a human mind or possessing all of human capabilities. It simply refers to an entity that possesses General Intelligence on par with a human. If it can prove the Riemann hypothesis but it can’t play the cello, it’s still an AGI.
One notable shade of AGI is the singularity: an AI that can create new AIs better than humans can create new AIs. When we reach shades A and B then a singularity AGI is probably quite close, if not before. Note that a singularity AGI doesn’t require simulation of the human mind either. It’s entirely possible that a cello-playing AI is chronologically after a self-improving AI.
The term "AGI" has been loosely used for so many years that it doesn't mean anything very specific. The meaning of words derives from their usage.
To me Shane Legg's (DeepMind) definition of AGI meaning human level across full spectrum of abilities makes sense.
Being human or super-human level at a small number of specialized things like math is the definition of narrow AI - the opposite of general/broad AI.
As long as the only form of AI we have is pre-trained transformers, then any notion of rapid self-improvement is not possible (the model can't just commandeer $1B of compute for a 3-month self-improvement run!). Self-improvement would only seem possible if we have an AI that is algorithmically limited and does not depend on slow/expensive pre-training.
What if it sleeps for 8 hours every 16 hours and during that sleep period, it updates its weights with whatever knowledge it learned that day? Then it doesn't need $1B of compute every 3 months, it would use the $1B of compute for 8 hours every day. Now extrapolate the compute required for this into the future and the costs will come down. I don't know where I was going with that...
These current LLMs are purely pre-trained - there is no way to do incremental learning (other than a small amount of fine-tuning) without disrupting what they were pre-trained on. In any case, even if someone solves incremental learning, this is just a way of growing the dataset, which is happening anyway, and under the much more controlled/curated way needed to see much benefit.
There is very much a recipe (10% if this, 20% of that, curriculum learning, mix of modalities, etc) for the type of curated dataset creation and training schedule needed to advance model capabilities. There have even been some recent signs of "inverse scaling" where a smaller model performs better in some areas than a larger one due to getting this mix wrong. Throwing more random data at them isn't what is needed.
I assume we will eventually move beyond pre-trained transformers to better architectures where maybe architectural advances and learning algorithms do have more potential for AI-designed improvement, but it seems the best role for AI currently is synthetic data generation, and developer tools.
At one time it was thought that software that could beat a human at chess would be, in your lingo, "a shade of AGI." And for the same reason you're listing your milestones - because it sounded extremely difficult and complex. Of course now we realize that was quite silly. You can develop software that can crush even the strongest humans through relatively simple algorithmic processes.
And I think this is the trap we need to avoid falling into. Complexity and intelligence are not inherently linked in any way. Primitive humans did not solve complex problems, yet obviously were highly intelligent. And so, to me, the great milestones are not some complex problem or another, but instead achieving success in things that have no clear path towards them. For instance, many (if not most) primitive tribes today don't even have the concept of numbers. Instead they rely on, if anything, broad concepts like a few, a lot, and more than a lot.
Think about what an unprecedented and giant leap is to go from that to actually quantifying things and imagining relationships and operations. If somebody did try to do this, he would initially just look like a fool. Yes here is one rock, and here is another. Yes you have "two" now. So what? That's a leap that has no clear guidance or path towards it. All of the problems that mathematics solve don't even exist until you discover it! So you're left with something that is not just a recombination or stair step from where you currently are, but something entirely outside what you know. That we are not only capable of such achievements, but repeatedly achieve such is, to me, perhaps the purest benchmark for general intelligence.
So if we were actually interested in pursuing AGI, it would seem that such achievements would also be dramatically easier (and cheaper) to test for. Because you need not train on petabytes of data, because the quantifiable knowledge of these peoples is nowhere even remotely close to that. And the goal is to create systems that get from that extremely limited domain of input, to what comes next, without expressly being directed to do so.
I agree that general, open ended problem solving is a necessary condition for General intelligence. However I differ in that I believe that such open ended problem solving can be demonstrated via current chat interfaces involving asking questions with text and images.
It’s hard for people to define AGI because Earth only has one generally intelligent family: Homo. So there is a tendency to identify Human intelligence or capabilities with General intelligence.
Imagine if dolphins were much more intelligent and could write research-level mathematics papers on par with humans, communicating with clicks. Even though dolphins can’t play the cello or do origami, lacking the requisite digits, UCLA still has a dolphin tank to house some of their mathematics professors, who work hand-in-flipper with their human counterparts. That’s General intelligence.
Artificial General Intelligence is the same but with a computer instead of a dolphin.
> It also opted to include an outline of how to include an integrated timer. That’s a great idea and very practical, but wasn’t prompted at all. Some might consider that a bad thing, though.
When I've seen GPT-* do this, it's because the top articles about that subject online include that extraneous information and it's regurgitating them without being asked.
This really is the fastest growing technology of all time. Do you feel the curve?
I remember Mixtral8x7b dominating for months; I expected data bricks to do the same! but it was washed out of existence in days, with 8x22b, llama3, gemini1.5...
WOW.
I must be missing something because the output from two years ago feels exactly the same as the output now. Any comment saying the output is significantly better can be equally pared with a comment saying the output is terrible/censored/"nerfed".
How do you see "fastest growing technology of all time" and I don't? I know that I keep very up to date with this stuff, so it's not that I'm unaware of things.
I do massive amounts of zero shot document classification tasks, the performance keeps getting better. It’s also a domain where there is less of a hallucination issue as it’s not open ended requests.
It strikes me as unprecedented that a technology which takes arbitrary language-based commands can actually surface and synthesize useful information, and it gets better at doing it (even according to extensive impartial benchmarking) at a fairly rapid pace. It’s technology we haven’t really seen before recently, improving quite quickly. It’s also being adopted very rapidly.
I’m not saying it’s certainly the fastest growth of all time, but I think there’s a decent case for it being a contender. If we see this growth proceeding at a similar rate for years, it seems like it would be a clear winner.
> unprecedented that a technology [...] It’s technology we haven’t really seen before recently
This is what frustrates me: First that it's not unprecedented, but second that you follow up with "haven't really" and "recently".
> fairly rapid pace ... decent case for it being a contender
Any evidence for this?
> extensive impartial benchmarking
Or this? The last two "benchmarks" I've seen that were heralded both contained an incredible gap between what was claimed and what was even proven (4 more required you to run the benchmarks even get the results!)
What is the precedent for this? The examples I’m aware of were fairly bad at what GPTs are now quite good at. To me that signals growth of the technology.
By “haven’t really seen until recently” I mean that similar technologies have existed, so we’ve seen something like it, but they haven’t actually functioned well enough to be comparable. So we can say there’s a precedent, but arguably there isn’t in terms of LLMs that can reliably do useful things for us. If I’m mistaken, I’m open to being corrected.
In terms of benchmarks, I agree that there are gaps but I also see a clear progression in capability as well.
Then in terms of evidence for there being a decent case here, I don’t need to provide it. I clearly indicated that’s my opinion, not a fact. I also said conditionally it would seem like a clear winner, and that condition is years of a similar growth trajectory. I don’t claim to know which technology has advanced the fastest, I only claim to believe LLMs seem like they have the potential to fit that description. The first ones I used were novel toys. A couple years later, I can use them reliably for a broad array of tasks and evidence suggests this will only improve in the near future.
I put my hands out, count to the third finger from the left, and put that finger down. I then count the fingers to the left (2) and count the fingers to the right (2 + hand aka 5) and conclude 27.
I have memorised the technique, but I definitely never memorised my nine times table. If you’d said ‘6’, then the answer would be different, as I’d actually have to sing a song to get to the answer.
100% of the time when I post a critique someone replies with this. I tell them I've used literally every LLM under the sun quite a bit to find any use I can think of and then it's immediately crickets.
RT-2 is a vision language model fine tuned on the current vision input and actuator positions as the output. Google uses a bunch of TPUs to produce a full response at a cycle rate of 3 Hz and the VLM has learned the kinematics of the robot and knows how to pick up objects according to given instructions.
Given the current rate of progress, we will have robots that can learn simple manual labor from human demonstrations (e.g. Youtube as a dataset, no I do not mean bimanual teleoperation) by the end of the decade.
Usually when I encounter sentiment like this it is because they only have used 3.5 (evidently not the case here) or that their prompting is terrible/misguided.
When I show a lot of people GPT4 or Claude, some percentage of them jump right to "What year did Nixon get elected?" or "How tall is Barack Obama?" and then kind of shrug with a "Yeah, Siri could do that ten years ago" take.
Beyond that you have people who prompt things like "Make a stock market program that has tabs for stocks, and shows prices" or "How do you make web cookies". Prompts that even a human would struggle greatly with.
For the record, I use GPT4 and Claude, and both have dramatically boosted my output at work. They are powerful tools, you just have to get used to massaging good output from them.
That is not the reality today. If you want good results from an LLM, then you do need to speak LLM. Just because they appear to speak English doesn't mean they act like a human would.
People don’t even know how to use traditional web search properly.
Here’s a real scenario: A Citrix virtual desktop crashed because a recent critical security fix forced an upgrade of a shared DLL. The output is a really specific set of errors in a stack trace. I watched with my own two eyes an IT professional typed the following phrase into Google: “Why did my PC crash?”
Then he sat there and started reading through each result… including blog posts by random kids complaining about Windows XP.
I wish I could say this kind of thing is an isolated incident.
I mean, you need to speak German to talk to a German. It’s not really much different for LLM, just because the language they speak has a root in English doesn’t mean it actually is English.
And even if it was, there’s plenty of people completely unintelligible in English too…
You see no difference between non-RLHFed GPT3 from early 2022 and GPT-4 in 2024? It's a very broad consensus that there is a huge difference so that's why I wanted to clarify and make sure you were comparing the right things.
What type of usages are you testing? For general knowledge it hallucinates way less often, and for reasoning and coding and modifying its past code based on English instructions it is way, way better than GPT-3 in my experience.
It's fine, you don't have a use for it so you don't care. I personally don't spend any effort getting to know things that I don't care about and have no use for; but I also don't tell people who use tools for their job or hobby that I don't need how much those tools are useless and how their experience using them is distorted or wrong.
Usually people who post such claims haven’t used anything beyond gpt3. That’s why you get questions.
Also, the difference is so big and so plainly visible that I guess people don’t know how to even answer someone saying they don’t see it. That’s why you get crickets.
The difference matters as generally in my experience, Llama 3, by virtue of its giant vocabulary, generally tokenizes text with 20-25% less tokens than something like Mistral. So even if its 18% slower in terms of tokens/second, it may, depending on the text content, actually output a given body of text faster.
Don't sleep on Gemini 1.5. The 1,000,000 token context window is crazy when you can dump everything from a single project (hundreds, even thousands of documents) into it and then inference. Sure it's not the strongest model, but it is still good, and its the best when you can basically train it on whatever you are working with.
llama3 on groq hits the sweet spot of being so fast that I now avoid going back to waiting on gpt4 unless I really need it, and being smart enough that for 95% of the cases I won't need to.
I simply asked it "what are you" and it responded that it was GPT-4 based.
> I'm ChatGPT, a virtual assistant powered by artificial intelligence, specifically designed by OpenAI based on the GPT-4 model. I can help answer questions, provide explanations, generate text based on prompts, and assist with a wide range of topics. Whether you need help with information, learning something new, solving problems, or just looking for a chat, I'm here to assist!
Why would the model be self aware? There is no mechanism for the llm to know the answer to “what are you” other than training data it was fed. So it’s going to spit out whatever it was trained with, regardless of the “truth”
I agree there's no reason to believe it's self-aware (or indeed aware at all) but capabilities and origins is probably among the questions they get most, especially as the format is so inviting for anthropomorphizing and those questions are popular starters in real human conversation. It's simply due diligence in interface design to add that task to the optimization.
It would be easy to mislead about if the maker wished to do that of course, but it seems plausible that it would usually be have been put truthfully as a service to the user.
This doesn't necessarily confirm that it's 4, though. For example, when I write a new version of a package on some package management system, the code may be updated by 1 major version but it stays the exact same version until I enter the new version into the manifest. Perhaps that's the same here; the training and architecture are improved, but the version number hasn't been ticked up (and perhaps intentionally; they haven't announced this as a new version openly, and calling it GPT-2 doesn't explain anything either).
Yeah that isn't reliable, you can ask mistral 7b instruct the same thing and it will often claim to be created by OpenAI, even if you prompt it otherwise.
It also opted to include an outline of how to include an integrated timer. That’s a great idea and very practical, but wasn’t prompted at all. Some might consider that a bad thing, though.
Whatever it is, it’s substantially better than what I’ve been using. Exciting.