Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We are months away from being able to do this with images too.

All the pieces are there, and multi-modal (smallish) large language + image models are already being used in research labs; eg MS Kosmos-1[1]. Check out the visual IQ test results in the paper.

Kosmos-1 is only 1.6B parameters. When that or similar models scale out to 50B_ params they will be pretty amazing.

[1] https://arxiv.org/abs/2302.14045



Linked article / brief mentions interleaving of text and images as essential for certain kinds of learning; that particular intersection reminds me of the profoundly compelling multimodal book by Nick Sousanis, "Unflattening". Highest possible recommendation.

https://en.m.wikipedia.org/wiki/Unflattening


I had that book but lost it in LA. This is the page I remembered the most and have shown to the most people:

https://twitter.com/nsousanis/status/245176914900299776?lang...


That is the truly mind-blowing thing that's coming.

I am currently creating reference images for a game which I expect within 6-12 months I'll be able to feed into a multimodal ChatGPT to create 3D assets out of the 2D pics.

We'll be able to conjure up worlds at a whim - so start imagining them already!


Mark gets a lot of flack for metaverse, but can imagine world design where you start in a blank room and describe what you want around you, and it appears. Like the Matrix loading white room. And with voice recognition and eye tracking (and brain scans), how close are we to “you have to use your hands? It’s like a baby’s toy.”


This was always the coolest capability of Star Trek's holodecks, not the projector technology. Super cool that we might just get there within the lifetime of someone who watched the show in the 80s.

My favorite Star Trek episode has always been "Identity Crisis". Not because it's one of the good ones, it's pretty clunky. But it contains a fantastic 5-7 minute long montage featuring Geordi La Forge interacting with computers (by touch, voice and on the holodeck) to solve a murder mystery, analyzing and live-manipulating 3D "holo footage" to discover a vital clue. Whoever imagined that sequence is the hero of my childhood and perhaps the reason I became a software engineer, doing an oddball mix of HMI and systems engineering.

There's so much in that sequence. The free mixing of different input modes, the complementary collaboration between a human and an AI system, carrying state with you from room to room. Analyzing and generating. Following instructions and making suggestions. Powerful inference, precision of control.

We're getting close now!



Thought-to-space?


YES!!! That's exactly what's coming. I'm getting a head start by envisioning a rich world and creating pictorial references for it along with text descriptions. But it's a tiny head start: once the "holodeck" technology comes along, we'll all be able to create anything with just natural language and correcting anything the model got wrong. ITERATIVE CHISELING YO!!!

By the way, I am still using SD 1.5 inpainting.ckpt - thank you Runway for releasing it, it's perfect for my needs and abilities. I never even tried SD 2 and later ones - heard they worked completely differently and I'm too busy creating to be relearning.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: