1. Create a point cloud from a scene (either via lidar, or via photogrammetry from multiple images)
2. Replace each point of the point cloud with a fuzzy ellipsoid, that has a bunch of parameters for its position + size + orientation + view-dependent color (via spherical harmonics up to some low order)
3. If you render these ellipsoids using a differentiable renderer, then you can subtract the resulting image from the ground truth (i.e. your original photos), and calculate the partial derivatives of the error with respect to each of the millions of ellipsoid parameters that you fed into the renderer.
4. Now you can run gradient descent using the differentiable renderer, which makes your fuzzy ellipsoids converge to something closely reproducing the ground truth images (from multiple angles).
5. Since the ellipsoids started at the 3D point cloud's positions, the 3D structure of the scene will likely be preserved during gradient descent, thus the resulting scene will support novel camera angles with plausible-looking results.
ELI5 has meant friendly simplified explanations (not responses aimed at literal five-year-olds) since forever, at least on the subreddit where the concept originated.
Now, perhaps referring to differentiability isn't layperson-accessible, but this is HN after all. I found it to be the perfect degree of simplification personally.
If one actually tried to explain to a five year old, they can use things like analogy, simile, metaphor, and other forms of rhetoric. This was just a straight-up technical explanation.
Lol. Def not for 5 year olds but it's about exactly what I needed
How about this:
Take a lot of pictures of a scene from different angles, do some crazy math, and then you can later pretend to zoom and pan the camera around however you want
Saying math (even using it in a dismissive tldr) is immensely helpful. Specifically, I've never encountered these terms before:
- point cloud
- fuzzy ellipsoid
- view-dependent color
- spherical harmonics
- low order
- differentiable renderer (what makes it differentiable? A renderer creates images, right?)
- subtract the resulting image from the ground truth (good to know this means your original photos, but how do you subtract images from images?)
- millions of ellipsoid parameters (the explanation previously mentioned 4 parameters by name. Where are the millions coming from?)
- gradient descent (I've heard of this in AI, but usually ignore it because I haven't gotten deep enough into it to need to understand what it means)
- 3D point cloud's positions (are all point clouds 3d? The point cloud mentioned earlier wasn't. Or was it? Is this the same point cloud?)
In other words, you've explained this at far too high a level for me. Given that the request was for ELI5, I expected an explanation that I could actually follow, without knowing any specific terminology. Do disregard specifics and call it math. Don't just call it math and skip past it entirely: call it math and explain what you're actually doing with the math, rather than trying to explain the math you're doing; same for all the other words. If a technical term is only needed once in a conversation, then don't use it.
Given that I actually do know what photogrammetry is at a basic level, I can make a best-effort translation here, but it's purely from 100% guessing rather than actually understanding:
1. Create a 3d scan of a real-life scene or object. It uses radar (intentionally incorrect term, more familiar) or multiple photographs at different angles to see the 3 dimensional shape.
2. For some reason, break up the stapes into smaller shapes.
This is where my understanding goes to nearly 0:
3-5: somehow, looking at the difference between a rendering of your 3d scene and a picture of the actual scene allows you to correct the errors in the 3d scene to make it more realistic. Using complex math works better and having the computer do it is less effort than manually correcting the models in your 3d scene.
How hard is it to handle cases where the starting positions of ellipsoids in 3D is not correct (being too off). How common is such a scenario with the state of the art? E.g., if having only a stereoscopic image pair, the correspondences are often not accurate.
I assume that the differentiable renderer is only given its position and viewing angle at any one time (in order to be able to generalize to new viewing angles)?
No. There are no neural networks here. The renderer is just a function that takes a bunch of ellipsoid parameters and outputs a bunch of pixels. You render the scene, then subtract the ground truth pixels from the result, and sum the squared differences to get the total error. Then you ask the question "how would the error change if the X position of ellipsoid #1 was changed slightly?" (then repeat for all ellipsoid parameters, not just the X position, and all ellipsoids, not just ellipsoid #1). In other words, compute the partial derivative of the error with respect to each ellipsoid parameter. This gives you a gradient, that you can use to adjust the ellipsoids to decrease the error (i.e. get closer to the ground truth image).
TIL there are two units of measurement that are both called ton but confusingly are not the same as a ton. One is a tiny bit more than a ton (1.016 tons) and one is a bit less (0.907 tons). Apparently people use the prefixes long and short to differentiate them, at least that part is intuitive.
reply