Is there any rigorous way to answer the question of how much information (be it ...

riskable · on Feb 29, 2024

Yes, actually: That's the entire point of the paper! The concept is that the amount of information contained in a weight like 0.00006103515625 is equivalent to 0. -0.99951172 is equivalent to -1, 1.26406236 equivalent to 1, etc. That there's no practical difference when actually utilizing the model (if trained in ternary from the start).

The paper posits (and provides evidence) that if you train a model using ternary values instead of floating point values you get equivalent (useful/practical) information. You can't take an existing model and round all the values down to `{-1,0,+1}` values but you can (re)train a model using ternary values to get the same end result (equivalent information/output).

Technically a model trained using FP16 values contains vastly more information than a model trained using ternary values. Practically though it seems to make no difference.

My prediction: Floating point models will still be used extensively by scientists and academics in their AI research but nearly all real-world, publicly-distributed AI models will be ternary. It's just too practical and enticing! Even if the ternary representation of a model is only 90% effective it's going to be so much faster and cheaper to use it in reality. We're talking about the difference between requiring a $500 GPU or a $5 microcontroller.

Blackthorn · on Feb 29, 2024

I don't think you really answered my question. What's been done by the paper is show experimentally that networks don't have enough information to justify their weight precision, and that's really good and a very important result, but what I was asking was if there's a rigorous way to take an arbitrary network and determine its information content (either by itself, or compared to another network). Possibly that can be relative to its outputs.