Hash function performance

gojomo · on Sept 9, 2010

That's an awful hash function. The largest bucket has more than 4X as many items as the smallest. He should pick another function that uses all the characters in the object name. Whatever language or libraries he's already using probably has a better function for strings handy.

acqq · on Sept 9, 2010

I don't know in which language he implements this, but if it's in any efficient language he should certainly use all the characters. Some interesting benchmarks (and also some very interesting links in the comments!) can be found at:

http://www.strchr.com/hash_functions

Xk · on Sept 9, 2010

Looks like a case of premature optimization to me. Optimizing for hash function speed when you end up with thousands of comparisons is just pathetic. Who cares if your hash function takes a whole millisecond instead of a fraction of a millisecond if it ends up that you get an equal distribution in the different buckets?

(On average they're going to need to compare with 6324 other objects before they've gotten the right one. A perfect hash function would end up 5686 checks. Willing to bet that the 639 fewer checks would make up for a better hash function.)

But that's just the beginning of the problem! I mean, even with a hash function that's that bad, ~100 buckets would get them nearly 10X better performance! And realistically, why not use several thousand buckets? Come on ...

wiredfool · on Sept 9, 2010

This is used for everything in his environment, and has been for 20 years or so. The Odb. Local hash variables, stack frames. Both speed and size wewe important when this was created. Why it's coming up now, I have no idea.

jallmann · on Sept 9, 2010

The ignorance in this article is almost embarrassing. Why does Dave Winer keep getting posts on HN?

codexon · on Sept 9, 2010

I thought he was being sarcastic when he said it looked random. Isn't that why he is asking for input?

Natsu · on Sept 9, 2010

If he wants my input he should:

* Learn more about the theory behind hashes before designing one.

* Use a hash that distributes evenly across buckets.

* Learn what random looks like.

If it was random, all the numbers would be chosen with close to the same probability. So there would be an even number of items in each bucket. What we have here shows that bucket 1 is rarely chosen and bucket 7 is really common. The spread is so ridiculous (17.6k vs. 3.8k) that it cannot be due to chance.

In short, this is an absurd hash function. Do not use it in cryptography or elsewhere. If you need simple, low-cost, non-cryptographic hashes, look here:

http://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homewo...

wiredfool · on Sept 9, 2010

Many of the design decisions in his software make a lot of sense for the design of an environment that could fit on a floppy.

jemfinch · on Sept 9, 2010

None of the ones in this post do, though.

Using an appropriate number of buckets would actually reduce memory usage.

Using an appropriate hash function (one that at least considered all the characters in the key) wouldn't increase the time to hash a string read off a floppy disk, because he's using the first and last characters anyway.

wiredfool · on Sept 10, 2010

What's the appropriate number of buckets, taking into account everywhere it's used?

jemfinch · on Sept 10, 2010

Depends on your way of hashing. If you're using linear chaining (basically, hashing a linked list of entries) you can get away with N buckets for N items if you have a good hash function. If you're using open addressing, you'll usually want your load factor (the ratio of filled slots to unfilled slots) not to be much higher than 2/3 or so. It may appear on the surface that open addressing uses more memory, but don't forget the overhead of linear chaining (an extra pointer for every element in the hash table, many more cache misses relative to a low-load open addressed table).

16s · on Sept 9, 2010

I read that headline and straight away thought of md4 versus md5 performance. I need to take a break :)

watersco · on Sept 9, 2010

Why only 11 buckets? That seems like the bigger issue. Converting an O(n) lookup to O(n/10) seems silly. Why not have 1000 buckets and get two more orders of magnitude improvement in lookup performance.

Of course that will expose how bad the hash function really is. Saving a few cycles in the hash function and then chaining through 17k linked list entries doesn't make any sense.

wiredfool · on Sept 9, 2010

Because it was a design decision from 20 years ago for small n hash tables. It's baked into the design of his object db files.

jallmann · on Sept 9, 2010

To be fair, we don't know the number of objects -- most of these 125k accesses could be lookups. But then again, if the number of objects is that small, then a hashtable strikes me as overkill. Otherwise yes, he should notch it up to a reasonable load factor.

But there's no denying how atrocious the hash function is. To quote:

> first and last characters of the name of the object, adds them together and mods the result by the number of buckets, which is 11

I don't know much about the name representation, but I'm guessing it's human readable ASCII. Which means your keys are confined to a very narrow range, and they'll be distributed along the same lines as the language itself (English or whatever). That means collisions up the wazoo.

houseabsolute · on Sept 9, 2010

I upvoted because I really hope to hear from someone who knows this stuff. Off the top of my head (as a non-expert), this sounds like a horrible hash function that is vulnerable to a plethora of attacks, but I don't know for sure. Why not just use the built-in hash function, or some function someone smarter than you has written?

drblast · on Sept 9, 2010

I'm not a hash function expert, but I did sleep at a Holiday Inn Express last night.

This is a horrible hash function for a number of reasons. You want your hash function to use every bit of information from the data you're keying on because that's more likely to give you a spread that doesn't contain a collision. Consider this list of keys:

aaz abz axz

Those three keys would collide, and for no reason. Even adding up each letter (another horrible hash function because it eliminates information about the position of the character, so "the" and "eht" would collide) would be better than this.

Information theory is important here; you want to preserve as much information from your key as possible. Check out http://en.wikipedia.org/wiki/Entropy_(information_theory) for a good discussion on this.

Also, there is almost no earthly reason to use a hash function with 11 buckets, and you certainly wouldn't want to evaluate your hash based on that. Assuming you'd have to search each bucket of 10,000 for your match, hashing it into 11 sections buys you very little time.

Also, there's no reason not to do something more complicated; assuming your key is in CPU cache because you're going to add the first and last letters, why not at least add up all the letters? You're wasting free CPU cycles after you've already loaded the key from memory.

Finally, you don't want output that "looks pretty random." You want output that sorts exactly evenly between buckets. He's nowhere close.

jemfinch · on Sept 9, 2010

It's absolutely, positively horrible. My real comments are on the blog itself.

wiredfool · on Sept 9, 2010

It's an internal hash for hash tables and the like. It's not for crypto.

It's main issue is that it turns large tables into an O(n/10) linked list. It was kinda painful for a few things 10 years ago, and it was a reasonable hack 10 years before that when it was likely originally written. Iirc, there was one pathological case where all the items ended up in one bucket, but that's lost to the sands of time.

Fwiw, md5 has been available in his system since 98 or so, and any security related stuff would have been using that.