Tuesday, November 10, 2009

The Assumptions We Make

I loved my algorithms courses in university. The material just 'clicked' and I did extremely well. But, one day awhile after I had finished the course, a thought about hashing occurred. I realized I had been making some very serious assumptions when analyzing algorithmic complexity. This post covers two of those assumptions.

The thought, restructured to actually follow a logical argument, goes something like this: hash tables can't be constant time. There's just no way. If a hash function takes constant time then its output has constant size. If the output has size N, then a table using that output has maximum size 2^N. If the table has a maximum size then we can overflow it by just adding more and more items (ie. it will eventually degenerate into linear time searches of the bins). What's going on?!

Constant Word Size

The constant word size assumption is the assumption that, even though you may have an unbounded number of unique items, they can still have constant size and can still be manipulated in constant time. Copying, comparing, swapping, and a bunch of other operations we often assume are constant time would necessarily become more expensive with huge input sizes.

Suppose you want to sort 2^2048 items (clearly this is a thought experiment!). If each of those items is unique, they must each have a unique binary representation. So, on average, each item must use at least 2048 bits. It suddenly becomes clear that comparing items can't be O(1), but must be at least O(lg n) [because the items have O(lg n) bits].

Let's analyze merge sort using this new realization. It should be immediately obvious that merging becomes more expensive: merging two sublists of size L from a list of size N requires O(L * lg n) time. Since we must merge sublists whose combined size is O(N) for O(lg n) levels before the list is sorted, merge sort will take O(N * (lg n) * (lg n)) time. Wait, what?! Merge sort is O(n lg n), not O(n lg^2 n)! What's going on?!

What's going on is the constant word size assumptions. The O(n lg n) bound actually applies to the number of comparisons you have to perform, not the total computational time. They just happen to coincide exceedingly well in practice (eg. I doubt anyone will ever actually sort 2^64 unique items, certainly never 2^128 unique items).

Essentially every algorithm has a hidden theoretical O(lg n) word-size factor hidden away. Once I realized this, my constant-time hashing dilemma went away. It's the exact same issue, except I was focusing on the output size instead of the input size.

RAM Model vs Physics

Perhaps the most ingrained assumption is the random-access memory model, where memory is unbounded with a constant access time. Unlike the constant word size assumption, this assumption was at least stated in my algorithms course. But reality is, of course, not so forgiving.

You can only fit so much information into a given area, and distance from the processor lower bounds the time it takes to retrieve a bit of information. A sphere with volume N has radius O(N^(1/3)), so accessing a particular bit of data is not just O(lg n), it's O(n^(1/3))! [I remember reading a black hole will eventually form if you exceed O(surface area), so it could even be O(n^(1/2))!]

I don't even want to think about what that does to the time complexity of sorting or hashing.

Summary

It's a bit unsettling to realize I'd been ignoring all these logarithmic factors, and that no one had ever mentioned them to me. Oh well, at least it doesn't matter in practice... Actually, the word size of processors has actually exceeded the increases in the size of data, subtly reinforcing the constant-size assumption.

No comments:

Post a Comment