I haven't tried to compare but the method I've used in the past uses a multiply instead of divide per iteration. It's a simple binary search - If you want a 16 bit result from a 32 bit source then it always takes 15 iterations and is not an estimate so it's as accurate as 16 bits can be. Combined with the multiply, which is often a lot faster than a divide, then the binary search method might be a better solution.
The search loop goes something like:
square=12345678 ' The number we want to find the square-root of.
sqroot=0 ' The resultant square-root.
index = &8000 ' The bit mask indexing.
sqrt_loop:
sqroot = sqroot + index
if sqroot * sqroot > square then sqroot = sqroot - index
index = index >> 1
if index<>0 then sqrt_loop
Evan