Home » Category » Python

Python: hash()

200| Wed, 07 May 2008 00:26:00 GMT| john_marshall| Comments (9)
Hi,

For strings of > 1 character, what are the chances
that hash(st) and hash(st[::-1]) would return the
same value?

My goal is to uniquely identify multicharacter strings,
all of which begin with "/" and never end with "/".
Therefore, st != st[::-1].

Thanks,
John

Keywords & Tags: hash, python

URL: http://programming.itags.org/python/32369/
 
«« Prev - Next »» 9 helpful answers below.
John Marshall wrote:
> For strings of > 1 character, what are the chances
> that hash(st) and hash(st[::-1]) would return the
> same value?


Why not grab a dictionary and do the stats yourself?

--Scott David Daniels
scott.daniels...acm.org

scottdavid_daniels | Wed, 07 May 2008 00:27:00 GMT |

Scott David Daniels wrote:
> John Marshall wrote:
>
>
> Why not grab a dictionary and do the stats yourself?


I was actually interested in the mathematical/probability
side rather than the empirical w/r to the current
hash function in python. Although I imagine I could do
a brute force test for x-character strings.

John

john_marshall | Wed, 07 May 2008 00:28:00 GMT |

John Marshall wrote:

> I was actually interested in the mathematical/probability
> side rather than the empirical w/r to the current
> hash function in python. Although I imagine I could do
> a brute force test for x-character strings.


Hah. No.

At least on the version I have handy (Py 2.2.3 on Itanium2), hash
returns a 64-bit value. Brute-forcing that in any reasonable length of
time is rather impossible.

christopher_subich | Wed, 07 May 2008 00:29:00 GMT |

John Marshall wrote:
> Scott David Daniels wrote:
> I was actually interested in the mathematical/probability
> side rather than the empirical w/r to the current
> hash function in python.

Well, the probability depends on the universe you are choosing from.
That was why I was suggesting a dictionary: words may well have a
different distribution than arbitrary strings.

--Scott David Daniels
scott.daniels...acm.org

scottdavid_daniels | Wed, 07 May 2008 00:30:00 GMT |

John Marshall wrote:

> For strings of > 1 character, what are the chances
> that hash(st) and hash(st[::-1]) would return the
> same value?


the algorithm is described here:

http://effbot.org/zone/python-hash.htm

feel free to do a mathematical analysis. a non-mathematical analysis
says that if the chance is high if your word is a palindrome, and low if
it's not.

</F>

fredrik_lundh | Wed, 07 May 2008 00:31:00 GMT |

[John Marshall]
> For strings of > 1 character, what are the chances
> that hash(st) and hash(st[::-1]) would return the
> same value?


Python's string hash algorithm is non-commutative, so a collision with
a reversed string is not likely. The exact answer depends on the
population of strings being hashed, but it's not hard to compute
collision statistics for a sampling of those strings:

collisions = len(sample) - len(set(hash(s) for s in sample))

FWIW, here is how Python computes string hash values:

static long
string_hash(PyStringObject *a)
{
register int len;
register unsigned char *p;
register long x;

len = a->ob_size;
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= a->ob_size;
if (x == -1)
x = -2;
return x;
}


> My goal is to uniquely identify multicharacter strings,
> all of which begin with "/" and never end with "/".
> Therefore, st != st[::-1].


Just use a set -- no string reversal is needed for detection of unique
multicharacter strings..

Raymond

raymond_hettinger | Wed, 07 May 2008 00:32:00 GMT |

[John Marshall]
> For strings of > 1 character, what are the chances
> that hash(st) and hash(st[::-1]) would return the
> same value?


First, if `st` is a string, `st[::-1]` is a list. Do you really mean
to compare string hashes with list hashes here? I'm going to assume
not.

Second, what are your assumptions about (a) the universe of strings;
and, (b) the hash function?

Assuming a finite universe of strings (also finite ;-)), and a hash
function that returns each of its H possible results "at random"
(meaning that there's no algorithmic way to predict any bit of the
hash output short of running the hash function), then the probability
that two distinct strings have the same hash is 1/H. It doesn't
matter to this outcome whether one input is the reversal of the other.

> My goal is to uniquely identify multicharacter strings,


Unclear what that means. Obviously, if your string universe contains
more than H strings, it's impossible for any hash function with H
possible values to return a different hash value each input.

> all of which begin with "/" and never end with "/".
> Therefore, st !=3D st[::-1].


As at the start, I think you mean to say st !=3D "".join(st[::-1]). I
don't know why you might think that matters, though. Is it simply
because this condition eliminates palindromes from your input
universe?

Anyway, to be concrete, using CPython's hash function on a 32-bit box,
H =3D 2**32-1. Call a string `s` bad iff:

s[0] =3D=3D "/" and s[-1] !=3D "/" and hash(s) =3D=3D hash("".join(reve=
rsed(s)))

Then there are no bad strings of length 1, 2, 3, or 4. There are 4
bad strings of length 5:

'/\xde&\xf6C'
'/\xca\x0e\xfaC'
'/\xc4\x06\xfcC'
'/\xad\xd6\x01\xd6'

I didn't think about this -- I just wrote a little program to try all
~=3D 4 billion such strings. So if your universe is the set of all
5-character 8-bit strings that start with '/' but don't end with '/',
and you pick inputs uniformly at random from that universe, the chance
of a hash match between a string and its reversal is

4 / (256**3 * 255)

or a little less than 1 in a billion. For a truly random hash
function with a 32-bit output, the chance would be a little less than
1 in 4 billion.

It would be mildly surprising if those odds got worse as string length
increased. The md5 and sha hashes have much larger H, and were
designed for (among other things) good collision resistance.

timpeters | Wed, 07 May 2008 00:33:00 GMT |

[John Marshall]

[Tim Peters]

[Jeff Epler]
> It is?
>
> 'sgorf hcnerf'
> (Python 2.3)


Indeed that's right. Python 2.4+ also. My apologies! Good thing it
doesn't matter to the rest of the exposition ;-)

timpeters | Wed, 07 May 2008 00:34:00 GMT |

Tim Peters wrote:
> First, if `st` is a string, `st[::-1]` is a list.


I hate to question the great timbot, but am I missing something?

'edcba'

STeVe

steven_bethard | Wed, 07 May 2008 00:35:00 GMT |

Python Hot Answers

Python New questions

Python Related Categories