Python: hash()

  • john_marshall / 200 / Fri, 27 Mar 2009 05:33:00 GMT / Comments (9)
  • Hi,

    For strings of > 1 character, what are the chances
    that hash(st) and hash(st[::-1]) would return the
    same value?

    My goal is to uniquely identify multicharacter strings,
    all of which begin with "/" and never end with "/".
    Therefore, st != st[::-1].

    Thanks,
    John
  • Keywords:

    hash, python

  • http://programming.itags.org/python/32369/«« Last Thread - Next Thread »»
    1. John Marshall wrote:
      > For strings of > 1 character, what are the chances
      > that hash(st) and hash(st[::-1]) would return the
      > same value?


      Why not grab a dictionary and do the stats yourself?

      --Scott David Daniels
      scott.daniels...acm.org

      scottdavid_daniels | Wed, 07 May 2008 00:27:00 GMT |

    2. Scott David Daniels wrote:
      > John Marshall wrote:
      >
      >
      > Why not grab a dictionary and do the stats yourself?


      I was actually interested in the mathematical/probability
      side rather than the empirical w/r to the current
      hash function in python. Although I imagine I could do
      a brute force test for x-character strings.

      John

      john_marshall | Wed, 07 May 2008 00:28:00 GMT |

    3. John Marshall wrote:

      > I was actually interested in the mathematical/probability
      > side rather than the empirical w/r to the current
      > hash function in python. Although I imagine I could do
      > a brute force test for x-character strings.


      Hah. No.

      At least on the version I have handy (Py 2.2.3 on Itanium2), hash
      returns a 64-bit value. Brute-forcing that in any reasonable length of
      time is rather impossible.

      christopher_subich | Wed, 07 May 2008 00:29:00 GMT |

    4. John Marshall wrote:
      > Scott David Daniels wrote:
      > I was actually interested in the mathematical/probability
      > side rather than the empirical w/r to the current
      > hash function in python.

      Well, the probability depends on the universe you are choosing from.
      That was why I was suggesting a dictionary: words may well have a
      different distribution than arbitrary strings.

      --Scott David Daniels
      scott.daniels...acm.org

      scottdavid_daniels | Wed, 07 May 2008 00:30:00 GMT |

    5. John Marshall wrote:

      > For strings of > 1 character, what are the chances
      > that hash(st) and hash(st[::-1]) would return the
      > same value?


      the algorithm is described here:

      http://effbot.org/zone/python-hash.htm

      feel free to do a mathematical analysis. a non-mathematical analysis
      says that if the chance is high if your word is a palindrome, and low if
      it's not.

      </F>

      fredrik_lundh | Wed, 07 May 2008 00:31:00 GMT |

    6. [John Marshall]
      > For strings of > 1 character, what are the chances
      > that hash(st) and hash(st[::-1]) would return the
      > same value?


      Python's string hash algorithm is non-commutative, so a collision with
      a reversed string is not likely. The exact answer depends on the
      population of strings being hashed, but it's not hard to compute
      collision statistics for a sampling of those strings:

      collisions = len(sample) - len(set(hash(s) for s in sample))

      FWIW, here is how Python computes string hash values:

      static long
      string_hash(PyStringObject *a)
      {
      register int len;
      register unsigned char *p;
      register long x;

      len = a->ob_size;
      p = (unsigned char *) a->ob_sval;
      x = *p << 7;
      while (--len >= 0)
      x = (1000003*x) ^ *p++;
      x ^= a->ob_size;
      if (x == -1)
      x = -2;
      return x;
      }


      > My goal is to uniquely identify multicharacter strings,
      > all of which begin with "/" and never end with "/".
      > Therefore, st != st[::-1].


      Just use a set -- no string reversal is needed for detection of unique
      multicharacter strings..

      Raymond

      raymond_hettinger | Wed, 07 May 2008 00:32:00 GMT |

    7. [John Marshall]
      > For strings of > 1 character, what are the chances
      > that hash(st) and hash(st[::-1]) would return the
      > same value?


      First, if `st` is a string, `st[::-1]` is a list. Do you really mean
      to compare string hashes with list hashes here? I'm going to assume
      not.

      Second, what are your assumptions about (a) the universe of strings;
      and, (b) the hash function?

      Assuming a finite universe of strings (also finite ;-)), and a hash
      function that returns each of its H possible results "at random"
      (meaning that there's no algorithmic way to predict any bit of the
      hash output short of running the hash function), then the probability
      that two distinct strings have the same hash is 1/H. It doesn't
      matter to this outcome whether one input is the reversal of the other.

      > My goal is to uniquely identify multicharacter strings,


      Unclear what that means. Obviously, if your string universe contains
      more than H strings, it's impossible for any hash function with H
      possible values to return a different hash value each input.

      > all of which begin with "/" and never end with "/".
      > Therefore, st !=3D st[::-1].


      As at the start, I think you mean to say st !=3D "".join(st[::-1]). I
      don't know why you might think that matters, though. Is it simply
      because this condition eliminates palindromes from your input
      universe?

      Anyway, to be concrete, using CPython's hash function on a 32-bit box,
      H =3D 2**32-1. Call a string `s` bad iff:

      s[0] =3D=3D "/" and s[-1] !=3D "/" and hash(s) =3D=3D hash("".join(reve=
      rsed(s)))

      Then there are no bad strings of length 1, 2, 3, or 4. There are 4
      bad strings of length 5:

      '/\xde&\xf6C'
      '/\xca\x0e\xfaC'
      '/\xc4\x06\xfcC'
      '/\xad\xd6\x01\xd6'

      I didn't think about this -- I just wrote a little program to try all
      ~=3D 4 billion such strings. So if your universe is the set of all
      5-character 8-bit strings that start with '/' but don't end with '/',
      and you pick inputs uniformly at random from that universe, the chance
      of a hash match between a string and its reversal is

      4 / (256**3 * 255)

      or a little less than 1 in a billion. For a truly random hash
      function with a 32-bit output, the chance would be a little less than
      1 in 4 billion.

      It would be mildly surprising if those odds got worse as string length
      increased. The md5 and sha hashes have much larger H, and were
      designed for (among other things) good collision resistance.

      timpeters | Wed, 07 May 2008 00:33:00 GMT |

    8. [John Marshall]

      [Tim Peters]

      [Jeff Epler]
      > It is?
      >
      > 'sgorf hcnerf'
      > (Python 2.3)


      Indeed that's right. Python 2.4+ also. My apologies! Good thing it
      doesn't matter to the rest of the exposition ;-)

      timpeters | Wed, 07 May 2008 00:34:00 GMT |

    9. Tim Peters wrote:
      > First, if `st` is a string, `st[::-1]` is a list.


      I hate to question the great timbot, but am I missing something?

      'edcba'

      STeVe

      steven_bethard | Wed, 07 May 2008 00:35:00 GMT |