Python: hash() yields different results for different platforms

  • qiangninghong / 206 / Fri, 27 Mar 2009 05:36:00 GMT / Comments (10)
  • I'm writing a spider. I have millions of urls in a table (mysql) to
    check if a url has already been fetched. To check fast, I am
    considering to add a "hash" column in the table, make it a unique key,
    and use the following sql statement:
    insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
    to add new url.

    I believe this will be faster than making the "url" column unique key
    and doing string comparation. Right?

    However, when I come to Python's builtin hash() function, I found it
    produces different values in my two computers! In a pentium4,
    hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
    hash function depend on machine's word length?

    If it does, I must consider another hash algorithm because the spider
    will run concurrently in several computers, some are 32-bit, some are
    64-bit. Is md5 a good choice? Will it be too slow that I have no
    performance gain than using the "url" column directly as the unique
    key?

    I will do some benchmarking to find it out. But while making my hands
    dirty, I would like to hear some advice from experts here. :)

  • Keywords:

    hash, yields, different, results, platforms, python

  • http://programming.itags.org/python/32372/«« Last Thread - Next Thread »»
    1. "Qiangning Hong" <hongqn...gmail.comwrites:
      Quote:
      Originally Posted by
      However, when I come to Python's builtin hash() function, I found it
      produces different values in my two computers! In a pentium4,
      hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
      hash function depend on machine's word length?

      The hash function is unspecified and can depend on anything the
      implementers feel like. It may(?) even be permitted to differ from
      one run of the interpreter to another (I haven't checked the spec for
      this). Don't count on it being consistent from machine to machine.
      Quote:
      Originally Posted by
      If it does, I must consider another hash algorithm because the spider
      will run concurrently in several computers, some are 32-bit, some are
      64-bit. Is md5 a good choice? Will it be too slow that I have no
      performance gain than using the "url" column directly as the unique key?

      If you're going to accept the overhead of an SQL database you might as
      well enjoy the use of the abstraction it gives you, instead of trying
      to implement what amounts to your own form of indexing instead of
      letting the db take care of it. But md5(url) is certainly very fast
      compared with processing the outgoing http connection that you
      presumably plan to open for each url.
      Quote:
      Originally Posted by
      I will do some benchmarking to find it out.

      That's the right way to answer questions like this.

      paulrubin | Wed, 02 Jan 2008 22:50:00 GMT |

    2. On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
      Quote:
      Originally Posted by
      I'm writing a spider. I have millions of urls in a table (mysql) to
      check if a url has already been fetched. To check fast, I am
      considering to add a "hash" column in the table, make it a unique key,
      and use the following sql statement:
      insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
      to add new url.
      >
      I believe this will be faster than making the "url" column unique key
      and doing string comparation. Right?

      I doubt it will be significantly faster. Comparing two strings
      and hashing a string are both O(N).
      Quote:
      Originally Posted by
      However, when I come to Python's builtin hash() function, I
      found it produces different values in my two computers! In a
      pentium4, hash('a') --468864544; in a amd64, hash('a') ->
      12416037344. Does hash function depend on machine's word
      length?

      Apparently. :)

      The low 32 bits match, so perhaps you should just use that
      portion of the returned hash?
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >>hex(12416037344)

      '0x2E40DB1E0L'
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >>hex(-468864544 & 0xffffffffffffffff)

      '0xFFFFFFFFE40DB1E0L'
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >>hex(12416037344 & 0xffffffff)

      '0xE40DB1E0L'
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >>hex(-468864544 & 0xffffffff)

      '0xE40DB1E0L'

      --
      Grant Edwards grante Yow! Uh-oh!! I forgot
      at to submit to COMPULSORY
      visi.com URINALYSIS!

      grantedwards | Wed, 02 Jan 2008 22:51:00 GMT |

    3. Grant Edwards wrote:
      Quote:
      Originally Posted by
      On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
      >
      Quote:
      Originally Posted by
      I'm writing a spider. I have millions of urls in a table (mysql) to
      check if a url has already been fetched. To check fast, I am
      considering to add a "hash" column in the table, make it a unique key,
      and use the following sql statement:
      insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
      to add new url.

      I believe this will be faster than making the "url" column unique key
      and doing string comparation. Right?


      >
      I doubt it will be significantly faster. Comparing two strings
      and hashing a string are both O(N).

      Playing Devil's Advocate: The hash would be a one-time operation during
      database insertion, whereas string comparison would happen every
      search. Conceivably, comparing hash strings (which is O(1)) could
      result in a big savings compared to comparing regular strings; but I
      expect most decent sql implementations already hash data internally, so
      rolling your own hash would be useless at best.

      If the OP's database is lacking, md5 is probably fine. Perhaps using a
      subset of the md5 (the low 32 bits, say) could speed up comparisons at
      risk of more collisions. Probably a good trade off unless the DB is
      humungous.

      Carl Banks

      carlbanks | Wed, 02 Jan 2008 22:52:00 GMT |

    4. Grant Edwards wrote:
      Quote:
      Originally Posted by
      On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
      Quote:
      Originally Posted by
      However, when I come to Python's builtin hash() function, I
      found it produces different values in my two computers! In a
      pentium4, hash('a') --468864544; in a amd64, hash('a') ->
      12416037344. Does hash function depend on machine's word
      length?

      >
      Apparently. :)
      >
      The low 32 bits match, so perhaps you should just use that
      portion of the returned hash?
      >
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >hex(12416037344)

      '0x2E40DB1E0L'
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >hex(-468864544 & 0xffffffffffffffff)

      '0xFFFFFFFFE40DB1E0L'
      >
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >hex(12416037344 & 0xffffffff)

      '0xE40DB1E0L'
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >hex(-468864544 & 0xffffffff)

      '0xE40DB1E0L'

      Is this relationship (same low 32 bits) guaranteed? Will it change in
      the future version?

      qiangninghong | Wed, 02 Jan 2008 22:53:00 GMT |

    5. [Grant Edwards]
      Quote:
      Originally Posted by
      Quote:
      Originally Posted by
      >...
      >The low 32 bits match, so perhaps you should just use that
      >portion of the returned hash?
      >>
      Quote:
      Originally Posted by
      >>hex(12416037344)

      >'0x2E40DB1E0L'
      Quote:
      Originally Posted by
      >>hex(-468864544 & 0xffffffffffffffff)

      >'0xFFFFFFFFE40DB1E0L'
      >>
      Quote:
      Originally Posted by
      >>hex(12416037344 & 0xffffffff)

      >'0xE40DB1E0L'
      Quote:
      Originally Posted by
      >>hex(-468864544 & 0xffffffff)

      >'0xE40DB1E0L'

      [Qiangning Hong]
      Quote:
      Originally Posted by
      Is this relationship (same low 32 bits) guaranteed?

      No. Nothing about hashes is guaranteed, except that when x and y are
      of a hashable type, and x == y, then hash(x) == hash(y) too.
      Quote:
      Originally Posted by
      Will it change in the future version?

      That's possible, but not planned. Note that the guts of string
      hashing in CPython today is implemented via

      while (--len >= 0)
      x = (1000003*x) ^ *p++;

      where x is C type "long", and the C language doesn't even define what
      that does (behavior when signed multiplication overflows isn't defined
      in C).

      timpeters | Wed, 02 Jan 2008 22:54:00 GMT |

    6. Qiangning Hong wrote:
      Quote:
      Originally Posted by
      /.../ add a "hash" column in the table, make it a unique key

      at this point, you should have slapped yourself on the forehead, and gone
      back to the drawing board.

      </F>

      fredriklundh | Wed, 02 Jan 2008 22:55:00 GMT |

    7. Using Python's hash as column in the table might not be a good idea.
      You just found out why. So you could instead just use the base url and
      create an index based on that so next time just quickly get all urls
      from same base address then do a linear search for a specific one, or
      even easier, implement your own hashes without using any of the
      Python's built-in hash() functions. For example, transform each
      character to an int and multply them all mod 2^32-1 or something like
      that. Even better I think someone already posted the Python's way of
      generating hashes for string, well, just re-implement it in Python such
      that your version will yield the same hash on any platform.

      Hope this helps,
      Nick V.

      Qiangning Hong wrote:
      Quote:
      Originally Posted by
      I'm writing a spider. I have millions of urls in a table (mysql) to
      check if a url has already been fetched. To check fast, I am
      considering to add a "hash" column in the table, make it a unique key,
      and use the following sql statement:
      insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
      to add new url.
      >
      I believe this will be faster than making the "url" column unique key
      and doing string comparation. Right?
      >
      However, when I come to Python's builtin hash() function, I found it
      produces different values in my two computers! In a pentium4,
      hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
      hash function depend on machine's word length?
      >
      If it does, I must consider another hash algorithm because the spider
      will run concurrently in several computers, some are 32-bit, some are
      64-bit. Is md5 a good choice? Will it be too slow that I have no
      performance gain than using the "url" column directly as the unique
      key?
      >
      I will do some benchmarking to find it out. But while making my hands
      dirty, I would like to hear some advice from experts here. :)

      nickvatamaniuc | Wed, 02 Jan 2008 22:57:00 GMT |

    8. >>>>Grant Edwards <grante...visi.com(GE) wrote:
      Quote:
      Originally Posted by
      >GEThe low 32 bits match, so perhaps you should just use that
      >GEportion of the returned hash?

      If the hashed should be unique, 32 bits is much too low if you have
      millions of entries.
      --
      Piet van Oostrum <piet...cs.uu.nl>
      URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
      Private email: piet...vanoostrum.org

      pietvanoostrum | Wed, 02 Jan 2008 22:57:00 GMT |

    9. On 2006-07-12, Carl Banks <invalidemail...aerojockey.comwrote:
      Quote:
      Originally Posted by
      Grant Edwards wrote:
      Quote:
      Originally Posted by
      >On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
      >>
      Quote:
      Originally Posted by
      I'm writing a spider. I have millions of urls in a table (mysql) to
      check if a url has already been fetched. To check fast, I am
      considering to add a "hash" column in the table, make it a unique key,
      and use the following sql statement:
      insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
      to add new url.
      >
      I believe this will be faster than making the "url" column unique key
      and doing string comparation. Right?

      >>
      >I doubt it will be significantly faster. Comparing two strings
      >and hashing a string are both O(N).

      >
      Playing Devil's Advocate: The hash would be a one-time operation during
      database insertion, whereas string comparison would happen every
      search.

      Good point.
      Quote:
      Originally Posted by
      Conceivably, comparing hash strings (which is O(1)) could
      result in a big savings compared to comparing regular strings;

      Still, I doubt that the URLs are long enough so that there's a
      significant difference.
      Quote:
      Originally Posted by
      but I expect most decent sql implementations already hash data
      internally, so rolling your own hash would be useless at best.

      Precisely. DB designers and implementers have been working on
      this problem for 30 years. I doubt the OP is going to be able
      to best them with a few minutes work.
      Quote:
      Originally Posted by
      If the OP's database is lacking, md5 is probably fine. Perhaps
      using a subset of the md5 (the low 32 bits, say) could speed
      up comparisons at risk of more collisions. Probably a good
      trade off unless the DB is humungous.

      My advice: do it the simple way first (let the DB handle it).
      Don't try to fix a problem until you know it exists.

      Premature optimization...

      --
      Grant Edwards grante Yow! It's strange, but I'm
      at only TRULY ALIVE when I'm
      visi.com covered in POLKA DOTS and
      TACO SAUCE...

      grantedwards | Wed, 02 Jan 2008 22:59:00 GMT |

    10. On 2006-07-12, Qiangning Hong <hongqn...gmail.comwrote:
      Quote:
      Originally Posted by
      Grant Edwards wrote:
      Quote:
      Originally Posted by
      >On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
      Quote:
      Originally Posted by
      However, when I come to Python's builtin hash() function, I
      found it produces different values in my two computers! In a
      pentium4, hash('a') --468864544; in a amd64, hash('a') ->
      12416037344. Does hash function depend on machine's word
      length?

      >>
      >Apparently. :)
      >>
      >The low 32 bits match, so perhaps you should just use that
      >portion of the returned hash?
      >>
      Quote:
      Originally Posted by
      >>hex(12416037344)

      >'0x2E40DB1E0L'
      Quote:
      Originally Posted by
      >>hex(-468864544 & 0xffffffffffffffff)

      >'0xFFFFFFFFE40DB1E0L'
      >>
      Quote:
      Originally Posted by
      >>hex(12416037344 & 0xffffffff)

      >'0xE40DB1E0L'
      Quote:
      Originally Posted by
      >>hex(-468864544 & 0xffffffff)

      >'0xE40DB1E0L'

      >
      Is this relationship (same low 32 bits) guaranteed?

      No, I don't believe so.
      Quote:
      Originally Posted by
      Will it change in the future version?

      It may.

      --
      Grant Edwards grante Yow! Is this an out-take
      at from the "BRADY BUNCH"?
      visi.com

      grantedwards | Wed, 02 Jan 2008 22:59:00 GMT |