I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?
However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?
If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?
I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)
Hi!I had found (since only 2 hours) that KixForm can to do used by Python +PyWin32At the end of the ...
By doremichellasido
Hi!I had found (since only 2 hours) that KixForm can to do used by Python +PyWin32At the end of the ...
By do_re_mi_chel_la_si_do
Mike Kent <mrmakent@cox.net> writes:> A bug issue has been opened in the Python Trac system...
By ben_finney
"Ben Finney" <bignose+hates-spam@benfinney.id.au> wrote in messagenews:8763xi7wme.fsf_-_@benfi...
By terry_reedy
On Jan 24, 5:13 pm, "Terry Reedy" <tjre...@udel.edu> wrote:> "Ben Finney" <bignose+hates...
By mike_kent
Hi,I'd like to create a temporaty shell script. Therefore I use tempfilewith NamedTemporaryFile. The...
By christophe_delarue_gmail_com
Hi,I'd like to create a temporaty shell script. Therefore I use tempfilewith NamedTemporaryFile...
By christophe_delarue_gmail
hi. Is the hash() algorithm standard ? Does hash(some_string) will alwaysreturn the same hash code o...
By benotdejean, 7 Comments
hi. Is the hash() algorithm standard ? Does hash(some_string) will alwaysreturn the same hash code o...
By benot_dejean, 7 Comments
Hi,For strings of > 1 character, what are the chancesthat hash(st) and hash(st[::-1]) would retur...
By john_marshall, 9 Comments
I am very new to both programming and Pyhton and while trying to dosome practice using A byte of pyt...
By willkab6_gmail, 3 Comments
Hi,I have a program that makes a call to a function in a different pythonscript that I wrote. But, w...
By seancron, 1 Comments
Is anyone using the Kiwi wrappers for pygtk? Is there an updatedversion of it? The one I can find is...
By laughlin_josephv, 1 Comments
Hi guys,probably dumb question but after googling a lot I couldn't find an answer.with a simple...
By gmtaglia, 2 Comments
Greetings.Im trying to write a program that can be run from the command line.If I want to search for...
By peter_hansen, 6 Comments
"Qiangning Hong" <hongqn...gmail.comwrites:
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?
The hash function is unspecified and can depend on anything the
implementers feel like. It may(?) even be permitted to differ from
one run of the interpreter to another (I haven't checked the spec for
this). Don't count on it being consistent from machine to machine.
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique key?
If you're going to accept the overhead of an SQL database you might as
well enjoy the use of the abstraction it gives you, instead of trying
to implement what amounts to your own form of indexing instead of
letting the db take care of it. But md5(url) is certainly very fast
compared with processing the outgoing http connection that you
presumably plan to open for each url.
That's the right way to answer questions like this.
paulrubin | Wed, 02 Jan 2008 22:50:00 GMT |
On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.
>
I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?
I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).
found it produces different values in my two computers! In a
pentium4, hash('a') --468864544; in a amd64, hash('a') ->
12416037344. Does hash function depend on machine's word
length?
Apparently. :)
The low 32 bits match, so perhaps you should just use that
portion of the returned hash?
'0x2E40DB1E0L'
'0xFFFFFFFFE40DB1E0L'
'0xE40DB1E0L'
'0xE40DB1E0L'
--
Grant Edwards grante Yow! Uh-oh!! I forgot
at to submit to COMPULSORY
visi.com URINALYSIS!
grantedwards | Wed, 02 Jan 2008 22:51:00 GMT |
Grant Edwards wrote:
>
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.
I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?
>
I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).
Playing Devil's Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search. Conceivably, comparing hash strings (which is O(1)) could
result in a big savings compared to comparing regular strings; but I
expect most decent sql implementations already hash data internally, so
rolling your own hash would be useless at best.
If the OP's database is lacking, md5 is probably fine. Perhaps using a
subset of the md5 (the low 32 bits, say) could speed up comparisons at
risk of more collisions. Probably a good trade off unless the DB is
humungous.
Carl Banks
carlbanks | Wed, 02 Jan 2008 22:52:00 GMT |
Grant Edwards wrote:
found it produces different values in my two computers! In a
pentium4, hash('a') --468864544; in a amd64, hash('a') ->
12416037344. Does hash function depend on machine's word
length?
>
Apparently. :)
>
The low 32 bits match, so perhaps you should just use that
portion of the returned hash?
>
'0x2E40DB1E0L'
'0xFFFFFFFFE40DB1E0L'
>
'0xE40DB1E0L'
'0xE40DB1E0L'
Is this relationship (same low 32 bits) guaranteed? Will it change in
the future version?
qiangninghong | Wed, 02 Jan 2008 22:53:00 GMT |
[Grant Edwards]
>The low 32 bits match, so perhaps you should just use that
>portion of the returned hash?
>>
>'0x2E40DB1E0L'
>'0xFFFFFFFFE40DB1E0L'
>>
>'0xE40DB1E0L'
>'0xE40DB1E0L'
[Qiangning Hong]
No. Nothing about hashes is guaranteed, except that when x and y are
of a hashable type, and x == y, then hash(x) == hash(y) too.
That's possible, but not planned. Note that the guts of string
hashing in CPython today is implemented via
while (--len >= 0)
x = (1000003*x) ^ *p++;
where x is C type "long", and the C language doesn't even define what
that does (behavior when signed multiplication overflows isn't defined
in C).
timpeters | Wed, 02 Jan 2008 22:54:00 GMT |
Qiangning Hong wrote:
at this point, you should have slapped yourself on the forehead, and gone
back to the drawing board.
</F>
fredriklundh | Wed, 02 Jan 2008 22:55:00 GMT |
Using Python's hash as column in the table might not be a good idea.
You just found out why. So you could instead just use the base url and
create an index based on that so next time just quickly get all urls
from same base address then do a linear search for a specific one, or
even easier, implement your own hashes without using any of the
Python's built-in hash() functions. For example, transform each
character to an int and multply them all mod 2^32-1 or something like
that. Even better I think someone already posted the Python's way of
generating hashes for string, well, just re-implement it in Python such
that your version will yield the same hash on any platform.
Hope this helps,
Nick V.
Qiangning Hong wrote:
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.
>
I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?
>
However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?
>
If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?
>
I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)
nickvatamaniuc | Wed, 02 Jan 2008 22:57:00 GMT |
>>>>Grant Edwards <grante...visi.com(GE) wrote:
>GEportion of the returned hash?
If the hashed should be unique, 32 bits is much too low if you have
millions of entries.
--
Piet van Oostrum <piet...cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: piet...vanoostrum.org
pietvanoostrum | Wed, 02 Jan 2008 22:57:00 GMT |
On 2006-07-12, Carl Banks <invalidemail...aerojockey.comwrote:
>>
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.
>
I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?
>>
>I doubt it will be significantly faster. Comparing two strings
>and hashing a string are both O(N).
>
Playing Devil's Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search.
Good point.
result in a big savings compared to comparing regular strings;
Still, I doubt that the URLs are long enough so that there's a
significant difference.
internally, so rolling your own hash would be useless at best.
Precisely. DB designers and implementers have been working on
this problem for 30 years. I doubt the OP is going to be able
to best them with a few minutes work.
using a subset of the md5 (the low 32 bits, say) could speed
up comparisons at risk of more collisions. Probably a good
trade off unless the DB is humungous.
My advice: do it the simple way first (let the DB handle it).
Don't try to fix a problem until you know it exists.
Premature optimization...
--
Grant Edwards grante Yow! It's strange, but I'm
at only TRULY ALIVE when I'm
visi.com covered in POLKA DOTS and
TACO SAUCE...
grantedwards | Wed, 02 Jan 2008 22:59:00 GMT |
On 2006-07-12, Qiangning Hong <hongqn...gmail.comwrote:
found it produces different values in my two computers! In a
pentium4, hash('a') --468864544; in a amd64, hash('a') ->
12416037344. Does hash function depend on machine's word
length?
>>
>Apparently. :)
>>
>The low 32 bits match, so perhaps you should just use that
>portion of the returned hash?
>>
>'0x2E40DB1E0L'
>'0xFFFFFFFFE40DB1E0L'
>>
>'0xE40DB1E0L'
>'0xE40DB1E0L'
>
Is this relationship (same low 32 bits) guaranteed?
No, I don't believe so.
It may.
--
Grant Edwards grante Yow! Is this an out-take
at from the "BRADY BUNCH"?
visi.com
grantedwards | Wed, 02 Jan 2008 22:59:00 GMT |