I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?
However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?
If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?
I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)
Quote: Originally Posted by However, when I come to Python's builtin hash() function, I found it produces different values in my two computers! In a pentium4, hash('a') --468864544; in a amd64, hash('a') -12416037344. Does hash function depend on machine's word length? |
Quote: Originally Posted by If it does, I must consider another hash algorithm because the spider will run concurrently in several computers, some are 32-bit, some are 64-bit. Is md5 a good choice? Will it be too slow that I have no performance gain than using the "url" column directly as the unique key? |
Quote: Originally Posted by I will do some benchmarking to find it out. |
paulrubin | Wed, 02 Jan 2008 22:50:00 GMT |
Quote: Originally Posted by I'm writing a spider. I have millions of urls in a table (mysql) to check if a url has already been fetched. To check fast, I am considering to add a "hash" column in the table, make it a unique key, and use the following sql statement: insert ignore into urls (url, hash) values (newurl, hash_of_newurl) to add new url. > I believe this will be faster than making the "url" column unique key and doing string comparation. Right? |
Quote: Originally Posted by However, when I come to Python's builtin hash() function, I found it produces different values in my two computers! In a pentium4, hash('a') --468864544; in a amd64, hash('a') -> 12416037344. Does hash function depend on machine's word length? |
The low 32 bits match, so perhaps you should just use that
portion of the returned hash?
Quote: Originally Posted by
|
Quote: Originally Posted by
|
Quote: Originally Posted by
|
Quote: Originally Posted by
|
--
Grant Edwards grante Yow! Uh-oh!! I forgot
at to submit to COMPULSORY
visi.com URINALYSIS!
grantedwards | Wed, 02 Jan 2008 22:51:00 GMT |
Quote: Originally Posted by On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote: >
> I doubt it will be significantly faster. Comparing two strings and hashing a string are both O(N). |
If the OP's database is lacking, md5 is probably fine. Perhaps using a
subset of the md5 (the low 32 bits, say) could speed up comparisons at
risk of more collisions. Probably a good trade off unless the DB is
humungous.
Carl Banks
carlbanks | Wed, 02 Jan 2008 22:52:00 GMT |
Quote: Originally Posted by On 2006-07-11, Qiangning Hong <hongqn...gmail.comwrote:
> Apparently. :) > The low 32 bits match, so perhaps you should just use that portion of the returned hash? >
'0x2E40DB1E0L'
'0xFFFFFFFFE40DB1E0L' >
'0xE40DB1E0L'
'0xE40DB1E0L' |
qiangninghong | Wed, 02 Jan 2008 22:53:00 GMT |
Quote: Originally Posted by
|
Quote: Originally Posted by Is this relationship (same low 32 bits) guaranteed? |
Quote: Originally Posted by Will it change in the future version? |
while (--len >= 0)
x = (1000003*x) ^ *p++;
where x is C type "long", and the C language doesn't even define what
that does (behavior when signed multiplication overflows isn't defined
in C).
timpeters | Wed, 02 Jan 2008 22:54:00 GMT |
Quote: Originally Posted by /.../ add a "hash" column in the table, make it a unique key |
</F>
fredriklundh | Wed, 02 Jan 2008 22:55:00 GMT |
Hope this helps,
Nick V.
Qiangning Hong wrote:
Quote: Originally Posted by I'm writing a spider. I have millions of urls in a table (mysql) to check if a url has already been fetched. To check fast, I am considering to add a "hash" column in the table, make it a unique key, and use the following sql statement: insert ignore into urls (url, hash) values (newurl, hash_of_newurl) to add new url. > I believe this will be faster than making the "url" column unique key and doing string comparation. Right? > However, when I come to Python's builtin hash() function, I found it produces different values in my two computers! In a pentium4, hash('a') --468864544; in a amd64, hash('a') -12416037344. Does hash function depend on machine's word length? > If it does, I must consider another hash algorithm because the spider will run concurrently in several computers, some are 32-bit, some are 64-bit. Is md5 a good choice? Will it be too slow that I have no performance gain than using the "url" column directly as the unique key? > I will do some benchmarking to find it out. But while making my hands dirty, I would like to hear some advice from experts here. :) |
nickvatamaniuc | Wed, 02 Jan 2008 22:57:00 GMT |
Quote: Originally Posted by >GEThe low 32 bits match, so perhaps you should just use that >GEportion of the returned hash? |
pietvanoostrum | Wed, 02 Jan 2008 22:57:00 GMT |
Quote: Originally Posted by Grant Edwards wrote:
> Playing Devil's Advocate: The hash would be a one-time operation during database insertion, whereas string comparison would happen every search. |
Quote: Originally Posted by Conceivably, comparing hash strings (which is O(1)) could result in a big savings compared to comparing regular strings; |
Quote: Originally Posted by but I expect most decent sql implementations already hash data internally, so rolling your own hash would be useless at best. |
Quote: Originally Posted by If the OP's database is lacking, md5 is probably fine. Perhaps using a subset of the md5 (the low 32 bits, say) could speed up comparisons at risk of more collisions. Probably a good trade off unless the DB is humungous. |
Premature optimization...
--
Grant Edwards grante Yow! It's strange, but I'm
at only TRULY ALIVE when I'm
visi.com covered in POLKA DOTS and
TACO SAUCE...
grantedwards | Wed, 02 Jan 2008 22:59:00 GMT |
Quote: Originally Posted by Grant Edwards wrote:
> Is this relationship (same low 32 bits) guaranteed? |
Quote: Originally Posted by Will it change in the future version? |
--
Grant Edwards grante Yow! Is this an out-take
at from the "BRADY BUNCH"?
visi.com
grantedwards | Wed, 02 Jan 2008 22:59:00 GMT |
Hi, I am trying to install ZWiki on Zope 2.5.1 / Debian 2.4.20-bf2.4-xfs.First I copied the contents of the ZWiki-0.32.0.tgz to theProducts-directory under SOFTWARE_HOME:This is what it contained before:server1:/usr/lib/zope/lib/python/Products# lsExternalMethod PageTemplates StandardCacheManag...
Sorry if this post is stupid, but I'm a python newbie. I would like to dosome experiments with webservices and I try to consume a web service thatreturn an italian fiscal code (a simple string). I have to send somestrings and a date (birth date). I'm using the ZSI module, but I don...
Hi,I have been able to get ZServerSSL to work with the demo certs, andwith some self generated. However I'm really not clear oncertificates in general, and we're about to try it with real certsfrom a real CA.What I'd like to find is some really clear documentation onZServerSSL. Wh...
I got strange errors in Zope 2.7.METALErrormacro 'context/base' has incompatible version None, at line 1, column 1One ZPT file (named 'base') defines some simply slots:<htmlxmlns="http://www.w3.org/1999/xhtml"xml:lang="en-US"lang="en-US"i18n:domain="plone"metal:use-macro="...
Hi All,Does anyone know of any good Zope3 examples? I'm new to Zope and I justwant to start with a simple website, and later move on to a more complexsite with interactive calendar, obligatoryblog/wiki/buzzword-of-the-day-thingy, etc.I started by installing Zope2 and Plone but it was very s...
Hi all!I'm trying to make a simple SOAP call from python to SOAP::Lite (perl)SOAP server.My SOAP server has https://myserv.com/open-api URI, the functionopen_session has the "QW/API" namespace. SO I do the following:from ZSI.client import Bindingfp = open('debug.out', 'a'...
I'm trying to access a SOAP web-service using the latest ZSI, I have theServiceDescription and feeding it to a ZSI.ServiceProxy class.service = ZSI.ServiceProxy('http://xxxx.xxx.xx.x/app/app.asmx?WSDL')The problem is that the soapaction parameter is not set corectly in the SOAPreq...
hi therei want to write a little SOAP client and thought about using ZSI.now i have read that ZSI only works with PyXML versions later than 0.6 andearlier than 0.7.unfortunaltly i can find that old versioons of PyXML on sourceforge anymore...any ideas?thanks, leo...
Dear All,I'm very new to python/ZSI and have a (simple) query to ask. I have asimple client:from ZSI.client import Bindingfp=open('debug.out','w')b=Binding(url='somewhere',tracefile=fp)print b.connect('usename')fp.close()and a server offrom ZSI import...
Hi,The Rheinland section of the German Zope User Group (dzugRheinland) isannouncing a Zope 3 Sprint on Producuts for December 2003.Sprint Topics: porting products from Zope 2 to Zope 3;creating new products in Zope3.Sprint Coaches: Martijn Faassen and Philipp von Weitershausen.Moreinfosonhttp://...
I heard Zope3 has implemented CMF inside. What about Plone? Will it beadded also?--JZ...
Hallo, I have the problem.Zope 2.7.0, Plone-2.0.3, LinuxI list actions from portal_actionsactions python: here.portal_actions.listFilteredActionsFor(here)I recive a list with actions. Depending on kind of logged user I recive thatlist in diffrent count.I mean that when I'm Manager the list...
Hi !I think that I have been found a bug in mx.DateTime...So:I use FireBird, and in this RDBMS the datetime fields are "doubles".So if I set them to 0, the values the fields are '1899-12-30 01:00:00'.When I try to see this datetime as European format (YYYY.MM.DD HH:MM:SS)I get error.Ex...
I'm trying to use Plone again. I do this every few months and usuallydrop it after a few days. I keep getting a little closer though. I'mdocumenting one of my attempts to make Plone more palatable in casesomeone else comes this way (and so that I'll remember).Plone uses random cap...
I am the lead developer of MacSuburb.com, a to-be-launched maccommunity. Unfortunatly, I am the only developer. If you are interestedin helping please email me or reply with your contact information.Please leave atleast your email and/or IM name. I can be reached atzacim@jezajo.org. *Make sure y...