|
sirdf.com Search & Information Retrieval Development Forum
|
View previous topic :: View next topic |
Author |
Message |
DataGuy Newbie
Joined: 17 Nov 2004 Posts: 3
|
Posted: Wed Nov 17, 2004 6:46 pm Post subject: |
|
|
Hello,
I was just referred here from WebmasterWorld. I would love to be able to converse openly with other SE operators so I'm glad to be here...
I operate a string of SE's, with my most popular site being SearchSight.com. My sites runs mostly on Windows-based machines, so I'm in an up-hill battle from the start.
I've been doing this for 6 years, and the SE business has been good to me. I have always considered this somewhat of a hobby, though I do have employees and due to my wife's prompting, I try to run it as a real business.
I'm fascinated with data aggregation, and I hope to be able to contribute something to the Internet on a World-wide scale. I just haven't been able to do it yet!
Which SE sites are represented here? |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Thu Nov 18, 2004 10:51 am Post subject: |
|
|
I run www.boitho.com, Fischerlaender runs www.neomo.de . Both is algorithmic, crawler based search engines based on own file structures (not a sql database), writen in Perl and C.
What languages are you developing in ?
( Please don't judge me based on the search engine that you find at www.boitho.com , it is a 1.5 years old demo we made to show to potential investors how thumbnail pictures can be used in search engines. ) _________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
DataGuy Newbie
Joined: 17 Nov 2004 Posts: 3
|
Posted: Thu Nov 18, 2004 3:37 pm Post subject: |
|
|
Quote: | What languages are you developing in ? |
Well, please don't judge me based on the languages that we use!
Our crawler uses Visual Basic, mostly because of the thumbnail image retrieval, speed has not been something we've been concerned with. You just can't get very fast if you want to download an entire web page instead of just the source code.
Our database runs on SQL Server.... again probably the worst choice for running a search engine. We are testing an index manager from surfinity that creates a much more efficient search on SQL Server and we hope to have Microsofts full-text search replaced with this product within the next few days.
Ease of development has been the primary concern up until this point. Since I am in charge of programming, marketing, and eveything in between I don't have the time to spend working on developing the fastest system.
I do have some pretty good marketing systems in place right now and I'd be interested in working with someone to create a new search engine based on a new platform. Anyone interested? |
|
Back to top |
|
|
scolls Newbie
Joined: 08 Apr 2006 Posts: 8
|
Posted: Sat Apr 08, 2006 11:53 pm Post subject: |
|
|
Well, a belated reply to this post, but what the heck! :rolleyes:
Well, I've ended up writing the system for searchserf.market-uk.com.
I know... it's on a subdomain. :rolleyes: I'll get a domain name for it shortly...
See, I actually just got side-tracked while writing myself a little app for something else, and somehow ended up with this thing!
Anyhow, I've been finding it really fascinating, so I've stuck with it and am doing it as a hobby, with a view to perhaps creating a job for myself with it so I can quit chasing the end of the month and start chasing my dreams, man!!!
I'm actually quite shocked I even got this far... having zero idea about search engines other than some basic SEO, etc.
So basically it's 4 pieces of software I wrote with Delphi. One does the crawling, with seeds added from another piece of software that handles submissions (as well as sending out email for submission confirmation), another handles the indexing on keywords, and yet another that monitors that everything is up & running & email me reports daily.
It's running on one of those ol' wind-up computers, so it's sure gonna be interesting to see it run on its own server one day! And, of course, a whole lot bigger connection would work wonders! But... one step at a time I suppose. I'm learning as I go, sort of really just dreaming up how it should work - it's really such a trip doing it - I wish I had a job that was this much fun!!! :blink: _________________ <b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a> |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Mon Apr 10, 2006 12:08 am Post subject: |
|
|
I have been using some Delphi to. The crawler for boitho.com is for example in Delphi.
How many pages have you crawled sow fare?
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
scolls Newbie
Joined: 08 Apr 2006 Posts: 8
|
Posted: Sun Apr 23, 2006 4:00 pm Post subject: |
|
|
QUOTE (runarb @ Apr 10 2006, 12:08 AM) | I have been using some Delphi to. The crawler for boitho.com is for example in Delphi.
How many pages have you crawled sow fare? |
Hi Runarb,
I've had a couple of test runs... debugging can be a pain, as you know.
Considering I first just banged a bit of code together really, I've been learning more and more from the experiment as I go along, and I adapt the thing further as I go along.
The crawler does maybe up to 60000 or a bit more a day - not good at all, but it's doing too much of the work, filtering unwanted sites, parsing entire pages & feeding itself URL's from every page it downloads. Backend is MySQL, running on same pc as crawler. So not a fast setup by any standard.
But the results are encouraging enough for me to be planning a complete rewrite based upon the things I've learned so far from its behaviour.
For example, I'd like to perhaps multi-thread it so that it can be caching pages while waiting for pages to download. At the moment, it waits for the page to download, then caches it, then gets the next url to parse, waiting for MySQL to execute the query, etc etc.
Another thing I'd like to do is have it crawl only a limited length route from the initial seeds, eg max xyz links away from seed. The thing is I'm finding that the further away from a good seed you go, the more 404's you find! I was actually quite shocked to see how many outdated links many people have on their sites!
Any suggestions? _________________ <b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a> |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Mon Apr 24, 2006 1:56 pm Post subject: |
|
|
Just having one thread that do the crawling in serial wont be optimal.
You should look into either using multi-thread, as you mention, or asynchronous io.
You can implement asynchronous io by having a large array of non blocking sockets, and sent request through each, one by one, but not wait for it to finish, just go to the next one. Then start at the first to check if all the data have come in. If it has, give it an other page to download.
Unfortunately this is more complicated, and may be tricky to get working.
Also se http://en.wikipedia.org/wiki/Asynchronous_I/O
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
scolls Newbie
Joined: 08 Apr 2006 Posts: 8
|
Posted: Sat May 06, 2006 4:15 am Post subject: |
|
|
Thanks runarb!
How many sockets in the array would you recommend, or should I just play around & monitor the difference in results.
Also, how would you recommend feeding & fniding good seeds? I am thinking of keeping a scoreboard of all links spawned from crawled pages & cutting entire chains of those that fall below a certain score (eg site "A" gives x links of which y spawn y2 good links and z that spawn z2 bad links (404's etc) )
I'm definitely going to do a complete rewrite of the crawler so I'm really keen on hearing any ideas I can before I start. B) _________________ <b><a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a></b><br><a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a> |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Mon May 08, 2006 4:25 am Post subject: |
|
|
Try with 500 for starts, and se how that performs. Then adjust to you have a good cpu utilization.
For seeds most search engines uses the Open Directory RDF Dump, from http://rdf.dmoz.org/ . That is a XML like file containing all the links in dmoz.
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|