sirdf.com

DataGuy · Newbie Joined: 17 Nov 2004 Posts: 3

Hello,

I was just referred here from WebmasterWorld. I would love to be able to converse openly with other SE operators so I'm glad to be here...

I operate a string of SE's, with my most popular site being SearchSight.com. My sites runs mostly on Windows-based machines, so I'm in an up-hill battle from the start.

I've been doing this for 6 years, and the SE business has been good to me. I have always considered this somewhat of a hobby, though I do have employees and due to my wife's prompting, I try to run it as a real business.

I'm fascinated with data aggregation, and I hope to be able to contribute something to the Internet on a World-wide scale. I just haven't been able to do it yet!

Which SE sites are represented here?

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

I run www.boitho.com, Fischerlaender runs www.neomo.de . Both is algorithmic, crawler based search engines based on own file structures (not a sql database), writen in Perl and C.

What languages are you developing in ?

( Please don't judge me based on the search engine that you find at www.boitho.com , it is a 1.5 years old demo we made to show to potential investors how thumbnail pictures can be used in search engines. )
_________________
CTO @ Searchdaimon company search.

DataGuy · Newbie Joined: 17 Nov 2004 Posts: 3

scolls · Newbie Joined: 08 Apr 2006 Posts: 8

Well, a belated reply to this post, but what the heck! :rolleyes:

Well, I've ended up writing the system for searchserf.market-uk.com.
I know... it's on a subdomain. :rolleyes: I'll get a domain name for it shortly...

See, I actually just got side-tracked while writing myself a little app for something else, and somehow ended up with this thing!

Anyhow, I've been finding it really fascinating, so I've stuck with it and am doing it as a hobby, with a view to perhaps creating a job for myself with it so I can quit chasing the end of the month and start chasing my dreams, man!!! Laughing

I'm actually quite shocked I even got this far... having zero idea about search engines other than some basic SEO, etc.

So basically it's 4 pieces of software I wrote with Delphi. One does the crawling, with seeds added from another piece of software that handles submissions (as well as sending out email for submission confirmation), another handles the indexing on keywords, and yet another that monitors that everything is up & running & email me reports daily.

It's running on one of those ol' wind-up computers, so it's sure gonna be interesting to see it run on its own server one day! And, of course, a whole lot bigger connection would work wonders! But... one step at a time I suppose. I'm learning as I go, sort of really just dreaming up how it should work - it's really such a trip doing it - I wish I had a job that was this much fun!!! :blink:
_________________
<a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a> <a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

I have been using some Delphi to. The crawler for boitho.com is for example in Delphi.

How many pages have you crawled sow fare?

_________________
CTO @ Searchdaimon company search.

scolls · Newbie Joined: 08 Apr 2006 Posts: 8

QUOTE (runarb @ Apr 10 2006, 12:08 AM)

I have been using some Delphi to. The crawler for boitho.com is for example in Delphi.

How many pages have you crawled sow fare?

Hi Runarb,

I've had a couple of test runs... debugging can be a pain, as you know.

Considering I first just banged a bit of code together really, I've been learning more and more from the experiment as I go along, and I adapt the thing further as I go along.

The crawler does maybe up to 60000 or a bit more a day - not good at all, but it's doing too much of the work, filtering unwanted sites, parsing entire pages & feeding itself URL's from every page it downloads. Backend is MySQL, running on same pc as crawler. So not a fast setup by any standard.

But the results are encouraging enough for me to be planning a complete rewrite based upon the things I've learned so far from its behaviour.

For example, I'd like to perhaps multi-thread it so that it can be caching pages while waiting for pages to download. At the moment, it waits for the page to download, then caches it, then gets the next url to parse, waiting for MySQL to execute the query, etc etc.
Another thing I'd like to do is have it crawl only a limited length route from the initial seeds, eg max xyz links away from seed. The thing is I'm finding that the further away from a good seed you go, the more 404's you find! I was actually quite shocked to see how many outdated links many people have on their sites!

Any suggestions?
_________________
<a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a> <a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Just having one thread that do the crawling in serial wont be optimal.

You should look into either using multi-thread, as you mention, or asynchronous io.

You can implement asynchronous io by having a large array of non blocking sockets, and sent request through each, one by one, but not wait for it to finish, just go to the next one. Then start at the first to check if all the data have come in. If it has, give it an other page to download.

Unfortunately this is more complicated, and may be tricky to get working.

Also se http://en.wikipedia.org/wiki/Asynchronous_I/O

_________________
CTO @ Searchdaimon company search.

scolls · Newbie Joined: 08 Apr 2006 Posts: 8

Thanks runarb!

How many sockets in the array would you recommend, or should I just play around & monitor the difference in results.

Also, how would you recommend feeding & fniding good seeds? I am thinking of keeping a scoreboard of all links spawned from crawled pages & cutting entire chains of those that fall below a certain score (eg site "A" gives x links of which y spawn y2 good links and z that spawn z2 bad links (404's etc) )

I'm definitely going to do a complete rewrite of the crawler so I'm really keen on hearing any ideas I can before I start. B)
_________________
<a href='http://www.webwobot.com' target='_blank'>WebWobot Search Engine</a> <a href='http://www.massdebation.com' target='_blank'>MassDebation.com ~ No Ordinary Debate!</a>

runarb · Site Admin Joined: 29 Oct 2006 Posts: 4

Try with 500 for starts, and se how that performs. Then adjust to you have a good cpu utilization.

For seeds most search engines uses the Open Directory RDF Dump, from http://rdf.dmoz.org/ . That is a XML like file containing all the links in dmoz.

_________________
CTO @ Searchdaimon company search.