View previous topic :: View next topic |
Author |
Message |
zootreeves Newbie
Joined: 10 Dec 2005 Posts: 8
|
Posted: Mon Mar 06, 2006 7:00 pm Post subject: |
|
|
No I've got my search engine prototype completed i need to invest in some good hardware. Can anyone recommend any servers with high storage (above 600gb), but also have a powerful cpu? Or do you think i'm better of going for lots of smaller servers, like google?
I've currently got 200thousand files in my database (but it's a file search engine so each file can be as big as 600mb, most are about 1-2mb), I hope to expand to about 10x this so i'm going to need huge storage. |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Mon Mar 13, 2006 4:22 pm Post subject: |
|
|
If you continue to expand, you will have to use many small servers sooner or later. Using one big server will not scale economically. Because you don’t get a doubling in performance when you double the price.
The problem is that on order to use many computers in parallel, instead of one, you will have to create a system for them to work together.
For example to search, each server could hold a portion of all documents, and an index for them. You can then make a meta search engine that search all your servers, and then combine the results.
To create such a system takes time, sow many end up using one powerful server to get the first version up and running. Then changes it to use many small servers later.
At Boitho we uses Super Micro 5014C-MT 1u server to hold data and searching. See http://www.sirdf.com/forum/index.php?showtopic=23 . The servers has Intel P4 or Intel dual cure possessors.
If you need a cheap server with a lot of possessor power, tak a look one the Dell PowerEdge SC1425. A setup with two Xeon processors is relatively cheap. Unfortunately it only has space for two harddrives.
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
zootreeves Newbie
Joined: 10 Dec 2005 Posts: 8
|
Posted: Wed Mar 15, 2006 12:54 am Post subject: |
|
|
Thanks for advice. I'm not sure which route to go down, whether I have a few individual servers with a crawler, indexer and query servers on each one and then one server to handle all outside requests that will query all the servers at once (meta search type thing). If i went with this option i would go for the PowerEdge SC1425.
Or i have one big storage server and a seperate crawler/indexer and two or more query servers.
If i go for the 2nd option i was looking at these: http://cgi.ebay.com/ws/eBayISAPI.dll?ViewI...Y_BIN_Stores_IT for the query and indexers and maybe something like this: http://cgi.ebay.com/4-Terabyte-RAID-Storag...1QQcmdZViewItem for the storage server.
What do you think? |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Fri Mar 17, 2006 2:21 am Post subject: |
|
|
Search speed can be a problem using one big server. If you for example use 5 small servers, you have 5 cpus working on the problem in parallel. Each working on 1/5 of the problem.
If you uses on big server you either have to use only one processor at a time. Even if you get the most expensive one, it won’t be as powerful as 5 cheap Intel p4’s working in parallel.
Or you can go for a server with more then one possessor, and use treads or fork to work in parallel, but then you have to first split up the problem, then merge the results. Getting that to work will probably be as hard as writing a system based on parallel servers.
An other problem is disk i/o. If you uses separate server, each only have to work on local harddrives. If you have one query server, and one storage servers, you hav to move a lot of data across the network. This can be slow, especially if you plan to use NFS.
Also when having separate server you can grow your search engine by adding new servers. And if on of the servers is down you service isent down, you can still return answers from the remaining servers.
The conclusion is that separate server is cheaper, more scaleable and more robust. But using on big server is essayer to develop for. At Boitho we first used one Dual Xeon server and SM 5014C-MT for storage, by NFS. Then we swapped to that the SM 5014C-MT did the searching.
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
zootreeves Newbie
Joined: 10 Dec 2005 Posts: 8
|
Posted: Fri Mar 17, 2006 6:25 pm Post subject: |
|
|
I agree with you about it being more scallable to use several small server.
But say you are skimming the top 10 results from 3 servers then this is not going to to be the top 30 most relevant results. Say if one server has crawled part of the web which is most relevant to the search term then result 11 on that server could still be more relevant than the first result on another server, however it will not be returned. Do you have a way to get round this at boitho?
One solution could be if you wanted the top 30 results then you could query 3 server for the top 20 then combine the results and discard the last 30. But this way would not be very efficient and you could still miss more relevant results.
I am using Clucene for my database, so it cannot span more than one hard drive unless I use a raid array, but then I can't use seperate servers. I might look into a distributed file system like lustre. |
|
Back to top |
|
|
runarb Site Admin
Joined: 29 Oct 2006 Posts: 4
|
Posted: Sun Mar 19, 2006 4:03 am Post subject: |
|
|
Merging the results to get the best, is as you have identified, on of the problem with a parallel architecture. At Boitho we now just get the same number of results from each node as we shall display. So to display 20 results, each node has to return 20 results.
This is flexible and easy. As long as at least on node is up we can serve some result.
Quote: | so it cannot span more than one hard drive unless I use a raid array |
One can use LVM (Logical Volume Manager) if you are on Linux to make on virtual disk of many physical disks.
I don’t think using distributed file system is specially smart. What happens if one harddisk crashes? Dos the entire filsystem then become corrupted? It also probably pretty slow.
Making a system with independent nodes, will be more flexible. The “share nothing architecture†where no node share anything with an other node, is the cheap architecture on the planet.
_________________ CTO @ Searchdaimon company search. |
|
Back to top |
|
|
masidani Member
Joined: 10 Jan 2006 Posts: 23
|
Posted: Fri Mar 24, 2006 11:41 pm Post subject: |
|
|
There's a very interesting discussion on webmasterworld about the server hardware needed for high traffic sites - if it's of any interest. Find it here: http://www.webmasterworld.com/forum23/4496.htm.
Simon |
|
Back to top |
|
|
|