Have you ever thought about constructing a fully featured search engine working similar to Google or Bing? Google has emerged as one of the largest companies on Internet within a very brief span of time. All net entrepreneurs might have entertained by viewing the success of Google as a organization. Considering the Tech, how google is working really fast and powerful? How can google manage the fault tolerance? Where do google save all these data of tens of thousands of web pages? Can you create a search engine such as Google? If so how?
Well, considering constructing a search engine such as Google, you want to understand various facets. First of building a search engine such as google cannot be done immediately. It takes weeks or years to crawl and store all of the data, and also to rank the outcomes, to allow it to crawl virtually the whole web. But usually you should be able to start generating the search results within a couple of week.
Where do you store the information? Where would you Google stores the information? Google has a special NOSQL database called BigTable where they store the whole search information. BigTable operates on a distributed system that works on much dependable HDFS system. This document system supports distributed computing to encourage tens of thousands of notes attached in the system.
If you want to construct the search engine from scratch in Python or PHP to electricity, for example, your website search -- you can get it done free after analyzing a few courses at Udemy or EDX. In fact, that costs you around $100 (if we take into consideration paid classes ).
If you'd like to produce the search engine such as Google (using an adequate search quality) and curious in price -- we would say it costs about $100M (for your prototype) from scratch. Including costs for servers, bandwidth, colocation, electricity etc. Incidentally, renting servers are not profitable. Maintenance price for the current cluster costs you roughly $25M each year.
We'd say, that there isn't any sense in creating another global search engine. If you would like to construct the search engine -- we suggest that you bring something new to the business.
If you would like to produce the custom commercial search engine to your company: insurance, bioinformatics, healthcare, e-commerce, fund and other -- the increase prices you 10.000 to $60.000, using a low maintenance fee.
What Technology should I use?
You cannot run google on MySQL. Period. Not even in Oracle, even if you are searching for a international scale support. You need to have something very similar to BigTable that works on a file system such as HDFS. But BigTable is google unique technology and are not open source and not readily available to the public, except a hosted version of it is recently made available in google cloud.Hadoop: Hadoop is an assortment of different bigdata components/software/tools such as HDFS which is widely considered the BEST distributed filesystem available now. Hadoop is available source always researched and developed by Apache! Hadoop is the best file system you can use to conduct a highly scalable, multimachine software like internet search engines, analytics etc.Hadoop enable you to joins tens of thousands of nodes with each other to work as a expandable file program.
Though it's based on Java and regarded as a dependable database. Hadoop is maintained by Apache!
It functions based on C++ and the Hypertable company asserts that the operation is much quicker the HBase. Hypertable support is also very nice and it's more flexibility on inquiries comparing with HBase.
Therefore for running a Google clone, you will either utilize Hadoop + HBase or Hadoop + Hypertable.
What Hardware Can I use?
Obviously I know you don't want to start to your own datacenter initially. Google has their very own, ever expanding datacenter across the world. The perfect solution to begin is you tie up using a datacenter or hosting company who can provide a series of nodes(computers) in one system. The key reason, why desire nodes in a single system is that, as we expand more nodes in future in a scalable distributed system, nodes in same physical network may significantly enhance the operation of your search engine.
How Can I Code a Google Clone Program?
Here comes the most catchy and intriguing part on your trip to construct a Google clone search engine. Whatever your pick to utilize the ideal technology or to use the ideal infrastructure, if the code isn't powerful, and made to manage the scalability, your spider will not be effective enough. I am not able to cover the components of your software logic, algorithm to build up a spider. Anyhow the below diagram discovered on Inout Spider will give you a read good idea about the significant components required to construct a spider. Inout Spider is a commercial application (widely considered a potent search engine data spider application, plus a typical google clone script) that work on Hadoop and Hypertable technologies.
Building a search engine like google, isn't as simple endeavor, or else we'd have seen much google clones online. However, with the right technologies, hardware and software(your personal, or commercial software like Inout Spider), your fantasy is attainable.