Google’s Analogy Altamash R. Jiwani, Government College of Engineering Amravati.
Abstract: The amount of information on the web is growing rapidly and with this the web creates new challenges for information retrieval as well as the number of new inexperienced users in the art of web research. A search engine is an information retrieval system designed to help find information stored on the World Wide Web. This search engine allows one to ask for information on the basis of specific criteria mentioned and retrieves a list of items that match those criteria. This list is sorted with respect to some measure of relevance of the results. In this paper, I present Google search engine, which has become a prototype of a large-scale search engine which is in heavy use today and would be a far more than that in the near future which could be estimated from the fact that Google has an indexed database of at least 24 million pages. This paper mainly covers “How Google works?” which includes Google hardware architecture , servers, what Google indexes , features and limitations, Google ranking principles and tips, Googleplex and a lot more…. Everybody is running for this amazing thing which has changed the way of how we surf the net. Technically they
are devising new ways to get a high Google Pagerank. This paper would provide a lot of tricks and hacks for increasing your page rank in the worlds most popular search engine.
Why Google is considered in this paper?? Because Google is the most popular largescale search engine which addresses many of the problems of existing systems. It makes heavy use of the additional structure present in hypertext to provide much higher quality search results. Some of the features include Fast crawling technology to gather the web documents and keep them up to date, efficient use of Storage space to store indices, minimal response time Queries system. In short “The best navigation service” which instead of making things easier for the computer, make things easier for the user and make the computer work harder. As a Google user, were familiar with the speed and accuracy of a Google search. How exactly does Google manage to find the right results for every query as quickly as it does? All questions like this would be answered in this paper. There’s something deeper to learn about Google like a mystery waited to be solved. An example of this could be that Google is a company that has built a single very
large, custom computer comprising of 100,000 servers... It’s running their own cluster operating system. They make their big computer even bigger and faster each month, while lowering the cost of CPU cycles, making an efficient system with a unique combination of advanced hardware and software. Google has taken the last 10 years of systems software research out of university labs, and built their own proprietary, production quality system. What will they do next with the world’s biggest computer and most advanced operating system? Still remains a mystery.
TYPES OF SEARCH ENGINES……… There are basically three types of search engines: 1) Those that are powered by robots (called crawlers; ants or spiders) 2) Those that are powered by human submissions. 3) Those that are a hybrid of the two. Crawler-based search engines are those that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the sites meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. Human-powered search engines rely on humans to submit information that is
subsequently indexed and catalogued. Only information that is submitted is put into the index. In both cases, when you query a search engine to locate information, you are actually searching through the index that the search engine has created —you are not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links Why will the same search on different search engines produce different results? Part of the answer to that question is because not all indices are going to be exactly the same. It depends on what the spiders find or what the humans submitted. But more important, not every search engine uses the same algorithm to search through the indices.
Google developers:
Larry Page Co-Founder & President, Google Products
will then be handled solely by that cluster. A load balancer that is monitoring the cluster then spreads the request out over the servers in the cluster to make sure the load on the hardware is even. Then the following process is done.
Sergey Brin Co-Founder & President, Google Technology
Google’s Hardware: To provide sufficient service capacity, Google’s physical structure consists of clusters of computers situated around the world known as server farms. These server farms consist of a large number of commodity level computers running Linux based systems that operate with GFS, or the Google file system. It has been speculated that Google has the world’s largest computer. The estimate states Google as having up to: Ø 899 racks Ø 79,112 machines Ø 158,224 CPUs Ø 316,448 Ghz of processing power Ø 158,224 Gb of RAM Ø 6,180 Tb of Hard Drive space
How Google Handles Search Queries?????? When a user enters a query into the search box at Google.com, it is randomly sent to one of many Google clusters. The query
ü Determine the documents pointed to by the keywords ü Sort these documents using each one’s Page Rank ü Provide links to these documents on the Web ü Provide a link to view the cached version of the document in the doc server farm ü Pull an excerpt from the page, using the cached version of the page, to give a quick idea of what it is about ü Return an initial result set of document excerpts and links, with links to retrieve further result sets of matches, rendered as HTML. ü By default, Google returns result in sets of ten matches (as an HTML page). ü You can change the number of results you want to see on the Google Preferences page. Google prides itself on the fact that most queries are answered in less than half a second. Considering the number of steps involved in answering a query, you can see that this is quite a technological feat.
Let's see how Google processes a query .
The PageRank System The PageRank algorithm is used to sort pages returned by a Google search request. Page Rank, named after Larry Page, who came up with it, is one of the ways in which Google determines the importance of a page, which in turn decides where the page will turn up in the results list. PageRank is a numeric value that represents how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. The crucial element that makes PageRank work is the nature of the Web itself, which depends almost solely on the use of hyper linking between pages and sites. In the system that makes Google’s PageRank algorithm work, links are a Web popularity contest: Webmaster A thinks Webmaster B’s site has good information (or is cool, or looks good, or is funny), Webmaster A may decide to add a link to Webmaster B’s site.
PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn)) That's the equation that calculates a page's PageRank. It's the original one that was published when PageRank was being
developed, and it is probable that Google uses a variation of it. In the equation 't1 - tn' are pages linking to page A, 'C' is the number of outbound links that a page has and 'd' is a damping factor, usually set to 0.85. We can think of it in a simpler way:-
a page's PageRank = 0.15 + 0.85 * (a "share" of the PageRank of every page that links to it) This equation shows that a website has a maximum amount of PageRank that is distributed between its pages by internal links. The maximum amount of PageRank in a site increases as the number of pages in the site increases. The more pages that a site has, the more PageRank it has Let's consider a 3 page site (pages A, B and C) with no links coming in from the outside. The site's maximum PageRank is the amount of PageRank in the site. Consider, the PageRank for the web pages as, Page A = 0.15 Page B = 1 Page C = 0.15
Page A has "voted" for page B and, as a result, page B's PageRank has increased. This is looking good for page B. After 100 iterations the figures are:Page A = 0.15 Page B = 0.2775 Page C = 0.15 The total PageRank in the site is now (0.15+0.15+0.2775) =0.5775. Hence u could see that very less linking or poor linking decreases your PageRank. PageRank is also displayed on the toolbar of your browser if you’ve installed the Google toolbar (http://toolbar.google.com/).
Google’s web crawler: The Googlebot A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms.
Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
There are 2 methods that the Googlebot uses to find a web page, Ø Either it reaches the webpage after crawling through links, Ø Or it goes the page after it has been submitted by the webmaster to www.google.com/addurl.html By submitting the base link of the site, for example, http://wiki.media-culture.org.au/,
the Googlebot will go through all links in the index page and every subsequent page, until the entire site has been indexed.
The Google file system is a propriety file management system developed by Sanjay Ghemawat, Shun-Tak Leung and Urs Holzle for Google as a means to handle the massive number of requests over a large number of server clusters.
Because of Google’s decision to use a large number of commodity level computers instead of a smaller number of server type systems, the Google File System had to be designed to handle system failures, which resulted in it being designed to effect constant monitoring, of systems, error detection, fault tolerance and automatic recovery . That meant that clusters would have to hold multiple replicas of the information created by Google’s web crawlers.
The system was designed like most other distributed files systems for maximum performance, to handle the large number of users, scalability, to be able to handle inevitable expansions, reliability, to ensure maximum uptime and availability, to ensure computers are available to handle queries.
Because of the size of the Google database, the system had to be designed to handle huge multi-gigabyte sized files totaling many terabytes in size. GFS helps ensuring that Google has maximum control over the system, at the same time allowing the system to stay flexible.
www.google.com/addurl.html
Google File System:
More on Google Google almost certainly knows more about you than you would tell your mother. Did you ever search for information about Aids, cancer, mental illnesses or bombmaking equipment? Google knows, because it has put a unique reference number in a permanent cookie on your hard drive (which doesn't expire until 2038). It also knows your internet (IP) address. Google's privacy policy says that it "notes and saves information such as time of day, browser type, browser language, and IP address with each query. That information is used to verify our records and to provide more relevant services to users. For example, Google may use your IP address or browser language to determine which language to use when showing search results or advertisements." If you add the Google Toolbar to your Windows browser, then it can send Google information about the pages you view, and Google can update the Toolbar code automatically, without asking you
Searching Tips and Tricks for Google: ü If you want to search Google for the words that will be on the page you want, not for a description of the
page or website, then – type your query inside square brackets e.g. [ur query]. ü To limit the scope of a search to a particular file type is to use the syntax for file type (filetype:). For example, filetype:ppt google finds mention of Google in PowerPoint slides. Other formats include .pdf (Adobe Acrobat), .doc (Word) and .xls (Excel). ü You can use an asterisk (*) as a wildcard. Example: "George * Bush" finds George W. Bush. Example: "To * * * to be" finds "To be or not to be". You can also use this strategy to find email addresses: "email * * <domain>". ü To find out who links to a Web page, use the link (link:) syntax. The search link:www.virtualchase.com would perform a reverse link search to the url mentioned. Useful to see how popular your site is ü Use quotation marks ” “ to locate an entire string. eg. “bill gates conference” will only return results with that exact string. ü Mark essential words with a ‘+’. If a search term must contain certain words or phrases, mark it with a + symbol. eg: +”bill gates” conference will return all results containing “bill gates” but not
necessarily those pertaining to a conference ü Negate unwanted words with a ü You may wish to search for the term bass, pertaining to the fish and be returned a list of music links as well. To narrow down your search a bit more, try: bass -music. This will return all results with “bass” and NOT “music”. ü site:www.cwire.org This will search only pages, which reside on this domain. ü related:www.cwire.org This will display all pages which Google finds to be related to your URL ü spell:word Runs a spell check on your word ü define:word Returns the definition of the word ü stocks: [symbol, symbol, etc] Returns stock information. eg. stock: msft ü maps: A shortcut to Google Maps ü phone: name_here Attempts to lookup the phone number for a given name
If you include other words in the query, Google will highlight those words within the cached document. For instance, cache:www.cwire.org web will show the cached content with the word “web” highlighted. ü info: The query [info:] will present some information that Google has about that web page. For instance, info:www.cwire.org will show information about the CyberWyre homepage. ü weather: Used to find the weather in a particular city. eg. weather: new york ü allinurl: If you start a query with [allinurl:], Google will restrict the results to those with all of the query words in the url. For instance, [allinurl: google search] will return only documents that have both “google” and “search” in the url. ü inurl: If you include [inurl:] in your query, Google will restrict the results to documents containing that word in the url. For instance, [inurl:google search] will return documents that mention the word “google” in their url, and mention the word “search” anywhere in the document (url or no).
ü cache: ü allintitle:
If you start a query with [allintitle:], Google will restrict the results to those with all of the query words in the title. For instance, [allintitle: google search] will return only documents that have both “google” and “search” in the title. ü intitle: If you include [intitle:] in your query, Google will restrict the results to documents containing that word in the title. For instance, [intitle:google search] will return documents that mention the word “google” in their title, and mention the word “search” anywhere in the document (title or no). Note there can be no space between the “intitle:” and the following word. ü allinlinks: Searches only within links, not text or title. ü allintext: Searches only within text of pages, but not in the links or page title.
Steps to get a high Google Pagerank: Server Speed: Your website pages must be downloaded nearly at the speed of light. Yes it is, Google gives more visibility to websites that are resident on fast servers. Site Updating: Googlebot has the ability to check out WHEN your pages have been
uploaded to the server. Here's a simple hack - Upload all your pages everyday even if nothing has changed. Lots of light HTML pages: Google adores simple websites with hundreds of pages. If you are building a page that (because of its extensive contents) is going to be larger than 50K, split it in two or three pages Start out slowly. If possible, begin with a new site that has never been submitted to the search engines or directories. Choose an appropriate domain name, and start out by optimizing just the home page. Learn basic HTML. Many search engine optimization techniques involve editing the behind the scenes HTML code. Your high rankings can depend on knowing which codes are necessary, and which aren't. Choose keywords wisely. The keywords you think might be perfect for your site may not be what people are actually searching for. To find the optimal keywords for your site, use tools such as WordTracker. Create a killer Title tag. HTML title tags are critical because they're given a lot of weight with all of the search engines. You must put your keywords into this tag and not waste space with extra words. Do not use the Title tag to display your company name or to say "Home Page." Think of it more as a "Title Keyword Tag" and create it accordingly. Add your company name to the end of this tag, if you must use it.
Create Meaty Meta tags. Meta tags can be valuable, but they are not a magic bullet. Create a Meta Description tag that uses your keywords and also describes your site. The information in this tag often appears under your Title in the search engine results pages. Use extra "goodies" to boost rankings. Things like headlines, image alt tags, header tags
, etc.), links from other pages, keywords in file names, and keywords in hyperlinks can cumulatively boost search engine rankings. Use any or all of these where they make sense for your site. Don't expect quick results. Getting high rankings takes time; there's no getting around that fact. Once your site is added to a search engine or directory, its ranking may start out low and then slowly work its way up the ladder. Google measure "click-through popularity," i.e., the more people that click on a particular site, the higher its ranking will go. Be patient and give your site time to mature. Search engines don't index framed sites very well so if they aren't necessary, Simply remove them for the better ranking of your sites. Sites that use Dynamic URLs. Most search Engines don't list any dynamic URLs from Database -driven or script running sites. Solution could be if your URL contains only any of the following elements ? & % + = $ cgi-bin .cgi
then it is considered as dynamic URL by Search Engines, otherwise Make static Page, with URL not having elements from the provided list. Sites use Flash. Flash is not a problem, its the non-professional application wasting the effort. Mostly homepages, intro pages, etc. are build up using Flash to provide a cool 'n interactive impact over the browser. But the problem arises when the maximum or complete page is Flash dependent b'coz Search Engines don't index Flash. Another major problem is that the links over flash can't be crawled over by search engines so they can't be indexed. Solution could be that in such cases you need to get highly effective TITLE & META tags. To solve the links problem the standard way is creating a sitemap and linking it to each web page via a standard HTML Hyperlink Tag. Sites use Image Maps for navigation. The problem arising from these are that link crawlers of Search Engines mostly get jumbled over the Image Maps & don't spider most of links. You can overcome it adding an alternate Simple Navigational Menu or the standard technique to SiteMap. Sites also use JavaScript for navigation. Search engines mostly don't follow the links provided in the javascript. You can overcome it adding an alternate Simple Navigational Menu or the standard technique to SiteMap.
Eventually, you'll see the fruits of your labor i.e. your site’s listing in Google and the rest of the search engines!
Conclusion: Google is now the world's most powerful website. Google's mission: Organize the world's information and make it universally accessible and useful. Google said “We're about not ever accepting that the way something has been done in the past is necessarily the best way to do it today." But nobody believed it. But now they have done a far more than that. The Google web site is powered by some amazing technology. People often ask "What are you working on? Isn't search a solved problem?" And all this because of Google i.e. The master of the internet whose averages about 250 million Searches Per Day.