The Google Goal of Indexing 100 Billion Web Pages
In their paper ‘The Anatomy of a Large-Scale Hyper textual Web Search Engine’ it is very evident that Google goal has always been to be one of the best search engines there is in terms of the quality of the results it gives. Sergey Brin and Lawrence Page however knew that in order to do this, Google needed to be able to store information efficiently and cost effectively and to have excellent crawling, indexing, and sorting methods or techniques. Google not only aimed to give quality results but to produce the results as fast as possible. Google started as a high quality search engine and continues to be the best search engine today. It has managed to stay true to its original intent to be a search engine that not only crawls and indexes the web efficiently but also to produce more satisfying results in comparison to other existing search engines.
To stay true to their goal of providing the best search results Google knew right from the start that it had to be designed so that the search engine could catch up with the web’s growth.
According to “Brin and Page “In designing Google we have considered both the rate of growth of the Web and technological changes. Google is designed to scale well to extremely large data sets. It makes efficient use of storage space to store the index”. They knew that they needed much space to store and ever growing index.
Google’s index size, which that started out as 24 million web pages was large for its time and has grown to around 25 billion web pages, still keeping Google ahead of its competitors. However, Google is a company that doesn’t settle for just beating the competitors.
They truly aim to give their users the best service there is and that means as a search engine they want to give users access to all or at least most of the quality information that is available on the web.
Google’s New System for Indexing More Pages
As mentioned earlier, Google aims to give access to even more information and has been devoting time and much effort to realize this goal. It seems that the new patent entitled ‘Multiple Index Based Information Retrieval System’ filed by Google employee Anna Patterson might be the answer to the problem.
The patent published just this May of 2006 and filed way back in January of 2005 shows that Google might actually be aiming to expand their index size to as much as a 100 billion web pages or even more.
According to the patent, conventional information retrieval systems, more commonly known as search engines, are able to index only a small part of the documents available on the Internet. According to estimates the existing number of web pages in the Internet as of last year was around 200 billion; however, Patterson claimed that even the best search engine (that is Google) was able to index only up to 6 to 8 billion web pages.
The disparity between the number of indexed pages and existing pages clearly signaled a need for a new breed of information retrieval system. Conventional information retrieval systems just weren’t capable of doing the job and just wouldn’t be able to index enough web pages to give users access to a large enough percentage of the present existing information available on the web.
The Multiple Index Based Information Retrieval System, however, is up to the challenge and is Google’s answer to the problem. Two characteristics of the new system makes it stands out compared to the conventional systems.
One is that it has the “capability to index an extremely large number of documents, on the order of a hundred billion or more”. And the other is its capability to “index multiple versions or instances of documents for archiving…enabling a user to search for documents within a specific range of dates, and allowing date or version related relevance information to be used in evaluating documents in response to a search query and in organizing search results.”
With the new system developed by Patterson, Google now has the ability to expand its index size to unbelievable proportions as well as improve document analysis and processing, document annotation, and even the process of ranking according to contained and anchor phrases.
History of Google’s Index Size
Google started out with an index size of around 24 million web pages in 1996. By August of 200, Google had managed to quadruple their index size to approximately one billion web pages. On September of 2003 Google’s front-page boasted and an index of 3.3 billion web pages.
Microdoc, however, revealed that the actual number of web pages Google had indexed during that time was more than five billion web pages already. In their article ‘Google Understates the Size of Its Database’, they emphasized that Google not only specialized in simplicity but also in understating their power and complexity.
Google was still managing to stay ahead of its competitors and continued to surprise everyone with what they had under their sleeves.
As Google’s index continued to grow the number in their front page grew impressively large as well before it plateaud at eight billion web pages. This was around the time that Patterson filed the new patent. Then in 2005, with controversies in index size growing, Google decided to stop counting in front of the public and simply claimed that their index size was three times larger than the nearest competitor’s index size.
Google also maintained that it was not just the size of indexed pages that was important but how relevant the results they returned were. Then in September of 2005, as part of Google’s 7th anniversary, Anna Patterson, the same software engineer who filed the patent on the Multiple Based Index Information Retrieval System posted an entry on Google’s official blog claiming that the index size was now 1,000 times larger than the original index.
This pegged their index size to around 24 billion web pages, about a fourth of the Google goal of indexing a100 billion web pages. It seems then that Google must have started using the new system in mid 2005. With the new system in place we can only wait and see how fast Google will reach the goal of a 100 billion web pages in its index. It’s most likely though that when Google has reached that goal it would set an even higher goal to provide continuous quality service.
Filed under: Uncategorized