Computer Science Issue I Volume VI

Search Engines: Guiding You in the Digital World

About the Author: Reid Hirata

In Fall 2003 Reid was a junior majoring in computer engineering/computer science at the University of Southern California and hails from the island of Oahu. He enjoys applying web programming and entrepreneurship in the development of his company, Xaveon Security.

The daily life of an average American is depending more and more on the Internet. Information and entertainment sites abound, but many of the billions of web pages are yet unknown. Search engines are the pipeline through which users can quickly find the content they desire. The history of online search engines dates back to the very beginnings of the Internet. They have evolved to become complex tools that index and rank the enormous and highly dynamic environment of the World Wide Web. Much can be learned from investigating the way search engines rank sites and what can be done by a webmaster to rank highly in the major search engines.

Introduction

When we open up a web browser, what is the first page we visit? More often than not, it is our favorite search engine. Maybe we even set it as our default home page because we use it so frequently. But what engineering lies behind this simple exterior allowing us to access hundreds of millions of web pages with a few keystrokes and the click of a mouse? What complex processes run from the time we click “search” to the time we see the results page? A bit more than we might think.
Search engines bind the World Wide Web together for the average Joe Surfer, linking him to previously unknown and unreachable sites that contain the information he desires. Perhaps this is how you, the reader, have found Illumin. Yet there are billions of web pages still unindexed by any search engine – the “invisible web”. Pages that require a subscription or registration, as well as pages generated dynamically based on user input, are off limits to the search engines that scan the Internet for new sites [1]. This article will introduce you to the inner workings of search engines, paint a picture of the search engine scene today, and offer some tips for better searching along the way.

A Brief History

Although the Internet was officially born in 1968 through ARPANET, online searching took its first steps in 1964 when the concept was still in incubation. Teams at MIT, Data Corporation, and IBM, among others, concurrently developed many features that are still in use in modern search engines. Most of the improvements came as browsing features and additional functionality and flexibility when viewing search results since the basic searching functions had been in place from ordinary text searches [2]. Many search engine companies came and went over the next 30 years as early search algorithms became obsolete in the rapidly changing World Wide Web environment. Those who managed to adapt, such as Lycos and Altavista, still survive today, but none share the success of newcomer Google (see Fig. 1).

                                                  Wikimedia Commons
                                                  Figure 1: Various Search Engines of Today.

The Process

Identifying documents and displaying them to the user is a fine art. Search engine users are often frustrated when they fail to find results that they expect. Undoubtedly the necessary websites exist somewhere, but finding one page in a few billion does not make for good odds. Understanding the way search engines work goes a long way towards narrowing down search results to the best candidates.
The search process consists of four main functions: document processing, query processing, searching and matching, and ranking. Although users focus on searching, any of the four modules may cause the expected or unexpected results that consumers get when they use a search engine.
The document processor and query processor perform similar tasks but on different objects. The document processor, which goes through several steps when analyzing web page content, is executed as the search engine’s robot scans new websites. First, it standardizes the content to plain text so subsequent steps can more easily handle the incoming data, and then it identifies key terms that can be searched for in the document. The methods for doing this and the terms themselves, whether they are words or phrases, differ from search engine to search engine. Next, stop words (common prepositions, conjunctions, and articles) are removed to prevent false matches that provide little relevance to the topic being searched. Most search engines then truncate words that have alternate forms so that all forms will match. An example of this would be truncating analysis to analy- so that it will match searches of analysis, analyze, analyzed, analyzing, and so on. Although some engines will allow you to put the truncated version, analy, in the query string yourself, this step makes inputting many forms of a word to find a match unnecessary. Finally, the key terms are assigned scores based on relevance, and the information is stored in a separate file. This is called indexing a page, and it is typically done every two to six weeks [3]. The query processor goes through many of the same steps, but it does so in real time as queries are entered.
The searching and matching function and the ranking function go hand in hand analyzing the outputs of the query processor and document processor to generate a ranked list of matches. These functions comprise almost all of the computing time needed to return a search and vary widely between engines [3]. Early search engines used simple binary criteria to determine matches. If a document had the search terms in it, then it was a match. As the number of web pages increased and ranking demanded more precision, fuzzy logic was introduced. This gave partial weights to related terms, so given a search query of “wooded area” a document with “wooded hil” would be given a higher ranking than one with “wooden house”.
A good thing to keep in mind is the criteria engines generally use to find quality matches in documents. One of the first checks an engine does is to count the number of occurrences of the search terms in a document. The more frequent and the closer to the top the terms are, the higher the document’s rating. Also considered are the number of links to that particular document, the number of hits the document received, and the date of publication. All of these indicate the document’s relevancy and relative importance to others who searched using similar criteria in the past [3].
Until now, search engine providers have mainly opted for less complex processing of documents and queries. Typical search results therefore leave a lot of work to be done by the searcher who must sift his or her way through the results and likely view a number of documents before finding exactly what he or she seeks. The current trend toward greater flexibility and an increasingly web-savvy user base suggests that this will not continue. Search engines that go further in the complexity and quality of the processing performed will be rewarded with greater user loyalty and increased opportunities to provide search engine services for more organizations’ intranets.

Today’s Arena

Google, Altavista, Alltheweb, Lycos, MSN, AOL, AskJeeves, Excite, Hotbot, Yahoo. Three major points differentiate the top search engines today: the store of information they draw from to provide matches, the efficiency and proficiency with which they match searches to relevant documents, and the way they present these matches back to the user. The last two are vital, and a deficiency in one of these areas will lead to an unsatisfactory experience for the user.
Many search engines, such as Google and Alltheweb, display the number of pages they index up front. As of January 2003, Google holds the lead with just over 3 billion pages and is followed by Alltheweb and Altavista at 2.1 and 1.7 billion respectively [4]. With all major search engines at well over 1 billion pages, however, these differences become negligible. For all but the most diligent searcher, all search engines have the same content. The challenge then becomes how to sort through these billions of pages to find quality matches.
With so many similarities between engines, a user’s choice between them may come down to personal preference of the user interface. Results that are displayed in a clear, logical fashion free from pop-ups and intrusive ads will gain the support of many, as Google and Yahoo! have demonstrated. Additionally, these two sites clearly distinguish between search results and paid ads. Others, such as AOL and MSN, mix paid-for sites, their own sites, and relevant results together without any indication of which is which. Many web surfers feel that companies should not be allowed to pay search engine companies to achieve high rankings, as this blocks out small companies and non- profit organizations and skews the search results [5].

Search Engine Optimization

While some search engines allow sites to buy high ranking for certain keywords, search engine optimization (SEO) must be done to achieve the same high rankings in others. This process, usually done through an external SEO company, often costs more than the search engine fee. High traffic keywords for a large site can cost $50K-$100K per year to maintain [6]. This is a fairly new and extremely broad field that hosts very few truly knowledgeable and ethical experts. There are two types of SEO companies: those who get high rankings by tricking the search engine spider into thinking a site is very relevant to a certain keyword and those who truly redesign a site to make it more relevant to a certain keyword.
The methods of duping a search engine spider vary widely and have evolved over the years as search engine companies have caught on and improved their spiders. The earliest attempts involved placing abundant hidden words on a page to increase the word’s frequency thereby improving the relevancy rating by a spider, a practice known as Search Engine Spamming. Little could be done to improve the spider’s ability to detect this since a truly relevant site would appear similarly to the spider. Search engine companies, however, now regularly check the top web pages for spamming and ban any offending sites they find [6]. Another practice frowned upon by search engine companies is the use of shadow domains. Offenders beef up an empty page for certain keywords through spamming or other means then redirect visitors to the actual site, which may or may not be related. These and numerous other techniques are cause for removal from a search engine’s index, so while it may achieve top ten results as promised, the site will only be there for a week or two before the search engine company finds and bans the site.
A good SEO company will strictly follow the guidelines set by search engines and take a legitimate (and ultimately more effective) approach to website optimization. This often involves completely redesigning the website’s structure by assigning proper naming to web pages, images, and links within the site. The results, however, are upright and lasting. The best test of a good SEO company, of course, is to search for “search engine optimization” or a similar phrase on several search engines over the course of a few weeks to see which sites are consistently in the top ten.
The conscientious web searcher knows that if a search of “Sun Microsystems” gives results with “abc-computers.com” in the first spot, that site has most likely fraudulently usurped that spot, is unrelated, and will be gone next week.

Conclusion

Search engines, like the Internet it binds together, are constantly evolving and expanding. The algorithms that comb the billions of web pages are structured but immensely complex. To produce better search results, the user must understand the basics of how the modern search engine operates and must be willing to experiment with different search engines and the various options each offers. Millions of pages, just waiting to be indexed and viewed, are added to each search engine every day. Go!

References

[1] N. Medeiros. (2002) “Reap what you sow: harvesting the deep web.” OCLC Systems & Services. [On-line]. 18(1), pp. 18-20. Available: http://www.emeraldin​sight.com/journals.h​tm?articleid=863167&​show=html. [Nov. 4, 2003].

[2] T.B. Hahn. “Text retrieval online: Historical perspective on Web search engines.” American Society for Information Science,vol. 24.1, pp. 7-10, Apr. 1998.

[3] E. Liddy. “How a search engine works.” Internet: http://www.infotoday​.com/searcher/may01/​liddy.htm, 2001 [Nov. 4, 2003].

[4] “Search Engine Showdown.” Internet: http://www.searcheng​ineshowdown.com/ [Nov. 4, 2003].

[5] A. Kassel. “Power Searching Strategies for Success.” Internet: http://www.allbusine​ss.com/media-telecom​munications/internet​-www/10622026-1.html​, Apr. 2001, [Nov. 4, 2003].

[6] B. Clay. “LLC. Bruce Clay, LLC: Internet Business Consultants.” Internet: http://www.bruceclay​.com, Oct. 2003 [Nov. 4, 2003].

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *