So Much Information, So Little Time. Evaluating Web Resources With Search Engines
Evaluating Web Resources With Search Engines
The Internet is one of the youngest and fastest growing media in the world. Its growth is still accelerating at a rate of about 7.3 million pages per day, doubling every eight months (Murray and Moore 2000), indicating that the Internet has not yet reached its highest period of expansion. Buried in this vast, quickly growing collection of documents lies information of interest and use to almost everyone. The trick is finding it. An abundance of search engine tools can be used to retrieve information from the World Wide Web. Search Engine Watch (2001) reports that more than 75 search engine tools are available and provide links to many relevant sources. Many tutorials are also available online, providing details on search engine tools and guidelines for effective searches.
Students are using the Web as a source of information for both educational and personal topics. To conduct an effective search, students must understand the structure of various search engines. Not all search engines are created equal. Many do not always provide the right information, often subjecting the user to an influx of disjointed and irrelevant data. Without systematic instruction in information literacy, students cannot realize the potential of the Web. Students knowledgeable of both effective search strategies and criteria to evaluate search engine results can enhance their current studies and, more importantly, continue to update their knowledge through access to relevant and useful resources readily available on the Web.
The Internet is a valuable resource, but it should be used with caution. The motto "An educated consumer is the best customer" is aptly applied to the Internet. To move toward the goal of creating "educated information consumers," we developed a hands-on exercise that was used in an introductory management information systems (MIS) course. The remainder of this article describes the exercise, including an introduction to information retrieval and a discussion about our experience. The lesson in information retrieval focuses on answering the following three questions:
1. Do all search engines find the same information?
2. How can we judge the retrieval effectiveness of these results?
3. Why do we get different results using different search engines at the same time, or the same search engine at different times?
Information Retrieval Concepts
We introduce students to the concept of retrieval effectiveness through a classroom demonstration. We ask three students to suggest different search engines, then using those search engines we perform a search. For example, we want information on mobile commerce in healthcare management, so we conduct a search using the terms "mo-bile commerce" and "healthcare management." We ask stu-dents to note the number of sites returned for each engine. Analyzing this number shows the large variations in the manner that various search engines retrieve sites. Preliminary results were: Google returned 19 sites; LookSmart returned five directory topics and 2,000 sites; and Metacrawler returned three sites. Clearly, the number of sites returned varies substantially. This can be attributed to the fact that different search engines are used, as well as the fact that each engine belongs to a different categoryof search tool.
We then introduce students to the concepts of recall and precision. Recall and precision are two of the most widely used measures of information retrieval effectiveness. Recall measures how well an engine retrieves all the relevant documents, whereas precision measures how well the system retrieves only the relevant documents (Blair and Maron 1985). Relevancy, for the purposes of this exercise, is defined as whether or not a site is deemed relevant by the user who initiated the search. Otherwise, the page is noted as irrelevant.
Figure 1, above, graphically represents the concepts of recall and precision. U represents everything on the Web, A+B are the sites we want, B+C are the sites that are returned, C represents the sites we do not want but always get, and A represents the documents we want but always miss. Recall is calculated as B/(A+B) = the percentage of those sites we want that were retrieved. Precision is calculated as B/(B+C ) = the percentage of sites retrieved that we actually wanted. Recall is more difficult to calculate. A relative value for recall can be calculated by estimating the total number of relevant sites. This can be done by using the highest number of relevant sites found for any of the three search engines being evaluated, or by counting the total number of unique relevant sites that are returned from all of the engines.
We illustrate the concepts by calculating the retrieval effectiveness of the three previously chosen search engines. This is done by having the students review the first 20 sites returned by each engine. As they do so, the students note the number of relevant sites and the total number of unique sites returned by each search engine. The retrieval effectiveness results, with 11 unique relevant sites identified among the three engines, were:Google: Precision = 9/20; Recall = 9/11LookSmart (looked at first 20 sites for estimate): Precision = 3/20; Recall = 3/11Metacrawler: Precision = 1/3; Recall = 1/11
Analyzing Search Results
The Web offers numerous search tools for finding information. Each tool differs in its databases, searching capabilities, command languages and method of displaying results. The two basic approaches to searching the Web are search engines and subject directories. Search engines can be further divided into multithreaded search engines or metasearch engines. Subject directories can be further divided into subject-specific search engines and specialized subject directories.
Search engines. Search engines, such as Google, and metasearch engines are best used to locate a specific piece of information or a known document. All search engines prompt a user to enter a keyword or phrase, and based on the words entered, the search engine produces lists of Web documents containing the specific keyword(s). A major misconception about search engines is that they search the entire Internet as it exists at the very moment the search is conducted. Instead, what is actually happening is that each search engine is searching a fixed database of Web documents that has been automatically compiled by "spiders" or "robots" prior to the search. Search engine results differ in the specific sites retrieved as well as the order and quantity of sites. The differences can be attributed to the size of a search engine's database and the frequency with which it is updated. Other differences in search engines include the speed of retrieving results and its user-friendliness.
Metasearch engines, such as Metacrawler, are a class of engines that search multiple databases simulta-neously via a single interface. When keywords or phrases are entered, the metasearch engine searches up to 13 different search engines, including AltaVista, Excite, FindWhat.com, Google, LookSmart, Lycos and WebCrawler. While metasearch engines are able to search multiple engines simultaneously, there are limitations. Most metasearch engines time-out before they can comprehensively search each index; and because they use various indexes, the search may be slower. As a result, some complex search statements may not be recognized.
Subject directories. Subject directories, such as LookSmart, are best used when searching for general subject matter rather than a specific piece of information. Subject directories arrange Internet resources by subject or category in a hierarchical nature and provide links to those resources. The databases for these directories are compiled and maintained by humans rather than by spiders. As a result, subject directory databases are typically smaller than search engines, but the results tend to be more relevant. While search engines may retrieve all Web pages containing the specified keywords, subject directories typically retrieve only a site's main page. Specialized subject directories are comprehensive directories limited to a particular subject area compiled by experts in a particular field.
Effective Search Strategies
Regardless of the search tool selected, an effective search strategy is fundamental to a successful search. Developing an effective strategy requires an understanding of two types of logic: search and Boolean. Students are asked to explore Web sites that focus on instruction for using Internet resources, Web searching and links to other ways of finding information via the Internet. In addition, there are several studies that provide commentary on how various search engines perform (Lawrence and Giles 1999; The Star-Ledger 1999; Search Engine Watch 2001).
Search logic refers to the rules that the engine applies when interpreting the search phrase the user enters. For example, entering the search terms "online payment systems" could be interpreted as returning sites that contain the entire phrase, all of the search terms or any of the search terms. Each returns a very different set of sites. Determining the rules used by each search tool is often difficult. Using trial and error is often a good method for understanding the unique search logic of each search engine. There are also Internet sites that provide convenient information about using search engines, such as Infopeople's "Search Engines Quick Guide" (2001).
Boolean logic refers to the set of logical operators used by search tools to combine search terms. The basic set of Boolean operators includes and, or and not. And returns documents that contain both of the search terms. Or returns documents that contain either search term. Not returns documents that contain the first term, but not the second term. Boolean operators are useful for complex searches.
Initial search features. There is a significant difference in the presentation of the initial search pages among various engines. Some, such as Google, are focused solely on searching and provide only this feature. In contrast, others, such as Yahoo!, have created an online community offering features such as news, e-mail and chatting.
Ease of use. Some search engine sites are so complex that viewers may lose track of where they are within the site or find it difficult to stay on the topic. Links to other sites may be related to the original search topic, but some are barely related.
Presentation of results. Not only do search engines differ in the content of the sites that are returned, but also in how results are presented to the user. Most present a short description of the site, which aids in the user's decision to open a page. Some sites allow the user to further narrow their results by searching within the results. A few of the search engines also provide users with a relevancy ranking and shortcuts for the user to find sites similar to those returned during the search.
Once students are introduced to basic information retrieval concepts, we have students work through a hands-on exercise to evaluate the retrieval effectiveness of various search tools (see the "Search Engine Evaluation Exercise" on Page 26). The topics for the assignment can vary depending on the content of the course. The exercise uses concepts from an MIS course.
The "Search Engine Evaluation Exercise" was conducted over several semesters in an introductory MIS course at Montclair State University in New Jersey. The exercise was given as a homework assignment, therefore, the searches were not conducted simultaneously as they may have been in an in-class assignment. The general consensus from students was that although many of them used search engines prior to this lesson, they did not realize that all search engines do not return the same results. Furthermore, they believed that search engines searched the entire Web as it existed at the moment the search was conducted. Once students learned and witnessed the differences among search engines firsthand, they found the concepts of recall and precision useful in their comparison of search engine effectiveness. Students commented that this exercise made them more informed Web information consumers. Students also commented that prior to this exercise they were unaware of the vast number of search engines that exist.
One interesting observation from this exercise is that students' calculations of recall and precision varied immensely from student to student. These differences can be attributed to both the actual time the searches were conducted and each student's interpretation of relevancy when viewing individual Web sites. Upon completion of the hands-on exercise, students were asked to select their favorite search engine based on their findings. These responses are shown above in Figure 2. It is interesting to note that there was no clear favorite among the students. In choosing a favorite search engine, students used more than their calculations of recall and precision in making their selection. The initial interface, speed of the search tool, presentation of results and other criterion were also mentioned in support of their decision.
Using the Web as a source of information is becoming more popular. There are numerous search tools from which to choose to retrieve information from the Web. Educating our students to be better consumers of the information on the Web is an important goal. The exercise described in this article is a step toward this education. As noted by Lawrence and Giles (1999), the overlap in information between the various engines is relatively low, therefore, one obvious conclusion is that combining the results of multiple engines greatly improves coverage of the Web in searches. Two obvious extensions of this case are to include additional or alternative search tools, and to include additional evaluation criteria.
Whether using the Web as a resource for academic or personal research, skill in searching the Web and locating useful sites can increase the advantages of using this dynamic tool. Students can learn to judge the quality and accuracy of not only the information returned by search engines, but also about the search engines themselves. The "Search Engine Evaluation Exercise" can assist in this goal. One goal of education is to help students develop the capacity for learning in a constantly changing environment. With effective guidance, the Web provides an excellent learning environment for this task.
Search Engine Evaluation Exercise
Researching Online Privacy and Security Using the Internet
Your goal is to research the issue of online privacy and security using the Internet. You will begin by reviewing and evaluating several Internet search engines. Once you have completed your review, you will use the search engine(s) that you find to be best at returning relevant information to find information on online privacy and security. A significant amount of media attention has been paid to the issues of online privacy and security. You are interested in obtaining more information on the issue, as well as finding ways to protect your own information as you access the Web. You have decided to do your research on the Web and are considering four popular search engines - Google, Northern Light, Metacrawler and AltaVista - along with an additional search engine of your choice.
Be sure to read the Blair and Maron article (1985) and "Accessibility of Information On the Web" (Lawrence and Giles 1999) before beginning this part of the assignment. You should also refer to the Web to obtain additional information on search engines, differences between search tools, refining searches and other information about retrieving information from the Inter-net. Prior to beginning your research, conduct a comprehensive analysis of the five search engines to determine which is the best at finding relevant information. Begin by entering a broad search term for each of the engines to get an idea of the types of sites found. Use "online security threats" to begin your search.
Try to identify the breadth of the search. How many articles d'es each search engine return? Review the first 20, then comment on the relevancy and types of sites found. Did you get any that do not relate to online security at all? What happens if you redefine your search to include only one word, such as "security" instead of "online security threats"? Can you comment on the recall and precision of the sites each engine returned? Which engine do you prefer to use and why? Write a short (80-100 word) statement comparing and contrasting what you learned from your experience.
Once your evaluation is complete, begin your research on online privacy and security. Use the search engine(s) you identified to perform your search. Be sure to list the Internet address(es) you use. Include the following items in your answer :
1. The number of items found by each search engine.
2. A statement about the items found by each search tool with comments about relevancy, recall and precision.
3. An evaluation of the different search engines with links to two references. These links should be to sites that review or discuss search engines.
4. A statement discussing your experience evaluating the search tools, such as what you learned.
5. A statement describing how you used the Web to obtain information about online security and privacy, along with a list of search engines, sites and terms you used.
6. A brief summary of your findings.
"As Old Engines Morph, New Ones Take Searches to Another Level." 1999. The Star-Ledger, Newark, New Jersey, 23 August.
Blair, D.C. and M.E. Maron. 1985. "An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System." Communications of the ACM, (28) 3, 289-99.
Infopeople. 2001. "Search Engines Quick Guide." Online: www.infopeople.org/search/guide.html.
Lawrence, S. and C.L. Giles. 1999. "Accessibility of Information On the Web." Nature, (400) 8 July, 107-9.
Murray, B. and A. Moore. 2000. "Sizing the Internet." Cyveillance. Online: www.cyveillance.com/web/downloads/Sizing_the_Internet.pdf.
Search Engine Watch. 2001. "Guide to Search Engines." Online: www.searchenginewatch.com.
This article originally appeared in the 08/01/2002 issue of THE Journal.