Understanding the qualitative differences between the sets of results from different search engines can be a difficult task. How many links must you follow from each list before you can reach a conclusion? We describe a user interface that allows users to quickly identify the most significant differences in content between two lists of Web pages. We have implemented this interface in CenSEARCHip, a system for comparing the effects of censorship policies on search engines.

Contents

Introduction

We all understand that different search engines return different results for the same query. The same is true of different countries’ versions of the same search engine. For example, the results returned by Yahoo! United States and Yahoo! Germany will not be the same. This is true even if we restrict our query to Englishlanguage pages. Some of these differences may be due to geographic location and culture. Other differences may be the effect of censorship mandated by the local government.

Hundreds of millions of users around the world must contend with state censorship of Internet content. This censorship may occur by executive order or democratic legislation. In either case, users face a problem beyond simply being denied access to information. They cannot be sure when censorship is taking place and how it affects what they see. What reason do you have to believe censors when they tell you what you are and are not seeing? The relationship between a user and the censoring agent is inherently adversarial.

Major search engines now offer censored versions of their services in China, which has caused much controversy (Zeller, 2006). These modified search engines comply with Chinese policies that restrict access to a variety of Web sites. Censored topics range from pornography to Islamic fundamentalism to democracy and human rights (Zittrain and Edelman, 2003a). The U.S. government has been strongly critical of search engines’ cooperation in this censorship (U.S. House of Representatives, 2006; Schrage, 2006). Even so, the major search engines have implemented Chinese regulations to various degrees (Pan, 2006).

The censorship debate is not limited to China. Some Western governments also restrict the activity of search engines. For example, both France and Germany regulate access to sites containing hate speech or that sell Nazi artifacts (Zittrain and Edelman, 2003a). Search engines in those countries must avoid promoting such sites by omitting them from search results (Chilling Effects Clearinghouse, 2007).

Censorship by search engines presents even greater problems for users. Search engines are essential in discovering new sources of information. Therefore, a censored search engine can hide that a blocked site even exists. How can you know what you’re not being shown?

Suppose we have access to both a censored and uncensored version of a search engine. We want to see how the censored version actually affects the results users get. Do these government policies really change the character of search results?

This is a difficult question to answer. It relies on a subjective comparison that will vary widely from person to person. Comparing lists of search results is timeconsuming and imprecise. The subject of censorship is also emotionally charged. This makes many suspicious of any study on the topic. Interested users need a way to compare for themselves rather than trusting the judgment of others.

We have developed an interface for comparing lists of search results that addresses these problems. It reduces subjectivity by processing the results using text retrieval techniques. It saves the user time by drawing attention to the features of greatest difference. Finally, it lets users judge for themselves whether those differences are important.

The remainder of this paper describes this interface and how it works. We discuss its implementation at a site we call CenSEARCHip, available at http://homer.informatics.indiana.edu/censearchip/. This site allows users to explore censorship on different versions of the Yahoo! and Google search engines.

Discussion

The CenSEARCHip home page is shown in Figure 1. It allows users to choose a search engine, two countries to compare, and a search query. They can then choose whether to compare the results of a normal search or an image search. When they submit the form, their browser will display the differences between the results using our interface.

Figure 1: The home page of CenSEARCHip.

The system architecture is shown in Figure 2. There is a single Web page that contains the search form and displays our interface. This page uses Javascript to submit asynchronous queries to two scripts on the server. The first, engine.cgi, handles communication with the search engines. The second, wordbag.cgi, retrieves a result page, parses it, and reduces it to a set of terms. All other processing is done in the user’s browser using Javascript. This Javascript code is fully commented and not obfuscated. This lets users verify for themselves that our description of the interface matches its implementation.

Figure 2:CenSEARCHip system architecture.

A session begins when a user selects a search engine (Google or Yahoo!) and the two countries to compare. They then enter a search query. Finally, they choose either an image search or traditional search. Their browser then uses Javascript to contact the engine.cgi script. That script reads in the user’s choice and builds query URLs for the two versions of the search engine.

Building the query URLs involves several design decisions. To reduce the burden of our tool on search engines, we limit the number of results to 25 for image searches and 30 for traditional search. To make the comparison fair, we also request only Englishlanguage pages. If we do not do this, the two sets will be in different languages. Their contents will have very little in common. Finally, we remove safe search features that filter out adult content. Asking for additional censorship would not make sense for this tool.

Suppose the user is comparing countries A and B. Once engine.cgi has result sets for A and B, it filters out the shared results. It gives the browser two sets. Set A is the list of pages that are returned only for country A. Set B is the list returned only for country B. The script also returns the estimated number of hits for both A and B. This is useful because the more censored country will likely have a smaller number.

Our interface is simple in the case of an image search. On the left side of the page, we show thumbnails of images returned only for A. On the right side, we show thumbnails returned only for B. An example is shown in Figure 3. The user can get an impression of the difference just by looking at the images. This is especially true for controversial topics.

The more difficult case is a traditional text search. In this case, we have two lists of URLs for entire Web pages. The goal is to show the difference between these lists quickly. We don’t want to make users click on the URLs, and we dont want to place a large burden on the search engine.

Our solution is shown in Figure 4. On each side are the fifty terms most particular to that nation’s results. This is defined as a ratio of term frequencies. If the word football shows up on the left side, then the word football occurs more frequently in As results than Bs. The terms are listed in alphabetical order. The size of each term reflects how exclusive the term is to that nation. If A uses football five times as often as B, but uses soccer only twice as often, then football will be in a larger font. This display is similar to the tag clouds made popular by sites such as del.icio.us (Wikipedia, 2007).

The client first creates two associative arrays for tracking the terms in each set of results. These arrays are indexed by term. The associated value is a count of the number of times that term appears in a result page. For each URL in the sets A and B, the user’s browser submits a query to wordbag.cgi.

The wordbag.cgi script retrieves a single URL. It strips away all scripts, style information, metadata, comments, and HTML tags. Only actual text remains. It then removes punctuation surrounding terms and forces them into lowercase. It forces all terms into the standard ISOLatin1 character set. This is done to eliminate exotic punctuation such as smart quotes. We do not apply a stemming algorithm to remove suffixes. This is because we will display the terms, and stemming algorithms make the terms hard to recognize. Finally, wordbag.cgi returns the list of terms to the browser.

When the browser receives a bag of words from wordbag.cgi, it removes very common terms such as the, that, and so on. It then updates the appropriate associative array. The tally of a term is increased for each occurrence of that term.

Text retrieval tools usually adjust such tallies with an inverse document frequency (IDF) correction to decrease the impact of very general terms. We avoid this because the sets A and B are too small to provide good document frequencies. We could use frequency data from a large body of text, but that would force the user to download a lot of data.

After the tallies are updated, the browser redraws the display. For each term T in array A, it finds the frequency of T in array A and array B. It then finds the ratio of these frequencies and subtracts one. Suppose football is used 10 times out of 100 total terms in A and 5 times out of 200 total terms in B. Then the ratio is (10/100) / (5/200), or 4. Subtracting one then gives us 3.

This score measures how specific T is to nation A. If it is positive, then T is more common in A than B. If it is zero, T is equally common in both sets. If it is negative, T is more common in B than A.

Once every term in A has a score, the browser takes the 50 terms with the highest score. It arranges them in alphabetical order. It also calculates a point size for each term. This is done on a linear scale. We limit the maximum size to 30 points. We also adjust the scale so that the smallest term will be the same size on both sides. This makes the output more intuitive to users.

The process is repeated for array B. The browser then uses dynamic HTML to update the display. This lets the user see the interface being updated in real time as more results come in. This realtime update is an important feature. Most Web browsers allow only two connections to a single server, so we can process only two result URLs at a time. Without the update feature, the user might have to wait several minutes before seeing results.

To illustrate how the interface works, let us look at three examples. Figures 3 and 4 show the effects of Chinese political censorship. We use the query Tiananmen Square, in reference to the prodemocracy protests of 1989. When we search for images (Figure 3), we see photos of tanks and protests on the right. This is the U.S. version of Google. Most images on the left do not suggest any violence. This is the Chinese version of Google. We then do a text search on the same topic (Figure 4). Once again, we see clear references to the events of 1989 on the right. Most terms on the left deal with tourism.

Figure 5 shows the effect of European censorship laws. We examine the German law related to the sale of Nazi artifacts. If we search the US version of Yahoo! for German military decorations, we get many keywords associated with online sales. The German version reflects a more general discussion.

Figure 5: Traditional search comparing German and American versions of Yahoo! The search query is iron cross purchase.

Conclusion

Our interface is one of several approaches developed to visualize the effects of Internet censorship. Other researchers have also developed tools, especially for the Chinese version of Google.

One such tool (Smith, 2006) shows the two sets of results in separate frames. This has the advantage that the user’s browser can contact Google directly. There is no potentially untrustworthy intermediary. However, the interface does not focus on the differences between the sets. The user receives raw data but not much information.

Another tool (Langreiter, 2006) uses a Flash interface. This interface compares the relative positions of the sites common between the result sets. This is effective in showing differences in ranking. However, the interface does not consider differences in content. It measures difference in emphasis rather than restriction of information.

We are pleased with the interface itself. In the future we will use this visual comparison technique to allow users to compare different search engines in others domains, such as shopping, news, recipes, and so on. This effort will involve tailoring the comparison to the distinctions most interesting to the user. For example, in our case study, not all differences show the work of censors. We need a way to distinguish between differences caused by censorship and differences caused by search algorithms. You might argue that the Chinese Google is just being practical in our example. Tourist information on Tiananmen Square is relevant for Chinese citizens who visit often. American citizens cannot travel there without a visa. Is the distinction because of censorship or utility? How can we decide?

About the authors

Mark Meiss is a doctoral student in Computer Science at Indiana University, Bloomington. His concentrations are in artificial intelligence and data networks. His dissertation work involves analysis of Internet flow traffic and application networks.

Filippo Menczer is an associate professor of Informatics, Computer Science, and Cognitive Science at Indiana University, Bloomington. Currently he is on sabbatical as a Lagrange Senior Fellow at the Institute for Scientific Interchange Foundation in Torino, Italy. He works on modeling and applications of complex information, communication, and social networks.

Acknowledgments

The authors would like to thank the Advanced Network Management Laboratory at Indiana University for its development support and the School of Informatics at Indiana University for hosting and promoting CenSEARCHip. We also thank the search engines themselves for handling the queries our system generates.

Elliot Schrage, 2006. Testimony of Google Inc. Before the Subcommittee on Asia and the Pacific, and the Subcommittee on Africa, Global Human Rights, and International Operations, United States House of Representatives, at http://www.internationalrelations.house.gov/, accessed 1 May 2007.

U.S. House of Representatives. Committee on International Relations. Subcommittee on Africa, Global Human Rights and International Operations and Subcommittee on Asia and the Pacific, 2006. The Internet in China: A tool for freedom or suppression? at http://www.internationalrelations.house.gov/, accessed 1 May 2007.