Web Graph Analysis in Perspective: Description and Evaluation in terms of Krippendorff's Conceptual Framework for Content Analysis (version 1.0)
by Kenneth Farrallbased on work in Krippendorff's Content Analysis class, Fall, 2004, Annenberg School for Communication, University of Pennsylvania.
last updated October 1, 2005, please do not quote without permission
permanent url: http://farrall.org/papers/webgraph_as_content.html
questions? contact kfarrall AT gmail DOTCOM

A new method for studying Internet-mediated social networks, web graph analysis, is gaining popularity among Internet researchers. Web graph analysis involves the study of link patterns emerging between documents and web sites on the World Wide Web. Link patterns are described formally using the mathematical language of graph theory. Inferences from these patterns are made using analytical constructs generated largely within the tradition of social network theory. Because of its relative newness, however, numerous questions about the validity and ultimate value of this tool remain.

This paper uses Krippendorff's (2004) model for content analysis as a prism through which to evaluate the strengths and weaknesses of web graph analysis and use it effectively in research. The paper begins with a brief overview of Krippendorff's conceptual framework and then proceeds in a step by step evaluation of web graph analysis in terms of this model. The evaluation begins with a brief discussion of the context in which the method appears to have use, discusses the mechanics and reliability of the data making process, introduces the analytical constricts of social network analysis on which web graph analysis is based, and closes with an extensive discussion of validity issues.

Krippendorff defines content analysis as "a research technique for making replicable and valid inferences from texts (or other meaningful matter) to the contexts of their use." His framework for content analysis includes the following conceptual components: 1) a body of text; 2) a research question; 3) a context within which to make sense of the body of text; 4) an analytical construct; 5) inferences and 6) validating evidence.

In order to draw inferences from a body of text, the content analyst must engage in the process of data making, the creation of "computable data from raw or unedited texts.(p.18)" Data making consists of four components: 1) unitizing; 2) sampling; 3)recording/coding and 4) reducing. The act of recording and reducing data is defined by the data language, consisting of variables and their values, constants whose operational meanings are fixed within the data language, a grammar and logic. Data generated in the process of data making is expected to be reliable, meaning that they are essentially stable, reproducible and accurate.

Analytical constructs provide a formal procedure for the analyst to draw inferences from the data and in their simplest form can be though of as if-then statements. They are based firmly within a theoretical context and within the boundaries of the particular research question. Constructs may take many forms, including extrapolations, applications of standards, and indices. Sources of certainty for these constructs derive from, among others, expert knowledge and established theory. Sources of uncertainty for analytical constructs include limitations of the theory on which it is based, improper application of theory, and the potential for published results to affect the research context and weaken subsequent findings.

Content analyses are expected to be validatable in principle. Although the particular circumstances of the inferences, such as statements about events in the past or analysis of events in an enemy country during wartime, may make validation impractical or even impossible, they must contain assertions that go beyond simply concluding the text in question has certain content. An example of an invalidatable analysis would be a conclusion that the number of occurrences of the word "patriot" in network news broadcasts increased dramatically after September 11th, without some reference to the context of this word and the meaning of such an increase. The validity of a content analysis can be evaluated in terms of its content, its internal structure, and its relationship to external variables.

In general, web graph analysis draws sociological inferences based on local structure of the web graph. Described formally using the mathematical language of graph theory, the web graph is a directed graph whose vertices consist of all publicly available documents on the Web and whose edges are determined by the links (in the form of URLs) connecting these documents (Kumar et al, 2000). The web graph's local structure consist of the graph surrounding particular documents or web sites of interest. Web graph analysis is usually understood to fit within the theoretical domain of social network analysis (SNA). SNA assumes that the structure of relationships among interacting units, be they people, organizations or countries, can tell us a great deal about the nature of these relationships.
Web graph analysis may be understood in terms of is relationship to two more established forms of content analysis, citation analysis and semantic network analysis. All three methods share a data language based on formation of triplets and their assemblage into complex patterns of relation in a graph.

Citation analysis shares a great deal in common with web graph analysis. Both methods reduce the body of texts under study to the directed, but otherwise value and attribute-free, indications of association between individual documents, and draw inferences from the patterns of the resulting networks to the population of individuals that create these documents. Analytical constructs developed for citation analysis, however, deal with a much narrower range of texts (peer-reviewed academic articles and books) as well as a narrower range of possible types of association between documents, than does web graph analysis. For this reason, citation analysis could be considered a special case of web graph analysis, at least in terms of the variance that must be dealt with.

Semantic network analysis differs from web graph analysis particularly in the range of attributes that might be assigned to connections within the graph. Kleinnijenhuis et al. (1997) for example, connect two concept nodes based on five predicate types: evaluative, reality, action, causal, and affective. Further, semantic networks are generally assumed to represent abstract knowledge or the social construction of knowledge rather than the existence of social networks, although this is not always the case. For example, Johnson and Krempel (2004) recently used a type of semantic network analysis, Centering Resonance Analysis, to make inferences about behind-the-scenes relationships between key figures in the Bush White House after the September 11th terrorist attacks. A specific implementation of web graph analysis with which we will deal in some detail, issue network analysis (Rogers and Marres, 2000), focuses on those areas of the web graph produced by organizations and individuals dealing with civic and political issues. Issue network analysis is well suited for identifying key actors (government bodies, NGOs, corporations, individuals) within a particular issue space, their interrelationships and their orientation towards actors and institutions within the broader social space. Further, the method can help answer question about how certain political issues cluster together in the public sphere and how actors might serve as bridges between social groups with different or evening opposing issue focus.

The specific context of an issue network analysis (or any form of web graph analysis for that matter) plays a particularly important role in the process of making inferences. Although the process of analysis begins with the structure of a network graph, the nodes of the graph are anchored to their textual content, to which the analysts must frequently refer. The actual analysis, including the formation of analytical constructs, often proceeds back and forth between the units of the web graph and the context units of the site content, in a hermeneutic circle.

Krippendorff distinguishes between sampling units, recording units, and context units. Recording units differ from sampling units as a function of the data reduction that occurs in the data-making process. In the case of web graph analysis, the sampling unit is, at its most basic level, an individual web page. The web document, in the process of data reduction, is reduced to the recording unit of the individual URLs, or links, which appear in the document. The context unit, on which the analysts must depend to make valid inferences from the web graph, consists of the text of the un-reduced web document. Depending on the nature of the analysis, the context unit may be expanded to include the web site where the document is stored, and the other sites to which this document links.
Providing full definition of the sampling unit independent of the sampling and subsequent data making process itself is not possible, as they are connected. This will become clearer as the rest of the data making process is explained below.

In general, web graph analysis gathers data using some form of snowball sampling. The specifics of this sampling process vary according to the particular implementation of web graph analysis. For the purposes of this paper, we focus on issue network analysis, briefly introduced in the context section, above.
The Issue Crawler software developed for issue network analysis builds the web graph from a seed of URLs provided by the researcher. The seed is expected to include significant websites within the issue of interest. The software scans the seed documents for links pointing to external domains and stores these links in a matrix. Any links which are not present in at least two of the seed documents are thrown out. The linked documents that remain are then scanned again for external links, with the same criteria for throwing out solitary links (a process known as co-link analysis.) The process is usually carried out two or three times (iterations) and may also involve the retrieval of deeper links within domains (depth).
This interactive process (the software crawler interacts with the public documents on the web) results in a square association matrix. In this matrix, every document that has been retained by the crawler appears in both the rows and columns. Each potential pair of URLs (links) is represented by two cells in the matrix, one for each direction. The value within each cell is an integer. 0 means there is no link between the two documents in the given direction. Integers of greater value indicate the number of links in that direction.

From the raw association matrix, data is generated using the data language of graph theory. A graph consists of nodes (also, vertices) and the edges/arcs that connect them, which can be either directed, pointing like an arrow from one node to another, or undirected. Each document (URL) in the association matrix defines a node in the graph. The graph's edges are defined by the non-zero cells in the matrix. In the case of the web graph, the edges have a direction. Graph theory metrics for association matrices are often based on undirected graphs which can then be generalized to include directed graphs. The distinction is important when inferences are being made from graph patterns and will be discussed in detail below.

Degree (d) - Defined at the nodal level, degree refers to the total number of edges connecting an individual node. When directed edges are taken into account, a node can have an in-degree, the number of edges directed toward the node, and an out-degree, the total number of edges directed from the node to other nodes.Geodesic (g) - Generally defined for two selected nodes in the graph, refers to the shortest number of edges that must be passed through to get from one node to the other. In a directed graph, a geodesic can also be defined directionally. Nodes (n) - the total number of nodes in the graph.

Using the basic metrics of degree, geodesic, and total nodes, more complex metrics can be defined that describe structural characteristics of the graph at multiple levels of analysis, from the position of single nodes to the graph as a whole. Metrics of particular interest include degree centrality, closeness centrality, betweenness centrality, density, network centrality, and cliques.

Degree centrality - In its simplest terms degree centrality is defined as a node's degree. If a node in an undirected graph shares 20 edges with other nodes, it has a degree centrality of 20. If the node has 15 directed edges pointing toward it, it has an in degree centrality of 15. To allow for easier comparison between graphs, however, this number is usually normalized based on the total number of nodes in the graph. So, if d is the degree (in, out, or undirected) of a given node and N is the total number of nodes, normalized centrality is d/N-1.

Closeness centrality(Freeman, 1978) - Closeness centrality (Ci) for a given node is defined as the sum of the geodesics between that node (i) and all other nodes in the graph (j). This is an inverse measure; the higher the closeness centrality measure, the less central is the node. When visualized in a two-dimensional graph, nodes with low closeness centrality scores move towards the middle of the graph. Closeness centrality is calculated in terms of an undirected graph.

Betweenness centrality (Freeman, 1978) -- Betweenness centrality measures the position of a given node as a mediator for geodesic paths between other nodes in the graph. It is defined as the number of times any given node, i, needs node k (the subject of the measurement) to reach any other given node, j, in a geodesic path. If gij is defined as the number of geodesic paths between i and j and gikj is the number of these geodesics that pass through k, k's betweenness centrality is defined as:

If the edges of the graph are assumed to mediate flows of any kind (information, for example), the removal of nodes with high betweenness centrality will most dramatically restrict these flows. As is the case for closeness centrality, betweenness centrality is generally defined in terms of an undirected graph.

Network Density - Network density is defined simply as the ratio of all actual ties within a network to all possible ties. In an undirected network, the number of possible ties is defined as n*(n-1)/2. In a directed network, the number of possible ties is simply n*(n-1). This normalized value can range from 0 to 1.

Centralization - Network (or graph) centrality measures the differences between the centrality of the most central node and that of all the other nodes. The larger the relative differences, the most central the graph. The formal definition of centralization varies according to one's choice of degree, closeness, or betweenness similarity, and in the interest of brevity they will not be included here.

Cliques - The most basic definition of clique refers to three or nodes which form a maximal complete sub-graph. In this case, each node in the clique shares a connection with every other node such that the sub-graph has a density of 1. Since this is often considered too restrictive, many other definitions have arisen, including the N-clique, the N-clan, the K-plex and the K-core (Hanneman, 1998). Each of these definitions relaxes the definition of clique in some way to identify sub-groups which cohere without exhibiting the perfect interconnectivity of the maximal complete sub-graph.

For the purposes of web graph analysis, the specific numeric value of individual metrics defined above is often of less importance than the more general, overall structure of the graph. To facilitate the more general interpretation of graph structure, the Issue Crawler software visualizes the association matrix using the ReseauLu clustering algorithm developed by Aguidel, an analytical solutions company in Paris, France. The visualization allows the analysts to quickly asses, in relative terms, the metrics defined above. In most cases the visualization is sufficient for the analysis. When specific comparisons between nodes, clusters, or entire graphs are necessary, the researcher can refer back to specific metrics.

Reliability is generally not a problem in computer aided text analysis. A computer, given specific instructions, should execute them in exactly the same way each time. As the Issue Crawler scans a given web page, links are either present or not and they point to one and only one location. Since human coders are not involved in the crawling process, there should be a consistent result each time given the same crawling parameters.

The problem is, however, that the conditions under which crawls take place can vary to a certain degree which the analyst can not control. Crawls may take anywhere from a few hours to a few days, and during this time, the pages that they are reading may be changing. This "snapshot" of the web graph is blurred slightly across this time, introducing systematic, not random, noise into the data. Halavais (2003) notes:

In collecting information from the weblogging world ... a "crawl" of several thousand pages might take several hours, and therefore is not, exactly, a snapshot, but an estimation of content over that period. Faster crawls are naturally more desirable, but given the ethical restrictions on crawling a data set too quickly, a balance must be struck. Researchers should report the entire period over which data was collected and, whenever practicable, make that data available to other researchers.(p.4)

If the network being measured changes during the process of crawling, the map will blur, getting fuzzier as the crawl time increases. In practice, however, the issue of web graph reliability turns out to be a minor one. Over short periods of time links generally do not change significantly. The general structural characteristics of an issue network tend to remain stable for several months at a time. Further, as the computing power for web crawls has grown, the average time for a crawl has dropped from several hours to about twenty minutes. At these speeds, the difference between a perfect snapshot of the local web graph and its estimation over the given crawl period approaches insignificance. For particularly dynamic networks, where links are changing in response to a real time event, the "time blur" might still be an issue, but such issue networks have yet to be encountered in the field.

From the patterns discerned within the graph, analytical constructs are used to make inferences about the networks of individuals and organizations that co-produce and read the documents that comprise the graph. These inferences necessarily involve an extrapolation from the domain of electronic text to the domain people and their social networks. As such, the development of analytical constructs for web graph analysis represents an extension of social network theory, where the nodes of the graph are traditionally individuals or organizations.

Social Network Analysis (SNA) studies the patterns of relations among individuals, organizations and other social groups such as states (Wasserman and Faust, 1994.) The primary assumption of SNA is that inferences about the nature of relationships between people, within an organization and between clusters of larger social groups can be made from the structure of relationships alone. The unit of analysis is the relation rather than specific attributes of the people who form these relations. Further, SNA attempts to discern certain individual attributes - ones general role within an organization, for example - solely from their position within the network. Inferences that have been made from social network patterns include an individual's relative power within an organization (Brass, 1984), one's chances for getting a job or getting promoted (Granovetter, 1973; Burt, 1992), a group's relative efficiency in completing a complex task (Shaw, 1964), the overall level of conflict within an organization (Nelson, 1989), and the likelihood of engaging in unethical behavior (Brass, Butterfield, & Skaggs, 1998).

SNA is conducted at several levels of analysis: individuals, dyads, triples, groups, and the entire network. Social networks that manifest a single type of relation, such as the exchange of goods or friendship, are defined as uniplex networks. Multiplex networks, on the other hand, can encompass two or more relations. The majority of social network analysis to date involve studies of uniplex networks despite the widely noted desirability of multiplex analyses (Wasserman and Faust, 1994; Monge and Contractor, 2003).

Inferences made about the importance and implication of particular network patterns are most valid when made on the basis of a specific operational theory. The theory is expected to explain some mechanism through which the network structure in question constrains or enables relevant aspects of the social system and its actors. Some common operational theories include social capital theory, structural holes, social learning theory, and social exchange theory. As Monge and Contractor (2003) note, a weakness in much of the literature in social network analysis is that inferences are often draw from network structure without explicit reference to an operational theory. We address this issue explicitly below when considering the adaptation of analytical constructs from SNA for use in web graph analysis. First, we review some of the most common analytical constructs in social network theory according to the network parameters on which they are based.

In simplest terms, a clique is a group of nodes who share more connections between them than they do with the rest of the network. A clique is likely to represent a community of some form, where shared values, goals, or physical location lead the members of the clique engaging more with each other than with the "outside world." A clique might be found among four of five office workers who always gather around the water cooler for chat and often get together after work, for example. Coleman (1988) based his theory of social capital on closed, dense social networks. Coleman argued that cohesive networks, in which there is significant interconnection between actors, improve access to information for all actors and provide an environment where social sanctions and rewards can effectively emerge. Such "closed" networks, he argued, were thus better suited for the generation of social capital.

Centrality measures within social networks have been used to infer social parameters including power, prestige, prominence and importance. Actors who occupy a central position within a network have greater control over relevant resources and benefit from a wide range of benefits and opportunities unavailable to those on the margins (Brass, 1992; Iberra, 1993) For example, Brass (1984) found that degree, closeness, and betweenness centrality in workflow, communication, and friendship networks correlated with power in business settings. Raven (1965) found that central actors more readily acquire information resources that allow their opinions to become influential than do actors on the periphery.

Mark Granovetter (1973) argued that the presence of weak ties in social networks were important aspects of social structure through which novelty is likely to flow, such as information important for finding a new job. Burt (1992) took Granovetter's argument a step further with the concept of "structural holes." Structural holes can be found in the vicinity of nodes with high betweenness centrality that mediate connections between two or more otherwise isolated cliques. In contrast to Coleman's view of social capital inherent in closed, Burt argued, and empirically demonstrated, that open networks connected via bridges across structural holes carried more consistent social benefits.

Krippendorff (2004) classifies empirical validity questions into a three-fold typology: those that deal with the text of the analysis itself (content), those that deal with the logic of abductive inferences made from that text (internal structure), and those that deal with questions about the results themselves (relations to other variables). Within the broad category of content validity, are sampling validity and semantic validity. Semantic validity reflects the degree to which categories within the analysis accurately reflect the meanings and uses of the categories within the chosen context. Sampling validity can be further resolved into sampling validity of members, the extent to which the sample represents the population from which it was drawn, and of representatives, the extent to which the sample represents "a population or phenomena other than that from which it is drawn." Within the classification of internal structure are two sub-categories: structural and functional validity. Structural validity refers to the degree to which the analytical construct is applicable and reflects the relations in the chosen context, while functional validity refers to the degree to which the analytical construct is vindicated in use.

We begin this section by addressing the issues of convergent and functional validity. The functional validity of one class of web graph inferences, authority, has been demonstrated with the success of web search engines such as AltaVista and Google. Convergent validity can be observed in the cases of both authority and community inferences.

In the landmark paper "Authoritative sources in a hyperlinked environment," Kleinberg (1998) offers an algorithm for determining authoritative documents based on their surrounding link structure. The algorithm assessed documents based on the analytical construct that documents which are more frequently cited by other document tend to be viewed as being more authoritative. Hubs, web documents which link to a large number of authoritative pages within a narrow topical domain, allow for authoritative pages on crude oil, for example, to be separated from olive oil. Kleinberg's HITS algorithm was validated through wide use my many early search engines. The algorithm represented a dramatic improvement in a search engines ability to return documents a human being might find useful and relevant to their query. A major assumption this algorithm makes is that "the creator of page p, by including a link to page q, has in some measure conferred authority on q." (p.2)

The means by which interaction with a link structure can facilitate the discovery of information is a general and far-reaching notion, and we feel that it will continue to offer a range of fascinating algorithmic possibilities. (Kleinberg 1998, pp. 28,9.)

In related work with Gibson and Raghavan Kleinberg (1998) surmised that "web communities" dedicated to broad topics of interest consisted of "authorities" that are highly referenced on the topic and "hubs" that pull these communities together:

Our analysis of the link structure of the WWW suggests that the on-going process of page creation and linkage, while very difficult to understand at a local level, results in structure that is considerably more orderly than is typically assumed. Thus it gives us a global understanding of the way in which independent users have built connections to one another, and a basis for predicting the way in which on-line communities in less computer-oriented disciplines will develop as they become increasingly "wired." Gibson, Kleinberg, and Raghavan (1998, n.p.)

IBM researchers Kumar et al (1999), developed a more complex definition that sought to identify a wider range of web communities, including those that were still in the process of emerging. Without restricting their web community definition to the presence of the highly central hubs and authorities, they established a simple, graph-based definition of a community core:

Web communities are characterized by dense directed bipartite subgraphs. A bipartite graph is a graph whose node set can be partitioned into two sets, which we denote F and C. Every directed edge in the graph is directed from a node u in F to a node v in C. A bipartite graph is dense if many of the possible edges between F and C are present.

By noting the occurrence of "co-citation," when two different websites are cited in the same place, the researchers discovered they could "extract all communities that have taken shape on the Web, even before the participants have realized that they have formed a community."(p. 405)

Despite their use of the term, the identification of "web communities" within the computer science literature is best understood as some form of semantic community, where documents form clusters of common meaning, rather than a true community of people. In certain contexts, however, making an inference from document communities to true social communities and extended networks in which they interact, can be justified.

An increasing number of social organizations with civic and political orientations maintain a presence on the web. Large scale environmental organizations such as Greenpeace and the Sierra Club, human rights organizations such as Amnesty International, and online privacy organizations such as the Electronic Privacy Information Center, maintain extensive, active presences on the web. At the other end of the scale, small, informal organizations of people dedicated to particular political issues are also establishing web presences. For example, a group of citizens concerned about the presidential candidacy of Arnold Schwarzenegger and the constitutional amendment it would require recently launched a website, arnoldexposed.com. Most of these web sites maintain external links which often attest to their awareness of and relationship with other actors in the same issue space. Although nodes within the web graph do not represent individual people, they often represent social organizations, suggesting that much of the social network literature on inter-organizational networks may apply to the study of the web graph.

Rogers and Marres (2000) have developed an operational theory to describe the mechanism of interlinking between issue-oriented web actors they call issue network analysis. In their conceptualization, when issue network actors engage in the process of giving and receiving hyper-links, they position and are positioned by other actors in the network (Rogers, 2002). By mapping the local web graph, the analysts can discern the overall topological structure of actors within a particular issue network and the overall landscape of the discourse. Through analysis of the content of these sites one can gauge the character of the discourse as well.

Numerous studies of link structure over the past several years have empirically validated inferences made from community-type web graph patterns to the structure of social networks behind them. For example, Terveen and Hill (1998) found that in-degree connectivity of web sites was highly correlated with expert judgments of credibility. Palmer et al (2000) used the number of in-links for e-commerce web sites as indicators of consumer trust and found that high numbers of in-links correlated with other indicators of trust such as reference to trusted third parties and the presence of privacy statements. Park, Kim, and Barnett (2000) analyzed the hyperlink structure of political web sites in South Korea, and found that web sites which formed link structures were highly correlated with party identification of the site organizers. Adamic's (2001) link analysis of homepages at both MIT and Stanford University found that patterns of connection between homepages were very similar to the small world social network patterns regularly found in the real world. For example, the distance between any two homepages was small, averaging 6.4 hops in the MIT network, mirroring Milgram's six degrees of separation. Further, homepages tended to exhibit a high degree of clustering, much like personal networks in the real world, and a small number of homepages received a high number of links from other pages, mirroring the power law structure that is seen in real world social networks.

Two questions about the validity of web graph analysis will be addressed in this section:

1. Are analytical constructs based in social network theory valid when they are used to make abductive inferences from the domain of electronic texts to the domain of social networks?

2. Are the web graphs generated in the process of crawling issue spaces on the web truly representative of the issue network in question? Does the resulting web graph vary significantly with small changes in the seed, such that valid inferences the web graph structure itself become impossible?

Within this validity typology, question 1 concerns structural validity, while question 2 comprises both sampling and semantic validity, or, more generally, content validity. We address these questions below in the order that they are presented above.

1. Are analytical constructs based in social network theory valid when they are used to make abductive inferences from the domain of electronic texts to the domain of social networks? As is the case with all forms of content analysis, inferences in web graph analysis are abductive, meaning that they proceed logically across different domains, from the domain of text, to the domain of interest (public opinion, for example). Standard social network analysis, however, typically remains within a single domain. Inferences about how people ultimately relate to one another are drawn from data about the network structure of those relations. Further, the web graph is by nature a multiplex network, meaning that it encodes multiple types of relations in a single graph. One site might link to another for instrumental reasons, such as identifying a site that has necessary software, to identify an object of criticism, declare affinity, or a range of other reasons. Traditional social networks tend to be constructed on the basis of a single type of social relation - the exchange of money or collaboration in work, for example. (Monge and Contractor, 2003)

In their book Theories of Communication Networks, Monge and Contractor (2003) propose a model for a Multitheoretical, Multilevel (MTML) analytic al framework. In general terms, the MTML framework guides the integration of multiple levels of analysis (node, diad, triad, cluster, network, inter-network), multiple types of relation, and multiple operational theories into a single research perspective. In addition the MTML model moves "from purely network explanations to hybrid models that also account for attributes of the individual nodes. (p.45)" Although MTML has not been used explicitly in the development of web graph analysis, this general model provides a useful frame for understanding how valid analytical constructs can be developed.

Borgatti and Foster (2003) distinguish between two broad categories of social network theory: topology and flow mechanisms, also known as the girder-pipe distinction. Topology mechanisms discount the actual content of ties while focusing on the overall patterns of association. The topology identifies an overall ontology of the system or subsystem in question. Social theories describing the network structure of social capital generally fall into this category. On the other hand, flow mechanisms consider network ties as explicit conduits for the flow of social goods, be they tangible (manufactured products, resources, coin) or intangible (information, social support). Social contagion theories, such as the diffusion of innovations, are of this type.

As we consider the emerging literature of web graph analysis and its analytical constructs, it us useful to keep Borgatti and Foster's distinction in mind. It is possible to conceive of the web graph in both terms, as evidence of the associational topology of the relations of web site producers or as the path through which human attention flows as one surf's the web. For the latter conceptualization there are clearly numerous validity issues. People access the web in many ways, directly typing in URLs, using bookmarks, clicking on emails arriving in their inbox, such that the actual structure of the web graph is unable to give us an accurate picture of these individual paths. However, if we conceive of the graph as an overall topology of association between web actors, analytical constructs can be formed on more solid theoretical ground.

As has been discussed above, a number of analytical constructs based in social network theory can be adapted and applied effectively and consistently to make valid inferences from the web graph. Although the analysts should always ground his inferences within the context units of web content, the use of in-degree centrality as a measure of prominence and authority, for example, or web site cliques within an issue space as a measure of a cohesive community tend to be reliable on their own. At times, however, making inferences based on the network patterns is more problematic.

Consider the following example. A recent issue network analysis of post September 11th online civic engagement mapped a cohesive community of organizations and individuals who challenge the official version of events and share the belief that the Bush administration either consciously failed to act on that day or were in fact complicitous. The community calls itself the September 11th Truth Movement. Although the characterization of this community as particularly cohesive and the identification of prominent actors within the issue network was relatively straightforward, certain nodes of interest, despite structural similarities, played divergent roles within the issue network. Two nodes, newamericancentury.org and truthout.org, both received 8 in-links from the network without any supporting links back to the network, and appeared close to each other in the web graph. Newamericancentury.org, a policy institute which calls for American global dominance both diplomatically and militarily, is at the far opposite end of the political spectrum from trutouth.org, a progressive left news digest. This fact is relatively easily discerned by considering the contexts of both links, but is there any way an algorithm might be developed distinguish them automatically?

Menczer (2004b) describes how it is possible to graph relations between documents on the web in terms of three different topologies: link, lexical, and semantic. Link topology, as has been discussed in detail above, manifests in the web graph. Two documents within the web graph appear close together if they share mutual links both between each other and with other documents in the graph. In a lexical topology, two documents will be placed close to each other if they are similar in content, in the words that comprise them. In a semantic topology, the proximity of two documents is determined by their shared meaning. The two web sites in question are clearly in close proximity within the link topology but far apart in semantic topology.

Link topology is easily mapped using the web crawling method described above. Mapping lexical topology, however, is more complex. Salton and McGill (1983), for example, developed a method that represents documents in a vector space. A lexical vector contains one dimension for each term along with a weight that estimates the contribution of the term to the overall meaning of the document. Mapping semantic topology, however, cannot be done with a simple computer algorithm; it must be grounded in shared meanings within a given human population and informed by context.

As Menczer points out, and as was discussed above in the analytical construct section, advanced search algorithms in use today make use of both link and lexical topology to infer semantic similarity. It is possible, however, to calculate semantic similarity directly through the use of semantic networks. To understand how this might be done, let us first consider the class of semantic network that Sowa (2002) defines as a definitional network.

Definitional networks, also known as generalization or subsumption hierarchies, "emphasize the subtype or is-a relation between a concept type and a newly defined subtype." The network defines the rules of inheritance, showing how properties in more general categories are related to their subtypes. The earliest known network of this type was the Tree of Porphyry, drawn by the Greek philosopher Porphyry in the 3rd century AD as part of a commentary on Aristotle's categories. The network included the "substance" category at the top of the hierarchy, defined as the "supreme genus," with increasingly specific subcategories, such as "living" and "mineral" or "animal" and "plant" branching out below. A more modern version of the definitional semantic network is the Open Directory Project, "the largest, most comprehensive human-edited directory of the Web." The directory facilitates the navigation of information and knowledge on the web through its classification in a hierarchy of general to specific categories.

Menczer(2004a) has developed an information theory-based algorithm for calculating semantic similarity which uses the Open Directory Project (ODP) hierarchy as a reference point. The informational distance between two pages is based on their respective positions within the directory. Entropy, or information content, is calculated for the lowest (most specific) common ancestor for the two pages as well as for their respective individual categories, comparing both the informational difference and shared meaning between the two documents. Specifically, semantic similarity, ss (p,q) is defined as:

where t(p) is the topic containing p in ODP, t0 is the lowest common ancestor
of p and q in the ODP tree, and Pr[t] represents the prior probability
that any page is classified under topic t.(p.1)

The entropy measure is based on the understanding that information decreases as one moves up in a topical hierarchy from specific to general. Entropy here is defined as the negative log likelihood of a randomly selected document falling into that category (the negative disappears in this ratio since it appears in both the numerator and denominator.) As categories become more general, the chances that a randomly selected document would fall into that category increase, thus decreasing the information content. If the ODP project included a single all-inclusive category, its entropy and this information content, would be 0.

Menczer's operationalization of semantic similarity through an independently developed semantic network of web documents appears to be an attractive option for improving the structural validity of web graph analysis. If sites are already placed within semantic categories, their informational distance should be a simple calculation, without the necessary calculation of distance between complex lexical vectors or the intervention of meaning dictionaries. There are several hurdles that must be overcome, however, as the following example will demonstrate.

Example: What is the semantic similarity between newamericancentury.org and truthout.org?

We lack the specific entropy values for each category in the ODP directory and thus cannot directly calculate Menczer's semantic similarity. However, we can use an approximation of this value based on a more traditional method known as edge counting (Resnick, 1999). To determine the informational distance between two sites, we count the shortest number of edges, or linked categories, one must travel within the category tree to get from one category to the other. One starts at either sites' corresponding ODP category, moves up the tree in stepwise fashion until the first category encompassing both subcategories is encountered, and then move down the tree to the second document's specific category. The respective positions of truthout.org and newamericancentury.org are as follows:

Based on this calculation, truthout.org and newamericancentury.org have a informational distance of 4. From truthout.org's immediate category of "Digests, Readers, and Compilations," one would step through Progressive and Left to Politics (the first category the two sites have in common) and then down to Policy Institutes. For comparison, if we calculate the informational distance between truthout.org and a site known to be semantically similar that also appears in the graph, buzzflash.com (category, Top: Society: Politics: News and Media: Progressive and Left) we find that the two sites have an informational distance of 1.

Although the above seems to suggest that an informational distance calculation might help to distinguish two sites that are otherwise structurally similar, closer inspection of the ODP directory shows that its use in this way is likely to be inconsistent, problematic and ultimately invalid. The reason is perhaps best summed up in one word, context. For example, consider the placement of newamericancentury.org within the "politics: policy institute category." The site, which could be classified politically as neoconservative, shares the same directory category with the nationinstitute.org, a progressive-left policy institute which produced The Nation magazine. Both would have the same informational distance from truthout.org, despite having oppositional politics. The semantic distinctions made within the context of an encyclopedic web directory are of a different sort than one might make in the context of civic society and politics.

The inadequacy of the ODP to the task of determining semantic similarity in issue network analysis does not invalidate Menczer's measure itself, but to be used effectively, another, politically-focused semantic network would have to be used, and this network would need to categorize the far majority of issue-oriented web sites that might be encountered in a crawl. For the time being, the existence of such a semantic network remains a distant dream; more complex lexical similarity measures will need to be used instead.

2. Are the web graphs generated in the process of crawling issue spaces on the web truly representative of the issue network in question? Does the resulting web graph vary significantly with small changes in the seed, such that valid inferences the web graph structure itself become impossible? This question of content validity involves sampling validity of members, of representatives, and semantic validity.

Sampling validity of members

There tends to be an assumption when graphs are made of issue networks online that, in a valid graph, the majority of key actors should appear. This is based partially on the fact that isolated, unlinked sites that deal with the issue in question are, by definition, not part of the issue network, and that those sites which interlink should be retrieved in the crawl process. This validity of members, however, is intimately tied to the question of semantic validity, addressed below.

Sampling validity of representatives

One can expect that inferences drawn from the web graph to a particular issue space will vary in validity in proportion to the degree with which relevant actors in that space make use of the Internet as a medium of activity.

Within Krippendorff's conceptual framework, semantic validity "is the degree to which analytical categories of texts respond to the meanings these texts have for particular readers or the roles they play within a chosen context.(p. 323)" Within the framework of most content analyses, the process of categorization creates a level of abstraction "above that of ordinary talk" to facilitate the subsequent analysis. When particular text units are coded and assigned to categories in a semantically valid manner, independent analysts who share the research context should be able to agree that, 1) although units assigned to one category may differ in significant ways, they share properties that are significant to the research question at hand, and 2) and that units that appear in different categories differ in their analytically relevant meanings.

To understand how semantic validity may be approached in web graph analysis, we first consider its role in standard electronic text searches as described by Krippendorff. When evaluating the semantic validity of a text search, one can compare the results of a given computer query to human judgment of their ultimate relevance. Assuming that within a given textual universe human judges could be used to identify the relevance and non-relevance of all documents for a given query, we could generate a fourfold table of frequencies: a) texts retrieved by the query deemed to be relevant; b) texts retrieved by the query deemed to be irrelevant; c) texts not retrieved by the query but deemed to be relevant; and finally d) texts not retrieved by the query and judged to be irrelevant. Within such a table, semantic validity is defined by (a+d)/n, where n is the total number of documents searched. Non valid results appear in cells b and c, defined as errors of commission and omission respectively.

When a web graph analyst crawls the web for a specific issue network, this network is initially hypothesized as a particular issue category, such as "land mine safety," "online privacy" or "copyright protection." The analyst then uses a representative set of URLs as a seed from which to begin the crawl. Since this seed is ultimately determined by the analyst and this seed forms the basis of the subsequent crawl, an obvious question arises: What is the semantic validity of the resulting web graph? Is the network as it appears in the generated web graph consistent with the true issue network as it exists out there in the global web graph?

The simplest answer to this question is that valid issue networks will only be generated by valid seeds. To study the electronic voting issue network, one can not start with a seed of websites that deal with the issue of Turkey's acceptance into the European Union. But what if the researcher takes care to identify sites that, when judged independently, are agreed to fit within the category "electronic voting?" Since there are likely several sites that would fit this semantic category, how can the analyst be confident that the graph he retrieves is valid, or is valid enough to serve as the basis for confident inferences within a given research context?

Due to the context within which web graphs are analyzed, errors of commission and omission must be viewed in different terms. Let us first deal with errors of commission, which in the context of web graph analysis are not necessarily errors at all.

Consider again the web graph analysis focusing on the September 11th Truth Movement. A crawl conducted on December 2nd using a seed of four September 11th Truth Movement sites returned a graph with 95 nodes. Although the majority of the nodes within the network fit within the September 11th Truth Movement category, a number of sites do not. Rather than being an error in need of correction, however, many of these sites indicate the broader orientation of the September 11th Truth Movement within the virtual space of global civic society. For example, the web graph identified a significant relationship between the Truth Movement community and what may best be described as the progressive-left issue community, consisting of popular web sites such as thenation.org, moveon.org and michaelmoore.com (see graph, page 50). There are cases when nodes retrieved in the network are not relevant to the analysis, such as when standard links to microsoft.com or adobe.com are provided by numerous sites for needed plug-ins, but these nodes are easily removed during the editing process. Even if the researcher were to inadvertently use a seed URL that did not fit semantically within the issue network of interest, the process of co-link analysis would tend to throw out its potential contributions to the association matrix, since its outward links would have to be shared by at least one other seed.

Errors of omission, however, are more of a more serious nature and may have a greater impact on the overall validity of data than the simple omission of relevant nodes. If a crawl fails to retrieve significant web sites within an issue network, the analyst loses not only contextual data about that site (it's content, it's contribution to the issue discourse) but structural data the site contributes to the overall network. Without this data, structural inferences from link patterns -- about the relative centrality of a particular actor in the network, for example -- may ultimately be incorrect.

The question of seed validity is one of the most important issues facing web graph analysis. To date, little, if any, research has been published that discusses this problem. From my general experience of graphing many issues, it appears clear that web graphs analysts can improve seed validity by following a few general principles. First, familiarity with the issue to be analyzed will lead to the most valid seeds. When such familiarity is not possible, the researcher can begin with a set of sites is ranked highly by the Pagerank algorithm when keywords representing the issue in question are entered into Google. The researcher can then follow an iterative process where a graph is generated using this initial seed and then prominent sites identified in this graph are then fed back into the seed. If there is a significant issue network to be found, such a process will usually find it.

A related question regarding seed validity has to do with the ultimate sensitivity of the whole graphing process to small variations in otherwise semantically valid seeds. If sensitivity is too great, such that key structural properties of important actors vary wildly with the seed, web graph analysis will have little ultimate utility within the social sciences. The standard visualization of web graphs will not be sufficient to test seed validity, as it is extremely difficult to compare graphical images in any systematic way. Instead, a seed validity study would need to agree on a set of graph metrics, or a composite index, which can be compared across seeds and within particular issues. Variation within these metrics could then be analyzed as the seed varies.

Pilot research in this area suggests that there is a significant degree of consistency within graphs as seeds vary, although more research needs to be done. A recent study of the "Electronic Voting" issue network led to the conclusion that the web site for the Electronic Privacy Information Center, (epic.org), has the highest in-degree centrality within this online issue space. The conclusion was widely challenged within the electronic voting community itself, on the grounds that the graph generated was too dependent on the crawl seed and thus ultimately invalid.

To test the validity of the high centrality finding for epic.org, four crawls using varying seeds were run in the span of approximately six weeks. The number of URLs in each seed varied from a low of 2 to a high of 9 (the specific URLs in each seed are included in Appendix B, p.51). Epic.org itself was present in two of the four seeds. From the resulting graphs, in-degree centrality was calculated for the most prominent nodes, leading to a ranking of the five most prominent nodes for each of the four graphs. The results appear below, with the date of each crawl appearing along the top row, the number of URLs in the seed (seed n) in the second row, and the total number of nodes in the returned graph (nodes n) in the third row, followed by the sites in rank order from one to five. Sites are listed by their domain name (all are .org sites, except for thomas.gov) with the normalized in-degree centrality metric appearing in parentheses. The two dates in bold indicate seeds in which epic.org was not present:

22-Jul

2-Aug

16-Aug

5-Sep

seed n

5

4

2

9

nodes n

84

81

79

93

1

epic (.3735)

epic (.2375)

epic (.2692)

epic (.3261)

2

aclu (.3012)

thomas (.1875)

eff (.2308)

eff (.2935)

3

eff (.2891)

eff (.175)

cdt (.2051)

aclu (.2391)

4

cdt (.1928)

cdt (.175)

acm (.2051)

cdt (.1848)

5

acm (.1687)

acm (.1375)

aclu (.1923)

cps (.1739)

Epic.org appears as the most prominent node within each graph, including the two seeds conducted on August 2nd and August 16th in which the site itself did not appear. Although these results suggest we be can confident that epic.org's dominant position within the "electronic voting" web graph is not simply an artifact of a particular seed, there is enough variance in the second to fifth positions to warrant further investigation of broader seed validity issues. Further, since the four crawls were conducted across a time span of six weeks, an undertermined proportion of this variance is due to changes within the web graph itself over that period (as discussed in the reliability section, above). To get an idea of the relative degree of variance this time span might have introduced in the seed study, the August 2nd seed was run again on September 6th:

2-Aug

6-Sep

epic (.2375)

epic (.2338)

thomas (.1875)

thomas (.2208)

eff (.175)

cdt (.2208)

cdt (.175)

eff (.2077)

acm (.1375)

acm (.1558)

The results suggest that the majority of variance in this pilot seed validity study came from the seed variance and not the time span.

More extensive research into this question needs to be conducted to help analysts systematically improve seed validity, understand its idiosyncrasies, and build confidence in their results. The two relevant variables in the study of seed validity are the seed and the issue itself. An "issue" can be defined at varying levels of specificity. A semantically narrow issue, such as "ocean pollution", is likely to have different validity issues than a more general issue, such as "environmental protection." Further, the salience of a particular issue, whether it widely known and debated within the public sphere, is also likely to impact the researcher's ability to generate a valid seed. A thorough study of seed validity should therefore test a range of issues with varying generality and salience. As to the seed, it can vary according to composition (specific URLs) and length.

Using Krippendorff's conceptual model of content analysis we have been able to examine and evaluate web graph analysis in great detail. We began with a brief discussion of the theoretical context and potential uses of the method, noting its similarities to citation and semantic network analysis, and highlighting its applicability to the study of political issue networks in civil society. We outlined each step of the data-making process, beginning with the identification of relevant seed URLs and the subsequent crawling of the seed using co-link analysis, continuing with the construction of an association matrix from the documents collected in the completed crawl, and culminating in the identification, using the data language of graph theory, of key network patterns. Next, we discussed how network patterns such as betweenness centrality and cohesion may be used to draw inferences using analytical constructs based on the literature of social network theory.

The validity of web graph analysis was then considered in terms of both internal and content validity. We saw that abductive inferences based on social network theory often can be applied effectively to the domain of electronic texts, particularly when patterns such as centrality and cohesion are identified, but noted that structurally similar nodes can play divergent roles that can only be understood within context. We explored the possibility of using independently produced semantic networks to determine semantic proximity for structurally similar nodes, but found that generic topic hierarchies such as the Open Directory Project do not make the kinds of semantic distinctions necessary for the context of politics and civic society.

Finally, we explored what may be the most significant validity issue facing web graph analysis, the ultimate representativeness of the issue graph and the significance of seed variance on crawl results. We reviewed the results of a pilot study suggesting that structural inferences about prominent actors within an issue graph can be made with confidence, but acknowledged the need for more thorough research about the nature of seed variance.

Web graph analysis is a promising new method of content analysis with a bright future. As validity issues continue to be explored in more detail, it is perhaps best used as one component of a more comprehensive content analysis, rather than as a standalone method. Nevertheless, its unique advantages cannot be easily dismissed. Web graph analysis is fast, inexpensive and unobtrusive and can quickly orient researchers to an issue space of interest in ways that traditional forms of content analysis are unable to do. As the method itself improves, especially as ways are found to integrate lexical and semantic topologies into the graphing process, it appears certain to gain widespread use.