COMPUTERS AND TEXT ANALYSIS

Finally . . . anyone who collects mountains of text will want to take advantage of modern text analysis software. Don’t take the phrase ‘‘text analysis software’’ literally. Computer programs do a lot, but in the end, you do the analysis; you make the connections and formulate hypotheses to test; you draw conclusions and point them out to your readers.

The two broad approaches in text analysis—inductive, hypothesis-generating research and deductive, hypothesis-testing research—are reflected in the available software. Programs for automated content analysis are based on the concept of a computerized, contextual dictionary. You feed the program a piece of text; the program looks up each word in the dictionary and runs through a series of disambiguation rules to see what the words mean—for example, whether the word ‘‘concrete’’ is a noun (stuff you pave your driveway with) or an adjective, as in ‘‘I need a concrete example of this or I won’t believe it.’’

Work on automated text analysis began in the 1960s. Philip Stone and others (1966) developed a program for doing automated content analysis called the General Inquirer. They tested it on 66 suicide notes—33 written by men who had actually taken their own lives, and 33 written by men who were asked to produce simulated suicide notes. The program parsed the texts and picked the actual suicide notes 91% of the time (Ogilvie et al. 1966). The latest version of the system (which runs with a dictionary called the Harvard IV-4) has a 13,000-word dictionary and over 6,000 rules. It can tell whether the word ‘‘broke’’ means ‘‘fractured,’’ or ‘‘destitute,’’ or ‘‘stopped functioning,’’ or (when paired with ‘‘out’’) ‘‘escaped’’ (Rosenberg et al. 1990:303). (For more about the General Inquirer, see appendix E.)

About the same time, Nicholas Colby (1966) developed a special-purpose dictionary for Zuni and Navajo texts. From his ethnographic work, Colby had the impression that

BOX 19.5

THE BOOLEAN LOGIC IN SCHWEIZER’S ANALYSIS

Here are the details of the Boolean logic of Schweizer's analysis (Schweizer 1996). Three possible hypotheses can be derived from two binary variables: ''If A then B,'' ''If B then A,'' and ''If A, then and only then, B.'' In the first hypothesis, A is a sufficient condition to B and B is necessary to A. This hypothesis is falsified by all cases having A and not B. In the second hypothesis, B is a sufficient condition to A and A is necessary to B. The second hypothesis is falsified by all cases of B and not A. These two hypotheses are implications or conditional statements. The third hypothesis (an equivalence or biconditional statement) is the strongest: Whenever you see A, you also see B and vice versa; the absence of A implies the absence of B and vice versa. This hypothesis is falsified by all cases of A and not B, and all cases of B and not A.

Applied to the data from Chen Village, the strong hypothesis is falsified by many cases, but the sufficient condition hypotheses (urban origin implies success; proletarian background implies success; having external ties implies success) are true in 86% of the cases (this is an average of the three sufficient condition hypotheses). The necessary condition hypotheses (success implies urban origin; success implies proletarian background; success implies external ties) are true in just 73% of cases (again, an average). (There are seven discon- firming cases in 51 possible outcomes of the 12 sufficient condition possibilities—4 possible outcomes for each of three independent variables and one dependent variable. There are 14 disconfirming cases in 51 possible outcomes of the 12 necessary condition possibilities.) To improve on this, Schweizer tested multivariate hypotheses, using the logical operators OR and AND (Schwei- zer 1996).

the Navajo regarded their homes as havens and places of relaxation but that the Zuni viewed their homes as places of discord and tension. To test this idea, Colby developed two groups of words, one group associated with relaxation (words like assist, comfort, affection, happy, and play) and one associated with tension (words like discomfort, difficult, sad, battle, and anger). He then had the computer look at the 35 sentences that contained the word ‘‘home’’ and one of the words in the two word groups. Navajos were more than twice as likely as the Zuni to use relaxation words when talking about home than they were to use tension words; the Zuni were almost twice as likely as the Navajo to use tension words when they talked about their home.

Over the years, computer-assisted content analysis has developed into a major industry. When you hear ‘‘This call may be monitored for quality assurance purposes,’’ it’s likely that the conversation will be turned into text that will be submitted to a high-end data-mining program for analysis.

Most content analysis, however, is not based on computerized dictionaries. It’s based on the tried-and-true method of coding a set of texts for themes, producing a text-bytheme profile matrix, and then analyzing that matrix with statistical tools. Most text analysis packages today support this kind of work. You code themes on the fly, as you read the text on the screen, and the program produces the text-by-theme matrix for you. Then, you import the matrix into your favorite stats package (SPSS®, SAS®, SYSTAT®, etc.) and run all the appropriate tests (about which, more in chapters 20, 21, and 22).

Programs for doing grounded-theory-type research are also widely available. No program does everything, so do your homework before deciding on what to buy. A good place to start is the CAQDAS Networking Project (http://caqdas.soc.surrey.ac.uk). CAQDAS (pronounced cactus—really) stands for computer-assisted qualitative data analysis software. The CAQDAS site is continually updated with information on tools for analyzing qualitative data. Those tools are getting more sophisticated, with more features added all the time. Were transcription not such a nuisance, social science would have focused long ago on the wealth of qualitative data that describe life and history across the globe. But with voice-recognition software coming on strong (see chapter 8), and transcription becoming less and less intimidating, all forms of text analysis—narrative analysis, discourse analysis, grounded theory, content analysis—will become more and more attractive.

Text analysis is only just beginning to come into its own. It’s going to be very exciting. FURTHER READING