Social question and answer (Q&A) Web sites field a remarkable variety of questions: while one user seeks highly technical information, another looks to start a social exchange. Prior work in the field has adopted informal taxonomies of question types as a mechanism for interpreting user behavior and community outcomes. In this work, we contribute a formal taxonomy of question types to deepen our understanding of the nature and intent of questions that are asked online. Our taxonomy is grounded in Aristotelian rhetorical theory, and complemented by contributions of leading twentieth century rhetorical theorists. This taxonomy offers a way to differentiate between similar–sounding questions, while remaining flexible enough to encompass the wide range of questions asked online. To ground the taxonomy in reality, we code questions drawn from three popular social Q&A sites, and report on the distributions of several objective and subjective measures.

1. Introduction

Social question and answer sites (Q&A sites) are online spaces where people ask and answer tens of thousands of questions every day. These questions are striking in their variety: while one person asks a technical question about linear programming, another person asks others to name their favorite football player. In many successful Q&A sites, these questions come together as part of a diverse and vibrant whole, supporting users with different goals, interests, and expectations.

It is intuitive, therefore, to think of different types of questions in these sites. Consider the following two questions: “do you drink soda or coffee to wake up?” and “does soda or coffee have more caffeine?” In the former question, the asker appears to be looking to start a conversation about the relative merits of soda vs. coffee, while in the latter question the asker appears to be looking for information to guide his or her future behavior. Despite both questions sharing a topic — caffeinated drinks — we interpret each question to be functionally different (Pomerantz, 2005) in the types of answers it expects.

Recent studies of social Q&A sites have built on the concept that there are different types of questions in order to better understand user behaviors and community outcomes. The field is accumulating results that demonstrate a range of distinct question types that have implications for designers and researchers. For example:

Enjoyment and likelihood of return. A comparative study of two similar Q&A sites found that the one which fielded more “socially conducive” questions was favored by users, who reported that they enjoyed their experience more, and were more likely to keep using the system (Hsieh and Counts, 2009).

Answerer effort. A cross–site comparison of five Q&A sites found that, across all sites, different types of questions were met with different levels of effort from answerers (Harper, et al., 2008). Specifically, advice–seeking questions received answers that were twice as long, on average, as fact–seeking questions.

Evaluation of quality. A study of how question askers select “best answers” found that users evaluate answers differently depending on the type of the question (Kim, et al., 2007). For instance, users who ask information–seeking questions appear to value the answer’s clarity and accuracy, while users who ask for opinions value factors such as agreement with the answer or the presence of emotional support.

These differences help us understand why Q&A sites are used, and how we can make them better. Being able to quantify what types of questions are being asked — and in what proportions — offers a helpful window into how a Q&A site is performing and what improvements specific to that site might be most beneficial. To customize the user experience, we might look at users’ preferences for answering certain types of questions, rather than just looking at topics. To maximize the informational value of a Q&A site, we might ensure that question types with high information value are emphasized in the interface. To increase participation, we might instead emphasize engaging question types.

We are optimistic that machine learning and other forms of automation will be able to help with these goals, as research has shown striking structural differences between different question types. A study of “conversational” and “informational” questions (Harper, et al., 2009) found differences between these question types in terms of the use of words and in terms of the social network characteristics of users. Another study used unsupervised learning to identify three clusters of questions: “discussion forums”, “advice”, and “factual”, based on structural factors such as the length of the question and the asking/answering ratio of participants (Adamic, et al., 2008).

We are working toward a broad vision of Q&A sites where algorithms intelligently index, categorize, and personalize content. To get there, we first seek a more structured understanding of the nature of user contributions in these communities.

1.1. Research goals

The aim of this research is to deepen our understanding of the nature and intent of questions that are asked online. We organize this contribution around three primary research goals:

Goal 1: Produce a taxonomy of question types drawing on core principles of rhetorical theory that is tailored to the study of online question asking.

A team of scholars with expertise in both rhetorical studies and computer science worked to develop a new taxonomy of online questions. The resulting taxonomy is flexible enough to encompass the range of questions asked online, and maintains meaningful divisions between types. We believe this taxonomy will help researchers and practitioners to understand the nature of questions that are asked online, as well as the intent behind those questions.

Goal 2: Develop an initial understanding of the properties of the different question types.

We apply our taxonomy to a set of online questions to better understand the differences between the types. This analysis serves to strengthen our definitions of the question types. It also illustrates some of the differences between the question types that may be used by developers of classification algorithms.

Goal 3: Develop an initial understanding of the quality implications of the different question types.

We collect quality–related data on a set of questions to establish a preliminary understanding of the implication of a question’s type. We map existing results to the new taxonomy, and report on new results that can inform system designers.

While much of rhetorical theory is directed at analyzing and understanding overtly persuasive communication, we found that the principles embedded in these theories offered substantial insight when adapted to the indirectly persuasive exchanges that are commonly found in online Q&A sites.

2. A rhetorical framework for classifying questions

Understanding online question asking requires interdisciplinary thinking. Though language theorists have developed taxonomies of questions in isolation (Pomerantz, 2005), recent studies of Q&A sites have either ignored these taxonomies, or have been forced to customize them for the online world (e.g., Ignatova, et al., 2009).

By contrast, the taxonomy at the heart of this project was developed collaboratively, with scholars from rhetorical studies and human–computer interaction jointly applying theoretical principles and practices from their respective disciplines. Drafts of this taxonomy were tested and evaluated using real questions asked in a dozen Q&A sites and discussion forums [1]. We believe the resulting taxonomy is both theoretically justified and practical.

2.1. Background: The three species of rhetoric

Composed in the fourth century BCE, Aristotle’s On rhetoric (Aristotle, 2007) has served as a central text for scholars of persuasive communication for over 24 centuries. While Aristotle’s text is best understood as a guide to the construction of persuasive speeches, its broad outlines have been adapted and applied in a variety of contexts. Most significantly, Aristotelian rhetoric has proven useful for enhancing composition strategies in persuasive writing, and sharpening critical analyses of spoken or written arguments.

Central to Aristotle’s work is the notion that there are three “species” of rhetoric: deliberative, epideictic, and forensic. Aristotle describes deliberative rhetoric as the rhetoric of legislative bodies, directed at informed judgments about potential actions. He describes epideictic rhetoric as a phenomenon of providing evaluative or qualitative judgments — in Aristotle’s words, “praising or blaming” a given person, topic, or idea. And he identifies forensic rhetoric as the rhetoric of the courtroom, directed at determining whether a crime has or has not occurred.

Importantly, each of the three species is distinguished by its temporal focus, as Aristotle explains:

“Each of these [species] has its own ‘time’; for the deliberative speaker, the future (for whether exhorting or dissuading he advises about future events); for the speaker in court [i.e., the forensic speaker], the past (for he always prosecutes or defends concerning what has been done); in epideictic the present is most important, for all speakers praise and blame in regard to existing qualities … .” [2]

Thus, at root, deliberative rhetoric is future–focused, epideictic is present–focused, and forensic is past–focused.

Applications of Aristotle’s account of these three species of rhetoric are many and varied. Recent examples of scholars across disciplines applying these theoretical tools include: analysis and critique of corporate annual reports as an Aristotelian genre (White and Hanson, 2000); argument that pedagogical presentations emphasizing epideictic rhetoric perform a necessary community–building function within educational institutions (Deacon and WynSculley, 2007); analysis of the function of these species in the “crisis rhetoric” of United States Presidents (Dow, 1989).

2.2. Applying rhetorical analysis to online Q&A

Our project investigates a further adaptation of Aristotle’s three species of rhetoric, applying the deliberative, epideictic, and forensic categories to distinguish among the particular types of assistance or information sought by online questioners. Early in our research, we recognized that most questions submitted to online Q&A sites could be easily categorized according to an adapted version of Aristotle’s rhetorical species. For example, questions seeking advice are directed at moving toward informed future action, and thus can be understood as deliberative questions; questions seeking factual information are properly understood as forensic questions seeking clarity with respect to what has already been done; questions of approval or disapproval and questions of quality correspond closely to present–focused “praising and blaming” activities central to the Aristotelian understanding of epideictic rhetoric.

But these three species alone are not adequate to the task of explaining the range of questions found in social Q&A sites. In particular, the species do not adequately address questions that are simply asked to facilitate social interaction. Prior work (e.g., Adamic, et al., 2008; Agictein, et al., 2008; Harper, et al., 2009) has demonstrated that such open–ended subjective conversations are pervasive in social Q&A sites.

By contrast, the Aristotelian species each point toward definable endpoints. Deliberative questions typically point toward policy or action. Forensic questions typically lead to greater understanding of past facts or data. And epideictic questions routinely produce settled judgments with respect to qualities (or their lack). The Aristotelian species are less helpful in cases where Q&A exchanges are not so expressly goal–directed. This gap is underscored by the work of twentieth century rhetorical theorists, who complemented Aristotelian definitions of rhetoric by explaining the value of rhetorical exchanges that seem — at first blush — to be conversational rather than persuasive.

In order to develop a taxonomy expansive enough to address these more conversational moments, we drew upon scholars central to the twentieth century’s “social turn” in rhetorical theory. The middle of that century saw two major efforts to produce an identifiably “new” rhetoric. Burke’s (1950) articulation of a rhetoric that is directed not at Aristotelian persuasion, but rather at two or more individuals establishing a shared space of “identification” begins the process of accounting for the great mass of discursive interactions that do not — initially — seem reducible to a persuasive or productive purpose. Burke argues that the establishment —through language — of even a small zone of shared substance is a foundational step in social cohesion. Indeed, identification sets the stage for meaningful future action. Burkean identification, and its implied orientation toward building future relationships is reflected directly in our use of Burke’s term to describe one of our two “deliberative” categories.

In a related vein, in The new rhetoric, Perelman and Olbrechts–Tyteca (1969) argue that Aristotle’s species are too heavily weighted toward the deliberative and forensic, and further, that the “adherence” generated by participating in collective exchanges centered around the core epideictic activities of praising and blaming is particularly important. They write:

“epideictic oratory has significance and importance for argumentation because it strengthens the disposition toward action by increasing adherence to the values it lauds.” [3]

Taken together, these new classes of rhetoric complement the Aristotelian species by offering functional explanations as to what might actually be accomplished when questions seem superficially detached from tangible goals. While this type of adherence might occur as a by–product of responses to a variety of questions, Perelman and Olbrechts–Tyteca’s work informs our taxonomy’s recognition of a distinction between the subjective task of offering opinions about one’s favorite or least favorite of a given class, and the relatively objectively grounded task of identifying the best of a given class.

In addition to the distinctions suggested by the theoretical work of Burke and the later work of Perelman and Olbrechts–Tyteca, we offer our own observation that there is a notable distinction to be drawn between questions seeking a stray snippet of information and questions seeking step–by–step instructions to address a problem or challenge. If a questioner asks, “is there a way to set my iPhone so that it will shut off each night at midnight?” most readers would recognize that a respondent simply answering “yes” — even if this is factually accurate — is not being especially helpful given the import of the question. Our rhetorical framework acknowledges this distinction in hopes of helping to inform the development of more responsive Q&A systems.

2.3. A rhetorical framework of question types

Our taxonomy consists of six question types, two “subspecies” for each of Aristotle’s species. We were at first reluctant to subdivide Aristotle’s deliberative, epideictic, and forensic rhetoric, as these species had been used and maintained by scholars without significant amendment, for the better part of 24 centuries. On the other hand, few rhetorical scholars had undertaken sustained examination of question–and–answer exchanges, in part because questions seem — at least superficially — to fall outside the scope of persuasive communication. Indeed, the rhetorical question is traditionally (and fairly) understood as question soliciting no answer. But this study is directed at questions that do seek answers, some of them urgently, some of them plaintively. Online Q&A sites generate a tremendous daily churn of questions than range from trivia to major life decisions. But all of these questions are marked by an implicit sense of urgency — the question asker is seeking an answer (or answers) as soon as possible and depending on the scale and speed of the Internet to deliver. As we came to understand the range of questions on these sites, we found the Aristotelian species suggestive, but unable to offer a comprehensive account of the nature and intent of the questions we were seeing. But we also saw questions falling fairly neatly into a small pool of subcategories.

Our project ultimately offers a “friendly amendment” to Aristotle’s species of rhetoric — calibrated to the specific dynamics of online question–and–answer exchanges. On one hand, we are challenging Aristotle by suggesting his species of rhetoric are further divisible, but on the other hand we are honoring the method of stabilizing a definition outlined by Aristotle in his Posterior analytics (1975):

“Our procedure makes it clear that no elements in the definable form have been omitted: we have taken the differentia that comes first in the order of division, pointing out that animal, e.g., is divisible exhaustively into A and B, and that the subject accepts one of the two as its predicate. Next we have taken the differentia of the whole thus reached, and shown that the whole we finally reach is not further divisible — i.e., that as soon as we have taken the last differentia to form the concrete totality, this totality admits of no division into species.” [4]

In the case at hand, the application of Aristotle’s species to the distinctive rhetorical strategies of the online Q&A sites prompted a need for further division. And it is a settled principle of taxonomies whether Aristotelian or biologic that a species must either have no subspecies at all, or two or more. In this case, once we tested our subspecies against a representative pool of questions drawn from leading online Q&A sites, we found that we needed no more than two subspecies per species in order to adequately identify and classify these questions. Indeed, our six categories readily encompassed all of the questions that we encountered. While some of the questions submitted to online Q&A sites are arguably “interspecies” — representing a blend of two or more question types, none of the questions we encountered seemed to fall outside our taxonomy’s subspecies or prompt a sustainable argument for further division. We thus see our taxonomy as offering distinct categories which, taken together, a comprehensive account of online question–and–answer exchanges. We also see potential for using this taxonomy to analyze and interpret question–and–answer exchanges more generally.

Directed at generating a new (or specifically tailored) solution, approach, or plan rather than locating or implementing an already existing solution. Grounded in the questioner’s desire to inform future action.

Directed at establishing a focused discussion (and potentially building relationships) among people with a shared commitment to a topic.

Directed at encouraging readers to offer a “favorite” or “least favorite”, with the implicit understanding that answers will be — at root — subjective opinions.

Directed at seeking the “best” or “worst” example of a given class, or at weighing the relative merits of a given product, item, or concept, with the implicit understanding that answers will be — at root — objectively grounded.

Directed at pursuing an already developed solution to a problem or challenge. Grounded in the questioner’s desire to learn steps or strategies that are known (through experience) to address or resolve the issue at hand.

Directed at seeking an answer that is objectively or empirically true, such as existing information, data, or settled knowledge.

(d)

My parents say that playing “The Beatles: Rock Band” is a waste of time. How can I persuade them that it will actually help me learn to play music?

What’s the next band you want to see get a Rock Band “special edition”? I wish they would do the Ramones. I would want to be Dee–Dee. Who would you be?

What’s your favorite Beatles song?

What’s the best Ramones Song?

I’ve heard there is an Easter Egg in “The Beatles: Rock Band” where you can play “Eleanor Rigby” but I haven’t been able to find it. How do you unlock the song?

Will my controllers for the Wii version of “Guitar Hero” also work on the Wii version of “The Beatles: Rock Band”?

2.4. Our taxonomy: “Subspecies” of rhetoric

Our taxonomy offers two subdivisions of each of Aristotle’s three species of rhetoric.

Deliberative questions are divided into advice and identification. Advice questions articulate specific and presumably novel situations and seek guidance in anticipation of impending action. Identification questions are directed at attracting respondents with a shared commitment (however ephemeral) to the significance of a given topic. Identification questions often appear to be directed at cultivating conversations and building relationships rather than securing a definitive or functional answer.

Epideictic questions are divided into those that seek responses voicing approval/disapproval and those seeking responses articulating the quality (or qualities) of a thing or a concept. These two types are distinguished by their biases toward the subjective and the objective, respectively. Approval/disapproval questions call upon respondents to offer their “favorites,” whereas quality questions ask respondents to make a case for whether something is the “best” or “worst,” or somewhere in between.

Forensic questions are divided into those seeking prescriptive answers, and those seeking factual answers. Prescriptive questions solicit identified solutions to reasonably common problems. Respondents typically offer answers that take the form of steps expected to resolve the questioner’s problem or concern. Factual questions pursue already existing information, data, or settled knowledge. These can often be answered briefly with lists, tables, or the solicited piece of data.

2.5. Comparison to existing taxonomies

Many scholars have developed taxonomies of question types, because the act of classification aids in the process of interpreting and understanding questions (Pomerantz, 2005), a process that is essential to fields such as information science, natural language processing, and linguistics.

However, these existing taxonomies are typically quite complex. Two “functional” taxonomies identified in a review of existing taxonomies (Pomerantz, 2005) contain 18 and 11 distinct question types. A taxonomy for automated question generation (Nielsen, et al., 2008) contains 17 types. And, a “roadmap” document for automated question answering technology (Burger, et al., 2001) contributes exemplar taxonomies containing 13 and 18 types. We argue here that a more streamlined question taxonomy is easier to apply and interpret, and is more tractable for the end goal of building algorithms that classify questions in social Q&A sites.

Another limitation of existing taxonomies is that they fail to acknowledge social question asking, which has been shown to be a major component of Q&A sites (Adamic, et al., 2008; Agictein, et al., 2008; Harper, et al., 2009). As a case in point, one study of Yahoo Answers (Ignatova, et al., 2009) adapted Graesser’s well–known taxonomy of questions (Graesser, et al., 1994). However, the researchers added an overarching “opinion nature” question type — functionally serving as a “none of the above” response. In contrast, our taxonomy supports classification of all types of questions asked in social Q&A sites, whether or not they are asked with purely social intent.

Finally, existing work in question taxonomies treats questions as single interrogative clauses; we wish a taxonomy that accommodates the messy whole of contributions to social Q&A sites: submissions that are often highly contextualized, often containing multiple sentences, and sometimes ambiguous or poorly worded.

3. Applying the rhetorical framework

We now turn to investigate what we can learn from applying the rhetorical framework of question types to actual questions from social Q&A sites. In this section, we describe the process of coding 300 questions from three Q&A sites.

3.1. Methods

To learn about the prevalence of the different question types in online Q&A sites, two co–authors of this paper — scholars in the field of rhetoric — hand–coded a set of 300 questions. We selected these questions at random from the data set reported in Harper, et al. (2009), which contains several years of questions from three popular Q&A sites: Yahoo Answers, Answerbag, and Ask Metafilter. These three sites offer similar Q&A interfaces, but differ in the volume of contribution and membership. We selected 100 questions from each of these sites to collect a random sample that reflects a wide range of question–asking behavior online.

To facilitate the coding process, we developed an online tool that isolates the text and category of the question from its context, blinding the coders to any potential bias attributable to the Q&A site. The coding tool asks coders to assign exactly five points among the six question types. We chose this design over a “pick the best match” design, because in early trials we determined that many questions combine aspects of more than one question type. For instance, a question may be asking for both factual information and advice. In this case, a coder might determine that the question is 60 percent factual (three points) and 40 percent advice (two points). In the case where the coder believes that the question does not fit into any of the categories, we provided a “none of the above” option.

Figure 1: A screenshot of the online tool for assigning types to questions. Coders assigned exactly five points across the different types.

To create a “gold standard” of question types, the expert coders first independently scored the 300 questions, then discussed their differences in person, and re–coded in cases of extreme disagreement. We did not require eventual agreement; there are cases where the two coders agreed to disagree. This disagreement, for better or worse, reflects the ambiguity and challenge in coding online questions.

We employ a metric called majority type for summarizing the coding of each question. We consider a question to have a majority type if it received at least 60 percent of the expert coders’ combined points. In the case that a question didn’t get a majority of the expert coders’ points in a single type, we say that question has “no majority.” Finally, if either of the coders checked “no category applies,” and noted that the purported questioner had not, in fact, asked a question (the two cases are discussed below) we assigned the type of “not a question,” or “N.A.Q.”

3.2. Agreement and disagreement

Though our expert coders discussed their disagreements, and re–coded in some cases, they were not required to reach consensus. There are a number of ways to measure the eventual agreement between the coders. Looking at primary type assignment — where a coder assigns three or more points to a single type — the two experts agreed on the primary type 94.3 percent of the time (283/300 questions). Of the 17 questions with disagreement between the two coders, 10 are 3/2 splits where the two coders assigned all of their points to the same two types, but disagreed concerning the primary type. We may also look at how the two coders “aligned,” by measuring the extent to which types received points from both coders; the two coders achieved 100 percent alignment in 61.0 percent of the questions, 90 percent alignment in 87.0 percent of the questions, and 80 percent alignment in 98.3 percent of the questions. Overall, the two coders rated with a Pearson product–moment correlation coefficient [5] of r=0.953.

We may use those instances of disagreement to help understand where the taxonomy is most vulnerable to confusion — or, where the grey area of question–asking is most revealed. Interestingly, the most commonly confused categories were factual and identification: one coder chose each of these as the primary type in 10 cases before discussion, and in three cases after discussion. One example of this type of “borderline” question is from Answerbag: “After losing my job, I began drinking daily and heavily. I stopped cold turkey. It has been 1 week and I am nauseous, my stomach hurts and I keep having hot flashes. Am I experiencing withdrawl [sic] to some degree?” In this case, the question asker could be asking for facts regarding the symptoms of withdrawal, or just looking for others to identify with. It’s also possible that the asker is looking for both of these things. The other question types that were most commonly confused: factual/prescriptive (eight cases before discussion, two after), and advice/identification (six cases before discussion, one after).

4. Analysis: Quantifying the differences among question types

The six question types presented in the rhetorical framework each exhibit different characteristics that may help us to understand their differences. In this section, our goal is to learn about the prevalence and descriptive characteristics of the different question types as they occur on different sites. These analyses all use the “majority type” metric developed in the previous section to associate each question with a single question type.

4.1. Distribution of question types by site

Figure 2 shows the distribution of questions overall and by site. We find that factual (31 percent of questions) and identification (28 percent) are the most common, while quality (seven percent) and (dis)approval (five percent) are the least common. However, these distributions do vary, site by site. For instance, Answerbag has relatively more identification and (dis)approval questions, Ask Metafilter has more advice and prescriptive questions, and Yahoo Answers has more factual questions. Yahoo also produced the only two questions that the coders rated as “not a question”: The full text of these questions is: “myspace layouts?” and “heavy metal music? I am doing a project about it and need info.”

Figure 2: The distribution of majority question types by Q&A site. Each number is a percentage; the rows sum to 100. The first six columns show counts for the types from our rhetorical framework; “noMaj” means the question had no majority type, while “N.A.Q.” means “not a question”.

4.2. Word use by question type

To better understand the verbal characteristics of the six question types, we conduct a simple analysis of the frequency of word use. We first construct a “bag of words” for each question, representing each question by a vector of Boolean values indicating the presence or absence of each word in the overall corpus of questions. From this vector, we calculate the “importance” of each word to each question type, using the following formula:

Importance = T(wi, tj) – G(wi)

Where T(wi, tj) is the percentage of questions of type tj containing the word wi and G(wi) is the percentage of all questions containing the word wi. This metric is related to the idea of TF–IDF (Salton and McGill, 1983) — a standard algorithm for computing the relative importance of a word to a document — but in this case better accounts for the small corpus and the goal of summarizing by type (rather than by question).

Table 2 provides a summary of the four most “important” words for each question type. Several observations stand out: advice questions feature the words “me” and “I” — indicating that the asker is the target of the question, while identification questions feature the words “you” and “your” — indicating that these question types target the answerers. (Dis)approval questions feature the words “favorite” and “fav”, as compared to quality questions, which use the words “best”, and “good”. These difference reflect the more subjective nature of (dis)approval questions and the more objective nature of quality questions. Prescriptive questions often use the question word “how”; similar to advice questions, they are self–targeted, using “my” and “I”.

Table 2: An analysis of the four words that are most “important” to each type, relative to their overall importance. Left–to–right, the three percentages represent: (1) the percentage of questions of that type containing the word; (2) the percentage of all questions in the data set containing the word; and, (3) the difference between these two percentages.

Type

Word

Percentage Type/Global/Diff

Advice

me

68%

24%

43%

but

74%

32%

41%

in

82%

44%

39%

I

97%

59%

38%

Identification

you

60%

37%

23%

none

52%

33%

19%

your

25%

13%

12%

think

24%

16%

8%

(Dis)Approval

do

71%

34%

37%

favorite

29%

2%

27%

none

57%

33%

24%

fav

14%

1%

14%

Quality

best

38%

6%

32%

good

43%

13%

30%

what

67%

40%

27%

a

86%

60%

25%

Prescriptive

how

62%

26%

36%

my

71%

40%

31%

on

59%

29%

30%

I

88%

59%

30%

Factual

can

40%

30%

10%

does

22%

13%

8%

name

11%

4%

7%

find

15%

11%

5%

4.3. Length of question, number of responses

We also use the length of the question and the number of answers received as objective metrics by which we can objectively separate questions by type. See Table 3 for a summary.

Table 3: The average length (characters) and number of answers received per question type.

Type

Length

Number of answers

Advice

919

12.0

Identification

278

11.4

(Dis)Approval

175

8.1

Quality

435

7.0

Prescriptive

573

6.1

Factual

334

6.2

Before we jump to conclusions about these metrics, we must first remember that these questions are drawn from three Q&A sites with different characteristics. Each site has different average question lengths (mean in characters: Answerbag=103, Metafilter=844, Yahoo=323) and number of answers per question (mean: Answerbag=6.5, Metafilter=14.3, Yahoo=5.5). Therefore, some of the differences may in fact be due to the different proportions of question types across sites. Thus, in testing for statistical significance, we control for site by building a regression model predicting the quantitative outcome by the site and the question type. We then conduct a Tukey’s HSD [6] (Kramer, 1956) test across the least squared means of the question type to determine which differences exist.

We find evidence that advice questions are longer than other types of questions, after controlling for site. The full model has a predictive power of r2=0.44, and both site (ρ<0.01) and type (ρ<0.01) are statistically significant predictors of question length. The HSD test reveals that advice questions are significantly longer than other questions (α=0.05), while there are no statistically significant differences among the remaining categories.

We also observe a relationship between a question’s type and the number of answers it receives. This model has a predictive power of r2=0.25, and both site ((ρ<0.01) and type ((ρ<0.01) are statistically significant. The HSD test does reveal that identification questions attract more responses than factual, quality, or prescriptive questions (α=0.05). However, we cannot detect a statistically significant difference among identification, disapproval, and advice.

4.4. Compound questions

We have observed that many questions in social Q&A sites in fact contain several interrogative phrases that are distinct in meaning and intent. For example, one question in our data set reads: “What is the Democratic Party’s official stance on the future of Iraq? If we were in power right now, what would we be doing differently?” This question contains two adjacent interrogative sentences, each of which has a separate type (factual and identification), and each requires different answers. We call this a compound question.

To evaluate whether or not a question is compound, we employed seven undergraduate students to answer the question “Does this submission contain more than one distinct question?” (Yes or No), using a similar interface to that shown in Figure 1. We take the median of their responses as our gold standard for whether a question is compound. See Figure 3 for a summary of the results.

Figure 3: A comparison, by type, of the percentage of questions that are compound vs. single questions.

Question type is predictive of whether a question is compound. Because different question types are asked in different proportions across the three experimental sites, we employ a regression model using site and the majority question type to predict whether a question is compound. Both site (ρ<0.01) and type (ρ=0.01) are statistically significant predictors of whether a question is compound (r2=0.19). Looking deeper at a Tukey HSD over the least squared means, we find that advice questions are more often compound than factual questions (α=0.05). However, we cannot distinguish other (statistically significant) differences between the types with our current number of observations.

We now turn from the descriptive characteristics of questions to outcome–oriented characteristics. Our goal in this analysis is to develop an understanding of the qualitative value characteristics of the different question types, and to contribute new results to inform the design of Q&A sites.

5.1. Methods and metrics

We employed seven undergraduate students to evaluate the same 300 questions discussed above. We provided the students with an online tool to evaluate questions across several dimensions. As with the expert coders, they viewed questions as plain text with only the topical category for context. We discuss the specific measures used in this analysis below.

To condense the responses of the seven coders into a single “score” for a question, we employ the following metric:

Adjusted average — drop the highest and lowest scores, then compute the arithmetic mean.

This metric is highly correlated with the median coded score (r=0.95), giving us confidence that it is fairly summarizing the scores. It is an appropriate metric for this analysis, as it protects against outliers better than arithmetic mean, and offers increased precision as compared with median.

The student coders were generally consistent in their ratings across the measures reported in this paper. Individuals’ ratings correlated with the median in the range of r=0.69 (the least consistent coder) to r=0.83 (the most consistent coder). The average pairwise correlation between coders is r=0.57.

As above, to test for statistical significance, we first build a regression model including a categorical variable representing the Q&A site, then test for differences among categories by computing Tukey’s HSD across the least squared means.

5.2. Which question types have high archival value?

We first examine the relationship between question type and the potential for that question to create archival value information, by asking our student coders to evaluate the following statement on a five–point Likert scale (1=strongly disagree, 5=strongly agree):

“I think high–quality answers to this question will provide information of lasting/archival value to others.”

This metric was introduced in Harper, et al. (2009) using the same language and scale; we include it in this study to understand the relationship between our (more nuanced) taxonomy and the archival value of questions.

See Figure 4 for an overview of our results, formatted as a standard box plot with notches (Chambers, et al., 1983).

Figure 4: A box plot representing the distribution of adjusted averages for archival value, for each question type. Center lines represent the median value; boxes represent the range between the lower and upper quartiles (i.e., 25 percent–75 percent); whiskers indicate the full range of the data, excluding outliers.

We find a stark contrast between question types that appear to typically have a high potential archival value (advice, quality, prescriptive, and factual), and those that have a low potential archival value (identification and (dis)approval). This difference is statistically significant (α=0.05). We also find that quality and prescriptive questions have a higher archival potential than factual questions (α=0.05).

This is a surprising result! We borrow the “archival value” metric from Harper, et al. (2009), who found evidence that “informational” questions generally have high potential archival value, while “conversational” questions do not. On its surface, this might lead us to hypothesize that factual and prescriptive questions lead to the generation of the best material for helping future Google searchers. However, the top category is actually quality (mean: 3.9); advice questions are also rated highly (they are statistically indistinguishable from factual using this metric).

While the distinction between the largely subjective “favorites” and “least favorites” solicited by (dis)approval questions and the more objectively–grounded “best” and “worst” in the quality category is superficially subtle, student coders strongly distinguished between the potential archival value of questions in these categories. A representative (dis)approval question: “Do you have a favorite Star Wars quote from any of the movies?“ was coded by students as having minimal archival value (adjusted average: 1.2), reflecting, we suspect, an appropriate understanding that subjective opinions, while potentially helping to build relationships among discussants, do not tend to have value that extends beyond the natural life of the discussion. A representative quality question: “What type of passive solar house roof works better: a peaked or a flat sloping?” was scored as having potentially high archival value by student coders (adjusted average: 4.8), who recognized that answers would almost certainly tend toward valuable data and information, and not merely an array of ephemeral opinions. Thus, while the two epideictic categories involve discussants negotiating standards of value, the responses to these types of questions can often be distinguished by the degree to which they are understood to solicit subjective opinions or relatively objective, grounded information.

5.3. Which question types are most personalized?

One aspect of a question that potentially affects its usefulness and uniqueness is the degree to which it is personalized to the asker’s situation. To examine the relationship between question type and the degree of personalization, coders evaluated the following statement on a five–point Likert scale (1=generic, 5=personalized):

“Is this question highly personalized with regard to the asker’s situation, completely independent of the asker’s situation (generic), or somewhere in between?”

Figure 5: A box plot representing the distribution of the degree of personalization in questions, for each question type.

Advice questions are the most personalized to the asker’s situation (mean: 4.0), while factual questions are the most generic (mean: 1.9). The striking difference between advice questions and questions from all other question types is statistically significant (α=0.05). We also find evidence that identification and (dis)approval questions — which run the spectrum of personalization — are more personalized than factual questions (α=0.05). We see considerable potential in future work directed at identifying and understanding the linguistic markers that produce these highly varied results.

5.4. Which question types need the most coaching?

Prior work has established a correlation between the value of the question and the value of subsequent answers (Agichtein, et al., 2008). Thus, we seek to learn which types of questions might benefit the most from improvement. We pursue this question by asking our student coders to evaluate the following statement on a five–point Likert scale (1=strongly disagree, 5=strongly agree):

“I think this question will receive more or better answers if it is revised.”

Figure 6: A box plot representing the distribution of adjusted averages for the potential for improvement in questions.

Our student coders rated prescriptive questions (mean: 2.68) to have the most potential to get better answers through rewriting the question. The difference between prescriptive questions and all other question types except advice is statistically significant (α=0.05).

5.5. Correlations

For completeness, we include an analysis of the correlations between the different metrics reported in this paper (see Table 4). Generally speaking, our outcome measures are only weakly correlated, though we do find that question length is positively correlated with whether the question is compound (r=0.38), the degree of personalization (r=0.57), and the potential archival value (r=0.28).

Table 4: A correlation matrix of several measures (objective and quality) discussed in this research.

Compound

Archived value

Personalized

Improved with revisions

Question length

Number of answers

Compound

1.00

0.22

0.24

0.15

0.38

0.12

Archived value

0.22

1.00

0.14

0.01

0.28

-0.08

Personalized

0.24

0.14

1.00

0.22

0.57

0.13

Improved with revisions

0.15

0.01

0.22

1.00

0.07

-0.21

Question length

0.38

0.28

0.57

0.07

1.00

0.32

Number of answers

0.12

-0.08

0.13

-0.21

0.32

1.00

6. Discussion

In this paper, we presented a taxonomy of question types that is based on rhetorical theory and tailored to the study of online question asking. We also took a preliminary look at the structural and value features that distinguish the question types in the taxonomy.

6.1. Limitations and next steps

There are several limitations to this study. Primary among these is the lack of data concerning disagreement between the two coders. It would be useful to understand the rates of initial disagreement between the expert coders, to better understand which question types were most often confused.

Another limitation is that we do not have evidence that the taxonomy is easy to apply to questions. We employed five of the seven undergraduate coders for a follow–up study, where we asked them to re–code the full 300 question data set for type. We did not train the students beforehand; we simply gave them a chart very similar to Figure 1 for definitions and reference. Like the expert coders, they used the online coding tool to evaluate questions, but unlike the expert coders, they were not given the chance to discuss or change their answers. The rate at which they agree with the expert coders isn’t especially high: taking a plurality of their votes across question types, they agreed with the experts’ majority type 63.6 percent of the time (ranging from 92.9 percent agreement on (dis)approval questions to just 28.6 percent agreement on quality questions).

In future work, we hope to develop tools that make it easier to accurately code large numbers of questions with little human effort. Preliminary tests show that accurate classification will be a challenging task using computers alone: using a naïve Bayes classifier over a bag of words feature set, we were able to achieve just 41.7 percent classification accuracy. We are interested in developing coding tools that leverage a wider feature set of the question — including those reported in Agichtein, et al. (2008) and Harper, et al. (2009), along with natural language features — to highlight the most salient attributes of a question in a coding interface. We are also interested in exploring the potential for coding along a decision tree, where, for example, the coder might begin by stating whether the question is subjective or objective before working towards more specific aspects of the classification.

6.2. Notes for researchers and practitioners

Researchers and practitioners may wish to use a simplified version of the taxonomy reported here. Here are some alternatives, along with rough mappings from our taxonomy:

However, we have observed in this paper several instances where such simplifications would substantially blur real differences. For instance, if we had combined the factual and prescriptive types, we would have missed a statistically significant difference in potential archival value between the two types.

7. Conclusion

To better support the diverse set of uses to which social Q&A sites are being put, it is necessary to clarify the nature of the questions being asked, determine the intent behind those questions, and then to operationalize that understanding. In this paper, we have taken steps towards this goal by offering a new rhetorically grounded taxonomy of question types that is flexible enough to encompass the range of questions asked online.

We contributed results and discussed related work where the type of a question is used to explain differences in user behavior. However, a question’s type isn’t purely an academic interest. Real systems — such as Answerbag and Microsoft’s Live QnA — already organize their content based on the informational or conversational type of the question. This is a first step towards the development of interfaces that change how questions are asked, answered, aggregated, and searched for, in response to the type of information users seek.

We hope the tools and insights provided in this paper will help researchers and practitioners to innovate next–generation Q&A platforms that ultimately lead to richer user experiences and to more productive communicative exchanges.

About the authors

F. Maxwell Harper is a post–doctoral associate in the Department of Computer Science and Engineering at the University of Minnesota.

Joseph Weinberg is a doctoral student in Rhetoric and Scientific & Technical Communication in the Department or Writing Studies at the University of Minnesota.

John Logie is an Associate Professor of Rhetoric in the
Department of Writing Studies at the University of Minnesota.

Joseph A. Konstan is Distinguished McKnight University Professor and Distinguished University Teaching Professor in the Department of
Computer Science and Engineering at the University of Minnesota.

Acknowledgments

The authors would like to thank Aditya Pal and Arun Kumar Mannava for their help developing our research ideas and testing our taxonomies.

This work was supported by the National Science Foundation through grants 03–24851 and 08–12148.

5. Pearson product–moment correlation is a standard measure of the linear dependence of two variables. Values range from +1 (correlated) to -1 (inversely correlated). A value of 0 indicates no observable relationship between the two variables.

A. Deacon and C. WynSculley, 2007. “Learning from the rhetoric of academics using educational technology,” International Journal of Education and Development using Information and Communication Technology, volume 3, number 4, pp. 153–167.

S. Kim, J. Oh, and S. Oh, 2007. “Best–answer selection criteria in a social Q&A site from the user–oriented relevance perspective,” Proceedings of the American Society for Information Science and Technology, volume 44, number 1, pp. 1–15.http://dx.doi.org/10.1002/meet.1450440256

C. Perelman and L. Olbrechts–Tyteca, 1969. The new rhetoric: A treatise on argumentation. Translated by John Wilkinson and Purcell Weaver. Notre Dame, Ind.: University of Notre Dame Press, pp. 715–728.